{"title": "Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 7679, "page_last": 7687, "abstract": "We propose a new complexity measure for Markov decision processes (MDPs), the maximum expected hitting cost (MEHC). This measure tightens the closely related notion of diameter [JOA10] by accounting for the reward structure.\nWe show that this parameter replaces diameter in the upper bound on the optimal value span of an extended MDP, thus refining the associated upper bounds on the regret of several UCRL2-like algorithms.\nFurthermore, we show that potential-based reward shaping [NHR99] can induce equivalent reward functions with varying informativeness, as measured by MEHC.\nBy analyzing the change in the maximum expected hitting cost, this work presents a formal understanding of the effect of potential-based reward shaping on regret (and sample complexity) in the undiscounted average reward setting.\nWe further establish that shaping can reduce or increase MEHC by at most a factor of two in a large class of MDPs with finite MEHC and unsaturated optimal average rewards.", "full_text": "Maximum Expected Hitting Cost of a Markov\n\nDecision Process and Informativeness of Rewards\n\nToyota Technological Institute at Chicago\n\nToyota Technological Institute at Chicago\n\nMatthew R. Walter\n\nChicago, IL, USA 60637\n\nmwalter@ttic.edu\n\nFalcon Z. Dai\n\nChicago, IL, USA 60637\n\ndai@ttic.edu\n\nAbstract\n\nWe propose a new complexity measure for Markov decision processes (MDPs), the\nmaximum expected hitting cost (MEHC). This measure tightens the closely related\nnotion of diameter [JOA10] by accounting for the reward structure. We show that\nthis parameter replaces diameter in the upper bound on the optimal value span of an\nextended MDP, thus re\ufb01ning the associated upper bounds on the regret of several\nUCRL2-like algorithms. Furthermore, we show that potential-based reward shaping\n[NHR99] can induce equivalent reward functions with varying informativeness,\nas measured by MEHC. We further establish that shaping can reduce or increase\nMEHC by at most a factor of two in a large class of MDPs with \ufb01nite MEHC and\nunsaturated optimal average rewards.\n\n1\n\nIntroduction\n\nIn the average reward setting of reinforcement learning (RL) [Put94; SB98], an algorithm learns\nto maximize its average rewards by interacting with an unknown Markov decision process (MDP).\nSimilar to analysis in multi-armed bandits and other online machine learning problems, (cumulative)\nregret provides a natural model to evaluate the ef\ufb01ciency of a learning algorithm. With the UCRL2\nalgorithm, Jaksch, Ortner, and Auer [JOA10] show a problem-dependent bound of \u02dcO(DSpAT ) on\nregret and an associated logarithmic bound on the expected regret, where D is the diameter of the\nactual MDP (De\ufb01nition 1), S the size of the state space, and A the size of the action space. Many\nsubsequent algorithms [FLP19] enjoy similar diameter-dependent bounds. This establishes diameter\nas an important measure of complexity for an MDP. However, strikingly, this measure is independent\nof rewards and is a function of only the transitions. This is obviously peculiar as two MDPs differing\nonly in their rewards would have the same regret bounds even if one gives the maximum reward for\nall transitions. We review the related key observation by Jaksch, Ortner, and Auer [JOA10], and\nre\ufb01ne it with a new lemma (Lemma 1), establishing a reward-sensitive complexity measure that we\nrefer to as the maximum expected hitting cost (MEHC, De\ufb01nition 2), which tightens the regret bounds\nof UCRL2 and similar algorithms by replacing diameter (Theorem 1).\nNext, with respect to this new complexity measure, we describe a notion of reward informativeness\n(Section 2.4). Intuitively speaking, in an environment, the same desired policies can be motivated by\ndifferent (immediate) rewards. These differing de\ufb01nitions of rewards can be more or less informative\nof useful actions, i.e., yielding high long-term rewards. To formalize this intuition, we study a way\nto reparametrize rewards via potential-based reward shaping (PBRS) [NHR99] that can produce\ndifferent rewards with the same near-optimal policies (Section 2.5). We show that the MEHC changes\nunder reparametrization by PBRS and, in turn, so do regret and sample complexity, substantiating\nthis notion of informativeness. Lastly, we study the extent of its impact. In particular, we show that\nthere is a factor-of-two limit on its impact on MEHC in a large class of MDPs (Theorem 2). This\nresult and the concept of reward informativeness may be useful for a task designer crafting a reward\nfunction (Section 3). The detailed proofs are deffered to Appendix A.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe main contributions of this work are two-fold:\n\n\u2022 We propose a new MDP structural parameter, maximum expected hitting cost (MEHC), that\naccounts for both transitions and rewards. This parameter replaces diameter in the regret\nbounds of several model-based RL algorithms.\n\n\u2022 We show that potential-based reward shaping can change the maximum expected hitting\ncost of an MDP and thus the regret bound. This results in a set of equivalent MDPs with\ndifferent learning dif\ufb01culties as measured by regret. Moreover, we show that their MEHCs\ndiffer by a factor of at most two in a large class of MDPs.\n\n1.1 Related work\n\nThis work is closely related to the study of diameter as an MDP complexity measure [JOA10],\nwhich is prevalent in the regret bounds of RL algorithms in the average reward setting [FLP19]. As\nnoted by Jaksch, Ortner, and Auer [JOA10], unlike some previous measures of MDP complexity\nsuch as the return mixing time [KS02; BT02], diameter depends only on the transitions, but not the\nrewards. The core reason for the presence of diameter in the regret analysis is that it upper bounds\nthe optimal value span of the extended MDP that summarizes the observations (Section 2.3 and\nEquation 8). We review and update this observation with a reward-dependent parameter we called\nmaximum expected hitting cost (Lemma 1). Interestingly, the gap between diameter and MEHC can\nbe arbitrarily large \uf8ff(M ) \uf8ff rmaxD(M ); there are MDPs with \ufb01nite MEHC and in\ufb01nite diameter.\nThese MDPs are non-communicating but have saturated optimal average rewards \u21e2\u21e4(M ) = rmax.\nIntuitively, there is a state s in these MDPs from which the learner cannot visit some other state s0,\nbut can nonetheless achieve the maximum possible average reward, thus allowing for good regret\nguarantees; the unreachable states will not seem better than the reachable ones under the principle of\noptimism in the face of uncertainty (OFU). We will use UCRL2 [JOA10] as an example algorithm\nthroughout the rest of the article, however the main results do not depend on it. In particular, with\nMEHC, its regret bounds are updated (Theorem 1).\nAnother important comparison is with optimal bias span [Put94; BT09; Fru+18], a reward-dependent\nparameter of MDPs. Here, we again \ufb01nd that the gap can be arbitrarily large sp(M ) \uf8ff \uf8ff(M ).1 These\nnon-communicating MDPs would have unsaturated optimal average reward \u21e2\u21e4(M ) < rmax. But as\nshown elsewhere [FPL18; Fru+18], extra knowledge of (some upper bound on) the optimal bias span\nis necessary for an algorithm to enjoy a regret that scales with this smaller parameter. In contrast,\nUCRL2, which scales with MEHC, does not need to know the diameter or MEHC of the actual MDP.\nPotential-based reward shaping [NHR99] was originally proposed as a solution technique for a\nprogrammer to in\ufb02uence the sample complexity of their reinforcement learning algorithm, without\nchanging the near-optimal policies in episodic and discounted settings. Prior theoretical analysis\ninvolving PBRS [NHR99; Wie03; WCE03; ALZ08; Grz17] mostly focuses on the consistency of RL\nagainst the shaped rewards, i.e., the resulting learned behavior is also (near-)optimal in the original\nMDP, while suggesting empirically that the sample complexity can be changed by a well speci\ufb01ed\npotential. In this work, we use PBRS to construct \u21e7-equivalent reward functions in the average\nreward setting (Section 2.4) and show that two reward functions related by a shaping potential can\nhave different MEHCs, and thus different regrets and sample complexities (Section 2.5). However, a\nsubtle but important technical requirement of [0, rmax]-boundedness of MDPs makes it dif\ufb01cult to\nimmediately apply our results (Section 2.5 and Theorem 2) to the treatment of PBRS as a solution\ntechnique because an arbitrary potential function picked without knowledge of the original MDP\nmay not preserve the [0, rmax]-boundedness. Nevertheless, we think our work may bring some new\nperspectives to this topic.\n\n1This inequality can be derived as a consequence of Lemma 1 as N (s, a) ! 1, M + has very tight\ncon\ufb01dence intervals around the actual transition and mean rewards of M. Observe that the span of ui is equal to\nsp(M ) at the limit of i ! 1 [JOA10, remark 8].\n\n2\n\n\f2 Results\n\n2.1 Markov decision process\nA Markov decision process is de\ufb01ned by the tuple M = (S,A, p, r), where S is the state space, A\nis the action space, p : S\u21e5A!P (S) is the transition function, and r : S\u21e5A!P ([0, rmax]) is\nthe reward function with mean \u00afr(s, a) := E[r(s, a)]. We assume that the state and action spaces are\n\ufb01nite, with sizes S := |S| and A := |A|, respectively. At each time step t = 0, 1, 2, . . ., an algorithm\nL chooses an action at 2A based on the observations up to that point. The state transitions to st+1\nwith probability p(st+1|st, at) and a reward rt 2 [0, rmax] is drawn according to the distribution\nr(st, at).2 The transition probabilities and reward function of the MDP are unknown to the learner.\nThe sequence of random variables (st, at, rt)t0 forms a stochastic process. Note that a stationary\ndeterministic policy \u21e1 : S!A is a restrictive type of algorithm whose action at depends only on st.\nWe refer to stationary deterministic policies as policies in the rest of the paper.\nRecall that in a Markov chain, the hitting time of state s0 starting at state s is a random variable\nhs!s0 := inf{t 2 N0|st = s0 and s0 = s}3 [LPW08].\nDe\ufb01nition 1 (Diameter, [JOA10]). Suppose in the stochastic process induced by following a policy\n\u21e1 in MDP M, the time to hit state s0 starting at state s is hs!s0(M, \u21e1). We de\ufb01ne the diameter of M\nto be\n\nD(M ) := max\ns,s02S\n\nmin\n\u21e1:S!A\n\nE [hs!s0(M, \u21e1)] .\n\nWe incorporate rewards into diameter, and introduce a novel MDP parameter.\nDe\ufb01nition 2 (Maximum expected hitting cost). We de\ufb01ne the maximum expected hitting cost of a\nMarkov decision process M to be\n\n\uf8ff(M ) := max\ns,s02S\n\nmin\n\u21e1:S!A\n\nE24\n\nhs!s0 (M,\u21e1)1Xt=0\n\nrmax rt35 .\n\nObserve that MEHC is a smaller parameter, that is, \uf8ff(M ) \uf8ff rmaxD(M ), since for any s, s0,\u21e1 , we\nhave rmax rt \uf8ff rmax.\n2.2 Average reward criterion, and regret\n\nThe accumulated reward of an algorithm L after T time steps in MDP M starting at state s is a\nrandom variable\n\nR(M, L, s, T ) :=\n\nWe de\ufb01ne the average reward or gain [Put94] as\n\nrt.\n\nT1Xt=0\n\n\u21e2(M, L, s) := lim\nT!1\n\n1\nT E [R(M, L, s, T )] .\n\n(1)\n\nWe will evaluate policies by their average reward. This can be maximized by a stationary deterministic\npolicy and we de\ufb01ne the optimal average reward of M starting at state s as\n\n\u21e2\u21e4(M, s) := max\n\u21e1:S!A\n\n\u21e2(M, \u21e1, s).\n\n(2)\n\nFurthermore, we assume that the optimal average reward starting at any state to be the same, i.e.,\n\u21e2\u21e4(M, s) = maxs0 \u21e2\u21e4(M, s0) for any state s. This is a natural requirement of an MDP in the\nonline setting to allow for any hope for a vanishing regret. Otherwise, the learner may take actions\nleading to states with a lower average optimal reward due to ignorance and incur linear regret\n\n2It is important to assume that the support of rewards lies in a known bounded interval, often [0, 1] by\nconvention. This is sometimes referred to as a bounded MDP in the literature. Analogous to bandits, the details\nof the reward distribution often is unimportant, and it suf\ufb01ces to specify an MDP with the mean rewards \u00afr.\n\n30-indexing ensures that hs!s = 0. Note also that by convention, inf ? = 1.\n\n3\n\n\fwhen compared with the optimal policy starting at the initial state. In particular, this condition is\ntrue for communicating MDPs [Put94] by virtue of their transitions, but this is also possible for\nnon-communicating MDPs with appropriate rewards. We will write \u21e2\u21e4(M ) := maxs0 \u21e2\u21e4(M, s0).\nWe will compete with the expected cumulative reward of an optimal policy on its trajectory, and\nde\ufb01ne the regret of a learning algorithm L starting at state s after T time steps as\n\n(M, L, s, T ) := T\u21e2 \u21e4(M ) R(M, L, s, T ).\n2.3 Optimism in the face of uncertainty, extended MDP, and UCRL2\n\n(3)\n\nThe principle of optimism in the face of uncertainty (OFU) [SB98] states that for uncertain state-\naction pairs, i.e., those that we have not visited enough up to this point, we should be optimistic about\ntheir outcome. The intuition for doing so is that taking reward-maximizing actions with respect to this\noptimistic model (in terms of both transitions and immediate rewards for these uncertain state-action\npairs), we will have no regret if the optimism is well placed and will otherwise quickly learn more\nabout these suboptimal state-action pairs to avoid them in the future. This fruitful idea has been the\nbasis for many model-based RL algorithms [FLP19] and in particular, UCRL2 [JOA10], which keeps\ntrack of the statistical uncertainty via upper con\ufb01dence bounds.\nSuppose we have visited a particular state-action pair (s, a) N (s, a)-many times. With con\ufb01dence\nat least 1 , we can establish that a con\ufb01dence interval for both its mean reward \u00afr(s, a) and its\ntransition p(\u00b7|s, a) from the Chernoff-Hoeffding inequality (or Bernstein, [FPL18]). Let b(, n) 2 R\nbe the -con\ufb01dence bound after observing n i.i.d. samples of a [0, 1]-bounded random variable,\n\u02c6r(s, a) the empirical mean of r(s, a), \u02c6p(\u00b7|s, a) the empirical transition of p(\u00b7|s, a). The statistically\nplausible mean rewards are\n\nand the statistically plausible transitions are\n\nB(s, a) :=r0 2 R : |r0 \u02c6r(s, a)|\uf8ff rmax b(, N(s, a)) \\ [0, rmax]\nC(s, a) :=p0 2P (S) : ||p0(\u00b7) \u02c6p(\u00b7|s, a)||1 \uf8ff b(, N(s, a)) .\n\nWe de\ufb01ne an extended MDP M + := (S,A+, p+, r+) to summarize these statistics [GLD00; SL05;\nTB07; JOA10], where S is the same state space as in M, the action space A+ is a union over\nstate-speci\ufb01c actions\n\n(4)\nwhere A is the same action space in M, p+ the transitions according to the selected distribution p0\n(5)\n\ns :=(a, p0, r0) : a 2A , p0 2 C(s, a), r0 2 B(s, a) ,\nA+\n\nand r+ is the rewards according to the selected mean reward r0\n\np+ \u00b7 |s, (a, p0, r0) := p0(\u00b7),\nr+s, (a, p0, r0) := r0.\n\n(6)\n\nIt is not hard to see that M + is indeed an MDP with an in\ufb01nite but compact action space.\nBy OFU, we want to \ufb01nd an optimal policy for an optimistic MDP within the set of statistically\nplausible MDPs. As observed in [JOA10], this is equivalent to \ufb01nding an optimal policy \u21e1+ : S!A +\nin the extended MDP M +, which speci\ufb01es a policy in M via \u21e1(s) := 1(\u21e1+(s)), where i is the\n\nprojection map onto the i-th coordinate (and an optimistic MDP fM = (S,A,ep,er) via transitions\nep(\u00b7|s, \u21e1(s)) := 2(\u21e1+(s)) and mean rewardser(s, \u21e1(s)) := 3(\u21e1+(s)) over actions selected by \u21e14).\n\nBy construction of the extended MDP M +, M is in M + with high con\ufb01dence, i.e., \u00afr(s, a) 2 B(s, a)\nand p(\u00b7|s, a) 2 C(s, a) for all s 2S , a 2A . At the heart of UCRL2-type regret analysis, there\nis a key observation [JOA10, equation (11)] that we can bound the span of optimal values in the\nextended MDP M + by the diameter of the actual MDP M under the condition that M is in M +.\nThis observation is needed to characterize how good following the \u201coptimistic\u201d policy 1(\u21e1+) in the\nactual MDP M is. For i 0, the i-step optimal values ui(s) of M + is the expected total reward by\n\n4We can set transitions and mean rewards over actions a 6= \u21e1(s) to \u02c6p and \u02c6r, respectively.\n\n4\n\n\ffollowing an optimal non-stationary i-step policy starting at state s 2S . We can also de\ufb01ne them\nrecursively (via dynamic programming5)\n\nu0(s) := 0\n\np+s0|s, (a, p0, r0) ui(s0)#\n\ns \"r+s, (a, p0, r0) +Xs0\np0(s0) ui(s0)#\ns \"r0 +Xs0\n\nui+1(s) := max\n\n(a,p0,r0)2A+\nBy (5) and (6)\n\n= max\n\n(a,p0,r0)2A+\n\nBy (4)\n\na2A\" max\n\nr02B(s,a)\n\n= max\n\nr0 + max\n\np02C(s,a)Xs0\n\np0(s0) ui(s0)#\n\n(7)\n\nWe are now ready to restate the observation. If M is in M +, which happens with high probability,\nJaksch, Ortner, and Auer [JOA10] observe that\n\nmax\n\ns\n\nui(s) min\n\ns0\n\nui(s0) \uf8ff rmaxD(M ).\n\n(8)\n\nHowever, this bound is too conservative because it fails to account for the rewards collected. By\npatching this, we tighten the upper bound with MEHC.\nLemma 1 (MEHC upper bounds the span of values). Assuming that the actual MDP M is in the\nextended MDP M +, i.e., \u00afr(s, a) 2 B(s, a) and p(\u00b7|s, a) 2 C(s, a) for all s 2S , a 2A , we have\n\nmax\n\ns\n\nui(s) min\n\ns0\n\nui(s0) \uf8ff \uf8ff(M )\n\nwhere ui(s) is the i-step optimal undiscounted value of state s.\n\nThis re\ufb01ned upper bound immediately plugs into the main theorems of [JOA10, equations 19 and 22,\ntheorem 2].\nTheorem 1 (Reward-sensitive regret bound of UCRL2). With probability of at least 1 , for any\ninitial state s and any T > 1, and \uf8ff := \uf8ff(M ), the regret of UCRL2 is bounded by\n\n(M, UCRL2, s, T )\n\n8\n\n \u25c6 + pT + \uf8ffs 5\n\n\uf8ffs 5\nT log\u2713 8T\n+ \uf8ffs14S log\u2713 2AT\n\uf8ff 34 max{1,\uf8ff}SsAT log\u2713 T\n\u25c6.\n\n2\n\nT log\u2713 8T\n \u25c6 +s14 log\u2713 2SAT\n\n \u25c6 + \uf8ffSA log2\u2713 8T\nSA\u25c6\n \u25c6 + 2!(p2 + 1)pSAT\n\n\" \u2318 sample complexity [Kak03],\nAs a corollary, Theorem 1 implies that UCRL2 offers O\u21e3 \uf8ff2S2A\nby inverting the regret bound by demanding that the per-step regret is at most \" with probability of at\nleast 1 [JOA10, corollary 3]. Similarly, we have an updated logarithmic bound on the expected\nregret [JOA10, theorem 4], E[(M, UCRL2, s, T )] = O( \uf8ff2S2A log T\n) where g is the gap in average\nreward between the best policy and the second best policy.\n\nlog \uf8ffSA\n\n\"2\n\ng\n\n5In fact, the exact maximization of Equation 7 can be found via extended value iteration [JOA10, section 3.1]\n\n5\n\n\fInformativeness of rewards\n\n2.4\nInformally, it is not hard to appreciate the challenge imposed by delayed feedback inherent in\nMDPs, as actions with high immediate rewards do not necessarily lead to a high optimal value. Are\nthere different but \u201cequivalent\u201d reward functions that differ in their informativeness with the more\ninformative ones being easier to reinforcement learn? Suppose we have two MDPs differing only in\ntheir rewards, M1 = (S,A, p, r1) and M2 = (S,A, p, r2), then they will have the same diameters\nD(M1) = D(M2) and thus the same diameter-dependent regret bounds from previous works. With\nMEHC, however, we may get a more meaningful answer.\nFirstly, let us make precise a notion of equivalence. We say that r1 and r2 are \u21e7-equivalent if\nfor any policy \u21e1 : S!A , its average rewards are the same under the two reward functions\n\u21e2(M1,\u21e1, s ) = \u21e2(M2,\u21e1, s ). Formally, we will study the MEHC of a class of \u21e7-equivalent reward\nfunctions related via a potential.\n\n2.5 Potential-based reward shaping\nOriginally introduced by Ng, Harada, and Russell [NHR99], potential-based reward shaping (PBRS)\ntakes a potential ' : S! R and de\ufb01nes shaped rewards\n\nr'\nt := rt '(st) + '(st+1).\n\n(9)\nt )t0 being generated from an MDP M ' =\n\nWe can think of the stochastic process (st, at, r'\n(S,A, p, r') with reward function r' : S\u21e5A!P ([0, rmax])6 whose mean rewards are\n\n\u00afr'(s, a) = \u00afr(s, a) '(s) + Es0\u21e0p(\u00b7|s,a) ['(s0)] .\nIt is easy to check that r' and r are indeed \u21e7-equivalent. For any policy \u21e1,\n\n\u21e2(M ',\u21e1, s ) = lim\nT!1\n\n= lim\nT!1\n\n= lim\nT!1\n\nBy telescoping sums of potential terms over consecutive t\n\n1\nT E [R(M ',\u21e1, s, T )]\n1\n\nr'\n\nt#\nrt '(st) + '(st+1)#\nrt#\nT1Xt=0\n\nT E\"T1Xt=0\nT E\"T1Xt=0\nT E\"'(s0) + '(sT ) +\nT\u21e3 '(s) + E['(sT )] + E [R(M, \u21e1, s, T )]\u2318\n1\nT E [R(M, \u21e1, s, T )]\n\n1\n\n1\n\n1\n\nThe \ufb01rst two terms vanish in the limit\n\n= lim\nT!1\n\n= lim\nT!1\n\n= lim\nT!1\n\n= \u21e2(M, \u21e1, s).\n\n(10)\n\nTo get some intuition, it is instructive to consider a toy example (Figure 1). Suppose 0 <<\u21b5\nand \u270f 2 (0, 1), then the optimal average reward in this MDP is 1 , and the optimal stationary\ndeterministic policy is \u21e1\u21e4(s1) := a2 and \u21e1\u21e4(s2) := a1, as staying in state s2 yields the highest\naverage reward. As the expected number of steps needed to transition from state s1 to s2 and vice\nversa are both 1/\u270f via action a2, we conclude that \uf8ff(M ) = max{\u21b5, \u21b5/\u270f, /\u270f, } = \u21b5/\u270f. Furthermore,\nnotice that taking action a2 in either state transitions to the other state with probability of \u270f, however\nthe immediate rewards are the same as taking the alternative action a1 to stay in the current state\u2014the\nimmediate rewards are not informative. We can differentiate the actions better by shaping with a\npotential of '(s1) := 0 and '(s2) := (\u21b5)/2\u270f. The shaped mean rewards become, at s1,\n\n\u00afr'(s1, a2) = 1 \u21b5 '(s1) + \u270f'(s2) + (1 \u270f)'(s1) = 1 (\u21b5+)/2 > 1 \u21b5 = \u00afr'(s1, a1)\n6One needs to ensure that ' respects the [0, rmax]-boundedness of M.\n\n6\n\n\f1 \u21b5\n\ns1\n\na1\n\n1\n\n1 \u21b5\n\na2\n\n\u270f\n\n1 \u270f\n\n1 \u270f\n\ns2\n\na1\n\n1 \n1\n\n\u270f\n\na2\n\n1 \n\nFigure 1: Circular nodes represent states and square nodes represent actions. The solid edges\nare labeled by the transition probabilities and the dashed edges are labeled by the mean rewards.\nFurthermore, rmax = 1. For concreteness, one can consider setting \u21b5 = 0.11, = 0.1,\u270f = 0.05.\n\nand at s2,\n\n\u00afr'(s2, a2) = 1 '(s2) + \u270f'(s1) + (1 \u270f)'(s2) = 1 (\u21b5+)/2 < 1 = \u00afr'(s2, a1).\n\nThis encourages taking actions a2 at state s1 and discourages taking actions a1 at state s2 simultane-\nously. The maximum expected hitting cost becomes smaller\n\n\u21b5\n\u270f\n\n,' (s2) '(s1) +\n\n\n\n\u270f\n\n\uf8ff(M ') = max\u21e2\u21b5, , '(s1) '(s2) +\n2\u270f \n\n= max\u21e2\u21b5, ,\n\n2\u270f\n\n\u21b5 + \n\n\u21b5 + \n\n,\n\n=\n\n<\n\n\u21b5 + \n\n2\u270f\n= \uf8ff(M ).\n\n\u21b5\n\u270f\n\nIn this example, MEHC is halved at best when is made arbitrarily close to zero. Noting that the\noriginal MDP M is equivalent to M ' shaped with potential ', i.e. M = (M ')' from (9), we\nsee that MEHC can be almost doubled. It turns out that halving or doubling the MEHC is the most\nPBRS can do in a large class of MDPs.\nTheorem 2 (MEHC under PBRS). Given an MDP M with \ufb01nite maximum expected hitting cost\n\uf8ff(M ) < 1 and an unsaturated optimal average reward \u21e2\u21e4(M ) < rmax, the maximum expected\nhitting cost of any PBRS-parameterized MDP M ' is bounded by a multiplicative factor of two\n\n1\n2\n\n\uf8ff(M ) \uf8ff \uf8ff(M ') \uf8ff 2\uf8ff(M ).\n\nThe key observation is that the expected total rewards along a loop remains unchanged by shaping,\nwhich originally motivated PBRS [NHR99]. To see this, consider a loop as a concatenation of two\npaths, one from s to s0 and the other from s0 to s. Under the shaping of a potential ', the expected\ntotal rewards of the former is increased by '(s0) '(s) and the latter is decreased by the same\namount. For more details, see Appendix A.2.\n\n3 Discussion\n\nIf we view RL as an engineering tool that \u201ccompiles\u201d an arbitrary reward function into a behavior (as\nrepresented by a policy) in an environment, then a programmer\u2019s primary responsibility would be to\ncraft a reward function that faithfully expresses the intended goal. However, this problem of reward\ndesign is complicated by practical concerns for the dif\ufb01culty of learning. As recognized by Kober,\nBagnell, and Peters [KBP13, section 3.4],\n\n\u201c[t]here is also a trade-off between the complexity of the reward function and the\ncomplexity of the learning problem.\u201d\n\n7\n\n\fAccurate rewards are often easy to specify in a sparse manner (reaching a position, capturing the\nking, etc), thus hard to learn, whereas dense rewards, providing more feedback, are harder to specify\naccurately, leading to incorrect trained behaviors. The recent rise of deep RL also exposes \u201cbugs\u201d\nin some of these designed rewards [CA16]. Our results show that the informativeness of rewards,\nan aspect of \u201cthe complexity of the learning problem\u201d can be controlled to some extent by a well\nspeci\ufb01ed potential without inadvertently changing the intended behaviors of the original reward.\nTherefore, we propose to separate the de\ufb01nitional concern from the training concern. Rewards\nshould be \ufb01rst de\ufb01ned to faithfully express the intended task, and then any extra knowledge can be\nincorporated via a shaping potential to reduce the sample complexity of training to obtain the same\ndesired behaviors. That is not to say that it is generally easy to \ufb01nd a helpful potential making the\nrewards more informative.\nThough Theorem 2 might be a disappointing result for PBRS, we wish to emphasize that this result\nmost directly concerns algorithms whose regrets scale with MEHC, such as UCRL2. It is conceivable\nthat in a different setting such as discounted total rewards, or for a different RL algorithm, such as\nSARSA with epsilon-greedy exploration [NHR99, footnote 4], PBRS might have a greater impact on\nthe learning ef\ufb01ciency.\n\nAcknowledgments\nThis work was supported in part by the National Science Foundation under Grant No. 1830660. We\nthank Avrim Blum for many insightful comments. In particular, his challenge to \ufb01nding a better\nexample has led to Theorem 2. We also thank Ronan Fruit for a discussion on a concept similar to\nthe proposed maximum expected hitting cost that he independently developed in his thesis draft.\n\nReferences\n[ALZ08]\n\n[BT02]\n\n[BT09]\n\n[CA16]\n\n[FLP19]\n\n[FPL18]\n\nJohn Asmuth, Michael L Littman, and Robert Zinkov. \u201cPotential-based Shaping in\nModel-based Reinforcement Learning.\u201d In: Proceedings of the National Conference on\nArti\ufb01cial Intelligence (AAAI). 2008, pp. 604\u2013609.\nRonen I Brafman and Moshe Tennenholtz. \u201cR-max-a general polynomial time algorithm\nfor near-optimal reinforcement learning\u201d. In: Journal of Machine Learning Research\n3.Oct (2002), pp. 213\u2013231.\nPeter L Bartlett and Ambuj Tewari. \u201cREGAL: A regularization based algorithm for rein-\nforcement learning in weakly communicating MDPs\u201d. In: Proceedings of the Conference\non Uncertainty in Arti\ufb01cial Intelligence (UAI). AUAI Press. 2009, pp. 35\u201342.\nJack Clark and Dario Amodei. Faulty Reward Functions in the Wild. 2016. URL: https:\n//openai.com/blog/faulty-reward-functions/ (visited on 10/27/2019).\nRonan Fruit, Alessandro Lazaric, and Matteo Pirotta. \u201cExploration-Exploitation in Rein-\nforcement Learning\u201d. Tutorial at Algorithmic Learning Theory conference. 2019. URL:\nhttps://rlgammazero.github.io/docs/2019_ALT_exptutorial.pdf.\nRonan Fruit, Matteo Pirotta, and Alessandro Lazaric. \u201cNear optimal exploration-\nexploitation in non-communicating markov decision processes\u201d. In: Advances in Neural\nInformation Processing Systems (NeurIPS). 2018, pp. 2994\u20133004.\n\n[Fru+18] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. \u201cEf\ufb01cient Bias-\nSpan-Constrained Exploration-Exploitation in Reinforcement Learning\u201d. In: Proceedings\nof the International Conference on Machine Learning (ICML). 2018, pp. 1573\u20131581.\n\n[GLD00] Robert Givan, Sonia Leach, and Thomas Dean. \u201cBounded-parameter Markov decision\n\nprocesses\u201d. In: Arti\ufb01cial Intelligence 122.1\u20132 (2000), pp. 71\u2013109.\n\n[Grz17] Marek Grze\u00b4s. \u201cReward shaping in episodic reinforcement learning\u201d. In: Proceedings of\nthe International Conference on Autonomous Agents and MultiAgent Systems (AAMAS).\nInternational Foundation for Autonomous Agents and Multiagent Systems. 2017, pp. 565\u2013\n573.\nThomas Jaksch, Ronald Ortner, and Peter Auer. \u201cNear-optimal regret bounds for rein-\nforcement learning\u201d. In: Journal of Machine Learning Research 11.Apr (2010), pp. 1563\u2013\n1600.\nSham Machandranath Kakade. \u201cOn the sample complexity of reinforcement learning\u201d.\nPhD thesis. 2003.\n\n[JOA10]\n\n[Kak03]\n\n8\n\n\f[KBP13]\n\n[KS02]\n\nJens Kober, J Andrew Bagnell, and Jan Peters. \u201cReinforcement learning in robotics: A\nsurvey\u201d. In: The International Journal of Robotics Research 32.11 (2013), pp. 1238\u2013\n1274.\nMichael Kearns and Satinder Singh. \u201cNear-optimal reinforcement learning in polynomial\ntime\u201d. In: Machine learning 49.2-3 (2002), pp. 209\u2013232.\n\n[LPW08] David A Levin, Yuval Peres, and Elizabeth L Wilmer. Markov chains and mixing times.\n\nAmerican Mathematical Soc., 2008.\n\n[Put94]\n\n[NHR99] Andrew Y Ng, Daishi Harada, and Stuart Russell. \u201cPolicy invariance under reward\ntransformations: Theory and application to reward shaping\u201d. In: Proceedings of the\nInternational Conference on Machine Learning (ICML). Vol. 99. 1999, pp. 278\u2013287.\nMartin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Program-\nming. John Wiley & Sons, Inc., 1994.\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT\npress, 1998.\nAlexander L Strehl and Michael L Littman. \u201cA theoretical analysis of model-based inter-\nval estimation\u201d. In: Proceedings of the International Conference on Machine Learning\n(ICML). ACM. 2005, pp. 856\u2013863.\nAmbuj Tewari and Peter L Bartlett. \u201cBounded parameter Markov decision processes\nwith average reward criterion\u201d. In: International Conference on Computational Learning\nTheory (COLT). Springer. 2007, pp. 263\u2013277.\n\n[TB07]\n\n[SB98]\n\n[SL05]\n\n[WCE03] Eric Wiewiora, Garrison W Cottrell, and Charles Elkan. \u201cPrincipled methods for advising\nreinforcement learning agents\u201d. In: Proceedings of the International Conference on\nMachine Learning (ICML). 2003, pp. 792\u2013799.\nEric Wiewiora. \u201cPotential-based shaping and Q-value initialization are equivalent\u201d. In:\nJournal of Arti\ufb01cial Intelligence Research 19 (2003), pp. 205\u2013208.\n\n[Wie03]\n\n9\n\n\f", "award": [], "sourceid": 4174, "authors": [{"given_name": "Falcon", "family_name": "Dai", "institution": "TTI-Chicago"}, {"given_name": "Matthew", "family_name": "Walter", "institution": "TTI-Chicago"}]}