{"title": "How hard is my MDP?\" The distribution-norm to the rescue\"", "book": "Advances in Neural Information Processing Systems", "page_first": 1835, "page_last": 1843, "abstract": "In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel $p$. In many problems, a good approximation of $p$ is not needed. For instance, if from one state-action pair $(s,a)$, one can only transit to states with the same value, learning $p(\\cdot|s,a)$ accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) we call the {\\em distribution-norm}. The distribution-norm w.r.t.~a measure $\\nu$ is defined on zero $\\nu$-mean functions $f$ by the standard variation of $f$ with respect to $\\nu$. We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the generic but loose $||\\cdot||_1$ concentration inequalities used in most previous analysis of RL algorithms, to benefit from this new hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.", "full_text": "\u201cHow hard is my MDP?\u201d\n\nThe distribution-norm to the rescue\n\nOdalric-Ambrym Maillard\nThe Technion, Haifa, Israel\n\nTimothy A. Mann\n\nThe Technion, Haifa, Israel\n\nodalric-ambrym.maillard@ens-cachan.org\n\nmann.timothy@gmail.com\n\nShie Mannor\n\nThe Technion, Haifa, Israel\n\nshie@ee.technion.ac.il\n\nAbstract\n\nIn Reinforcement Learning (RL), state-of-the-art algorithms require a large num-\nber of samples per state-action pair to estimate the transition kernel p. In many\nproblems, a good approximation of p is not needed. For instance, if from one\nstate-action pair (s, a), one can only transit to states with the same value, learning\np(\u00b7|s, a) accurately is irrelevant (only its support matters). This paper aims at cap-\nturing such behavior by de\ufb01ning a novel hardness measure for Markov Decision\nProcesses (MDPs) based on what we call the distribution-norm. The distribution-\nnorm w.r.t. a measure \u03bd is de\ufb01ned on zero \u03bd-mean functions f by the standard\nvariation of f with respect to \u03bd. We \ufb01rst provide a concentration inequality for the\ndual of the distribution-norm. This allows us to replace the problem-free, loose\n|| \u00b7 ||1 concentration inequalities used in most previous analysis of RL algorithms,\nwith a tighter problem-dependent hardness measure. We then show that several\ncommon RL benchmarks have low hardness when measured using the new norm.\nThe distribution-norm captures \ufb01ner properties than the number of states or the\ndiameter and can be used to assess the dif\ufb01culty of MDPs.\n\nIntroduction\n\n1\nThe motivation for this paper started with a question: Why are the number of samples needed for Re-\ninforcement Learning (RL) in practice so much smaller than those given by theory? Can we improve\nthis? In Markov Decision Processes (MDPs, Puterman (1994)), when the performance is measured\nby (1) the sample complexity (Kearns and Singh, 2002; Kakade, 2003; Strehl and Littman, 2008;\nSzita and Szepesv\u00b4ari, 2010) or (2) the regret (Bartlett and Tewari, 2009; Jaksch, 2010; Ortner, 2012),\nalgorithms have been developed that achieve provably near-optimal performance. Despite this, one\ncan often solve MDPs in practice with far less samples than required by current theory. One possible\nreason for this disconnect between theory and practice is because the analysis of RL algorithms has\nfocused on bounds that hold for the most dif\ufb01cult MDPs. While it is interesting to know how an\nRL algorithm will perform for the hardest MDPs, most MDPs we want to solve in practice are far\nfrom pathological. Thus, we want algorithms (and analysis) that perform appropriately with respect\nto the hardness of the MDP it is facing.\nA natural way to \ufb01ll this gap is to formalize a \u201chardness\u201d metric for MDPs and show that MDPs\nfrom the literature that were solved with few samples are not \u201chard\u201d according to this metric. For\n\ufb01nite-state MDPs, usual metrics appearing in performance bounds of MDPs include the number of\nstates and actions, the maximum of the value function in the discounted setting, and the diameter\nor sometimes the span of the bias function in the undiscounted setting. They only capture limited\nproperties of the MDP. Our goal in this paper is to propose a more re\ufb01ned notion of hardness.\n\n1\n\n\fPrevious work Despite the rich literature on MDPs, there has been surprisingly little work on met-\nrics capturing the dif\ufb01culty of learning MDPs. In Jaksch (2010), the authors introduce the UCRL\nalgorithm for undiscounted MDPs, whose regret scales with the diameter D of the MDP, a quan-\ntity that captures the time to reach any state from any other.\nIn Bartlett and Tewari (2009), the\nauthors modify UCRL to achieve regret that scales with the span of the bias function, which can be\narbitrarily smaller than D. The resulting algorithm, REGAL achieves smaller regret, but it is an\nopen question whether the algorithm can be implemented. Closely related to our proposed solution,\nin Filippi et al. (2010) the authors provide a modi\ufb01ed version of UCRL, called KL-UCRL that\nuses modi\ufb01ed con\ufb01dence intervals on the transition kernel based on Kullback-Leibler divergence\nrather than || \u00b7 ||1 control on the error. The resulting algorithm is reported to work better in prac-\ntice, although this is not re\ufb02ected in the theoretical bounds. Farahmand (2011) introduced a metric\nfor MDPs called the action-gap. This work is the closest in spirit to our approach. The action-\ngap captures the dif\ufb01culty of distinguishing the optimal policy from near-optimal policies, and is\ncomplementary to the notion of hardness proposed here. However, the action-gap has mainly been\nused for planning, instead of learning, which is our main focus. In the discounted setting, several\nworks have improved the bounds with respect to the number of states (Szita and Szepesv\u00b4ari, 2010)\nand the discount factor (Lattimore and Hutter, 2012). However, these analyses focus on worst case\nbounds that do not scale with the hardness of the MDP, missing an opportunity to help bridge the\ngap between theory and practice.\nContributions Our main contribution is a re\ufb01ned metric for the hardness of MDPs, that captures the\nobserved \u201ceasiness\u201d of common benchmark MDPs. To accomplish this we \ufb01rst introduce a norm in-\nduced by a distribution \u03bd, aka the distribution-norm. For functions f with zero \u03bd-expectation, ||f||\u03bd\nis the variance of f. We de\ufb01ne the dual of this norm in Lemma 1, and then study its concentration\nproperties in Theorem 1. This central result is of independent interest beyond its application in RL.\n\nMore precisely, for a discrete probability measure p and its empirical version\ufffdpn built from n i.i.d\nsamples, we control ||p \u2212\ufffdpn||\ufffd,p in O((np0)\u22121/2), where p0 is the minimum mass of p on its sup-\n\nport. Second, we de\ufb01ne a hardness measure for MDPs based on the distribution-norm. This measure\ncaptures stochasticity along the value function. This quantity is naturally small in MDPs that are\nnearly deterministic, but it can also be small in MDPs with highly stochastic transition kernels. For\ninstance, this is the case when all states reachable from a state have the same value. We show that\nsome common benchmark MDPs have small hardness measure. This illustrates that our proposed\nnorm is a useful tool for the analysis and design of existing and future RL algorithms.\nOutline In Section 2, we formalize the distribution-norm, and give intuition about the interplay with\nits dual. We compare to distribution-independent norms. Theorem 1 provides a concentration in-\nequality for the dual of this norm, that is of independent interest beyond the MDP setting. Section 3\nuses these insights to de\ufb01ne a problem-dependent hardness metric for both undiscounted and dis-\ncounted MDPs (De\ufb01nition 2, De\ufb01nition 1), that we call the environmental norm. Importantly, we\nshow in section 3.2 that common benchmark MDPs have small environmental norm C in this sense,\nand compare our bound to approaches bounding the problem-free || \u00b7 ||1 norm.\n2 The distribution-norm and its dual\nIn Machine Learning (ML), norms often play a crucial role in obtaining performance bounds. One\ntypical example is the following. Let X be a measurable space equipped with an unknown prob-\nability measure \u03bd \u2208 M1(X ) with density p. Based on some procedure, an algorithm produces a\ncandidate measure \u02dc\u03bd \u2208 M1(X ) with density \u02dcp. One is then interested in the loss with respect to a\ncontinuous function f. It is natural to look at the mismatch between \u03bd and \u02dc\u03bd on f. That is\n\n(\u03bd \u2212 \u02dc\u03bd, f ) =\ufffdX\n\nf (x)(\u03bd \u2212 \u02dc\u03bd)(dx) =\ufffdX\n\nf (x)(p(x) \u2212 \u02dcp(x))dx .\n\nA typical bound on this quantity is obtained by applying a H\u00a8older inequality to f and p \u2212 \u02dcp, which\ngives (\u03bd \u2212 \u02dc\u03bd, f ) \ufffd ||p \u2212 \u02dcp||1||f||\u221e . Assuming a bound is known for ||f||\u221e, this inequality can\nbe controlled with a bound on ||p \u2212 \u02dcp||1. When X is \ufb01nite and \u02dcp is the empirical distribution \ufffdpn\nestimated from n i.i.d. samples of p, results such as Weissman et al. (2003) can be applied to bound\nthis term with high probability.\nHowever, in this learning problem, what matters is not f but the way f behaves with respect to \u03bd.\nThus, trying to capture the properties of f via the distribution-free ||f||\u221e bound is not satisfactory.\nSo we propose, instead, a norm || \u00b7 ||\u03bd driven by \u03bd. Well-behaving f will have small norm ||f||\u03bd,\nwhereas badly-behaving f will have large norm ||f||\u03bd. Every distribution has a natural norm asso-\n\n2\n\n\fciated with it that measures the quadratic variations of f with respect to \u03bd. This quantity is at the\nheart of many key results in mathematical statistics, and is formally de\ufb01ned by\n\n||f||\u03bd =\ufffd\ufffdX\ufffdf (x) \u2212 E\u03bdf\ufffd2\n\n\u03bd(dx) .\n\n(1)\n\nTo get a norm, we restrict C(X ) to the space of continuous functions E\u03bd = {f \u2208 C(X ) : ||f||\u03bd <\n\u221e, supp(\u03bd) \u2282 supp(f ), E\u03bdf = 0} . We then de\ufb01ne the corresponding dual space in a standard way\nby E \ufffd\n\n\u03bd = {\u00b5 : ||\u00b5||\ufffd,\u03bd < \u221e} where\n\n||\u00b5||\ufffd,\u03bd = sup\n\nf\u2208E\u03bd\ufffdx f (x)\u00b5(dx)\n\n||f||\u03bd\n\n.\n\nNote that for f \u2208 E\u03bd, using the fact the \u03bd(X ) = \u02dc\u03bd(X ) = 1 and that x \u2192 f (x)\u2212 E\u03bdf is a zero mean\nfunction, we immediately have\n\n(\u03bd \u2212 \u02dc\u03bd, f ) = (\u03bd \u2212 \u02dc\u03bd, f \u2212 E\u03bdf )\n\n\ufffd ||p \u2212 \u02dcp||\ufffd,\u03bd||f \u2212 E\u03bdf||\u03bd .\n\n(2)\nThe key difference with the generic H\u00a8older inequality is that || \u00b7 ||\u03bd is now capturing the behavior of\nf with respect to \u03bd, as opposed to || \u00b7 ||\u221e. Conceptually, using a quadratic norm instead of an L1\nnorm, as we do here, is analogous to moving from Hoeffding\u2019s inequality to Bernstein\u2019s inequality\nin the framework of concentration inequalities.\nWe are interested in situations where ||f||\u03bd is much smaller than ||f||\u221e. That is, f is well-behaving\nwith respect to \u03bd. In such cases, we can get an improved bound ||p \u2212 \u02dcp||\ufffd,\u03bd||f \u2212 E\u03bdf||\u03bd instead of\nthe best possible generic bound inf c\u2208R ||p \u2212 \u02dcp||1||f \u2212 c||\u221e.\nSimply controlling either ||p \u2212 \u02dcp||\ufffd,\u03bd (respectively ||p \u2212 \u02dcp||1) or ||f||\u03bd (respectively ||f||\u221e) is not\nenough. What matters is the product of these quantities. For our choice of norm, we show that\n||p \u2212 \u02dcp||\ufffd,\u03bd concentrates at essentially the same speed as ||p \u2212 \u02dcp||1, but ||f||\u221e is typically much\nlarger than ||f||\u03bd for the typical functions met in the analysis of MDPs. We do not claim that the\nnorm de\ufb01ned in equation (1) is the best norm that leads to a minimal ||p \u2212 \u02dcp||\ufffd,\u03bd||f \u2212 E\u03bdf||\u03bd, but\nwe show that it is an interesting candidate.\n\nWe proceed in two steps. First, we design in Section 2 a concentration bound for ||p\u2212\ufffdpn||\ufffd,\u03bd that is\nnot much larger than the Weissman et al. (2003) bound on ||p \u2212\ufffdpn||1. (Note that ||p \u2212\ufffdpn||\ufffd,\u03bd must\nbe larger than ||p \u2212\ufffdpn||1 as it captures a re\ufb01ned property). Second, in Section 3, we consider RL in\n\nan MDP where p represents the transition kernel of a station-action pair and f represents the value\nfunction of the MDP for a policy. The value function and p are strongly linked by construction,\nand the distribution-norm helps us capture their interplay. We show in Section 3.2 that common\nbenchmark MDPs have optimal value functions with small || \u00b7 ||\u03bd norm. This naturally introduces a\nnew way to capture the hardness of MDPs, besides the diameter (Jaksch, 2010) or the span (Bartlett\nand Tewari, 2009). Our formal notion of MDP hardness is summarized in De\ufb01nitions 1 and 2, for\ndiscounted and undiscounted MDPs, respectively.\n\nterm on the right hand side of (2), which corresponds to the dual norm when \u02dcp =\ufffdpn is the empirical\n\n2.1 A dual-norm concentration inequality\nFor convenience we consider a \ufb01nite space X = {1, . . . , S} with S points. We focus on the \ufb01rst\nmean built from n i.i.d. samples from the distribution \u03bd. We denote by p the probability vector\ncorresponding to \u03bd. The following lemma, whose proof is in the supplementary material, provides a\nconvenient way to compute the dual norm.\nLemma 1 Assume that X = {1, . . . , S}, and, without loss of generality,\n{1, . . . , K}, with K \ufffd S. Then the following equality holds true\nn,s \u2212 p2\n\nthat supp(p) =\n\ns\n\n.\n\nNow we provide a \ufb01nite-sample bound on our proposed norm.\n\n||\ufffdpn \u2212 p||\ufffd,p = \ufffd\ufffd\ufffd\ufffd\n\nK\ufffds=1\ufffdp2\n\nps\n\n3\n\n\fTheorem 1 (Main result) Assume that supp(p) = {1, . . . , K}, with K \ufffd S. Then for all \u03b4 \u2208\n(0, 1), with probability higher than 1 \u2212 \u03b4,\np(1)\ufffd\ufffd ,(3)\n\n+ 2\ufffd (2n \u2212 1) ln(1/\u03b4)\n\np(K) \u2212 1,\ufffd K \u2212 1\n\nwhere p(K) is the smallest non zero component of p = (p1, . . . , pS), and p(1) the largest one.\n\n||\ufffdpn \u2212 p||\ufffd,p \ufffd min\ufffd\ufffd 1\n\n\ufffd 1\n\np(K) \u2212\n\nn2\n\nn\n\n1\n\nThe proof follows an adaptation of Maurer and Pontil (2009) for empirical Bernstein bounds, and\nuses results for self-bounded functions from the same paper. This gives tighter bounds than naive\nconcentration inequalities (Hoeffding, Bernstein, etc.). We indeed get a O(n\u22121/2) scaling, whereas\nusing simpler techniques would lead to a weak O(n\u22121/4) scaling.\nProof We will apply Theorem 7 of Maurer and Pontil (2009). Using the notation of this theorem,\nwe denote the sample by X = (X1, . . . , Xn) and the function we want to control by\n\nWe now introduce, for any s \u2208 S the modi\ufb01ed sample Xi0,s = (X1, . . . , Xi0\u22121, s, Xi0+1, . . . , Xn).\nWe are interested in the quantity V(X)\u2212V(Xi0,s). To apply Theorem 7 of Maurer and Pontil (2009),\nwe need to identify constants a, b such that\n\nV(X) = ||\ufffdpn \u2212 p||2\n\n\ufffd,p .\n\n\ufffd\u2200i \u2208 [n], V(X) \u2212 inf s\u2208S V(Xi,s) \ufffd b\ni=1\ufffdV(X) \u2212 inf s\u2208S V(Xi,s)\ufffd2\n\ufffdn\n\n\ufffd aV(X) .\n\nThe two following lemmas enable us to identify a and b. They follow from simple algebra and are\nproved in Appendix A in the supplementary material.\n\nLemma 2 V(X) satis\ufb01es Ep\ufffdV(X)\ufffd = K\u22121\n\nn . Moreover, for all i \u2208 {1, . . . , n} we have that\n\nV(X) \u2212 inf\n\ns\u2208S V(Xi,s) \ufffd b , where b =\n\n2n \u2212 1\n\nn2 \ufffd 1\n\np(K) \u2212\n\n1\n\np(1)\ufffd .\n\n\ufffd,p satis\ufb01es\n\nLemma 3 V(X) = ||\ufffdpn \u2212 p||2\nThus, we can choose a = 2b. By application of Theorem 7 of Maurer and Pontil (2009) to \u02dcV(X) =\nV(X)/b, we deduce that for all \u03b5 > 0,\n\nn\ufffdi=1\ufffdV(X) \u2212 inf\n\ns\u2208S V(Xi,s)\ufffd2\n\n\ufffd 2bV(X) .\n\nAfter inverting this bound in \u03b5 and using the fact that \u221aa + b \ufffd \u221aa + \u221ab for non-negative a, b, we\ndeduce that for all \u03b4 \u2208 (0, 1), with probability higher than 1 \u2212 \u03b4, then\n\nPlugging back in the de\ufb01nition of \u02dcV(X), we obtain\n\nn\n\n\u03b52\n\n\u03b52/b\n\n\ufffd,p >\n\n4 K\u22121\n\nK \u2212 1\n\n4E \u02dcV(X) + 2\u03b5\ufffd .\nn + 2\u03b5\ufffd .\n\nP\ufffd \u02dcV(X) \u2212 E \u02dcV(X) > \u03b5\ufffd \ufffd exp\ufffd\u2212\n+ \u03b5\ufffd \ufffd exp\ufffd\u2212\nP\ufffd||\ufffdpn \u2212 p||2\n||\ufffdpn \u2212 p||2\n||\ufffdpn \u2212 p||\ufffd,p \ufffd \ufffdEV(X) + 2\ufffdb ln(1/\u03b4)\n\n\ufffd,p \ufffd EV(X) + 2\ufffdEV(X)b ln(1/\u03b4) + 2b log(1/\u03b4)\n= \ufffd\ufffdEV(X) +\ufffdb ln(1/\u03b4)\ufffd2\n+ 2\ufffd (2n \u2212 1) ln(1/\u03b4)\n= \ufffd K \u2212 1\n\n+ b log(1/\u03b4) .\n\np(K) \u2212\n\n\ufffd 1\n\nn2\n\nn\n\n1\n\np(1)\ufffd ,\n\nThus, we deduce from this inequality that\n\nwhich concludes the proof. We recover here a O(n\u22121/2) behavior, more precisely a O(p\u22121\nscaling where p(K) is the smallest non zero probability mass of p.\n\n(K)n\u22121/2)\n\ufffd\n\n4\n\n\f3 Hardness measure in Reinforcement Learning using the distribution-norm\nIn this section, we apply the insights from Section 2 for the distribution-norm to learning in Markov\nDecision Processes (MDPs). We start by de\ufb01ning a formal notion of hardness C for discounted\nMDPs and undiscounted MDPs with average reward, that we call the environmental norm. Then, we\nshow in Section 3.2 that several benchmark MDPs have small environmental norm. In Section 3.1,\nwe present a regret bound for a modi\ufb01cation of UCRL whose regret scales with C, without having\nto know C in advance.\nDe\ufb01nition 1 (Discounted MDP) Let M =< S,A, r, p, \u03b3 > be a \u03b3-discounted MDP, with reward\nfunction r and transition kernel p. We denote V \u03c0 the value function corresponding to a policy \u03c0\n(Puterman, 1994). We de\ufb01ne the environmental-value norm of policy \u03c0 in MDP M by\n\nC \u03c0\n\nM = max\n\ns,a\u2208S\u00d7A||V \u03c0||p(\u00b7|s,a) .\n\nDe\ufb01nition 2 (Undiscounted MDP) Let M =< S,A, r, p > be an undiscounted MDP, with reward\nfunction r and transition kernel p. We denote by h\u03c0 the bias function for policy \u03c0 (Puterman, 1994;\nJaksch, 2010). We de\ufb01ne the environmental-value norm of policy \u03c0 in MDP M by the quantity\n\nC \u03c0\n\nM = max\n\ns,a\u2208S\u00d7A||h\u03c0||p(\u00b7|s,a) .\n\nM \ufffd 1\n\nIn the discounted setting with bounded rewards in [0, 1], V \u03c0 \ufffd 1\nIn the undiscounted setting, then ||h\u03c0||p(\u00b7|s,a) \ufffd span(h\u03c0), and thus C \u03c0\nthe class of C-\u201chard\u201d MDPs by MC =\ufffdM : C \u03c0\u2217\n\n1\u2212\u03b3 as well.\nM \ufffd span(h\u03c0). We de\ufb01ne\nM \ufffd C\ufffd . That is, the class of MDPs with optimal\n\n1\u2212\u03b3 and thus C \u03c0\n\n\u201cEasy\u201d MDPs and algorithms\n\npolicy having a low environmental-value norm, or for short, MDPs with low environmental norm.\nImportant note It may be tempting to think that, since the above de\ufb01nition captures a notion of\nvariance, an MDP that is very noisy will have a high environmental norm. However this reasoning\nis incorrect. The environmental norm of an MDP is not the variance of a roll-out trajectory, but\nrather captures the variations of the value (or the bias value) function with respect to the transition\nkernel. For example, consider a fully connected MDP with transition kernel that transits to every\nstate uniformly at random, but with a constant reward function. In this trivial MDP, C \u03c0\nM = 0 for\nall policies \u03c0, even though the MDP is extremely noisy because the value function is constant. In\ngeneral MDPs, the environmental norm depends on how varied the value function is at the possible\nnext states and on the distribution over next states. Note also that we use the term hardness rather\nthan complexity to avoid confusion with such concepts as Rademacher or VC complexity.\n3.1\nIn this section, we demonstrate how the dual norm (instead of the usual || \u00b7 ||1 norm) can lead to\nimproved bounds for learning in MDPs with small environmental norm.\nDiscounted MDPs Due to space constraints, we only report one proposition that illustrates the kind\nof achievable results. Indeed, our goal is not to derive a modi\ufb01ed version of each existing algorithm\nfor the discounted scenario, but rather to instill the key idea of using a re\ufb01ned hardness measure\nwhen deriving the core lemmas underlying the analysis of previous (and future) algorithms.\nThe analysis of most RL algorithms for the discounted case uses a \u201csimulation lemma\u201d (Kearns\nand Singh, 2002); see also Strehl and Littman (2008) for a re\ufb01ned version. A simulation lemma\nbounds the error in the value function of running a policy planned on an estimated MDP in the MDP\nwhere the samples were taken from. This effectively controls the number of samples needed from\neach state-action pair to derive a near-optimal policy. The following result is a simulation lemma\nexploiting our proposed notion of hardness (the environmental norm).\nProposition 1 Let M be a \u03b3-discounted MDP with deterministic rewards. For a policy \u03c0, let us\ndenote its corresponding value V \u03c0. We denote by p the transition kernel of M, and for convenience\n\nuse the notation p\u03c0(s\ufffd|s) for p(s\ufffd|s, \u03c0(s)). Now, let\ufffdp be an estimate of the transition kernel such\nthat maxs\u2208S ||p\u03c0(\u00b7|s) \u2212\ufffdp\u03c0(\u00b7|s)||\ufffd,p\u03c0(\u00b7|s) \ufffd \u03b5 and let us denote \ufffdV \u03c0 its corresponding value in the\nMDP with kernel\ufffdp. Then, the maximal expected error between the two values is bounded by\nwhere C \u03c0 = maxs,a\u2208S\u00d7A ||V \u03c0||p(\u00b7|s,a). In particular, for the optimal policy \u03c0\ufffd, then C \u03c0\ufffd \ufffd C.\n\ns0\u2208S\ufffdEp\u03c0(\u00b7|s0)\ufffdV \u03c0\ufffd \u2212 E\ufffdp\u03c0(\u00b7|s0)\ufffd\ufffdV \u03c0\ufffd\ufffd \ufffd \u03b5C \u03c0\n\ndef= max\n\n1 \u2212 \u03b3\n\nE \u03c0\n\nrr\n\n,\n\n5\n\n\fTo understand when this lemma results in smaller sample sizes, we need to compare to what\none would get using the standard || \u00b7 ||1 decomposition, for an MDP with rewards in [0, 1].\nIf\n\nmaxs\u2208S ||p\u03c0(\u00b7|s) \u2212\ufffdp\u03c0(\u00b7|s)||1 \ufffd \u03b5\ufffd, then one would get\n\ufffd \u03b5\ufffdV \u2217MAX\n1 \u2212 \u03b3 \ufffd\n\nrr \ufffd \u03b5span(V \u03c0)\nE \u03c0\n\n1 \u2212 \u03b3\n\n\u03b5\ufffd\n\n(1 \u2212 \u03b3)2 .\n\nWhen, for example, C is a bound with respect to all policies, this simulation lemma can be plugged\ndirectly into the analysis of R-MAX (Kakade, 2003) or MBIE (Strehl and Littman, 2008) to obtain\na hardness-sensitive bound on the sample complexity. Now, in most analyses, one only needs to\nbound the hardness with respect to the optimal policy and to the optimistic/greedy policies actually\nused by the algorithm. For an optimal policy \u02dc\u03c0 computed from an (\u03b5, \u03b5\ufffd)-approximate model (see\nLemma 4 for details), it is not dif\ufb01cult to show that C \u02dc\u03c0 \ufffd C \u03c0\ufffd\n+ \u03b5)/(1 \u2212 \u03b3), which thus\nallows for a tighter analysis. We do not report further results here, to avoid distracting the reader\nfrom the main message of the paper, which is the introduction of a distribution-dependent hardness\nmetric for MDPs. Likewise, we do not detail the steps that lead from this result to the various\nsample-complexity bounds one can \ufb01nd in the abundant literature on the topic, as it would not be\nmore illuminating than Proposition 1.\nUndiscounted MDPs In the undiscounted setting, with average reward criterion, it is natural to\nconsider the UCRL algorithm from Jaksch (2010). We modify the de\ufb01nition of plausible MDPs\nused in the algorithm as follows: Using the same notations as that of Jaksch (2010), we replace the\nadmissibility condition for a candidate transition kernel \u02dcp at the beginning of episode k at time tk\n\n+ (\u03b5\ufffdC \u03c0\ufffd\n\n\ufffd 1\n\n\u02dcp(K) \u2212\n\nwith the following condition involving the result of Theorem 1\n\n||\ufffdpk(\u00b7|s, a) \u2212 \u02dcp(\u00b7|s, a)||1 \ufffd\ufffd 14S log(2Atk/\u03b4)\n+ 2\ufffd (2Nk(s, a) \u2212 1) ln(tkSA/\u03b4)\n\nmax{1, Nk(s, a)}\n\nK \u2212 1\n\nmax{1, Nk(s, a)}2\n\nmax{1, Nk(s, a)}\n\np0 \u2212 1,\ufffd\n\n||\ufffdpk(\u00b7|s, a) \u2212 \u02dcp(\u00b7|s, a)||\ufffd, \u02dcp(\u00b7|s,a) \ufffd Bk(s, a) def=\nmin\ufffd\ufffd 1\n\n\u02dcp(1)\ufffd\ufffd , (4)\nwhere \u02dcp(K) is the smallest non zero component of \u02dcp(\u00b7|s, a), and \u02dcp(1) the largest one, and K is the\nsize of the support of \u02dcp(\u00b7|s, a). We here assume for simplicity that the transition kernel p of the MDP\nalways puts at least p0 mass on each point of its support, and thus constraint an admissible kernel \u02dcp\nto satisfy the same condition. One restriction of the current (simple) analysis is that the algorithm\nneeds to know a bound on p0 in advance. We believe it is possible to remove such an assumption by\nestimating p0 and taking care of the additional low probability event corresponding to the estimation\nerror. As this comes at the price of a more complicated algorithm and analysis, we do not report\nthis extension here for clarity. Note that the optimization problem corresponding to Extended Value\nIteration with (4) can still be solved by optimizing over the simplex. We refer to Jaksch (2010) for\nimplementation details. Naturally, similar modi\ufb01cations apply also to REGAL and other UCRL\nvariants introduced in the MDP literature.\nIn order to assess the performance of the policy chosen by UCRL it is useful to show the following:\nLemma 4 Let M and \u02dcM be two communicating MDPs over the same state-action space such that\none is an (\u03b5, \u03b5\ufffd)-approximation of the other in the sense that for all s, a |r(s, a) \u2212 \u02dcr(s, a)| \ufffd \u03b5 and\n||\u02dcp(\u00b7|s, a) \u2212 p(\u00b7|s, a)||\ufffd,p(\u00b7|s,a) \ufffd \u03b5\ufffd. Let \u03c1\ufffd(M ) denotes the average value function of M. Then\nLemma 4 is a simple adaptation from Ortner et al. (2014). We now provide a bound on the regret of\nthis modi\ufb01ed UCRL algorithm. The regret bound turns out to be a bit better than UCRL in the case\nof an MDP M \u2208 MC with a small C.\nProposition 2 Let us consider a \ufb01nite-state MDP with S state, low environmental norm (M \u2208 MC)\nand diameter D. Assume moreover that the transition kernel that always puts at least p0 mass on\neach point of its support. Then, the modi\ufb01ed UCRL algorithm run with condition (4) is such that\nfor all \u03b4, with probability higher than 1 \u2212 \u03b4, for all T , the regret after T steps is bounded by\nlog(T SA/\u03b4)\ufffd .\n\nRT = O\ufffd\ufffdDC\u221aSA\ufffd\ufffd log(T SA/\u03b4)\n\n||\u03c1\ufffd(M ) \u2212 \u03c1\ufffd( \u02dcM )||p \ufffd \u03b5\ufffd min{CM , C \u02dcM} + \u03b5 .\n\n+ \u221aS\ufffd + D\ufffd\ufffd T\n\np0\n\n1\n\n,\n\np0\n\n6\n\n\fSince we used some crude upper bounds in parts of the proof of Proposition 2, we believe the\n\nThe regret bound for the original UCRL from Jaksch (2010) scales as O\ufffdDS\ufffdAT log(T SA/\u03b4)\ufffd.\nlog(T SA/\u03b4)\ufffd. The cruder factors\nright scaling for the bound of Proposition 2 is O\ufffdC\ufffd T SA\n\ncome from some second order terms that we controlled trivially to avoid technical and not very\nilluminating considerations. What matters here is that C appears as a factor of the leading term.\nIndeed proposition 2 is mostly here for illustration purpose of what one can achieve, and improving\non the other terms is technical and goes beyond the scope of this paper. Comparing the two regret\nbounds, the result of Proposition 2 provides a qualitative improvement over the result of Jaksch\n(2010) whenever C < D\u221aSp0 (respectively C < \u221aSp0) for the conjectured (resp. current) result.\nNote. The modi\ufb01ed UCRL algorithm does not need to know the environmental norm C of the MDP\nin advance. It only appears in the analysis and in the \ufb01nal regret bound. This property is similar to\nthat of UCRL with respect to the diameter D.\n\np0\n\n3.2 The hardness of benchmarks MDPs\nIn this section, we consider the hardness of a set of MDPs that have appeared in past literature.\nTable 3.2 summarizes the results for six MDPs that were chosen to be both representative of typ-\nical \ufb01nite-states MDPs but also cover a diverse range of tasks. These MDPs are also signi\ufb01cant\nin the sense that good solutions for them have been learned with far fewer samples then sug-\ngested by existing theoretical bounds. The metrics we report include the number of states S,\nthe number of actions A, the maximum of V \ufffd (denoted V \u2217MAX), the span of V \u2217, the C \u03c0\u2217\nM , and\np(s\ufffd|s, a), that is the minimum non-zero probability mass given by the\np0 = min\ntransition kernel of the MDP. While we cannot compute the hardness for all policies, the hardness\nwith respect to \u03c0\u2217 is signi\ufb01cant because it indicates how hard it is to learn the value function V \u2217\nof the optimal policy. Notice that C \u03c0\u2217\nM is signi\ufb01cantly smaller than both V \u2217MAX and span(V \u2217) in\nall the MDPs. This suggests that a model accurately representing the optimal value function can be\nderived with a small number of samples (and a bound based on \ufffd \u00b7 \ufffd1V \u2217MAX is overly conservative).\n\ns\ufffd\u2208supp(p(\u00b7|s,a)\n\ns\u2208S,a\u2208A\n\nmin\n\nMDP\nbottleneck McGovern and Barto (2001)\nred herring Hester and Stone (2009)\ntaxi \u2020 Dietterich (1998)\ninventory \u2020 Mankowitz et al. (2014)\nmountain car \u2020 \ufffd \ufffd Sutton and Barto (1998)\npinball \u2020 \ufffd \ufffd Konidaris and Barto (2009)\n\nS\n231\n121\n500\n101\n150\n2304\n\nA V \u2217MAX\n19.999\n4\n17.999\n4\n7.333\n6\n19.266\n2\n3\n19.999\n19.999\n5\n\nSpan(V \u2217)\n\n19.999\n17.999\n0.885\n0.963\n19.999\n19.991\n\nC \u03c0\u2217\np0\nM\n0.1\n0.526\n0.1\n4.707\n0.055\n0.043\n0.263 < 10\u22123\n1.296\n0.322\n0.059 < 10\u22123\n\nTable 1: MDPs marked with a \u2020 indicate that the true MDP was not available and so it was\nestimated from samples. We estimated these MDPs with 10, 000 samples from each state-\naction pair. MDPs marked with a \ufffd indicate that the original MDP is deterministic and there-\nfore we added noise to the transition dynamics. For the Mountain Car problem, we added a\nsmall amount of noise to the vehicle\u2019s velocity during each step (post+1 = post + velt(1 +\nX) where X is a random variable with equally probable events {\u2212velM AX , 0, velM AX}). For the\npinball domain we added noise similar to Tamar et al. (2013). MDPs marked with a \ufffd were dis-\ncretized to create a \ufb01nite state MDP. The rewards of all MDPs were normalized to [0, 1] and discount\nfactor \u03b3 = 0.95 was used.\n\nTo understand the environmental-value norm of near-optimal policies \u03c0 in an MDP, we ran policy\niteration on each of the benchmark MDPs from Table 3.2 for 100 iterations (see supplementary\nmaterial for further details). We computed the environmental-value norm of all encountered policies\nand selected the policy \u03c0 with maximal norm and its corresponding worst case distribution. Figure 1\nM as the\ncompares the Weissman et al. (2003) bound \u00d7VMAX to the bound given by Theorem 1 \u00d7C \u03c0\nnumber of samples increases. This is indeed the comparison of this products that matters for the\nlearning regret, rather than that of one or the other factor only. In each MDP, we see an order of\nmagnitude improvement by exploiting the distribution-norm. This is particularly signi\ufb01cant because\nthe Weissman et al. (2003) bound is quite close to the behavior observed in experiments. The result\nin Figure 1 strengthens support for our theoretical \ufb01ndings, suggesting that bounds based on the\ndistribution-norm scale with the MDP\u2019s hardness.\n\n7\n\n\fFigure 1: Comparison of the Weissman et al. (2003) bound times VMAX to (3) of Theorem 1 times\nM in the benchmark MDPs. In each MDP, we selected the policy \u03c0 (from the policies encountered\nC \u03c0\nduring policy iteration) that gave the largest C \u03c0 and the worst next state distribution for our bound.\nIn each MDP, the improvement with the distribution-norm is an order of magnitude (or more) better\nthan using the distribution-free Weissman et al. (2003) bound.\n\n4 Discussion and conclusion\nIn the early days of learning theory, sample independent quantities such as the VC-dimension and\nlater the Rademacher complexity were used to derive generalization bounds for supervised learning.\nLater on, data dependent bounds (empirical VC or empirical Rademacher) replaced these quantities\nto obtain better bounds. In a similar spirit, we proposed the \ufb01rst analysis in RL where instead of\nconsidering generic a-priori bounds one can use stronger MDP-speci\ufb01c bounds. Similarly to the su-\npervised learning, where generalization bounds have been used to drive model selection algorithms\nand structural risk minimization, our proposed distribution dependent norm suggests a similar ap-\nproach in solving RL problems. Although we do not claim to close the gap between theoretical\nand empirical bounds, this paper opens an interesting direction of research towards this goal, and\nachieves a signi\ufb01cant \ufb01rst step. It inspires at least a modi\ufb01cation of the whole family of UCRL-\nbased algorithms, and could potentially bene\ufb01t also to others fundamental problems in RL such as\nbasis-function adaptation or model selection, but ef\ufb01cient implementation should not be overlooked.\nWe choose a natural weighted L2 norm induced by a distribution, due to its simplicity of interpre-\ntation and showed several benchmark MDPs have low hardness. A natural question is how much\nbene\ufb01t can be obtained by studying other Lp or Orlicz distribution-norms? Further, one may wish\nto create other distribution dependent norms that emphasize certain areas of the state space in order\nto better capture desired (or undesired) phenomena. This is left for future work.\nIn the analysis we basically showed how to adapt existing algorithms to use the new distribution\ndependent hardness measure. We believe this is only the beginning of what is possible, and that new\nalgorithms will be developed to best utilize distribution dependent norms in MDPs.\nAcknowledgements This work was supported by the European Community\u2019s Seventh Framework\nProgramme (FP7/2007-2013) under grant agreement 306638 (SUPREL) and the Technion.\n\nReferences\nBartlett, P. L. and Tewari, A. (2009). Regal: A regularization based algorithm for reinforcement\nIn Proceedings of the Twenty-Fifth Conference on\n\nlearning in weakly communicating mdps.\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342.\n\nDietterich, T. G. (1998). The MAXQ method for hierarchical reinforcement learning. In Interna-\n\ntional Conference on Machine Learning, pages 118\u2013126.\n\n8\n\n02004006008001000Samples100101102103Error(log-scale)BottleneckWeissman\u00d7VMAXTheorem1\u00d7C\u03c002004006008001000Samples100101102103Error(log-scale)RedHerringWeissman\u00d7VMAXTheorem1\u00d7C\u03c002004006008001000Samples100101102103Error(log-scale)TaxiWeissman\u00d7VMAXTheorem1\u00d7C\u03c002004006008001000Samples10\u22121100101102103Error(log-scale)InventoryManagementWeissman\u00d7VMAXTheorem1\u00d7C\u03c002004006008001000Samples100101102103Error(log-scale)MountainCarWeissman\u00d7VMAXTheorem1\u00d7C\u03c002004006008001000Samples10\u22121100101102103104Error(log-scale)PinballWeissman\u00d7VMAXTheorem1\u00d7C\u03c0\fFarahmand, A. M. (2011). Action-gap phenomenon in reinforcement learning. In Shawe-Taylor, J.,\nZemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q., editors, Proceedings of the\n25th Annual Conference on Neural Information Processing Systems, pages 172\u2013180, Granada,\nSpain.\n\nFilippi, S., Capp\u00b4e, O., and Garivier, A. (2010). Optimism in reinforcement learning and kullback-\nIn Communication, Control, and Computing (Allerton), 2010 48th Annual\n\nleibler divergence.\nAllerton Conference on, pages 115\u2013122. IEEE.\n\nHester, T. and Stone, P. (2009). Generalized model learning for reinforcement learning in factored\ndomains. In The Eighth International Conference on Autonomous Agents and Multiagent Systems\n(AAMAS).\n\nJaksch, T. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine\n\nLearning Research, 11:1563\u20131600.\n\nKakade, S. M. (2003). On the Sample Complexity of Reinforcement Learning. PhD thesis, University\n\nCollege London.\n\nKearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine\n\nLearning, 49:209\u2013232.\n\nKonidaris, G. and Barto, A. (2009). Skill discovery in continuous reinforcement learning domains\nusing skill chaining. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta,\nA., editors, Advances in Neural Information Processing Systems 22, pages 1015\u20131023.\n\nLattimore, T. and Hutter, M. (2012). PAC bounds for discounted MDPs. In Algorithmic learning\n\ntheory, pages 320\u2013334. Springer.\n\nMankowitz, D. J., Mann, T. A., and Mannor, S. (2014). Time-regularized interrupting options\n\n(TRIO). In Proceedings of the 31st International Conference on Machine Learning.\n\nMaurer, A. and Pontil, M. (2009). Empirical Bernstein bounds and sample-variance penalization. In\n\nConference On Learning Theory (COLT).\n\nMcGovern, A. and Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning\nusing diverse density. In Proceedings of the 18th International Conference on Machine Learning,\npages 361 \u2013 368, San Fransisco, USA.\n\nOrtner, R. (2012). Online regret bounds for undiscounted continuous reinforcement learning. In\n\nNeural Information Processing Systems 25, pages 1772\u2014-1780.\n\nOrtner, R., Maillard, O.-A., and Ryabko, D. (2014). Selecting near-optimal approximate state rep-\n\nresentations in reinforcement learning. Technical report, Montanuniversitaet Leoben.\n\nPuterman, M. L. (1994). Markov Decision Processes - Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc.\n\nStrehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for markov\n\ndecision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331.\n\nSutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press.\nSzita, I. and Szepesv\u00b4ari, C. (2010). Model-based reinforcement learning with nearly tight explo-\nIn Proceedings of the 27th International Conference on Machine\n\nration complexity bounds.\nLearning.\n\nTamar, A., Castro, D. D., and Mannor, S. (2013). TD methods for the variance of the reward-to-go.\n\nIn Proceedings of the 30 th International Conference on Machine Learning.\n\nWeissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. (2003). Inequalities for\n\nthe l1 deviation of the empirical distribution. Technical report, Hewlett-Packard Labs.\n\n9\n\n\f", "award": [], "sourceid": 978, "authors": [{"given_name": "Odalric-Ambrym", "family_name": "Maillard", "institution": "The Technion"}, {"given_name": "Timothy", "family_name": "Mann", "institution": "The Technion"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}