{"title": "Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2818, "page_last": 2826, "abstract": "Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound of order O(|S|\u00b2|A|H\u00b2 log(1/\u03b4)/\u025b\u00b2) and a lower PAC bound \u03a9(|S||A|H\u00b2 log(1/(\u03b4+c))/\u025b\u00b2) (ignoring log-terms) that match up to log-terms and an additional linear dependency on the number of states |S|. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least H\u00b3.", "full_text": "Sample Complexity of Episodic Fixed-Horizon\n\nReinforcement Learning\n\nChristoph Dann\n\nMachine Learning Department\nCarnegie Mellon University\n\ncdann@cdann.net\n\nEmma Brunskill\n\nComputer Science Department\nCarnegie Mellon University\n\nebrun@cs.cmu.edu\n\nAbstract\n\nRecently, there has been signi\ufb01cant progress in understanding reinforcement\nlearning in discounted in\ufb01nite-horizon Markov decision processes (MDPs) by de-\nriving tight sample complexity bounds. However, in many real-world applications,\nan interactive learning agent operates for a \ufb01xed or bounded period of time, for\nexample tutoring students for exams or handling customer service requests. Such\nscenarios can often be better treated as episodic \ufb01xed-horizon MDPs, for which\nonly looser bounds on the sample complexity exist. A natural notion of sample\ncomplexity in this setting is the number of episodes required to guarantee a certain\nperformance with high probability (PAC guarantee). In this paper, we derive an\nupper PAC bound \u02dcO(\n\u03b4+c )\nthat match up to log-terms and an additional linear dependency on the number of\nstates |S|. The lower bound is the \ufb01rst of its kind for this setting. Our upper bound\nleverages Bernstein\u2019s inequality to improve on previous bounds for episodic \ufb01nite-\nhorizon MDPs which have a time-horizon dependency of at least H 3.\n\n\u03b4 ) and a lower PAC bound \u02dc\u2126(\n\n|S|2|A|H 2\n\nln 1\n\n|S||A|H 2\n\n\u00012\n\nln 1\n\n\u00012\n\nIntroduction and Motivation\n\n1\nConsider test preparation software that tutors students for a national advanced placement exam taken\nat the end of a year, or maximizing business revenue by the end of each quarter. Each individual\ntask instance requires making a sequence of decisions for a \ufb01xed number of steps H (e.g., tutoring\none student to take an exam in spring 2015 or maximizing revenue for the end of the second quarter\nof 2014). Therefore, they can be viewed as a \ufb01nite-horizon sequential decision making under uncer-\ntainty problem, in contrast to an in\ufb01nite horizon setting in which the number of time steps is in\ufb01nite.\nWhen the domain parameters (e.g. Markov decision process parameters) are not known in advance,\nand there is the opportunity to repeat the task many times (teaching a new student for each year\u2019s\nexam, maximizing revenue for each new quarter), this can be treated as episodic \ufb01xed-horizon rein-\nforcement learning (RL). One important question is to understand how much experience is required\nto act well in this setting. We formalize this as the sample complexity of reinforcement learning [1],\nwhich is the number of time steps on which the algorithm may select an action whose value is not\nnear-optimal. RL algorithms with a sample complexity that is a polynomial function of the domain\nparameters are referred to as Probably Approximately Correct (PAC) [2, 3, 4, 1]. Though there has\nbeen signi\ufb01cant work on PAC RL algorithms for the in\ufb01nite horizon setting, there has been relatively\nlittle work on the \ufb01nite horizon scenario.\nIn this paper we present the \ufb01rst, to our knowledge, lower bound, and a new upper bound on the\nsample complexity of episodic \ufb01nite horizon PAC reinforcement learning in discrete state-action\nspaces. Our bounds are tight up to log-factors in the time horizon H, the accuracy \u0001, the number\nof actions |A| and up to an additive constant in the failure probability \u03b4. These bounds improve\nupon existing results by a factor of at least H. Our results also apply when the reward model\nis a function of the within-episode time step in addition to the state and action space. While we\nassume a stationary transition model, our results can be extended readily to time-dependent state-\n\n1\n\n\ftransitions. Our proposed UCFH (Upper-con\ufb01dence \ufb01xed-horizon RL) algorithm that achieves our\nupper PAC guarantee can be applied directly to wide range of \ufb01xed-horizon episodic MDPs with\nknown rewards.1 It does not require additional structure such as assuming access to a generative\nmodel [8] or that the state transitions are sparse or acyclic [6].\nThe limited prior research on upper bound PAC results for \ufb01nite horizon MDPs has focused on\ndifferent settings, such as partitioning a longer trajectory into \ufb01xed length segments [4, 1], or con-\nsidering a sliding time window [9]. The tightest dependence on the horizon in terms of the number\nof episodes presented in these approaches is at least H 3 whereas our dependence is only H 2. More\nimportantly, such alternative settings require the optimal policy to be stationary, whereas in general\nin \ufb01nite horizon settings the optimal policy is nonstationary (e.g. is a function of both the state and\nthe within episode time-step).2 Fiechter [10, 11] and Reveliotis and Bountourelis [12] do tackle a\nclosely related setting, but \ufb01nd a dependence that is at least H 4.\nOur work builds on recent work [6, 8] on PAC in\ufb01nite horizon discounted RL that offers much tighter\nupper and lower sample complexity bounds than was previously known. To use an in\ufb01nite horizon\nalgorithm in a \ufb01nite horizon setting, a simple change is to augment the state space by the time step\n(ranging over 1, . . . , H), which enables the learned policy to be non-stationary in the original state\nspace (or equivalently, stationary in the newly augmented space). Unfortunately, since these recent\nbounds are in general a quadratic function of the state space size, the proposed state space expansion\nwould introduce at least an additional H 2 factor in the sample complexity term, yielding at least a\nH 4 dependence in the number of episodes for the sample complexity.\nSomewhat surprisingly, we prove an upper bound on the sample complexity for the \ufb01nite horizon\ncase that only scales quadratically with the horizon. A key part of our proof is that the variance of\nthe value function in the \ufb01nite horizon setting satis\ufb01es a Bellman equation. We also leverage recent\ninsights that state\u2013action pairs can be estimated to different precisions depending on the frequency to\nwhich they are visited under a policy, extending these ideas to also handle when the policy followed\nis nonstationary. Our lower bound analysis is quite different than some prior in\ufb01nite-horizon results,\nand involves a construction of parallel multi-armed bandits where it is required that the best arm in\na certain portion of the bandits is identi\ufb01ed with high probability to achieve near-optimality.\n\n2 Problem Setting and Notation\n\nWe consider episodic \ufb01xed-horizon MDPs, which can be formalized as a tuple M =\n(S,A, r, p, p0, H). Both, the statespace S and the actionspace A are \ufb01nite sets. The learning agent\ninteracts with the MDP in episodes of H time steps. At time t = 1 . . . H, the agent observes a state\nst and choses an action at based on a policy \u03c0 that potentially depends on the within-episode time\nstep, i.e., at = \u03c0t(st) for t = 1, . . . , H. The next state is sampled from the stationary transition\nkernel st+1 \u223c p(\u00b7|st, at) and the initial state from s1 \u223c p0. In addition the agent receives a reward\ndrawn from a distribution3 with mean rt(st) determined by the reward function. The reward func-\ntion r is possibly time-dependent and takes values in [0, 1]. The quality of a policy \u03c0 is evaluated by\nthe total expected reward of an episode R\u03c0\n. For simplicity,1 we assume that\nthe reward function r is known to the agent but the transition kernel p is unknown. The question\nwe study is how many episodes does a learning agent follow a policy \u03c0 that is not \u0001-optimal, i.e.,\nM \u2212 \u0001 > R\u03c0\nM , with probability at least 1 \u2212 \u03b4 for any chosen accuracy \u0001 and failure probability \u03b4.\nR\u2217\n\nM = E(cid:104)(cid:80)H\n\n(cid:105)\n\nt=1 rt(st)\n\nNotation.\nIn the following sections, we reason about the true MDP M, an empirical MDP \u02c6M and\nan optimistic MDP \u02dcM which are identical except for their transition probabilities p, \u02c6p and \u02dcpt. We\nwill provide more details about these MDPs later. We introduce the notation explicitly only for M\nbut the quantities carry over to \u02dcM and \u02c6M with additional tildes or hats by replacing p with \u02dcpt or \u02c6p.\n\n1 Previous works [5] have shown that the complexity of learning state transitions usually dominates learning\nreward functions. We therefore follow existing sample complexity analyses [6, 7] and assume known rewards\nfor simplicity. The algorithm and PAC bound can be extended readily to the case of unknown reward functions.\n2The best action will generally depend on the state and the number of remaining time steps. In the tutoring\nexample, even if the student has the same state of knowledge, the optimal tutor decision may be to space\npractice if there is many days till the test and provide intensive short-term practice if the test is tomorrow.\n3It is straightforward to have the reward depend on the state, or state/action or state/action/next state.\n\n2\n\n\fi f (s) := E[f (si+1)|si = s] =(cid:80)\ni:j(s) := E(cid:104)(cid:80)j\n(cid:105)\n=(cid:80)j\ni:jf := P \u03c0\ni P \u03c0\nt=i rt(st)|si = s\n\ni+1 . . . P \u03c0\n\ni:j is the optimal value-function. When the policy is clear, we omit the superscript \u03c0.\n\ns(cid:48)\u2208S p(s(cid:48)|s, \u03c0i(s))f (s(cid:48)) takes any function\nThe (linear) operator P \u03c0\nf : S \u2192 R and returns the expected value of f with respect to the next time step.4 For convenience,\nj f. The value function from time i to\nwe de\ufb01ne the multi-step version as P \u03c0\ntime j is de\ufb01ned as V \u03c0\nt=i P \u03c0\nand V \u2217\nWe denote by S(s, a) \u2286 S the set of possible successor states of state s and action a. The maximum\nnumber of them is denoted by C = maxs,a\u2208S\u00d7A |S(s, a)|.\nIn general, without making further\nassumptions, we have C = |S|, though in many practical domains (robotics, user modeling) each\nstate can only transition to a subset of the full set of states (e.g. a robot can\u2019t teleport across the\nbuilding, but can only take local moves). The notation \u02dcO is similar to the usual O-notation but\nignores log-terms. More precisely f = \u02dcO(g) if there are constants c1, c2 such that f \u2264 c1g(ln g)c2\nand analogously for \u02dc\u2126. The natural logarithm is ln and log = log2 is the base-2 logarithm.\n\ni:t\u22121rt =(cid:0)P \u03c0\n\n(cid:1) (s) + ri(s)\n\ni V \u03c0\n\ni+1:j\n\n3 Upper PAC-Bound\n\nWe now introduce a new model-based algorithm, UCFH, for RL in \ufb01nite horizon episodic domains.\nWe will later prove UCFH is PAC with an upper bound on its sample complexity that is smaller\nthan prior approaches. Like many other PAC RL algorithms [3, 13, 14, 15], UCFH uses an opti-\nmism under uncertainty approach to balance exploration and exploitation. The algorithm generally\nworks in phases comprised of optimistic planning, policy execution and model updating that take\nseveral episodes each. Phases are indexed by k. As the agent acts in the environment and observes\n(s, a, r, s(cid:48)) tuples, UCFH maintains a con\ufb01dence set over the possible transition parameters for each\nstate-action pair that are consistent with the observed transitions. De\ufb01ning such a con\ufb01dence set that\nholds with high probability can be be achieved using concentration inequalities like the Hoeffding\ninequality. One innovation in our work is to use a particular new set of conditions to de\ufb01ne the con-\n\ufb01dence set that enables us to obtain our tighter bounds. We will discuss the con\ufb01dence sets further\nbelow. The collection of these con\ufb01dence sets together form a class of MDPs Mk that are consistent\nwith the observed data. We de\ufb01ne \u02c6Mk as the maximum likelihood estimate of the MDP given the\nprevious observations.\nGiven Mk, UCFH computes a policy \u03c0k by performing optimistic planning. Speci\ufb01cally, we use\na \ufb01nite horizon variant of extended value iteration (EVI) [5, 14]. EVI performs modi\ufb01ed Bellman\nbackups that are optimistic with respect to a given set of parameters. That is, given a con\ufb01dence\nset of possible transition model parameters, it selects in each time step the model within that set\nthat maximizes the expected sum of future rewards. Appendix A provides more details about \ufb01xed\nhorizon EVI.\nUCFH then executes \u03c0k until there is a state-action pair (s, a) that has been visited often enough\nsince its last update (de\ufb01ned precisely in the until-condition in UCFH). After updating the model\nstatistics for this (s, a)-pair, a new policy \u03c0k+1 is obtained by optimistic planning again. We refer to\neach such iteration of planning-execution-update as a phase with index k. If there is no ambiguity,\nwe omit the phase indices k to avoid cluttered notation.\nUCFH is inspired by the in\ufb01nite-horizon UCRL-\u03b3 algorithm by Lattimore and Hutter [6] but has\nseveral important differences. First, the policy can only be updated at the end of an episode, so\nthere is no need for explicit delay phases as in UCRL-\u03b3. Second, the policies \u03c0k in UCFH are\ntime-dependent. Finally, UCFH can directly deal with non-sparse transition probabilities, whereas\nUCRL-\u03b3 only directly allows two possible successor states for each (s, a)-pair (C = 2).\nCon\ufb01dence sets. The class of MDPs Mk consists of \ufb01xed-horizon MDPs M(cid:48) with the known true\nt(s(cid:48)|s, a) from any (s, a) \u2208 S \u00d7 A to s(cid:48) \u2208\nreward function r and where the transition probability p(cid:48)\nS(s, a) at any time t is in the con\ufb01dence set induced by \u02c6p(s(cid:48)|s, a) of the empirical MDP \u02c6M. Solely\nfor the purpose of computationally more ef\ufb01cient optimistic planning, we allow time-dependent\ntransitions (allows choosing different transition models in different time steps to maximize reward),\nbut this does not affect the theoretical guarantees as the true stationary MDP is still in Mk with high\n\n4The de\ufb01nition also works for time-dependent transition probabilities.\n\n3\n\n\fAlgorithm 1: UCFH: Upper-Con\ufb01dence Fixed-Horizon episodic reinforcement learning algorithm\nInput: desired accuracy \u0001 \u2208 (0, 1], failure tolerance \u03b4 \u2208 (0, 1], \ufb01xed-horizon MDP M\nResult: with probability at least 1 \u2212 \u03b4: \u0001-optimal policy\nk := 1,\nm := 512(log2 log2 H)2 CH 2\nn(s, a) = v(s, a) = n(s, a, s(cid:48)) := 0 \u2200, s \u2208 S, a \u2208 A, s(cid:48) \u2208 S(s, a);\nwhile do\n\nUmax := |S \u00d7 A| log2\n2(4|S|2H 2/\u0001)\n;\n\u03b4\n\nlog2(cid:16) 8H 2|S|2\n\nln 6|S\u00d7A|C log2\n\n2UmaxC ,\n\nwmin := \u0001\n\n4H|S|,\n\n|S|H\nwmin\n\n\u03b41 :=\n\n(cid:17)\n\n\u03b4\n\n\u0001\n\n\u00012\n\n;\n\n/* Optimistic planning\n\u02c6p(s(cid:48)|s, a) := n(s, a, s(cid:48))/n(s, a), for all (s, a) with n(s, a) > 0 and s(cid:48) \u2208 S(s, a);\n\nMk :=(cid:8) \u02dcM \u2208 Mnonst. : \u2200(s, a) \u2208 S \u00d7 A, t = 1 . . . H, s(cid:48) \u2208 S(s, a)\n\u02dcpt(s(cid:48)|s, a) \u2208 ConfidenceSet(\u02c6p(s(cid:48)|s, a), n(s, a))(cid:9);\n\nSampleEpisode(\u03c0k) ; // from M using \u03c0k\n\n\u02dcMk, \u03c0k := FixedHorizonEVI(Mk);\n/* Execute policy\nrepeat\nuntil there is a (s, a) \u2208 S \u00d7 A with v(s, a) \u2265 max{mwmin, n(s, a)} and n(s, a) < |S|mH;\n/* Update model statistics for one (s, a)-pair with condition above\nn(s, a) := n(s, a) + v(s, a);\nn(s, a, s(cid:48)) := n(s, a, s(cid:48)) + v(s, a, s(cid:48)) \u2200s(cid:48) \u2208 S(s, a);\nv(s, a) := v(s, a, s(cid:48)) := 0 \u2200s(cid:48) \u2208 S(s, a); k := k + 1\ns0 \u223c p0;\nfor t = 0 to H \u2212 1 do\n\nProcedure SampleEpisode(\u03c0)\n\nat := \u03c0t+1(st) and st+1 \u223c p(\u00b7|st, at);\nv(st, at) := v(st, at) + 1 and v(st, at, st+1) := v(st, at, st+1) + 1;\n\n*/\n\n*/\n\n*/\n\nFunction ConfidenceSet(p, n)\n\nP :=\n\np(cid:48) \u2208 [0, 1] :if n > 1 : |p(cid:48)(1 \u2212 p(cid:48)) \u2212 p(1 \u2212 p)| \u2264 2 ln(6/\u03b41)\n\n,\n\n(cid:26)\n\n|p \u2212 p(cid:48)| \u2264 min\n\nreturn P\n\n(cid:32)(cid:114)\n\n(cid:114)\n\nn \u2212 1\n2p(1 \u2212 p)\n\nln(6/\u03b41)\n\n2n\n\n,\n\nln(6/\u03b41) +\n\nn\n\n(cid:33)(cid:27)\n\n(1)\n\n(2)\n\n2\n3n\n\nln\n\n6\n\u03b41\n\nprobability. Unlike the con\ufb01dence intervals used by Lattimore and Hutter [6], we not only include\nconditions based on Hoeffding\u2019s inequality5 and Bernstein\u2019s inequality (Eq. 2), but also require that\nthe variance p(1 \u2212 p) of the Bernoulli random variable associated with this transition is close to the\nempirical one (Eq. 1). This additional condition (Eq. 1) is key for making the algorithm directly\napplicable to generic MDPs (in which states can transition to any number of next states, e.g. C > 2)\nwhile only having a linear dependency on C in the PAC bound.\n3.1 PAC Analysis\nFor simplicity we assume that each episode starts in a \ufb01xed start state s0. This assumption is not\ncrucial and can easily be removed by additional notational effort.\nTheorem 1. For any 0 < \u0001, \u03b4 \u2264 1, the following holds. With probability at least 1 \u2212 \u03b4, UCFH\nproduces a sequence of policies \u03c0k, that yield at most\n\n(cid:18) H 2C|S \u00d7 A|\n\n\u00012\n\n(cid:19)\n\nln\n\n1\n\u03b4\n\n\u02dcO\n\nepisodes with R\u2217 \u2212 R\u03c0k\nstates is denoted by 1 < C \u2264 |S|.\n\n= V \u2217\n\n1:H (s0) \u2212 V \u03c0k\n\n1:H (s0) > \u0001. The maximum number of possible successor\n\n5The \ufb01rst condition in the min in Equation (2) is actually not necessary for the theoretical results to hold. It\n\ncan be removed and all 6/\u03b41 can be replaced by 4/\u03b41.\n\n4\n\n\fSimilarities to other analyses. The proof of Theorem 1 is quite long and involved, but builds on\nsimilar techniques for sample-complexity bounds in reinforcement learning (see e.g. Brafman and\nTennenholtz [3], Strehl and Littman [16]). The general proof strategy is closest to the one of UCRL-\u03b3\n[6] and the obtained bounds are similar if we replace the time horizon H with the equivalent in the\ndiscounted case 1/(1 \u2212 \u03b3). However, there are important differences that we highlight now brie\ufb02y.\n\u2022 A central quantity in the analysis by Lattimore and Hutter [6] is the local variance of the value\nfunction. The exact de\ufb01nition for the \ufb01xed-horizon case will be given below. The key insight for\nthe almost tight bounds of Lattimore and Hutter [6] and Azar et al. [8] is to leverage the fact that\nthese local variances satisfy a Bellman equation [17] and so the discounted sum of local variances\ncan be bounded by O((1\u2212 \u03b3)\u22122) instead of O((1\u2212 \u03b3)\u22123). We prove in Lemma 4 that local value\nfunction variances \u03c32\ni:j also satisfy a Bellman equation for \ufb01xed-horizon MDPs even if transition\nprobabilities and rewards are time-dependent. This allows us to bound the total sum of local\nvariances by O(H 2) and obtain similarly strong results in this setting.\n\u2022 Lattimore and Hutter [6] assumed there are only two possible successor states (i.e., C = 2) which\ni:j to the difference of the expected value of\nallows them to easily relate the local variances \u03c32\nsuccessor states in the true and optimistic MDP (Pi \u2212 \u02dcPi) \u02dcVi+1:j. For C > 2, the relation is less\nclear, but we address this by proving a bound with tight dependencies on C (Lemma C.6).\n\u2022 To avoid super-linear dependency on C in the \ufb01nal PAC bound, we add the additional condition\nin Equation (1) to the con\ufb01dence set. We show that this allows us to upper-bound the total reward\ndifference R\u2217 \u2212 R\u03c0k of policy \u03c0k with terms that either depend on \u03c32\ni:j or decrease linearly in\nthe number of samples. This gives the desired linear dependency on C in the \ufb01nal bound. We\ntherefore avoid assuming C = 2 which makes UCFH directly applicable to generic MDPs with\nC > 2 without the impractical transformation argument used by Lattimore and Hutter [6].\n\nWe will now introduce the notion of knownness and importance of state-action pairs that is essential\nfor the analysis of UCFH and subsequently present several lemmas necessary for the proof of Theo-\nrem 1. We only sketch proofs here but detailed proofs for all results are available in the appendix.\n\nFine-grained categorization of (s, a)-pairs. Many PAC RL sample complexity proofs [3, 4, 13,\n14] only have a binary notion of \u201cknownness\u201d, distinguishing between known (transition proba-\nbility estimated suf\ufb01ciently accurately) and unknown (s, a)-pairs. However, as recently shown by\nLattimore and Hutter [6] for the in\ufb01nite horizon setting, it is possible to obtain much tighter sample\ncomplexity results by using a more \ufb01ne grained categorization. In particular, a key idea is that in or-\nder to obtain accurate estimates of the value function of a policy from a starting state, it is suf\ufb01cient\nto have only a loose estimate of the parameters of (s, a)-pairs that are unlikely to be visited under\nthis policy.\nLet the weight of a (s, a)-pair given policy \u03c0k be its expected frequency in an episode\n\nH(cid:88)\n(cid:26)\n\nt=1\n\nH(cid:88)\n\nt=1\n\nwk(s, a) :=\n\nP(st = s, \u03c0k\n\nt (st) = a) =\n\nP1:t\u22121I{s = \u00b7, a = \u03c0k\n\nt (s)}(s0).\n\nThe importance \u03b9k of (s, a) is its relative weight compared to wmin := \u0001\n\n(cid:27)\n\n4H|S| on a log-scale\nwhere z1 = 0 and zi = 2i\u22122 \u2200i = 2, 3, . . . .\n\n\u03b9k(s, a) := min\n\nzi : zi \u2265 wk(s, a)\nwmin\n\nNote that \u03b9k(s, a) \u2208 {0, 1, 2, 4, 8, 16 . . .} is an integer indicating the in\ufb02uence of the state-action\npair on the value function of \u03c0k. Similarly, we de\ufb01ne the knownness\n\n(cid:26)\n\n(cid:27)\n\n\u03bak(s, a) := max\n\nzi : zi \u2264 nk(s, a)\nmwk(s, a)\n\n\u2208 {0, 1, 2, 4, . . .}\n\nwhich indicates how often (s, a) has been observed relative to its importance. The constant m is\nde\ufb01ned in Algorithm 1. We can now categorize (s, a)-pairs into subsets\n\nXk,\u03ba,\u03b9 := {(s, a) \u2208 Xk : \u03bak(s, a) = \u03ba, \u03b9k(s, a) = \u03b9}\n\nand\n\n\u00afXk = S \u00d7 A \\ Xk\n\nwhere Xk = {(s, a) \u2208 S \u00d7 A : \u03b9k(s, a) > 0} is the active set and \u00afXk the set of state-action pairs\nthat are very unlikely under the current policy. Intuitively, the model of UCFH is accurate if only few\n\n5\n\n\fk=1\n\n\u0001\n\n\u03b4\n\nH\n\nwmin\n\nln 2Emax\n\nlog2 |S|.\n\nI{\u2203(\u03ba, \u03b9) :\n\ni.e. E = (cid:80)\u221e\n\n|Xk,\u03ba,\u03b9| > \u03ba} and assume that m \u2265 6H 2\n\n(s, a) are in categories with low knownness \u2013 that is, important under the current policy but have\nnot been observed often so far. Recall that over time observations are generated under many policies\n(as the policy is recomputed), so this condition does not always hold. We will therefore distinguish\nbetween phases k where |Xk,\u03ba,\u03b9| \u2264 \u03ba for all \u03ba and \u03b9 and phases where this condition is violated.\nThe condition essentially allows for only a few (s, a) in categories that are less known and more and\nmore (s, a) in categories that are more well known. In fact, we will show that the policy is \u0001-optimal\nwith high probability in phases that satisfy this condition.\nWe \ufb01rst show the validity of the con\ufb01dence sets Mk.\nLemma 1 (Capturing the true MDP whp.). M \u2208 Mk for all k with probability at least 1 \u2212 \u03b4/2.\nProof Sketch. By combining Hoeffding\u2019s inequality, Bernstein\u2019s inequality and the concentration re-\nsult on empirical variances by Maurer and Pontil [18] with the union bound, we get that p(s(cid:48)|s, a) \u2208\nP with probability at least 1\u2212 \u03b41 for a single phase k, \ufb01xed s, a \u2208 S \u00d7A and \ufb01xed s(cid:48) \u2208 S(s, a). We\nthen show that the number of model updates is bounded by Umax and apply the union bound.\nThe following lemma bounds the number of episodes in which \u2200\u03ba, \u03b9 : |Xk,\u03ba,\u03b9| \u2264 \u03ba is violated with\nhigh probability.\nLemma 2. Let E be the number of episodes k for which there are \u03ba and \u03b9 with |Xk,\u03ba,\u03b9| > \u03ba,\n. Then P(E \u2264\n6N Emax) \u2265 1 \u2212 \u03b4/2 where N = |S \u00d7 A| m and Emax = log2\nProof Sketch. We \ufb01rst bound the total number of times a \ufb01xed pair (s, a) can be observed while\nbeing in a particular category Xk,\u03ba,\u03b9 in all phases k for 1 \u2264 \u03ba < |S|. We then show that for a\nparticular (\u03ba, \u03b9), the number of episodes where |Xk,\u03ba,\u03b9| > \u03ba is bounded with high probability, as the\nvalue of \u03b9 implies a minimum probability of observing each (s, a) pair in Xk,\u03ba,\u03b9 in an episode. Since\nthe observations are not independent we use martingale concentration results to show the statement\nfor a \ufb01xed (\u03ba, \u03b9). The desired result follows with the union bound over all relevant \u03ba and \u03b9.\nThe next lemma states that in episodes where the condition \u2200\u03ba, \u03b9 : |Xk,\u03ba,\u03b9| \u2264 \u03ba is satis\ufb01ed and the\ntrue MDP is in the con\ufb01dence set, the expected optimistic policy value is close to the true value.\nThis lemma is the technically most involved part of the proof.\nLemma 3 (Bound mismatch in total reward). Assume M \u2208 Mk. If |Xk,\u03ba,\u03b9| \u2264 \u03ba for all (\u03ba, \u03b9) and\n1:H (s0)| \u2264 \u0001.\n0 < \u0001 \u2264 1 and m \u2265 512 CH 2\n\u2264\ntransformations, we\nfor each \u02dcp, p \u2208 P in the con\ufb01dence set as de\ufb01ned\nin Eq. 2. Since we assume M \u2208 Mk, we know that p(s(cid:48)|s, a) and \u02dcp(s(cid:48)|s, a) satisfy this bound\nwith n(s, a) for all s,a and s(cid:48). We use that to bound the difference of the expected value function\nof the successor state in M and \u02dcM, proving that |(Pi \u2212 \u02dcPi) \u02dcVi+1:j(s)| \u2264 O\n+\nO\n\u03c32\n\n\u02dc\u03c3i:j(s), where the local variance of the value function is de\ufb01ned as\ni:j(s, \u03c0i(s)).\nt=0 P1:t|(Pt \u2212 \u02dcPt) \u02dcVt+1:H (s)|. The basic\nidea is to split the bound into a sum of two parts by partitioning of the (s, a) space by knownness,\nthat is (st, at) \u2208 \u00afX\u03ba,\u03b9 for all \u03ba and \u03b9 and (st, at) \u2208 \u00afX. Using the fact that w(st, at) and\ne.g.\nn(st, at) are tightly coupled for each (\u03ba, \u03b9), we can bound the expression eventually by \u0001. The \ufb01nal\nt=1 P1:t\u22121\u03c3t:H (s)2 by O(H 2) instead of\n\n(cid:16) 8H 2|S|2\n\n\u00012 (log2 log2 H)2 log2\n\n(cid:16) CH\n\n1:H (s0)\u2212V \u03c0k\n\n(cid:16)(cid:113) 1\n\n. Then | \u02dcV \u03c0k\n\nn(s,\u03c0(s)) ln 1\n\u03b41\n\n|p \u2212 \u02dcp|\n\ni:j(s) := \u03c32\n\nshow that\n\n(cid:16) 1\n\nand \u03c32\n\nln 6\n\u03b41\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n\u0001\n\n+ O\n\nbasic\n\nn ln 1\n\u03b41\n\nn(s,\u03c0(s)) ln 1\n\u03b41\n\nProof Sketch. Using\n\nalgebraic\nn ln 1\n\u03b41\n\n(cid:112)\u02dcp(1 \u2212 \u02dcp)O\n(cid:17)\n(cid:16)(cid:113) C\ni:j(s, a) := E(cid:2)(V \u03c0\ni+1:j(si))2|si = s, ai = a(cid:3)\nThis bound then is applied to | \u02dcV1:H (s0)\u2212 V1:H (s0)| \u2264(cid:80)H\u22121\ni+1:j(si+1) \u2212 P \u03c0\nkey ingredient in the remainder of the proof is to bound(cid:80)H\n(cid:20)(cid:16)(cid:80)j\nwhich gives Vi:j = (cid:80)j\n0 \u2264(cid:80)j\n\nthe trivial bound O(H 3). To this end, we show the lemma below.\nLemma\nE\n\n(cid:21)\n(cid:17)2 |si = s\n\nt=i rt(st) \u2212 V \u03c0\n\nmax for all s \u2208 S.\n\nt:j(s) \u2264 H 2r2\n\nt=i Pi:t\u22121\u03c32\n\nt=1 Pi:t\u22121\u03c32\n\nvariance\n\nfunction\n\ni:j(si)\n\ni V \u03c0\n\nvalue\n\nThe\n\nt:j.\n\nthe\n\n4.\n\nof\n\nsatis\ufb01es a Bellman equation Vi:j = PiVi+1:j + \u03c32\ni:j\n\nSince 0 \u2264 V1:H \u2264 H 2r2\n\nmax,\n\nde\ufb01ned\n\nas\n\nV\u03c0\n\ni:j(s)\n\n:=\n\n2\n\n6\n\nit\n\nfollows that\n\n\fp(i|0, a) = 1\n\nn\n\n0\n\n1\n\n2\n\n...\n\nn\n\n+\n\nr(+) = 1\n\np(+|i, a) = 1\n\n2 + \u0001(cid:48)\n\ni(a)\n\np(\u2212|i, a) = 1\n\n2 \u2212 \u0001(cid:48)\n\ni(a)\n\nr(\u2212) = 0\n\n\u2212\n\nFigure 1: Class of a hard-to-learn \ufb01nite horizon MDPs. The function \u0001(cid:48) is de\ufb01ned as \u0001(cid:48)(a1) = \u0001/2,\n\u0001(cid:48)(a\u2217\ni is an unknown action per state i and \u0001 is a parameter.\n\ni ) = \u0001 and otherwise \u0001(cid:48)(a) = 0 where a\u2217\n\nProof Sketch. The proof works by induction and uses fact that the value function satis\ufb01es the Bell-\nman equation and the tower-property of conditional expectations.\n\nProof Sketch for Theorem 1. The proof of Theorem 1 consists of the following major parts:\n1. The true MDP is in the set of MDPs Mk for all phases k with probability at least 1\u2212 \u03b4\n2 (Lemma 1).\n2. The FixedHorizonEVI algorithm computes a value function whose optimistic value is higher\n3. The number of episodes with |Xk,\u03ba,\u03b9| > \u03ba for some \u03ba and \u03b9 are bounded with probability at least\n\nthan the optimal reward in the true MDP with probability at least 1 \u2212 \u03b4/2 (Lemma A.1).\n\n1 \u2212 \u03b4/2 by \u02dcO(|S \u00d7 A| m) if m = \u02dc\u2126\n\n|S|\n\u03b4\n\nln\n\n(Lemma 2).\n\n(cid:16) H 2\n\n\u0001\n\n(cid:17)\n\n(cid:16) CH 2\n\n(cid:17)\n\n4. If |Xk,\u03ba,\u03b9| \u2264 \u03ba for all \u03ba, \u03b9, i.e., relevant state-action pairs are suf\ufb01ciently known and m =\n, then the optimistic value computed is \u0001-close to the true MDP value. Together\n\n\u02dc\u2126\nwith part 2, we get that with high probability, the policy \u03c0k is \u0001-optimal in this case.\n\nln 1\n\u03b41\n\n\u00012\n\n(cid:16) C|S\u00d7A|H 2\n\n\u00012\n\n(cid:17)\n\nln 1\n\u03b4\n\nepisodes that\n\n5. From parts 3 and 4, with probability 1 \u2212 \u03b4, there are at most \u02dcO\n\nare not \u0001-optimal.\n\n4 Lower PAC Bound\nTheorem 2. There exist positive constants c1, c2, \u03b40, \u00010 such that for every \u03b4 \u2208 (0, \u03b40) and \u0001 \u2208\n(0, \u00010) and for every algorithm A that satis\ufb01es a PAC guarantee for (\u0001, \u03b4) and outputs a deterministic\npolicy, there is a \ufb01xed-horizon episodic MDP Mhard with\n\n(cid:18) c2\n\n(cid:19)\n\n\u03b4 + c3\n\n(cid:18)|S \u00d7 A|H 2\n\n(cid:18) c2\n\n(cid:19)(cid:19)\n\n= \u2126\n\n\u00012\n\nln\n\n\u03b4 + c3\n\n(3)\n\nE[nA] \u2265 c1(H \u2212 2)2(|A| \u2212 1)(|S| \u2212 3)\n\nln\n\n\u00012\n\n80 \u2248 1\n\n5000 , \u00010 = H\u22122\n\n640e4 \u2248 H/35000, c2 = 4 and c3 = e\u22124/80.\n\nwhere nA is the number of episodes until the algorithm\u2019s policy is (\u0001, \u03b4)-accurate. The constants\ncan be set to \u03b40 = e\u22124\nThe ranges of possible \u03b4 and \u0001 are of similar order than in other state-of-the-art lower bounds for\nmulti-armed bandits [19] and discounted MDPs [14, 6]. They are mostly determined by the bandit\nresult by Mannor and Tsitsiklis [19] we build on. Increasing the parameter limits \u03b40 and \u00010 for\nbandits would immediately result in larger ranges in our lower bound, but this was not the focus of\nour analysis.\nProof Sketch. The basic idea is to show that the class of MDPs shown in Figure 1 require at least a\nnumber of observed episodes of the order of Equation (3). From the start state 0, the agent ends up\nin states 1 to n with equal probability, independent of the action. From each such state i, the agent\ntransitions to either a good state + with reward 1 or a bad state \u2212 with reward 0 and stays there for\nthe rest of the episode. Therefore, each state i = 1, . . . , n is essentially a multi-armed bandit with\nbinary rewards of either 0 or H \u2212 2. For each bandit, the probability of ending up in + or \u2212 is\nequal except for the \ufb01rst action a1 with p(st+1 = +|st = i, at = a1) = 1/2 + \u0001/2 and possibly an\nunknown optimal action a\u2217\ni ) = 1/2 + \u0001.\nIn the episodic \ufb01xed-horizon setting we are considering, taking a suboptimal action in one of the\nbandits does not necessarily yield a suboptimal episode. We have to consider the average over all\nbandits instead. In an \u0001-optimal episode, the agent therefore needs to follow a policy that would\nsolve at least a certain portion of all n multi-armed bandits with probability at least 1 \u2212 \u03b4. We show\nthat the best strategy for the agent to achieve this is to try to solve all bandits with equal probability.\nThe number of samples required to do so then results in the lower bound in Equation (3).\n\ni (different for each state i) with p(st+1 = +|st = i, at = a\u2217\n\n7\n\n\fSimilar MDPs that essentially solve multiple of such multi-armed bandits have been used to prove\nlower sample-complexity bounds for discounted MDPs [14, 6]. However, the analysis in the in\ufb01nite\nhorizon case as well as for the sliding-window \ufb01xed-horizon optimality criterion considered by\nKakade [4] is signi\ufb01cantly simpler. For these criteria, every time step the agent follows a policy that\nis not \u0001-optimal counts as a \u201dmistake\u201d. Therefore, every time the agent does not pick the optimal\narm in any of the multi-armed bandits counts as a mistake. This contrasts with our \ufb01xed-horizon\nsetting where we must instead consider taking an average over all bandits.\n5 Related Work on Fixed-Horizon Sample Complexity Bounds\nWe are not aware of any lower sample complexity bounds beyond multi-armed bandit results that\ndirectly apply to our setting. Our upper bound in Theorem 1 improves upon existing results by at\nleast a factor of H. We brie\ufb02y review those existing results in the following.\nTimestep bounds. Kakade [4, Chapter 8] proves upper and lower PAC bounds for a similar set-\nting where the agent interacts inde\ufb01nitely with the environment but the interactions are divided in\nsegments of equal length and the agent is evaluated by the expected sum of rewards until the end\nof each segment. The bound states that there are not more than \u02dcO\nwhich the agents acts \u0001-suboptimal. Strehl et al. [1] improves the state-dependency of these bounds\nfor their delayed Q-learning algorithm to \u02dcO\n. However, in episodic MDP it is more\nnatural to consider performance on the entire episode since suboptimality near the end of the episode\nis no issue as long as the total reward on the entire episode is suf\ufb01ciently high. Kolter and Ng [9]\nuse an interesting sliding-window criterion, but prove bounds for a Bayesian setting instead of PAC.\nTimestep-based bounds can be applied to the episodic case by augmenting the original statespace\nwith a time-index per episode to allow resets after H steps. This adds H dependencies for each |S|\nin the original bound which results in a horizon-dependency of at least H 6 of these existing bounds.\nTranslating the regret bounds of UCRL2 in Corollary 3 by Jaksch et al. [20] yields a PAC-bound\non the number of episodes of at least \u02dcO\neven if one ignores the reset after H time\nsteps. Timestep-based lower PAC-bounds cannot be applied directly to the episodic reward criterion.\n\n(cid:17)6 time steps in\n\n(cid:16)|S|2|A|H 6\n\n(cid:16)|S|2|A|H 3\n\n(cid:16)|S||A|H 5\n\n(cid:17)\n\n(cid:17)\n\nln 1\n\u03b4\n\nln 1\n\u03b4\n\nln 1\n\u03b4\n\n\u00013\n\n\u00012\n\n\u00014\n\n(cid:16)|S|2|A|H 7\n\n(cid:17)\n\n\u00012\n\n\u00013\n\nln 1\n\u03b4\n\n(cid:17)\n\n(cid:16)|S||A|H 4\n\n(cid:16)|S|10|A|H 7\n\nEpisode bounds. Similar to us, Fiechter [10] uses the value of initial states as optimality-criterion,\nbut de\ufb01nes the value w.r.t. the \u03b3-discounted in\ufb01nite horizon. His results of order \u02dcO\nln 1\n\u03b4\nepisodes of length \u02dcO(1/(1 \u2212 \u03b3)) \u2248 \u02dcO(H) are therefore not directly applicable to our setting. Auer\nand Ortner [5] investigate the same setting as we and propose a UCB-type algorithm that has no-\nregret, which translates into a basic PAC bound of order \u02dcO\nepisodes. We improve\non this bound substantially in terms of its dependency on H, |S| and \u0001. Reveliotis and Bountourelis\n[12] also consider the episodic undiscounted \ufb01xed-horizon setting and present an ef\ufb01cient algorithm\nin cases where the transition graph is acyclic and the agent knows for each state a policy that visits\nthis state with a known minimum probability q. These assumptions are quite limiting and rarely\nhold in practice and their bound of order \u02dcO\n6 Conclusion\nWe have shown upper and lower bounds on the sample complexity of episodic \ufb01xed-horizon RL that\nare tight up to log-factors in the time horizon H, the accuracy \u0001, the number of actions |A| and up\nto an additive constant in the failure probability \u03b4. These bounds improve upon existing results by a\nfactor of at least H. One might hope to reduce the dependency of the upper bound on |S| to be linear\nby an analysis similar to Mormax [7] for discounted MDPs which has sample complexity linear in\n|S| at the penalty of additional dependencies on H. Our proposed UCFH algorithm that achieves our\nPAC bound can be applied to directly to a wide range of \ufb01xed-horizon episodic MDPs with known\nrewards and does not require additional structure such as sparse or acyclic state transitions assumed\nin previous work. The empirical evaluation of UCFH is an interesting direction for future work.\nAcknowledgments: We thank Tor Lattimore for the helpful suggestions and comments. This work\nwas supported by an NSF CAREER award and the ONR Young Investigator Program.\n\nexplicitly depends on 1/q.\n\n(cid:17)\n\nln 1\n\u03b4\n\n\u00012q\n\n6For comparison we adapt existing bounds to our setting. While the original bound stated by Kakade [4]\n\nonly has H 3, an additional H 3 comes in through \u0001\u22123 due to different normalization of rewards.\n\n8\n\n\fReferences\n[1] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman.\nPAC Model-Free Reinforcement Learning. In International Conference on Machine Learn-\ning, 2006.\n\n[2] Michael J Kearns and Satinder P Singh. Finite-Sample Convergence Rates for Q-Learning and\n\nIndirect Algorithms. In Advances in Neural Information Processing Systems, 1999.\n\n[3] Ronen I Brafman and Moshe Tennenholtz. R-MAX \u2013 A General Polynomail Time Algorithm\nfor Near-Optimal Reinforcement Learning. Journal of Machine Learning Research, 3:213\u2013\n231, 2002.\n\n[4] Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, Univer-\n\nsity College London, 2003.\n\n[5] Peter Auer and Ronald Ortner. Online Regret Bounds for a New Reinforcement Learning\n\nAlgorithm. In Proceedings 1st Austrian Cognitive Vision Workshop, 2005.\n\n[6] Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In International Con-\n\nference on Algorithmic Learning Theory, 2012.\n\n[7] Istv`an Szita and Csaba Szepesv\u00b4ari. Model-based reinforcement learning with nearly tight ex-\n\nploration complexity bounds. In International Conference on Machine Learning, 2010.\n\n[8] Mohammad Gheshlaghi Azar, R\u00b4emi Munos, and Hilbert J. Kappen. On the Sample Complexity\nof Reinforcement Learning with a Generative Model. In International Conference on Machine\nLearning, 2012.\n\n[9] J Zico Kolter and Andrew Y Ng. Near-Bayesian exploration in polynomial time. In Interna-\n\ntional Conference on Machine Learning, 2009.\n\n[10] Claude-Nicolas Fiechter. Ef\ufb01cient reinforcement learning. In Conference on Learning Theory,\n\n1994.\n\n[11] Claude-Nicolas Fiechter. Expected Mistake Bound Model for On-Line Reinforcement Learn-\n\ning. In International Conference on Machine Learning, 1997.\n\n[12] Spyros Reveliotis and Theologos Bountourelis. Ef\ufb01cient PAC learning for episodic tasks with\nacyclic state spaces. Discrete Event Dynamic Systems: Theory and Applications, 17(3):307\u2013\n327, 2007.\n\n[13] Alexander L Strehl, Lihong Li, and Michael L Littman. Incremental Model-based Learners\nWith Formal Learning-Time Guarantees. In Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence, 2006.\n\n[14] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement Learning in Finite\n\nMDPs : PAC Analysis. Journal of Machine Learning Research, 10:2413\u20132444, 2009.\n\n[15] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal Regret Bounds for Reinforce-\n\nment Learning. In Advances in Neural Information Processing Systems, 2010.\n\n[16] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation\nfor Markov Decision Processes. Journal of Computer and System Sciences, 74(8):1309\u20131331,\ndec 2008.\n\n[17] Matthew J Sobel. The Variance of Markov Decision Processes. Journal of Applied Probability,\n\n19(4):794\u2013802, 1982.\n\n[18] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein Bounds and Sample-Variance\n\nPenalization. In Conference on Learning Theory, 2009.\n\n[19] Shie Mannor and John N Tsitsiklis. The Sample Complexity of Exploration in the Multi-Armed\n\nBandit Problem. Journal of Machine Learning Research, 5:623\u2013648, 2004.\n\n[20] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal Regret Bounds for Reinforce-\n\nment Learning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[21] Fan Chung and Linyuan Lu. Concentration Inequalities and Martingale Inequalities: A Survey.\n\nInternet Mathematics, 3(1):79\u2013127, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1596, "authors": [{"given_name": "Christoph", "family_name": "Dann", "institution": "Carnegie Mellon University"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "CMU"}]}