{"title": "Improving PAC Exploration Using the Median Of Means", "book": "Advances in Neural Information Processing Systems", "page_first": 3898, "page_last": 3906, "abstract": "We present the first application of the median of means in a PAC exploration algorithm for MDPs. Using the median of means allows us to significantly reduce the dependence of our bounds on the range of values that the value function can take, while introducing a dependence on the (potentially much smaller) variance of the Bellman operator. Additionally, our algorithm is the first algorithm with PAC bounds that can be applied to MDPs with unbounded rewards.", "full_text": "Improving PAC Exploration\nUsing the Median of Means\n\nLaboratory for Information and Decision Systems\n\nDepartment of Computer Science\n\nRonald Parr\n\nDuke University\n\nDurham, NC 27708\nparr@cs.duke.edu\n\nJason Pazis\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139, USA\n\njpazis@mit.edu\n\nJonathan P. How\n\nAerospace Controls Laboratory\n\nDepartment of Aeronautics and Astronautics\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139, USA\n\njhow@mit.edu\n\nAbstract\n\nWe present the \ufb01rst application of the median of means in a PAC exploration\nalgorithm for MDPs. Using the median of means allows us to signi\ufb01cantly reduce\nthe dependence of our bounds on the range of values that the value function can\ntake, while introducing a dependence on the (potentially much smaller) variance of\nthe Bellman operator. Additionally, our algorithm is the \ufb01rst algorithm with PAC\nbounds that can be applied to MDPs with unbounded rewards.\n\n1\n\nIntroduction\n\nAs the reinforcement learning community has shifted its focus from heuristic methods to methods\nthat have performance guarantees, PAC exploration algorithms have received signi\ufb01cant attention.\nThus far, even the best published PAC exploration bounds are too pessimistic to be useful in prac-\ntical applications. Even worse, lower bound results [14, 7] indicate that there is little room for\nimprovement.\nWhile these lower bounds prove that there exist pathological examples for which PAC exploration\ncan be prohibitively expensive, they leave the door open for the existence of \u201cwell-behaved\u201d classes\nof problems in which exploration can be performed at a signi\ufb01cantly lower cost. The challenge of\ncourse is to identify classes of problems that are general enough to include problems of real-world\ninterest, while at the same time restricted enough to have a meaningfully lower cost of exploration\nthan pathological instances.\nThe approach presented in this paper exploits the fact that while the square of the maximum value\nthat the value function can take (Q2\nmax) is typically quite large, the variance of the Bellman operator\nis rather small in many domains of practical interest. For example, this is true in many control tasks:\nIt is not very often that an action takes the system to the best possible state with 50% probability and\nto the worst possible state with 50% probability.\nMost PAC exploration algorithms take an average over samples. By contrast, the algorithm presented\nin this paper splits samples into sets, takes the average over each set, and returns the median of the\naverages. This seemingly simple trick (known as the median trick [1]), allows us to derive sample\ncomplexity bounds that depend on the variance of the Bellman operator rather than Q2\nmax. Addi-\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmax\n4\n\ntionally, our algorithm (Median-PAC) is the \ufb01rst reinforcement learning algorithm with theoretical\nguarantees that allows for unbounded rewards.1\nNot only does Median-PAC offer signi\ufb01cant sample complexity savings in the case when the variance\nof the Bellman operator is low, but even in the worst case (the variance of the Bellman operator is\nbounded above by Q2\n) our bounds match the best, published PAC bounds. Note that Median-PAC\ndoes not require the variance of the Bellman operator to be known in advance. Our bounds show that\nthere is an inverse relationship between the (possibly unknown) variance of the Bellman operator\nand Median-PAC\u2019s performance. This is to the best of our knowledge not only the \ufb01rst application\nof the median of means in PAC exploration, but also the \ufb01rst application of the median of means in\nreinforcement learning in general.\nContrary to recent work which has exploited variance in Markov decision processes to improve PAC\nbounds [7, 3], Median-PAC makes no assumptions about the number of possible next-states from\nevery state-action (it does not even require the number of possible next states to be \ufb01nite), and as a\nresult it is easily extensible to the continuous state, concurrent MDP, and delayed update settings [12].\n\n2 Background, notation, and de\ufb01nitions\nIn the following, important symbols and terms will appear in bold when \ufb01rst introduced. Let X be\nthe domain of x. Throughout this paper, 888x will serve as a shorthand for 8x 2X . In the following\ns, \u00afs, \u02dcs, s0 are used to denote various states, and a, \u00afa, \u02dca, a0\na, \u00afa, \u02dca, a0 are used to denote actions.\na, \u00afa, \u02dca, a0\ns, \u00afs, \u02dcs, s0\ns, \u00afs, \u02dcs, s0\nA Markov Decision Process (MDP) [13] is a 5-tuple (S,A, P, R, ), where SSS is the state space\nof the process, AAA is the action space2, PPP is a Markovian transition modelp(s0|s, a) denotes the\nprobability of a transition to state s0 when taking action a in state s, RRR is a reward function\nR(s, a, s0) is the reward for taking action a in state s and transitioning to state s0, and 2 [0, 1)\nis a discount factor for future rewards. A deterministic policy \u21e1\u21e1\u21e1 is a mapping \u21e1 : S 7! A from\nstates to actions; \u21e1(s) denotes the action choice in state s. The value V \u21e1(s)\nV \u21e1(s) of state s under\nV \u21e1(s)\npolicy \u21e1 is de\ufb01ned as the expected, accumulated, discounted reward when the process begins\nin state s and all decisions are made according to policy \u21e1. There exists an optimal policy \u21e1\u21e4\u21e1\u21e4\u21e1\u21e4\nfor choosing actions which yields the optimal value function V \u21e4(s), de\ufb01ned recursively via the\nBellman optimality equation V \u21e4(s)\nV \u21e4(s)\nthe value Q\u21e1(s, a)\nQ\u21e1(s, a) of a state-action (s, a) under policy \u21e1 is de\ufb01ned as the expected, accumulated,\nQ\u21e1(s, a)\ndiscounted reward when the process begins in state s by taking action a and all decisions thereafter\nare made according to policy \u21e1. The Bellman optimality equation for Q becomes Q\u21e4(s, a)\nQ\u21e4(s, a)\nQ\u21e4(s, a) =\n\nV \u21e4(s) = maxa {Ps0 p(s0|s, a) (R(s, a, s0) + V \u21e4(s0))}. Similarly,\nPs0 p(s0|s, a) (R(s, a, s0) + maxa0{Q\u21e4(s0, a0)}). For a \ufb01xed policy \u21e1 the Bellman operator for Q\nB\u21e1Q(s, a) =Ps0 p(s0|s, a)\u21e3R(s, a, s0) + Q(s0,\u21e1 (s0))\u2318. In reinforcement learning\n\nis de\ufb01ned as B\u21e1Q(s, a)\nB\u21e1Q(s, a)\n(RL) [15], a learner interacts with a stochastic process modeled as an MDP and typically observes the\nstate and immediate reward at every step; however, the transition model P and reward function R are\nnot known. The goal is to learn a near optimal policy using experience collected through interaction\nwith the process. At each step of interaction, the learner observes the current state s, chooses an\naction a, and observes the reward received r, and resulting next state s0, essentially sampling the\ntransition model and reward function of the process. Thus experience comes in the form of (s, a, r, s0)\nsamples.\nWe assume that all value functions Q live in a complete metric space.\nDe\ufb01nition 2.1. QmaxQmaxQmax denotes an upper bound on the expected, accumulated, discounted reward\nfrom any state-action under any policy.\n\nWe require that Qmin, the minimum expected, accumulated, discounted reward from any state-action\nunder any policy is bounded, and in order to simplify notation we also assume without loss of\n\n1Even though domains with truly unbounded rewards are not common, many domains exist for which\ninfrequent events with extremely high (winning the lottery) or extremely low (nuclear power-plant meltdown)\nrewards exist. Algorithms whose sample complexity scales with the highest magnitude event are not well suited\nto such domains.\n\n2For simplicity of exposition we assume that the same set of actions is available at every state. Our results\n\nreadily extend to the case where the action set can differ from state to state.\n\n2\n\n\fgenerality that it is bounded below by 0. If Qmin < 0, this assumption is easy to satisfy in all MDPs\nfor which Qmin is bounded by simply shifting the reward space by ( 1)Qmin.\nThere have been many de\ufb01nitions of sample complexity in RL. In this paper we will be using the\nfollowing [12]:\nDe\ufb01nition 2.2. Let (s1, s2, s3, . . . ) be the random path generated on some execution of \u21e1, where \u21e1\nis an arbitrarily complex, possibly non-stationary, possibly history dependent policy (such as the\npolicy followed by an exploration algorithm). Let \u270f be a positive constant, T the (possibly in\ufb01nite)\nset of time steps for which V \u21e1(st) < V \u21e4(st) \u270f, and de\ufb01ne3\n\n\u270fe(t) =V \u21e4(st) V \u21e1(st) \u270f, 8 t 2 T.\n\u270fe(t) =0, 8 t /2 T.\n\nThe Total Cost of Exploration (TCE) is de\ufb01ned as the undiscounted in\ufb01nite sumP1t=0 \u270fe(t).\n\n\u201cNumber of suboptimal steps\u201d bounds follow as a simple corollary of TCE bounds.\nWe will be using the following de\ufb01nition of ef\ufb01cient PAC exploration [14]:\nDe\ufb01nition 2.3. An algorithm is said to be ef\ufb01cient PAC-MDP (Probably Approximately Correct in\nMarkov Decision Processes) if, for any \u270f> 0 and 0 << 1, its sample complexity, its per-timestep\ncomputational complexity, and its space complexity, are less than some polynomial in the relevant\nquantities (S, A, 1\n\n1\n\n1 ), with probability at least 1 .\n\n\u270f , 1\n ,\n\n3 The median of means\n\nn\n\n2+\u270f2 and P (X \u00b5 \uf8ff \u270f) \uf8ff 2\n\nn . From Cantelli\u2019s inequality we have that P (X0 \u00b5 \u270f) \uf8ff 2\n2+n\u270f2 . Solving for n we have that we need at most n = (1)2\n\nBefore we present Median-PAC we will demonstrate the usefulness of the median of means with a\nsimple example. Suppose we are given n independent samples from a random variable X and we\nwant to estimate its mean. The types of guarantees that we can provide about how close that estimate\nwill be to the expectation, will depend on what knowledge we have about the variable, and on the\nmethod we use to compute the estimate. The main question of interest in our work is how many\nsamples are needed until our estimate is \u270f-close to the expectation with probability at least 1 .\nLet the expectation of X be E[X] = \u00b5 and its variance var[X] = 2. Cantelli\u2019s inequality tells\nus that: P (X \u00b5 \u270f) \uf8ff 2\n2+\u270f2 . Let Xi be a random variable\ndescribing the value of the i-th sample, and de\ufb01ne X0 = X1+X2+\u00b7\u00b7\u00b7+Xn\n. We have that E[X0] =\n\u00b5 and var[X0] = 2\n2+n\u270f2 and\n\u270f2 = O\u21e3 2\n\u270f2\u2318\nP (X0 \u00b5 \uf8ff \u270f) \uf8ff 2\nsamples until our estimate is \u270f-close to the expectation with probability at least 1 . In RL, it is\ncommon to apply a union bound over the entire state-action space in order to prove uniformly good\napproximation. This means that has to be small enough that even when multiplied with the number\nof state-actions, it yields an acceptably low probability of failure. The most signi\ufb01cant drawback of\nthe bound above is that it grows very quickly as becomes smaller. Without further assumptions one\ncan show that the bound above is tight for the average estimator.\nIf we know that X can only take values in a bounded range a \uf8ff X \uf8ff b, Hoeffding\u2019s inequality\ntells us that P (X0 \u00b5 \u270f) \uf8ff e 2n\u270f2\n(ba)2 . Solving for n we have\nthat n = (ba)2 ln 1\nsamples suf\ufb01ce to guarantee that our estimate is \u270f-close to the expectation with\nprobability at least 1 . Hoeffding\u2019s inequality yields a much better bound with respect to , but\nintroduces a quadratic dependence on the range of values that the variable can take. For long planning\nhorizons (discount factor close to 1) and/or large reward magnitudes, the range of possible Q-values\ncan be very large, much larger than the variance of individual state-actions.\nWe can get the best of both worlds by using a more sophisticated estimator. Instead of taking the\naverage over n samples, we will split them into km = n\u270f2\n\u270f2 samples each,4 compute the\n3Note that V \u21e1(st) denotes the expected, discounted, accumulated reward of the arbitrarily complex policy \u21e1\n\n(ba)2 and P (X0 \u00b5 \uf8ff \u270f) \uf8ff e 2n\u270f2\n\n42 sets of 42\n\n2\u270f2\n\n\n\nfrom state st at time t, rather than the expectation of some stationary snapshot of \u21e1.\n\n4The number of samples per set was chosen so as to minimize the constants in the \ufb01nal bound.\n\n3\n\n\faverage over each set, and then take the median of the averages. From Cantelli\u2019s inequality we have\nthat with probability at least 4\n5, each one of the sets will not underestimate, or overestimate the mean\n\u00b5 by more than \u270f. Let f be the function that counts the number of sets that underestimate the\nmean by more than \u270f, and f + the function that counts the number of sets that overestimate the mean\n\n10 )2\n2( 3km\nkm\n\nby more than \u270f. From McDiarmid\u2019s inequality [9] we have that Pf km\nPf + km\nsamples\nsuf\ufb01ce to guarantee that our estimate is \u270f-close to the expectation with probability at least 1 .\nThe median of means offers logarithmic dependence on 1\n , independence from the range of values\nthat the variables in question can take (even allowing for them to be in\ufb01nite), and can be computed\nef\ufb01ciently. The median of means estimator only requires a \ufb01nite variance and the existence of a mean.\nNo assumptions (including boundedness) are made on higher moments.\n\n2 \uf8ff e\n\u21e1 22.222 ln( 1\n\n. Solving for n we have that n =\n\n2 \uf8ff e\n\n9 2 ln( 1\n )\n\nand\n\n )\n\n200\n\n\u270f2\n\n\u270f2\n\n10 )2\n2( 3km\nkm\n\n4 Median PAC exploration\n\nsamples in u(s, a))\n\nacceptable error \u270fa.\n\nPerform action a = arg max\u02dca \u02dcQ(s, \u02dca)\nReceive reward r, and transition to state s0.\nif |u(s, a)| < k then\n\nAlgorithm 1 Median-PAC\n1: Inputs: start state s, discount factor , max number of samples k, number of sets km, and\n2: Initialize sample sets unew(s, a) = ;, u(s, a) = ; 8 (s, a). (|u(s, a)| denotes the number of\n3: Set \u270fb = \u270fapk, and initialize value function \u02dcQ(s, a) = Qmax 8 (s, a).\n4: loop\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\nend if\n16:\n17: end loop\n18: function \u02dcB \u02dcQ(s, a)\n19:\n20:\n21:\n\nend if\nwhile max(s,a)( \u02dcB \u02dcQ(s, a) \u02dcQ(s, a)) >\u270f a or max(s,a)( \u02dcQ(s, a) \u02dcB \u02dcQ(s, a)) >\u270f a do\nend while\n\nAdd (s, a, r, s0) to unew(s, a).\nif |unew(s, a)| > |u(s, a)| and |unew(s, a)| = 2ikm, where i 0 is an integer then\n\nSet \u02dcQ(s, a) = \u02dcB \u02dcQ(s, a) 8 (s, a).\n\nu(s, a) = unew(s, a)\nunew(s, a) = ;\n\nif |u(s, a)| km then\nLet (s, a, ri, s0i) be the i-th sample in u(s, a).\nfor j = 1 to km do\ng(j) =Pj |u(s,a)|\nreturn min\u21e2Qmax,\n\ni=1+(j1) |u(s,a)|\n\n\u270fbp|u(s,a)|\n\nend for\n\nkm\n\nkm ri + max\u00afa \u02dcQ(s0i, \u00afa)!\n\n+ kmmedian{g(1),...g(km)}\n\n|u(s,a)|\n\n\n\n22:\n\n23:\n24:\n\nelse\n\n25:\n26:\n27:\n28: end function\n\nend if\n\nreturn Qmax.\n\nAlgorithm 1 has three parameters that can be set by the user:\n\nlead to increased sample complexity but better approximation.\n\n\u2022 k is the maximum number of samples per state-action. As we will show, higher values for k\n\u2022 \u270fa is an \u201cacceptable error\u201d term. Since Median-PAC is based on value iteration (lines 13\nthrough 15) we specify a threshold after which value iteration should terminate. Value\n\n4\n\n\fiteration is suspended when the max-norm of the difference between Bellman backups is no\nlarger than \u270fa.\n\n\u2022 Due to the stochasticity of Markov decision processes, Median-PAC is only guaranteed\nto achieve a particular approximation quality with some probability. km offers a trade-\noff between approximation quality and the probability that this approximation quality is\nachieved. For a \ufb01xed k smaller values of km offer potentially improved approximation\nquality, while larger values offer a higher probability of success. For simplicity of exposition\n\nour analysis requires that k = 2ikm for some integer i. If km & 50\n\n9 ln\n\nthe probability of failure is bounded above by .\n\n4 log2\n\n4Q2\nmax\n\u270f2\na\n\n\n|SA|2\n\n'\n\nLike most modern PAC exploration algorithms, Median-PAC is based on the principle of optimism\nin the face of uncertainty. At every step, the algorithm selects an action greedily based on the\ncurrent estimate of the Q-value function \u02dcQ. The value function is optimistically initialized to Qmax,\nthe highest value that any state-action can take. If k is set appropriately (see theorem 5.4), the\nvalue function is guaranteed to remain approximately optimistic (approximately represent the most\noptimistic world consistent with the algorithm\u2019s observations) with high probability.\nWe would like to draw the reader\u2019s attention to two aspects of Median-PAC, both in the way Bellman\nbackups are computed: 1) Instead of taking a simple average over sample values, Median-PAC divides\nthem into km sets, computes the mean over each set, and takes the median of means. 2) Instead of\nusing all the samples available for every state-action, Median-PAC uses samples in batches of a power\nof 2 times km (line 9). The reasoning behind the \ufb01rst choice follows from the discussion above: using\nthe median of means will allow us to show that Median-PAC\u2019s complexity scales with the variance of\nthe Bellman operator (see de\ufb01nition 5.1) rather than Q2\nmax. The reasoning behind using samples in\nbatches of increasing powers of 2 is more subtle. A key requirement in the analysis of our algorithm\nis that samples belonging to the same state-action are independent. While the outcome of sample i\ndoes not provide information about the outcome of sample j if i < j (from the Markov property), the\nfact that j samples exist can reveal information about the outcome of i. If the \ufb01rst i samples led to a\nsevere underestimation of the value of the state-action in question, it is likely that j samples would\nnever have been collected. The fact that they did gives us some information about the outcome of the\n\ufb01rst i samples. Using samples in batches, and discarding the old batch when a new batch becomes\navailable, ensures that the outcomes of samples within each batch are independent from one another.\n\n5 Analysis\nDe\ufb01nition 5.1. is the minimal constant satisfying\n\n8(s, a, \u21e1\n\n\u02dcQ, \u02dcQ),sXs0\n\np(s0|s, a)\u21e3R(s, a, s0) + \u02dcQ(s0,\u21e1 \u02dcQ(s0)) B\u21e1 \u02dcQ \u02dcQ(s, a)\u23182\n\n\uf8ff ,\n\nwhere 8 \u02dcQ refers to any value function produced by Median-PAC, rather than any conceivable\nvalue function (similarly \u21e1 \u02dcQ refers to any greedy policy over \u02dcQ followed during the execution of\nMedian-PAC rather than any conceivable policy).\n\nIn the following we will call 222 the variance of the Bellman operator. Note that the variance of the\nBellman operator is not the same as the variance, or stochasticity in the transition model of an MDP.\nA state-action can be highly stochastic (lead to many possible next states), yet if all the states it\ntransitions to have similar values, the variance of its Bellman operator will be small.\nFrom Lemmas 5.2, 5.3, and theorem 5.4 below, we have that Median-PAC is ef\ufb01cient PAC-MDP.\nLemma 5.2. The space complexity of algorithm 1 is O (k|S||A|).\nProof. Follows directly from the fact that at most k samples are stored per state-action.\nLemma 5.3. The per step computational complexity of algorithm 1 is bounded above by\n\nO\u2713 k|S||A|2\n\n1 \n\nln\n\nQmax\n\n\u270fa \u25c6 .\n\n5\n\n\fProof. The proof of this lemma is deferred to the appendix.\n\nTheorem 5.4 below is the main theorem of this paper. It decomposes errors into the following three\nsources:\n\n1. \u270fa is the error caused by the fact that we are only \ufb01nding an \u270fa-approximation, rather than\nthe true \ufb01xed point of the approximate Bellman operator \u02dcB, and the fact that we are using\nonly a \ufb01nite set of samples (at most k) to compute the median of the means, thus we only\nhave an estimate.\n\n2. \u270fu is the error caused by underestimating the variance of the MDP. When k is too small\nand Median-PAC fails to be optimistic, \u270fu will be non-zero. \u270fu is a measure of how far\nMedian-PAC is from being optimistic (follow the greedy policy over the value function of\nthe most optimistic world consistent with its observations).\n\n3. Finally, \u270fe(t) is the error caused by the fact that at time t there may exist state-actions that\n\ndo not yet have k samples.\n\nTheorem 5.4. Let (s1, s2, s3, . . . ) be the random path generated on some execution of\nMedian-PAC, and \u02dc\u21e1\u02dc\u21e1\u02dc\u21e1 be the (non-stationary) policy followed by Median-PAC. Let \u270fu =\n\nmax{0, p4km \u270fapk}, and \u270fa be de\ufb01ned as in algorithm 1. If km =& 50\n\u270fa \uf8ff \u270fbpk\nat least 1 , for all t\n\n1 ln (1)Qmax\n\nkm|SA|+1\n\n, 2d 1\n\ne2 ln\n\nlog2\n\n\n2k\nkm\n\n\u270fa\n\n4 log2\n\n9 ln\n\n4Q2\nmax\n\u270f2\na\n\n\n|SA|2\n\n',\n\n< 1, and k = 2ikm for some integer i, then with probability\n\n2\u270fu + 5\u270fa\n1 \n\n+ \u270fe(t),\n\nV \u21e4(st) V \u02dc\u21e1(st) \uf8ff\n\u270fe(t) < c0\u2713\u27132km + log2\n1Xt=0\n(|SA| + 1)\u21e31 + log2l 1\n1 s 2l 1\n\n2k\n\nkm\u25c6 Qmax + \u270fak\u27138 +\nm\u2318l 1\n1 ln (1)Qmax\nm2 ln\n1 ln (1)Qmax\nkm|SA|+1\n\nlog2\n\n\n2k\nkm\n\n\u270fa\n\n\u270fa\n\nc0 =\n\n1 ln (1)Qmax\n\n\u270fa\n\nm\n\n.\n\n(1)\n\n(2)\n\n8\n\np2\u25c6\u25c6 ,\n\nIf k = 2ikm where i is the smallest integer such that 2i 42\nprobability at least 1 , for all t\n\n\u270f2\na\n\n, and \u270f0 = (1 )\u270fa, then with\n\nwhere\n\nand\n\nwhere5\n\nV \u21e4(st) V \u02dc\u21e1(st) \uf8ff \u270f0 + \u270fe(t),\n\u270fe(t) \u21e1 \u02dcO\u2713\u2713\n\n\u270f0(1 )2 +\n\n1 \u25c6|SA|\u25c6 .\n\nQmax\n\n2\n\n1Xt=0\n\n(3)\n\n(4)\n\nNote that the probability of success holds for all timesteps simultaneously, andP1t=0 \u270fe(t) is an\n\nundiscounted in\ufb01nite sum.\n\nProof. The detailed proof of this theorem is deferred to the appendix. Here we provide a proof\nsketch:\nThe non-stationary policy of the algorithm can be broken up into \ufb01xed policy (and \ufb01xed approximate\nvalue function) segments. The \ufb01rst step in proving theorem 5.4 is to show that the Bellman error of\neach state-action at a particular \ufb01xed approximate value function segment is acceptable with respect\nto the number of samples currently available for that state-action with high probability. We use\nCantelli\u2019s and McDiarmid\u2019s inequalities to prove this point. This is where the median of means\n\n5f (n) = \u02dcO(g(n)) is a shorthand for f (n) = O(g(n) logc g(n)) for some constant c.\n\n6\n\n\fbecomes useful, and the main difference between our work and earlier work. We then combine the\nresult from the median of means, the fact that there are only a small number of possible policy and\napproximate value function changes that can happen during the lifetime of the algorithm, and the\nunion bound, to prove that the Bellman error of all state-actions during all timesteps is acceptable\nwith high probability. We subsequently prove that due to the optimistic nature of Median-PAC, at\nevery time-step it will either perform well, or learn something new about the environment with high\nprobability. Since there is only a \ufb01nite number of things it can learn, the total cost of exploration for\nMedian-PAC will be small with high probability.\n\n\u270f1\n\nA typical \u201cnumber of suboptimal steps\u201d sample complexity bound follows as a simple corollary of\n\ntheorem 5.4. If the total cost of exploration isP1t=0 \u270fe(t) for an \u270f0-optimal policy, there can be no\nmore than P1t=0 \u270fe(t)\n\nsteps that are more than (\u270f0 + \u270f1)-suboptimal.\n\nNote that the sample complexity of Median-PAC depends log-linearly on Qmax, which can be \ufb01nite\neven if Rmax is in\ufb01nite. Consider for example an MDP for which the reward at every state-action\nfollows a Gaussian distribution (for discrete MDPs this example requires rewards to be stochastic,\nwhile for continuous MDPs rewards can be a deterministic function of state-action-nextstate since\nthere can be an in\ufb01nite number of possible nextstates for every state-action). If the mean of the reward\nfor every state-action is bounded above by c, Qmax is bounded above by c\n1 , even though Rmax is\nin\ufb01nite.\nAs we can see from theorem 5.4, apart from being the \ufb01rst PAC exploration algorithm that can be\napplied to MDPs with unbounded rewards, Median-PAC offers signi\ufb01cant advantages over the current\nstate of the art for MDPs with bounded rewards. Until recently, the algorithm with the best known\nsample complexity for the discrete state-action setting was MORMAX, an algorithm by Szita and\nSzepesv\u00e1ri [16]. Theorem 5.4 offers an improvement of\n(1)2 even in the worst case, and trades\na factor of Q2\nmax for a (potentially much smaller) factor of 2. A recent algorithm by Pazis and\nParr [12] currently offers the best known bounds for PAC exploration without additional assumptions\non the number of states that each action can transition to. Compared to that work we trade a factor of\nQ2\n\nmax for a factor of 2.\n\n1\n\n5.1 Using Median-PAC when is not known\n\nIn many practical situations will not be known. Instead the user will have a \ufb01xed exploration cost\nbudget, a desired maximum probability of failure , and a desired maximum error \u270fa. Given we\n\ncan solve for the number of sets as km =& 50\nequation 2 except for k are known, and we can solve for k. When the sampling budget is large enough\nsuch that k 42km\n\n, then \u270fu in equation 1 will be zero. Otherwise \u270fu = p4km \u270fapk.\n\n', at which point all variables in\n\n4Q2\nmax\n\u270f2\na\n\n\n|SA|2\n\n9 ln\n\n4 log2\n\n\u270f2\na\n\n5.2 Beyond the discrete state-action setting\n\nRecent work has extended PAC exploration to the continuous state [11] concurrent exploration [4] and\ndelayed update [12] settings. The goal in the concurrent exploration setting is to explore in multiple\nidentical or similar MDPs and incur low aggregate exploration cost over all MDPs. For a concurrent\nalgorithm to offer an improvement over non-concurrent exploration, the aggregate cost must be lower\nthan the cost of non-concurrent exploration times the number of tasks. The delayed update setting\ntakes into account the fact that in real world domains, reaching a \ufb01xed point after collecting a new\nsample can take longer that the time between actions. Contrary to other work that has exploited\nthe variance of MDPs to improve bounds on PAC exploration [7, 3] our analysis does not make\nassumptions about the number of possible next states from a given action. As such, Median-PAC\nand its bounds are easily extensible to the continuous state, concurrent exploration, delayed update\nsetting. Replacing the average over samples in an approximation unit with the median of means over\nsamples in an approximation unit in the algorithm of Pazis and Parr [12], improves their bounds\n(which are the best published bounds for PAC exploration in these settings) by (Rmax + Qmax)2\nwhile introducing a factor of 2.\n\n7\n\n\f6 Experimental evaluation\n\nWe compared Median-PAC against the algorithm of Pazis and Parr [12] on a simple 5 by 5 gridworld\n(see appendix for more details). The agent has four actions: move one square up, down, left, or right.\nAll actions have a 1% probability of self-transition with a reward of 100. Otherwise the agent moves\nin the chosen direction and receives a reward of 0, unless its action causes it to land on the top-right\ncorner, in which case it receives a reward of 1. The world wraps around and the agent always starts at\nthe center. The optimal policy for this domain is to take the shortest path to the top-right corner if at a\nstate other than the top-right corner, and take any action while at the top-right corner.\nWhile the probability of any individual sample being a self-transition is small, unless the number of\nsamples per state-action is very large, the probability that there will exist at least one state-action\nwith signi\ufb01cantly more than 1\n100 sampled self-transitions is high. As a result, the naive average\nalgorithm frequently produced a policy that maximized the probability of encountering state-actions\n100 sampled self-transitions. By contrast, it is far less likely that there will exist\nwith more than 1\na state-action for which at least half of the sets used by the median of means have more than 1\n100\nsampled self-transitions. Median-PAC was able to consistently \ufb01nd the optimal policy.\n\n7 Related Work\n\nMaillard, Mann, and Mannor [8] present the distribution norm, a measure of hardness of an MDP.\nSimilarly to our de\ufb01nition of the variance of the Bellman operator, the distribution norm does not\ndirectly depend on the stochasticity of the underlying transition model. It would be interesting to see\nif the distribution norm (or a similar concept) can be used to improve PAC exploration bounds for\n\u201ceasy\u201d MDPs.\nWhile to the best our knowledge our work is the \ufb01rst in PAC exploration for MDPs that introduces a\nmeasure of hardness for MDPs (the variance of the Bellman operator), measures of hardness have\nbeen previously used in regret analysis [6]. Such measures include the diameter of an MDP [6], the\none way diameter [2], as well as the span [2]. These measures express how hard it is to reach any\nstate of an MDP from any other state. A major advantage of sample complexity over regret is that\n\ufb01nite diameter is not required to prove PAC bounds. Nevertheless, if introducing a requirement for a\n\ufb01nite diameter could offer drastically improved PAC bounds, it may be worth the trade-off for certain\nclasses of problems. Note that variance and diameter of an MDP appear to be orthogonal. One can\nconstruct examples of arbitrary diameter and then manipulate the variance by changing the reward\nfunction and/or discount factor.\nAnother measure of hardness which was recently introduced in regret analysis is the Eluder dimension.\nOsband and Van Roy [10] show that if an MDP can be parameterized within some known function\nclass, regret bounds that scale with the dimensionality, rather than cardinality of the underlying MDP\ncan be obtained. Like the diameter, the Eluder dimension appears to be orthogonal to the variance of\nthe Bellman operator, potentially allowing for the two concepts to be combined.\nLattimore and Hutter [7] have presented an algorithm that can match the best known lower bounds\nfor PAC exploration up to logarithmic factors for the case of discrete MDPs where every state-action\ncan transition to at most two next states.\nTo the best of our knowledge there has been no work in learning with unbounded rewards. Harrison [5]\nhas examined the feasibility of planning with unbounded rewards.\n\nAcknowledgments\n\nWe would like to thank Emma Brunskill, Tor Lattimore, and Christoph Dann for spotting an error\nin an earlier version of this paper, as well as the anonymous reviewers for helpful comments and\nsuggestions. This material is based upon work supported in part by The Boeing Company, by\nONR MURI Grant N000141110688, and by the National Science Foundation under Grant No. IIS-\n1218931. Opinions, \ufb01ndings, conclusions or recommendations herein are those of the authors and not\nnecessarily those of the NSF.\n\n8\n\n\fReferences\n[1] N Alon, Y Matias, and M Szegedy. The space complexity of approximating the frequency\nmoments. Journal of Computer and System Sciences - JCSS (special issue of selected papers\nfrom STOC\u201996), 58:137\u2013147, 1999.\n\n[2] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement\nlearning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty\nin Arti\ufb01cial Intelligence (UAI2009), pages 35\u201342, June 2009.\n\n[3] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforce-\n\nment learning. Advances in Neural Information Processing Systems, 2015.\n\n[4] Zhaohan Guo and Emma Brunskill. Concurrent PAC RL. In AAAI Conference on Arti\ufb01cial\n\nIntelligence, pages 2624\u20132630, 2015.\n\n[5] J. Michael Harrison. Discrete dynamic programming with unbounded rewards. The Annals of\n\nMathematical Statistics, 43(2):636\u2013644, 04 1972.\n\n[6] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, August 2010.\n\n[7] Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In Proceedings of the\n23th International Conference on Algorithmic Learning Theory, volume 7568 of Lecture Notes\nin Computer Science, pages 320\u2013334. Springer Berlin / Heidelberg, 2012.\n\n[8] Odalric-Ambrym Maillard, Timothy A Mann, and Shie Mannor. How hard is my MDP?\u201d the\ndistribution-norm to the rescue\u201d. In Advances in Neural Information Processing Systems 27,\npage 1835\u20131843. 2014.\n\n[9] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, number 141\nin London Mathematical Society Lecture Note Series, pages 148\u2013188. Cambridge University\nPress, August 1989.\n\n[10] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder\ndimension. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 1466\u20131474. 2014.\n\n[11] Jason Pazis and Ronald Parr. PAC optimal exploration in continuous space Markov decision\n\nprocesses. In AAAI Conference on Arti\ufb01cial Intelligence, pages 774\u2013781, July 2013.\n\n[12] Jason Pazis and Ronald Parr. Ef\ufb01cient PAC-optimal exploration in concurrent, continuous state\n\nMDPs with delayed updates. In AAAI Conference on Arti\ufb01cial Intelligence, February 2016.\n\n[13] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nWiley-Interscience, April 1994.\n\n[14] Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in \ufb01nite\nMDPs: PAC analysis. Journal of Machine Learning Research, 10:2413\u20132444, December 2009.\n[15] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. The MIT Press,\n\nCambridge, Massachusetts, 1998.\n\n[16] Istvan Szita and Csaba Szepesv\u00e1ri. Model-based reinforcement learning with nearly tight\nexploration complexity bounds. In International Conference on Machine Learning, pages\n1031\u20131038, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1929, "authors": [{"given_name": "Jason", "family_name": "Pazis", "institution": "MIT"}, {"given_name": "Ronald", "family_name": "Parr", "institution": "Duke University"}, {"given_name": "Jonathan", "family_name": "How", "institution": "MIT"}]}