{"title": "Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model", "book": "Advances in Neural Information Processing Systems", "page_first": 5186, "page_last": 5196, "abstract": "In this paper we consider the problem of computing an $\\epsilon$-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in $O(1)$ time. Given such a DMDP with states $\\states$, actions $\\actions$, discount factor $\\gamma\\in(0,1)$, and rewards in range $[0, 1]$ we provide an algorithm which computes an $\\epsilon$-optimal policy with probability $1 - \\delta$ where {\\it both} the run time spent and number of sample taken is upper bounded by \n\\[\nO\\left[\\frac{|\\cS||\\cA|}{(1-\\gamma)^3 \\epsilon^2} \\log \\left(\\frac{|\\cS||\\cA|}{(1-\\gamma)\\delta \\epsilon}\n\t\t\\right) \n\t\t\\log\\left(\\frac{1}{(1-\\gamma)\\epsilon}\\right)\\right] ~.\n\\]\nFor fixed values of $\\epsilon$, this improves upon the previous best known bounds by a factor of $(1 - \\gamma)^{-1}$ and matches the sample complexity lower bounds proved in \\cite{azar2013minimax} up to logarithmic factors. \nWe also extend our method to computing $\\epsilon$-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.", "full_text": "Near-Optimal Time and Sample Complexities for\n\nSolving Markov Decision Processes with a Generative\n\nModel\n\nAaron Sidford\n\nStanford University\n\nMengdi Wang\n\nPrinceton University\n\nXian Wu\n\nStanford University\n\nsidford@stanford.edu\n\nmengdiw@princeton.edu\n\nxwu20@stanford.edu\n\nLin F. Yang\n\nPrinceton University\n\nlin.yang@princeton.edu\n\nYinyu Ye\n\nStanford University\nyyye@stanford.edu\n\nAbstract\n\nIn this paper we consider the problem of computing an \u0001-optimal policy of a dis-\ncounted Markov Decision Process (DMDP) provided we can only access its tran-\nsition function through a generative sampling model that given any state-action\npair samples from the transition function in O(1) time. Given such a DMDP with\nstates S, actions A, discount factor \u03b3 \u2208 (0, 1), and rewards in range [0, 1] we\nprovide an algorithm which computes an \u0001-optimal policy with probability 1 \u2212 \u03b4\nwhere both the time spent and number of sample taken are upper bounded by\n\nO\n\n(1 \u2212 \u03b3)3\u00012 log\n\n(1 \u2212 \u03b3)\u03b4\u0001\n\nlog\n\n1\n\n(1 \u2212 \u03b3)\u0001\n\n(cid:20) |S||A|\n\n(cid:18) |S||A|\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)(cid:21)\n\n.\n\nFor \ufb01xed values of \u0001, this improves upon the previous best known bounds by a\nfactor of (1 \u2212 \u03b3)\u22121 and matches the sample complexity lower bounds proved in\n[AMK13] up to logarithmic factors. We also extend our method to computing\n\u0001-optimal policies for \ufb01nite-horizon MDP with a generative model and provide a\nnearly matching sample complexity lower bound.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) are a fundamental mathematical abstraction used to model se-\nquential decision making under uncertainty and are a basic model of discrete-time stochastic control\nand reinforcement learning (RL). Particularly central to RL is the case of computing or learning an\napproximately optimal policy when the MDP itself is not fully known beforehand. One of the sim-\nplest such settings is when the states, rewards, and actions are all known but the transition between\nstates when an action is taken is probabilistic, unknown, and can only be sampled from.\nComputing an approximately optimal policy with high probability in this case is known as PAC\nRL with a generative model. It is a well studied problem with multiple existing results providing\nalgorithms with improved the sample complexity (number of sample transitions taken) and running\ntime (the total time of the algorithm) under various MDP reward structures, e.g. discounted in\ufb01nite-\nhorizon, \ufb01nite-horizon, etc. (See Section 5 for a detailed review of the literature.)\nIn this work, we consider this well studied problem of computing approximately optimal policies\nof discounted in\ufb01nite-horizon Markov Decision Processes (DMDP) under the assumption we can\nonly access the DMDP by sampling state transitions. Formally, we suppose that we have a DMDP\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fwith a known set of states, S, a known set of actions that can be taken at each states, A, a known\nreward rs,a \u2208 [0, 1] for taking action a \u2208 A at state s \u2208 S, and a discount factor \u03b3 \u2208 (0, 1). We\nassume that taking action a at state s probabilistically transitions an agent to a new state based on\na \ufb01xed, but unknown probability vector P s,a. The objective is to maximize the cumulative sum\nof discounted rewards in expectation. Throughout this paper, we assume that we have a generative\nmodel, a notion introduced by [Kak03], which allows us to draw random state transitions of the\nDMDP. In particular, we assume that we can sample from the distribution de\ufb01ned by P s,a for all\n(s, a) \u2208 S \u00d7 A in O(1) time. This is a natural assumption and can be achieved in expectation in\ncertain computational models with linear time preprocessing of the DMDP.1\nThe main result of this paper is that we provide the \ufb01rst algorithm that is sample-optimal and\nruntime-optimal (up to polylogarithmic factors) for computing an \u0001-optimal policy of a DMDP with\n\na generative model (in the regime of \u0001 \u2265 1/(cid:112)(1 \u2212 \u03b3)|S|). In particular, we develop a randomized\n\nVariance-Reduced Q-Value Iteration (vQVI) based algorithm that computes an \u0001-optimal policy with\nprobability 1 \u2212 \u03b4 with a number of samples, i.e. queries to the generative model, bound by\n\n(cid:20) |S||A|\n\n(cid:18) |S||A|\n\n(cid:19)\n\n(cid:18)\n\nO\n\n(1 \u2212 \u03b3)3\u00012 log\n\n(1 \u2212 \u03b3)\u03b4\u0001\n\nlog\n\n(cid:19)(cid:21)\n\n.\n\n1\n\n(1 \u2212 \u03b3)\u0001\n\nThis result matches (up to polylogarithmic factors) the following sample complexity lower bound\nestablished in [AMK13] for \ufb01nding \u0001-optimal policies with probability 1 \u2212 \u03b4 (see Appendix D):\n\n(cid:20) |S||A|\n\n(cid:18)|S||A|\n\n(cid:19)(cid:21)\n\n\u2126\n\n(1 \u2212 \u03b3)3\u00012 log\n\n\u03b4\n\n.\n\nFurthermore, we show that the algorithm can be implemented using sparse updates such that the\noverall run-time complexity is equal to its sample complexity up to constant factors, as long as each\nsample transition can be generated in O(1) time. Consequently, up to logarithmic factors our run\ntime complexity is optimal as well. In addition, the algorithm\u2019s space complexity is \u0398(|S||A|).\nOur method and analysis builds upon a number of prior works. (See Section 5 for an in-depth com-\nparison.) The paper [AMK13] provided the \ufb01rst algorithm that achieves the optimal sample com-\nplexity for \ufb01nding \u0001-optimal value functions (rather than \u0001-optimal policy), as well as the matching\nlower bound. Unfortunately an \u0001-optimal value function does not imply an \u0001-optimal policy and if\nwe directly use the method of [AMK13] to get an \u0001-optimal policy for constant \u0001, the best known\n\nsample complexity is (cid:101)O(|S||A|(1 \u2212 \u03b3)\u22125\u0001\u22122). 2 This bound is known to be improvable through\n(cid:101)O(|S||A|(1 \u2212 \u03b3)\u22124\u0001\u22122) samples and total runtime and the work of [AMK13] which in the regime\n\nrelated work of [SWWY18] which provides a method for computing an \u0001-optimal policy using\nof small approximation error, i.e. where \u0001 = O((1 \u2212 \u03b3)\u22121/2|S|\u22121/2), already provides a method\nthat achieves the optimal sample complexity. However, when the approximation error takes \ufb01xed\nvalues, e.g. \u0001 \u2265 \u2126((1 \u2212 \u03b3)\u22121/2|S|\u22121/2), there remains a gap between the best known runtime and\nsample complexity for computing an \u0001-optimal policy and the theoretical lower bounds. For \ufb01xed\nvalues of \u0001, which mostly occur in real applications, our algorithm improves upon the previous best\nsample and time complexity bounds by a factor of (1 \u2212 \u03b3)\u22121 where \u03b3 \u2208 (0, 1), the discount factor,\nis typically close to 1.\nWe achieve our results by combining and strengthening techniques from both [AMK13] and\n[SWWY18]. On the one hand, in [AMK13] the authors showed that simply constructing a \u201csparsi-\n\ufb01ed\u201d MDP model by taking samples and then solving this model to high precision yields a sample\noptimal algorithm in our setting for computing the approximate value of every state. On the other\nhand, [SWWY18] provided faster algorithms for solving explicit DMDPs and improved sample and\ntime complexities given a sampling oracle. In fact, as we show in Appendix B.1, simply combining\nthese two results yields the \ufb01rst nearly optimal runtime for approximately learning the value function\nwith a generative model. Unfortunately, it is known that an approximate-optimal value function does\nnot immediately yield an approximate-optimal policy of comparable quality (see e.g. [Ber13]) and\nit is was previously unclear how to combine these methods to improve upon previous known bounds\nfor computing an approximate policy. To achieve our policy computation algorithm we therefore\n\n1If instead the oracle needed time \u03c4, every running time result in this paper should be multiplied by \u03c4.\n2[AMK13] showed that one can obtain \u0001-optimal value v (instead of \u0001-optimal policy) using sample size\n\u221d (1\u2212 \u03b3)\u22123\u0001\u22122. By using this \u0001-optimal value v, one can get a greedy policy that is [(1\u2212 \u03b3)\u22121\u0001]-optimal. By\nsetting \u0001 \u2192 (1 \u2212 \u03b3)\u0001, one can obtain an \u0001-optimal policy, using the number of samples \u221d (1 \u2212 \u03b3)\u22125\u0001\u22122.\n\n2\n\n\fopen up both the algorithms and the analysis in [AMK13] and [SWWY18], combining them in non-\ntrivial ways. Our proofs leverage techniques ranging from standard probabilistic analysis tools such\nas Hoeffding and Bernstein inequalities, to optimization techniques such as variance reduction, to\nproperties speci\ufb01c to MDPs such as the Bellman \ufb01xed-point recursion for expectation and variance\nof the optimal value vector, and monotonicity of value iteration.\nFinally, we extend our method to \ufb01nite-horizon MDPs, which are also occurred frequently in real\n\napplications. We show that the number of samples needed by this algorithm is (cid:101)O(H 3|S||A|\u0001\u22122), in\n\norder to obtain an \u0001-optimal policy for H-horizon MDP (see Appendix F). We also show that the\npreceding sample complexity is optimal up to logarithmic factors by providing a matching lower\nbound. We hope this work ultimately opens the door for future practical and theoretical work on\nsolving MDPs and ef\ufb01cient RL more broadly.\n\n\u221a\n\nv, |v|, and v2 vectors in RN with\n\n2 Preliminaries\nWe use calligraphy upper case letters for sets or operators, e.g., S, A and T . We use bold small\ncase letters for vectors, e.g., v, r. We denote vs or v(s) as the s-th entry of vector v. We denote\n\u221a\u00b7, | \u00b7 |, and\nmatrix as bold upper case letters, e.g., P . We denote constants as normal upper case letters, e.g., M.\nFor a vector v \u2208 RN for index set N , we denote\n(\u00b7)2 acting coordinate-wise. For two vectors v, u \u2208 RN , we denote by v \u2264 u as coordinate-wise\ncomparison, i.e., \u2200i \u2208 N : v(i) \u2264 u(i). The same de\ufb01nition are de\ufb01ned to relations \u2264, < and >.\nWe describe a DMDP by the tuple (S,A, P , r, \u03b3), where S is a \ufb01nite state space, A is a \ufb01nite action\nspace, P \u2208 RS\u00d7A\u00d7S is the state-action-state transition matrix, r \u2208 RS\u00d7A is the state-action reward\nvector, and \u03b3 \u2208 (0, 1) is a discount factor. We use P s,a(s(cid:48)) to denote the probability of going to\nstate s(cid:48) from state s when taking action a. We also identify each P s,a as a vector in RS. We use rs,a\nto denote the reward obtained from taking action a \u2208 A at state s \u2208 S and assume r \u2208 [0, 1]S\u00d7A.3\nFor a vector v \u2208 RS, we denote P v \u2208 RS\u00d7A as (P v)s,a = P (cid:62)\ns,av. A policy \u03c0 : S \u2192 A maps\neach state to an action. The objective of MDP is to \ufb01nd the optimal policy \u03c0\u2217 that maximizes the\nexpectation of the cumulative sum of discounted rewards.\nIn the remainder of this section we give de\ufb01nitions for several prominent concepts in MDP analysis\nthat we use throughout the paper.\nDe\ufb01nition 2.1 (Bellman Value Operator). For a given DMDP the value operator T : RS (cid:55)\u2192 RS is\nde\ufb01ned for all u \u2208 RS and s \u2208 S by T (u)s = maxa\u2208A[ra(s) + \u03b3 \u00b7 P (cid:62)\ns,av], and we let v\u2217 denote\nthe value of the optimal policy \u03c0\u2217, which is the unique vector such that T (v\u2217) = v\u2217.\nDe\ufb01nition 2.2 (Policy). We call any vector \u03c0 \u2208 AS a policy and say that the action prescribed by\npolicy \u03c0 to be taken at state s \u2208 S is \u03c0s. We let T\u03c0 : RS (cid:55)\u2192 RS denote the value operator associated\nwith \u03c0 de\ufb01ned for all u \u2208 RS and s \u2208 S by T\u03c0(u)s = rs,\u03c0(s) + \u03b3 \u00b7 P (cid:62)\ns,\u03c0(s)u , and we let v\u03c0 denote\nthe values of policy \u03c0, which is the unique vector such that T\u03c0(v\u03c0) = v\u03c0.\nNote that T\u03c0 can be viewed as the value operator for the modi\ufb01ed MDP where the only available\naction from each state is given by the policy \u03c0. Note that this modi\ufb01ed MDP is essentially just an\nuncontrolled Markov Chain, i.e. there are no action choices that can be made.\nDe\ufb01nition 2.3 (\u0001-optimal value and policy). We say values u \u2208 RS are \u0001-optimal if (cid:107)v\u2217\u2212 u(cid:107)\u221e \u2264 \u0001\nand policy \u03c0 \u2208 AS is \u0001-optimal if (cid:107)v\u2217 \u2212 v\u03c0(cid:107)\u221e \u2264 \u0001, i.e. the values of \u03c0 are \u0001-optimal.\nDe\ufb01nition 2.4 (Q-function). For any policy \u03c0, we de\ufb01ne the Q-function of a MDP with respect to \u03c0\nas a vector Q \u2208 RS\u00d7A such that Q\u03c0(s, a) = r(s, a) + \u03b3P (cid:62)\ns,av\u03c0. The optimal Q-function is de\ufb01ned\n. We call any vector Q \u2208 RS\u00d7A a Q-function even though it may not relate to a policy\nas Q\u2217 = Q\u03c0\u2217\nor a value vector and de\ufb01ne v(Q) \u2208 RS and \u03c0(Q) \u2208 AS as the value and policy implied by Q, by\n\n\u2200s \u2208 S : v(Q)(s) = max\n\nFor a policy \u03c0, let P \u03c0Q \u2208 RS\u00d7A be de\ufb01ned as (P \u03c0Q)(s, a) =(cid:80)\n\na\u2208A Q(s, a)\n\nand \u03c0(Q)(s) = arg max\n\na\u2208A Q(s, a).\n\ns(cid:48)\u2208S P s,a(s(cid:48))Q(s(cid:48), \u03c0(s(cid:48))).\n\n3A general r \u2208 RS\u00d7A can always be reduced to this case by shifting and scaling.\n\n3\n\n\f3 Technique Overview\n\nIn this section we provide a more detailed and technical overview of our approach. At a high\nlevel, our algorithm shares a similar framework as the variance reduction algorithm presented in\n[SWWY18]. This algorithm used two crucial algorithmic techniques, which are also critical in this\npaper. We call these techniques as the monotonicity technique and the variance reduction technique.\nOur algorithm and the results of this paper can be viewed as an advanced, non-trivial integration of\nthese two methods, augmented with a third technique which we refer to as a total-variation technique\nwhich was discovered in several papers [MM99, LH12, AMK13]. In the remainder of this section\nwe give an overview of these techniques and through this, explain our algorithm.\n\nThe Monotonicity Technique Recall that the classic value iteration algorithm for solving a MDP\nrepeatedly applies the following rule\n\nv(i)(s) \u2190 max\n\n(r(s, a) + \u03b3P (cid:62)\n\ns,av(i\u22121)).\n\na\n\n(3.1)\n\nA greedy policy \u03c0(i) can be obtained at each iteration i by\n\n\u2200s : \u03c0(i)(s) \u2190 argmax\n\n(r(s, a) + \u03b3P (cid:62)\n\ns,av(i)).\n\na\n\nFor any u > 0, it can be shown that if one can approximate v(i)(s) with(cid:98)v(i)(s) such that (cid:107)(cid:98)v(i) \u2212\n\nv(i)(cid:107)\u221e \u2264 (1 \u2212 \u03b3)u and run the above value iteration algorithm using these approximated values,\nthen after \u0398((1\u2212 \u03b3)\u22121 log[u\u22121(1\u2212 \u03b3)\u22121]) iterations, the \ufb01nal iteration gives an value function that\nis u-optimal ([Ber13]). However, a u-optimal value function only yields a u/(1\u2212 \u03b3)-optimal greedy\npolicy (in the worst case), even if (3.2) is precisely computed. To get around this additional loss, a\nmonotone-VI algorithm was proposed in [SWWY18] as follows. At each iteration, this algorithm\nmaintains not only an approximated value v(i) but also a policy \u03c0(i). The key for improvement is to\nkeep values as a lower bound of the value of the policy on a set of sample paths with high probability.\nIn particular, the following monotonicity condition was maintained with high probability\n\n(3.2)\n\nv(i) \u2264 T\u03c0(i)(v(i)) .\n\nBy the monotonicity of the Bellman\u2019s operator, the above equation guarantees that v(i) \u2264 v\u03c0(i).\nvalue(cid:98)v(R) that is u-optimal then we also obtain a policy \u03c0(R) which by the monotonicity condition\nIf this condition is satis\ufb01ed, then, if after R iterations of approximate value iteration we obtain an\nand the monotonicity of the Bellman operator T\u03c0(R) yields\n\nv(R) \u2264 T\u03c0(R) (v(R)) \u2264 T 2\n\n\u03c0(R) (v(R)) \u2264 . . . \u2264 T \u221e\n\n\u03c0(R) (v(R)) = v\u03c0(R) \u2264 v\u2217.\n\nand therefore this \u03c0(R) is an u-optimal policy. Ultimately, this technique avoids the standard loss of\na (1 \u2212 \u03b3)\u22121 factor when converting values to policies.\n\nThe Variance Reduction Technique Suppose now that we provide an algorithm that maintains\nthe monotonicity condition using random samples from P s,a to approximately compute (3.1). Fur-\nther, suppose we want to obtain a new value function and policy that is at least (u/2)-optimal. In\nSince (cid:107)v(i)(cid:107)\u221e \u2264 (1 \u2212 \u03b3)\u22121, by Hoeffding bound, (cid:101)O((1 \u2212 \u03b3)\u22124u\u22122) samples suf\ufb01ces. Note that\ns,av(i) up to error at most (1\u2212\u03b3)u/2.\norder to obtain the desired accuracy, we need to approximate P (cid:62)\n(cid:101)O((1 \u2212 \u03b3)\u22124u\u22122|S||A|) samples/computation time and (cid:101)O((1 \u2212 \u03b3)\u22121) iterations for the value itera-\ntion to converge. Overall, this yields a sample/computation complexity of (cid:101)O((1 \u2212 \u03b3)\u22125u\u22122|S||A|).\n\nthe number of samples also determines the computation time and therefore each iteration takes\n\nTo reduce the (1\u2212 \u03b3)\u22125 dependence, [SWWY18] uses properties of the input (and the initialization)\nvectors: (cid:107)v(0) \u2212 v\u2217(cid:107)\u221e \u2264 u and rewrites value iteration (3.1) as follows\ns,a(v(i\u22121) \u2212 v(0)) + P (cid:62)\n\n(3.3)\nusing only (cid:101)O((1 \u2212 \u03b3)\u22124u\u22122) samples. For every iteration, we have (cid:107)v(i\u22121) \u2212 v(0)(cid:107)\u221e \u2264 u (recall\ns,av(0) is shared over all iterations and we can approximate it up to error (1 \u2212 \u03b3)u/4\nbe approximated up to error (1 \u2212 \u03b3)u/4 using only (cid:101)O((1 \u2212 \u03b3)\u22122) samples (note that there is no u-\ns,a(v(i\u22121) \u2212 v(0)) can\ndependence here). By this technique, over (cid:101)O((1\u2212\u03b3)\u22121) iterations only (cid:101)O((1\u2212\u03b3)\u22124u\u22122+(1\u2212\u03b3)\u22123)\n\nthat we demand the monotonicity is satis\ufb01ed at each iteration). Hence P (cid:62)\n\nsamples/computation per state action pair are needed, i.e. there is a (1 \u2212 \u03b3) improvement.\n\n(cid:2)r(s, a) + P (cid:62)\n\ns,av(0)(cid:3),\n\nv(i)(s) \u2190 max\n\nNotice that P (cid:62)\n\na\n\n4\n\n\fThe Total-Variance Technique By combining the monotonicity technique and variance reduction\n\ntechnique, one can obtain a (cid:101)O((1\u2212 \u03b3)\u22124) sample/running time complexity (per state-action pair) on\nbound and the best known lower bound of (cid:101)\u2126[|S||A|\u0001\u22122(1 \u2212 \u03b3)\u22123] [AMK13]. Here we show how\n\ncomputing a policy; this was one of the results [SWWY18]. However, there is a gap between this\nto remove the last (1 \u2212 \u03b3) factor by better exploiting the structure of the MDP. In [SWWY18] the\nupdate error in each iteration was set to be at most (1\u2212 \u03b3)u/2 to compensate for error accumulation\nthrough a horizon of length (1 \u2212 \u03b3)\u22121 (i.e., the accumulated error is sum of the estimation error\nat each iteration). To improve we show how to leverage previous work to show that the true error\naccumulation is much less. To see this, let us now switch to Bernstein inequality. Suppose we would\nlike to estimate the value function of some policy \u03c0. The estimation error vector of the value function\n\nis upper bounded by (cid:101)O((cid:112)\u03c3\u03c0/m), where \u03c3\u03c0(s) = Vars(cid:48)\u223cP s,\u03c0(s) (v\u03c0(s(cid:48))) denotes the variance of\n\naccumated error \u221d\n\n(cid:112)\u03c3\u03c0/m \u2264 c1\n\nthe value of the next state if starting from state s by playing policy \u03c0, and m is the number of\nsamples collected per state-action pair. The accumulated error due to estimating value functions can\nbe shown to obey the following inequality (upper to logarithmic factors)\n\n\u221e(cid:88)\nstate s, the expected sum of variance of the tail sums of rewards,(cid:80) \u03b32iP i\nthat(cid:80)\n\nwhere c1 is a constant and the inequality follows from a Cauchy-Swartz-like inequality. According\nto the law of total variance, for any given policy \u03c0 (in particular, the optimal policy \u03c0\u2217) and initial\n\u03c0\u03c3\u03c0, is exactly the variance\nof the total return by playing the policy \u03c0. This observation was previously used in the analysis of\n[MM99, LH12, AMK13]. Since the upper bound on the total return is (1 \u2212 \u03b3)\u22121, it can be shown\nThus picking m \u2248 (1\u2212\u03b3)\u22123\u0001\u22122 is suf\ufb01cient to control the accumulated error (instead of (1\u2212\u03b3)\u22124).\nTo analyze our algorithm, we will apply the above inequality to the optimal policy \u03c0\u2217 to obtain our\n\ufb01nal error bound.\n\n\u03c0\u03c3\u03c0 \u2264 (1 \u2212 \u03b3)\u22122 \u00b7 1 and therefore the total error accumulation is(cid:112)(1 \u2212 \u03b3)\u22123/m.\n\n\u221e(cid:88)\n\ni=0\n\n\u03b32iP i\n\n\u03c0\u03c3\u03c0/m\n\n,\n\n(cid:33)1/2\n\ni \u03b32iP i\n\n(cid:32)\n\n1\n1 \u2212 \u03b3\n\ni=0\n\n\u03b3iP i\n\u03c0\n\nPutting it All Together\nIn the next section we show how to combine these three techniques into\none algorithm and make them work seamlessly. In particular, we provide and analyze how to com-\nbine these techniques into an Algorithm 1 which can be used to at least halve the error of a current\npolicy. Applying this routine a logarithmic number of time then yields our desired bounds. In the\ninput of the algorithm, we demand the input value v(0) and \u03c0(0) satis\ufb01es the required monotonicity\nrequirement, i.e., v(0) \u2264 T\u03c0(0)(v(0)) (in the \ufb01rst iteration, the zero vector 0 and an arbitrary pol-\n(cid:101)O((1\u2212 \u03b3)\u22123\u0001\u22122) samples per state-action pair. The same set of samples is used to estimate the vari-\nicy \u03c0 satis\ufb01es the requirement). We then pick a set of samples to estimate P v(0) accurately with\n\nance vector \u03c3v\u2217. These estimates serve as the initialization of the algorithm. In each iteration i, we\ndraw fresh new samples to compute estimate of P (v(i) \u2212 v(0)). The sum of the estimate of P v(0)\nand P (v(i) \u2212 v(0)) gives an estimate of P v(i). We then make the above estimates have one-sided\nerror by shifting them according to their estimation errors (which is estimated from the Bernstein\ninequality). These one-side error estimates allow us to preserve monotonicity, i.e., guarantees the\nnew value is always improving on the entire sample path with high probability. The estimate of\nP v(i) is plugged in to the Bellman\u2019s operator and gives us new value function, v(i+1) and policy\n\u03c0(i+1), satisfying the monotonicity and advancing accuracy. Repeating the above procedure for the\ndesired number of iterations completes the algorithm.\n4 Algorithm and Analysis\n\nIn this section we provide and analyze our near sample/time optimal \u0001-policy computation algo-\nrithm. As discussed in Section 3 our algorithm combines three main ideas: variance reduction, the\nmonotone value/policy iteration, and the reduction of accumulated error via Bernstein inequality.\nThese ingredients are used in the Algorithm 1 to provide a routine which halves the error of a given\npolicy. We analyze this procedure in Section 4.1 and use it to obtain our main result in Section 4.2.\n\n4.1 The Analysis of the Variance Reduced Algorithm\n\nIn this section we analyze Algorithm 1, showing that each iteration of the algorithm approximately\ncontracts towards the optimal value and policy and that ultimately the algorithm halves the error\n\n5\n\n\f(cid:80)m1\n(cid:80)m1\n\nm1\n\nm1\n\ns,av(0) and \u03c3v(0)(s, a)\n\n\u22121log(8|S||A|\u03b4\u22121);\n\ns,a, s(2)\n\ns,a, . . . , s(m1)\n\ns,a\n\nfrom P s,a;\n\n\\\\Compute empirical estimates of P (cid:62)\n\n12: Initialize w = (cid:101)w =(cid:98)\u03c3 = Q(0) \u2190 0S\u00d7A, and i \u2190 0;\nLet (cid:101)w(s, a) \u2190 1\nLet(cid:98)\u03c3(s, a) \u2190 1\ns,a) \u2212 (cid:101)w2(s, a)\nw(s, a) \u2190 (cid:101)w(s, a) \u2212(cid:112)2\u03b11(cid:98)\u03c3(s, a) \u2212 4\u03b13/4\n\nAlgorithm 1 Variance-Reduced QVI\n1: Input: A sampling oracle for DMDP M = (S,A, r, P , \u03b3)\n2: Input: Upper bound on error u \u2208 [0, (1 \u2212 \u03b3)\u22121] and error probability \u03b4 \u2208 (0, 1)\n3: Input: Initial values v(0) and policy \u03c0(0) such that v(0) \u2264 T\u03c0(0)v(0), and v\u2217 \u2212 v(0) \u2264 u1;\n4: Output: v, \u03c0 such that v \u2264 T\u03c0(v) and v\u2217 \u2212 v \u2264 (u/2) \u00b7 1.\n5:\n6: INITIALIZATION:\n7: Let \u03b2 \u2190 (1 \u2212 \u03b3)\u22121, and R \u2190 (cid:100)c1\u03b2 ln[\u03b2u\u22121](cid:101) for constant c1;\n8: Let m1 \u2190 c2\u03b23u\u22122log(8|S||A|\u03b4\u22121) for constant c2;\n9: Let m2 \u2190 c3\u03b22 log[2R|S||A|\u03b4\u22121] for constant c3;\n10: Let \u03b11 \u2190 m1\n11: For each (s, a) \u2208 S \u00d7 A, sample independent samples s(1)\n13: for each (s, a) \u2208 S \u00d7 A do\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24: REPEAT:\n25: for i = 1 to R do\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n33:\n34: return v(R), \u03c0(R).\n\n\\\\Compute g(i) the estimate of P(cid:2)v(i) \u2212 v(0)(cid:3) with one-sided error\nLet v(i) \u2190 v(Q(i\u22121)), \u03c0(i) \u2190 \u03c0(Q(i\u22121)); \\\\let(cid:101)v(i) \u2190 v(i),(cid:101)\u03c0(i) \u2190 \u03c0(i) (for analysis);\ns,a, . . . ,(cid:101)s(m2)\nFor each (s, a) \u2208 S \u00d7 A, draw independent samples(cid:101)s(1)\ns,a)(cid:3) \u2212 (1 \u2212 \u03b3)u/8;\n\n\\\\Shift the empirical estimate to have one-sided error and guarantee monotonicity\n\nLet g(i)(s, a) \u2190 1\n\\\\Improve Q(i)\nQ(i) \u2190 r + \u03b3 \u00b7 [w + g(i)];\n\n\\\\Compute coarse estimate of the Q-function\nQ(0)(s, a) \u2190 r(s, a) + \u03b3w(s, a)\n\n1 (cid:107)v(0)(cid:107)\u221e \u2212 (2/3)\u03b11(cid:107)v(0)(cid:107)\u221e\n\ns,a) \u2212 v(0)((cid:101)s(j)\n\nj=1 v(0)(s(j)\ns,a)\nj=1(v(0))2(s(j)\n\n(cid:2)v(i)((cid:101)s(j)\n\n\\\\successively improve\n\n(cid:80)m2\n\nm2\n\nj=1\n\ns,a,(cid:101)s(2)\n\nFor each s \u2208 S, if v(i)(s) \u2264 v(i\u22121)(s), then v(i)(s) \u2190 v(i\u22121)(s) and \u03c0(i)(s) \u2190 \u03c0(i\u22121)(s);\n\ns,a\n\nfrom P s,a;\n\nof the input value and policy with high probability. All proofs in this section are deferred to Ap-\npendix E.1.\n\nWe start with bounding the error of (cid:101)w and(cid:98)\u03c3 de\ufb01ned in Line 15 and 16 of Algorithm 1. Notice that\nLemma 4.1 (Empirical Estimation Error). Let(cid:101)w and(cid:98)\u03c3 be computed in Line 15 and 16 of Algorithm\n1. Recall that (cid:101)w and(cid:98)\u03c3 are empirical estimates of P v and \u03c3v = P v2 \u2212 (P v)2 using m1 samples\n\nthese are the empirical estimations of P (cid:62)\n\nper (s, a) pair. With probability at least 1 \u2212 \u03b4, for L def= log(8|S||A|\u03b4\u22121), we have\n\ns,av(0) and \u03c3v(0)(s, a).\n\n(cid:12)(cid:12)(cid:101)w \u2212 P (cid:62)v(0)(cid:12)(cid:12) \u2264(cid:113)\n\nand\n\n\u2200(s, a) \u2208 S \u00d7 A :\n\n2m\u22121\n\n1 \u03c3v(0) \u00b7 L + 2(3m1)\u22121(cid:107)v(0)(cid:107)\u221eL\n\n(cid:12)(cid:12)(cid:98)\u03c3(s, a) \u2212 \u03c3v(0)(s, a)(cid:12)(cid:12) \u2264 4(cid:107)v(0)(cid:107)2\u221e \u00b7(cid:113)\n\n2m\u22121\n\n1 L.\n\n(4.1)\n\n(4.2)\n\nThe proof is a straightforward application of Bernstein\u2019s inequality and Hoeffding\u2019s inequality.\nNext we show that the difference between \u03c3v(0) and \u03c3v\u2217 is also bounded.\n\n6\n\n\f\u221a\n\n\u03c3v \u2264 \u221a\n\nLemma 4.2. Suppose (cid:107)v \u2212 v\u2217(cid:107)\u221e \u2264 \u0001 for some \u0001 > 0, then\nNext we show that in Line 30, the computed g(i) concentrates to and is an overestimate of P [v(i) \u2212\nv(0)] with high probability.\n\nLemma 4.3. Let g(i) be the estimate of P(cid:2)v(i) \u2212 v(0)(cid:3) de\ufb01ned in Line 30 of Algorithm 1. Then\n\n\u03c3v\u2217 + \u0001 \u00b7 1.\n\nconditioning on the event that (cid:107)v(i) \u2212 v(0)(cid:107)\u221e \u2264 2u, with probability at least 1 \u2212 \u03b4/R,\n\nP(cid:2)v(i) \u2212 v(0)(cid:3) \u2212 (1 \u2212 \u03b3)u\n\n\u00b7 1 \u2264 g(i) \u2264 P(cid:2)v(i) \u2212 v(0)(cid:3)\n\nprovided appropriately chosen constants c1, c2, and c3 in Algorithm 1.\n\n4\n\n(cid:104)\n\n(cid:105) \u00b7 1,\n\n1 L < 1 the error vector \u03be satis\ufb01es\n0 \u2264 \u03be \u2264 C\n\n\u221a\n\nNow we present the key contraction lemma, in which we set the constants, c1, c2, c3, in Algorithm 1\nto be suf\ufb01ciently large (e.g., c1 \u2265 4, c2 \u2265 8192, c3 \u2265 128). Note that these constants only need to\nbe suf\ufb01ciently large so that the concentration inequalities hold.\nLemma 4.4. Let Q(i) be the estimated Q-function of v(i) in Line 33 of Algorithm 1. Let \u03c0(i) and\nv(i) be estimated in iteration i, as de\ufb01ned in Line 27 and 28. Then, with probability at least 1 \u2212 2\u03b4,\nfor all 1 \u2264 i \u2264 R,\nv(i\u22121) \u2264 v(i) \u2264 T\u03c0(i) [v(i)], Q(i) \u2264 r + \u03b3P v(i),\nwhere for \u03b11 = m\u22121\n\nand Q\u2217\u2212 Q(i) \u2264 \u03b3P \u03c0\u2217(cid:2)Q\u2217\u2212 Q(i\u22121)(cid:3) + \u03be,\n\n(1 \u2212 \u03b3)u/C + C\u03b13/4\n\n1 (cid:107)v(0)(cid:107)\u221e\n\n\u03b11\u03c3v\u2217 +\nfor some suf\ufb01ciently large constant C \u2265 8.\nUsing the previous lemmas we can prove the guarantees of Algorithm 1.\nProposition 4.5. On an input value vector v(0), policy \u03c0(0), and parameters u \u2208 (0, (1\u2212 \u03b3)\u22121], \u03b4 \u2208\n(0, 1) such that v(0) \u2264 T\u03c0(0)[v(0)], and v\u2217\u2212 v(0) \u2264 u1, Algorithm 1 halts in time O((1\u2212 \u03b3)\u22121u\u22122\u00b7\n|S||A| \u00b7 log(|S||A\u03b4\u22121(1 \u2212 \u03b3)\u22121u\u22121)) and outputs values v and policy \u03c0 such that v \u2264 T\u03c0(v) and\nv\u2217 \u2212 v \u2264 (u/2)1 with probability at least 1\u2212 \u03b4, provided appropriately chosen constants, c1, c2, c3.\nWe prove this proposition by iteratively applying Lemma 4.4. Suppose v(R) is the output of the\n)\u22121\u03be.\nNotice that (I \u2212 \u03b3P \u03c0\u2217\n\u03c3v\u2217. We then apply the variance an-\nalytical tools presented in Section C to show that (I \u2212 \u03b3P \u03c0\u2217\n)\u22121\u03be \u2264 (u/4)1 when setting the\nconstants properly in Algorithm 1. We refer this technique as the total-variance technique, since\n\u03c3v\u2217(cid:107)2\u221e \u2264 O[(1 \u2212 \u03b3)\u22123] instead of a na\u00a8\u0131ve bound of (1 \u2212 \u03b3)\u22124. We complete the\n(cid:107)(I \u2212 \u03b3P \u03c0\u2217\n\nalgorithm, after R iterations. We show v\u2217 \u2212 v(R) \u2264 \u03b3R\u22121P \u03c0\u2217(cid:2)Q\u2217 \u2212 Q0(cid:3) + (I \u2212 \u03b3P \u03c0\u2217\n\nproof by choosing R = (cid:101)\u0398((1 \u2212 \u03b3)\u22121 log(u\u22121)) and showing that \u03b3R\u22121P \u03c0\u2217(cid:2)Q\u2217 \u2212 Q0(cid:3) \u2264 (u/4)1.\n\n)\u22121\u03be is related to (I \u2212 \u03b3P \u03c0\u2217\n\n)\u22121\u221a\n\n)\u22121\u221a\n\n4.2 From Halving the Error to Arbitrary Precision\n\nIn the previous section, we provided an algorithm that on an input policy, outputs a policy with\nvalue vector that has (cid:96)\u221e distance to the optimal value vector only half of that of the input one. In\nthis section, we give a complete policy computation algorithm by by showing that it is possible to\napply this error \u201chalving\u201d procedure iteratively. We summarize our meta algorithm in Algorithm 2.\nNote that in the algorithm, each call of HALFERR draws new samples from the sampling oracle. We\nrefer in this section to Algorithm 1 as a subroutine HALFERR, which given an input MDP M with\na sampling oracle, an input value function v(i), and an input policy \u03c0(i), outputs an value function\nv(i+1) and a policy \u03c0(i+1).\nCombining Algorithm 2 and Algorithm 1, we are ready to present main result.\nTheorem 4.6. Let M = (S,A, P , r, \u03b3) be a DMDP with a generative model. Suppose we can\nsample a state from each probability vector P s,a within time O(1). Then for any \u0001, \u03b4 \u2208 (0, 1), there\nexists an algorithm that halts in time\n\n(cid:20) |S||A|\n\n(cid:18) |S||A|\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)(cid:21)\n\nT := O\n\n(1 \u2212 \u03b3)3\u00012 log\n\n(1 \u2212 \u03b3)\u03b4\u0001\n\nlog\n\n1\n\n(1 \u2212 \u03b3)\u0001\n\n7\n\n\fAlgorithm 2 Meta Algorithm\n1: Input: A sampling oracle of some M = (S,A, r, P , \u03b3), \u0001 > 0, \u03b4 \u2208 (0, 1)\n2: Initialize: v(0) \u2190 0, \u03c0(0) \u2190 arbitrary policy, R \u2190 \u0398[log(\u0001\u22121(1 \u2212 \u03b3)\u22121)]\n3: for i = {1, 2, . . . , R} do\n4:\n5:\n6: Output: v(R), \u03c0(R).\n\n//HALFERR is initialized with QVI(u = 2\u2212i+1(1 \u2212 \u03b3)\u22121, \u03b4, v(0) = v(i\u22121), \u03c0(0) = \u03c0(i\u22121))\nv(i), \u03c0(i) \u2190 HALFERR \u2190 v(i\u22121), \u03c0(i\u22121)\n\nand obtains a policy \u03c0 such that v\u2217 \u2212 \u00011 \u2264 v\u03c0 \u2264 v\u2217, with probability at least 1 \u2212 \u03b4 where v\u2217 is the\noptimal value of M. The algorithm uses space O(|S||A|) and queries the generative model for at\nmost O(T ) fresh samples.\nRemark 4.7. The full analysis of the halving algorithm is presented in Section E.2. Our algorithm\ncan be implemented in space O(|S||A|) since in Algorithm 1, the initialization phase can be done\nupdates can be computed in space O(|S||A|) as well.\n\nfor each (s, a) and compute w(s, a),(cid:101)w(s, a),(cid:98)\u03c3(s, a), Q(0)(s, a) without storing the samples. The\n\n5 Comparison to Previous Work\n\nAlgorithm\n\nPhased Q-Learning\n\nEmpirical QVI\n\nEmpirical QVI\n\nRandomized Primal-Dual\n\nSublinear Randomized Value\n\nMethod\n\nIteration\n\nSublinear Randomized QVI\n\nSample Complexity\n\n(1\u2212\u03b3)3\u00012\n\n|S||A|\n(1\u2212\u03b3)7\u00012 )\n|S||A|\n(1\u2212\u03b3)5\u00012 ) 4\n\n(cid:101)O(C\n(cid:101)O(\n(cid:101)O(cid:0) |S||A|\n(cid:1) if \u0001 = (cid:101)O(cid:0)\n(cid:101)O(C\n(cid:17)\n(cid:16) |S||A|\n(cid:101)O\n(cid:16) |S||A|\n(cid:17)\n(cid:101)O\n\n|S||A|\n(1\u2212\u03b3)4\u00012 )\n\n(1\u2212\u03b3)4\u00012\n\n(1\u2212\u03b3)3\u00012\n\n1\u221a\n(1\u2212\u03b3)|S|\n\n(cid:1)\n\nReferences\n\n[KS99]\n[AMK13]\n\n[AMK13]\n\n[Wan17]\n\n[SWWY18]\n\nThis Paper\n\nTable 1: Sample Complexity to Compute \u0001-Approximate Policies Using the Generative Sampling Model:\nHere |S| is the number of states, |A| is the number of actions per state, \u03b3 \u2208 (0, 1) is the discount factor, and C\nis an upper bound on the ergodicity. Rewards are bounded between 0 and 1.\nThere exists a large body of literature on MDPs and RL (see e.g. [Kak03, SLL09, KBJ14, DB15]\nand reference therein). The classical MDP problem is to compute an optimal policy exactly or\napproximately, when the full MDP model is given as input. For a survey on existing complexity\nresults when the full MDP model is given, see Appendix A.\nDespite the aforementioned results of [Kak03, AMK13, SWWY18], there exists only a handful of\nadditional RL methods that achieve a small sample complexity and a small run-time complexity at\nthe same time for computing an \u0001-optimal policy. A classical result is the phased Q-learning method\nby [KS99], which takes samples from the generative model and runs a randomized value itera-\ntion. The phased Q-learning method \ufb01nds an \u0001-optimal policy using O(|S||A|\u0001\u22122/poly(1 \u2212 \u03b3))\n\nsamples/updates, where each update uses (cid:101)O(1) run time.5 Another work [Wan17] gave a ran-\nDMDP. They achieve a total runtime of (cid:101)O(|S|3|A|\u0001\u22122(1 \u2212 \u03b3)\u22126) for the general DMDP and\n(cid:101)O(C|S||A|\u0001\u22122(1 \u2212 \u03b3)\u22124) for DMDPs that are ergodic under all possible policies, where C is a\napproximate policy in sample size/run time (cid:101)O(|S||A|\u0001\u22122(1\u2212\u03b3)\u22124), without requiring any ergodicity\n\nproblem-speci\ufb01c ergodicity measure. A recent closely related work is [SWWY18] which gave a\nvariance-reduced randomized value iteration that works with the generative model and \ufb01nds an \u0001-\n\ndomized mirror-prox method that applies to a special Bellman saddle point formulation of the\n\nassumption.\n\n4Although not explicitly stated, an immediate derivation shows that obtaining an \u0001-optimal policy in\n5The dependence on (1 \u2212 \u03b3) in [KS99] is not stated explicitly but we believe basic calculations yield\n\n[AMK13] requires O(|S||A|(1 \u2212 \u03b3)\u22125\u0001\u22122) samples.\nO(1/(1 \u2212 \u03b3)7).\n\n8\n\n\f(cid:16)\n\n1/(cid:112)(1 \u2212 \u03b3)\u22121|S|(cid:17)\n\n, [AMK13] showed that the solution obtained\nFinally, in the case where \u0001 = O\nby performing exact PI on the empirical MDP model provides not only an \u0001-optimal value but also\n\nan \u0001-optimal policy. In this case, the number of samples is (cid:101)O(|S||A|(1 \u2212 \u03b3)\u22123\u0001\u22122) and matches the\nbecause of the very small approximation error \u0001 = O(1/(cid:112)(1 \u2212 \u03b3)|S|). See Table 1 for a list of\n\nsample complexity lower bound. Although this sample complexity is optimal, it requires solving the\nempirical MDP exactly (see Appendix B), and is no longer sublinear in the size of the MDP model\n\ncomparable sample complexity results for solving MDP based on the generative model.\n\ntime spent and number of sample taken is upper bounded by (cid:101)O((1\u2212\u03b3)\u22123\u0001\u22122|S||A|). This improves\n\n6 Concluding Remark\nIn summary, for a discounted Markov Decision Process (DMDP) M = (S,A, P , r, \u03b3) provided\nwe can only access the transition function of the DMDP through a generative sampling model, we\nprovide an algorithm which computes an \u0001-approximate policy with probability 1\u2212 \u03b4 where both the\nupon the previous best known bounds by a factor of 1/(1 \u2212 \u03b3) and matches the the lower bounds\nproved in [AMK13] up to logarithmic factors.\nThe appendix is structured as follows. Section A surveys the existing runtime results for solving the\nDMDP when a full model is given. Section B provides an runtime optimal algorithm for computing\napproximate value functions (by directly combining [AMK13] and [SWWY18]). Section C gives\ntechnical analysis and variance upper bounds for the total-variance technique. Section D discusses\nsample complexity lower bounds for obtaining approximate policies with a generative sampling\nmodel. Section E provides proofs to lemmas, propositions and theorems in the main text of the\npaper. Section F extends our method and results to the \ufb01nite-horizon MDP and provides a nearly\nmatching sample complexity lower bound.\n\n9\n\n\fReferences\n\n[AMK13] Mohammad Gheshlaghi Azar, R\u00b4emi Munos, and Hilbert J Kappen. Minimax pac\nbounds on the sample complexity of reinforcement learning with a generative model.\nMachine learning, 91(3):325\u2013349, 2013.\n\n[Bel57] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ,\n\n1957.\n\n[Ber13] Dimitri P Bertsekas. Abstract dynamic programming. Athena Scienti\ufb01c, Belmont,\n\nMA, 2013.\n\n[Dan16] George Dantzig. Linear Programming and Extensions. Princeton University Press,\n\nPrinceton, NJ, 2016.\n\n[DB15] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon\nreinforcement learning. In Advances in Neural Information Processing Systems, pages\n2818\u20132826, 2015.\n\n[d\u2019E63] F d\u2019Epenoux. A probabilistic production and inventory problem. Management Science,\n\n10(1):98\u2013108, 1963.\n\n[DG60] Guy De Ghellinck. Les problemes de decisions sequentielles. Cahiers du Centre\n\ndEtudes de Recherche Op\u00b4erationnelle, 2(2):161\u2013179, 1960.\n\n[HMZ13] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is\nstrongly polynomial for 2-player turn-based stochastic games with a constant discount\nfactor. J. ACM, 60(1):1:1\u20131:16, February 2013.\n\n[How60] Ronald A. Howard. Dynamic programming and Markov processes. The MIT press,\n\nCambridge, MA, 1960.\n\n[Kak03] Sham M Kakade. On the sample complexity of reinforcement learning. PhD thesis,\n\nUniversity of London London, England, 2003.\n\n[KBJ14] Dileep Kalathil, Vivek S Borkar, and Rahul Jain. Empirical q-value iteration. arXiv\n\npreprint arXiv:1412.0180, 2014.\n\n[KS99] Michael J Kearns and Satinder P Singh. Finite-sample convergence rates for q-learning\nand indirect algorithms. In Advances in neural information processing systems, pages\n996\u20131002, 1999.\n\n[LDK95] Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity\nof solving Markov decision problems. In Proceedings of the Eleventh conference on\nUncertainty in arti\ufb01cial intelligence, pages 394\u2013402. Morgan Kaufmann Publishers\nInc., 1995.\n\n[LH12] Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In International\n\nConference on Algorithmic Learning Theory, pages 320\u2013334. Springer, 2012.\n\n[LS14] Yin Tat Lee and Aaron Sidford. Path \ufb01nding methods for linear programming: Solving\nlinear programs in o (vrank) iterations and faster algorithms for maximum \ufb02ow.\nIn\nFoundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on,\npages 424\u2013433. IEEE, 2014.\n\n[LS15] Yin Tat Lee and Aaron Sidford. Ef\ufb01cient inverse maintenance and faster algorithms for\nlinear programming. In Foundations of Computer Science (FOCS), 2015 IEEE 56th\nAnnual Symposium on, pages 230\u2013249. IEEE, 2015.\n\n[MM99] Remi Munos and Andrew W Moore. Variable resolution discretization for high-\n\naccuracy solutions of optimal control problems. Robotics Institute, page 256, 1999.\n\n[MS99] Yishay Mansour and Satinder Singh. On the complexity of policy iteration. In Pro-\nceedings of the Fifteenth conference on Uncertainty in arti\ufb01cial intelligence, pages\n401\u2013408. Morgan Kaufmann Publishers Inc., 1999.\n\n10\n\n\f[Sch13] Bruno Scherrer. Improved and generalized upper bounds on the complexity of policy\nIn Advances in Neural Information Processing Systems, pages 386\u2013394,\n\niteration.\n2013.\n\n[SLL09] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in\n\ufb01nite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413\u2013\n2444, 2009.\n\n[SWWY18] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value itera-\ntion and faster algorithms for solving markov decision processes. In Proceedings of the\nTwenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770\u2013787.\nSIAM, 2018.\n\n[Tse90] Paul Tseng. Solving h-horizon, stationary markov decision problems in time propor-\n\ntional to log (h). Operations Research Letters, 9(5):287\u2013297, 1990.\n\n[Wan17] Mengdi Wang. Randomized linear programming solves the discounted Markov deci-\n\nsion problem in nearly-linear running time. arXiv preprint arXiv:1704.01869, 2017.\n\n[Ye05] Yinyu Ye. A new complexity result on solving the Markov decision problem. Mathe-\n\nmatics of Operations Research, 30(3):733\u2013749, 2005.\n\n[Ye11] Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the\nMarkov decision problem with a \ufb01xed discount rate. Mathematics of Operations Re-\nsearch, 36(4):593\u2013603, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2486, "authors": [{"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}, {"given_name": "Mengdi", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Xian", "family_name": "Wu", "institution": "Stanford University"}, {"given_name": "Lin", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Yinyu", "family_name": "Ye", "institution": "Standord"}]}