{"title": "Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5238, "page_last": 5247, "abstract": "Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work (Efroni et al., 2018), multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.", "full_text": "Multiple-Step Greedy Policies in Online and\n\nApproximate Reinforcement Learning\n\nYonathan Efroni\u2217\n\nGal Dalal\u2217\n\njonathan.efroni@gmail.com\n\ngald@campus.technion.ac.il\n\nBruno Scherrer\u2020\n\nShie Mannor\u2217\n\nbruno.scherrer@inria.fr\n\nshie@ee.technion.ac.il\n\nAbstract\n\nMultiple-step lookahead policies have demonstrated high empirical competence\nin Reinforcement Learning, via the use of Monte Carlo Tree Search or Model\nPredictive Control. In a recent work [5], multiple-step greedy policies and their use\nin vanilla Policy Iteration algorithms were proposed and analyzed. In this work,\nwe study multiple-step greedy algorithms in more practical setups. We begin by\nhighlighting a counter-intuitive dif\ufb01culty, arising with soft-policy updates: even in\nthe absence of approximations, and contrary to the 1-step-greedy case, monotonic\npolicy improvement is not guaranteed unless the update stepsize is suf\ufb01ciently\nlarge. Taking particular care about this dif\ufb01culty, we formulate and analyze online\nand approximate algorithms that use such a multi-step greedy operator.\n\n1\n\nIntroduction\n\nThe use of the 1-step policy improvement in Reinforcement Learning (RL) was theoretically inves-\ntigated under several frameworks, e.g., Policy Iteration (PI) [18], approximate PI [2, 9, 13], and\nActor-Critic [10]; its practical uses are abundant [22, 12, 25]. However, single-step based improve-\nment is not necessarily the optimal choice. It was, in fact, empirically demonstrated that multiple-step\ngreedy policies can perform conspicuously better. Notable examples arise from the integration of RL\nand Monte Carlo Tree Search [4, 28, 23, 3, 25, 24] or Model Predictive Control [15, 6, 27].\nRecent work [5] provided guarantees on the performance of the multiple-step greedy policy and\ngeneralizations of it in PI. Here, we establish it in the two practical contexts of online and approximate\nPI. With this objective in mind, we begin by highlighting a speci\ufb01c dif\ufb01culty: softly updating a policy\nwith respect to (w.r.t.) a multiple-step greedy policy does not necessarily result in improvement of\nthe policy (Section 4). We \ufb01nd this property intriguing since monotonic improvement is guaranteed\nin the case of soft updates w.r.t. the 1-step greedy policy, and is central to the analysis of many\nRL algorithms [10, 9, 22]. We thus engineer several algorithms to circumvent this dif\ufb01culty and\nprovide some non-trivial performance guarantees, that support the interest of using multi-step greedy\noperators. These algorithms assume access to a generative model (Section 5) or to an approximate\nmultiple-step greedy policy (Section 6).\n\n2 Preliminaries\n\nOur framework is the in\ufb01nite-horizon discounted Markov Decision Process (MDP). An MDP is\nde\ufb01ned as the 5-tuple (S,A, P, R, \u03b3) [18], where S is a \ufb01nite state space, A is a \ufb01nite action space,\n\n\u2217Department of Electrical Engineering, Technion, Israel Institute of Technology\n\u2020INRIA, Villers les Nancy, France\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fv\u03c0 \u2208 R|S| be the value of a policy \u03c0, de\ufb01ned in state s as v\u03c0(s) \u2261 E\u03c0[(cid:80)\u221e\nv(st). It is known that v\u03c0 =(cid:80)\u221e\n\nP \u2261 P (s(cid:48)|s, a) is a transition kernel, R \u2261 r(s, a) is a reward function, and \u03b3 \u2208 (0, 1) is a discount\nfactor. Let \u03c0 : S \u2192 P(A) be a stationary policy, where P(A) is a probability distribution on A. Let\nt=0 \u03b3tr(st, \u03c0(st))|s0 = s].\nFor brevity, we respectively denote the reward and value at time t by rt \u2261 r(st, \u03c0t(st)) and vt \u2261\nt=0 \u03b3t(P \u03c0)tr\u03c0 = (I \u2212 \u03b3P \u03c0)\u22121r\u03c0, with the component-wise values\n[P \u03c0]s,s(cid:48) (cid:44) P (s(cid:48) | s, \u03c0(s)) and [r\u03c0]s (cid:44) r(s, \u03c0(s)). Lastly, let\n\nq\u03c0(s, a) = E\u03c0[\n\n\u03b3tr(st, \u03c0(st)) | s0 = s, a0 = a].\n\nOur goal is to \ufb01nd a policy \u03c0\u2217 yielding the optimal value v\u2217 such that\n\nt=0\n\nv\u2217 = max\n\n(I \u2212 \u03b3P \u03c0)\u22121r\u03c0 = (I \u2212 \u03b3P \u03c0\u2217\n\n\u03c0\n\n)\u22121r\u03c0\u2217\n\n.\n\n(1)\n\n(2)\n\n\u221e(cid:88)\n\nThis goal can be achieved using the three classical operators (equalities hold component-wise):\n\n\u2200v, \u03c0, T \u03c0v = r\u03c0 + \u03b3P \u03c0v,\nT \u03c0v,\n\n\u2200v, T v = max\n\u2200v, G(v) = {\u03c0 : T \u03c0v = T v},\n\n\u03c0\n\nwhere T \u03c0 is a linear operator, T is the optimal Bellman operator and both T \u03c0 and T are \u03b3-contraction\nmappings w.r.t. the max norm. It is known that the unique \ufb01xed points of T \u03c0 and T are v\u03c0 and v\u2217,\nrespectively. The set G(v) is the standard set of 1-step greedy policies w.r.t. v.\n\n3 The h- and \u03ba-Greedy Policies\n\n(cid:34)h\u22121(cid:88)\n\nt=0\n\nIn this section, we bring forward necessary de\ufb01nitions and results on two classes of multiple-step\ngreedy policies: h- and \u03ba-greedy [5]. Let h \u2208 N\\{0}. The h-greedy policy \u03c0h outputs the \ufb01rst\noptimal action out of the sequence of actions solving a non-stationary, h-horizon control problem as\nfollows:\n\n\u2200s \u2208 S, \u03c0h(s) \u2208 arg max\n\n\u03c00\n\nmax\n\n\u03c01,..,\u03c0h\u22121\n\nE\u03c00...\u03c0h\u22121\n\n\u03b3tr(st, \u03c0t(st)) + \u03b3hv(sh) | s0 = s\n\n.\n\nSince the h-greedy policy can be represented as the 1-step greedy policy w.r.t. T h\u22121v, the set of\nh-greedy policies w.r.t. v, Gh(v), can be formally de\ufb01ned as follows:\n\nh v = T \u03c0T h\u22121v,\n\n\u2200v, \u03c0, T \u03c0\n\u2200v, Gh(v) = {\u03c0 : T \u03c0\n\nh v = T hv}.\n\nLet \u03ba \u2208 [0, 1]. The set of \u03ba-greedy policies w.r.t. a value function v, G\u03ba(v), is de\ufb01ned using the\nfollowing operators:\n\n(cid:35)\n\n\u2200v, \u03c0, T \u03c0\n\n\u03ba v = (I \u2212 \u03ba\u03b3P \u03c0)\u22121(r\u03c0 + (1 \u2212 \u03ba)\u03b3P \u03c0v)\n\n\u2200v, T\u03bav = max\nT \u03c0\n\u03ba v = max\n\u03ba v = T\u03bav}.\n\u2200v, G\u03ba(v) = {\u03c0 : T \u03c0\n\n\u03c0\n\n\u03c0\n\n(I \u2212 \u03ba\u03b3P \u03c0)\u22121(r\u03c0 + (1 \u2212 \u03ba)\u03b3P \u03c0v)\n\n(3)\n\nRemark 1. A comparison of (2) and (3) reveals that \ufb01nding the \u03ba-greedy policy is equivalent to\nsolving a \u03ba\u03b3-discounted MDP with shaped reward r\u03c0\n\n= r\u03c0 + (1 \u2212 \u03ba)\u03b3P \u03c0v.\n\ndef\n\nv,\u03ba\n\nIn [5, Proposition 11], the \u03ba-greedy policy was explained to be interpolating over all geometrically\n\u03ba-weighted h-greedy policies. It was also shown that for \u03ba = 0, the 1-step greedy policy is restored,\nwhile for \u03ba = 1, the \u03ba-greedy policy is the optimal policy.\n\n\u03ba and T\u03ba are \u03be\u03ba contraction mappings, where \u03be\u03ba = \u03b3(1\u2212\u03ba)\n\n1\u2212\u03b3\u03ba \u2208 [0, \u03b3]. Their respective \ufb01xed\nBoth T \u03c0\npoints are v\u03c0 and v\u2217. For brevity, where there is no risk of confusion, we shall denote \u03be\u03ba by \u03be.\nMoreover, in [5] it was shown that both the h- and \u03ba-greedy policies w.r.t. v\u03c0 are strictly better then\n\u03c0, unless \u03c0 = \u03c0\u2217.\n\n2\n\n\f\u2212c\na0\n\ns3\n\n0\na0\n\ns1\n\na1\n\n1\n\ns2\n\na0\n\n0\n\na0\n\n0\n\ns0\n\n0\n\na1\n\nFigure 1: The Tightrope Walking MDP used in the counter example of Theorem 1.\n\nNext, let\n\nq\u03c0\n\u03ba (s, a) = max\n\n\u03c0(cid:48)\n\nE\u03c0(cid:48)\n\n\u221e(cid:88)\n\n[\n\nt=0\n\n(\u03ba\u03b3)t(r(st, \u03c0(cid:48)(st)) + \u03b3(1 \u2212 \u03ba)v\u03c0(st+1) | s0 = s, a0 = a].\n\n(4)\n\nThe latter is the optimal q-function of the surrogate, \u03b3\u03ba-discounted MDP with v\u03c0-shaped reward (see\nRemark 1). Thus, we can obtain a \u03ba-greedy policy, \u03c0\u03ba \u2208 G\u03ba(v\u03c0), directly from q\u03c0\n\u03ba :\n\n\u03c0\u03ba(s) \u2208 arg max\n\u03ba=0(s, a) is the 1-step greedy policy since q\u03c0\n\n\u03ba (s, a), \u2200s \u2208 S.\nq\u03c0\n\na\n\n\u03ba=0(s, a) = q\u03c0(s, a).\n\nSee that the greedy policy w.r.t. q\u03c0\n\n4 Multi-step Policy Improvement and Soft Updates\n\nIn this section, we focus on policy improvement of multiple-step greedy policies, performed with soft\nupdates. Soft updates of the 1-step greedy policy have proved necessary and bene\ufb01cial in prominent\nalgorithms [10, 9, 22]. Here, we begin by describing an intrinsic dif\ufb01culty in selecting the step-size\nparameter \u03b1 \u2208 (0, 1] when updating with multiple-step greedy policies. Speci\ufb01cally, denote by \u03c0(cid:48)\nsuch multiple-step greedy policy w.r.t. v\u03c0. Then, \u03c0new = (1 \u2212 \u03b1)\u03c0 + \u03b1\u03c0(cid:48) is not necessarily better\nthan \u03c0.\nTheorem 1. For any MDP, let \u03c0 be a policy and v\u03c0 its value. Let \u03c0\u03ba \u2208 G\u03ba(v\u03c0) and \u03c0h \u2208 Gh(v\u03c0)\nwith \u03ba \u2208 [0, 1] and h > 1. Consider the mixture policies with \u03b1 \u2208 (0, 1],\n\n\u03c0(\u03b1, \u03ba)\n\n\u03c0(\u03b1, h)\nThen we have the following equivalences:\n\ndef\n\n= (1 \u2212 \u03b1)\u03c0 + \u03b1\u03c0\u03ba,\n= (1 \u2212 \u03b1)\u03c0 + \u03b1\u03c0h.\n\ndef\n\n1. The inequality v\u03c0(\u03b1,\u03ba) \u2265 v\u03c0 holds for all MDPs if and only if \u03b1 \u2208 [\u03ba, 1].\n2. The inequality v\u03c0(\u03b1,h) \u2265 v\u03c0 holds for all MDPs if and only if \u03b1 = 1.\n\nThe above inequalities hold entry-wise, with strict inequality in at least one entry unless v\u03c0 = v\u2217.\n\nProof sketch.\nSee Appendix A for the full proof. Here, we only provide a counterexample\ndemonstrating the potential non-monotonicity of \u03c0(\u03b1, \u03ba) when the stepsize \u03b1 is not big enough. One\ncan show the same for \u03c0(\u03b1, h) with the same example.\nConsider the Tightrope Walking MDP in Fig. 1. It describes the act of walking on a rope: in the\ninitial state s0 the agent approaches the rope, in s1 the walking attempt occurs, s2 is the goal state\nand s3 is repeatedly met if the agent falls from the rope, resulting in negative reward.\nFirst, notice that by de\ufb01nition, \u2200v, \u03c0\u2217 \u2208 G\u03ba=1(v). We call this policy the \u201ccon\ufb01dent\u201d policy.\nObviously, for any discount factor \u03b3 \u2208 (0, 1), \u03c0\u2217(s0) = a1 and \u03c0\u2217(s1) = a1. Instead, consider the\n\u201chesitant\u201d policy \u03c00(s) \u2261 a0 \u2200s. We now claim that for any \u03b1 \u2208 (0, 1) and\n\nc >\n\n\u03b1\n1 \u2212 \u03b1\n\n3\n\n(5)\n\n\fthe mixture policy, \u03c0(\u03b1, \u03ba = 1) = (1 \u2212 \u03b1)\u03c00 + \u03b1\u03c0\u2217, is not strictly better than \u03c00. To see this, notice\nthat v\u03c00(s1) < 0 and v\u03c00 (s0) = 0; i.e., the agent accumulates zero reward if she does not climb the\nrope. Thus, while v\u03c00 (s0) = 0, taking any mixture of the con\ufb01dent and hesitant policies can result in\nv\u03c0(\u03b1,\u03ba=1)(s0) < 0, due to the portion of the transition to s1 and its negative contribution. Based on\nthis construction, let \u03ba \u2208 [0, 1]. To ensure \u03c0\u2217 \u2208 G\u03ba(v\u03c0), we \ufb01nd it is necessary that\n\nc \u2264 \u03ba\n1 \u2212 \u03ba\n\n.\n\n(6)\n\n1\u2212x , such a choice of c is indeed possible for \u03b1 < \u03ba.\n\nTo conclude, if both (5) and (6) are satis\ufb01ed, the mixture policy does not improve over \u03c00. Due to the\nmonotonicity of x\nTheorem 1 guarantees monotonic improvement for the 1-step greedy policy as a special case when\n\u03ba = 0. Hence, we get that for any \u03b1 \u2208 (0, 1], the mixture of any policy \u03c0 and the 1-step greedy policy\nw.r.t. v\u03c0 is monotonically better then \u03c0. To the best of our knowledge, this result was not explicitly\nstated anywhere. Instead, it appeared within proofs of several famous results, e.g, [10, Lemma 5.4],\n[9, Corollary 4.2], and [21, Theorem 1].\nIn the rest of the paper, we shall focus on the \u03ba-greedy policy and extend it to the online and the\napproximate cases. The discovery that the \u03ba-greedy policy w.r.t. v\u03c0 is not necessarily strictly better\nthan \u03c0 will guide us in appropriately devising algorithms.\n\n5 Online \u03ba-Policy Iteration with Cautious Soft Updates\n\nIn [5], it was shown that using the \u03ba-greedy policy in the improvement stage leads to a convergent PI\nprocedure \u2013 the \u03ba-PI algorithm. This algorithm repeats i) \ufb01nding the optimal policy of small-horizon\nsurrogate MDP with shaped reward, and ii) calculating the value of the optimal policy and use it to\nshape the reward of next iteration. Here, we devise a practical version of \u03ba-PI, which is model-free,\nonline and runs in two timescales; i.e, it performs i) and ii) simultaneously.\nThe method is depicted in Algorithm 1. It is similar to the asynchronous PI analyzed in [16], except\nfor two major differences. First, the fast timescale tracks both q\u03c0, q\u03c0\n\u03ba and not just q\u03c0. Thus, it enables\naccess to both the 1-step-greedy and \u03ba-greedy policies. The 1-step greedy policy is attained via the q\u03c0\nestimate, which is plugged into a q-learning [29] update rule for obtaining the \u03ba-greedy policy. The\nlatter essentially solves the surrogate \u03ba\u03b3-discounted MDP (see Remark 1). The second difference is in\nthe slow timescale, in which the policy is updated using a new operator, bs, as de\ufb01ned below. To better\nunderstand this operator, \ufb01rst notice that in Stochastic Approximation methods such as Algorithm 1,\nthe policy is improved using soft updates with decaying stepsizes. However, as Theorem 1 states,\nmonotonic improvement is not guaranteed below a certain stepsize value. Hence, for q, q\u03ba \u2208 R|S\u00d7A|\nand policy \u03c0, we set bs(q, q\u03ba, \u03c0) to be the \u03ba-greedy policy only when assured to have improvement:\n\nbs(q, q\u03ba, \u03c0) =\n\n(cid:26)a\u03ba(s)\n= arg maxa q(s, a), and v\u03c0(s) =(cid:80)\n\nif q(s, a\u03ba) \u2265 v\u03c0(s),\nelse,\n\na1-step(s)\n\ndef\n\n1s=sk and \u03c6n(s, a)\n\nk=1\n\ndef\n\ndef\n\n= (cid:80)n\n\n= (cid:80)n\n\ndef\n= arg maxa q\u03ba(s, a), a1-step(s)\n\na \u03c0(a | s)q(s, a).\nwhere a\u03ba(s)\nWe respectively denote the state and state-action-pair visitation counters after the n-th time-step by\n1(s,a)=(sk,ak). The stepsize sequences \u00b5f (\u00b7), \u00b5s(\u00b7)\n\u03bdn(s)\nsatisfy the common assumption (B2) in [16], among which limn\u2192\u221e \u00b5s(n)/\u00b5f (n) \u2192 0. The second\nmoments of {rn} are assumed to be bounded. Furthermore, let \u03bd be some measure over the state\nspace, s.t. \u2200s \u2208 S, \u03bd(s) > 0. Then, we assume to have a generative model G(\u03bd, \u03c0), using which we\nsample state s \u223c \u03bd, sample action a \u223c \u03c0(s), apply action a and receive reward r and next state s(cid:48).\nThe fast-timescale update rules in lines 6 and 8 can be jointly written as the sum of H \u03c0\n\u03ba (q, q\u03ba) (de\ufb01ned\nbelow) and a martingale difference noise.\n\nk=1\n\n4\n\n\fDe\ufb01nition 1. Let q, q\u03ba \u2208 R|S||A|. The mapping H \u03c0\n\u2200(s, a) \u2208 S \u00d7 A.\n\n\u03ba : R2|S||A| \u2192 R2|S||A| is de\ufb01ned as follows\n\n(cid:20)\n\n(cid:21)\n\nAlgorithm 1 Two-Timescale Online \u03ba-Policy-Iteration\n1: initialize: \u03c00, q0, q\u03ba,0.\n2: for n = 0, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n12: return: \u03c0\n\nn \u223c G(\u03bd, \u03c0n)\nsn, an, rn, s(cid:48)\n# Fast-timescale updates\nn(s(cid:48)\n\u03b4n = rn + \u03b3v\u03c0\nqn+1(sn, an) \u2190 qn(sn, an) + \u00b5f (\u03c6n+1(sn, an))\u03b4n\n\u03b4\u03ba,n = rn + \u03b3(1 \u2212 \u03ba)v\u03c0\nn(s(cid:48)\nn) + \u03ba\u03b3 maxa(cid:48) q\u03ba,n(s(cid:48)\nq\u03ba,n+1(sn, an) \u2190 q\u03ba,n(sn, an) + \u00b5f (\u03c6n+1(sn, an))\u03b4\u03ba,n\n# Slow-timescale updates\n\u03c0n+1(sn) \u2190 \u03c0n(sn) + \u00b5s(\u03bdn+1(sn))(bsn (qn+1, q\u03ba,n+1, \u03c0n) \u2212 \u03c0n(sn))\n\nn, a(cid:48)) \u2212 q\u03ba,n(sn, an)\n\nn) \u2212 qn(sn, an)\n\nH \u03c0\n\n\u03ba (q, q\u03ba)(s, a)\n\ndef\n=\n\nr(s, a) + \u03b3(1 \u2212 \u03ba)Es(cid:48),a\u03c0 q(s(cid:48), a\u03c0) + \u03ba\u03b3Es(cid:48) maxa(cid:48) q\u03ba(s(cid:48), a(cid:48))\n\n,\n\nr(s, a) + \u03b3Es(cid:48),a\u03c0 q(s(cid:48), a\u03c0)\n\nwhere s(cid:48) \u223c P (\u00b7 | s, a), a\u03c0 \u223c \u03c0(s(cid:48)).\nThe following lemma shows that, given a \ufb01xed \u03c0, H \u03c0\n(see Appendix B for the proof).\nLemma 2. H \u03c0\n\n\u03ba is a \u03b3-contraction in the max-norm. Its \ufb01xed point is [ q\u03c0, q\u03c0\n\n\u03ba is a contraction, equivalently to [16, Lemma 5.3]\n\u03ba ](cid:62), as de\ufb01ned in (1), (4).\nFinally, based on several intermediate results given in Appendix C and relaying on Lemma 2, we\nestablish the convergence of Algorithm 1.\nTheorem 3. The coupled process (qn, q\u03ba,n, \u03c0n) in Algorithm 1 converges to the limit (q\u2217, q\u2217, \u03c0\u2217),\nwhere q\u2217 is the optimal q-function and \u03c0\u2217 is the optimal policy.\nFor \u03ba = 1, the fast-timescale update rule in line 8 corresponds to that of q-learning [29]. For that \u03ba,\nAlgorithm 1 uses an estimated optimal q-function to update the current policy when improvement is\nassured. For \u03ba < 1, the estimated \u03ba-dependent optimal q-function (see (4)) is used, again with the\n\u2018cautious\u2019 policy update. Moreover, Algorithm 1 combines an off-policy algorithm, i.e., q-learning,\nwith an on-policy Actor-Critic algorithm. To the best of our knowledge, this is the \ufb01rst appearance of\nthese two approaches combined in a single algorithm.\n\n6 Approximate \u03ba-Policy Iteration with Hard Updates\n\nTheorem 1 establishes the conditions required for guaranteed monotonic improvement of softly-\nupdated multiple-step greedy policies. The algorithm in Section 5 then accounts for these conditions\nto ensure convergence. Contrarily, in this section, we derive and study algorithms that perform\nhard policy-updates. Speci\ufb01cally, we generalize the prominent Approximate Policy Iteration (API)\n[13, 7, 11] and Policy Search by Dynamic Programming (PSDP) [1, 19]. For both, we obtain\nperformance guarantees that exhibit a tradeoff in the choice of \u03ba, with optimal performance bound\nachieved with \u03ba > 0. That is, our approximate \u03ba-generalized PI methods outperform the 1-step\ngreedy approximate PI methods in terms of best known guarantees.\nFor the algorithms here we assume an oracle that returns a \u03ba-greedy policy with some error. Formally,\nwe denote by G\u03ba,\u03b4,\u03bd(v) the set of approximate \u03ba-greedy policies w.r.t. v, with \u03b4 approximation error\nunder some measure \u03bd.\nDe\ufb01nition 2 (Approximate \u03ba-greedy policy). Let v : S \u2192 R be a value function, \u03b4 \u2265 0 a real\nnumber and \u03bd a distribution over S. A policy \u03c0 \u2208 G\u03ba,\u03b4,\u03bd(v) if \u03bdT \u03c0\nSuch a device can be implemented using existing approximate methods, e.g., Conservative Policy\nIteration (CPI) [9], approximate PI or VI [7], Policy Search [21], or by having an access to an\napproximate model of the environment. The approximate \u03ba-greedy oracle assumed here is less\n\n\u03ba v \u2265 \u03bdT\u03bav \u2212 \u03b4.\n\n5\n\n\frestrictive than the one assumed in [5]. There, a uniform error over states was assumed, whereas here,\nthe error is de\ufb01ned w.r.t. a speci\ufb01c measure, \u03bd. For practical purposes, \u03bd can be thought of as the\ninitial sampling distribution to which the MDP is initialized. Lastly, notice that the larger \u03ba is, the\nharder it is to solve the surrogate \u03ba\u03b3-discounted MDP since the discount factor is bigger [17, 26, 8];\ni.e., the computational cost of each call to the oracle increases.\nUsing the concept of concentrability coef\ufb01cients introduced in [13] (there, they were originally termed\n\u201cdiffusion coef\ufb01cients\u201d), we follow the line of work in [13, 14, 7, 19, 11] to prove our performance\nbounds. This allows a direct comparison of the algorithms proposed here with previously studied\napproximate 1-step greedy algorithms. Namely, our bounds consist of concentrability coef\ufb01cients\nC (1), C (2), C (2,k) and C \u03c0\u2217(1) from [19, 11], as well as two new coef\ufb01cients C \u03c0\u2217\nDe\ufb01nition 3 (Concentrability coef\ufb01cients [19, 11])). Let \u00b5, \u03bd be some measures over S. Let {c(i)}\u221e\nbe the sequence of the smallest values in [1,\u221e) \u222a {\u221e} such that for every i, for all sequences\nj=1 P \u03c0j \u2264 c(i)\u03bd. Let C (1)(\u00b5, \u03bd) = (1 \u2212\ni,j=0 \u03b3i+jc(i + j + k). For brevity, we denote\ni=0 be the sequence of the smallest values in\n(i).\n\nof deterministic stationary policies \u03c01, \u03c02, .., \u03c0i, \u00b5(cid:81)i\ni=0 \u03b3ic(i) and C (2,k)(\u00b5, \u03bd) = (1 \u2212 \u03b3)2(cid:80)\u221e\n\u03b3)(cid:80)\u221e\n[1,\u221e)\u222a{\u221e} such that for every i, \u00b5(cid:0)P \u03c0\u2217(cid:1)i \u2264 c\u03c0\u2217\n\n(i)\u03bd. Let C \u03c0\u2217(1)(\u00b5, \u03bd) = (1\u2212\u03b3)(cid:80)\u221e\n\nC (2,0)(\u00b5, \u03bd) as C (2)(\u00b5, \u03bd). Similarly, let {c\u03c0\u2217\n\n\u03ba and C \u03c0\u2217(1)\n\n(i)}\u221e\n\ni=0\n\ni=0 \u03b3ic\u03c0\u2217\n\n\u03ba\n\n.\n\nWe now introduce two new concentrability coef\ufb01cients suitable for bounding the worst-case perfor-\nmance of PI algorithms with approximate \u03ba-greedy policies.\nDe\ufb01nition 4 (\u03ba-Concentrability coef\ufb01cients). Let C \u03c0\u2217(1)\n\u03b3 C \u03c0\u2217(1)(\u00b5, \u03bd) + (1 \u2212 \u03be)\u03bac(0).\n\u03ba,\u00b5 \u2264 C \u03c0\u2217\nd\u03c0\u2217\nAlso,\n\u03ba (\u00b5, \u03bd)\u03bd, where\n\u03ba = (1 \u2212 \u03ba\u03b3)(I \u2212 \u03ba\u03b3P \u03c0)\u22121 is\n\u03ba,\u00b5 = (1 \u2212 \u03be)\u00b5(I \u2212 \u03beD\u03c0\u2217\nd\u03c0\u2217\na stochastic matrix.\n\n\u03ba (\u00b5, \u03bd) \u2208 [1,\u221e) \u222a {\u221e} be the smallest value s.t.\n\n)\u22121 is a probability measure and D\u03c0\n\nlet C \u03c0\u2217\n\n(\u00b5, \u03bd) = \u03be\n\n\u03ba P \u03c0\u2217\n\n\u03ba\n\n(\u00b5, \u03bd); the latter was previously de\ufb01ned in, e.g, [19, De\ufb01nition 1].\n\nIn the de\ufb01nitions above, \u03bd is the measure according to which the approximate improve-\nment\nis guaranteed, while \u00b5 speci\ufb01es the distribution on which one measures the loss\nEs\u223c\u00b5[v\u2217(s) \u2212 v\u03c0k (s)] = \u00b5(v\u2217 \u2212 v\u03c0k ) that we wish to bound. From De\ufb01nition 4 it holds that\nC \u03c0\u2217\n\u03ba=0(\u00b5, \u03bd) = C \u03c0\u2217\nBefore giving our performance bounds, we \ufb01rst study the behavior of the coef\ufb01cients appearing in\nthem. The following lemma sheds light on the behavior of C \u03c0\u2217\n\u03ba (\u00b5, \u03bd). Speci\ufb01cally, it shows that\nunder certain constructions, C \u03c0\u2217\nLemma 4. Let \u03bd(\u03b1) = (1 \u2212 \u03b1)\u03bd + \u03b1\u00b5. Then, for all \u03ba(cid:48) > \u03ba, there exists \u03b1\u2217 \u2208 (0, 1) such that\n\u03ba(cid:48) (\u00b5, \u03bd(\u03b1\u2217)) \u2264 C \u03c0\u2217\nC \u03c0\u2217\n\u03ba (\u00b5, \u03bd) > 1. For \u00b5 = \u03bd this implies that\nC \u03c0\u2217\n\u03ba (\u03bd, \u03bd) is a decreasing function of \u03ba.\nDe\ufb01nition 4 introduces two coef\ufb01cients with which we shall derive our bounds. Though traditional\narithmetic relations between them do not exist, they do comply to some notion of ordering.\nRemark 2 (Order of concentrability coef\ufb01cients). In [19], an order between the concentrability\ncoef\ufb01cients was introduced: a coef\ufb01cient A is said to be strictly better than B \u2014 a relation we denote\nwith A \u227a B \u2014 if and only if i) B < \u221e implies A < \u221e and ii) there exists an MDP for which A < \u221e\nand B = \u221e. Particularly, it was argued that\n\n\u03ba (\u00b5, \u03bd) decreases3 as \u03ba increases (see proof in Appendix D).\n\n\u03ba (\u00b5, \u03bd). The inequality is strict for C \u03c0\u2217\n\n(\u00b5, \u03bd) \u227a C \u03c0\u2217(1)(\u00b5, \u03bd) \u227a C (1)(\u00b5, \u03bd) \u227a C (2)(\u00b5, \u03bd), and\n\nC \u03c0\u2217\nC (2,k1)(\u00b5, \u03bd) \u227a C (2,k2)(\u00b5, \u03bd) if k2 < k1.\n(\u00b5, \u03bd) is analogous to C \u03c0\u2217(1)(\u00b5, \u03bd), while its de\ufb01nition might suggest improve-\n\u03ba (\u00b5, \u03bd) improves as \u03ba increases, as\n\nIn this sense, C \u03c0\u2217(1)\nment as \u03ba increases. Moreover, combined with the fact that C \u03c0\u2217\nLemma 4 suggests, C \u03c0\u2217\n\n\u03ba (\u00b5, \u03bd) is better than all previously de\ufb01ned concentrability coef\ufb01cients.\n\n\u03ba\n\n6.1 \u03ba-Approximate Policy Iteration\n\nA natural generalization of API [13, 19, 11] to the multiple-step greedy policy is \u03ba-API, as given in\nAlgorithm 2. In each of its iterations, the policy is updated to the approximate \u03ba-greedy policy w.r.t.\nv\u03c0k\u22121; i.e, a policy from the set G\u03ba,\u03b4,\u03bd(v\u03c0k\u22121 ).\n\n3A smaller coef\ufb01cient is obviously better. The best value for any concentrability coef\ufb01cient is 1.\n\n6\n\n\fAlgorithm 2 \u03ba-API\n\ninitialize \u03ba \u2208 [0, 1], \u03bd, \u03b4, v\u03c00\nv \u2190 v\u03c00\nfor k = 1, .. do\n\u03c0k \u2190 G\u03ba,\u03b4,\u03bd(v)\nv \u2190 v\u03c0k\n\nend for\nreturn \u03c0\n\nAlgorithm 3 \u03ba-PSDP\n\ninitialize \u03ba \u2208 [0, 1], \u03bd, \u03b4, v\u03c00, \u03a0 = [ ]\nv \u2190 v\u03c00\nfor k = 1, .. do\n\u03c0k \u2190 G\u03ba,\u03b4,\u03bd(v)\nv \u2190 T \u03c0k\n\u03ba v\n\u03a0 \u2190Append(\u03a0, \u03c0k)\n\nend for\nreturn \u03a0\n\n(cid:16)\n\n(cid:17)\n\n,\n\n(cid:16)\n(cid:16)\n\n(1 \u2212 \u03ba)C (1)(\u00b5, \u03bd) + (1 \u2212 \u03b3\u03ba)C \u03c0\u2217(1)\n\nThe following theorem gives a performance bound for \u03ba-API (see proof in Appendix E), with\nC\u03ba\u2212API(\u00b5, \u03bd) = (1 \u2212 \u03ba)2C (2)(\u00b5, \u03bd) + (1 \u2212 \u03b3)\u03ba\n\u03ba\u2212API(\u00b5, \u03bd) = (1 \u2212 \u03ba\u03b3)\nC (k,1)\n\u03ba\u2212API(\u00b5, \u03bd) = (1 \u2212 \u03ba)\u03ba\nC (k,2)\nwhere g(\u03ba) is a bounded function for \u03ba \u2208 [0, 1].\nTheorem 5. Let \u03c0k be the policy at the k-th iteration of \u03ba-API and \u03b4 be the error as de\ufb01ned in\nDe\ufb01nition 2. Then\n\n\u03ba(1 \u2212 \u03ba\u03b3)C \u03c0\u2217\n,\n(1 \u2212 \u03b3)C (1)(\u00b5, \u03bd) + g(\u03ba)(1 \u2212 \u03ba)\u03b3kC (2,k)(\u00b5, \u03bd)\n\n\u03ba (\u00b5, \u03bd) + (1 \u2212 \u03ba)2C (1)(\u00b5, \u03bd))\n\n(\u00b5, \u03bd)\n\n(cid:17)\n\n(cid:17)\n\n\u03ba\n\n,\n\n(cid:24) log Rmax\n\n\u03b4(1\u2212\u03b3)\n1\u2212\u03be\n\n(cid:25)\n\nAlso, let k =\n\n\u00b5(v\u2217 \u2212 v\u03c0k ) \u2264 C\u03ba\u2212API(\u00b5, \u03bd)\n\n(1 \u2212 \u03b3)2\n. Then \u00b5(v\u2217 \u2212 v\u03c0k ) \u2264 C(k,1)\n\n\u03ba\u2212API(\u00b5,\u03bd)\n(1\u2212\u03b3)2\n\nlog\n\n\u03b4 + \u03bek Rmax\n1 \u2212 \u03b3\n\n.\n\n(cid:16) Rmax\n\n(1\u2212\u03b3)\u03b4\n\n(cid:17)\n\n\u03b4 +\n\nC(k,2)\n\n\u03ba\u2212API(\u00b5,\u03bd)\n(1\u2212\u03b3)2\n\n\u03b4 + \u03b4.\n\n(1\u2212\u03b3)2 \u03b4 + \u03b3kRmax\n1\u2212\u03b3\n\nFor brevity, we now discuss the \ufb01rst part of the statement; the same insights are true for the second\nas well. The bound for the original API is restored for the 1-step greedy case of \u03ba = 0, i.e,\n\u00b5(v\u2217 \u2212 v\u03c0k ) \u2264 C(2)(\u00b5,\u03bd)\n[19, 11]. As in the case of API, our bound consists of a \ufb01xed\napproximation error term and a geometrically decaying term. As for the other extreme, \u03ba = 1,\nwe \ufb01rst remind that in the non-approximate case, applying T\u03ba=1 amounts to solving the original\n\u03b3-discounted MDP in a single step [5, Remark 4]. In the approximate setup we investigate here,\nthis results in the vanishing of the second, geometrically decaying term, since \u03be = 0 for \u03ba = 1. We\nare then left with a single constant approximation error: \u00b5(v\u2217 \u2212 v\u03c0k ) \u2264 c(0)\u03b4. Notice that c(0) is\nindependent of \u03c0\u2217 (see De\ufb01nition 3). It represents the mismatch between \u00b5 and \u03bd [9].\nNext, notice that, by de\ufb01nition (see De\ufb01nition 3), C (2)(\u00b5, \u03bd) > (1\u2212\u03b3)2c(0); i.e., C(2)(\u00b5,\u03bd)\n(1\u2212\u03b3)2 \u03b4 > c(0)\u03b4.\nGiven the discussion above, we have that the \u03ba-API performance bound is strictly smaller with \u03ba = 1\nthan with \u03ba = 0. Hence, the bound suggests that \u03ba-API is strictly better than the original API for\n\u03ba = 1. Since all expressions there are continuous, this behavior does not solely hold point-wise.\nRemark 3 (Performance tradeoff). Naively, the above observation would lead to the choice of \u03ba = 1.\nHowever, it is reasonable to assume that \u03b4, the error of the \u03ba-greedy step, itself depends on \u03ba, i.e,\n\u03b4 \u2261 \u03b4(\u03ba). The general form of such dependence is expected to be monotonically increasing: as the\neffective horizon of the surrogate \u03ba\u03b3-discounted MDP becomes larger, its solution is harder to obtain\n(see Remark 1). Thus, Theorem 5 reveals a performance tradeoff as a function of \u03ba.\n\n6.2 \u03ba-Policy Search by Dynamic Programming\n\nWe continue with generalizing another approximate PI method \u2013 PSDP [1, 19]. We name it \u03ba-PSDP\nand introduce it in Algorithm 3. This algorithm updates the policy differently from \u03ba-API. However,\nsimilarly to \u03ba-API, it uses hard updates. We will show this algorithm exhibits better performance\nthan any other previously analyzed approximate PI method [19].\nThe \u03ba-PSDP algorithm, unlike \u03ba-API, returns a sequence of deterministic policies, \u03a0. Given this\nsequence, we build a single, non-stationary policy by successively running Nk steps of \u03a0[k], followed\n\n7\n\n\fby Nk\u22121 steps of \u03a0[k \u2212 1], etc, where {Ni}k\ni=1 are i.i.d. geometric random variables with parameter\n1 \u2212 \u03ba. Once this process reaches \u03c00, it runs \u03c00 inde\ufb01nitely. We shall refer to this non-stationary\npolicy as \u03c3\u03ba,k. Its value v\u03c3\u03ba,k can be seen to satisfy\n\nv\u03c3\u03ba,k = T \u03a0[k]\n\n\u03ba\n\nT \u03a0[k\u22121]\n\n\u03ba\n\n. . . T \u03a0[1]\n\n\u03ba\n\nv\u03c00.\n\nThis algorithm follows PSDP from [19]. Differently from it, the 1-step improvement is generalized to\nthe \u03ba-greedy improvement and the non-stationary policy behaves randomly. The following theorem\ngives a performance bound for it (see proof in Appendix F).\nTheorem 6. Let \u03c3\u03ba,k be the policy at the k-th iteration of \u03ba-PSDP and \u03b4 be the error as de\ufb01ned in\nDe\ufb01nition 2. Then\n\n(cid:24) log Rmax\n\n\u03b4(1\u2212\u03b3)\n1\u2212\u03be\n\n(cid:25)\n\nAlso, let k =\n\n\u00b5(v\u2217 \u2212 v\u03c3\u03ba,k ) \u2264 C \u03c0\u2217(1)\n1 \u2212 \u03be\n. Then \u00b5(v\u2217 \u2212 v\u03c3\u03ba,k ) \u2264 C\u03c0\u2217\n\n\u03ba\n\n(\u00b5, \u03bd)\n\n\u03ba (\u00b5,\u03bd)\n(1\u2212\u03be)2\n\n\u03b4 + \u03bek Rmax\n1 \u2212 \u03b3\n\n(cid:16) Rmax\n\nlog\n\n(1\u2212\u03b3)\u03b4\n\n.\n\n(cid:17)\n\n\u03b4 + \u03b4.\n\n\u03ba\n\nCompared to \u03ba-API from the previous section, the \u03ba-PSDP bound consists of a different \ufb01xed\napproximation error and a shared geometrically decaying term. Regarding the former, notice that\nC \u03c0\u2217(1)\n(\u00b5, \u03bd) \u227a C\u03ba\u2212API(\u00b5, \u03bd), using the notation from Remark 2. This suggests that \u03ba-PSDP is\nstrictly better than \u03ba-API in the metrics we consider, and is in line with the comparison of the original\nAPI to the original PSDP given in [19].\nSimilarly to the previous section, we again see that substituting \u03ba = 1 gives a tighter bound than\n\u03ba = 0. The reason is that C\u03c0\u2217 (1)(\u00b5,\u03bd)\n\u03b4 > c(0)\u03b4, by de\ufb01nition (see De\ufb01nition 3); i.e., we have that\n\u03ba-PSDP is generally better than PSDP. Also, contrarily to \u03ba-API, here we directly see the performance\nimprovement as \u03ba increases due to the decrease of C \u03c0\u2217\n\u03ba prescribed in Lemma 4, for the construction\ngiven there. Moreover, the \u03ba tradeoff discussion in Remark 3 applies here as well.\nAn additional advantage of this new algorithm over PSDP is reduced space complexity. This can be\nseen from the 1 \u2212 \u03be in the denominator in the choice of k in the second part of Theorem 6. It shows\nthat, since \u03be is a strictly decreasing function of \u03ba, better performance is guaranteed with signi\ufb01cantly\nfewer iterations by increasing \u03ba. Since the size of stored policy \u03a0 is linearly dependent on the number\nof iterations, larger \u03ba improves space ef\ufb01ciency.\n\n1\u2212\u03b3\n\n7 Discussion and Future Work\n\nIn this work, we introduced and analyzed online and approximate PI methods, generalized to the\n\u03ba-greedy policy, an instance of a multiple-step greedy policy. Doing so, we discovered two intriguing\nproperties compared to the well-studied 1-step greedy policy, which we believe can be impactful in\ndesigning state-of-the-art algorithms. First, successive application of multiple-step greedy policies\nwith a soft, stepsize-based update does not guarantee improvement; see Theorem 1. To mitigate this\ncaveat, we designed an online PI algorithm with a \u2018cautious\u2019 improvement operator; see Section 5.\nThe second property we \ufb01nd intriguing stemmed from analyzing \u03ba generalizations of known approx-\nimate hard-update PI methods. In Section 6, we revealed a performance tradeoff in \u03ba, which can\nbe interpreted as a tradeoff between short-horizon bootstrap bias and long-rollout variance. This\ncorresponds to the known \u03bb tradeoff in the famous TD(\u03bb).\nThe two characteristics above lead to new compelling questions. The \ufb01rst regards improvement\noperators: would a non-monotonically improving PI scheme necessarily not converge to the optimal\npolicy? Our attempts to generalize existing proof techniques to show convergence in such cases have\nfallen behind. Speci\ufb01cally, in the online case, Lemma 5.4 in [10] does not hold with multiple-step\ngreedy policies. Similar issues arise when trying to form a \u03ba-CPI algorithm via, e.g., an attempt to\ngeneralize Corollary 4.2 in [9]. Another research question regards the choice of the parameter \u03ba given\nthe tradeoff it poses. One possible direction for answering it could be investigating the concentrability\ncoef\ufb01cients further and attempting to characterize them for speci\ufb01c MDPs, either theoretically or via\nestimation. Lastly, a next indisputable step would be to empirically evaluate implementations of the\nalgorithms presented here.\n\n8\n\n\fAcknowledgments\n\nThis work was partially funded by the Israel Science Foundation under contract 1380/16.\n\nReferences\n[1] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by\ndynamic programming. In Advances in neural information processing systems, pages 831\u2013838,\n2004.\n\n[2] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In\nDecision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages\n560\u2013564. IEEE, 1995.\n\n[3] Bruno Bouzy and Bernard Helmstetter. Monte-carlo go developments. In Advances in computer\n\ngames, pages 159\u2013174. Springer, 2004.\n\n[4] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling,\nPhilipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton.\nA survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence\nand AI in games, 4(1):1\u201343, 2012.\n\n[5] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one step greedy\n\napproach in reinforcement learning. arXiv preprint arXiv:1802.03654, 2018.\n\n[6] Damien Ernst, Mevludin Glavic, Florin Capitanescu, and Louis Wehenkel. Reinforcement\nlearning versus model predictive control: a comparison on a power system problem. IEEE\nTransactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):517\u2013529, 2009.\n\n[7] Amir-massoud Farahmand, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. Error propagation for approxi-\nmate policy and value iteration. In Advances in Neural Information Processing Systems, pages\n568\u2013576, 2010.\n\n[8] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective\nplanning horizon on model accuracy. In Proceedings of the 2015 International Conference on\nAutonomous Agents and Multiagent Systems, pages 1181\u20131189. International Foundation for\nAutonomous Agents and Multiagent Systems, 2015.\n\n[9] S.M. Kakade and J. Langford. Approximately Optimal Approximate Reinforcement Learning.\n\nIn International Conference on Machine Learning, pages 267\u2013274, 2002.\n\n[10] Vijaymohan R Konda and Vivek S Borkar. Actor-critic\u2013type learning algorithms for markov\n\ndecision processes. SIAM Journal on control and Optimization, 38(1):94\u2013123, 1999.\n\n[11] Alessandro Lazaric, Mohammad Ghavamzadeh, and R\u00e9mi Munos. Analysis of classi\ufb01cation-\nbased policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583\u2013612,\n2016.\n\n[12] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[13] R\u00e9mi Munos. Error bounds for approximate policy iteration. In Proceedings of the Twentieth\nInternational Conference on International Conference on Machine Learning, pages 560\u2013567.\nAAAI Press, 2003.\n\n[14] R\u00e9mi Munos. Performance bounds in l_p-norm for approximate value iteration. SIAM journal\n\non control and optimization, 46(2):541\u2013561, 2007.\n\n[15] Rudy R Negenborn, Bart De Schutter, Marco A Wiering, and Hans Hellendoorn. Learning-\nbased model predictive control for markov decision processes. IFAC Proceedings Volumes,\n38(1):354\u2013359, 2005.\n\n[16] Steven Perkins and David S Leslie. Asynchronous stochastic approximation with differential\n\ninclusions. Stochastic Systems, 2(2):409\u2013446, 2013.\n\n[17] Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower\ndiscount factor. In Advances in neural information processing systems, pages 1265\u20131272, 2009.\n\n9\n\n\f[18] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, 1994.\n\n[19] Bruno Scherrer. Approximate policy iteration schemes: a comparison.\n\nConference on Machine Learning, pages 1314\u20131322, 2014.\n\nIn International\n\n[20] Bruno Scherrer. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration.\nINFORMS, February 2016. Markov decision processes ; Dynamic Programming ; Analysis of\nAlgorithms.\n\n[21] Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative\npolicy iteration as boosted policy search. In Joint European Conference on Machine Learning\nand Knowledge Discovery in Databases, pages 35\u201350. Springer, 2014.\n\n[22] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[23] Brian Sheppard. World-championship-caliber scrabble. Arti\ufb01cial Intelligence, 134(1-2):241\u2013\n\n275, 2002.\n\n[24] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering\nchess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint\narXiv:1712.01815, 2017.\n\n[25] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[26] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in \ufb01nite mdps:\n\nPac analysis. Journal of Machine Learning Research, 10(Nov):2413\u20132444, 2009.\n\n[27] Aviv Tamar, Garrett Thomas, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Learning from\nthe hindsight plan\u2014episodic mpc improvement. In Robotics and Automation (ICRA), 2017\nIEEE International Conference on, pages 336\u2013343. IEEE, 2017.\n\n[28] Gerald Tesauro and Gregory R Galperin. On-line policy improvement using monte-carlo search.\n\nIn Advances in Neural Information Processing Systems, pages 1068\u20131074, 1997.\n\n[29] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292,\n\n1992.\n\n10\n\n\f", "award": [], "sourceid": 2501, "authors": [{"given_name": "Yonathan", "family_name": "Efroni", "institution": "Technion"}, {"given_name": "Gal", "family_name": "Dalal", "institution": "Technion"}, {"given_name": "Bruno", "family_name": "Scherrer", "institution": "INRIA"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}