{"title": "Aggregating Optimistic Planning Trees for Solving Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2382, "page_last": 2390, "abstract": "This paper addresses the problem of online planning in Markov Decision Processes using only a generative model. We propose a new algorithm which is based on the construction of a forest of single successor state planning trees. For every explored state-action, such a tree contains exactly one successor state, drawn from the generative model. The trees are built using a planning algorithm which follows the optimism in the face of uncertainty principle, in assuming the most favorable outcome in the absence of further information. In the decision making step of the algorithm, the individual trees are combined. We discuss the approach, prove that our proposed algorithm is consistent, and empirically show that it performs better than a related algorithm which additionally assumes the knowledge of all transition distributions.", "full_text": "Aggregating Optimistic Planning Trees for Solving\n\nMarkov Decision Processes\n\nGunnar Kedenburg\n\nRapha\u00ebl Fonteneau\n\nINRIA Lille - Nord Europe / idalab GmbH\n\nUniversity of Li\u00e8ge / INRIA Lille - Nord Europe\n\ngunnar.kedenburg@inria.fr\n\nraphael.fonteneau@ulg.ac.be\n\nR\u00e9mi Munos\n\nINRIA Lille - Nord Europe / Microsoft Research New England\n\nremi.munos@inria.fr\n\nAbstract\n\nThis paper addresses the problem of online planning in Markov decision processes\nusing a randomized simulator, under a budget constraint. We propose a new\nalgorithm which is based on the construction of a forest of planning trees, where\neach tree corresponds to a random realization of the stochastic environment. The\ntrees are constructed using a \u201csafe\u201d optimistic planning strategy combining the\noptimistic principle (in order to explore the most promising part of the search\nspace \ufb01rst) with a safety principle (which guarantees a certain amount of uniform\nexploration). In the decision-making step of the algorithm, the individual trees are\naggregated and an immediate action is recommended. We provide a \ufb01nite-sample\nanalysis and discuss the trade-off between the principles of optimism and safety.\nWe also report numerical results on a benchmark problem. Our algorithm performs\nas well as state-of-the-art optimistic planning algorithms, and better than a related\nalgorithm which additionally assumes the knowledge of all transition distributions.\n\n1\n\nIntroduction\n\nAdaptive decision making algorithms have been used increasingly in the past years, and have attracted\nresearchers from many application areas, like arti\ufb01cial intelligence [16], \ufb01nancial engineering [10],\nmedicine [14] and robotics [15]. These algorithms realize an adaptive control strategy through\ninteraction with their environment, so as to maximize an a priori performance criterion.\nA new generation of algorithms based on look-ahead tree search techniques have brought a break-\nthrough in practical performance on planning problems with large state spaces. Techniques based on\nplanning trees such as Monte Carlo tree search [4, 13], and in particular the UCT algorithm (UCB\napplied to Trees, see [12]) have allowed to tackle large scale problems such as the game of Go [7].\nThese methods exploit that in order to decide on an action at a given state, it is not necessary to build\nan estimate of the value function everywhere. Instead, they search locally in the space of policies,\naround the current state.\nWe propose a new algorithm for planning in Markov Decision Problems (MDPs). We assume that\na limited budget of calls to a randomized simulator for the MDP (the generative model in [11]) is\navailable for exploring the consequences of actions before making a decision. The intuition behind\nour algorithm is to achieve a high exploration depth in the look-ahead trees by planning in \ufb01xed\nrealizations of the MDP, and to achieve the necessary exploration width by aggregating a forest of\nplanning trees (forming an approximation of the MDP from many realizations). Each of the trees\nis developed around the state for which a decision has to be made, according to the principle of\noptimism in the face of uncertainty [13] combined with a safety principle.\n\n1\n\n\fWe provide a \ufb01nite-sample analysis depending on the budget, split into the number of trees and\nthe number of node expansions in each tree. We show that our algorithm is consistent and that it\nidenti\ufb01es the optimal action when given a suf\ufb01ciently large budget. We also give numerical results\nwhich demonstrate good performance on a benchmark problem. In particular, we show that our\nalgorithm achieves much better performance on this problem than OP-MDP [2] when both algorithms\ngenerate the same number of successor states, despite the fact that OP-MDP assumes knowledge\nof all successor state probabilities in the MDP, whereas our algorithm only samples states from a\nsimulator.\nThe paper is organized as follows: \ufb01rst, we discuss some related work in section 2. In section 3, the\nproblem addressed in this paper is formalized, before we describe our algorithm in section 4. Its\n\ufb01nite-sample analysis is given in section 5. We provide numerical results on the inverted pendulum\nbenchmark in section 6. In section 7, we discuss and conclude this work.\n\n2 Related work\n\nThe optimism in the face of uncertainty paradigm has already lead to several successful results\nfor solving decision making problems. Speci\ufb01cally, it has been applied in the following contexts:\nmulti-armed bandit problems [1] (which can be seen as single state MDPs), planning algorithms\nfor deterministic systems and stochastic systems [8, 9, 17], and global optimization of stochastic\nfunctions that are only accessible through sampling. See [13] for a detailed review of the optimistic\nprinciple applied to planning and optimization.\nThe algorithm presented in this paper is particularly closely related to two recently developed online\nplanning algorithms for solving MDPs, namely the OPD algorithm [9] for MDPs with deterministic\ntransitions, and the OP-MDP algorithm [2] which addresses stochastic MDPs where all transition\nprobabilities are known. A Bayesian adaptation of OP-MDP has also been proposed [6] for planning\nin the context where the MDP is unknown.\nOur contribution is also related to [5], where random ensembles of state-action independent distur-\nbance scenarios are built, the planning problem is solved for each scenario, and a decision is made\nbased on majority voting. Finally, since our algorithm proceeds by sequentially applying the \ufb01rst\ndecision of a longer plan over a receding horizon, it can also be seen as a Model Predictive Control\n[3] technique.\n\n3 Formalization\nLet (S,A, p, r, \u03b3) be a Markov decision process (MDP), where the set S and A respectively denote\nthe state space and the \ufb01nite action space, with |A| > 1, of the MDP. When an action a \u2208 A is\nselected in state s \u2208 S of the MDP, it transitions to a successor state s(cid:48) \u2208 S(s, a) with probability\np(s(cid:48)|s, a). We further assume that every successor state set S(s, a) is \ufb01nite and their cardinality\nis bounded by K \u2208 N. Associated with the transition is an deterministic instantaneous reward\nr(s, a, s(cid:48)) \u2208 [0, 1].\nWhile the transition probabilities may be unknown, it is assumed that a randomized simulator is\navailable, which, given a state-action pair (s, a), outputs a successor state s(cid:48) \u223c p(\u00b7|s, a). The ability\nto sample is a weaker assumption than the knowledge of all transition probabilities. In this paper we\nconsider the problem of planning under a budget constraint: only a limited number of samples may\nbe drawn using the simulator. Afterwards, a single decision has to be made.\nLet \u03c0 : S \u2192 A denote a deterministic policy. De\ufb01ne the value function of the policy \u03c0 in a state s as\nthe discounted sum of expected rewards:\n\n(cid:34) \u221e(cid:88)\n\n(cid:35)\n\u03b3tr(st, \u03c0(st), st+1)(cid:12)(cid:12)s0 = s\n\nv\u03c0 : S \u2192 R, v\u03c0 : s (cid:55)\u2192 E\n\n,\n\n(1)\n\nwhere the constant \u03b3 \u2208 (0, 1) is called the discount factor. Let \u03c0\u2217 be an optimal policy (i.e. a policy\nthat maximizes v\u03c0 in all states). It is well known that the optimal value function v\u2217 := v\u03c0\u2217\nis the\n\nt=0\n\n2\n\n\fsolution to the Bellman equation\n\n(cid:88)\n\n\u2200s \u2208 S : v\u2217(s) = max\na\u2208A\n\nGiven the action-value function Q\u2217 : (s, a) (cid:55)\u2192 (cid:80)\n\ns(cid:48)\u2208S(s,a)\n\noptimal policy can be derived as \u03c0\u2217 : s (cid:55)\u2192 argmaxa\u2208A Q\u2217(s, a).\n\np(s(cid:48)|s, a) (r(s, a, s(cid:48)) + \u03b3v\u2217(s(cid:48))) .\n\ns(cid:48)\u2208S(s,a) p(s(cid:48)|s, a)(r(s, a, s(cid:48)) + \u03b3v\u2217(s(cid:48))), an\n\n4 Algorithm\n\nWe name our algorithm ASOP (for \u201cAggregated Safe Optimistic Planning\u201d). The main idea behind it\nis to use a simulator to obtain a series of deterministic \u201crealizations\u201d of the stochastic MDP, to plan\nin each of them individually, and to then aggregate all the information gathered in the deterministic\nMDPs into an empirical approximation to the original MDP, on the basis of which a decision is made.\nWe refer to the planning trees used here as single successor state trees (S3-trees), in order to distinguish\nthem from other planning trees used for the same problem (e.g. the OP-MDP tree, where all possible\nsuccessor states are considered). Every node of a S3-tree represents a state s \u2208 S, and has at most\none child node per state-action a, representing a successor state s(cid:48) \u2208 S. The successor state is drawn\nusing the simulator during the construction of the S3-tree.\nThe planning tree construction, using the SOP algorithm (for \u201cSafe Optimistic Planning\u201d), is described\nin section 4.1. The ASOP algorithm, which integrates building the forest and deciding on an action\nby aggregating the information in the forest, is described in section 4.2.\n\n4.1 Safe optimistic planning in S3-trees: the SOP algorithm\n\nSOP is an algorithm for sequentially constructing a S3-tree. It can be seen as a variant of the OPD\nalgorithm [9] for planning in deterministic MDPs. SOP expands up to two leaves of the planning tree\nper iteration. The \ufb01rst leaf (the optimistic one) is a maximizer of an upper bound (called b-value) on\nthe value function of the (deterministic) realization of the MDP explored in the S3-tree. The b-value\nof a node x is de\ufb01ned as\n\nb(x) :=\n\n\u03b3iri +\n\n\u03b3d(x)\n1 \u2212 \u03b3\n\n(2)\n\nd(x)\u22121(cid:88)\n\ni=0\n\nwhere (ri) is the sequence of rewards obtained along the path to x, and d(x) is the depth of the node\n(the length of the path from the root to x). Only expanding the optimistic leaf would not be enough\nto make ASOP consistent; this is shown in the appendix. Therefore, a second leaf (the safe one),\nde\ufb01ned as the shallowest leaf in the current tree, is also expanded in each iteration. A pseudo-code is\ngiven as algorithm 1.\n\nAlgorithm 1: SOP\nData: The initial state s0 \u2208 S and a budget n \u2208 N\nResult: A planning tree T\nLet T denote a tree consisting only of a leaf, representing s0.\nInitialize the cost counter c := 0.\nwhile c < n do\n\nForm a subset of leaves of T , L, containing a leaf of minimal depth, and a leaf of maximal b-value\n(computed according to (2); the two leaves can be identical).\nforeach l \u2208 L do\n\nLet s denote the state represented by l.\nforeach a \u2208 A do\nif c < n then\n\nUse the simulator to draw a successor state s(cid:48) \u223c p(\u00b7|s, a).\nCreate an edge in T from l to a new leaf representing s(cid:48).\nLet c := c + 1.\n\nreturn T\n\n3\n\n\f4.2 Aggregation of S3-trees: the ASOP algorithm\n\nASOP consists of three steps. In the \ufb01rst step, it runs independent instances of SOP to collect\ninformation about the MDP, in the form of a forest of S3-trees. It then computes action-values\n\u02c6Q\u2217(s0, a) of a single \u201cempirical\u201d MDP based on the collected information, in which states are\nrepresented by forests: on a transition, the forest is partitioned into groups by successor states, and\nthe corresponding frequencies are taken as the transition probabilities. Leaves are interpreted as\nabsorbing states with zero reward on every action, yielding a trivial lower bound. A pseudo-code for\nthis computation is given as algorithm 2. ASOP then outputs the action\n\n\u02c6\u03c0(s0) \u2208 argmax\na\u2208A\n\n\u02c6Q\u2217(s0, a).\n\nThe optimal policy of the empirical MDP has the property that the empirical lower bound of its value,\ncomputed from the information collected by planning in the individual realizations, is maximal over\nthe set of all policies. We give a pseudo-code for the ASOP algorithm as algorithm 3.\n\nAlgorithm 2: ActionValue\nData: A forest F and an action a, with each tree in F representing the same state s\nResult: An empirical lower bound for the value of a in s\nLet E denote the edges representing action a at any of the root nodes of F .\nif E = \u2205 then\nreturn 0\nelse\n\nLet F be the set of trees pointed to by the edges in E.\nEnumerate the states represented by any tree in F by {s(cid:48)\nforeach i \u2208 I do\n\nDenote the set of trees in F which represent si by Fi.\nLet \u02c6\u03bdi := maxa(cid:48)\u2208A ActionValue(Fi, a(cid:48)).\nLet \u02c6pi := |Fi|/|F|.\n\ni : i \u2208 I} for some \ufb01nite I.\n\nreturn(cid:80)\n\ni\u2208I \u02c6pi (r(s, a, s(cid:48)\n\ni) + \u03b3 \u02c6\u03bdi)\n\nAlgorithm 3: ASOP\nData: The initial state s0, a per-tree budget b \u2208 N and the forest size m \u2208 N\nResult: An action to take\nfor i = 1, . . . , m do\nreturn argmaxa\u2208A ActionValue({T1, . . . , Tm}, a)\n\nLet Ti := SOP(s0, b).\n\n5 Finite-sample analysis\n\nIn this section, we provide a \ufb01nite-sample analysis of ASOP in terms of the number of planning trees\nm and per-tree budget n. An immediate consequence of this analysis is that ASOP is consistent: the\naction returned by ASOP converges to the optimal action when both n and m tend to in\ufb01nity.\nOur loss measure is the \u201csimple\u201d regret, corresponding to the expected value of \ufb01rst playing the action\n\u02c6\u03c0(s0) returned by the algorithm at the initial state s0 and acting optimally from then on, compared to\nacting optimally from the beginning:\n\nRn,m(s0) = Q\u2217(s0, \u03c0\u2217(s0)) \u2212 Q\u2217(s0, \u02c6\u03c0(s0)).\n\nFirst, let us use the \u201csafe\u201d part of SOP to show that each S3-tree is fully explored up to a certain depth\nd when given a suf\ufb01ciently large per-tree budget n.\nLemma 1. For any d \u2208 N, once a budget of n \u2265 2|A||A|d+1\u22121\nhas been spent by SOP on an S3-tree,\n|A|\u22121\nthe state-actions of all nodes up and including those at depth d have all been sampled exactly once.\n\n4\n\n\fProof. A complete |A|-ary tree contains |A|l nodes in level l, so it contains(cid:80)d\n\n|A|d+1\u22121\n|A|\u22121\nnodes up to and including level d. In each of these nodes, |A| actions need to be explored. We\ncomplete the proof by noticing that SOP spends at least half of its budget on shallowest leaves.\n\nl=0 |A|l =\n\n\u03c9 and v\u03c0\n\nLet v\u03c0\n\u03c9,n denote the value functions for a policy \u03c0 in the in\ufb01nite, completely explored S3-tree\nde\ufb01ned by a random realization \u03c9 and the \ufb01nite S3-tree constructed by SOP for a budget of n in the\nsame realization \u03c9, respectively. From Lemma 1 we deduce that if the per-tree budget is at least\n\n|A|\n|A| \u2212 1\n\nn \u2265 2\n\n\u03c9,n(s0)| \u2264(cid:12)(cid:12)(cid:80)\u221e\n\n[\u0001(1 \u2212 \u03b3)]\n\n(cid:12)(cid:12) \u2264 \u03b3d+1\n\n\u2212 log |A|\n\nlog(1/\u03b3) .\n\n\u03c9(s0) \u2212 v\u03c0\n\nwe obtain |v\u03c0\nASOP aggregates the trees and computes the optimal policy \u02c6\u03c0 of the resulting empirical MDP whose\ntransition probabilities are de\ufb01ned by the frequencies (over the m S3-trees) of transitions from\nstate-action to successor states. Therefore, \u02c6\u03c0 is actually a policy maximizing the function\n\n1\u2212\u03b3 \u2264 \u0001 for any policy \u03c0.\n\ni=d+1 \u03b3iri\n\nm(cid:88)\n\ni=1\n\n\u03c0 (cid:55)\u2192 1\nm\n\nv\u03c0\n\u03c9i,n(s0).\n\n(4)\n\n(3)\n\n(5)\n\n(6)\n\n(7)\n\nIf the number m of S3-trees and the per-tree budget n are large, we therefore expect the optimal\npolicy \u02c6\u03c0 of the empirical MDP to be close to the optimal policy \u03c0\u2217 of the true MDP. This is the result\nstated in the following theorem.\nTheorem 1. For any \u03b4 \u2208 (0, 1) and \u0001 \u2208 (0, 1), if the number of S3-trees is at least\n\n(cid:19)\n\nm \u2265\n\n8\n\n\u00012(1 \u2212 \u03b3)2\nand the per-tree budget is at least\n\nlog |A| K\n\nK \u2212 1\n\n(1 \u2212 \u03b3)\n\nlog(1/\u03b3) + log(4/\u03b4)\n\n(cid:18)\n\n(cid:104) \u0001\n\n4\n\n(cid:104) \u0001\n\n4\n\n(cid:105)\u2212 log K\n(cid:105)\u2212 log |A|\n\nlog(1/\u03b3) ,\n\nn \u2265 2\n\n|A|\n|A| \u2212 1\n\n(1 \u2212 \u03b3)\n\nthen P (Rm,n(s0) < \u0001) \u2265 1 \u2212 \u03b4.\nProof. Let \u03b4 \u2208 (0, 1), and \u0001 \u2208 (0, 1) and \ufb01x realizations {\u03c91, . . . , \u03c9m} of the stochastic MDP, for\nsome m satisfying (5). Each realization \u03c9i corresponds to an in\ufb01nite, completely explored S3-tree.\nLet n denote some per-tree budget satisfying (6).\nAnalogously to (3), we know from Lemma 1 that, given our choice of n, SOP constructs trees which\nare completely explored up to depth d := (cid:98) log(\u0001(1\u2212\u03b3)/4)\n(cid:99), ful\ufb01lling \u03b3d+1\nConsider the following truncated value functions: let \u03bd\u03c0\nd (s0) denote the sum of expected discounted\nrewards obtained in the original MDP when following policy \u03c0 for d steps and then receiving reward\nzero from there on, and let \u03bd\u03c0\n\u03c9i,d(s0) denote the analogous quantity in the MDP corresponding to\nrealization \u03c9i.\nDe\ufb01ne, for all policies \u03c0, the quantities \u02c6v\u03c0\n\n\u03c9i,d(s0).\nSince the trees are complete up to level d and the rewards are non-negative, we deduce that we have\n0 \u2264 v\u03c0\n\n4 for each i and each policy \u03c0, thus the same will be true for the averages:\n\n\u03c9i,n(s0) and \u02c6\u03bd\u03c0\n\n1\u2212\u03b3 \u2264 \u0001\n4 .\n\n\u03c9i,n \u2212 \u03bd\u03c0\n\n\u03c9i,d \u2264 \u0001\n\n(cid:80)m\n\n(cid:80)m\n\nm,n := 1\nm\n\nm,d := 1\nm\n\ni=1 \u03bd\u03c0\n\ni=1 v\u03c0\n\nlog \u03b3\n\n0 \u2264 \u02c6v\u03c0\n\nm,n \u2212 \u02c6\u03bd\u03c0\n\nm,d \u2264 \u0001\n\n4 \u2200\u03c0.\n\nd (s0) = E\u03c9[\u03bd\u03c0\n\nNotice that \u03bd\u03c0\n\ufb01xed policy \u03c0 (since the truncated values lie in [0,\nd (s0)| \u2265 \u0001\n\nP(cid:0)|\u02c6\u03bd\u03c0\n\nm,d \u2212 \u03bd\u03c0\n\n4\n\n1\n\n1\u2212\u03b3 ]),\n\n(cid:1) \u2264 2e\u2212m\u00012(1\u2212\u03b3)2/8.\n\n\u03c9,d(s0)]. From the Chernoff-Hoeffding inequality, we have that for any\n\nNow we need a uniform bound over the set of all possible policies. The number of distinct policies is\n|A| \u00b7 |A|K \u00b7 \u00b7\u00b7\u00b7 \u00b7 |A|Kd (at each level l, there are at most K l states that can be reached by following a\n\n5\n\n\fpolicy at previous levels, so there are |A|Kl different choices that policies can make at level l). Thus\nsince m \u2265\n\nwe have\n\n8\n\n\u00012(1\u2212\u03b3)2\n\n(cid:105)\n(cid:104) Kd+1\nK\u22121 log |A| + log( 4\n\u03b4 )\nm,d \u2212 \u03bd\u03c0\n|\u02c6\u03bd\u03c0\n\nP(cid:16)\n\nmax\n\n\u03c0\n\n(cid:17) \u2264 \u03b4\n\n2 .\n\nd (s0)| \u2265 \u0001\n\n4\n\nThe action returned by ASOP is \u02c6\u03c0(s0), where \u02c6\u03c0 := argmax\u03c0 \u02c6v\u03c0\nFinally, it follows that with probability at least 1 \u2212 \u03b4:\n\nm,n.\n\nRn,m(s0) = Q\u2217(s0, \u03c0\u2217(s0)) \u2212 Q\u2217(s0, \u02c6\u03c0(s0)) \u2264 v\u2217(s0) \u2212 v \u02c6\u03c0(s0)\n= v\u2217(s0) \u2212 \u02c6v\u03c0\u2217\n\n+ \u02c6\u03bd \u02c6\u03c0\n\n+ \u02c6v \u02c6\u03c0\n\nm,n +\n\nm,n\n\n(cid:123)(cid:122)\n(cid:125)\n(cid:124)\nm,n \u2212 \u02c6v \u02c6\u03c0\n\u02c6v\u03c0\u2217\n\u22640, by de\ufb01nition of \u02c6\u03c0\n(cid:123)(cid:122)\n(cid:124)\nd (s0) \u2212 \u02c6\u03bd\u03c0\u2217\n+ \u03bd\u03c0\u2217\n\u2264\u0001/4, by (8)\n\n(cid:125)\n\n(cid:125)\n\nm,d\n\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\nm,n \u2212 \u02c6\u03bd \u02c6\u03c0\nm,d\n\u2264\u0001/4, by (7)\n(cid:125)\n(cid:124)\n(cid:123)(cid:122)\nm,d \u2212 \u02c6v\u03c0\u2217\n\u22640, by (7)\n\nm,n\n\n+ \u02c6\u03bd\u03c0\u2217\n\n(cid:125)\n\nd (s0)\n\n(cid:123)(cid:122)\n(cid:124)\nm,d \u2212 \u03bd \u02c6\u03c0\n\u2264\u0001/4, by (8)\n2 \u2264 \u0001\n+ \u0001\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2264 v\u03c0\u2217\n\n(s0) \u2212 \u03bd\u03c0\u2217\n\nd (s0)\n\u2264\u0001/4, by truncation\n\n(8)\n\n+ \u03bd \u02c6\u03c0\n\n(cid:124)\n\n(cid:125)\nd (s0) \u2212 v \u02c6\u03c0(s0)\n\u22640, by truncation\n\n(cid:123)(cid:122)\n\n\u22122\u2212 log(K|A|)\n\nRemark 1. The total budget (nm) required to return an \u0001-optimal action with high probability is\nlog(1/\u03b3) . Notice that this rate is poorer (by a \u0001\u22122 factor) than the rate obtained\nthus of order \u0001\nfor uniform planning in [2]; this is a direct consequence of the fact that we are only drawing samples,\nwhereas a full model of the transition probabilities is assumed in [2].\nRemark 2. Since there is a \ufb01nite number of actions, by denoting \u2206 > 0 the optimality gap between\nthe best and the second-best optimal action values, we have that the optimal arm is identi\ufb01ed (in high\nprobability) (i.e. the simple regret is 0) after a total budget of order \u2206\nRemark 3. The optimistic part of the algorithm allows a deep exploration of the MDP. At the same\ntime, it biases the expression maximized by \u02c6\u03c0 in (4) towards near-optimal actions of the deterministic\nrealizations. Under the assumptions of theorem 1, the bias becomes insigni\ufb01cant.\nRemark 4. Notice that we do not use the optimistic properties of the algorithm in the analysis. The\nanalysis only uses the \u201csafe\u201d part of the SOP planning, i.e. the fact that one sample out of two are\ndevoted to expanding the shallowest nodes. An analysis of the bene\ufb01t of the optimistic part of the\nalgorithm, similar to the analyses carried out in [9, 2] would be much more involved and is deferred\nto a future work. However the impact of the optimistic part of the algorithm is essential in practice,\nas shown in the numerical results.\n\n\u22122\u2212 log(K|A|)\nlog(1/\u03b3) .\n\n6 Numerical results\n\nIn this section, we compare the performance of ASOP to OP-MDP [2], UCT [12], and FSSS [17]. We\nuse the (noisy) inverted pendulum benchmark problem from [2], which consists of swinging up and\nstabilizing a weight attached to an actuated link that rotates in a vertical plane. Since the available\npower is too low to push the pendulum up in a single rotation from the initial state, the pendulum has\nto be swung back and forth to gather energy, prior to being pushed up and stabilized.\nThe inverted pendulum is described by the state variables (\u03b1, \u02d9\u03b1) \u2208 [\u2212\u03c0, \u03c0] \u00d7 [\u221215, 15] and the\ndifferential equation \u00a8\u03b1 = (mgl sin(\u03b1) \u2212 b \u02d9\u03b1 \u2212 K(K \u02d9\u03b1 + u)/R) /J, where J = 1.91 \u00b7 10\u22124 kg \u00b7 m2,\nm = 0.055 kg, g = 9.81 m/s2, l = 0.042 m, b = 3 \u00b7 10\u22126 Nm \u00b7 s/rad, K = 0.0536 Nm/A, and\nR = 9.5 \u2126. The state variable \u02d9\u03b1 is constrained to [\u221215, 15] by saturation. The discrete time problem\nis obtained by mapping actions from A = {\u22123V, 0V, 3V} to segments of a piecewise control signal\nu, each 0.05s in duration, and then numerically integrating the differential equation on the constant\nsegments using RK4. The actions are applied stochastically: with probability 0.6, the intended\nvoltage is applied in the control signal, whereas with probability 0.4, the smaller voltage 0.7a is\napplied. The goal is to stabilize the pendulum in the unstable equilibrium s\u2217 = (0, 0) (pointing up, at\nrest) when starting from state (\u2212\u03c0, 0) (pointing down, at rest). This goal is expressed by the penalty\nfunction (s, a, s(cid:48)) (cid:55)\u2192 \u22125\u03b1(cid:48)2 \u2212 0.1 \u02d9\u03b1(cid:48)2 \u2212 a2, where s(cid:48) = (\u03b1(cid:48), \u02d9\u03b1(cid:48)). The reward function r is obtained\nby scaling and translating the values of the penalty function so that it maps to the interval [0, 1], with\nr(s, 0, s\u2217) = 1. The discount factor is set to \u03b3 = 0.95.\n\n6\n\n\fs\nd\nr\na\nw\ne\nr\n\nd\ne\nt\nn\nu\no\nc\ns\ni\nd\n\nf\no\nm\nu\nS\n\n17\n\n16.8\n\n16.6\n\n16.4\n\n16.2\n\n16\n\n15.8\n\nOP-MDP\nUCT 0.2 7\nFSSS 1 7\nFSSS 2 7\nFSSS 3 7\nASOP 2\nASOP 3\n\n102\n\n103\n\n104\n\nCalls to the simulator per step\n\nFigure 1: Comparison of ASOP to OP-MDP, UCT, and FSSS on the inverted pendulum benchmark\nproblem, showing the sum of discounted rewards for simulations of 50 time steps.\n\nThe algorithms are compared for several budgets. In the cases of ASOP, UCT, and FSSS, the budget\nis in terms of calls to the simulator. OP-MDP does not use a simulator. Instead, every possible\nsuccessor state is incorporated into the planning tree, together with its precise probability mass, and\neach of these states is counted against the budget. As the benchmark problem is stochastic, and\ninternal randomization (for the simulator) is used in all algorithms except OP-MDP, the performance\nis averaged over 50 repetitions. The algorithm parameters have been selected manually to achieve\ngood performance. For ASOP, we show results for forest sizes of two and three. For UCT, the\nChernoff-Hoeffding term multiplier is set to 0.2 (the results are not very sensitive in the value,\ntherefore only one result is shown). For FSSS, we use one to three samples per state-action. For\nboth UCT and FSSS, a rollout depth of seven is used. OP-MDP does not have any parameters. The\nresults are shown in \ufb01gure 1. We observe that on this problem, ASOP performs much better than\nOP-MDP for every value of the budget, and also performs well in comparison to the other sampling\nbased methods, UCT and FSSS.\nFigure 2 shows the impact of optimistic planning on the performance of our aggregation method. For\nforest sizes of both one and three, optimistic planning leads to considerably increased performance.\nThis is due to the greater planning depth in the lookahead tree when using optimistic exploration.\nFor the case of a single tree, performance decreases (presumably due to over\ufb01tting) on the stochastic\nproblem for increasing budget. The effect disappears when more than one tree is used.\n\n7 Conclusion\n\nWe introduced ASOP, a novel algorithm for solving online planning problems using a (randomized)\nsimulator for the MDP, under a budget constraint. The algorithm works by constructing a forest\nof single successor state trees, each corresponding to a random realization of the MDP transitions.\nEach tree is constructed using a combination of safe and optimistic planning. An empirical MDP\nis de\ufb01ned, based on the forest, and the \ufb01rst action of the optimal policy of this empirical MDP is\nreturned. In short, our algorithm targets structured problems (where the value function possesses\nsome smoothness property around the optimal policies of the deterministic realizations of the MDP, in\na sense de\ufb01ned e.g. in [13]) by using the optimistic principle to focus rapidly on the most promising\narea(s) of the search space. It can also \ufb01nd a reasonable solution in unstructured problems, since some\nof the budget is allocated for uniform exploration. ASOP shows good performance on the inverted\npendulum benchmark. Finally, our algorithm is also appealing in that the numerically heavy part of\nconstructing the planning trees, in which the simulator is used, can be performed in a distributed way.\n\n7\n\n\f17\n\n16.5\n\n16\n\n15.5\n\ns\nd\nr\na\nw\ne\nr\n\nd\ne\nt\nn\nu\no\nc\ns\ni\nd\n\nf\no\nm\nu\nS\n\nSafe+Optimistic 1\nSafe+Optimistic 3\n\nSafe 1\nSafe 3\n\nOptimistic 1\nOptimistic 3\n\n101\n\n102\n\n103\n\n104\n\nCalls to the simulator per step\n\nFigure 2: Comparison of different planning strategies (on the same problem as in \ufb01gure 1). The\n\u201cSafe\u201d strategy is to use uniform planning in the individual trees, the \u201cOptimistic\u201d strategy is to use\nOPD. ASOP corresponds to the \u201cSafe+Optimistic\u201d strategy.\n\nAcknowledgements\n\nWe acknowledge the support of the BMBF project ALICE (01IB10003B), the European Community\u2019s\nSeventh Framework Programme FP7/2007-2013 under grant no 270327 CompLACS and the Belgian\nPAI DYSCO. Rapha\u00ebl Fonteneau is a post-doctoral fellow of the F.R.S. - FNRS. We also thank Lucian\nBusoniu for sharing his implementation of OP-MDP.\n\n1\n\n3\n\n1\n\n1\u2212\u03b3 + 2\n\n3\n\n1\n3\n\n1\n\n1\u2212\u03b3 + 2\n\n3\n\n\u03b3k\n1\u2212\u03b3 > 1\n\n3\n\n2 > \u03b3k > 1\n\nAppendix: Counterexample to consistency when using purely optimistic planning in S3-trees\nConsider the MDP in \ufb01gure 3 with k zero reward transitions in the middle branch, where \u03b3 \u2208 (0, 1)\nand k \u2208 N are chosen such that 1\n3 (e.g. \u03b3 = 0.95 and k = 14). The trees are constructed\niteratively, and every iteration consists of exploring a leaf of maximal b-value, where exploring a\nleaf means introducing a single successor state per action at the selected leaf. The state-action values\nare: Q\u2217(x, a) = 1\n1\u2212\u03b3 and Q\u2217(x, b) = 1\n1\u2212\u03b3 . There are two\n2\npossible outcomes when sampling the action a, which occur with probabilities 1\n3, respectively:\nOutcome I: The upper branch of action a is sampled. In this case, the contribution to the forest is an\narbitrarily long reward 1 path for action a, and a \ufb01nite reward 1\nOutcome II: The lower branch of action a is sampled. Because \u03b3k\n1\u2212\u03b3 , the lower branch will\nbe explored only up to k times, as its b-value is then lower than the value (and therefore any b-value)\nof action b. The contribution of this case to the forest is a \ufb01nite reward 0 path for action a and an\narbitrary long (depending on the budget) reward 1\nFor an increasing exploration budget per tree and an increasing number of trees, the approximate\naction values of action a and b obtained by aggregation converge to 1\n1\u2212\u03b3 , respectively.\n1\u2212\u03b3 and 1\n3\nTherefore, the decision rule will select action b for a suf\ufb01ciently large budget, even though a is the\noptimal action. This leads to simple regret of R(x) = Q\u2217(x, a) \u2212 Q\u2217(x, b) > 1\n\n2 path for action b.\n1\u2212\u03b3 < 1\n\n2 path for action b.\n\n1\u2212\u03b3 = 5\n\n3 and 2\n\n1\n\n9\n\n1\n\n2\n\n1\n\n1\n\n1\n\n2\n\n1\n1\u2212\u03b3 .\n\n18\n\ns0\n\na\n\nb\n\n11\n3\n\n2\n3\n\n0\n\n1\n2\n\n1\n\n0\n\n1\n2\n\n. . .\n\n. . .\n\n. . .\n\n1\n\n0\n\n1\n2\n\n1\n\n1\n\n1\n2\n\n1\n\n1\n\n1\n2\n\n. . . (I)\n\n. . . (II)\n\n. . .\n\nFigure 3: The middle branch (II) of this MDP is never explored deep enough if only the node with\nthe largest b-value is sampled in each iteration. Transition probabilities are given in gray where not\nequal to one.\n\n8\n\n\fReferences\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of multiarmed bandit problems.\n\nMachine Learning, 47:235\u2013256, 2002.\n\n[2] L. Busoniu and R. Munos. Optimistic planning for Markov decision processes. In International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), JMLR W & CP 22, pages\n182\u2013189, 2012.\n\n[3] E. F. Camacho and C. Bordons. Model Predictive Control. Springer, 2004.\n[4] R. Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo tree search. Computers\n\nand Games, pages 72\u201383, 2007.\n\n[5] B. Defourny, D. Ernst, and L. Wehenkel. Lazy planning under uncertainty by optimizing\ndecisions on an ensemble of incomplete disturbance trees. In Recent Advances in Reinforcement\nLearning - European Workshop on Reinforcement Learning (EWRL), pages 1\u201314, 2008.\n\n[6] R. Fonteneau, L. Busoniu, and R. Munos. Optimistic planning for belief-augmented Markov\ndecision processes. In IEEE International Symposium on Adaptive Dynamic Programming and\nReinforcement Learning (ADPRL), 2013.\n\n[7] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modi\ufb01cation of UCT with patterns in Monte-\n\nCarlo go. Technical report, INRIA RR-6062, 2006.\n\n[8] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of\nminimum cost paths. Systems Science and Cybernetics, IEEE Transactions on, 4(2):100\u2013107,\n1968.\n\n[9] J. F. Hren and R. Munos. Optimistic planning of deterministic systems. Recent Advances in\n\nReinforcement Learning, pages 151\u2013164, 2008.\n\n[10] J. E. Ingersoll. Theory of Financial Decision Making. Rowman and Little\ufb01eld Publishers, Inc.,\n\n1987.\n\n[11] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning\n\nin large Markov decision processes. Machine Learning, 49(2-3):193\u2013208, 2002.\n\n[12] L. Kocsis and C. Szepesv\u00e1ri. Bandit based Monte-Carlo planning. Machine Learning: ECML\n\n2006, pages 282\u2013293, 2006.\n\n[13] R. Munos. From bandits to Monte-Carlo Tree Search: The optimistic principle applied to\noptimization and planning. To appear in Foundations and Trends in Machine Learning, 2013.\n[14] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society,\n\nSeries B, 65(2):331\u2013366, 2003.\n\n[15] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics. In\n\nIEEE-RAS International Conference on Humanoid Robots, pages 1\u201320, 2003.\n\n[16] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998.\n[17] T. J. Walsh, S. Goschin, and M. L. Littman. Integrating sample-based planning and model-based\n\nreinforcement learning. In AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1131, "authors": [{"given_name": "Gunnar", "family_name": "Kedenburg", "institution": "INRIA"}, {"given_name": "Raphael", "family_name": "Fonteneau", "institution": "Universit\u00e9 de Li\u00e8ge"}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA"}]}