{"title": "Trajectory-Based Short-Sighted Probabilistic Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 3248, "page_last": 3256, "abstract": "Probabilistic planning captures the uncertainty of plan execution by probabilistically modeling the effects of actions in the environment, and therefore the probability of reaching different states from a given state and action. In order to compute a solution for a probabilistic planning problem, planners need to manage the uncertainty associated with the different paths from the initial state to a goal state. Several approaches to manage uncertainty were proposed, e.g., consider all paths at once, perform determinization of actions, and sampling. In this paper, we introduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs), a novel approach to manage uncertainty for probabilistic planning problems in which states reachable with low probability are substituted by artificial goals that heuristically estimate their cost to reach a goal state. We also extend the theoretical results of Short-Sighted Probabilistic Planner (SSiPP) [ref] by proving that SSiPP always finishes and is asymptotically optimal under sufficient conditions on the structure of short-sighted SSPs. We empirically compare SSiPP using trajectory-based short-sighted SSPs with the winners of the previous probabilistic planning competitions and other state-of-the-art planners in the triangle tireworld problems. Trajectory-based SSiPP outperforms all the competitors and is the only planner able to scale up to problem number 60, a problem in which the optimal solution contains approximately $10^{70}$ states.", "full_text": "Trajectory-Based Short-Sighted Probabilistic\n\nPlanning\n\nFelipe W. Trevizan\n\nMachine Learning Department\n\nManuela M. Veloso\n\nComputer Science Department\n\nCarnegie Mellon University - Pittsburgh, PA\n\n{fwt,mmv}@cs.cmu.edu\n\nAbstract\n\nProbabilistic planning captures the uncertainty of plan execution by probabilisti-\ncally modeling the effects of actions in the environment, and therefore the proba-\nbility of reaching different states from a given state and action. In order to compute\na solution for a probabilistic planning problem, planners need to manage the un-\ncertainty associated with the different paths from the initial state to a goal state.\nSeveral approaches to manage uncertainty were proposed, e.g., consider all paths\nat once, perform determinization of actions, and sampling. In this paper, we in-\ntroduce trajectory-based short-sighted Stochastic Shortest Path Problems (SSPs),\na novel approach to manage uncertainty for probabilistic planning problems in\nwhich states reachable with low probability are substituted by arti\ufb01cial goals that\nheuristically estimate their cost to reach a goal state. We also extend the theoretical\nresults of Short-Sighted Probabilistic Planner (SSiPP) [1] by proving that SSiPP\nalways \ufb01nishes and is asymptotically optimal under suf\ufb01cient conditions on the\nstructure of short-sighted SSPs. We empirically compare SSiPP using trajectory-\nbased short-sighted SSPs with the winners of the previous probabilistic planning\ncompetitions and other state-of-the-art planners in the triangle tireworld problems.\nTrajectory-based SSiPP outperforms all the competitors and is the only planner\nable to scale up to problem number 60, a problem in which the optimal solution\ncontains approximately 1070 states.\n\n1\n\nIntroduction\n\nThe uncertainty of plan execution can be modeled by using probabilistic effects in actions, and\ntherefore the probability of reaching different states from a given state and action. This search space,\nde\ufb01ned by the probabilistic paths from the initial state to a goal state, challenges the scalability of\nplanners. Planners manage the uncertainty by choosing a search strategy to explore the space. In\nthis work, we present a novel approach to manage uncertainty for probabilistic planning problems\nthat improves its scalability while still being optimal.\n\nOne approach to manage uncertainty while searching for the solution of probabilistic planning prob-\nlems is to consider the complete search space at once. Examples of such algorithms are value\niteration and policy iteration [2]. Planners based on these algorithms return a closed policy, i.e., a\nuniversal mapping function from every state to the optimal action that leads to a goal state. Assum-\ning the model correctly captures the cost and uncertainty of the actions in the environment, closed\npolicies are extremely powerful as their execution never \u201cfails,\u201d and the planner does not need to\nbe re-invoked ever. Unfortunately the computation of such policies is prohibitive in complexity as\nproblems scale up. Value iteration based probabilistic planners can be improved by combining asyn-\nchronous updates and heuristic search [3\u20137]. Although these techniques allow planners to compute\ncompact policies, in the worst case, these policies are still linear in the size of the state space, which\nitself can be exponential in the size of the state or goals.\n\n1\n\n\fAnother approach to manage uncertainty is basically to ignore uncertainty during planning, i.e., to\napproximate the probabilistic actions as deterministic actions. Examples of replanners based on\ndeterminization are FF-Replan [8], the winner of the \ufb01rst International Probabilistic Planning Com-\npetition (IPPC) [9], Robust FF [10], the winner of the third IPPC [11] and FF-Hindsight [12, 13].\nDespite the major success of determinization, this simpli\ufb01cation in the action space results in algo-\nrithms oblivious to probabilities and dead-ends, leading to poor performance in speci\ufb01c problems,\ne.g., the triangle tireworld [14].\n\nBesides the action space simpli\ufb01cation, uncertainty management can be performed by simplifying\nthe problem horizon, i.e., look-ahead search [15]. Based on sampling, the Upper Con\ufb01dence bound\nfor Trees (UCT) algorithm [16] approximates the look-ahead search by focusing the search in the\nmost promising nodes.\n\nThe state space can also be simpli\ufb01ed to manage uncertainty in probabilistic planning. One example\nof such approach is Envelope Propagation (EP) [17]. EP computes an initial partial policy \u03c0 and\nthen prunes all the states not considered by \u03c0. The pruned states are represented by a special meta\nstate. Then EP iteratively improves its approximation of the state space. Previously, we introduced\nshort-sighted planning [1], a new approach to manage uncertainty in planning problems: given a\nstate s, only the uncertainty structure of the problem in the neighborhood of s is taken into account\nand the remaining states are approximated by arti\ufb01cial goals that heuristically estimate their cost to\nreach a goal state.\n\nIn this paper, we introduced trajectory-based short-sighted Stochastic Shortest Path Problems\n(SSPs), a novel model to manage uncertainty in probabilistic planning problems. Trajectory-based\nshort-sighted SSPs manage uncertainty by pruning the state space based on the most likely trajectory\nbetween states and de\ufb01ning arti\ufb01cial goal states that guide the solution towards the original goal. We\nalso contribute by de\ufb01ning a class of short-sighted models and proving that the Short-Sighted Proba-\nbilistic Planner (SSiPP) [1] always terminates and is asymptotically optimal for models in this class\nof short-sighted models.\n\nThe remainder of this paper is organized as follows: Section 2 introduces the basic concepts and\nnotation. Section 3 de\ufb01nes formally trajectory-based short-sighted SSPs. Section 4 presents our\nnew theoretical results for SSiPP. Section 5 empirically evaluates SSiPP using trajectory-based short-\nsighted SSPs with the winners of the previous IPPCs and other state-of-the-art planner. Section 6\nconcludes the paper.\n\n2 Background\n\nA Stochastic Shortest Path Problem (SSP) is de\ufb01ned by the tuple S = hS, s0, G, A, P, Ci, in which\n[1, 18]: S is the \ufb01nite set of state; s0 \u2208 S is the initial state; G \u2286 S is the set of goal states; A is\nthe \ufb01nite set of actions; P (s\u2032|s, a) represents the probability that s\u2032 \u2208 S is reached after applying\naction a \u2208 A in state s \u2208 S; C(s, a, s\u2032) \u2208 (0, +\u221e) is the cost incurred when state s\u2032 is reached after\napplying action a in state s and this function is required to be de\ufb01ned for all s \u2208 S, a \u2208 A, s\u2032 \u2208 S\nsuch that P (s\u2032|s, a) > 0.\nA solution to an SSP is a policy \u03c0, i.e., a mapping from S to A. If \u03c0 is de\ufb01ned over the entire space\nS, then \u03c0 is a closed policy. A policy \u03c0 de\ufb01ned only for the states reachable from s0 when following\n\u03c0 is a closed policy w.r.t. s0 and S(\u03c0, s0) denotes this set of reachable states. For instance, in the\nSSP depicted in Figure 1(a), the policy \u03c00 = {(s0, a0), (s\u2032\n3, a0)} is a closed policy\nw.r.t. s0 and S(\u03c00, s0) = {s0, s\u2032\nGiven a policy \u03c0, we de\ufb01ne trajectory as a sequence T\u03c0 = hs(0), . . . , s(k)i such that, for all\ni \u2208 {0, \u00b7 \u00b7 \u00b7 , k \u2212 1}, \u03c0(s(i)) is de\ufb01ned and P (s(i+1)|s(i), \u03c0(s(i))) > 0. The probability of a tra-\ni=0 P (s(i+1)|s(i), \u03c0(s(i))) and maximum probability of a\n\ntrajectory between two states Pmax(s, s\u2032) is de\ufb01ned as max\u03c0 P (T\u03c0 = hs, . . . , s\u2032i).\nAn optimal policy \u03c0\u2217 for an SSP is any policy that always reaches a goal state when followed from\ns0 and also minimizes the expected cost of T\u03c0\u2217. For a given SSP, \u03c0\u2217 might not be unique, however\nthe optimal value function V \u2217, i.e., the mapping from states to the minimum expected cost to reach\na goal state, is unique. V \u2217 is the \ufb01xed point of the set of equations de\ufb01ned by (1) for all s \u2208 S \\ G\nand V \u2217(s) = 0 for all s \u2208 G. Notice that under the optimality criterion given by (1), SSPs are\n\njectory T\u03c0 is de\ufb01ned as P (T\u03c0) = Qi<|T\u03c0|\n\n1, a0), (s\u2032\n\n2, a0), (s\u2032\n\n1, s\u2032\n\n2, s\u2032\n\n3, sG}.\n\n2\n\n\fa0\na1\n\n.6\n\ns 0\n\n.4\n\n.75\n\n.25\n\nt>1\n\nt>2\n\nt>3\n\nt>4\n\np<1.0\n\np<0.4\n\np<0.4\n\n2\n\n.6\n\n.6\n\ns 1\n\n.4\n\n.75\n\n\u2019\ns 1\n\ns 2\n\n\u2019\ns 2\n\n.4\n\n.75\n\ns 3\n\n\u2019\ns 3\n\n.25\n\n.25\n\n.6\n\n.4\n\n.75\n\n.25\n\n.6\n\nsG\n\ns 0\n\n.4\n\n.75\n\n25\n\n.6\n\n.6\n\ns 1\n\n.4\n\ns 2\n\n.4\n\n\u2019\ns 1\n\n.75\n\n.25\n\n.75\n\n\u2019\ns 2\n\n.25\n\ns 3\n\n\u2019\ns 3\n\n.6\n\n.4\n\n.75\n\n.25\n\nsG\n\n.6\n\ns 0\n\n25\n\n.4\n\n.75\n\ns 1\n\n\u2019\ns 1\n\n.6\n\n.4\n\n.6\n\ns 2\n\n.4\n\n.75\n\n.25\n\n.75\n\n\u2019\ns 2\n\n.25\n\ns 3\n\n\u2019\ns 3\n\n.6\n\n.4\n\n.75\n\n.25\n\nsG\n\np<0.753\nFigure 1: (a) Example of an SSP. The initial state is s0, the goal state is sG, C(s, a, s\u2032) = 1, \u2200s \u2208 S,\na \u2208 A and s\u2032 \u2208 S. (b) State-space partition of (a) according to the depth-based short-sighted SSPs:\nGs0,t contains all the states in dotted regions which their conditions hold for the given value of t. (c)\nState-space partition of (a) according to the trajectory-based short-sighted SSPs: Gs0,\u03c1 contains all\nthe states in dotted regions which their conditions hold for the given value of \u03c1.\n\np<0.75\n\np<0.75\n\n2\n\nmore general than Markov Decision Processes (MDPs) [19], therefore all the work presented here\nis directly applicable to MDPs.\n\nV \u2217(s) = min\n\na\u2208A Xs\u2032\u2208ShC(s, a, s\u2032) + P (s\u2032|s, a)V \u2217(s\u2032)i\n\n(1)\n\nDe\ufb01nition 1 (reachability assumption). An SSP satis\ufb01es the reachability assumption if, for all s \u2208 S,\nthere exists sG \u2208 G such that Pmax(s, sG) > 0.\n\nGiven an SSP S, if a goal state can be reached with positive probability from every state s \u2208 S,\nthen the reachability assumption (De\ufb01nition 1) holds for S and 0 \u2264 V \u2217(s) < \u221e [19]. Once V \u2217 is\nknown, any optimal policy \u03c0\u2217 can be extracted from V \u2217 by substituting the operator min by argmin\nin equation (1).\n\nA possible approach to compute V \u2217 is the value iteration algorithm: de\ufb01ne V i+1(s) as in (1)\nwith V i in the right hand side instead of V \u2217 and the sequence hV 0, V 1, . . . , V ki converges to\nV \u2217 as k \u2192 \u221e [19]. The process of computing V i+1 from V i is known as Bellman up-\ndate and V 0(s) can be initialized with an admissible heuristic H(s), i.e., a lower bound for\nIn practice we are interested in reaching \u01eb-convergence, that is, given \u01eb, \ufb01nd V such that\nV \u2217.\n\nmaxs |V (s) \u2212 minaPs\u2032 [C(s, a, s\u2032) + P (s\u2032|s, a)V (s\u2032)]| \u2264 \u01eb. The following well-known result is\n\nnecessary in most of our proofs [2, Assumption 2.2 and Lemma 2.1]:\nTheorem 1. Given an SSP S, if the reachability assumption holds for S, then the admissibility and\nmonotonicity of V are preserved through the Bellman updates.\n\n3 Trajectory-Based Short-Sighted Stochastic SSPs\n\nShort-sighted Stochastic Path Problems (short-sighted SSPs) [1] are a special case of SSPs in which\nthe original problem is transformed into a smaller one by: (i) pruning the state space; and (ii) adding\narti\ufb01cial goal states to heuristically guide the search towards the goals of the original problem.\nDepth-based short-sighted SSPs are de\ufb01ned based on the action-distance between states [1]:\nDe\ufb01nition 2 (action-distance). The non-symmetric action-distance \u03b4(s, s\u2032) between two states s and\ns\u2032 is argmink{T\u03c0 = hs, s(1), . . . , s(k\u22121), s\u2032i|\u2203\u03c0 and T\u03c0 is a trajectory}.\nDe\ufb01nition 3 (Depth-Based Short-Sighted SSP). Given an SSP S = hS, s0, G, A, P, Ci, a state s \u2208\nS, t > 0 and a heuristic H, the (s, t)-depth-based short-sighted SSP Ss,t = hSs,t, s, Gs,t, A, P, Cs,ti\nassociated with S is de\ufb01ned as:\n\n\u2022 Ss,t = {s\u2032 \u2208 S|\u03b4(s, s\u2032) \u2264 t};\n\u2022 Gs,t = {s\u2032 \u2208 S|\u03b4(s, s\u2032) = t} \u222a (G \u2229 Ss,t);\n\n\u2022 Cs,t(s\u2032, a, s\u2032\u2032) = (cid:26)C(s\u2032, a, s\u2032\u2032) + H(s\u2032\u2032)\n\nC(s\u2032, a, s\u2032\u2032)\n\nif s\u2032\u2032 \u2208 Gs,t\notherwise\n\n,\n\n\u2200s\u2032 \u2208 Ss,t, a \u2208 A, s\u2032\u2032 \u2208 Ss,t\n\nFigure 1(b) shows, for different values of t, Ss0,t for the SSP in Figure 1(a); for instance, if t = 2\nthen Ss0,2 = {s0, s1, s\u2032\n2}. In the example shown in Figure 1(b), we can\n\n2} and Gs0,2 = {s2, s\u2032\n\n1, s2, s\u2032\n\n3\n\n\fsee that generation of Ss0,t is independent of the trajectories probabilities: for t = 2, s2 \u2208 Ss0,2 and\n3 6\u2208 Ss0,2, however Pmax(s0, s2) = 0.16 < Pmax(s0, s\u2032\ns\u2032\n\n3) = 0.753 \u2248 0.42.\n\nDe\ufb01nition 4 (Trajectory-Based Short-Sighted SSP). Given an SSP S = hS, s0, G, A, P, Ci, a\nstate s \u2208 S, \u03c1 \u2208 [0, 1] and a heuristic H, the (s, \u03c1)-trajectory-based short-sighted SSP Ss,\u03c1 =\nhSs,\u03c1, s, Gs,\u03c1, A, P, Cs,\u03c1i associated with S is de\ufb01ned as:\n\n\u2022 Ss,\u03c1 = {s\u2032 \u2208 S|\u2203\u02c6s \u2208 S and a \u2208 A s.t. Pmax(s, \u02c6s) \u2265 \u03c1 and P (s\u2032|\u02c6s, a) > 0};\n\n\u2022 Gs,\u03c1 = (G \u2229 Ss,\u03c1) \u222a (Ss,\u03c1 \u2229 {s\u2032 \u2208 S|Pmax(s, s\u2032) < \u03c1});\n\n\u2022 Cs,\u03c1(s\u2032, a, s\u2032\u2032) = (cid:26)C(s\u2032, a, s\u2032\u2032) + H(s\u2032\u2032)\n\nC(s\u2032, a, s\u2032\u2032)\n\nif s\u2032\u2032 \u2208 Gs,\u03c1\notherwise\n\n,\n\n\u2200s\u2032 \u2208 Ss,\u03c1, a \u2208 A, s\u2032\u2032 \u2208 Ss,\u03c1\n\nFor simplicity, when H is not clear by context nor explicit, then H(s) = 0 for all s \u2208 S.\n\nOur novel model, the trajectory-based short-sighted SSPs (De\ufb01nition 4), addresses the issue of\nstates with low trajectory probability by explicitly de\ufb01ning its state space Ss,\u03c1 based on maxi-\nmum probability of a trajectory between s and the candidate states s\u2032 (Pmax(s, s\u2032)). Figure 1(c)\nshows, for all values of \u03c1 \u2208 [0, 1], the trajectory-based Ss0,\u03c1 for the SSP in Figure 1(a): for in-\n3, sG} and Gs0,0.75 = {s1, sG}. This example\nstance, if \u03c1 = 0.753 then Ss0,0.753 = {s0, s1, s\u2032\nshows how trajectory-based short-sighted SSP can manage uncertainty ef\ufb01ciently: for \u03c1 = 0.753,\n|Ss0,\u03c1| = 6 and the goal of the original SSP sG is already included in Ss0,\u03c1 while, for the depth-\nbased short-sighted SSPs, sG \u2208 Ss0,t only for t \u2265 4 case in which |Ss0,t| = |S| = 8.\nNotice that the de\ufb01nition of Ss,\u03c1 cannot be simpli\ufb01ed to {\u02c6s \u2208 S|Pmax(s, \u02c6s) \u2265 \u03c1} since not all\nthe resulting states of actions would be included in Ss,\u03c1. For example, consider S = {s, s\u2032, s\u2032\u2032},\nP (s\u2032|s, a) = 0.9 and P (s\u2032\u2032|s, a) = 0.1, then for \u03c1 \u2208 (0.1, 1], {\u02c6s \u2208 S|Pmax(s, \u02c6s) \u2265 \u03c1} = {s, s\u2032},\ngenerating an invalid SSP since not all the resulting states of a would be contained in the model.\n\n1, s\u2032\n\n2, s\u2032\n\n4 Short-Sighted Probabilistic Planner\n\nThe Short-Sighted Probabilistic Planner (SSiPP) is an algorithm that solves SSPs based on short-\nsighted SSPs [1]. SSiPP is reviewed in Algorithm 1 and consists of iteratively generating and solving\nshort-sighted SSPs of the given SSP. Due to the reduced size of the short-sighted problems, SSiPP\nsolves each of them by computing a closed policy w.r.t.\ntheir initial state. Therefore, we obtain\na \u201cfail-proof\u201d solution for each short-sighted SSP, thus if this solution is directly executed in the\nenvironment, then replanning is not needed until an arti\ufb01cial goal is reached. Alternatively, an\nanytime behavior is obtained if the execution of the computed closed policy for the short-sighted\nSSP is simulated (Algorithm 1 line 4) until an arti\ufb01cial goal sa is reached and this procedure is\nrepeated, starting sa, until convergence or an interruption.\nIn [1], we proved that SSiPP always terminates and is asymptotically optimal for depth-based short-\nsighted SSPs. We generalize the results regarding SSiPP by: (i) providing the suf\ufb01cient conditions\nfor the generation of short-sighted problems (Algorithm 1, line 1) in De\ufb01nition 5; and (ii) proving\nthat SSiPP always terminates (Theorem 3) and is asymptotically optimal (Corollary 4) when the\nshort-sighted SSP generator respects De\ufb01nition 5. Notice that, by de\ufb01nition, both depth-based and\ntrajectory-based short-sighted SSPs meet the suf\ufb01cient conditions presented on De\ufb01nition 5.\n\nDe\ufb01nition 5. Given an SSP hS, s0, G, A, P, Ci, the suf\ufb01cient conditions on the short-sighted SSPs\nhS\u2032, \u02c6s, G\u2032, A, P \u2032, C \u2032i returned by the generator in Algorithm 1 line 1 are:\n\n1. G \u2229 S\u2032 \u2286 G\u2032;\n2. \u02c6s 6\u2208 G \u2192 \u02c6s 6\u2208 G\u2032; and\n\n3. for all s \u2208 S, a \u2208 A and s\u2032 \u2208 S\u2032 \\ G\u2032, if P (s|s\u2032, a) > 0 then s \u2208 S\u2032 and P \u2032(s|s\u2032, a) =\n\nP (s|s\u2032, a).\n\nLemma 2. SSiPP performs Bellman updates on the original SSP S.\n\n4\n\n\f1\n\n2\n\n3\n4\n\nSSIPP(SSP S = hS, s0, G, A, P, Ci, H a heuristic for V \u2217 and params the parameters to generate\nshort-sighted SSPs)\nbegin\n\nV \u2190 Value function for S initialized by H\ns \u2190 s0\nwhile s 6\u2208 G do\n\nhS\u2032, s, G\u2032, A, P, C \u2032i \u2190 GENERATE-SHORT-SIGHTED-SSP(S, s, V , params)\n(\u02c6\u03c0\u2217, \u02c6V \u2217) \u2190 OPTIMAL-SSP-SOLVER(hS\u2032, s, G\u2032, A, P, C \u2032i, V )\nforall s\u2032 \u2208 S\u2032(\u02c6\u03c0\u2217, s) do\n\nV (s\u2032) \u2190 \u02c6V \u2217(s\u2032)\n\nwhile s 6\u2208 G\u2032 do\n\ns \u2190 execute-action(\u02c6\u03c0\u2217(s))\n\nreturn V\n\nend\nAlgorithm 1: SSiPP algorithm [1]. GENERATE-SHORT-SIGHTED-SSP represents a procedure to\ngenerate short-sighted SSPs, either depth-based or trajectory-based. In the former case params = t\nand params = \u03c1 for the latter. OPTIMAL-SSP-SOLVER returns an optimal policy \u03c0\u2217 w.r.t. s0 for S\nand V \u2217 associated to \u03c0\u2217, i.e., V \u2217 needs to be de\ufb01ned only for s \u2208 S(\u03c0\u2217, s0).\n\nProof. In order to show that SSiPP performs Bellman updates implicitly, consider the loop\nSince OPTIMAL-SOLVER computes \u02c6V \u2217, by de\ufb01nition of short-\nin line 2 of Algorithm 1.\nsighted SSP: (i) \u02c6V \u2217(sG) equals V (sG) for all sG \u2208 G\u2032, therefore the value of V (sG) remains\n\nthe same; and (ii) mina\u2208APs\u2032\u2208S [C(s, a, s\u2032) + P (s\u2032|s, a)V (s\u2032)] \u2264 \u02c6V \u2217(s) for s \u2208 S\u2032 \\ G\u2032,\n\nthe assignment V (s) \u2190 \u02c6V \u2217 is equivalent to at least one Bellman update on V (s), be-\ni.e.,\ncause V is a lower bound on \u02c6V \u2217 and Theorem 1. Because s 6\u2208 G\u2032 and De\ufb01nition 5,\n\nmina\u2208A(cid:2)Ps\u2032\u2208S C(s, a, s\u2032) + P (s\u2032|s, a)V (s\u2032)(cid:3) \u2264 \u02c6V \u2217(s) is equivalent to the one Bellman update\n\nin the original SSP S.\n\nTheorem 3. Given an SSP S = hS, s0, G, A, P, Ci such that the reachability assumption holds, an\nadmissible heuristic H and a short-sighted problem generator that respects De\ufb01nition 5, then SSiPP\nalways terminates.\n\nProof. Since OPTIMAL-SOLVER always \ufb01nishes and the short-sighted SSP is an SSP by de\ufb01nition,\nthen a goal state sG of the short-sighted SSP is always reached, therefore the loop in line 3 of\nAlgorithm 1 always \ufb01nishes. If sG \u2208 G, then SSiPP terminates in this iteration. Otherwise, sG\nis an arti\ufb01cial goal and sG 6= s (De\ufb01nition 5), i.e., sG differs from the state s used as initial state\nfor the short-sighted SSP generation. Thus another iteration of SSiPP is performed using sG as\ns. Suppose, for contradiction purpose, that every goal state reached during SSiPP execution is an\narti\ufb01cial goal, i.e., SSiPP does not terminate. Then in\ufb01nitely many short-sighted SSPs are solved.\nSince S is \ufb01nite, then there exists s \u2208 S that is updated in\ufb01nitely often, therefore V (s) \u2192 \u221e.\nHowever, V \u2217(s) < \u221e by the reachability assumption. Since SSiPP performs Bellman updates\n(Lemma 2) then V (s) \u2264 V \u2217(s) by monotonicity of Bellman updates (Theorem 1) and admissibility\nG \u2208 G and therefore\nof H, a contradiction. Thus every execution of SSiPP reaches a goal state s\u2032\nterminates.\n\nCorollary 4. Under the same assumptions of Theorem 3, the sequence hV 0, V 1, \u00b7 \u00b7 \u00b7 , V ti, where\nV 0 = H and V t = SSiPP(S, t, V t\u22121), converges to V \u2217 as t \u2192 \u221e for all s \u2208 S(\u03c0\u2217, s0).\n\nProof. Let S\u2217 \u2286 S be the set of states being visited in\ufb01nitely many times. Clearly, S(\u03c0\u2217, s0) \u2286 S\u2217\nsince a partial policy cannot be executed ad in\ufb01nitum without reaching a state in which it is not\nde\ufb01ned. Since SSiPP performs Bellman updates in the original SSP space (Lemma 2) and ev-\nery execution of SSiPP terminates (Theorem 3), then we can view the sequence of lower bounds\nhV 0, V 1, \u00b7 \u00b7 \u00b7 , V ti generated by SSiPP as asynchronous value iteration. The convergence of V t\u22121(s)\nto V \u2217(s) as t \u2192 \u221e for all s \u2208 S(\u03c0\u2217, s0) \u2286 S\u2217 follows by [2, Proposition 2.2, p. 27] and guarantees\nthe convergence of SSiPP.\n\n5\n\n\fl\n\n)\ne\na\nc\ns\n \n\ng\no\nl\n(\n \ns\ne\na\nS\n\nt\n\nt\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n1080\n\n1070\n\n1060\n\n1050\n\n1040\n\n1030\n\n1020\n\n1010\n\n100\n\n \n0\n\n \n\n|S(\u03c0*,s0)|\n|S|\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\n50\n\n55\n\n60\n\nTriangle Tireworld Problem Size\n\n(a)\n\n(b)\n\nFigure 2: (a) Map of the triangle tireworld for the sizes 1, 2 and 3. Circles (squares) represent\nlocations in which there is one (no) spare tire. The shades of gray represent, for each location l,\nmax\u03c0 P (car reaches l and the tire is not \ufb02at when following the policy \u03c0 from s0). (b) Log-lin plot\nof the state space size (|S|) and the size of the states reachable from s0 when following the optimal\npolicy \u03c0\u2217 (|S(\u03c0\u2217, s0)|) versus the number of the triangle tireworld problem.\n\n5 Experiments\n\nWe present two sets of experiments using the triangle tireworld problems [9, 11, 20], a series of\nprobabilistic interesting problems [14] in which a car has to travel between locations in order to\nreach a goal location from its initial location. The roads are represented as directed graph in a shape\nof a triangle and, every time the car moves between locations, a \ufb02at tire happens with probability\n0.5. Some locations have a spare tire and in these locations the car can deterministically replace\nits \ufb02at tire by new one. When the car has a \ufb02at tire, it cannot change its location, therefore the car\ncan get stuck in locations that do not have a spare tire (dead-ends). Figure 2(a) depicts the map of\nthe triangle tireworld problems 1, 2 and 3 and Figure 2(b) shows the size of S and S(\u03c0\u2217, s0) for\nproblems up to size 60. For example, the problem number 3 has 28 locations, i.e., 28 nodes in the\ncorresponding graph on Figure 2(a), its state space has 19562 states and its optimal policy reaches\n8190 states.\n\nEvery triangle tireworld problem is a probabilistic interesting problem [14]: there is only one policy\nthat reaches the goal with probability 1 and all the other policies have probability at most 0.5 of\nreaching the goal. Also, the solution based on the shortest path has probability 0.52n\u22121 of reaching\nthe goal, where n is the problem number. This property is illustrated by the shades of gray in\nFigure 2(a) that represents, for each location l, max\u03c0 P (car reaches l and the tire is not \ufb02at when\nfollowing the policy \u03c0 from s0).\nFor the experiments in this section, we use the zero-heuristic for all the planners, i.e., V (s) = 0 for\nall s \u2208 S and LRTDP [4] as OPTIMAL-SOLVER for SSiPP. For all planners, the parameter \u01eb (for\n\u01eb-convergence) is set to 10\u22124. For UCT, we disabled the random rollouts because the probability\nof any policy other than the optimal policy to reach a dead-end is at least 0.5 therefore, with high-\nprobability, UCT would assign \u221e (cost of a dead-end) as the cost of all the states including the\ninitial state.\n\nThe experiments are conducted in a Linux machine with 4 cores running at 3.07GHz using MDP-\nSIM [9] as environment simulator. The following terminology is used for describing the experi-\nments: round, the computation for a solution for the given SSP; and run, a set of rounds in which\nlearning is allowed between rounds, i.e., the knowledge obtained from one round can be used to\nsolve subsequent rounds. The solution computed during one round is simulated by MDPSIM in a\nclient-server loop: MDPSIM sends a state s and requests an action from the planner, then the plan-\nner replies by sending the action a to be executed in s. The evaluation is done by the number of\nrounds simulated by MDPSIM that reached a goal state. The maximum number of actions allowed\nper round is 2000 and rounds that exceed this limit are stopped by MDPSIM and declared as failure,\ni.e., goal not reached.\n\n6\n\n\fPlanner\n\nSSiPP depth=8\nUCT\nSSiPP trajectory\n\n5\n\n50.0\n50.0\n50.0\n\n10\n40.7\n50.0\n50.0\n\n15\n41.2\n50.0\n50.0\n\nTriangle Tireworld Problem Number\n\n20\n40.8\n50.0\n50.0\n\n25\n41.1\n50.0\n50.0\n\n30\n41.0\n43.1\n50.0\n\n35\n40.9\n15.7\n50.0\n\n40\n40.0\n12.1\n50.0\n\n45\n40.6\n8.2\n50.0\n\n50\n40.8\n6.8\n50.0\n\n55\n40.3\n5.0\n50.0\n\n60\n40.4\n4.0\n50.0\n\nTable 1: Number of rounds solved out of 50 for experiment in Section 5.1. Results are averaged\nover 10 runs and the 95% con\ufb01dence interval is always less than 1.0. In all the problems, SSiPP\nusing trajectory-based short-sighted SSPs solves all the 50 round in all the 10 runs, therefore its 95%\ncon\ufb01dence interval is 0.0 for all the problems. Best results shown in bold font.\n\nPlanner\n\nSSiPP depth=8\nLRTDP\nUCT (4, 100)\nUCT (8, 100)\nUCT (2, 100)\nSSiPP \u03c1 = 1.0\nSSiPP \u03c1 = 0.50\nSSiPP \u03c1 = 0.25\nSSiPP \u03c1 = 0.125\n\n5\n\n50.0\n50.0\n50.0\n50.0\n50.0\n50.0\n50.0\n50.0\n50.0\n\n10\n45.4\n23.0\n50.0\n50.0\n50.0\n27.9\n50.0\n50.0\n50.0\n\n15\n41.2\n14.1\n50.0\n50.0\n50.0\n29.1\n50.0\n50.0\n50.0\n\nTriangle Tireworld Problem Number\n\n20\n42.3\n0.3\n48.8\n46.3\n49.5\n26.8\n50.0\n50.0\n50.0\n\n25\n41.2\n-\n24.0\n24.0\n23.2\n26.0\n50.0\n47.6\n50.0\n\n30\n44.1\n-\n12.3\n12.3\n12.0\n26.6\n50.0\n45.0\n50.0\n\n35\n42.4\n-\n6.5\n6.7\n7.5\n28.6\n50.0\n41.1\n50.0\n\n40\n32.7\n-\n4.0\n3.7\n3.5\n27.2\n50.0\n42.7\n50.0\n\n45\n20.6\n-\n2.5\n2.2\n2.2\n26.6\n50.0\n41.9\n49.8\n\n50\n14.1\n-\n1.3\n1.2\n1.2\n27.6\n50.0\n40.7\n37.4\n\n55\n9.9\n-\n1.0\n1.0\n1.0\n26.2\n50.0\n40.1\n26.4\n\n60\n7.0\n-\n0.7\n0.6\n0.6\n26.9\n50.0\n40.4\n18.9\n\nTable 2: Number of rounds solved out of 50 for experiment in Section 5.2. Results are averaged\nover 10 runs and the 95% con\ufb01dence interval is always less than 2.6. UCT (c, w) represents UCT\nusing c as bias parameter and w samples per decision. In all the problems, trajectory-based SSiPP\nfor \u03c1 = 0.5 solves all the 50 round in all the 10 runs, therefore its 95% con\ufb01dence interval is 0.0 for\nall the problems. Best results shown in bold font.\n\n5.1 Fixed number of search nodes per decision\n\nIn this experiment, we compare the performance of UCT, depth-based SSiPP, and trajectory-based\nSSiPP with respect to the number of nodes explored by depth-based SSiPP. Formally, to decide what\naction to apply in a given state s, each planner is allowed to use at most B = |Ss,t| search nodes,\ni.e., the size of the search space is bounded by the equivalent (s, t)-short-sighted SSP. We choose t\nequals to 8 since it obtains the best performance in the triangle tireworld problems [1]. Given the\nsearch nodes budget B, for UCT we sample the environment until the search tree contains B nodes;\nand for trajectory-based SSiPP we use \u03c1 = argmax\u03c1{|Ss,\u03c1| s.t. B \u2265 |Ss,\u03c1|}.\nThe methodology for this experiment is as follows: for each problem, 10 runs of 50 rounds are\nperformed for each planner using the search nodes budget B. The results, averaged over the 10 runs,\nare presented in Table 1. We set as time and memory cut-off 8 hours and 8 Gb, respectively, and\nUCT for problems 35 to 60 was the only planner preempted by the time cut-off. Trace-based SSiPP\noutperforms both depth-based SSiPP and UCT, solving all the 50 rounds in all the 10 runs for all the\nproblems.\n\n5.2 Fixed maximum planning time\n\nIn this experiment, we compare planners by limiting the maximum planning time. The methodology\nused in this experiment is similar to the one in IPPC\u201904 and IPPC\u201906: for each problem, planners\nneed to solve 1 run of 50 rounds in 20 minutes. For this experiment, the planners are allowed to per-\nform internal simulations, for instance, a planner can spend 15 minutes solving rounds using internal\nsimulations and then use the computed policy to solve the required 50 rounds through MDPSIM in\nthe remaining 5 minutes. The memory cut-off is 3Gb.\n\nFor this experiment, we consider the following planners: depth-based SSiPP for t = 8 [1], trajectory-\nbased SSiPP for \u03c1 \u2208 {1.0, 0.5, 0.25, 0.125}, LRTDP using 3-look-ahead [1] and 12 different\nparametrizations of UCT obtained by using the bias parameter c \u2208 {1, 2, 4, 8} and the number\nof samples per decision w \u2208 {10, 100, 1000}. The winners of IPPC\u201904, IPPC\u201906 and IPPC\u201908 are\n\n7\n\n\fomitted since their performance on the triangle tireworld problems are strictly dominated by depth-\nbase SSiPP for t = 8. Table 2 shows the results of this experiment and due to space limitations we\nshow only the top 3 parametrizations of UCT: 1st (c = 4, w = 100); 2nd (c = 8, w = 100); and 3rd\n(c = 2, w = 100).\nAll the four parametrizations of trajectory-based SSiPP outperform the other planners for problems\nof size equal or greater than 45. Trajectory-based SSiPP using \u03c1 = 0.5 is especially noteworthy\nbecause it achieves the perfect score in all problems, i.e., it reaches a goal state in all the 50 rounds\nin all the 10 runs for all the problems. The same happens for \u03c1 = 0.125 and problems up to size\n40. For larger problems, trajectory-based SSiPP using \u03c1 = 0.125 reaches the 20 minutes time\ncut-off before solving 50 rounds, however all the solved rounds successfully reach the goal. This\ninteresting behavior of trajectory-based SSiPP for the triangle tireworld can be explained by the\nfollowing theorem:\nTheorem 5. For the triangle tireworld, trajectory-based SSiPP using an admissible heuristic never\nfalls in a dead-end for \u03c1 \u2208 (0.5i+1, 0.5i] for i \u2208 {1, 3, 5, . . . }.\n\nProof Sketch. The optimal policy for the triangle tireworld is to follow the longest path: move from\nthe initial location l0 to the goal location lG passing through location lc, where l0, lc and lG are the\nvertices of the triangle formed by the problem\u2019s map. The path from lc to lG is unique, i.e., there\nis only one applicable move-car action for all the locations in this path. Therefore all the decision\nmaking to \ufb01nd the optimal policy happens between the locations l0 and lc. Each location l\u2032 in the\npath from l0 to lc has either two or three applicable move-car actions and we refer to the set of\nlocations l\u2032 with three applicable move-car actions as N. Every location l\u2032 \u2208 N is reachable from\nl0 by applying an even number of move-car actions (Figure 2(a)) and the three applicable move-car\nactions in l\u2032 are: (i) the optimal action ac, i.e., move the car towards lc; (ii) the action aG that moves\nthe car towards lG; and (iii) the action ap that moves the car parallel to the shortest-path from l0 to\nlG. The location reached by ap does not have a spare tire, therefore ap is never selected by a greedy\nchoice over any admissible heuristic since it reaches a dead-end with probability 0.5. The locations\nreached by applying either ac or aG have a spare tire and the greedy choice between them depends\non the admissible heuristic used, thus aG might be selected instead of ac. However, after applying\naG, only one move-car action a is available and it reaches a location that does not have a spare\ntire. Therefore, the greedy choice between ac and aG considering two or more move-car actions is\noptimal under any admissible heuristic: every sequence of actions haG, a, . . . i reaches a dead-end\nwith probability at least 0.5 and at least one sequence of actions starting with ac has probability 0 to\nreach a dead-end, e.g., the optimal solution.\nGiven \u03c1, we denote as Ls,\u03c1 the set of all locations corresponding to states in Ss,\u03c1 and as ls the\nlocation corresponding to the state s. Thus, Ls,\u03c1 contains all the locations reachable from ls using\nup to m = \u230alog0.5 \u03c1\u230b + 1 move-car actions.\nIf m is even and ls \u2208 N, then every location in\nLs,\u03c1 \u2229 N represents a state either in Gs,\u03c1 or at least two move-car actions away from any state\nin Gs,\u03c1. Therefore the solution of the (s, \u03c1)-trajectory-based short-sighted SSP only chooses the\naction ac to move the car. Also, since m is even, every state s used by SSiPP for generating\n(s, \u03c1)-trajectory-based short-sighted SSPs has ls \u2208 N. Therefore, for even values of m, i.e., for\n\u03c1 \u2208 (0.5i+1, 0.5i] and i \u2208 {1, 3, 5, . . . }, trajectory-based SSiPP always chooses the actions ac to\nmove the car to lc, thus avoiding the all dead-ends.\n\n6 Conclusion\n\nIn this paper, we introduced trajectory-based short-sighted SSPs, a new model to manage uncertainty\nin probabilistic planning problems. This approach consists of pruning the state space based on the\nmost likely trajectory between states and de\ufb01ning arti\ufb01cial goal states that guide the solution towards\nthe original goals. We also de\ufb01ned a class of short-sighted models that includes depth-based and\ntrajectory-based short-sighted SSPs and proved that SSiPP always terminates and is asymptotically\noptimal for short-sighted models in this class.\n\nWe empirically compared trajectory-based SSiPP with depth-based SSiPP and other state-of-the-art\nplanners in the triangle tireworld. Trajectory-based SSiPP outperforms all the other planners and it\nis the only planner able to scale up to problem number 60, a problem in which the optimal solution\ncontains approximately 1070 states, under the IPPC evaluation methodology.\n\n8\n\n\fReferences\n\n[1] F. W. Trevizan and M. M. Veloso. Short-sighted stochastic shortest path problems. In In Proc.\nof the 22nd International Conference on Automated Planning and Scheduling (ICAPS), 2012.\n\n[2] D. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[3] A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using real-time dynamic program-\n\nming. Arti\ufb01cial Intelligence, 72(1-2):81\u2013138, 1995.\n\n[4] B. Bonet and H. Geffner. Labeled RTDP: Improving the convergence of real-time dynamic\nIn Proc. of the 13th International Conference on Automated Planning and\n\nprogramming.\nScheduling (ICAPS), 2003.\n\n[5] H.B. McMahan, M. Likhachev, and G.J. Gordon. Bounded real-time dynamic programming:\nRTDP with monotone upper bounds and performance guarantees. In Proc. of the 22nd Inter-\nnational Conference on Machine Learning (ICML), 2005.\n\n[6] Trey Smith and Reid G. Simmons. Focused Real-Time Dynamic Programming for MDPs:\nIn Proc. of the 21st National Conference on Arti\ufb01cial\n\nSqueezing More Out of a Heuristic.\nIntelligence (AAAI), 2006.\n\n[7] S. Sanner, R. Goetschalckx, K. Driessens, and G. Shani. Bayesian real-time dynamic program-\nming. In Proc. of the 21st International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\n2009.\n\n[8] S. Yoon, A. Fern, and R. Givan. FF-Replan: A baseline for probabilistic planning. In Proc. of\n\nthe 17th International Conference on Automated Planning and Scheduling (ICAPS), 2007.\n\n[9] H.L.S. Younes, M.L. Littman, D. Weissman, and J. Asmuth. The \ufb01rst probabilistic track of the\ninternational planning competition. Journal of Arti\ufb01cial Intelligence Research, 24(1):851\u2013887,\n2005.\n\n[10] F. Teichteil-Koenigsbuch, G. Infantes, and U. Kuter. RFF: A robust, FF-based mdp planning\nalgorithm for generating policies with low probability of failure. 3rd International Planning\nCompetition (IPPC-ICAPS), 2008.\n\n[11] D. Bryce and O. Buffet. 6th International Planning Competition: Uncertainty Track. In 3rd\n\nInternational Probabilistic Planning Competition (IPPC-ICAPS), 2008.\n\n[12] S. Yoon, A. Fern, R. Givan, and S. Kambhampati. Probabilistic planning via determinization\nin hindsight. In Proc. of the 23rd National Conference on Arti\ufb01cial Intelligence (AAAI), 2008.\nImproving Determinization in Hindsight for\nOnline Probabilistic Planning. In Proc. of the 20th International Conference on Automated\nPlanning and Scheduling (ICAPS), 2010.\n\n[13] S. Yoon, W. Ruml, J. Benton, and M. B. Do.\n\n[14] I. Little and S. Thi\u00b4ebaux. Probabilistic planning vs replanning. In Proc. of ICAPS Workshop\n\non IPC: Past, Present and Future, 2007.\n\n[15] J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-\n\nWesley, Menlo Park, California, 1985.\n\n[16] Levente Kocsis and Csaba Szepesvri. Bandit based Monte-Carlo Planning.\n\nEuropean Conference on Machine Learning (ECML), 2006.\n\nIn Proc. of the\n\n[17] T. Dean, L.P. Kaelbling, J. Kirman, and A. Nicholson. Planning under time constraints in\n\nstochastic domains. Arti\ufb01cial Intelligence, 76(1-2):35\u201374, 1995.\n\n[18] D.P. Bertsekas and J.N. Tsitsiklis. An analysis of stochastic shortest path problems. Mathe-\n\nmatics of Operations Research, 16(3):580\u2013595, 1991.\n\n[19] D.P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 1995.\n[20] Blai Bonet and Robert Givan. 2th International Probabilistic Planning Competition (IPPC-\nICAPS). http://www.ldc.usb.ve/\u02dcbonet/ipc5/ (accessed on Dec 13, 2011),\n2007.\n\n9\n\n\f", "award": [], "sourceid": 1488, "authors": [{"given_name": "Felipe", "family_name": "Trevizan", "institution": null}, {"given_name": "Manuela", "family_name": "Veloso", "institution": null}]}