{"title": "Monte Carlo Value Iteration with Macro-Actions", "book": "Advances in Neural Information Processing Systems", "page_first": 1287, "page_last": 1295, "abstract": "POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions.", "full_text": "Monte Carlo Value Iteration with Macro-Actions\n\nZhan Wei Lim\n\nDavid Hsu\n\nWee Sun Lee\n\nDepartment of Computer Science, National University of Singapore\n\nSingapore, 117417, Singapore\n\nAbstract\n\nPOMDP planning faces two major computational challenges: large state spaces\nand long planning horizons. The recently introduced Monte Carlo Value Itera-\ntion (MCVI) can tackle POMDPs with very large discrete state spaces or contin-\nuous state spaces, but its performance degrades when faced with long planning\nhorizons. This paper presents Macro-MCVI, which extends MCVI by exploit-\ning macro-actions for temporal abstraction. We provide suf\ufb01cient conditions for\nMacro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI\ndoes not require explicit construction of probabilistic models for macro-actions\nand is thus easy to apply in practice. Experiments show that Macro-MCVI sub-\nstantially improves the performance of MCVI with suitable macro-actions.\n\nIntroduction\n\n1\nPartially observable Markov decision process (POMDP) provides a principled general framework for\nplanning with imperfect state information. In POMDP planning, we represent an agent\u2019s possible\nstates probabilistically as a belief and systematically reason over the space of all beliefs in order\nto derive a policy that is robust under uncertainty. POMDP planning, however, faces two major\ncomputational challenges. The \ufb01rst is the \u201ccurse of dimensionality\u201d. A complex planning task\ninvolves a large number of states, which result in a high-dimensional belief space. The second\nobstacle is the \u201ccurse of history\u201d. In applications such as robot motion planning, an agent often\ntakes many actions before reaching the goal, resulting in a long planning horizon. The complexity\nof the planning task grows very fast with the horizon.\nPoint-based approximate algorithms [10, 14, 9] have brought dramatic progress to POMDP plan-\nning. Some of the fastest ones, such as HSVI [14] and SARSOP [9], can solve moderately complex\nPOMDPs with hundreds of thousands states in reasonable time. The recently introduced Monte\nCarlo Value Iteration (MCVI) [2] takes one step further. It can tackle POMDPs with very large dis-\ncrete state spaces or continuous state spaces. The main idea of MCVI is to sample both an agent\u2019s\nstate space and the corresponding belief space simultaneously, thus avoiding the prohibitive compu-\ntational cost of unnecessarily processing these spaces in their entirety. It uses Monte Carlo sampling\nin conjunction with dynamic programming to compute a policy represented as a \ufb01nite state con-\ntroller. Both theoretical analysis and experiments on several robotic motion planning tasks indicate\nthat MCVI is a promising approach for plannning under uncertainty with very large state spaces, and\nit has already been applied successfully to compute the threat resolution logic for aircraft collision\navoidance systems in 3-D space [1].\nHowever, the performance of MCVI degrades, as the planning horizon increases. Temporal ab-\nstraction using macro-actions is effective in mitigating the negative effect and has achieved good\nresults in earlier work on Markov decision processes (MDPs) and POMDPs (see Section 2). In this\nwork, we show that macro-actions can be seamlessly integrated into MCVI, leading to the Macro-\nMCVI algorithm. Unfortunately, the theoretical properties of MCVI, such as the approximation error\nbounds [2], do not carry over to Macro-MCVI automatically, if arbitrary mapping from belief to ac-\ntions are allowed as macro-actions. We give suf\ufb01cient conditions for the good theoretical properties\n\n1\n\n\fto be retained, tranforming POMDPs into a particular type of partially observable semi-Markov\ndecision processes (POSMDPs) in which the lengths of macro-actions are not observable.\nA major advantage of the new algorithm is its ability to abstract away the lengths of macro-actions in\nplanning and reduce the effect of long planning horizons. Furthermore, it does not require explicit\nprobabilistic models for macro-actions and treats them just like primitive actions in MCVI. This\nsimpli\ufb01es macro-action construction and is a major bene\ufb01t in practice. Macro-MCVI can also be\nused to construct a hierarchy of macro-actions for planning large spaces. Experiments show that the\nalgorithm is effective with suitably designed macro-actions.\n2 Related Works\nMacro-actions have long been used to speed up planning and learning algorithms for MDPs (see,\ne.g., [6, 15, 3]). Similarly, they have been used in of\ufb02ine policy computation for POMDPs [16, 8].\nMacro-actions can be composed hierarchically to further improve scalability [4, 11]. These earlier\nworks rely on vector representations for beliefs and value functions, making it dif\ufb01cult to scale up to\nlarge state spaces. Macro-actions have also been used in online search algorithms for POMDPs [7].\nMacro-MCVI is related to Hansen and Zhou\u2019s work [5]. The earlier work uses \ufb01nite state controllers\nfor policy representation and policy iteration for policy computation, but it has not yet been shown\nto work on large state spaces. Expectation-maximization (EM) can be used to train \ufb01nite state\ncontrollers [17] and potentially handle large state spaces, but it often gets stuck in local optima.\n3 Planning with Macro-actions\nWe would like to generalize POMDPs to handle macro-actions. Ideally, the generalization should\nretain properties of POMDPs such as piecewise linear and convex \ufb01nite horizon value functions. We\nwould also like the approximation bounds for MCVI [2] to hold with macro-actions.\nWe would like to allow our macro-actions to be as powerful as possible. A very powerful repre-\nsentation for a macro-action would be to allow it to be an arbitrary mapping from belief to action\nthat will run until some termination condition is met. Unfortunately, the value function of a process\nwith such macro-actions need not even be continuous. Consider the following simple \ufb01nite hori-\nzon example, with horizon one. Assume that there are two primitive actions, both with constant\nrewards, regardless of state. Consider two macro-actions, one which selects the poorer primitive\naction all the time while the other which selects the better primitive action for some beliefs. Clearly,\nthe second macro-action dominates the \ufb01rst macro-action over the entire belief space. The reward\nfor the second macro-action takes two possible values depending on which action is selected for the\nbelief. The reward function also forms the optimal value function of the process and need not even\nbe continuous as the macro-action can be an arbitrary mapping from belief to action.\nNext, we give suf\ufb01cient conditions for the process to retain piecewise linearity and convexity of\nthe value function. We do this by constructing a type of partially observable semi-Markov decision\nprocess (POSMDP) with the desired property. The POSMDP does not need to have the length of\nthe macro-action observed, a property that can be practically very useful as it allows the branching\nfactor for search to be signi\ufb01cantly smaller. Furthermore, the process is a strict generalization of a\nPOMDP as it reduces to a POMDP when all the macro-actions have length one.\n3.1 Partially Observable Semi-Markov Decision Process\nFinite-horizon (undiscounted) POSMDP were studied in [18]. Here, we focus on a type of in\ufb01nite-\nhorizon discounted POSMDPs whose transition intervals are not observable. Our POSMDP is for-\nmally de\ufb01ned as a tuple (S,A,O, T, R, \u03b3), where S is a state space, A is a macro-action space,\nO is a macro-observation space, T is a joint transition and observation function, R is a reward\nfunction, and \u03b3 \u2208 (0, 1) is a discount factor.\nIf we apply a macro-action a with start state si,\nT = p(sj, o, k|si, a) encodes the joint conditional probability of the end state sj, macro-observation\no, and the number of time steps k that it takes for a to reach sj from si. We could decompose T\ninto a state-transition function and an observation function, but avoid doing so here to remain gen-\neral and simplify the notation. The reward function R gives the discounted cumulative reward for a\nt=0 \u03b3tE(rt|s, a), where E(rt|s, a) is the expected\n\nmacro-action a that starts at state s: R(s, a) =(cid:80)\u221e\n\nreward at step t. Here we assume that the reward is 0 once a macro-action terminates.\nFor convenience, we will work with reweighted beliefs, instead of beliefs. Assuming that the number\nof states is n, a reweighted belief (like a belief) is a vector of n non-negative numbers that sums to\n\n2\n\n\fone. By assuming that the POSMDP process will stop with probability 1\u2212\u03b3 at each time step, we can\ninterpret the reweighted belief as the conditional probability of a state given that the process has not\nstopped. This gives an interpretation of the reweighted belief in terms of the discount factor. Given\na reweighted belief, we compute the next reweighted belief given macroaction a and observation o,\nb(cid:48) = \u03c4 (b, a, o), as follows:\n\nk=1 \u03b3k\u22121(cid:80)n\n(cid:80)\u221e\nk=1 \u03b3k\u22121(cid:80)n\n(cid:80)n\n(cid:80)\u221e\ni=1 p(s, o, k|si, a)b(si)\ni=1 p(sj, o, k|si, a)b(si)\n\nj=0\n\n.\n\nb(cid:48)(s) =\n\n(cid:80)n\nWe will simply refer to the reweighted belief as a belief from here on. We denote the denominator\ni=1 p(sj, o, k|si, a)b(si) by p\u03b3(o|a, b). The value of \u03b3p\u03b3(o|a, b) can be inter-\npreted as the probability that observation o is received and the POSMDP has not stopped. Note that\n\n(cid:80)\u221e\nk=1 \u03b3k\u22121(cid:80)n\n(cid:80)\nA policy \u03c0 is a mapping from a belief to a macro-action. Let R(b, a) =(cid:80)\no p\u03b3(o|a, b) may sum to less than 1 due to discounting.\n\ns b(s)R(s, a). The value\n\nj=0\n\nof a policy \u03c0 can be de\ufb01ned recursively as\n\nV\u03c0(b) = R(b, \u03c0(b)) + \u03b3\n\np\u03b3(o|\u03c0(b), b)V\u03c0(\u03c4 (b, \u03c0(b), o)).\n\n(cid:88)\n\no\n\nNote that the policy operates on the belief and may not know the number of steps taken by the\nmacro-actions. If knowledge of the number of steps is important, it can be added into the observation\nfunction in the modeling process.\nWe now de\ufb01ne the backup operator H that operates on a value function Vm and returns Vm+1\n\n(cid:0)R(b, a) + \u03b3\n\n(cid:88)\n\no\u2208O\n\np\u03b3(o|a, b)V (\u03c4 (b, a, o))(cid:1).\n\nHV (b) = max\n\na\n\nThe backup operator is a contractive mapping1.\nLemma 1 Given value functions U and V , ||HU \u2212 HV ||\u221e \u2264 \u03b3||U \u2212 V ||\u221e.\nLet the value of an optimal policy, \u03c0\u2217, be V \u2217. The following theorem is a consequence of the Banach\n\ufb01xed point theorem and Lemma 1.\nTheorem 1 V \u2217 is the unique \ufb01xed point of H and satis\ufb01es the Bellman equation V \u2217 = HV \u2217.\n\nWe call a policy an m-step policy if the number of times the macro-actions is applied is m. For\nm-step policies, V \u2217 can be approximated by a \ufb01nite set of linear functions; the weight vectors of\nthese linear functions are called the \u03b1-vectors.\n\nTheorem 2 The value function for an m-step policy is piecewise linear and convex and can be\nrepresented as\n\n(cid:88)\n\ns\u2208S\n\nVm(b) = max\n\u03b1\u2208\u0393m\n\n\u03b1(s)b(s)\n\n(1)\n\n(2)\n\n(3)\n\nwhere \u0393m is a \ufb01nite collection of \u03b1-vectors.\nAs Vm is convex and converges to V \u2217, V \u2217 is also convex.\n\n3.2 Macro-action Construction\nWe would like to construct macro-actions from primitive actions of a POMDP in order to use tem-\nporal abstraction to help solve dif\ufb01cult POMDP problems. A partially observable Markov decision\nprocess (POMDP) is de\ufb01ned by \ufb01nite state space S, \ufb01nite action space A, a reward function R(s, a),\nan observation space O, and a discount \u03b3 \u2208 (0, 1).\nIn our POSMDP, the probability function p(sj, o, k|si, a) for a macro-action must be independent\nof the history given the current state si; hence the selection of primitive actions and termination\nconditions within the macro-action cannot depend on the belief. We examine some allowable de-\npendencies here. Due to partial observability, it is often not possible to allow the primitive action and\nthe termination condition to be functions of the initial state. Dependence on the portion of history\n\n1Proofs of the results in this section are included in the supplementary material.\n\n3\n\n\fthat occurs after the macro-action has started is, however, allowed. In some POMDPs, a subset of\nthe state variables are always observed and can be used to decide the next action. In fact, we may\nsometimes explicitly construct observed variables to remember relevant parts of the history prior to\nthe start of macro-action (see Section 5); these can be considered as parameters that are passed on to\nthe macro-action. Hence, one way to construct the next action in a macro-action is to make it a func-\ntion of the history since the macro-action started, xk, ak, ok+1, . . . , xt\u22121, at\u22121, ot, xt, where xi is\nthe fully observable subset of state variables at time i, and k is the starting time of the macro-action.\nSimilarly, when the termination criterion and the observation function of the macro-action depends\nonly on the history xk, ak, ok+1, . . . , xt\u22121, at\u22121, ot, xt, the macro-action can retain a transition\nfunction that is independent of the history given the initial state. Note that the observation to be\npassed on to the POSMDP to create the POSMDP observation space, O, is part of the design trade-\noff - usually it is desirable to reduce the number of observations in order to reduce complexity\nwithout degrading the value of the POSMDP too much. In particular, we may not wish to include\nthe execution length of the macro-action if it does not contribute much towards obtaining a good\npolicy.\n4 Monte Carlo Value Iteration with Macro-Actions\nWe have shown that if the action space A and the observation space O of a POSMDP are discrete,\nthen the optimal value function V \u2217 can be approximated arbitrarily closely by a piecewise-linear,\nconvex function. Unfortunately, when S is very high-dimensional (or continuous), a vector represen-\ntation is no longer effective. In this section, we show how the Monte Carlo Value Iteration (MCVI)\nalgorithm [2], which has been designed for POMDPs with very large or in\ufb01nite state spaces, can be\nextended to POSMDP.\nInstead of \u03b1-vectors, MCVI uses an alternative policy representation called a policy graph G. A\npolicy graph is a directed graph with labeled nodes and edges. Each node of G is labeled with an\nmacro-action a and each edge of G is labeled with an observation o. To execute a policy \u03c0G, it\nis treated as a \ufb01nite state controller whose states are the nodes of G. Given an initial belief b, a\nstarting node v of G is selected and its associated macro-action av is performed. The controller\nthen transitions from v to a new node v(cid:48) by following the edge (v, v(cid:48)) labeled with the observation\nreceived, o. The process then repeats with the new controller node v(cid:48).\nLet \u03c0G,v denote a policy represented by G, when the controller always starts in node v of G. We\nde\ufb01ne the value \u03b1v(s) to be the expected total reward of executing \u03c0G,v with initial state s. Hence\n\nVG(b) = max\nv\u2208G\n\n\u03b1v(s)b(s).\n\n(4)\n\n(cid:88)\n\ns\u2208S\n\n(cid:88)\n\no\u2208O\n\n(cid:110)(cid:88)\n\ns\u2208S\n\nVG is completely determined by the \u03b1-functions associated with the nodes of G.\n4.1 MC-Backup\nOne way to approximate the value function is to repeatedly run the backup operator H starting\nfrom an arbitrary value function until it is close to convergence. This algorithm is called value\niteration (VI). Value iteration can be carried out on policy graphs as well, as it provides an implicit\nrepresentation of a value function. Let VG be the value function for a policy graph G. Substituting\n(4) into (2), we get\n\nHVG(b) = max\na\u2208A\n\nR(s, a)b(s) +\n\np\u03b3(o|a, b) max\nv\u2208G\n\n\u03b1v(s)b(cid:48)(s)\n\n.\n\n(5)\n\n(cid:111)\n\n(cid:88)\n\ns\u2208S\n\nIt is possible to then evaluate the right-hand side of (5) via sampling and monte carlo simulation at a\nbelief b. The outcome is a new policy graph G(cid:48) with value function \u02c6HbVG. This is called MC-backup\nof G at b (Algorithm 1) [2].\nThere are |A||G||O| possible ways to generate a new policy graph G(cid:48) which has one new node\ncompared to the old policy graph node. Algorithm 1 computes an estimate of the best new policy\ngraph at b using only N|A||G| samples. Furthermore, we can show that MC-backup approximates\n\u221a\nthe standard VI backup (equation (5)) well at b, with error decreasing at the rate O(1/\nN ). Let\nRmax be the largest absolute value of the reward, |rt|, at any time step.\n\n4\n\n\fAlgorithm 1 MC-Backup of a policy graph G at a belief b \u2208 B with N samples.\nMC-BACKUP(G, b, N )\n1: For each action a \u2208 A, Ra \u2190 0.\n2: For each action a \u2208 A, each observation o \u2208 O, and each node v \u2208 G, Va,o,v \u2190 0.\n3: for each action a \u2208 A do\n4:\n5:\n6:\n\nSample a state si with probability b(si).\nSimulate taking macro-action a in state si. Generate a new state s(cid:48)\nreward R(cid:48)(si, a) by sampling from p(sj, o, k|si, a).\nRa \u2190 Ra + R(cid:48)(si, a).\nfor each node v \u2208 G do\n\nfor i = 1 to N do\n\n7:\n8:\n9:\n\ni, observation oi, and discounted\n\nVa \u2190 (Ra + \u03b3(cid:80)\n\nVa,o \u2190 maxv\u2208G Va,o,v.\nva,o \u2190 argmaxv\u2208GVa,o,v.\n\nSet V (cid:48) to be the expected total reward of simulating the policy represented by G, with initial\ncontroller state v and initial state s(cid:48)\ni.\nVa,oi,v \u2190 Va,oi,v + V (cid:48).\nfor each observation o \u2208 O do\n\n10:\n11:\n12:\n13:\n14:\n15: V \u2217 \u2190 maxa\u2208A Va.\n16: a\u2217 \u2190 argmaxa\u2208AVa.\n17: Create a new policy graph G(cid:48) by adding a new node u to G. Label u with a\u2217. For each o \u2208 O, add the\n18: return G(cid:48).\n\nedge (u, va\u2217,o) and label it with o.\n\no\u2208O Va,o)/N.\n\nTheorem 3 Given a policy graph G and a point b \u2208 B, MC-BACKUP(G, b, N ) produces an im-\nproved policy graph such that\n\n(cid:115)\n\n2(cid:0)|O| ln|G| + ln(2|A|) + ln(1/\u03c4 )(cid:1)\n\nN\n\n,\n\n| \u02c6HbVG(b) \u2212 HVG(b)| \u2264 2Rmax\n1 \u2212 \u03b3\n\nwith probability at least 1 \u2212 \u03c4.\n\nThe proof uses Hoeffding bound together with union bound. Details can be found in [2].\nMC-backup can be combined with point-based POMDP planning, which samples the belief space\nB. Point-based POMDP algorithms use a set B of points sampled from B as an approximate repre-\nsentation of B. In contrast to the standard VI backup operator H, which performs backup at every\npoint in B, the operator \u02c6HB applies MC-BACKUP(Gm, b, N ) on a policy graph Gm at every point\nin B. This results in |B| new policy graph nodes. \u02c6HB then produces a new policy graph Gm+1 by\nadding the new policy graph nodes to the previous policy graph Gm.\nLet \u03b4B = supb\u2208B minb(cid:48)\u2208B (cid:107)b\u2212 b(cid:48)(cid:107)1 be the maximum L1 distance from any point in B to the closest\npoint in B. Let V0 be value function for some initial policy graph and Vm+1 = \u02c6HBVm. The theorem\nbelow bounds the approximation error between Vm and the optimal value function V \u2217.\nTheorem 4 For every b \u2208 B,\n(cid:115)\n\n2(cid:0)|O| ln(|B|m) + ln(2|A|) + ln(|B|m/\u03c4 )(cid:1)\n\nN\n\n+\n\n2Rmax\n(1 \u2212 \u03b3)2 \u03b4B +\n\n2\u03b3mRmax\n(1 \u2212 \u03b3)\n\n,\n\n\u2217\n\n|V\n\n(b) \u2212 Vm(b)| \u2264 2Rmax\n(1 \u2212 \u03b3)2\n\nwith probability at least 1 \u2212 \u03c4.\n\nThe proof requires the contraction property and a Lipschitz property that can be derived from the\npiece-wise linearity of the value function. Having established those results in Section 3.1, the rest\n\u221a\nof the proof follows from the proof in [2]. The \ufb01rst term in the bound in Theorem 4 comes from\nN ) and can be reduced\nTheorem 3, showing that the error from sampling decays at the rate O(1/\nby taking a large enough sample size. The second term depends on how well the set B covers B\nand can be reduced by sampling a larger number of beliefs. The last term depends on the number of\nMC-backup iterations and decays exponentially with m.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Underwater Navigation: A reduced map with a 11 \u00d7 12 grid is shown with \u201cS\u201d marking the\npossible initial positions, \u201cD\u201d marking the destinations, \u201cR\u201d marking the rocks and \u201cO\u201d marking the locations\nwhere the robot can localize completely. (b) Collaborative search and capture: Two robotic agents catching 12\nescaped crocodiles in a 21 \u00d7 21 grid. (c) Vehicular ad-hoc networking: An UAV maintains ad-hoc network\nover four ground vehicles in a 10 \u00d7 10 grid with \u201cB\u201d marking the base and \u201cD\u201d the destinations.\n\n(c)\n\n4.2 Algorithm\nTheorem 4 bounds the performance of the algorithm when given a set of beliefs. Macro-MCVI,\nlike MCVI, samples beliefs incrementally in practice and performs backup at the sampled beliefs.\nBranch and bound is used to avoid sampling unimportant parts of the belief space. See [2] for details.\nThe other important component in a practical algorithm is the generation of next belief; Macro-\nMCVI uses a particle \ufb01lter for that. Given the macro-action construction as described in Section 3.2,\na simple particle \ufb01lter is easily implemented to approximate the next belief function in equation (1):\nsample a set of states from the current belief; from each sampled state, simulate the current macro-\naction until termination, keeping track of its path length, t; if the observation at termination matches\nthe desired observation, keep the particle; the set of particles that are kept are weighted by \u03b3t and\nthen renormalized to form the next belief2. Similarly, MC-backup is performed by simply running\nsimulations of the macro-actions - there is no need to store additional transition and observation\nmatrices, allowing the method to run for very large state spaces.\n5 Experiments\nWe now illustrate the use of macro-actions for temporal abstraction in three POMDPs of varying\ncomplexity. Their state spaces range from relatively small to very large. Correspondingly, the\nmacro-actions range from relatively simple ones to much more complex ones forming a hierarchy.\nUnderwater Navigation: The underwater navigation task was introduced in [9]. In this task, an\nautonomous underwater vehicle (AUV) navigates in an environment modeled as 51 x 52 grid map.\nThe AUV needs to move from the left border to the right border while avoiding the rocks scattered\nnear its destination. The AUV has six actions: move north, move south, move east, move north-east,\nmove south-east or stay in the same location. Due to poor visibility, the AUV can only localize itself\nalong the top or bottom borders where there are beacon signals.\nThis problem has several interesting characteristics. First, the relatively small state space size of\n2653 means that solvers that use \u03b1-vectors, such as SARSOP [9] can be used. Second, the dynamics\nof the robot is actually noiseless, hence the main dif\ufb01culty is actually localization from the robot\u2019s\ninitially unknown location.\nWe use 5 macro-actions that move in a direction (north, south, east, north-east, or south-east) until\neither a beacon signal or the destination is reached. We also de\ufb01ne an additional macro-action that:\nnavigates to the nearest goal location if the AUV position is known, or simply stays in the same\nlocation if the AUV position is not known. To enable proper behaviour of the last macro-action,\nwe augment the state space with a fully observable state variable that indicates the current AUV\nlocation. The variable is initialized to a value denoting \u201cunknown\u201d but takes the value of the current\nAUV location after the beacon signal is received. This gives a simple example where the original\nstate space is augmented with a fully observable state variable to allow more sophisticated macro-\naction behaviour.\n\n2More sophisticated approximation of the belief can be constructed but may require more knowledge of the\n\nunderlying POMDP and more computation.\n\n6\n\n\fCollaborative Search and Capture: In this problem, a group of crocodiles had escaped from its\nenclosure into the environment and two robotic agents have to collaborate to hunt down and capture\nthe crocodiles (see Figure 1). Both agents are centrally controlled and each agent can make a one\nstep move in one of the four directions (north, south, east and west) or stay still at each time instance.\nThere are twelve crocodiles in the environment. At every time instance, each crocodile moves to\na location furthest from the agent that is nearest to it with a probability 1 \u2212 p (p = 0.05 in the\nexperiments). With a probability p, the crocodile moves randomly. A crocodile is captured when\nit is at the same location as an agent. The agents do not know the exact location of the crocodiles,\nbut each agent knows the number of crocodiles in the top left, top right, bottom left and bottom\nright quadrants around itself from the noise made by the crocodiles. Each captured crocodile gives\na reward of 10, while movement is free.\nWe de\ufb01ne twenty-\ufb01ve macro actions where each agent moves (north, south, east, west, or stay) along\na passage way until one of them reaches an intersection. In addition, the macro-actions only return\nthe observation it makes at the point when the macro-action terminates, reducing the complexity\nof the problem, possibly at a cost of some sub-optimality. In this problem, the macro-actions are\nsimple, but the state space is extremely large (approximately 17914).\nVehicular Ad-hoc Network: In a post disaster search and rescue scenario, a group of rescue ve-\nhicles are deployed for operation work in an area where communication infrastructure has been\ndestroyed. The rescue units need high-bandwidth network to relay images of ground situations. An\nUnmanned Aerial Vehicle (UAV) can be deployed to maintain WiFi network communication be-\ntween the ground units. The UAV needs to visit each vehicle as often as possible to pick up and\ndeliver data packets [13].\nIn this task, 4 rescue vehicles and 1 UAV navigates in a terrain modeled as a 10 x 10 grid map. There\nare obstacles on the terrain that are impassable to ground vehicle but passable to UAV. The UAV can\nmove in one of the four directions (north, south, east, and west) or stay in the same location at every\ntime step. The vehicles set off from the same base and move along some prede\ufb01ned path towards\ntheir pre-assigned destinations where they will start their operations, randomly stopping along the\nway. Upon reaching its destination, the vehicle may roam around the environment randomly while\ncarrying out its mission. The UAV knows its own location on the map and can observe the location\nof a vehicle if they are in the same grid square. To elicit a policy with low network latency, there\nis a penalty of \u22120.1\u00d7 number of time steps since last visit of a vehicle for each time step for each\nvehicle. There is a reward of 10 for each time a vehicle is visited by the UAV. The state space\nconsists of the vehicles\u2019 locations, UAV location in the grid map and the number of time steps since\neach vehicle is last seen (for computing the reward).\nWe abstract the movements of UAV to search and visit a single vehicle as macro actions. There\nare two kinds of search macro actions for each vehicle: search for a vehicle along its prede\ufb01ned\npath and search for a vehicle that has started to roam randomly. To enable the macro-actions to\nwork effectively, the state space is also augmented with the previous seen location of each vehicle.\nEach macro-action is in turn hierarchically constructed by solving the simpli\ufb01ed POMDP task of\nsearching for a single vehicle on the same map using basic actions and some simple macro-actions\nthat move along the paths. This problem has both complex hierarchically constructed macro-actions\nand very large state space.\n\n5.1 Experimental setup\nWe applied Macro-MCVI to the above tasks and compared its performance with the original MCVI\nalgorithm. We also compared with a state-of-the-art off-line POMDP solver, SARSOP [9], on the\nunderwater navigation task. SARSOP could not run on the other two tasks, due to their large state\nspace sizes. For each task, we ran Macro-MCVI until the average total reward stablized. We then ran\nthe competing algorithms for at least the same amount of time. The exact running times are dif\ufb01cult\nto control because of our implementation limitations. To con\ufb01rm the comparison results, we also\nran the competing algorithms 100 times longer when possible. All experiments were conducted on\na 16 core Intel Xeon 2.4Ghz computer server.\nNeither MCVI nor SARSOP uses macro-actions. We are not aware of other ef\ufb01cient off-line macro-\naction POMDP solvers that have been demonstrated on very large state space problems. Some online\nsearch algorithms, such as PUMA [7], use macro-actions and have shown strong results. Online\nsearch algorithms do not generate a policy, making a fair comparison dif\ufb01cult. Despite that, they\n\n7\n\n\fSARSOP\n\nTime(s)\n\n1\n4\n100\n1\n100\n1\n1\n\n120\n120\n12000\n144\n3657\n\n29255\n29300\n28800\n\nFigure 2: Performance comparison.\n\nUnderwater Navigation\nMacro-MCVI\nMCVI\n\nReward\n749.30 \u00b1 0.28\n678.05 \u00b1 0.48\n725.28 \u00b1 0.38\n710.71 \u00b1 4.52\n730.83 \u00b1 0.75\n697.47 \u00b1 4.58\nPUMA\n746.10 \u00b1 2.37\nOnline-Macro\nCollaborative Search & Capture\n17.04 \u00b1 0.03\nMacro-MCVI\n13.14 \u00b1 0.04\nMCVI\n16.38 \u00b1 0.05\n1.04 \u00b1 0.91\n0\n-323.55 \u00b1 3.79\n-1232.57 \u00b1 2.24\n-422.26 \u00b1 3.98\n\nare useful as baseline references; we implement a variant of PUMA as a one such reference. In our\nexperiments, we simply gave the online search algorithms as much or more time than Macro-MCVI\nand report the results here. PUMA uses open-loop macro-actions. As a baseline reference for online\nsolvers with closed-loop macro-actions, we also created an online search variant of Macro-MCVI\nby removing the MC-backup component. We refer to this variant as Online-Macro. It is similar to\nother recent online POMDP algorithms [12], but uses the same closed-loop macro-actions as MCVI\ndoes.\n5.2 Results\nThe performance of the different algorithms is shown\nin Figure 2 with 95% con\ufb01dence intervals.\nThe underwater navigation task consist of two phases:\nthe localization phase and navigate to goal phase.\nMacro-MCVI\u2019s policy takes one macro-action, \u201cmov-\ning northeast until reaching the border\u201d, to localize\nand another macro-action, \u201cnavigating to the goal\u201d, to\nreach the goal. In contrast, both MCVI and SARSOP\nfail to match the performance of Macro-MCVI even\nwhen they are run 100 times longer. Online-Macro\ndoes well, as the planning horizon is short with the\nuse of macro-actions. PUMA, however, does not do\nas well, as it uses the less powerful open-loop macro-\nactions, which move in the same direction for a \ufb01xed\nnumber of time steps.\nFor the collaborative search & capture task, MCVI\nfails to match the performance of Macro-MCVI even\nwhen it is run for 100 times longer. PUMA and\nOnline-Macro do badly as they fail to search deep\nenough and do not have the bene\ufb01t of reusing sub-policies obtained from the backup operation.\nTo con\ufb01rm that it is the backup operation and not the shorter per macro-action time that is respon-\nsible for the performance difference, we ran Online-Macro for a much longer time and found the\nresult unchanged.\nThe vehicular ad-hoc network task was solved hierarchically in two stages. We \ufb01rst used Macro-\nMCVI to solve for the policy that \ufb01nds a single vehicle. This stage took roughly 8 hours of compu-\ntation time. We then used the single-vehicle policy as a macro-action and solved for the higher-level\npolicy that plans over the macro-actions. Although it took substantial computation time, Macro-\nMCVI generated a reasonable policy in the end. In constrast, MCVI, without macro-actions, fails\nbadly for this task. Due to the long running time involved, we did not run MCVI 100 times longer.\nTo con\ufb01rm that that the policy computed by Macro-MCVI at the higher level of the hierarchy is also\neffective, we manually crafted a greedy policy over the single-vehicle macro-actions. This greedy\npolicy always searches for the vehicle that has not been visited for the longest duration. The experi-\nmental results indicate that the higher-level policy computed by Macro-MCVI is more effective than\nthe greedy policy. We did not apply online algorithms to this task, as we are not aware of any simple\nway to hierarchically construct macro-actions online.\n6 Conclusions\nWe have successfully extended MCVI, an algorithm for solving very large state space POMDPs,\nto include macro-actions. This allows MCVI to use temporal abstraction to help solve dif\ufb01cult\nPOMDP problems. The method inherits the good theoretical properties of MCVI and is easy to\napply in practice. Experiments show that it can substantially improve the performance of MCVI\nwhen used with appropriately chosen macro-actions.\n\nPUMA\nOnline-Macro\nVehicular Ad-Hoc Network\nMacro-MCVI\nMCVI\nGreedy\n\nAcknowledgement We thank Tom\u00e1s Lozano-P\u00e9rez and Leslie Kaelbling from MIT for many in-\nsightful discussions. This work is supported in part by MoE grant MOE2010-T2-2-071 and MDA\nGAMBIT grant R-252-000-398-490.\n\n8\n\n\fReferences\n[1] H. Bai, D. Hsu, M.J. Kochenderfer, and W. S. Lee. Unmanned aircraft collision avoidance\n\nusing continuous-state POMDPs. In Proc. Robotics: Science & Systems, 2011.\n\n[2] H. Bai, D. Hsu, W. S. Lee, and V. Ngo. Monte Carlo Value Iteration for Continuous-State\nPOMDPs. In Algorithmic Foundations of Robotics IX\u2014Proc. Int. Workshop o n the Algorithmic\nFoundations of Robotics (WAFR), pages 175\u2013191. Springer, 2011.\n\n[3] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement\n\nlearning. Discrete Event Dynamic Systems, 13:2003, 2003.\n\n[4] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decom-\n\nposition. J. Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[5] E. Hansen and R. Zhou. Synthesis of hierarchical \ufb01nite-state controllers for POMDPs. In Proc.\n\nInt. Conf. on Automated Planning and Scheduling, 2003.\n\n[6] M. Hauskrecht, N. Meuleau, L.P. Kaelbling, T. Dean, and C. Boutilier. Hierarchical solution\nof Markov decision processes using macro-actions. In Proc. Conf. on Uncertainty in Arti\ufb01cial\nIntelligence, pages 220\u2013229. Citeseer, 1998.\n\n[7] R. He, E. Brunskill, and N. Roy. PUMA: Planning under uncertainty with macro-actions. In\n\nProc. AAAI Conf. on Arti\ufb01cial Intelligence, 2010.\n\n[8] H. Kurniawati, Y. Du, D. Hsu, and W. S. Lee. Motion planning under uncertainty for robotic\n\ntasks with long time horizons. Int. J. Robotics Research, 30(3):308\u2013323, 2010.\n\n[9] H. Kurniawati, D. Hsu, and W.S. Lee. SARSOP: Ef\ufb01cient point-based POMDP planning by\napproximating optimally reachable belief spaces. In Proc. Robotics: Science & Systems, 2008.\n[10] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for\n\nPOMDPs. In Int. Jnt. Conf. on Arti\ufb01cial Intelligence, volume 18, pages 1025\u20131032, 2003.\n\n[11] J. Pineau, N. Roy, and S. Thrun. A hierarchical approach to POMDP planning and execution.\nIn Workshop on Hierarchy & Memory in Reinforcement Learning (ICML), volume 156, 2001.\n[12] S. Ross, J. Pineau, S. Paquet, and B. Chaib-Draa. Online planning algorithms for POMDPs.\n\nJournal of Arti\ufb01cial Intelligence Research, 32(1):663\u2013704, 2008.\n\n[13] A. Sivakumar and C.K.Y. Tan. UAV swarm coordination using cooperative control for estab-\nlishing a wireless communications backbone. In Proc. Int. Conf. on Autonomous Agents &\nMultiagent Systems, pages 1157\u20131164, 2010.\n\n[14] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proc. Conf. on\n\nUncertainty in Arti\ufb01cial Intelligence, pages 520\u2013527. AUAI Press, 2004.\n\n[15] R.S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1):181\u2013211, 1999.\n[16] G. Theocharous and L. P. Kaelbling. Approximate planning in POMDPs with macro-actions.\n\nAdvances in Neural Processing Information Systems, 17, 2003.\n\n[17] M. Toussaint, L. Charlin, and P. Poupart. Hierarchical POMDP controller optimization by\n\nlikelihood maximization. Proc. Conf. on Uncertainty in Arti\ufb01cial Intelligence, 2008.\n\n[18] C.C. White. Procedures for the solution of a \ufb01nite-horizon, partially observed, semi-Markov\n\noptimization problem. Operations Research, 24(2):348\u2013358, 1976.\n\n9\n\n\f", "award": [], "sourceid": 763, "authors": [{"given_name": "Zhan", "family_name": "Lim", "institution": null}, {"given_name": "Lee", "family_name": "Sun", "institution": null}, {"given_name": "David", "family_name": "Hsu", "institution": ""}]}