{"title": "Approximate Planning in Large POMDPs via Reusable Trajectories", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1007, "abstract": null, "full_text": "Approximate Planning in Large POMDPs \n\nvia Reusable Trajectories \n\nMichael Kearns \n\nAT&T Labs \n\nmkearns@research.att.com \n\nYishay Mansour \nTel Aviv University \n\nmansour@math.tau.ac.il \n\nAndrewY. Ng \nUC Berkeley \n\nang@cs.berkeley.edu \n\nAbstract \n\nWe consider the problem of reliably choosing a near-best strategy from \na restricted class of strategies TI in a partially observable Markov deci(cid:173)\nsion process (POMDP). We assume we are given the ability to simulate \nthe POMDP, and study what might be called the sample complexity -\nthat is, the amount of data one must generate in the POMDP in order \nto choose a good strategy. We prove upper bounds on the sample com(cid:173)\nplexity showing that, even for infinitely large and arbitrarily complex \nPOMDPs, the amount of data needed can be finite, and depends only \nlinearly on the complexity of the restricted strategy class TI, and expo(cid:173)\nnentially on the horizon time. This latter dependence can be eased in a \nvariety of ways, including the application of gradient and local search \nalgorithms. Our measure of complexity generalizes the classical super(cid:173)\nvised learning notion of VC dimension to the settings of reinforcement \nlearning and planning. \n\n1 Introduction \n\nMuch recent attention has been focused on partially observable Markov decision processes \n(POMDPs) which have exponentially or even infinitely large state spaces. For such do(cid:173)\nmains, a number of interesting basic issues arise. As the state space becomes large, the \nclassical way of specifying a POMDP by tables of transition probabilities clearly becomes \ninfeasible. To intelligently discuss the problem of planning -\nthat is, computing a good \nstrategy 1 in a given POMDP -\ncompact or implicit representations of both POMDPs, and \nof strategies in POMDPs, must be developed. Examples include factored next-state dis(cid:173)\ntributions [2, 3, 7], and strategies derived from function approximation schemes [8]. The \ntrend towards such compact representations, as well as algorithms for planning and learn(cid:173)\ning using them, is reminiscent of supervised learning, where researchers have long empha(cid:173)\nsized parametric models (such as decision trees and neural networks) that can capture only \nlimited structure, but which enjoy a number of computational and information-theoretic \nbenefits. \n\nMotivated by these issues, we consider a setting were we are given a generative model, or \n\nlThroughout, we use the word strategy to mean any mapping from observable histories to actions, \n\nwhich generalizes the notion of policy in a fully observable MDP. \n\n\f1002 \n\nM Kearns. Y. Mansour and A. Y. Ng \n\nsimulator, for a POMDP, and wish to find a good strategy 7r from some restricted class of \nstrategies II. A generative model is a \"black box\" that allows us to generate experience (tra(cid:173)\njectories) from different states of our choosing. Generative models are an abstract notion of \ncompact POMDP representations, in the sense that the compact representations typically \nconsidered (such as factored next-state distributions) already provide efficient generative \nmodels. Here we are imagining that the strategy class II is given by some compact repre(cid:173)\nsentation or by some natural limitation on strategies (such as bounded memory). Thus, the \nview we are adopting is that even though the world (POMDP) may be extremely complex, \nwe assume that we can at least simulate or sample experience in the world (via the gener(cid:173)\native model), and we try to use this experience to choose a strategy from some \"simple\" \nclass II. \n\nWe study the following question: How many calls to a generative model are needed to have \nenough data to choose a near-best strategy in the given class? This is analogous to the \nquestion of sample complexity in supervised learning -\nbut harder. The added difficulty \nlies in the reuse of data. In supervised learning, every sample (x, f(x)) provides feedback \nabout every hypothesis function h(x) (namely, how close h(x) is to f(x)) . If h is restricted \nto lie in some hypothesis class 1i, this reuse permits sample complexity bounds that are far \nsmaller than the size of 1i. For instance, only O(log(I1il)) samples are needed to choose \na near-best model from a finite class 1i. If 1i is infinite, then sample sizes are obtained \nthat depend only on some measure of the complexity of1i (such as VC dimension [9]), but \nwhich have no dependence on the complexity of the target function or the size of the input \ndomain. \n\nIn the POMDP setting, we would like analogous sample complexity bounds in terms of \nthe \"complexity\" of the strategy class II - bounds that have no dependence on the size or \ncomplexity of the POMDP. But unlike the supervised learning setting, experience \"reuse\" \nis not immediate in POMDPs. To see this, consider the \"straw man\" algorithm that, starting \nwith some 7r E II, uses the generative model to generate many trajectories under 7r, and \nthus forms a Monte Carlo estimate of V 7r (so). It is not clear that these trajectories under \n7r are of much use in evaluating a different 7r' E II, since 7r and 7r' may quickly disagree \non which actions to take. The naive Monte Carlo method thus gives 0(1111) bounds on the \n\"sample complexity,\" rather than O(log(IIII)), for the finite case. \n\nIn this paper, we shall describe the trajectory tree method of generating \"reusable\" tra(cid:173)\njectories, which requires generating only a (relatively) small number of trajectories -\na \nnumber that is independent of the state-space size of the POMDP, depends only linearly \non a general measure of the complexity of the strategy class II, and depends exponentially \non the horizon time. This latter dependence can be eased via gradient algorithms such as \nWilliams' REINFORCE [10] and Baird and Moore's more recent YAPS [1], and by local \nsearch techniques. Our measure of strategy class complexity generalizes the notion of VC \ndimension in supervised learning to the settings of reinforcement learning and planning, \nand we give bounds that recover for these settings the most powerful analogous results in \nsupervised learning -\nbounds for arbitrary, infinite strategy classes that depend only on \nthe dimension of the class rather than the size of the state space. \n\n2 Preliminaries \n\nWe begin with some standard definitions. A Markov decision process (MDP) is a tuple \n(S, So, A, {P (,1 s, a)}, R), where: S is a (possibly infinite) state set; So E S is a start \nstate; A = {al' . .. ,ad are actions; PC Is, a) gives the next-state distribution upon taking \naction a from state s; and the reward function R(s, a) gives the corresponding rewards. \nWe assume for simplicity that rewards are deterministic, and further that they are bounded \n\n\fApproximate Planning in Large POMDPs via Reusable Trajectories \n\n1003 \n\nin absolute value by Rmax. A partially observable Markov decision process (POMDP) \nconsists of an underlying MOP and observation distributions Q(ols) for each state s, \nwhere 0 is the random observation made at s. \n\nWe have adopted the common assumption of a fixed start state,2 because once we limit \nthe class of strategies we entertain, there may not be a single \"best\" strategy in the class(cid:173)\ndifferent start states may have different best strategies in II. We also assume that we are \ngiven a POMOP M in the form of a generative model for M that, when given as input any \nstate-action pair (s, a), will output a state S' drawn according to P(\u00b7ls, a), an observation \no drawn according to Q(\u00b7ls), and the reward R(s, a). This gives us the ability to sample \nthe POMOP M in a random-access way. This definition may initially seem unreasonably \ngenerous: the generative model is giving us a fully observable simulation of a partially \nobservable process. However, the key point is that we must still find a strategy that performs \nwell in the partially observable setting. As a concrete example, in designing an elevator \ncontrol system, we may have access to a simulator that generates random rider arrival times, \nand keeps track of the waiting time of each rider, the number of riders waiting at every floor \nat every time of day, and so on. However helpful this information might be in designing \nthe controller, this controller must only use information about which floors currently have \nhad their call button pushed (the observables). In any case, readers uncomfortable with \nthe power provided by our generative models are referred to Section 5, where we briefly \ndescribe results requiring only an extremely weak form of partially observable simulation. \n\nAt any time t, the agent will have seen some sequence of observations, 00,\u00b7\u00b7., Ot, \nand will have chosen actions and received rewards for each of the t \ntime \nsteps prior to \nthe current one. We write its observable history as h \n(( 00, ao, TO), ... , (Ot-l , at-I, Tt-l ), (Ot, _, _)). Such observable histories, also called tra(cid:173)\njectories, are the inputs to strategies. More formally, a strategy 7r is any (stochastic) map(cid:173)\nping from observable histories to actions. (For example, this includes approaches which \nuse the observable history to track the belief state [5].) A strategy class II is any set of \nstrategies. \n\nWe will restrict our attention to the case of discounted return,3 and we let, E [0,1) be the \ndiscount factor. We define the t::-horizon time to be HE = 10gl'(t::(1 - ,)/2Rmax ). Note \nthat returns beyond the first HE-steps can contribute at most t::/2 to the total discounted \nreturn. Also, let Vmax = Rmax/(l - ,) bound the value function. Finally, for a POMDP \nM and a strategy class II, we define opt(M, II) = SUP7rEII V7r (so) to be the best expected \nreturn achievable from So using II. \n\nOur problem is thus the following: Given a generative model for a POMOP M and a \nstrategy class II, how many calls to the generative model must we make, in order to have \nenough data to choose a 7r E II whose performance V7r(so) approaches opt(M, II)? Also, \nwhich calls should we make to the generative model to achieve this? \n\n3 The Trajectory Tree Method \n\nWe now describe how we can use a generative model to create \"reusable\" trajectories. \nFor ease of exposition, we assume there are only two actions al and a2, but our results \ngeneralize easily to any finite number of actions. (See the full paper [6].) \n\n2 An equivalent definition is to assume a fixed distribution D over start states, since So can be a \n\n\"dummy\" state whose next-state distribution under any action is D. \n\n3The results in this paper can be extended without difficulty to the undiscounted finite-horizon \n\nsetting [6]. \n\n\f1004 \n\nM. Keams, Y. Mansour and A. Y. Ng \n\nA trajectory tree is a binary tree in which each node is labeled by a state and observation \npair, and has a child for each of the two actions. Additionally, each link to a child is \nlabeled by a reward, and the tree's depth will be H~, so it will have about 2H e nodes. \n(In Section 4, we will discuss settings where this exponential dependence on H~ can be \neased.) Each trajectory tree is built as follows: The root is labeled by So and the observation \nthere, 00 ' Its two children are then created by calling the generative model on (so, ad and \n(so, a2), which gives us the two next-states reached (say s~ and s~ respectively), the two \nobservations made (say o~ and o~), and the two rewards received (r~ = R(so, ad and \nr~ = R(so, a2). Then (s~ , aD and (s~, o~) label the root's aI-child and a2-child, and the \nlinks to these children are labeled r~ and r~. Recursively, we generate two children and \nrewards this way for each node down to depth H~ . \n\nNow for any deterministic strategy tr and any trajectory tree T, tr defines a path through \nT: tr starts at the root, and inductively, if tr is at some internal node in T, then we feed \nto tr the observable history along the path from the root to that node, and tr selects and \nmoves to a child of the current node. This continues until a leaf node is reached, and we \ndefine R( tr , T) to be the discounted sum of returns along the path taken. In the case that \ntr is stochastic, tr defines a distribution on paths in T, and R(tr, T) is the expected return \naccording to this distribution. (We will later also describe another method for treating \nstochastic strategies.) Hence, given m trajectory trees T1 , ... , T m, a natural estimate for \nV7r(so) is V7r(so) = ,; 2:::1 R(tr, Ti). Note that each tree can be used to evaluate any \nstrategy, much the way a single labeled example (x , f(x)) can be used to evaluate any \nhypothesis h(x) in supervised learning. Thus in this sense, trajectory trees are reusable. \n\nOur goal now is to establish uniform convergence results that bound the error of the es(cid:173)\ntimates V7r (so) as a function of the \"sample size\" (number of trees) m. Section 3.1 first \ntreats the easier case of deterministic classes II; Section 3.2 extends the result to stochastic \nclasses. \n\n3.1 The Case of Deterministic II \n\nLet us begin by stating a result for the special case of finite classes of deterministic strate(cid:173)\ngies, which will serve to demonstrate the kind of bound we seek. \n\nTheorem 3.1 Let II be any finite class of deterministic strategies for an arbitrary two(cid:173)\naction POMDP M. Let m trajectory trees be created using a generative modelfor M, and \nV7r(so) be the resulting estimates. lfm = 0 ((Vrnax /t)2(log(IIII) + log(1/8))), then with \nprobability 1 - 8, I V7r (so) - V7r (so) I :s t holds simultaneously for alltr E II. \n\nDue to space limitations, detailed proofs of the results of this section are left to the full \npaper [6], but we will try to convey the intuition behind the ideas. Observe that for any \nfixed deterministic tr, the estimates R( tr, Ti) that are generated by the m different trajectory \ntrees Ti are independent. Moreover, each R(tr, Ti ) is an unbiased estimate of the expected \ndiscounted H~ -step return of tr, which is in turn t/2-close to V7r(so). These observations, \ncombined with a simple Chernoff and union bound argument, are sufficient to establish \nTheorem 3.1. Rather than developing this argument here, we instead move straight on to \nthe harder case of infinite II. \n\nWhen addressing sample complexity in supervised learning, perhaps the most important \ninsight is that even though a class 1i may be infinite, the number of possible behaviors of \n1i on a finite set of points is often not exhaustive. More precisely, for boolean functions, \nwe say that the set Xl, ... , Xd is shattered by 1i if every of the 2d possible labelings of \n\n\fApproximate Planning in Large POMDPs via Reusable Trajectories \n\n1005 \n\nthese points is realized by some h E 1i. The VC dimension of 1i is then defined as the \nsize of the largest shattered set [9]. It is known that if the VC dimension of 1i is d, then the \nnumber