{"title": "VDCBPI: an Approximate Scalable Algorithm for Large POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1081, "page_last": 1088, "abstract": null, "full_text": " VDCBPI: an Approximate Scalable Algorithm\n for Large POMDPs\n\n\n\n Pascal Poupart Craig Boutilier\n Department of Computer Science Department of Computer Science\n University of Toronto University of Toronto\n Toronto, ON M5S 3H5 Toronto, ON M5S 3H5\n ppoupart@cs.toronto.edu cebly@cs.toronto.edu\n\n\n\n\n Abstract\n\n Existing algorithms for discrete partially observable Markov decision\n processes can at best solve problems of a few thousand states due to\n two important sources of intractability: the curse of dimensionality and\n the policy space complexity. This paper describes a new algorithm\n (VDCBPI) that mitigates both sources of intractability by combining the\n Value Directed Compression (VDC) technique [13] with Bounded Pol-\n icy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on\n synthetic network management problems with up to 33 million states.\n\n\n\n1 Introduction\n\nPartially observable Markov decision processes (POMDPs) provide a natural and expres-\nsive framework for decision making, but their use in practice has been limited by the lack\nof scalable solution algorithms. Two important sources of intractability plague discrete\nmodel-based POMDPs: high dimensionality of belief space, and the complexity of policy\nor value function (VF) space. Classic solution algorithms [4, 10, 7], for example, compute\nvalue functions represented by exponentially many value vectors, each of exponential size.\nAs a result, they can only solve POMDPs with on the order of 100 states. Consequently,\nmuch research has been devoted to mitigating these two sources of intractability.\n\nThe complexity of policy/VF space has been addressed by observing that there are often\nvery good policies whose value functions are representable by a small number of vectors.\nVarious algorithms such as approximate vector pruning [9], point-based value iteration\n(PBVI) [12, 16], bounded policy iteration (BPI) [14], gradient ascent (GA) [11, 1] and\nstochastic local search (SLS) [3] exploit this fact to produce (often near-optimal) policies\nof low complexity (i.e., few vectors) allowing larger POMDPs to be solved. Still these\nscale to problems of only roughly 1000 states, since each value vector may still have ex-\nponential dimensionality. Conversely, it has been observed that belief states often carry\nmore information than necessary. Hence, one can often reduce vector dimensionality by\nusing compact representations such as decision trees (DTs) [2], algebraic decision dia-\ngrams (ADDs) [8, 9], or linear combinations of small basis functions (LCBFs) [6], or by\nindirectly compressing the belief space into a small subspace by a value-directed compres-\nsion (VDC) [14] or exponential PCA [15]. Once compressed, classic solution methods can\nbe used. However, since none of these approaches address the exponential complexity of\n\n\f\npolicy/VF space, they can only solve slightly larger POMDPs (up to 8250 states [15]).\n\nScalable POMDP algorithms can only be realized when both sources of intractability are\ntackled simultaneously. While Hansen and Feng [9] implemented such an algorithm by\ncombining approximate state abstraction with approximate vector pruning, they didn't\ndemonstrate the scalability of the approach on large problems. In this paper, we describe\nhow to combine value directed compression (VDC) with bounded policy iteration (BPI)\nand demonstrate the scalability of the resulting algorithm (VDCBPI) on synthetic network\nmanagement problems of up to 33 million states. Among the techniques that deal with the\ncurse of dimensionality, VDC offers the advantage that the compressed POMDP can be di-\nrectly fed into existing POMDP algorithms with no (or only slight) adjustments. This is not\nthe case for exponential-PCA, nor compact representations (DTs, ADDs, LCBFs). Among\nalgorithms that mitigate policy space complexity, BPI distinguishes itself by its ability to\navoid local optima (cf. GA), its efficiency (cf. SLS) and the fact that belief state monitoring\nis not required (cf. PBVI, approximate vector pruning). Beyond the combination of VDC\nwith BPI, we offer two other contributions. We propose a new simple heuristic to compute\ngood lossy value directed compressions. We also augment BPI with the ability to bias its\npolicy search to reachable belief states. As a result, BPI can often find a much smaller\npolicy of similar quality for a given initial belief state.\n\n\n2 POMDP Background\n\n\nA POMDP is defined by: states S; actions A; observations Z; transition function T , where\nT (s, a, s ) denotes Pr(s |s, a); observation function Z, where Z(s, z) is the probability\nPr(z|s, a) of observation z in state s after executing a; and reward function R, where\nR(s, a) is the immediate reward associated with s when executing a. We assume discrete\nstate, action and observation sets and focus on discounted, infinite horizon POMDPs with\ndiscount factor 0 < 1.\n\nPolicies and value functions for POMDPs are typically defined over belief space B, where\na belief state b is a distribution over S capturing an agent's knowledge about the current\nstate of the world. Belief state b can be updated in response to a specific action-observation\npair a, z using Bayes rule. We denote the (unnormalized) belief update mapping by T a,z,\nwhere T a,z = Pr(s\n ij j |a, si) Pr(z|sj ). A factored POMDP, with exponentially many states,\nthus gives rise to a belief space of exponential dimensionality.\n\nPolicies represented by finite state controllers (FSCs) are defined by a (possibly cyclic) di-\nrected graph = N , E , where nodes n N correspond to stochastic action choices and\nedges e E to stochastic transitions. An FSC can be viewed as a policy = , , where\naction strategy associates each node n with a distribution over actions (n) = Pr(a|n),\nand observation strategy associates each node n and observation z with a distribution\nover successor nodes (n, z) = Pr(n |n, z) (corresponding to the edge from n labeled\nwith z). The value function V of FSC is given by:\n\n\n V (n, s) = Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s , a) Pr(n |n, z)V (n , s ) (1)\n\n a z n\n\n\n\n The value V (n, b) of each node n is thus linear w.r.t the belief state; hence the value\nfunction of the controller is piecewise-linear and convex. The optimal value function V \noften has a large (if not infinite) number of vectors, each corresponding to a different node.\nThe optimal value function V satisfies Bellman's equation:\n\n\n V (b) = max R(b, a) + Pr(z|b, a)V (ba) (2)\n z\n a\n z\n\n\f\n max\n s.t. V (n, s) + \n [Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s , a) Pr(a, n |n, z)V (n , s )], s\n a s ,z\n Pr(a|n) = 1; Pr(a, n |n, z) = Pr(a|n), a\n a n\n Pr(a|n) 0, a; Pr(a, n |n, z) 0, a, z\n\n\n Table 1: LP to uniformly improve the value function of a node.\n\n\n\n max o(s, n) s,n\n s,n\n s.t. V (n, s) + \n s,n\n [Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s , a) Pr(a, n |n, z)V (n , s )], s\n a s ,z\n Pr(a|n) = 1; Pr(a, n |n, z) = Pr(a|n), a\n a n\n Pr(a|n) 0, a; Pr(a, n |n, z) 0, a, z\n\n\nTable 2: LP to improve the value function of a node in a non-uniform way according to the\nsteady state occupancy o(s, n).\n\n\n\n3 Bounded Policy Iteration\n\n\nWe briefly review the bounded policy iteration (BPI) algorithm (see [14] for details) and\ndescribe a simple extension to bias its search toward reachable belief states. BPI incre-\nmentally constructs an FSC by alternating policy improvement and policy evaluation. Un-\nlike policy iteration [7], this is done by slowly increasing the number of nodes (and value\nvectors). The policy improvement step greedily improves each node n by optimizing its\naction and observation strategies by solving the linear program (LP) in Table 1. This LP\nuniformly maximizes the improvement in the value function by optimizing n's distribu-\ntions Pr(a, n |n, z). The policy evaluation step computes the value function of the current\ncontroller by solving Eq. 1. The algorithm monotonically improves the policy until con-\nvergence to a local optimum, at which point new nodes are introduced to escape the local\noptimum. BPI is guaranteed to converge to a policy that is optimal at the \"tangent\" belief\nstates while slowly growing the size of the controller [14].\n\nIn practice, we often wish to find a policy suitable for a given initial belief state. Since\nonly a small subset of belief space is often reachable, it is generally possible to construct\nmuch smaller policies tailored to the reachable region. We now describe a simple way to\nbias BPI's efforts toward the reachable region. Recall that the LP in Table 1 optimizes the\nparameters of a node to uniformly improve its value at all belief states. We propose a new\nLP (Table 2) that weighs the improvement by the (unnormalized) discounted occupancy\ndistribution induced by the current policy. This accounts for belief states reachable for the\nnode by aggregating them together. The (unnormalized) discounted occupancy distribution\nis given by:\n\n\n o(s , n ) = b0(s , n ) + o(s, n) Pr(a|n) Pr(z|a, s) Pr(n |n, z) s , n\n\n s,a,z,n\n\n\n\nThe LP in Table 2 is obtained by introducing variables s,n for each s, replacing the ob-\njective by o(s, n)\n s,n s,n and replacing in each constraint by the corresponding s,n.\nWhen using the modified LP, BPI naturally tries to improve the policy at the reachable be-\nlief states before the others. Since the modification ensures that the value function doesn't\ndecrease at any belief state, focusing the efforts on reachable belief states won't decrease\npolicy value at other belief states. Furthermore, though the policy is initially biased toward\nreachable states, BPI will eventually improve the policy for all belief states.\n\n\f\n \n T ~ ~\n T \n T T\n\n\n f ~\n b ~\n b b' b'\n ~ ~\n R R\n R R\n\n r r'\n\nFigure 1: Functional flow of a POMDP (dotted arrows) and a compressed POMDP (solid\narrows).\n\n\n\n4 Value-Directed Compression\n\nWe briefly review the sufficient conditions for a lossless compression of POMDPs [13] and\ndescribe a simple new algorithm to obtain good lossy compressions. Belief states constitute\na sufficient statistic summarizing all information available to the decision maker (i.e., past\nactions and observations). However, as long as enough information is available to evaluate\nthe value of each policy, one can still choose the best policy. Since belief states often\ncontain information irrelevant to the estimation of future rewards, one can often compress\nbelief states into some lower-dimensional representation. Let f be a compression function\nthat maps each belief state b into some lower dimensional compressed belief state ~\n b (see\nFigure 1). Here ~\n b can be viewed as a bottleneck that filters the information contained in\nb before it is used to estimate future rewards. We desire a compression f such that ~\n b\ncorresponds to the smallest statistic sufficient for accurately predicting the current reward\nr as well as the next compressed belief state ~\n b (since it captures all the information in b\nnecessary to accurately predict subsequent rewards). Such a compression f exists if we can\nalso find compressed transition dynamics ~\n T a,z and a compressed reward function ~\n R such\nthat:\n R = ~\n R f and f T a,z = ~\n T a,z f a A, z Z (3)\n\nGiven an f , ~\n R and ~\n T a,z satisfying Eq. 3, we can evaluate any policy using the compressed\nPOMDP dynamics to obtain ~\n V . Since V = ~\n V f , the compressed POMDP is equivalent\nto the original.\n\nWhen restricting f to be linear (represented by matrix F ), we can rewrite Eq. 3\n\n R = F ~\n R and T a,zF = F ~\n T a,z a A, z Z (4)\n\nThat is, the column space of F spans R and is invariant w.r.t. each T a,z. Hence, the\ncolumns of the best linear lossless compression mapping F form a basis for the smallest\ninvariant subspace (w.r.t. each T a,z) that spans R, i.e., the Krylov subspace. We can find the\ncolumns of F by Krylov iteration: multiplying R by each T a,z until the newly generated\nvectors are linear combinations of previous ones.1 The dimensionality of the compressed\nspace is equal to the number of columns of F , which is necessarily smaller than or equal\nto the dimensionality of the original belief space. Once F is found, we can compute ~\n R and\neach ~\n T a,z by solving the system in Eq. 4.\n\nSince linear lossless compressions are not always possible, we can extend the technique of\n[13] to find good lossy compressions with early stopping of the Krylov iteration. We retain\nonly the vectors that are \"far\" from being linear combinations of prior vectors. For instance,\nif v is a linear combination of v1, v2, . . . , vn, then there are coefficients c1, c2, . . . , cn s.t.\nthe error ||v - c\n i i vi ||2 is zero. Given a threshold or some upper bound k on the desired\nnumber of columns in F , we run Krylov iteration, retaining only the vectors with an error\ngreater than , or the k vectors with largest error. When F is computed by approximate\n\n 1For numerical stability, one must orthogonalize each vector before multiplying by T a,z.\n\n\f\nKrylov iteration, we cannot compute ~\n R and ~\n T a,z by solving the linear system in Eq. 4--\ndue to the lossy nature of the compression, the system is overconstrained. But we can find\nsuitable ~\n R and ~\n T a,z by computing a least square approximation, solving:\n\n F R = F F ~\n R and F T a,zF = F F ~\n T a,z a A, z Z\n\nWhile compression is required when the dimensionality of belief space is too large, unfortu-\nnately, the columns of F have the same dimensionality. Factored POMDPs of exponential\ndimension can, however, admit practical Krylov iteration if carried out using a compact\nrepresentation (e.g., DTs or ADDs) to efficiently compute F , ~\n R and each ~\n T a,z.\n\n\n5 Bounded Policy Iteration with Value-Directed Compression\n\nIn principle, any POMDP algorithm can be used to solve the compressed POMDPs pro-\nduced by VDC. If the compression is lossless and the POMDP algorithm exact, the com-\nputed policy will be optimal for the original POMDP. In practice, POMDP algorithms are\nusually approximate and lossless compressions are not always possible, so care must be\ntaken to ensure numerical stability and a policy of high quality for the original POMDP.\nWe now discuss some of the integration issues that arise when combining VDC with BPI.\n\nSince V = F ~\n V , maximizing the compressed value vector ~\n V of some node n automatically\nmaximizes the value V of n w.r.t. the original POMDP when F is nonnegative; hence it is\nessential that F be nonnegative. Otherwise, the optimal policy of the compressed POMDP\nmay not be optimal for the original POMDP. Fortunately, when R is nonnegative then F\nis guaranteed to be nonnegative by the nature of Krylov iteration. If some rewards are\nnegative, we can add a sufficiently large constant to R to make it nonnegative without\nchanging the decision problem.\n\nSince most algorithms, including BPI, compute approximately optimal policies it is also\ncritical to normalize the columns of F . Suppose F has two columns f1 and f2 with L1-\nlengths 1 and 100, respectively. Since V = F ~\n V = ~\n v1f1 + ~v2f2, changes in ~v2 have a\nmuch greater impact on V than changes in ~\n v1. Such a difference in sensitivity may bias the\nsearch for a good policy to an undesirable region of the belief space, or may even cause\nthe algorithm to return a policy that is far from optimal for the original POMDP despite the\nfact that it is -optimal for the compressed POMDP.\n\nWe note that it is \"safer\" to evaluate policies iteratively by successive approximation rather\nthan solving the system in Eq. 1. By definition, the transition matrices T a,z have eigen-\nvalues with magnitude 1. In contrast, lossy compressed transition matrices ~\n T a,z are not\nguaranteed to have this property. Hence, solving the system in Eq. 1 may not correspond\nto policy evaluation. It is thus safer to evaluate policies by successive approximation for\nlossy compressions.\n\nFinally several algorithms including BPI compute witness belief states to verify the domi-\nnance of a value vector. Since the compressed belief space ~\n B is different from the original\nbelief space B, this must be approached with care. B is a simplex corresponding to the\nconvex hull of the state points. In contrast, since each row vector of F is the compressed\nversion of some state point, ~\n B corresponds to the convex hull of the row vectors of F .\nWhen F is non-negative, it is often possible to ignore this difference. For instance, when\nverifying the dominance of a value vector, if there is a compressed witness ~\n b, there is al-\nways an uncompressed witness b, but not vice-versa. This means that we can properly\nidentify all dominating value vectors, but we may erroneously classify a dominated vector\nas dominating. In practice, this doesn't impact the correctness of algorithms such as policy\niteration, bounded policy iteration, incremental pruning, witness algorithm, etc. but it will\nslow them down since they won't be able to prune as many value vectors as possible.\n\n\f\n cycle16 cycle19 cycle22\n\n\n\n\n 105 120\n\n 100 130\n 115\n\n 95 125\n 110\n\n 90\n Expected Rewards Expected Rewards 105 Expected Rewards 120\n\n\n 250 250 250\n 200 120 200 120 200 120\n 100\n 150 100 100\n 80 150 80 150 80\n 60\n 100 60 60\n 40 100 40 100 40\n 20\n 50 20 20\n # of basis fns 50 50\n # of nodes # of basis fns # of nodes # of basis fns # of nodes\n\n cycle25 3legs16 3legs19\n\n\n\n\n 120 135\n 150 115 130\n\n 145 110 125\n\n 105 120\n 140\n 100 115\n Expected Rewards Expected Rewards Expected Rewards\n 135 95 110\n\n 250 250 250\n 200 120 200 120 200 120\n 100\n 150 100 100\n 80 150 80 150 80\n 60\n 100 60 60\n 40 100 40 100 40\n 20\n 50 20 20\n # of basis fns 50 50\n # of nodes # of basis fns # of nodes # of basis fns # of nodes\n\n 3legs22 3legs25 cycle25\n\n\n\n\n 150 12\n\n 145 160 10\n\n 8\n 140 155\n 6\n 135 150\n 4\n 130 145\n Expected Rewards Expected Rewards 2\n Time (1000 seconds)\n 125 140\n 250 250 250\n 200 120 200 120 200 120\n 100\n 150 100 100\n 80 150 80 150 80\n 60\n 100 60 60\n 40 100 40 100 40\n 20\n 50 20 20\n # of basis fns 50 50\n # of nodes # of basis fns # of nodes # of basis fns # of nodes\n\n\n\nFigure 2: Experimental results for cycle and 3legs network configurations of 16, 19,\n22 and 25 machines. The bottom right graph shows the running time of BPI on compressed\nversions of a cycle network of 25 machines.\n\n\n\n\n\n 3legs cycle\n 16 19 22 25 16 19 22 25\n VDCBPI 120.9 137.0 151.0 164.8 103.9 121.3 134.3 151.4\n heuristic 100.6 118.3 138.3 152.3 102.5 117.9 130.2 152.3\n doNothing 98.4 112.9 133.5 147.1 91.6 105.4 122.0 140.1\n\n\nTable 3: Comparison of the best policies achieved by VDCBPI to the doNothing and\nheuristic policies.\n\n\f\nThe above tips work well when VDC is integrated with BPI. We believe they are sufficient\nto ensure proper integration of VDC with other POMDP algorithms, though we haven't\nverified this empirically.\n\n\n6 Experiments\n\nWe report on experiments with VDCBPI on some synthetic network management problems\nsimilar to those introduced in [5]. A system administrator (SA) maintains a network of\nmachines. Each machine has a 0.1 probability of failing at any stage; but this increases to\n0.333 when a neighboring machine is down. The SA receives a reward of 1 per working\nmachine and 2 per working server. At each stage, she can either reboot a machine, ping a\nmachine or do nothing. She only observes the status of a machine (with 0.95 accuracy) if\nshe reboots or pings it. Costs are 2.5 (rebooting), 0.1 (pinging), and 0 (doing nothing). An\nn-machine network induces to a POMDP with 2n states, 2n + 1 actions and 2 observations.\n\nWe experimented with networks of 16, 19, 22 and 25 machines organized in two configura-\ntions: cycle (a ring) and 3legs (a tree of 3 branches joined at the root). Figure 2 shows\nthe average expected reward earned by policies computed by BPI after the POMDP has\nbeen compressed by VDC. Results are averaged over 500 runs of 60 steps, starting with a\nbelief state where all machines are working.2 As expected, decision quality increases as we\nincrease the number of nodes used in BPI and basis functions used in VDC. Also interesting\nare some of the jumps in the reward surface of some graphs, suggesting phase transitions\nw.r.t. the dimensionality of the compression. The bottom right graph in Fig. 2 shows the\ntime taken by BPI on a cycle network of 25 machines (other problems exhibit similar\nbehavior). VDC takes from 4902s. to 12408s. (depending on size and configuration) to\ncompress POMDPs to 250 dimensions.3\n\nIn Table 3 we compare the value of the best policy with less than 120 nodes found by\nVDCBPI to two other simple policies. The doNothing policy lets the network evolve\nwithout any rebooting or pinging. The heuristic policy estimates at each stage the\nprobability of failure4 of each machine and reboots the machine most likely to be down\nif its failure probability is greater than threshold p1 or pings it if greater than threshold\np2. Settings of p1 = 0.8 and p2 = 0.15 were used.5 This heuristic policy performs very\nwell and therefore offers a strong competitor to VDCBPI. But it is possible to do better\nthan the heuristic policy by optimizing the choice of the machine that the SA may reboot\nor ping. Since a machine is more likely to fail when neighboring machines are down, it\nis sometimes better to choose (for reboot) a machine surrounded by working machines.\nHowever, since the SA doesn't exactly know which machines are up or down due to partial\nobservability, such a tradeoff is difficult to evaluate and sometimes not worthwhile. With a\nsufficient number of nodes and basis functions, VDCBPI outperforms the heuristic policy\non the 3legs networks and matches it on the cycle networks. This is quite remarkable\ngiven the fact that belief states were compressed to 250 dimensions or less compared to the\noriginal dimensionality ranging from 65,536 to 33,554,432.\n\n\n7 Conclusion\n\nWe have described a new POMDP algorithm that mitigates both high belief space di-\nmensionality and policy/VF complexity. By integrating value-directed compression with\n\n 2The ruggedness of the graphs is mainly due to the variance in the reward samples.\n 3Reported running times are the cputime measured on 3GHz linux machines.\n 4Due to the large state space, approximate monitoring was performed by factoring the joint.\n 5These values were determined through enumeration of all threshold combinations in increments\nof 0.05, choosing the best for 25-machine problems.\n\n\f\nbounded policy iteration, we can solve synthetic network management POMDPs of 33 mil-\nlion states (3 orders of magnitude larger than previously solved discrete POMDPs). Note\nthat the scalability of VDCBPI is problem dependent, however we hope that new, scal-\nable, approximate POMDP algorithms such as VDCBPI will allow POMDPs to be used to\nmodel real-world problems, with the expectation that they can be solved effectively. We\nalso described several improvements to the existing VDC and BPI algorithms.\n\nAlthough VDC offers the advantage that any existing solution algorithm can be used to\nsolve compressed POMDPs, it would be interesting to combine BPI or PBVI with a fac-\ntored representation such as DTs or ADDs, allowing one to directly solve large scale\nPOMDPs without recourse to an initial compression. Beyond policy space complexity\nand high dimensional belief spaces, further research will be necessary to deal with expo-\nnentially large action and observation spaces.\n\n\nReferences\n\n [1] D. Aberdeen and J. Baxter. Scaling internal-state policy-gradient methods for POMDPs. Proc.\n of the Nineteenth Intl. Conf. on Machine Learning, pp.310, Sydney, Australia, 2002.\n [2] C. Boutilier and D. Poole. Computing optimal policies for partially observable decision pro-\n cesses using compact representations. Proc. AAAI-96, pp.11681175, Portland, OR, 1996.\n [3] D. Braziunas and C. Boutilier. Stochastic local search for POMDP controllers. Proc. AAAI-04,\n to appear, San Jose, CA, 2004.\n [4] A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact\n method for POMDPs. Proc. UAI-97, pp.5461, Providence, RI, 1997.\n [5] C. Guestrin, D. Koller, and R. Parr. Max-norm projections for factored MDPs. Proc. IJCAI-01,\n pp.673680, Seattle, WA, 2001.\n [6] C. Guestrin, D. Koller, and R. Parr. Solving factored POMDPs with linear value functions.\n IJCAI-01 Wkshp. on Planning under Uncertainty and Incomplete Information, Seattle, 2001.\n [7] E. A. Hansen. Solving POMDPs by searching in policy space. Proc. UAI-98, pp.211219,\n Madison, Wisconsin, 1998.\n [8] E. A. Hansen and Z. Feng. Dynamic programming for POMDPs using a factored state repre-\n sentation. Proc. AIPS-2000, pp.130139, Breckenridge, CO, 2000.\n [9] E. A. Hansen and Z. Feng. Approximate planning for factored POMDPs. Proc. ECP-2001,\n Toledo, Spain, 2000.\n[10] L. P. Kaelbling, M. Littman, and A. R. Cassandra. Planning and acting in partially observable\n stochastic domains. Artif. Intel., 101:99134, 1998.\n[11] N. Meuleau, L. Peshkin, K. Kim, and L. P. Kaelbling. Learning finite-state controllers for\n partially observable environments. Proc. UAI-99, pp.427436, Stockholm, 1999.\n[12] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for\n POMDPs. IJCAI-03, Acapulco, Mexico, 2003.\n[13] P. Poupart and C. Boutilier. Value-directed compressions of POMDPs. Advances in Neural\n Information Processing Systems, pp.15471554, Vancouver, Canada, 2002.\n[14] P. Poupart and C. Boutilier. Bounded finite state controllers. Advances in Neural Information\n Processing Systems, Vancouver, Canada, 2003.\n[15] N. Roy and G. Gordon. Exponential family PCA for belief compression in pomdps. Advances\n in Neural Information Processing Systems, pp.16351642, Vancouver, BC, 2002.\n[16] M. T. J. Spaan and N. Vlassis. A point-based pomdp algorithm for robot planning. IEEE Intl.\n Conf. on Robotics and Automation, to appear, New Orleans, 2004.\n\n\f\n", "award": [], "sourceid": 2704, "authors": [{"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "Craig", "family_name": "Boutilier", "institution": null}]}