VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Poupart, Pascal; Boutilier, Craig

VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Pascal Poupart, Craig Boutilier

Advances in Neural Information Processing Systems 17 (NIPS 2004)

Abstract

Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Pol- icy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

1 Introduction

Partially observable Markov decision processes (POMDPs) provide a natural and expres- sive framework for decision making, but their use in practice has been limited by the lack of scalable solution algorithms. Two important sources of intractability plague discrete model-based POMDPs: high dimensionality of belief space, and the complexity of policy or value function (VF) space. Classic solution algorithms [4, 10, 7], for example, compute value functions represented by exponentially many value vectors, each of exponential size. As a result, they can only solve POMDPs with on the order of 100 states. Consequently, much research has been devoted to mitigating these two sources of intractability.

The complexity of policy/VF space has been addressed by observing that there are often very good policies whose value functions are representable by a small number of vectors. Various algorithms such as approximate vector pruning [9], point-based value iteration (PBVI) [12, 16], bounded policy iteration (BPI) [14], gradient ascent (GA) [11, 1] and stochastic local search (SLS) [3] exploit this fact to produce (often near-optimal) policies of low complexity (i.e., few vectors) allowing larger POMDPs to be solved. Still these scale to problems of only roughly 1000 states, since each value vector may still have ex- ponential dimensionality. Conversely, it has been observed that belief states often carry more information than necessary. Hence, one can often reduce vector dimensionality by using compact representations such as decision trees (DTs) [2], algebraic decision dia- grams (ADDs) [8, 9], or linear combinations of small basis functions (LCBFs) [6], or by indirectly compressing the belief space into a small subspace by a value-directed compres- sion (VDC) [14] or exponential PCA [15]. Once compressed, classic solution methods can be used. However, since none of these approaches address the exponential complexity of

policy/VF space, they can only solve slightly larger POMDPs (up to 8250 states [15]).

Scalable POMDP algorithms can only be realized when both sources of intractability are tackled simultaneously. While Hansen and Feng [9] implemented such an algorithm by combining approximate state abstraction with approximate vector pruning, they didn't demonstrate the scalability of the approach on large problems. In this paper, we describe how to combine value directed compression (VDC) with bounded policy iteration (BPI) and demonstrate the scalability of the resulting algorithm (VDCBPI) on synthetic network management problems of up to 33 million states. Among the techniques that deal with the curse of dimensionality, VDC offers the advantage that the compressed POMDP can be di- rectly fed into existing POMDP algorithms with no (or only slight) adjustments. This is not the case for exponential-PCA, nor compact representations (DTs, ADDs, LCBFs). Among algorithms that mitigate policy space complexity, BPI distinguishes itself by its ability to avoid local optima (cf. GA), its efficiency (cf. SLS) and the fact that belief state monitoring is not required (cf. PBVI, approximate vector pruning). Beyond the combination of VDC with BPI, we offer two other contributions. We propose a new simple heuristic to compute good lossy value directed compressions. We also augment BPI with the ability to bias its policy search to reachable belief states. As a result, BPI can often find a much smaller policy of similar quality for a given initial belief state.

2 POMDP Background

A POMDP is defined by: states S; actions A; observations Z; transition function T , where T (s, a, s ) denotes Pr(s |s, a); observation function Z, where Z(s, z) is the probability Pr(z|s, a) of observation z in state s after executing a; and reward function R, where R(s, a) is the immediate reward associated with s when executing a. We assume discrete state, action and observation sets and focus on discounted, infinite horizon POMDPs with discount factor 0 < 1.

Policies and value functions for POMDPs are typically defined over belief space B, where a belief state b is a distribution over S capturing an agent's knowledge about the current state of the world. Belief state b can be updated in response to a specific action-observation pair a, z using Bayes rule. We denote the (unnormalized) belief update mapping by T a,z, where T a,z = Pr(s ij j |a, si) Pr(z|sj ). A factored POMDP, with exponentially many states, thus gives rise to a belief space of exponential dimensionality.

Policies represented by finite state controllers (FSCs) are defined by a (possibly cyclic) di- rected graph = N , E , where nodes n N correspond to stochastic action choices and edges e E to stochastic transitions. An FSC can be viewed as a policy = , , where action strategy associates each node n with a distribution over actions (n) = Pr(a|n), and observation strategy associates each node n and observation z with a distribution over successor nodes (n, z) = Pr(n |n, z) (corresponding to the edge from n labeled with z). The value function V of FSC is given by:

 V (n, s) =         Pr(a|n)R(s, a) +          Pr(s |s, a) Pr(z|s , a)         Pr(n |n, z)V (n , s ) (1)

                a                          z                               n

The value V (n, b) of each node n is thus linear w.r.t the belief state; hence the value function of the controller is piecewise-linear and convex. The optimal value function V often has a large (if not infinite) number of vectors, each corresponding to a different node. The optimal value function V satisfies Bellman's equation:

                          V (b) = max R(b, a) +             Pr(z|b, a)V (ba)                         (2)                                                                                     z                                         a                                                              z



 max       s.t.    V (n, s) +                             [Pr(a|n)R(s, a) +                    Pr(s |s, a) Pr(z|s , a) Pr(a, n |n, z)V (n , s )], s                       a                                  s ,z                       Pr(a|n) = 1;                    Pr(a, n |n, z) = Pr(a|n), a                  a                               n               Pr(a|n)  0, a;                 Pr(a, n |n, z)  0, a, z


                  Table 1: LP to uniformly improve the value function of a node.



 max                o(s, n) s,n                  s,n       s.t.    V (n, s) +                                                s,n                            [Pr(a|n)R(s, a) +                    Pr(s |s, a) Pr(z|s , a) Pr(a, n |n, z)V (n , s )], s                       a                                  s ,z                       Pr(a|n) = 1;                    Pr(a, n |n, z) = Pr(a|n), a                  a                               n               Pr(a|n)  0, a;                 Pr(a, n |n, z)  0, a, z

Table 2: LP to improve the value function of a node in a non-uniform way according to the steady state occupancy o(s, n).

3 Bounded Policy Iteration

We briefly review the bounded policy iteration (BPI) algorithm (see [14] for details) and describe a simple extension to bias its search toward reachable belief states. BPI incre- mentally constructs an FSC by alternating policy improvement and policy evaluation. Un- like policy iteration [7], this is done by slowly increasing the number of nodes (and value vectors). The policy improvement step greedily improves each node n by optimizing its action and observation strategies by solving the linear program (LP) in Table 1. This LP uniformly maximizes the improvement in the value function by optimizing n's distribu- tions Pr(a, n |n, z). The policy evaluation step computes the value function of the current controller by solving Eq. 1. The algorithm monotonically improves the policy until con- vergence to a local optimum, at which point new nodes are introduced to escape the local optimum. BPI is guaranteed to converge to a policy that is optimal at the "tangent" belief states while slowly growing the size of the controller [14].

In practice, we often wish to find a policy suitable for a given initial belief state. Since only a small subset of belief space is often reachable, it is generally possible to construct much smaller policies tailored to the reachable region. We now describe a simple way to bias BPI's efforts toward the reachable region. Recall that the LP in Table 1 optimizes the parameters of a node to uniformly improve its value at all belief states. We propose a new LP (Table 2) that weighs the improvement by the (unnormalized) discounted occupancy distribution induced by the current policy. This accounts for belief states reachable for the node by aggregating them together. The (unnormalized) discounted occupancy distribution is given by:

          o(s , n ) = b0(s , n ) +                      o(s, n) Pr(a|n) Pr(z|a, s) Pr(n |n, z)     s , n

                                              s,a,z,n

The LP in Table 2 is obtained by introducing variables s,n for each s, replacing the ob- jective by o(s, n) s,n s,n and replacing in each constraint by the corresponding s,n. When using the modified LP, BPI naturally tries to improve the policy at the reachable be- lief states before the others. Since the modification ensures that the value function doesn't decrease at any belief state, focusing the efforts on reachable belief states won't decrease policy value at other belief states. Furthermore, though the policy is initially biased toward reachable states, BPI will eventually improve the policy for all belief states.

                                                                                                                                T         ~                             ~                                                                             T                                                                                  T                              T


                                          f    ~                                     b                                        ~                                                    b         b'              b'                                                    ~                        ~                                                    R                        R                                          R                         R

                                               r                         r'

Figure 1: Functional flow of a POMDP (dotted arrows) and a compressed POMDP (solid arrows).

4 Value-Directed Compression

We briefly review the sufficient conditions for a lossless compression of POMDPs [13] and describe a simple new algorithm to obtain good lossy compressions. Belief states constitute a sufficient statistic summarizing all information available to the decision maker (i.e., past actions and observations). However, as long as enough information is available to evaluate the value of each policy, one can still choose the best policy. Since belief states often contain information irrelevant to the estimation of future rewards, one can often compress belief states into some lower-dimensional representation. Let f be a compression function that maps each belief state b into some lower dimensional compressed belief state ~ b (see Figure 1). Here ~ b can be viewed as a bottleneck that filters the information contained in b before it is used to estimate future rewards. We desire a compression f such that ~ b corresponds to the smallest statistic sufficient for accurately predicting the current reward r as well as the next compressed belief state ~ b (since it captures all the information in b necessary to accurately predict subsequent rewards). Such a compression f exists if we can also find compressed transition dynamics ~ T a,z and a compressed reward function ~ R such that: R = ~ R f and f T a,z = ~ T a,z f a A, z Z (3)

Given an f , ~ R and ~ T a,z satisfying Eq. 3, we can evaluate any policy using the compressed POMDP dynamics to obtain ~ V . Since V = ~ V f , the compressed POMDP is equivalent to the original.

When restricting f to be linear (represented by matrix F ), we can rewrite Eq. 3

                            R = F ~                                               R and T a,zF = F ~                                                                            T a,z a  A, z  Z                         (4)

That is, the column space of F spans R and is invariant w.r.t. each T a,z. Hence, the columns of the best linear lossless compression mapping F form a basis for the smallest invariant subspace (w.r.t. each T a,z) that spans R, i.e., the Krylov subspace. We can find the columns of F by Krylov iteration: multiplying R by each T a,z until the newly generated vectors are linear combinations of previous ones.1 The dimensionality of the compressed space is equal to the number of columns of F , which is necessarily smaller than or equal to the dimensionality of the original belief space. Once F is found, we can compute ~ R and each ~ T a,z by solving the system in Eq. 4.

Since linear lossless compressions are not always possible, we can extend the technique of [13] to find good lossy compressions with early stopping of the Krylov iteration. We retain only the vectors that are "far" from being linear combinations of prior vectors. For instance, if v is a linear combination of v1, v2, . . . , vn, then there are coefficients c1, c2, . . . , cn s.t. the error ||v - c i i vi ||2 is zero. Given a threshold or some upper bound k on the desired number of columns in F , we run Krylov iteration, retaining only the vectors with an error greater than , or the k vectors with largest error. When F is computed by approximate

 1For numerical stability, one must orthogonalize each vector before multiplying by T a,z.

Krylov iteration, we cannot compute ~ R and ~ T a,z by solving the linear system in Eq. 4-- due to the lossy nature of the compression, the system is overconstrained. But we can find suitable ~ R and ~ T a,z by computing a least square approximation, solving:

          F R = F F ~                              R and F T a,zF = F F ~                                                            T a,z a  A, z  Z

While compression is required when the dimensionality of belief space is too large, unfortu- nately, the columns of F have the same dimensionality. Factored POMDPs of exponential dimension can, however, admit practical Krylov iteration if carried out using a compact representation (e.g., DTs or ADDs) to efficiently compute F , ~ R and each ~ T a,z.

5 Bounded Policy Iteration with Value-Directed Compression

In principle, any POMDP algorithm can be used to solve the compressed POMDPs pro- duced by VDC. If the compression is lossless and the POMDP algorithm exact, the com- puted policy will be optimal for the original POMDP. In practice, POMDP algorithms are usually approximate and lossless compressions are not always possible, so care must be taken to ensure numerical stability and a policy of high quality for the original POMDP. We now discuss some of the integration issues that arise when combining VDC with BPI.

Since V = F ~ V , maximizing the compressed value vector ~ V of some node n automatically maximizes the value V of n w.r.t. the original POMDP when F is nonnegative; hence it is essential that F be nonnegative. Otherwise, the optimal policy of the compressed POMDP may not be optimal for the original POMDP. Fortunately, when R is nonnegative then F is guaranteed to be nonnegative by the nature of Krylov iteration. If some rewards are negative, we can add a sufficiently large constant to R to make it nonnegative without changing the decision problem.

Since most algorithms, including BPI, compute approximately optimal policies it is also critical to normalize the columns of F . Suppose F has two columns f1 and f2 with L1- lengths 1 and 100, respectively. Since V = F ~ V = ~ v1f1 + ~v2f2, changes in ~v2 have a much greater impact on V than changes in ~ v1. Such a difference in sensitivity may bias the search for a good policy to an undesirable region of the belief space, or may even cause the algorithm to return a policy that is far from optimal for the original POMDP despite the fact that it is -optimal for the compressed POMDP.

We note that it is "safer" to evaluate policies iteratively by successive approximation rather than solving the system in Eq. 1. By definition, the transition matrices T a,z have eigen- values with magnitude 1. In contrast, lossy compressed transition matrices ~ T a,z are not guaranteed to have this property. Hence, solving the system in Eq. 1 may not correspond to policy evaluation. It is thus safer to evaluate policies by successive approximation for lossy compressions.

Finally several algorithms including BPI compute witness belief states to verify the domi- nance of a value vector. Since the compressed belief space ~ B is different from the original belief space B, this must be approached with care. B is a simplex corresponding to the convex hull of the state points. In contrast, since each row vector of F is the compressed version of some state point, ~ B corresponds to the convex hull of the row vectors of F . When F is non-negative, it is often possible to ignore this difference. For instance, when verifying the dominance of a value vector, if there is a compressed witness ~ b, there is al- ways an uncompressed witness b, but not vice-versa. This means that we can properly identify all dominating value vectors, but we may erroneously classify a dominated vector as dominating. In practice, this doesn't impact the correctness of algorithms such as policy iteration, bounded policy iteration, incremental pruning, witness algorithm, etc. but it will slow them down since they won't be able to prune as many value vectors as possible.

                                                                                               cycle16                                                                                                                    cycle19                                                                                                                          cycle22




                                                         105                                                                                                                        120

                                                         100                                                                                                                                                                                                                                                      130                                                                                                                                                                                         115

                                                          95                                                                                                                                                                                                                                                      125                                                                                                                                                                                         110

                                                          90                      Expected Rewards                                                                                                                               Expected Rewards 105                                                                                                          Expected Rewards 120


                                                         250                                                                                                                        250                                                                                                                           250                                                                       200                                                                120                                                     200                                                                120                                                        200                                                                      120                                                                                                                                   100                                                                              150                                                                                                                                                                             100                                                                                                                                 100                                                                                                                            80                                                                           150                                           80                                                                               150                                                80                                                                                                                      60                                                                                       100                                                                                                                                                       60                                                                                                                                  60                                                                                                               40                                                                                                 100                     40                                                                                                      100                        40                                                                                                         20                                                                                              50                                                                                                                                    20                                                                                                                                20                                                                     # of basis fns                                                                                                                                      50                                                                                                                               50                                                                                                                     # of nodes                                                                 # of basis fns                                  # of nodes                                                                     # of basis fns                                       # of nodes

                                                                                               cycle25                                                                                                                    3legs16                                                                                                                          3legs19




                                                                                                                                                                                    120                                                                                                                           135                                                              150                                                                                                                        115                                                                                                                           130

                                                         145                                                                                                                        110                                                                                                                           125

                                                                                                                                                                                    105                                                                                                                           120                                                              140                                                                                                                                                                                         100                                                                                                                           115                                          Expected Rewards                                                                                                           Expected Rewards                                                                                                              Expected Rewards                                                              135                                                                                                                         95                                                                                                                           110

                                                         250                                                                                                                        250                                                                                                                           250                                                                       200                                                                120                                                     200                                                                120                                                        200                                                                      120                                                                                                                                   100                                                                              150                                                                                                                                                                             100                                                                                                                                 100                                                                                                                            80                                                                           150                                           80                                                                               150                                                80                                                                                                                      60                                                                                       100                                                                                                                                                       60                                                                                                                                  60                                                                                                               40                                                                                                 100                     40                                                                                                      100                        40                                                                                                         20                                                                                              50                                                                                                                                    20                                                                                                                                20                                                                     # of basis fns                                                                                                                                      50                                                                                                                               50                                                                                                                     # of nodes                                                                 # of basis fns                                  # of nodes                                                                     # of basis fns                                       # of nodes

                                                                                               3legs22                                                                                                                    3legs25                                                                                                                         cycle25




                                                         150                                                                                                                                                                                                                                                      12

                                                         145                                                                                                                        160                                                                                                                           10

                                                                                                                                                                                                                                                                                                                   8                                                              140                                                                                                                        155                                                                                                                                                                                                                                                                                                                        6                                                              135                                                                                                                        150                                                                                                                                                                                                                                                                                                                        4                                                              130                                                                                                                        145  Expected Rewards                                                                                                                               Expected Rewards                                                                                                                                                       2                                                                                                                                                                                                                                                                            Time (1000 seconds)                                                              125                                                                                                                        140                                                              250                                                                                                                        250                                                                                250                                                                      200                                                                 120                                                    200                                                                 120                                                      200                                                                        120                                                                                                                                   100                                                                              150                                                                                                                                                                             100                                                                                                                                 100                                                                                                                            80                                                                           150                                           80                                                                              150                                                 80                                                                                                                      60                                                                                       100                                                                                                                                                       60                                                                                                                                 60                                                                                                               40                                                                                                 100                     40                                                                                                     100                        40                                                                                                        20                                                                                              50                                                                                                                                   20                                                                                                                              20                                                                     # of basis fns                                                                                                                                      50                                                                                                                              50                                                                                                                     # of nodes                                                                 # of basis fns                                  # of nodes                                                                   # of basis fns                                        # of nodes

Figure 2: Experimental results for cycle and 3legs network configurations of 16, 19, 22 and 25 machines. The bottom right graph shows the running time of BPI on compressed versions of a cycle network of 25 machines.

                                                                                                                                                                                     3legs                                                                                                                                         cycle                                                                                                                       16                        19                                                               22                      25                    16                                                                   19                         22                         25                                                                     VDCBPI                                          120.9                 137.0                                                          151.0                    164.8                      103.9                                                           121.3                      134.3                       151.4                                                                     heuristic                                       100.6                 118.3                                                          138.3                    152.3                      102.5                                                           117.9                      130.2                       152.3                                                                     doNothing                                       98.4                  112.9                                                          133.5                    147.1                      91.6                                                            105.4                      122.0                       140.1

Table 3: Comparison of the best policies achieved by VDCBPI to the doNothing and heuristic policies.

The above tips work well when VDC is integrated with BPI. We believe they are sufficient to ensure proper integration of VDC with other POMDP algorithms, though we haven't verified this empirically.

Abstract

Name Change Policy