{"title": "APRICODD: Approximate Policy Construction Using Decision Diagrams", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1095, "abstract": null, "full_text": "APRICODD: Approximate Policy Construction \n\nusing Decision Diagrams \n\nRobert St-Aubin \n\nJesse Hoey \n\nCraig Boutilier \n\nDept. of Computer Science \n\nDept. of Computer Science \n\nDept. of Computer Science \n\nUniversity of British Columbia \n\nUniversity of British Columbia \n\nVancouver, BC V6T lZA \n\nstaubin@cs.ubc.ca \n\nVancouver, BC V6T lZA \n\njhoey@cs.ubc.ca \n\nUniversity of Toronto \nToronto, ON M5S 3H5 \ncebly@cs.toronto.edu \n\nAbstract \n\nWe propose a method of approximate dynamic programming for Markov \ndecision processes (MDPs) using algebraic decision diagrams (ADDs). \nWe produce near-optimal value functions and policies with much lower \ntime and space requirements than exact dynamic programming. Our \nmethod reduces the sizes of the intermediate value functions generated \nduring value iteration by replacing the values at the terminals of the ADD \nwith ranges of values. Our method is demonstrated on a class of large \nMDPs (with up to 34 billion states), and we compare the results with the \noptimal value functions. \n\n1 Introduction \n\nThe last decade has seen much interest in structured approaches to solving planning prob(cid:173)\nlems under uncertainty formulated as Markov decision processes (MDPs). Structured algo(cid:173)\nrithms allow problems to be solved without explicit state-space enumeration by aggregating \nstates of identical value. Structured approaches using decision trees have been applied to \nclassical dynamic programming (DP) algorithms such as value iteration and policy itera(cid:173)\ntion [7, 3]. Recently, Hoey et.al. [8] have shown that significant computational advantages \ncan be obtained by using an Algebraic Decision Diagram (ADD) representation [1, 4, 5]. \nNotwithstanding such advances, large MDPs must often be solved approximately. This can \nbe accomplished by reducing the \"level of detail\" in the representation and aggregating \nstates with similar (rather than identical) value. Approximations of this kind have been \nexamined in the context of tree structured approaches [2]; this paper extends this research \nby applying them to ADDs. Specifically, the terminal of an ADD will be labeled with the \nrange of values taken by the corresponding set of states. As we will see, ADDs have a \nnumber of advantages over trees. \n\nWe develop two approximation methods for ADD-structured value functions, and apply \nthem to the value diagrams generated during dynamic programming. The result is a near(cid:173)\noptimal value function and policy. We examine the tradeoff between computation time and \ndecision quality, and consider several variable reordering strategies that facilitate approxi(cid:173)\nmate aggregation. \n\n\f2 Solving MDPs using Algebraic Decision Diagrams \n\nWe assume a fully-observable MDP [10] with finite sets of states S and actions A, tran(cid:173)\nsition function Pr(s, a, t), reward function R, and a discounted infinite-horizon optimality \ncriterion with discount factor 13. Value iteration can be used to compute an optimal station(cid:173)\nary policy 7r : S -t A by constructing a series of n-stage-to-go value functions, where: \n\nV n+1(s) = R(s) + max {f3 L Pr(s, a, t) . Vn(t)} \n\naEA \n\ntES \n\n(1) \n\nThe sequence of value functions vn produced by value iteration converges linearly to the \noptimal value function V*. For some finite n, the actions that maximize Equation 1 form \nan optimal policy, and vn approximates its value. \n\nADDs [1,4, 5] are a compact, efficiently manipulable data structure for representing real(cid:173)\nvalued functions over boolean variables En -t R. They generalize a tree-structured rep(cid:173)\nresentation by allowing nodes to have multiple parents, leading to the recombination of \nisomorphic sub graphs and hence to a possible reduction in the representation size. A more \nprecise definition of the semantics of ADDs can be found in [9]. \n\nRecently, we applied ADDs to the solution of large MDPs [8], yielding significant \nspace/time savings over related tree-structured approaches. We assume the state of an \nMDP is characterized by a set of variables X = {Xl,' .. ,Xn }. Values of variable Xi will \nbe denoted in lowercase (e.g., Xi). We assume each Xi is boolean'! Actions are described \nusing dynamic Bayesian networks (DBNs) [6, 3] with ADDs representing their conditional \nprobability tables. Specifically, a DBN for action a requires two sets of variables, one set \nX = {X I, ... , Xn} referring to the state of the system before action a has been executed, \nand X' = {X~, . .. ,X~} denoting the state after a has been executed. Directed arcs from \nvariables in X to variables in X' indicate direct causal influence. The conditional proba(cid:173)\nbility table (CPT) for each post-action variable XI defines a conditional distribution PJc~ \nover XI-i.e., a's effect on Xi-for each instantiation of its parents. This can be viewed \nas a function PJc~ (Xl ... X n), but where the function value (distribution) depends only on \nthose Xj that ar~ parents of X:' We represent this function using an ADD. Reward func(cid:173)\ntions can also be represented using ADDs. Figure I(a) shows a simple example of a single \naction represented as a DBN as well as a reward function. \n\nWe use the method of Hoey et. al [8] to perform value iteration using ADDs. We refer to \nthat paper for full details on the algorithm, and present only a brief outline here. The ADD \nrepresentation of the CPTs for each action, PJc~ (X), are referred to as action diagrams, \nas shown in Figure 1 (b), where X represents the' set of pre-action variables, {Xl, ... X n}. \nThese action diagrams can be combined into a complete action diagram (Figure I(c\u00bb: \n\nn \n\npa(x\"X) = IIX:. PJc;(X) + XI\u00b7 (1- PJc;(X)). \n\n(2) \n\ni=l \n\nThe complete action diagram represents all the effects of pre-action variables on post(cid:173)\naction variables for a given action. The immediate reward function R(X') is also repre(cid:173)\nsented as an ADD, as are the n-stage-to-go value functions vn(x). Given the complete \naction diagrams for each action, and the immediate reward function, value iteration can be \nperformed by setting VO = R, and applying Eq. 1, \n\nvn+1(x) = R(X) + ~a;: {f3 L pa(X', X) . Vn(X')} , \n\nE \n\nx' \n\n(3) \n\n1 An extension to multi-valued variables would be straightforward. \n\n\f~ \n~ \n\nx \n\ny \n\nX\u00b7 \n\nT \nP \n\nY \n\n> \n~ ~ ~~~ \n~ :;~ ; \n> \n> \n\no~~ \n\nY' -------- -ttfty \n\nx lrew,,,, \n\n10 \n0 \n\nT \nF \n\n0 8 \n0 2 \n\nMatrIx \n\nRepresentatI on \n\nADD \n\nRepresentatJOn \n\n(a) \n\n(b) \n\nComplete \n\nActIon Diagra m \n\n(e) \n\nFigure 1: ADD representation of an MDP: (a) action network for a single action (top) \nand the immediate reward network (bottom) (b) Matrix and ADD representation of CPTs \n(action diagrams) (c) Complete action diagram. \n\nx \n\nx \n\ny \n\nx \n\ny \n\nz \n\n~ jfJ \n\nl.l [H] ~ [2}J ~1 [9.7.9.8J I 1.1 \n\nFigure 2: Approximation of original value diagram (a) with errors of 0.1 (b) and 0.5 (c). \n\n(b) 0.1 \n\n(e) 0.5 \n\nfollowed by swapping all unprimed variables with primed ones. All operations in Equa(cid:173)\ntion 3 are well defined in terms of ADDs [8, 12]. The value iteration loop is continued until \nsome stopping criterion is met. Various optimizations are applied to make this calculation \nas efficient as possible in both space and time. \n\n3 Approximating Value Functions \n\nWhile structured solution techniques offer many advantages, the exact solution of MDPs \nin this way can only work if there are \"few\" distinct values in a value function. Even if a \nDBN representation shows little dependence among variables from one stage to another, \nthe influence of variables tends to \"bleed\" through a DBN over time, and many variables \nbecome relevant to predicting value. Thus, even using structured methods, we must often \nrelax the optimality constraint and generate only approximate value functions, from which \nnear-optimal policies will hopefully arise. It is generally the case that many of the values \ndistinguished by DP are similar. Replacing such values with a single approximate values \nleads to size reduction, while not significantly affecting the precision of the value diagrams. \n\n3.1 Decision Diagrams and Approximation \n\nConsider the value diagram shown in Figure 2(a), which has eight distinct values as shown. \nThe value of each state s is represented as a pair [l, u], where the lower, l, and upper, u, \nbounds on the values are both represented. The span of a state, s, is given by span(s)=u-l. \nPoint values are represented by setting u=l , and have zero span. Now suppose that the \n\n\fdiagram in Figure 2(a) exceeds resource limits, and a reduction in size is necessary to \ncontinue the value iteration process. If we choose to no longer distinguish values which \nare within 0.1 or 0.5 of each other, the diagrams in Figure 2(b) or (c) result, respec(cid:173)\ntively. The states which had proximal values have been merged, where merging a set of \nstates 81,82, ... ,8n with values [it, U1], ... , [ln, un], results in an aggregate state, t, with \na ranged value [min(h, . .. , In), max:(u1, ... , un)]. The midpoint of the range estimates \nthe true value of the states with minimal error, namely, 8pan( t) / 2. The span of V is the \nmaximum of all spans in the value diagram, and therefore the maximum error in V is sim(cid:173)\nply span ( V) / 2 [2]. The combined span of a set of states is the span of the pair that would \nresult from merging them all. The extent of a value diagram V is the combined span of the \nportion of the state space which it represents. The span of the diagram in Figure 2(c) is 0.5, \nbut its extent is 8.7. \n\nADD-structured value functions can be leveraged by approximation techniques because \napproximations can always be performed directly without pre-processing techniques such \nas variable reordering. Of course, variable reordering can still play an important computa(cid:173)\ntional role in ADD-structured methods, but are not needed for discovering approximations. \n\n3.2 Value Iteration with Approximate Value Functions \n\nApproximate value iteration simply means applying an approximation technique to the n(cid:173)\nstage to go value function generated at each iteration of Eq. 3. Available resources might \ndictate that ADDs be kept below some fixed size. In contrast, decision quality might require \nerrors below some fixed value, referred to as the pruning strength, 8. The remainder of this \npaper will focus on the latter, although we have examined the former as well [9]. \n\nThus, the objective of a single approximation step is a reduction in the size of a ranged \nvalue ADD by replacing all leaves which have combined spans less than the specified \nerror bound by a single leaf. Given a leaf [l, u] in V, the set of all leaves [li, ud such \nthat the combined span of [li, Ui] with [l, u] is less than the specified error are merged. \nRepeating this process until no more merges are possible gives the desired result. We have \nalso examined a quicker, but less exact, method for approximation, which exploits the fact \nthat simply reducing the precision of the values at the leaves of an ADD merges the similar \nvalues. We defer explanations to the longer version of this paper [9]. \nThe sequence of ranged value functions, Vn , converges after n' iterations to an approximate \n(non-ranged) value function, V, by taking the mid-points of each ranged terminal node in \nVn'. The pruning strength, 8, then gives the percentage difference between V and the \noptimal n'-stage-to-go value function V n '. The value function V induces a policy, n, the \nvalue of which is ViTo In general, however, ViT # V [11] 2. \n\n3.3 Variable Reordering \n\nAs previously mentioned, variable reordering can have a significant effect on the size of an \nADD, but finding the variable ordering which gives rise to the smallest ADD for a boolean \nfunction is co-NP-complete [4]. We examine three reordering methods. The first two are \nstandard for reordering variables in BDDs: Rudell's sifting algorithm and random reorder(cid:173)\ning [12]. The last reordering method we consider arises in the decision tree induction \nliterature, and is related to the information gain criterion. Given a value diagram V with \nextent 8, each variable x is considered in tum. The value diagram is restricted first with \nx = true, and the extent 8t and the number of leaves nt are calculated for the restricted \nADD. Similar values 8 f and n f are found for the x = false restriction. If we collapsed the \nentire ADD into a single node, assuming a uniform distribution over values in the resulting \n\n2In fact, the equality arises if and only if V = V\u00b7, where V\u00b7 is the optimal value function. \n\n\frange gives us the entropy for the entire ADD: \n\nE = J p(v)log(p(v))dv = log(J), \n\n(4) \n\nand represents our degree of uncertainty about the values in the diagram. Splitting the \nvalues with the variable x results in two new value diagrams, for each of which the entropy \nis calculated. The gain in information (decrease in entropy) values are used to rank the \nvariables, and the resulting order is applied to the diagram. This method will be referred to \nas the minimum span method. \n\n4 Results \n\nThe procedures described above were implemented using a modified version of the CUDD \npackage [12] , a library of C routines which provides support for manipulation of ADDs. \n\nExperimental results from this section were all obtained using one processor on a dual(cid:173)\nprocessor Pentium II PC running at 400Mhz with O.5Gb of RAM. Our approximation meth(cid:173)\nods were tested on various adaptations of a process planning problem taken from [7, 8]. 3 \n\n4.1 Approximation \n\nAll experiments in this section were performed on problem domains where the variable \nordering was the one selected implicitly by the constructors of the domains. 4 \n\nValue \n\nFunction \nOptimal \n\nApproximate \n\n0 \n(%) \n0 \n1 \n2 \n3 \n4 \n5 \n10 \n15 \n20 \n30 \n40 \n50 \n\ntime \n(s) \n270.91 \n562.35 \n547.00 \n1l2.7 \n68.53 \n38.06 \n6.24 \n0.70 \n0.57 \n0.05 \n0.07 \n0.04 \n\niter \n\nnodes leaves \n\n(inl) \n22170 \n17108 \n15960 \n15230 \n14510 \n11208 \n3739 \n580 \n299 \n50 \n10 \n0 \n\n44 \n44 \n44 \n15 \n12 \n10 \n6 \n4 \n4 \n2 \n2 \n1 \n\n527 \n117 \n77 \n58 \n48 \n38 \n15 \n9 \n6 \n3 \n2 \n1 \n\nIV\" - V*I \n(%) \n0.0 \n0.13 \n0.14 \n5.45 \nlo20 \n2.48 \nllo33 \n14.1l \n16.66 \n25.98 \n30.28 \n3lo25 \n\nTable 1: Comparing optimal with approximate value iteration on a domain with 28 boolean \nvariables. \n\nIn Table 1 we compare optimal value iteration using ADDs (SPUDD as presented in [8]) \nwith approximate value iteration using different pruning strengths J. In order to avoid \noverly aggressive pruning in the early stage of the value iterations, we need to take into \naccount the size of the value function at every iteration. Therefore, we use a sliding pruning \nstrength specified as J E~=o j3iextent(R) where R is the initial reward diagram, j3 is the \ndiscount factor introduced earlier and n is the iteration number. \n\nWe illustrate running time, value function size (internal nodes and leaf nodes), number of \niterations, and the average sum of squared difference between the optimal value function, \nV* , and the value of the approximate policy, Vir. \n\nIt is important to note that the pruning strength is an upper bound on the approximation \nerror. That is, the optimal values are guaranteed to lie within the ranges of the approximate \n\n3 See [9] for details. \n4Experiments showed that conclusions in this section are independent of variable order. \n\n\franged value function. However, as noted earlier, this bound does not hold for the value of \nan induced policy, as can be seen at 3% pruning in the last column of Table 1. \n\nThe effects of approximation on the performance of the value iteration algorithm are three(cid:173)\nfold. First, the approximation itself introduces an overhead which depends on the size of \nthe value function being approximated. This effect can be seen in Table 1 at low pruning \nstrengths (1 - 2%), where the running time is increased from that taken by optimal value \niteration. Second, the ranges in the value function reduce the number of iterations needed \nto attain convergence, as can be seen in Table 1 for pruning strengths greater than 2%. \nHowever, for the lower pruning strengths, this effect is not observed. This can be explained \nby the fact that a small number of states with values much greater (or much lower) than \nthat of the rest of the state space may never be approximated. Therefore, to converge, this \nportion of the state space requires the same number of iterations as in the optimal case 5. \n\nThe third effect of approximation is to reduce the size of the value functions, thus reducing \nthe per iteration computation time during value iteration. This effect is clearly seen at prun(cid:173)\ning strengths greater than 2%, where it overtakes the cost of approximation, and generates \nsignificant time and space savings. Speed ups of 2 and 4 fold are obtained for pruning \nstrengths of 3% and 4% respectively. Furthermore, fewer than 60 leaf nodes represent the \nentire state space, while value errors in the policy do not exceed 6%. This confirms our \ninitial hypothesis that many values within a given domain are very similar and thus, replac(cid:173)\ning such values with ranges drastically reduces the size of the resulting diagram without \nsignificantly affecting the quality of the resulting policy. Pruning above 5% has a larger er(cid:173)\nror, and takes a very short time to converge. Pruning strengths of more than 40% generate \npolicies which are close to trivial, where a single action is always taken. \n\n4.2 Variable reordering \n\nlO'r;===:Sh=:U:;;:\"I=:gd;'=_=n=o=ra=Ord=;:e=, ===:::;--~------, \n\n-e- Intuitive (unshuffled) - no reorder \n~ shuffled - reorder mlnspan \no shuffled - reorder random \n\n- - shuffled - reorder sift \n\nla' \n\no \n\no \n\no \n\n102 '-----___ ~ ___ ~ ___ ~ ___ ----' \n35 \n\n15 \n\n20 \n\n25 \n\n30 \n\nboolesn variables \n\nFigure 3: Sizes of final value diagrams plotted as a function of the problem domain size. \n\nResults in the previous section were all generated using the \"intuitive\" variable ordering for \nthe problem at hand. It is probable that such an ordering is close to optimal, but such order(cid:173)\nings may not always be obvious, and the effects of a poor ordering on the resources required \nfor policy generation can be extreme. Therefore, to characterize the reordering methods \ndiscussed in Section 3.3, we start with initially randomly shuffled orders and compare the \nsizes of the final value diagrams with those found using the intuitive order. \n\n5We are currently looking into alleviating this effect in order to increase convergence speed for \n\nlow pruning strengths \n\n\fIn Figure 3 we present results obtained from approximate value iteration with a pruning \nstrength of 3% applied to a range of problem domain sizes. \n\nIn the absence of any reordering, diagrams produced with randomly shuffled variable orders \nare up to 3 times larger than those produced with the intuitive (unshuffled) order. The \nminimum span reordering method, starting from a randomly shuffled order, finds orders \nwhich are equivalent to the intuitive one, producing value diagrams with nearly identical \nsize. The sifting and random reordering methods find orders which reduce the sizes further \nby up to a factor of 7. \n\nReordering attempts take time, but on the other hand, DP is faster with smaller diagrams. \nValue iteration with the sifting reordering method (starting with shuffled orders) was found \nto run in time similar to that of value iteration with the intuitive ordering, while the other \nreordering methods took slightly longer. All reordering methods, however, reduced running \ntimes and diagram sizes from that using no reordering, by factors of 3 to 5. \n\n5 Concluding Remarks \n\nWe examined a method for approximate dynamic programming for MDPs using ADDs. \nADDs are found to be ideally suited to this task. The results we present have clearly shown \ntheir applicability on a range of MDPs with up to 34 billion states. Investigations into the \nuse of variable reordering during value iteration have also proved fruitful, and yield large \nimprovements in the sizes of value diagrams. Results show that our policy generator is \nrobust to the variable order, and so this is no longer a constraint for problem specification. \n\nReferences \n\n[1] R. Iris Bahar, Erica A. Frohm, Charles M. Gaona, Gary D. Hachtel, Enrico Macii, Abelardo \nPardo, and Fabio Somenzi. Algebraic decision diagrams and their applications. In International \nConference on Computer-Aided Design, pages 188- 191. IEEE, 1993. \n\n[2] Craig Boutilier and Richard Dearden. Approximating value trees in structured dynamic pro(cid:173)\n\ngramming. In Proceedings ICML-96, Bari, Italy, 1996. \n\n[3] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Exploiting structure in policy con(cid:173)\n\nstruction. In Proceedings Fourteenth Inter. Conf on AI (IJCAI-95), 1995. \n\n[4] Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transac(cid:173)\n\ntions on Computers, C-35(8):677--691, 1986. \n\n[5] E. M. Clarke, K. L. McMillan, X. Zhao, M. Fujita, and J. Yang. Spectral transforms for large \nboolean functions with applications to technology mapping. In DAC, 54-60. ACMIIEEE, 1993. \n[6] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. \n\nComputational Intelligence, 5(3):142- 150, 1989. \n\n[7] Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. \n\nArtificial Intelligence, 89:219- 283, 1997. \n\n[8] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: Stochastic planning using \n\ndecision diagrams. In Proceedings of UAI99, Stockholm, 1999. \n\n[9] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. Optimal and approximate planning \n\nusing decision diagrams. Technical Report TR-OO-05 , UBC, June 2000. \n\n[10] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. \n\nWiley, New York, NY., 1994. \n\n[11] Satinder P. Singh and Richard C. Yee. An upper bound on the loss from approximate optimal(cid:173)\n\nvalue function. Machine Learning, 16:227- 233, 1994. \n\n[12] Fabio Somenzi. \n\nCUDD: CU decision diagram package. \n\nAvailable \n\nfrom \n\nft p : / /vl s i . c o l o r ado. edu/pub /, 1998. \n\n\f", "award": [], "sourceid": 1840, "authors": [{"given_name": "Robert", "family_name": "St-Aubin", "institution": null}, {"given_name": "Jesse", "family_name": "Hoey", "institution": null}, {"given_name": "Craig", "family_name": "Boutilier", "institution": null}]}