{"title": "Bounded Finite State Controllers", "book": "Advances in Neural Information Processing Systems", "page_first": 823, "page_last": 830, "abstract": "", "full_text": "Bounded Finite State Controllers\n\nPascal Poupart\n\nCraig Boutilier\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Toronto\nToronto, ON M5S 3H5\n\nUniversity of Toronto\nToronto, ON M5S 3H5\n\nppoupart@cs.toronto.edu\n\ncebly@cs.toronto.edu\n\nAbstract\n\nWe describe a new approximation algorithm for solving partially observ-\nable MDPs. Our bounded policy iteration approach searches through the\nspace of bounded-size, stochastic \ufb01nite state controllers, combining sev-\neral advantages of gradient ascent (ef\ufb01ciency, search through restricted\ncontroller space) and policy iteration (less vulnerability to local optima).\n\n1 Introduction\n\nFinite state controllers (FSCs) provide a simple, convenient way of representing policies\nfor partially observable Markov decision processes (POMDPs). Two general approaches\nare often used to construct good controllers: policy iteration (PI) [7] and gradient ascent\n(GA) [10, 11, 1]. The former is guaranteed to converge to an optimal policy, however, the\nsize of the controller often grows intractably. In contrast, the latter restricts its search to\ncontrollers of a bounded size, but may get trapped in a local optimum.\n\nWhile locally optimal solutions are often acceptable, for many planning problems with a\ncombinatorial \ufb02avor, GA can easily get trapped by simple policies that are far from opti-\nmal. Consider a system engaged in preference elicitation, charged with discovering optimal\nquery policy to determine relevant aspects of a user\u2019s utility function. Often no single ques-\ntion yields information of much value, while a sequence of queries does. If each question\nhas a cost, a system that locally optimizes the policy by GA may determine that the best\ncourse of action is to ask no questions (i.e., minimize cost given no information gain).\nWhen an optimal policy consists of a sequence of actions any small perturbation to which\nresults in a bad policy, there is little hope of \ufb01nding this sequence using methods that\ngreedily perform local perturbations such as those employed by GA.\n\nIn general, we would like the best of both worlds: bounded controller size and conver-\ngence to a global optimum. While achieving both is NP-hard for the class of deterministic\ncontrollers [10], one can hope for a tractable algorithm that at least avoids obvious local op-\ntima. We propose a new anytime algorithm, bounded policy iteration (BPI) that improves a\npolicy much like Hansen\u2019s PI [7] while keeping the size of the controller \ufb01xed. Whenever\nthe algorithm gets stuck in a local optimum, the controller is allowed to slightly grow by\nintroducing one (or a few) node(s) to escape the local optimum.\n\nFollowing a brief review of FSCs (Sec. 2), we extend PI to stochastic controllers (Sec. 3),\nthus admitting smaller, high quality controllers. We then derive the BPI algorithm by en-\nsuring that the number of nodes remains unchanged (Sec. 4). We analyze the structure of\n\n\flocal optima for BPI (Sec. 5), relate this analysis to GA, and use it to justify a new method\nto escape local optima. Finally, we report some preliminary experiments (Sec. 6).\n\n2 Finite State Controllers for POMDPs\n\n;\n\n,\n\n(1)\n\n(2)\n\nto be a distribution over states. Belief\n\n; a set of actions \u0001\n\n. We assume\ndiscrete state, action and observation sets and we focus on discounted, in\ufb01nite horizon\n\nA POMDP is de\ufb01ned by a set of states \n; a set of observations \u0002\na transition function \u0003\n, where \u0003\u0005\u0004\u0007\u0006\t\b\u000b\n\f\b\r\u0006\u000f\u000e\u0011\u0010 denotes the transition probabilities \u0012\u0014\u0013\u0015\u0004\u0007\u0006\u000f\u000e\u0017\u0016\n\u0006\t\b\u000b\n\u0015\u0010 ;\nan observation function \u0018\n\u0006\t\b\u000b\n\u0015\u0010 of making\n, where \u0018\u0019\u0004\u001a\u0006\u001b\b\r\u001c\u001d\u0010 denotes the probability \u0012\u0014\u0013\u001e\u0004\u0007\u001c\f\u0016\nobservation \u001c\n, where \u001f \u0004\u0007\u0006\t\b\r\n\u001e\u0010\nin state \u0006 after taking action \n ; and a reward function \u001f\ndenotes the immediate reward associated with state \u0006 when executing ation \n\nPOMDPs with discount factor !#\"%$'&)( . Since states are not directly observable in\nPOMDPs, we de\ufb01ne a belief state *\u000f\u0004\u001a\u0006+\u0010-,.\u0012\u0014\u0013\u0015\u0004\u001a\u0006+\u0010\nstate* can be updated in response to a action-observation pair /0\n\f\b\u000b\u001c\u001e1 using Bayes rule.\nPolicies represented by FSCs are de\ufb01ned by a (possibly cyclic) directed graph 23,4/657\b98:1 ,\nwhere each node;=<>5\nis labeled by an action \n and each edge?-<@8 by an observation\n\u001c . Each node has one outward edge per observation. The FSC can be viewed as a policy\n24,A/0BC\b9D:1 , where action strategy B\nassociates each node ; with an action BC\u00040;E\u0010F<7\u0001\nand observation strategyD associates each node; and observation\u001c with a successor node\nlabeled with \u001c ). A policy is executed\nDG\u00040;H\b\u000b\u001c\u001e\u0010I\u0016\u0011\u0016\n\u0001]\u0016\n\u0001]\u0016\nPg\u0016\n54\u0016\na\f\u001d have an intuitive interpretation w.r.t. the action and observation strategies\nf and\n(i.e.,\nf ). Similarly, each\n\u001d variable indicates the (unnormalized) probability of\nBT\u0004U;H\b\u000b\n\u0015\u0010\nS after executing\n and observing\u001c\nf ). Note\nreaching node;\n(i.e.,DG\u0004U;H\b\n\nthat we now use probabilistic action strategies and have extended probabilistic observation\nstrategies to depend on the action executed.\n\nf variable indicates the probability of executing action\n\nables\nfor the improved node. Each\n\n\u0012\u0014\u0013\u001e\u0004\u001a\u0006+\u000e\u0017\u0016\n!b\b\n\n\u0006+\u000e0\b\r\n\u001e\u0010\n!b\b\n\nKW\u00040;\n\n\f\b\u000b\u001c\n\n\u0012\u0014\u0013\u0015\u0004\u0007\u001c\f\u0016\n\n\u0006\t\b\u000b\n\u0015\u0010\n\u0007\\\n\u001b\u000b\n\n\b\r\u001c\n\n\b\u000b;\n\n\u0010T,\n\nPg$\n\n\u001b\u000b\n\na\f\u001d\n\n\u001d\u0007\u0006\n\n.\n\nTo summarize, BPI alternates between policy evaluation and improvement as in regular PI,\nbut the policy improvement step simply tries to improve each node by solving the LP in\nTable 4. The\nstrategies of the new improved node.\n\na\f\u001d variables are used to set the probabilistic action and observation\n\nf and\n\n5 Local Optima\n\nBPI is a simple, ef\ufb01cient alternative to standard PI that monotonically improves an FSC\nwhile keeping its size constant. Unfortunately, it is only guaranteed to converge to a local\noptimum. We now characterize BPI\u2019s local optima and propose a method to escape them.\n\n5.1 Characterization\n\nThm. 2 gives a necessary and suf\ufb01cient condition characterizing BPI\u2019s local optima. Intu-\nitively, a controller is a local optimum when each linear segment touches from below, or is\ntangent to, the controller\u2019s backed up value function (see Fig. 1(b)).\n\nTheorem 2 BPI has converged to a local optimum if and only if each node\u2019s value function\nis tangent to the backed up value function.\n\nProof: Since the objective function of the LP in Table 4 seeks to maximize the improve-\nment \u0004 , the resulting convex combination must be tangent to the upper surface of the\nbacked up value function. Conversely, the only time when the LP won\u2019t be able to improve\na node is when its vector is already tangent to the backed up value function. \u000f\n\n1Actually, we don\u2019t need the\n\nvariables since they can be derived from the\n\nsumming out\n\n, so the number of variables can be reduced to\n\n\b\n\t\n\n\u001d variables by\n\n\b\n\t\f\u000b\n\n.\n\n\u0012\u0013\u0011\u0014\u0011\n\n\u0015\u0013\u0011\u0014\u0011\n\n\u0016\u0017\u0011\u0019\u0018\u001b\u001a\n\n\u000e\u0010\u000f\n\n\u0004\n\u0004\nf\n\u0012\na\n\u0010\n\u0012\na\n\u0014\n\u0012\na\n\u0018\n\nf\n\u0012\na\n\u0012\na\n\u0018\n\u0001\n$\nY\n\u0012\nS\n\u0016\nS\n\u0007\n\nY\nf\n\u0012\na\n\u0010\n\u0012\na\n\u0014\n\u0012\na\n\u0018\n\nf\n\u0012\n\u0012\n,\n\nf\n\u0012\n\u0012\n\u0005\n\u0007\nh\n\u0001\nl\n`\n\u0004\nP\n\u0004\n\"\nY\nf\n\u0001\n\nf\nY\n[\n\u0002\n\u0012\nS\n\nf\n\u0012\na\n\u001d\nS\n\u0006\nY\nf\n\nf\nY\n\nf\n\u0012\na\n\u001d\n,\n\nf\n\b\n\u0007\n\nf\n\u0005\n\nf\n\u0012\na\n\u001d\n\u0005\n\u0007\n\u0016\nP\n\nf\n\u0012\n\n,\n\nf\n\u0012\na\nS\n\nf\n\u0012\na\n\nf\n\u0012\n\n\u0011\n\fInterestingly, tangency is a necessary (but not suf\ufb01cient) condition for GA\u2019s local optima.\n\nCorollary 1 If GA has converged to a local optimum, then the value function of each node\nreachable from the initial belief state is tangent to the backed up value function.\n\nProof: GA seeks to monotonically improve a controller in the direction of steepest ascent.\nThe LP of Table 4 also seeks a monotonically improving direction. Thus if BPI can\nimprove a controller by \ufb01nding a direction of improvement using the LP of Table 4, then\nGA will also \ufb01nd it or will \ufb01nd a steeper one. Conversely, when a controller is a local\noptimum for GA, then there is no monotonic improvement possible in any direction. Since\nBPI can only improve a controller by following a direction of monotonic improvement,\nGA\u2019s local optima are a subset of BPI\u2019s local optima. Thus, tangency is a necessary, but\n\nnot suf\ufb01cient, condition of GA\u2019s local optima. \u000f\n\nIn the proof of Corollary 1, we argued that GA\u2019s local optima are a subset of BPI\u2019s local\noptima. This suggests that BPI is inferior to GA since it can be trapped by more local\noptima than GA. However we will describe in the next section a simple technique that\nallows BPI to easily escape from local optima.\n\n5.2 Escape Technique\n\nThe tangency condition characterizing local optima can be used to design an effective es-\ncape method for BPI. It essentially tells us that such tangent belief states are \u201cbottlenecks\u201d\nfor further policy improvement. If we could improve the value at the tangent belief state(s)\nof some node, then we could break out of the local optimum. A simple method for doing\nso consists of a one-step lookahead search from the tangent belief states. Figure 1(b) illus-\n\ntrates how belief state *V\u000e can be reached in one step from tangent belief state * , and how\nthe backed up value function improves *\n\u000e \u2019s current value. Thus, if we add a node to the\ncontroller that maximizes the value of *\n\u000e , its improved value can subsequently be backed\nup to the tangent belief state * , breaking out of the local optimum.\n\nOur algorithm is summarized as follows: perform a one-step lookahead search from each\ntangent belief state; when a reachable belief state can be improved, add a new node to the\ncontroller that maximizes that belief state\u2019s value. Interestingly, when no reachable belief\nstate can be improved, the policy must be optimal at the tangent belief states.\n\nTheorem 3 If the backed up value function does not improve the value of any belief state\nreachable in one step from any tangent belief state, then the policy is optimal at the tangent\nbelief states.\n\nProof: By de\ufb01nition, belief states for which the backed up value function provides no\nimprovement are tangent belief states. Hence, when all belief states reachable in one step\nare themselves tangent belief states, then the set of tangent belief states is closed under\nevery policy. Since there is no possibility of improvement, the current policy must be\n\noptimal at the tangent belief states. \u000f\n\nAlthough Thm 3 guarantees an optimal solution only at the tangent belief states, in practice,\nthey rarely form a proper subset of the belief space (when none of the reachable belief states\ncan be improved). Note also that the escape algorithm assumes knowledge of the tangent\nbelief states. Fortunately, the solution to the dual of the LP in Table 4 is a tangent belief\nstate. Since most commercial LP solvers return both the solution of the primal and dual, a\ntangent belief state is readily available for each node. 2\n\n2A node may have more than one tangent belief state when an interval of its linear segment is\n\n\f55\n\n50\n\n45\n\n40\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n35\n0\n\n55\n\n50\n\n45\n\n40\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n35\n100\n\n101\n\nMaze400\n\nTag\u2212Avoid\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n500\n\n1000\nNumber of nodes\n\n1500\n\n0\n\n500\n1000\nNumber of nodes\n\n1500\n\nMaze400\n\nTag\u2212Avoid\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n104\n\n105\n\n101\n\n102\n\n103\n\n104\n\nTime (seconds)\n\n105\n\n106\n\n102\n\n103\n\nTime (seconds)\n\nFigure 3: Experimental results for the maze and tag-avoid problems.\n\n6 Experiments\n\nWe report some preliminary experiments with BPI and the escape method to assess their\nrobustness against local optima, as well as their scalability to relatively large POMDPs.\nIn a \ufb01rst experiment, we ran BPI with escape on a preference elicitation problem and a\nmodi\ufb01ed version of the Heaven-and-Hell problem described in [3]. It consistently found\nthe optimal policy, whereas GA settles for a local optimum for both problems.\n\n! -state maze problem, and the second Pineau et al.\u2019s [12]\n\nIn a second experiment, we report the running time and decision quality of the con-\ntrollers found for two large grid-world problems. The \ufb01rst is a\nHauskrecht\u2019s [8]\navoid problem. In Figure 3, we report the expected return achieved w.r.t. time and number\nof nodes. For the maze problem, the expected return is averaged over all 400 states since\nBPI tries to optimize the policy for all belief states simultaneously. For comparison pur-\nposes, the expected return for the tag-avoid problem is measured at the same initial belief\nstate used in [12] even though BPI doesn\u2019t tailor its policy exclusively to that belief state.\nIn contrast, many point-based algorithms including PBVI [12] (which is perhaps the best\nsuch algorithm) optimize the policy for a single initial belief state, capitalizing on a hope-\nfully small reachable belief region. BPI found a\nsame expected return of\nlinear\nsegments. This suggests that most of the belief space is reachable in tag-avoid. We also\n\n\t!\u001b! -state extention of\n\u0002\u0004\u0003\u000f! -state tag-\n\nachieved by PBVI in (\u000b\u0002\u001b!\b\u0002\f\u0002\u001b!\u001d\u0006 with a policy of (\u000e\r\f\r\u0006\n\n\u0005\u0006\t! -node controller in\n\n\u0007\u0006\u0005\u0004\u0003\b\u0003\u0006\u0001\t\u0006 with the\n\n\t\u0011(\u000b\u0002\n\n\t\n\u0005\n\ntangent to the backed up value function, indicating that it is identical to some backed up node.\n\n\u0001\n\fran BPI on the tiger-grid, hallway and hallway2 benchmark problems [12] and obtained\n\n\u0001\u0006\u0002\n\n\u0001\f\u0003\n\ntailor the policy. In contrast, PBVI achieved expected returns of\n\nat the same initial belief states used in [12], but without using them to\nin\nlinear segments tailored to those\ninitial belief states. This suggests that only a small portion of the belief space is reachable.\n\n\u0001\b\u0002\u001b!\t\u0006 achieving expected returns of\n\nand !\n\n, !\n\n(\u000b\u0007\u001b!\u001b! -node controllers in (\u0001\b\r\u0006\u0004\u0001\n\u0002b( , !\n\r\u0006\f\f\u0002\u001d\u0006 ,\n\n\u0007\u0015( , !\n\u0001\b\u0002\b\u0002\t\u0006 and\n\n\u0002\t!\t\u0006 with policies of\n\n!\u001d\u0006 ,\n\nand\n\n\t!\t\u0006 and\n! ,\n\n\u0002\u0002\n\n\u0004\u0003\n\n\f\u0005\u0004\u0003\n\n\u0001\b\u0007\n\n\u0007\u0006\n\n\b\n\n7 Conclusion\n\nWe have introduced the BPI algorithm, which guarantees monotonic improvement of the\nvalue function while keeping controller size \ufb01xed. While quite ef\ufb01cient, the algorithm may\nget trapped in local optima. An analysis of such local optima reveals that the value function\nof each node is tangent to the backed up value function. This property can be successfully\nexploited in an algorithm that escapes local optima quite robustly.\n\nThis research can be extented in a number of directions. State aggregation [2] and belief\ncompression [13] techniques could be easily integrated with BPI to scale to problems with\nlarge state spaces. Also, since stochastic GA [11, 1] can tackle model free problems (which\nBPI cannot) it would be interesting to see if tangent belief states could be computed for\nstochastic GA and used to design a heuristic to escape local optima similar to the one\nproposed for BPI.\n\nAcknowledgements We thank Darius Braziunas for his help with the implementation and the anony-\nmous reviewers for the helpful comments.\n\nReferences\n\n[1] D. Aberdeen and J. Baxter. Scaling internal-state policy-gradient methods for POMDPs. Proc.\n\nICML-02, pp.3\u201310, Sydney, Australia, 2002.\n\n[2] C. Boutilier and D. Poole. Computing optimal policies for partially observable decision pro-\n\ncesses using compact representations. Proc. AAAI-96, pp.1168\u20131175, Portland, OR, 1996.\n\n[3] D. Braziunas. Stochastic local search for POMDP controllers. Master\u2019s thesis, University of\n\nToronto, Toronto, 2003.\n\n[4] A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact\n\nmethod for POMDPs. Proc.UAI-97, pp.54\u201361, Providence, RI, 1997.\n\n[5] H.-T. Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis,\n\nUniversity of British Columbia, Vancouver, 1988.\n\n[6] Z. Feng and E. A. Hansen. Approximate planning for factored POMDPs. Proc. ECP-01, Toledo,\n\nSpain, 2001.\n\n[7] E. A. Hansen. Solving POMDPs by searching in policy space. Proc. UAI-98, pp.211\u2013219,\n\nMadison, Wisconsin, 1998.\n\n[8] M. Hauskrecht. Value-function approximations for partially observable Markov decision pro-\n\ncesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[9] L. P. Kaelbling, M. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[10] N. Meuleau, K.-E. Kim, L. P. Kaelbling, and A. R. Cassandra. Solving POMDPs by searching\n\nthe space of \ufb01nite policies. Proc. UAI-99, pp.417\u2013426, Stockholm, 1999.\n\n[11] N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning \ufb01nite-state controllers for\n\npartially observable environments. Proc. UAI-99, pp.427\u2013436, Stockholm, 1999.\n\n[12] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for\n\nPOMDPs. In Proc. IJCAI-03, Acapulco, Mexico, 2003.\n\n[13] P. Poupart and C. Boutilier. Value-directed compressions of POMDPs. Proc. NIPS-02, pp.1547\u2013\n\n1554, Vancouver, Canada, 2002.\n\n[14] N. L. Zhang and W. Zhang. Speeding up the convergence of value-iteration in partially observ-\n\nable Markov decision processes. Journal of Arti\ufb01cial Intelligence Research, 14:29\u201351, 2001.\n\n\u0001\n\n(\n\t\n\t\n\t\n\u0001\n\t\n\t\n\t\n\u0005\n\u0007\n\f", "award": [], "sourceid": 2372, "authors": [{"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "Craig", "family_name": "Boutilier", "institution": null}]}