{"title": "Symbolic Dynamic Programming for Continuous State and Observation POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1394, "page_last": 1402, "abstract": "Partially-observable Markov decision processes (POMDPs) provide a powerful model for real-world sequential decision-making problems. In recent years, point- based value iteration methods have proven to be extremely effective techniques for \ufb01nding (approximately) optimal dynamic programming solutions to POMDPs when an initial set of belief states is known. However, no point-based work has provided exact point-based backups for both continuous state and observation spaces, which we tackle in this paper. Our key insight is that while there may be an in\ufb01nite number of possible observations, there are only a \ufb01nite number of observation partitionings that are relevant for optimal decision-making when a \ufb01nite, \ufb01xed set of reachable belief states is known. To this end, we make two important contributions: (1) we show how previous exact symbolic dynamic pro- gramming solutions for continuous state MDPs can be generalized to continu- ous state POMDPs with discrete observations, and (2) we show how this solution can be further extended via recently developed symbolic methods to continuous state and observations to derive the minimal relevant observation partitioning for potentially correlated, multivariate observation spaces. We demonstrate proof-of- concept results on uni- and multi-variate state and observation steam plant control.", "full_text": "Symbolic Dynamic Programming for Continuous\n\nState and Observation POMDPs\n\nZahra Zamani\nANU & NICTA\n\nCanberra, Australia\n\nScott Sanner\nNICTA & ANU\n\nCanberra, Australia\n\nzahra.zamani@anu.edu.au\n\nscott.sanner@nicta.com.au\n\nPascal Poupart\nU. of Waterloo\nWaterloo, Canada\n\nKristian Kersting\n\nFraunhofer IAIS & U. of Bonn\n\nBonn, Germany\n\nppoupart@uwaterloo.ca\n\nkristian.kersting@iais.fraunhofer.de\n\nAbstract\n\nPoint-based value iteration (PBVI) methods have proven extremely effective for\n\ufb01nding (approximately) optimal dynamic programming solutions to partially-\nobservable Markov decision processes (POMDPs) when a set of initial belief\nstates is known. However, no PBVI work has provided exact point-based back-\nups for both continuous state and observation spaces, which we tackle in this\npaper. Our key insight is that while there may be an in\ufb01nite number of observa-\ntions, there are only a \ufb01nite number of continuous observation partitionings that\nare relevant for optimal decision-making when a \ufb01nite, \ufb01xed set of reachable be-\nlief states is considered. To this end, we make two important contributions: (1) we\nshow how previous exact symbolic dynamic programming solutions for continu-\nous state MDPs can be generalized to continuous state POMDPs with discrete ob-\nservations, and (2) we show how recently developed symbolic integration methods\nallow this solution to be extended to PBVI for continuous state and observation\nPOMDPs with potentially correlated, multivariate continuous observation spaces.\n\n1\n\nIntroduction\n\nPartially-observable Markov decision processes (POMDPs) are a powerful modeling formalism for\nreal-world sequential decision-making problems [3]. In recent years, point-based value iteration\nmethods (PBVI) [5, 10, 11, 7] have proved extremely successful at scaling (approximately) optimal\nPOMDP solutions to large state spaces when a set of initial belief states is known.\nWhile PBVI has been extended to both continuous state and continuous observation spaces, no prior\nwork has tackled both jointly without sampling. [6] provides exact point-based backups for contin-\nuous state and discrete observation problems (with approximate sample-based extensions to contin-\nuous actions and observations), while [2] provides exact point-based backups (PBBs) for discrete\nstate and continuous observation problems (where multivariate observations must be conditionally\nindependent). While restricted to discrete states, [2] provides an important insight that we exploit in\nthis work: only a \ufb01nite number of partitionings of the observation space are required to distinguish\nbetween the optimal conditional policy over a \ufb01nite set of belief states.\nWe propose two major contributions: First, we extend symbolic dynamic programming for con-\ntinuous state MDPs [9] to POMDPs with discrete observations, arbitrary continuous reward and\ntransitions with discrete noise (i.e., a \ufb01nite mixture of deterministic transitions). Second, we extend\nthis symbolic dynamic programming algorithm to PBVI and the case of continuous observations\n\n1\n\n\f(while restricting transition dynamics to be piecewise linear with discrete noise, rewards to be piece-\nwise constant, and observation probabilities and beliefs to be uniform) by building on [2] to derive\nrelevant observation partitions for potentially correlated, multivariate continuous observations.\n\n2 Hybrid POMDP Model\n\nA hybrid (discrete and continuous) partially observable MDP (H-POMDP)\nis a tuple\n(cid:104)S,A,O,T ,R,Z, \u03b3, h(cid:105). States S are given by vector (ds, xs) = (ds1, . . . , dsn , xs1, . . . , xsm )\nwhere each dsi \u2208 {0, 1} (1 \u2264 i \u2264 n) is boolean and each xsj \u2208 R (1 \u2264 j \u2264 m) is continuous. We\nassume a \ufb01nite, discrete action space A = {a1, . . . , ar}. Observations O are given by the vector\n(do, xo) = (do1 , . . . , dop , xo1, . . . , xoq ) where each doi \u2208 {0, 1} (1 \u2264 i \u2264 p) is boolean and each\nxoj \u2208 R (1 \u2264 j \u2264 q) is continuous.\nThree functions are required for modeling H-POMDPs: (1) T : S \u00d7 A \u00d7 S \u2192 [0, 1] a Markovian\ntransition model de\ufb01ned as the probability of the next state given the action and previous state; (2)\nR : S \u00d7 A \u2192 R a reward function which returns the immediate reward of taking an action in\nsome state; and (3) an observation function de\ufb01ned as Z : S \u00d7 A \u00d7 O \u2192 [0, 1] which gives the\nprobability of an observation given the outcome of a state after executing an action. A discount\nfactor \u03b3, 0 \u2264 \u03b3 \u2264 1 is used to discount rewards t time steps into the future by \u03b3t.\nWe use a dynamic Bayes net (DBN)1 to compactly represent the transition model T over the factored\nstate variables and we use a two-layer Bayes net to represent the observation model Z:\n\nT : p(x\n\ns|xs,ds, a) =\n(cid:48)\n(cid:48)\ns,d\n\nZ : p(xo,do|x\n\n(cid:48)\n(cid:48)\ns,d\ns, a) =\n\nsj|xs,ds, d\n(cid:48)\np(x\n\n(cid:48)\ns, a).\n\np(xoj|x\n\n(cid:48)\n(cid:48)\ns,d\ns, a).\n\n(1)\n\n(2)\n\ni=1\n\np(d\n\nsi|xs,ds, a)\n(cid:48)\n\nm(cid:89)\nn(cid:89)\nq(cid:89)\np(cid:89)\n|xs,ds,a) and p(doi|x(cid:48)\n\np(doi|x\n\n(cid:48)\n(cid:48)\ns,d\ns, a)\n\nj=1\n\nj=1\n\ni=1\n\nsi\n\nsj\n\ns,d(cid:48)\n\n|xs,ds,d(cid:48)\n\ns. Observation probabilities over continuous variables p(xoj|x(cid:48)\n\nProbabilities over discrete variables p(d(cid:48)\ns,a) may condition on both dis-\ncrete variables and (nonlinear) inequalities of continuous variables; this is further restricted to\nlinear inequalities in the case of continuous observations. Transitions over continuous variables\np(x(cid:48)\ns,a) must be deterministic (but arbitrary nonlinear) piecewise functions; in the case of\ncontinuous observations they are further restricted to be piecewise linear; this permits discrete noise\nin the continuous transitions since they may condition on stochastically sampled discrete next-state\nvariables d(cid:48)\ns,a) only occur in the\ncase of continuous observations and are required to be piecewise constant (a mixture of uniform dis-\ntributions); the same restriction holds for belief state representations. The reward R(d, x, a) may be\nan arbitrary (nonlinear) piecewise function in the case of deterministic observations and a piecewise\nconstant function in the case of continuous observations. We now provide concrete examples.\nExample (Power Plant) [1] The steam generation system of a power plant evaporates feed-water\nunder restricted pressure and temperature conditions to turn a steam turbine. A reward is obtained\nwhen electricity is generated from the turbine and the steam pressure and temperature are within safe\nranges. Mixing water and steam makes the respective pressure and temperature observations po \u2208 R\nand to \u2208 R on the underlying state ps \u2208 R and ts \u2208 R highly uncertain. Actions A = {open, close}\ncontrol temperature and pressure by means of a pressure valve.\nWe initially present two H-POMDP variants labeled 1D-Power Plant using a single temperature\nstate variable ts. The transition and reward are common to both \u2014 temperature increments (decre-\nments) with a closed (opened) valve, a large negative reward is given for a closed valve with ts\nexceeding critical threshold 15, and positive reward is given for a safe, electricity-producing state:\n\ns,d(cid:48)\n\n(cid:34)\n\n(cid:40)\n\n(cid:35)\n\ns|ts, a) = \u03b4\n(cid:48)\n\np(t\n\ns \u2212\n(cid:48)\n\nt\n\n(a = open)\n(a = close)\n\n: ts \u2212 5\n: ts + 7\n\nR(ts, a) =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3(a = open)\n\n(a = close) \u2227 (ts > 15)\n(a = close) \u2227 \u00ac(ts > 15)\n\n: \u22121\n: \u22121000\n: 100\n\n(3)\n\nNext we introduce the Discrete Obs. 1D-Power Plant variant where we de\ufb01ne an observation space\nwith a single discrete binary variable o \u2208 O = {high, low}:\n\n1We disallow general synchronic arcs for simplicity of exposition but note their inclusion only places re-\n\nstrictions on the variable elimination ordering used during the dynamic programming backup operation.\n\n2\n\n\f(cid:40)\n\n(cid:40)\n\nFigure 1: (left) Example conditional plan \u03b2h for discrete observations; (right) example \u03b1-function for \u03b2h over\nstate b \u2208 {0, 1}, x \u2208 R in decision diagram form: the true (1) branch is solid, the false (0) branch is dashed.\n\np(o = high|t\n\n(cid:48)\ns, a = open) =\n\ns \u2264 15 : 0.9\nt(cid:48)\nt(cid:48)\ns > 15 : 0.1\n\np(o = high|t\n\n(cid:48)\ns, a = close) =\n\ns \u2264 15 : 0.7\nt(cid:48)\nt(cid:48)\ns > 15 : 0.3\n\n(4)\n\nFinally we introduce the Cont. Obs. 1D-Power Plant variant where we de\ufb01ne an observation space\nwith a single continuous variable to uniformly distributed on an interval of 10 units centered at t(cid:48)\ns.\n\np(to|t\n\n(cid:48)\ns, a = open) = U (to; t\n\ns \u2212 5, t\n(cid:48)\n\n(cid:48)\ns + 5) =\n\n(to > t(cid:48)\n(to \u2264 t(cid:48)\n\ns \u2212 5) \u2227 (to < t(cid:48)\ns \u2212 5) \u2228 (to \u2265 t(cid:48)\n\ns + 5)\ns + 5)\n\n: 0.1\n: 0\n\n(5)\n\n(cid:40)\n\nWhile simple, we note no prior method could perform exact point-based backups for either problem.\n3 Value Iteration for Hybrid POMDPs\n\nIn an H-POMDP, the agent does not directly observe the states and thus must maintain a belief state\nb(xs,ds) = p(xs,ds). For a given belief state b = b(xs,ds), a POMDP policy \u03c0 can be represented\nby a tree corresponding to a conditional plan \u03b2. An h-step conditional plan \u03b2h can be de\ufb01ned\nrecursively in terms of (h \u2212 1)-step conditional plans as shown in Fig. 1 (left). Our goal is to \ufb01nd a\npolicy \u03c0 that maximizes the value function, de\ufb01ned as the sum of expected discounted rewards over\nhorizon h starting from initial belief state b:\n\n(cid:104)(cid:88)h\n\n(cid:12)(cid:12)(cid:12)b0 = b\n\n(cid:105)\n\nV h\n\u03c0 (b) = E\u03c0\n\n\u03b3t \u00b7 rt\n\nt=0\n\nwhere rt is the reward obtained at time t and b0 is the belief state at t = 0. For \ufb01nite h and belief\nstate b, the optimal policy \u03c0 is given by an h-step conditional plan \u03b2h. For h = \u221e, the optimal\ndiscounted (\u03b3 < 1) value can be approximated arbitrarily closely by a suf\ufb01ciently large h [3].\nEven when the state is continuous (but the actions and observations are discrete), the optimal\nPOMDP value function for \ufb01nite horizon h is a piecewise linear and convex function of the be-\nlief state b [6], hence V h is given by a maximum over a \ufb01nite set of \u201c\u03b1-functions\u201d \u03b1h\ni :\n\nV h(b) = max\ni \u2208\u0393h\n\u03b1h\n\n(cid:104)\u03b1h\n\ni , b(cid:105) = max\ni \u2208\u0393h\n\u03b1h\n\ni (xs,ds) \u00b7 b(xs,ds) dxs\n\u03b1h\n\n(7)\n\n(cid:90)\n\n(cid:88)\n\nxs\n\nds\n\nLater on when we tackle continuous state and observations, we note that we will dynamically derive\nan optimal, \ufb01nite partitioning of the observation space for a given belief state and hence reduce the\ncontinuous observation problem back to a discrete observation problem at every horizon.\nThe \u0393h in this optimal h-stage-to-go value function can be computed via Monahan\u2019s dynamic pro-\n1}, and assuming\ngramming approach to value iteration (VI) [4]. Initializing \u03b10\ndiscrete observations o \u2208 Oh, \u0393h is obtained from \u0393h\u22121 as follows:2\n(cid:48)\n(x\ns,d\n\n1 = 0, \u03930 = {\u03b10\n\ns|xs,ds, a)\u03b1h\u22121\n(cid:48)\n(cid:48)\ns,d\n\ns)dxs(cid:48) ; \u2200\u03b1h\u22121\n(cid:48)\n\ngh\na,o,j(xs,ds) =\n\n(cid:48)\n(cid:48)\ns,d\ns, a)p(x\n\n(cid:88)\n\n\u2208 \u0393h\u22121\n\np(o|x\n\n(cid:90)\n\nj\n\nj\n\n(6)\n\n(8)\n\n(9)\n\nxs(cid:48)\n\nds(cid:48)\n\na = R(xs,ds, a) + \u03b3(cid:1)o\u2208O\n\u0393h\n\u0393h =\n\n(cid:91)\n\n(cid:110)\n\ngh\na,o,j(xs,ds)\n\n(cid:111)\n\nj\n\n(10)\n2The (cid:1) of sets is de\ufb01ned as (cid:1)j\u2208{1,...,n}Sj = S1(cid:1)\u00b7\u00b7\u00b7(cid:1)Sn where the pairwise cross-sum P(cid:1)Q =\n\na\n\n\u0393h\na\n\n{p + q|p \u2208 P, q \u2208 Q}.\n\n3\n\n121 + (3 * x)(1 * x) >= 50(1 * x) <= 39234 + (1.5 * x)197 + (2 * x)250b (1 * x) >= 150(1 * x) <= 139\f// Derive relevant observation partitions Oh\n(cid:104)Oh\n\ns, a)(cid:105) := GenRelObs(\u0393h\u22121\n\ni , p(Oh\n\ni |x(cid:48)\n\ns,d(cid:48)\n\ni for belief bi\nP BV I , a, bi)\n\nelse\n\n// Discrete observations and model already known\nOh\ni := {do}; p(Oh\nforeach o \u2208 Oh\ni do\nforeach \u03b1h\u22121\nP BV I do\n:= Prime(\u03b1h\u22121\n\ni |x(cid:48)\n\u2208 \u0393h\u22121\n\n) // \u2200di: di \u2192 d(cid:48)\n\ns, a) := see Eq (2)\n\ns,d(cid:48)\n\nj\n\ni and \u2200xi: xi \u2192 x(cid:48)\n\ni\n\nbegin\n\nP BV I := \u2205\n\nP BV I = {\u03b10\n1}\n\nforeach a \u2208 A do\n\nV 0 := 0, h := 0, \u03930\nwhile h < H do\n\nh := h + 1, \u0393h := \u2205, \u0393h\nforeach bi \u2208 B do\n\na := \u2205\n\u0393h\nif (continuous observations: q > 0) then\n\nAlgorithm 1: PBVI(H-POMDP, H, B = {bi}) \u2212\u2192 (cid:104)V h(cid:105)\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n\nbi := arg max\u03b1j\u2208\u0393h \u03b1j \u00b7 bi\n\u03b1h\nP BV I := \u0393h\n\u0393h\n\n// Terminate if early convergence\nif \u0393h\n\n\u03b1h\u22121\nj\nj\na,o,j := see Eq (8)\ngh\n\na := see Eq (9)\n\u0393h\n\u0393h := \u0393h \u222a \u0393h\n\nP BV I = \u0393h\u22121\nbreak\n\nP BV I \u222a \u03b1h\n\nP BV I then\n\nreturn \u0393P BV I\n\nend\n\na\n\nbi\n\n// Retain only \u03b1-functions optimal at each belief point\nforeach bi \u2208 B do\n\nPoint-based value iteration (PBVI) [5, 11] computes the value function only for a set of belief\nstates {bi} where bi := p(xs,ds). The idea is straightforward and the main modi\ufb01cation needed to\nMonahan\u2019s VI approach in Algorithm 1 is the loop from lines 23\u201325 where only \u03b1-functions optimal\nat some belief state are retained for subsequent iterations. In the case of continuous observation\nvariables (q > 0), we will need to derive a relevant set of observations on line 10, a key contribution\nof this work as described in Section 4.3. Otherwise if the observations are only discrete (q = 0),\nthen a \ufb01nite set of observations is already known and the observation function as given in Eq (2).\nWe remark that Algorithm 1 is a generic framework that can be used for both PBVI and other\nvariants of approximate VI. If used for PBVI, an ef\ufb01cient direct backup computation of the optimal\n\u03b1-function for belief state bi should be used in line 17 that is linear in the number of observations [5,\n11] and which obviates the need for lines 23\u201325. However, for an alternate version of approximate\nvalue iteration that will often produce more accurate values for belief states other than those in B,\none may instead retain the full cross-sum backup of line 17, but omit lines 23\u201325 \u2014 this yields\nan approximate VI approach (using discretized observations relevant only to a chosen set of belief\nstates B if continuous observations are present) that is not restricted to alpha-functions only optimal\nat B, hence allowing greater \ufb02exibility in approximating the value function over all belief states.\nWhereas PBVI is optimal if all reachable belief states within horizon H are enumerated in B, in\nthe H-POMDP setting, the generation of continuous observations will most often lead to an in\ufb01nite\nnumber of reachable belief states, even with \ufb01nite horizon \u2014 this makes it quite dif\ufb01cult to provide\noptimality guarantees in the general case of PBVI for continuous observation settings. Nonethe-\nless, PBVI has been quite successful in practice without exhaustive enumeration of all reachable\nbeliefs [5, 10, 11, 7], which motivates our use of PBVI in this work.\n\n4\n\n\f4 Symbolic Dynamic Programming\n\nIn this section we take a symbolic dynamic programming (SDP) approach to implementing VI and\nPBVI as de\ufb01ned in the last section. To do this, we need only show that all required operations can\nbe computed ef\ufb01ciently and in closed-form, which we do next, building on SDP for MDPs [9].\n\n4.1 Case Representation and Extended ADDs\n\nThe previous Power Plant examples represented all functions in case form, generally de\ufb01ned as\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nf =\n\n\u03c61 :\n\n...\n\u03c6k :\n\nf1\n\n...\nfk\n\nand this is the form we use to represent all functions in an H-POMDP. The \u03c6i are disjoint logical\nformulae de\ufb01ned over xs,ds and/or xo,do with logical (\u2227,\u2228,\u00ac) combinations of boolean variables\nand inequalities (\u2265, >,\u2264, <) over continuous variables. For discrete observation H-POMDPs, the\nfi and inequalities may use any function (e.g., sin(x1) > log(x2)\u00b7 x3); for continuous observations,\nthey are restricted to linear inequalities and linear or piecewise constant fi as described in Section 2.\nFor unary operations such as scalar multiplication c \u00b7 f (for some constant c \u2208 R) or negation\n\u2212f on case statements is simply to apply the operation on each case partition fi (1 \u2264 i \u2264 k). A\nbinary operation on two case statements, takes the cross-product of the logical partitions of each\ncase statement and performs the corresponding operation on the resulting paired partitions. The\ncross-sum \u2295 of two cases is de\ufb01ned as the following:\n\nLikewise (cid:9) and \u2297 are de\ufb01ned by subtracting or multiplying partition values. Inconsistent partitions\ncan be discarded when they are irrelevant to the function value. A symbolic case maximization is\nde\ufb01ned as below:\n\n(cid:40)\n\n(cid:40)\n\n\u03c61 :\n\u03c62 :\n\n\u2295\n\nf1\nf2\n\n\u03c81 :\n\u03c82 :\n\ng1\ng2\n\n=\n\n(cid:32)(cid:40)\n\n(cid:40)\n\n(cid:33)\n\ncasemax\n\n\u03c61 : f1\n\u03c62 : f2\n\n,\n\n\u03c81 : g1\n\u03c82 : g2\n\n=\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u03c61 \u2227 \u03c81 :\n\u03c61 \u2227 \u03c82 :\n\u03c62 \u2227 \u03c81 :\n\u03c62 \u2227 \u03c82 :\n\nf1 + g1\nf1 + g2\nf2 + g1\nf2 + g2\n\n\u03c61 \u2227 \u03c81 \u2227 f1 > g1 : f1\n\u03c61 \u2227 \u03c81 \u2227 f1 \u2264 g1 : g1\n\u03c61 \u2227 \u03c82 \u2227 f1 > g2 : f1\n\u03c61 \u2227 \u03c82 \u2227 f1 \u2264 g2 : g2\n...\n...\n\nThe following SDP operations on case statements require more detail than can be provided here,\nhence we refer the reader to the relevant literature:\n\n\u2022 Integration(cid:82)\n\n\u2022 Substitution f \u03c3: Takes a set \u03c3 of variables and their substitutions (which may be case\n\nstatements themselves), and carries out all variable substitutions [9].\n\nx1\n\nf dx1: There are two forms: If x1 is involved in a \u03b4-function (cf.\n\nthe\ntransition in Eq (3)) then the integral is equivalent to a symbolic substitution and can be\napplied to any case statement (cf. [9]). Otherwise, if f is in linearly constrained polynomial\ncase form, then the approach of [8] can be applied to yield a result in the same form.\n\nCase operations yield a combinatorial explosion in size if na\u00a8\u0131vely implemented, hence we use the\ndata structure of the extended algebraic decision diagram (XADD) [9] as shown in Figure 1 (right)\nto compactly represent case statements and ef\ufb01ciently support the above case operations with them.\n\n4.2 VI for Hybrid State and Discrete Observations\nFor H-POMDPs with only discrete observations o \u2208 O and observation function p(o|x(cid:48)\ns, a) as in\nthe form of Eq (4), we introduce a symbolic version of Monahan\u2019s VI algorithm. In brief, we note\nthat all VI operations needed in Section 3 apply directly to H-POMDPs, e.g., rewriting Eq (8):\n(cid:48)\ns)\n\n(cid:33)\nsi|xs,ds,a)\n(cid:48)\n\nsj|xs,ds, d\n(cid:48)\n(cid:48)\np(x\ns,a)\n\n(cid:32) m(cid:79)\n\np(o|x\n(cid:48)\ns,d\n\ns,a)\u2297\n(cid:48)\n\n(cid:32) n(cid:79)\n\n(cid:34)\n(cid:77)\n\ngh\na,o,j(xs,ds) =\n\n\u2297\u03b1h\u22121\n\nj\n\n(cid:48)\ns,d\n\n(x\n\ns,d(cid:48)\n\n(cid:33)\n\n(cid:35)\n\n(cid:90)\n\np(d\n\n\u2297\n\ndxs(cid:48)\n\nxs(cid:48)\n\nds(cid:48)\n\ni=1\n\nj=1\n\n(11)\n\n5\n\n\fs,d(cid:48)\n\ns, a)(cid:105)\n\n// Perform exact 1-step DP backup of \u03b1-functions at horizon h \u2212 1\n\u03b1a\n\ns, a) \u2297 p(x(cid:48)\n\ns,d(cid:48)\n\ns,d(cid:48)\n\ns|xs,ds, a) \u2297 \u03b1j(x(cid:48)\n\ns,d(cid:48)\n\ns) dx(cid:48)\n\ns\n\ns\n\ns\n\nd(cid:48)\n\ns,d(cid:48)\n\nbegin\n\n(cid:76)\n\nforeach \u03b1a\n\np(xo,do|x(cid:48)\n\nforeach \u03b1j(x(cid:48)\n\ns) \u2208 \u0393h\u22121 and a \u2208 A do\n\nx(cid:48)\nj (xs,ds, xo,do) do\n\nj (xs,ds, xo,do) :=(cid:82)\nj (xo,do) :=(cid:82)\n(cid:76)\n\nAlgorithm 2: GenRelObs(\u0393h\u22121, a, bi) \u2212\u2192 (cid:104)Oh, p(Oh|x(cid:48)\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n\nbi(xs,ds) \u2297 \u03b1a\n\ns, a)(cid:105)\n\ns,d(cid:48)\n\nend\n\nds\n\nxs\n\n// Generate value of each \u03b1-vector at belief point bi(xs,ds) as a function of observations\n\u03b4a\n\nj (xs,ds, xo,do) dxs\n\n// Using casemax, generate observation partitions relevant to each policy \u2013 see text for details\nOh := extract-partition-constraints[casemax(\u03b4a1\n1 (xo,do), \u03b4a2\nj (xo,do))]\nforeach ok \u2208 Oh do\n// Let \u03c6ok be the partition constraints for observation ok \u2208 Oh\np(Oh = ok|x(cid:48)\ns,d(cid:48)\ns, a)I[\u03c6ok ]dxo\nreturn (cid:104)Oh, p(Oh|x(cid:48)\n\ns, a) :=(cid:82)\n\n1 (xo,do), . . . , \u03b4ar\n\np(xo,do|x(cid:48)\n\n(cid:76)\n\ns,d(cid:48)\n\ndo\n\nxo\n\nFigure 2: (left) Beliefs b1, b2 for Cont. 1D-Power Plant; (right) derived observation partitions for b2 at h = 2.\n\nCrucially we note since the continuous transition cpfs p(x(cid:48)\n\nde\ufb01ned with Dirac \u03b4\u2019s (e.g., Eq 3) as described in Section 2, the integral(cid:82)\n\ns,a) are deterministic and hence\nxs(cid:48) can always be computed\nin closed case form as discussed in Section 4.1. In short, nothing additional is required for PBVI on\nH-POMDPs in this case \u2014 the key insight is simply that \u03b1-functions are now represented by case\nstatements and can \u201cgrow\u201d with the horizon as they partition the state space more and more \ufb01nely.\n\n|xs,ds,d(cid:48)\n\nsj\n\n4.3 PBVI for Hybrid State and Hybrid Observations\n\nIn general, it would be impossible to apply standard VI to H-POMDPs with continuous observations\nsince the number of observations is in\ufb01nite. However, building on ideas in [2], in the case of PBVI,\nit is possible to derive a \ufb01nite set of continuous observation partitions that permit exact point-based\nbackups at a belief point. This additional operation (GenRelObs) appears on line 10 of PBVI in\nAlgorithm 1 in the case of continuous observations and is formally de\ufb01ned in Algorithm 2.\nTo demonstrate the generation of relevant continuous observation partitions, we use the second\niteration of the Cont. Obs. 1D-Power Plant along with two belief points represented as uniform\ndistributions: b1 : U (ts; 2, 6) and b2 : U (ts; 6, 11) as shown in Figure 2 (left). Letting h = 2, we will\nassume simply for expository purposes that |\u03931| = 1 (i.e., it contains only one \u03b1-function) and that\nin lines 2\u20134 of Algorithm 2 we have computed the following two \u03b1-functions for a \u2208 {open, close}:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3(ts < 15) \u2227 (ts\u221210 < to < ts) : 10\n\n(ts \u2265 15) \u2227 (ts\u221210 < to < ts) : \u2212100\n\u00ac(ts\u221210 < to < ts)\n\n: 0\n\n\u03b1close\n\n1\n\n(ts, to) =\n\n(cid:40)\n\n\u03b1open\n\n1\n\n(ts, to) =\n\n(ts\u221210 < to < ts)\n: 0.1\n\u00ac(ts\u221210 < to < ts) : 0\n\nWe now need the \u03b1-vectors as a function of the observation space for a particular belief state, thus\nnext we marginalize out xs,ds in lines 5\u20137. The resulting \u03b4-functions are shown as follows where\nfor brevity from this point forward, 0 partitions are suppressed in the cases:\n\n6\n\ntsP(t )0.250.226111b2bs18t57.5-72.5ocloseopenP(o ) =0.0127P(o ) =0.983210414815-750.15.1(t )o\f\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3(14 < to < 18)\n\n(8 < to < 14)\n(4 < to < 8)\n\n\u03b4close\n1\n\n(to) =\n\n: 0.025to \u2212 0.45\n: \u22120.1\n: \u22120.025to \u2212 0.1\n\n\u03b4open\n1\n\n(to) =\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(15 < to < 18)\n(14 < to < 15)\n(8 < to < 14)\n(5 < to < 8)\n(4 < to < 5)\n\n: 25to \u2212 450\n: \u22122.5to \u2212 37.5\n: \u221272.5\n: \u221225to + 127.5\n: 2.5to \u2212 10\n\n1\n\n1\n\n(to) and \u03b4open\n\n(to) are drawn graphically in Figure 2 (right). These observation-\nBoth \u03b4close\ndependent \u03b4\u2019s divide the observation space into regions which can yield the optimal policy according\nto the belief state b2. Following [2], we need to \ufb01nd the optimal boundaries or partitions of the ob-\nservation space; in their work, numerical solutions are proposed to \ufb01nd these boundaries in one\ndimension (multiple observations are handled through an independence assumption). Instead, here\nwe leverage the symbolic power of the casemax operator de\ufb01ned in Section 4.1 to \ufb01nd all the parti-\ntions where each potentially correlated, multivariate observation \u03b4 is optimal. For the two \u03b4\u2019s above,\nthe following partitions of the observation space are derived by the casemax operator in line 9:\n\n(cid:17)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\no1 : (14 < to \u2264 18)\no1 : (8 < to \u2264 14)\no1 : (5.1 < to \u2264 8)\no2 : (5 < to \u2264 5.1)\no2 : (4 < to \u2264 5)\n\n: 0.025to \u2212 0.45\n: \u22120.1\n: \u22120.025to \u2212 0.1\n: \u221225to + 127.5\n: 2.5to \u2212 10\n\n(cid:16)\n\ncasemax\n\n\u03b4close\n1\n\n(to), \u03b4open\n\n1\n\n(to)\n\n=\n\n1\n\n1\n\ns,d(cid:48)\n(cid:90)\n\nis maximal and with o2 the observations\nHere we have labeled with o1 the observations where \u03b4close\nwhere \u03b4open\nis maximal. What we really care about though are just the constraints identifying o1\nand o2 and this is the task of extract-partition-constraints in line 9. This would associate with o1\nthe partition constraint \u03c6o1 \u2261 (5.1 < to \u2264 8) \u2228 (8 < to \u2264 14) \u2228 (14 < to \u2264 18) and with o2 the\npartition constraint \u03c6o2 \u2261 (4 < to \u2264 5) \u2228 (5 < to \u2264 5.1) \u2014 taking into account the 0 partitions\nand the 1D nature of this example, we can further simplify \u03c6o1 \u2261 (to > 5.1) and \u03c6o2 \u2261 (to \u2264 5.1).\nGiven these relevant observation partitons, our \ufb01nal task in lines 10-12 is to compute the probabil-\nities of each observation partition \u03c6ok. This is simply done by marginalizing over the observation\nfunction p(Oh|x(cid:48)\ns, a) within each region de\ufb01ned by \u03c6ok (achieved by multiplying by an indicator\nfunction I[\u03c6ok ] over these constraints). To better understand what is computed here, we can compute\n(cid:90)\nthe probability p(ok|bi, a) of each observation for a particular belief, calculated as follows:\np(ok|bi, a) :=\ns)\u2297bi(xs,ds) dx\n(cid:48)\n(cid:48)\n(cid:48)\nsdxs (12)\ns,d\nSpeci\ufb01cally, for b2, we obtain p(o1|b2, a = close) = 0.0127 and p(o2|b2, a = close) = 0.933 as\nshown in Figure 2 (right).\nIn summary, in this section we have shown how we can extend the exact dynamic programming\nalgorithm for the continuous state, discrete observation POMDP setting from Section 4.2 to com-\npute exact 1-step point-based backups in the continuous observation setting; this was accomplished\nthrough the crucial insight that despite the in\ufb01nite number of observations, using Algorithm 2 we\ncan symbolically derive a set of relevant observations for each belief point that distinguish the op-\ntimal policy and hence value as graphically illustrated in Figure 2 (right). Next we present some\nempirical results for 1- and 2-dimensional continuous state and observation spaces.\n\ns|xs,ds, a)\u2297\u03b1j(x\n(cid:48)\n\ns, a)\u2297p(x\n(cid:48)\n(cid:48)\ns,d\n\n(cid:77)\n\n(cid:77)\n\np(ok|x\n\n(cid:48)\ns,d\n\nd(cid:48)\n\nx(cid:48)\n\ns\n\nds\n\ns\n\nxs\n\n5 Empirical Results\n\nWe evaluated our continuous POMDP solution using XADDs on the 1D-Power Plant example and\nanother variant of this problem with two variables, described below.3\n2D-Power Plant: We consider the more complex model of the power plant similar to [1] where the\npressure inside the water tank must be controlled to avoid mixing water into the steam (leading to\nexplosion of the tank). We model an observable pressure reading po as a function of the underlying\npressure state ps. Again we have two actions for opening and closing a pressure valve. The close\naction has transition\n\n(cid:34)\n\n(cid:40)\n\n(cid:35)\n\ns|ts, a = close) = \u03b4(cid:2)t\n\n(cid:48)\n\ns \u2212 (ts + 10)(cid:3)\n\n(cid:48)\n\np(t\n\ns|ps, a = close) = \u03b4\n(cid:48)\n\np(p\n\ns \u2212\n(cid:48)\n\np\n\n(p + 10 > 20)\n\u00ac(p + 10 > 20)\n\n: 20\n: ps + 10\n\n3Full problem speci\ufb01cations and Java code to reproduce these experiments are available online in Google\n\nCode: http://code.google.com/p/cpomdp .\n\n7\n\n\fFigure 3: (left) time vs. horizon, and (right) space (total # XADD nodes in \u03b1-functions) vs. horizon.\n\nand yields high reward for staying within the safe temperature and pressure range:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n(5 \u2264 ps \u2264 15) \u2227 (95 \u2264 ts \u2264 105)\n(5 \u2264 ps \u2264 15) \u2227 (ts \u2264 95)\n(ps \u2265 15)\nelse\n\n: 50\n: \u22121\n: \u22125\n: \u22123\n\nR(ts, ps, a = close) =\n\nAlternately, for the open action, the transition functions reduce the temperature by 5 units and the\npressure by 10 units as long as the pressure stays above zero. For the open reward function, we\nassume that there is always a small constant penalty (-1) since no electricity is produced.\nObservations are distributed uniformly within a region depending on their underlying state:\n\n(cid:40)\n\n(cid:40)\n\np(to|t\n\n(cid:48)\ns) =\n\n(ts + 80 < to < ts + 105)\n\u00ac(ts + 80 < to < ts + 105)\n\n: 0.04\n: 0\n\np(po|p\n\n(cid:48)\ns) =\n\n(ps < po < ps + 10)\n\u00ac(ps < po < ps + 10)\n\n: 0.1\n: 0\n\nFinally for PBVI, we de\ufb01ne two uniform beliefs as follows: b1 : U [ts; 90, 100] \u2217 U [ps; 0, 10] and\nb2 : U [ts; 90, 130] \u2217 U [ps; 10, 30]\nIn Figure 3, a time and space analysis of the two versions of Power Plant have been performed for\nup to horizon h = 6. This experimental evaluation relies on one additional approximation over the\nPBVI approach of Algorithm 1 in that it substitutes p(Oh|b, a) in place of p(Oh|x(cid:48)\ns, a) \u2014 while\nthis yields correct observation probabilities for a point-based backup at a particular belief state b,\nthe resulting \u03b1-functions represent an approximation for other belief states. In general, the PBVI\nframework in this paper does not require this approximation, although when appropriate, using it\nshould increase computational ef\ufb01ciency.\nFigure 3 shows that the computation time required per iteration generally increases since more com-\nplex \u03b1-functions lead to a larger number of observation partitions and thus a more expensive backup\noperation. While an order of magnitude more time is required to double the number of state and\nobservation variables, one can see that the PBVI approach leads to a fairly constant amount of com-\nputation time per horizon, which indicates that long horizons should be computable for any problem\nfor which at least one horizon can be computed in an acceptable amount of time.\n\ns,d(cid:48)\n\n6 Conclusion\n\nWe presented the \ufb01rst exact symbolic operations for PBVI in an expressive subset of H-POMDPs\nwith continuous state and observations. Unlike related work that has extended to the continuous\nstate and observation setting [6], we do not approach the problem by sampling. Rather, follow-\ning [2], the key contribution of this work was to de\ufb01ne a discrete set of observation partitions on\nthe multivariate continuous observation space via symbolic maximization techniques and derive the\nrelated probabilities using symbolic integration. An important avenue for future work is to extend\nthese techniques to the case of continuous state, observation, and action H-POMDPs.\n\nAcknowledgments\n\nNICTA is funded by the Australian Government as represented by the Department of Broadband, Communi-\ncations and the Digital Economy and the ARC through the ICT Centre of Excellence program. This work was\nsupported by the Fraunhofer ATTRACT fellowship STREAM and by the EC, FP7-248258-First-MM.\n\n8\n\n123456102103104105HorizonTime(ms)Power Plant 1 state & 1 observ var2 state & 2 observ vars123456010203040506070HorizonNumber of NodesPower Plant 1 state & 1 observ var2 state & 2 observ vars\fReferences\n[1] Mario Agueda and Pablo Ibarguengoytia. An architecture for planning in uncertain domains.\n\nIn Proceedings of the ICTAI 2002 Conference, Dallas,Texas, 2002.\n\n[2] Jesse Hoey and Pascal Poupart. Solving pomdps with continuous or large discrete observation\nspaces. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\nEdinburgh, Scotland, 2005.\n\n[3] Leslie P. Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in\n\npartially observable stochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[4] G. E. Monahan. Survey of partially observable markov decision processes: Theory, models,\n\nand algorithms. Management Science, 28(1):1\u201316, 1982.\n\n[5] Joelle Pineau, Geoffrey J. Gordon, and Sebastian Thrun. Anytime point-based approximations\n\nfor large pomdps. J. Artif. Intell. Res. (JAIR), 27:335\u2013380, 2006.\n\n[6] J. M. Porta, N. Vlassis, M.T.J. Spaan, and P. Poupart. Point-based value iteration for continuous\n\npomdps. Journal of Machine Learning Research, 7:195220, 2006.\n\n[7] Pascal Poupart, Kee-Eung Kim, and Dongho Kim. Closing the gap: Improved bounds on\noptimal pomdp solutions. In In Proceedings of the 21st International Conference on Automated\nPlanning and Scheduling (ICAPS-11), 2011.\n\n[8] Scott Sanner and Ehsan Abbasnejad. Symbolic variable elimination for discrete and continuous\ngraphical models. In In Proceedings of the 26th AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI-12), Toronto, Canada, 2012.\n\n[9] Scott Sanner, Karina Valdivia Delgado, and Leliane Nunes de Barros. Symbolic dynamic\nprogramming for discrete and continuous state mdps. In Proceedings of the 27th Conference\non Uncertainty in AI (UAI-2011), Barcelona, 2011.\n\n[10] Trey Smith and Reid G. Simmons. Point-based POMDP algorithms: Improved analysis and\n\nimplementation. In Proc. Int. Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[11] M. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for pomdps. Jour-\n\nnal of Articial Intelligence Research (JAIR), page 195220, 2005.\n\n9\n\n\f", "award": [], "sourceid": 672, "authors": [{"given_name": "Zahra", "family_name": "Zamani", "institution": null}, {"given_name": "Scott", "family_name": "Sanner", "institution": null}, {"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "Kristian", "family_name": "Kersting", "institution": null}]}