{"title": "A POMDP Extension with Belief-dependent Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 72, "abstract": "Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce rho-POMDPs, an extension of POMDPs where the reward function rho depends on the belief state. We show that, under the common assumption that rho is convex, the value function is also convex, what makes it possible to (1) approximate rho arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes.", "full_text": "A POMDP Extension with Belief-dependent Rewards\n\nMauricio Araya-L\u00b4opez\n\nOlivier Buffet\n\nVincent Thomas\n\nFranc\u00b8ois Charpillet\n\nLORIA \u2013 Campus Scienti\ufb01que \u2013 BP 239\n\n54506 Vandoeuvre-l`es-Nancy Cedex \u2013 France\n\nNancy Universit\u00b4e / INRIA\n\nfirstname.lastname@loria.fr\n\nAbstract\n\nPartially Observable Markov Decision Processes (POMDPs) model sequential\ndecision-making problems under uncertainty and partial observability. Unfortu-\nnately, some problems cannot be modeled with state-dependent reward functions,\ne.g., problems whose objective explicitly implies reducing the uncertainty on the\nstate. To that end, we introduce \u03c1POMDPs, an extension of POMDPs where the\nreward function \u03c1 depends on the belief state. We show that, under the com-\nmon assumption that \u03c1 is convex, the value function is also convex, what makes\nit possible to (1) approximate \u03c1 arbitrarily well with a piecewise linear and con-\nvex (PWLC) function, and (2) use state-of-the-art exact or approximate solving\nalgorithms with limited changes.\n\n1\n\nIntroduction\n\nSequential decision-making problems under uncertainty and partial observability are typically mod-\neled using Partially Observable Markov Decision Processes (POMDPs) [1], where the objective is\nto decide how to act so that the sequence of visited states optimizes some performance criterion.\nHowever, this formalism is not expressive enough to model problems with any kind of objective\nfunctions.\nLet us consider active sensing problems, where the objective is to act so as to acquire knowledge\nabout certain state variables. Medical diagnosis for example is about asking the good questions and\nperforming the appropriate exams so as to diagnose a patient at a low cost and with high certainty.\nThis can be formalized as a POMDP by rewarding\u2014if successful\u2014a \ufb01nal action consisting in ex-\npressing the diagnoser\u2019s \u201cbest guess\u201d. Actually, a large body of work formalizes active sensing with\nPOMDPs [2, 3, 4].\nAn issue is that, in some problems, the objective needs to be directly expressed in terms of the\nuncertainty/information on the state, e.g., to minimize the entropy over a given state variable. In such\ncases, POMDPs are not appropriate because the reward function depends on the state and the action,\nnot on the knowledge of the agent. Instead, we need a model where the instant reward depends on\nthe current belief state. The belief MDP formalism provides the needed expressiveness for these\nproblems. Yet, there is not much research on speci\ufb01c algorithms to solve them, so they are usually\nforced to \ufb01t in the POMDP framework, which means changing the original problem de\ufb01nition. One\ncan argue that acquiring information is always a means, not an end, and thus, a \u201cwell-de\ufb01ned\u201d\nsequential-decision making problem with partial observability must always be modeled as a normal\nPOMDP. However, in a number of cases the problem designer has decided to separate the task of\nlooking for information from that of exploiting information. Let us mention two examples: (i) the\n\n1\n\n\fsurveillance [5] and (ii) the exploration [2] of a given area, in both cases when one does not know\nwhat to expect from these tasks\u2014and thus how to react to the discoveries.\nAfter reviewing some background knowledge on POMDPs in Section 2, Section 3 introduces\n\u03c1POMDPs\u2014an extension of POMDPs where the reward is a (typically convex) function of the\nbelief state\u2014and proves that the convexity of the value function is preserved. Then we show how\nclassical solving algorithms can be adapted depending whether the reward function is piecewise\nlinear (Sec. 3.3) or not (Sec. 4).\n\n2 Partially Observable MDPs\n\nThe general problem that POMDPs address is for the agent to \ufb01nd a decision policy \u03c0 choosing,\nat each time step, the best action based on its past observations and actions in order to maximize\nits future gain (which can be measured for example through the total accumulated reward or the\naverage reward per time step). Compared to classical deterministic planning, the agent has to face\nthe dif\ufb01culty to account for a system not only with uncertain dynamics but also whose current state\nis imperfectly known.\n\n2.1 POMDP Description\nFormally, POMDPs are de\ufb01ned by a tuple (cid:104)S,A, \u2126, T, O, r, b0(cid:105) where, at any time step, the system\nbeing in some state s \u2208 S (the state space), the agent performs an action a \u2208 A (the action\nspace) that results in (1) a transition to a state s(cid:48) according to the transition function T (s, a, s(cid:48)) =\nP r(s(cid:48)|s, a), (2) an observation o \u2208 \u2126 (the observation space) according to the observation function\nO(s(cid:48), a, o) = P r(o|s(cid:48), a), and (3) a scalar reward r(s, a). b0 is the initial probability distribution\nover states. Unless stated otherwise, the state, action and observation sets are \ufb01nite [6].\nThe agent can typically reason about the state of the system by computing a belief state b \u2208 \u2206 =\n\u03a0(S) (the set of probability distributions over S),1 using the following update formula (based on the\nBayes rule) when performing action a and observing o:\n\nwhere P r(o|a, b) = (cid:80)\n\nba,o(s(cid:48)) =\n\nO(s(cid:48), a, o)\nP r(o|a, b)\n\nT (s, a, s(cid:48))b(s),\n\ns,s(cid:48)(cid:48)\u2208S\n\nO(s(cid:48)(cid:48), a, o)T (s, a, s(cid:48)(cid:48))b(s). Using belief states, a POMDP can be\nrewritten as an MDP over the belief space, or belief MDP, (cid:104)\u2206,A, \u03c4, \u03c1(cid:105), where the new transition\n\u03c4 and reward functions \u03c1 are de\ufb01ned respectively over \u2206 \u00d7 A \u00d7 \u2206 and \u2206 \u00d7 A. With this refor-\nmulation, a number of theoretical results about MDPs can be extended, such as the existence of a\ndeterministic policy that is optimal. An issue is that, even if a POMDP has a \ufb01nite number of states,\nthe corresponding belief MDP is de\ufb01ned over a continuous\u2014and thus in\ufb01nite\u2014belief space.\nIn this continuous MDP, the objective is to maximize the cumulative reward by looking for a policy\ntaking the current belief state as input. More formally, we are searching for a policy verifying\n\n\u03c0\u2217 = argmax\u03c0\u2208A\u2206 J \u03c0(b0) where J \u03c0(b0) = E [(cid:80)\u221et=0 \u03b3\u03c1t|b0, \u03c0], \u03c1t being the expected immediate\n\nreward obtained at time step t, and \u03b3 a discount factor. Bellman\u2019s principle of optimality [7] lets us\ncompute the function J \u03c0\n\nrecursively through the value function\n\n\u2217\n\n(cid:88)\n\ns\u2208S\n\n\uf8ee\uf8f0\u03c1(b, a) + \u03b3\n(cid:34)\n\n\u03c1(b, a) + \u03b3\n\n(cid:90)\n(cid:88)\n\nb(cid:48)\u2208\u2206\n\no\n\nVn(b) = max\na\u2208A\n\n= max\na\u2208A\n\n\uf8f9\uf8fb\n(cid:35)\n\n\u03c4 (b, a, b(cid:48))Vn\u22121(b(cid:48))db(cid:48)\n\nP r(o|a, b)Vn\u22121(ba,o)\n\n,\n\n(1)\n\nwhere, for all b \u2208 \u2206, V0(b) = 0, and J \u03c0\nhorizon of the problem).\nThe POMDP framework presents a reward function r(s, a) based on the state and action. On the\nother hand, the belief MDP presents a reward function \u03c1(b, a) based on beliefs. This belief-based\n\n(b) = Vn=H (b) (where H is the\u2014possibly in\ufb01nite\u2014\n\n\u2217\n\n1\u03a0(S) forms a simplex because (cid:107)b(cid:107)1 = 1, that is why we use \u2206 as the set of all possible b.\n\n2\n\n\f(cid:88)\n\n(cid:88)\n\ns(cid:48)\n\n(cid:88)\n\n(cid:88)\n\n(cid:34)\n\n(cid:35)\n\n,\n\nreward function is derived as the expectation of the POMDP rewards:\n\n\u03c1(b, a) =\n\nb(s)r(s, a).\n\n(2)\n\ns\n\nAn important consequence of Equation 2 is that the recursive computation described in Eq. 1 has\nthe property to generate piecewise-linear and convex (PWLC) value functions for each horizon [1],\ni.e., each function is determined by a set of hyperplanes (each represented by a vector), the value\nat a given belief point being that of the highest hyperplane. For example, if \u0393n is the set of vectors\nrepresenting the value function for horizon n, then Vn(b) = max\u03b1\u2208\u0393n\n2.2 Solving POMDPs with Exact Updates\n\ns b(s)\u03b1(s).\n\n(cid:80)\n\nUsing the PWLC property, one can perform the Bellman update using the following factorization of\nEq. 1:\n\nr(s, a)\n|\u2126| +\n\no\n\ns\n\nb(s)\n\nT (s, a, s(cid:48))O(s(cid:48), a, o)\u03c7n\u22121(ba,o, s(cid:48))\n\nb \u00b7 \u03b1. If we consider the term in brackets in Eq. 3, this generates |\u2126| \u00d7 |A|\n\nVn(b) = max\na\u2208A\nwith2 \u03c7n(b) = argmax\n(cid:26) ra\n\u03b1\u2208\u0393n\n\u0393-sets, each one of size |\u0393n\u22121|. These sets are de\ufb01ned as\n|\u2126| + P a,o \u00b7 \u03b1n\u22121\nrepresentation of the value function, one can compute ((cid:76) being the cross-sum between two sets):\n(cid:77)\n\nwhere P a,o(s, s(cid:48)) = T (s, a, s(cid:48))O(s(cid:48), a, o) and ra(s) = r(s, a). Therefore, for obtaining an exact\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b1n\u22121 \u2208 \u0393n\u22121\n\n(cid:91)\n\n(cid:27)\n\n(4)\n\na,o\n\n\u0393n\n\n=\n\n,\n\n(3)\n\n\u0393n =\n\na,o\n\n.\n\n\u0393n\n\na\n\no\n\na,o sets\u2014and also the \ufb01nal \u0393n\u2014are non-parsimonious: some \u03b1-vectors may be use-\nYet, these \u0393n\nless because the corresponding hyperplanes are below the value function. Pruning phases are then\nrequired to remove dominated vectors. There are several algorithms based on pruning techniques\nlike Batch Enumeration [8] or more ef\ufb01cient algorithms such as Witness or Incremental Pruning [6].\n\n2.3 Solving POMDPs with Approximate Updates\n\nThe value function updating processes presented above are exact and provide value functions that\ncan be used whatever the initial belief state b0. A number of approximate POMDP solutions have\nbeen proposed to reduce the complexity of these computations, using for example heuristic estimates\nof the value function, or applying the value update only on selected belief points [9]. We focus here\non the latter point-based (PB) approximations, which have largely contributed to the recent progress\nin solving POMDPs, and whose relevant literature goes from Lovejoy\u2019s early work [10] via Pineau\net al.\u2019s PBVI [11], Spaan and Vlassis\u2019 Perseus [12], Smith and Simmons\u2019 HSVI2 [13], through to\nKurniawati et al.\u2019s SARSOP [14].\nAt each iteration n until convergence, a typical PB algorithm:\n\n1. selects a new set of belief points Bn based on Bn\u22121 and the current approximation Vn\u22121;\n2. performs a Bellman backup at each belief point b \u2208 Bn, resulting in one \u03b1-vector per point;\n3. prunes points whose associated hyperplanes are dominated or considered negligible.\n\nThe various PB algorithms differ mainly in how belief points are selected, and in how the update\nis performed. Existing belief point selection methods have exploited ideas like using a regular\ndiscretization or a random sampling of the belief simplex, picking reachable points (by simulating\naction sequences starting from b0), adding points that reduce the approximation error, or looking in\nparticular at regions relevant to the optimal policy [15].\n\n2The \u03c7 function returns a vector, so \u03c7n(b, s) = (\u03c7n(b))(s).\n\n3\n\n\f3 POMDP extension for Active Sensing\n\n3.1\n\nIntroducing \u03c1POMDPs\n\nAll problems with partial observability confront the issue of getting more information to achieve\nsome goal. This problem is usually implicitly addressed in the resolution process, where acquiring\ninformation is only a means for optimizing an expected reward based on the system state. Some\nactive sensing problems can be modeled this way (e.g. active classi\ufb01cation), but not all of them. A\nspecial kind of problem is when the performance criterion incorporates an explicit measure of the\nagent\u2019s knowledge about the system, which is based on the beliefs rather than states. Surveillance\nfor example is a never-ending task that does not seem to allow for a modeling with state-dependent\nrewards. Indeed, if we consider the simple problem of knowing the position of a hidden object, it\nis possible to solve this without even having seen the object (for instance if all the locations but one\nhave been visited). However, the reward of a POMDP cannot model this since it is only based on\nthe current state and action. One solution would be to include the whole history in the state, leading\nto a combinatorial explosion. We prefer to consider a new way of de\ufb01ning rewards based on the\nacquired knowledge represented by belief states. The rest of the paper explores the fact that belief\nMDPs can be used outside the speci\ufb01c de\ufb01nition of \u03c1(b, a) in Eq. 2, and therefore discusses how to\nsolve this special type of active sensing problems.\nAs Eq. 2 is no longer valid, the direct link with POMDPs is broken. We can however still use\nall the other components of POMDPs such as states, observations, etc. A way of \ufb01xing this is\nto generalize the POMDP framework to a \u03c1-based POMDP (\u03c1POMDP), where the reward is not\nde\ufb01ned as a function r(s, a), but directly as a function \u03c1(b, a). The nature of the \u03c1(b, a) function\ndepends on the problem, but is usually related to some uncertainty or error measure [3, 2, 4]. Most\ncommon methods are those based on Shannon\u2019s information theory, in particular Shannon\u2019s entropy\n(cid:80)\nor the Kullback-Leibler distance [16]. In order to present these functions as rewards, they have to\nmeasure information rather than uncertainty, so the negative entropy function \u03c1ent(b) = log2(|S|) +\ns\u2208S b(s) log2(b(s))\u2014which is maximal in the corners of the simplex and minimal in the center\u2014\nis used rather than Shannon\u2019s original entropy. Also, other simpler functions based on the same idea\ncan be used, such as the distance from the simplex center (DSC), \u03c1dsc(b) = (cid:107)b \u2212 c(cid:107)m, where c is\nthe center of the simplex and m a positive integer that denotes the order of the metric space. Please\nnote that \u03c1(b, a) is not restricted to be only an uncertainty measurement, but can be a combination\nof the expected state-action rewards\u2014as in Eq. 2\u2014and an uncertainty or error measurement. For\nexample, Mihaylova et al.\u2019s work [3] de\ufb01nes the active sensing problem as optimizing a weighted\nsum of uncertainty measurements and costs, where the former depends on the belief and the latter\non the system state.\nIn the remainder of this paper, we show how to apply classical POMDP algorithms to \u03c1POMDPs. To\nthat end, we discuss the convexity of the value function, which permits extending these algorithms\nusing PWLC approximations.\n\n3.2 Convexity Property\n\nAn important property used to solve normal POMDPs is the result that a belief-based value function\nis convex, because r(s, a) is linear with respect to the belief, and the expectation, sum and max\noperators preserve this property [1]. For \u03c1POMDPs, this property also holds if the reward function\n\u03c1(b, a) is convex, as shown in Theorem 3.1.\nTheorem 3.1. If \u03c1 and V0 are convex functions over \u2206, then the value function Vn of the belief MDP\nis convex over \u2206 at any time step n. [Proof in [17, Appendix]]\n\nThis last theorem is based on \u03c1(b, a) being a convex function over b, which is a natural property for\nuncertainty (or information) measures, because the objective is to avoid belief distributions that do\nnot give much information on which state the system is in, and to assign higher rewards to those\nbeliefs that give higher probabilities of being in a speci\ufb01c state. Thus, a reward function meant to\nreduce the uncertainty must provide high payloads near the corners of the simplex, and low payloads\nnear its center. For that reason, we will focus only on reward functions that comply with convexity\nin the rest of the paper.\nThe initial value function V0 might be any convex function for in\ufb01nite-horizon problems, but by\n\n4\n\n\f\u03c1(b, a) = max\n\u03b1\u2208\u0393a\n\u03c1\n\nb(s)\u03b1(s)\n\n.\n\n(cid:35)\n\n(cid:34)(cid:88)\n(cid:88)\n\ns\n\n(cid:88)\n\n(cid:34)\n\n(cid:88)\n\n(cid:35)\n\n,\n\nde\ufb01nition V0 = 0 for \ufb01nite-horizon problems. We will use the latter case for the rest of the paper,\nto provide fairly general results for both kinds of problems. Plus, starting with V0 = 0, it is also\neasy to prove by induction that, if \u03c1 is continuous (respectively differentiable), then Vn is continuous\n(respectively piecewise differentiable).\n\n3.3 Piecewise Linear Reward Functions\n\nThis section focuses on the case where \u03c1 is a PWLC function and shows that only a small adaptation\nof the exact and approximate updates in the POMDP case is necessary to compute the optimal value\nfunction. The complex case where \u03c1 is not PWLC is left for Sec. 4.\n\n3.3.1 Exact Updates\n\nFrom now on, \u03c1(b, a), being a PWLC function, can be represented as several \u0393-sets, one \u0393a\na. The reward is computed as:\n\n\u03c1 for each\n\nUsing this de\ufb01nition leads to the following changes in Eq. 3\n\nVn(b) = max\na\u2208A\n\ns\n\u03c1(b, s) = argmax\n\n\u03c7a\n\n\u03c1(b, s) +\n\nb(s)\n(b \u00b7 \u03b1). This uses the \u0393-set \u0393a\n\nT (s, a, s(cid:48))O(s(cid:48), a, o)\u03c7n\u22121(ba,o, s(cid:48))\n\u03c1 and generates |\u2126| \u00d7 |A| \u0393-sets:\n\ns(cid:48)\n\no\n\nwhere \u03c7a\n\n\u03b1\u2208\u0393a\n\n\u03c1\n\na,o\n\n\u0393n\n\n= {P a,o \u00b7 \u03b1n\u22121| \u03b1n\u22121 \u2208 \u0393n\u22121},\n\nwhere P a,o(s, s(cid:48)) = T (s, a, s(cid:48))O(s(cid:48), a, o).\nExact algorithms like Value Iteration or Incremental Pruning can then be applied to this POMDP\nextension in a similar way as for POMDPs. The difference is that the cross-sum includes not only\none \u03b1a,o for each observation \u0393-set \u0393n\n\u03c1 corresponding to the\nreward:\n\na,o, but also one \u03b1\u03c1 from the \u0393-set \u0393a\n\n(cid:35)\n\n(cid:34)(cid:77)\n\n(cid:91)\n\n\u0393n =\n\na,o \u2295 \u0393a\n\n\u03c1\n\n\u0393n\n\n.\n\nThus, the cross-sum generates |R| times more vectors than with a classic POMDP, |R| being the\nnumber of \u03b1-vectors specifying the \u03c1(b, a) function3.\n\na\n\no\n\n3.3.2 Approximate Updates\n\nPoint-based approximations can be applied in the same way as PBVI or SARSOP do to the original\nPOMDP update. The only difference is again the reward function representation as an envelope of\nhyperplanes. PB algorithms select the hyperplane that maximizes the value function at each belief\npoint, so the same simpli\ufb01cation can be applied to the set \u0393a\n\u03c1.\n\n4 Generalizing to Other Reward Functions\nUncertainty measurements such as the negative entropy or the DSC (with m > 1 and m (cid:54)= \u221e) are\nnot piecewise linear functions. In theory, each step of value iteration can be analytically computed\nusing these functions, but the expressions are not closed as in the linear case, growing in complexity\nand making them unmanageable after a few steps. Moreover, pruning techniques cannot be applied\ndirectly to the resulting hypersurfaces, and even second order measures do not exhibit standard\nquadratic forms to apply quadratic programming. However, convex functions can be ef\ufb01ciently\napproximated by piecewise linear functions, making it possible to apply the techniques described in\nSection 3.3 with a bounded error, as long as the approximation of \u03c1 is bounded.\n\n3More precisely, the number |R| depends on the considered action.\n\n5\n\n\f4.1 Approximating \u03c1\n\nConsider a continuous, convex and piecewise differentiable reward function \u03c1(b),4 and an arbitrary\n(and \ufb01nite) set of points B \u2282 \u2206 where the gradient is well de\ufb01ned. A lower PWLC approximation\nof \u03c1(b) can be obtained by using each element b(cid:48) \u2208 B as a base point for constructing a tangent\nhyperplane which is always a lower bound of \u03c1(b). Concretely, \u03c9b(cid:48)(b) = \u03c1(b(cid:48)) + (b \u2212 b(cid:48)) \u00b7 \u2207\u03c1(b(cid:48))\nis the linear function that represents the tangent hyperplane. Then, the approximation of \u03c1(b) using\na set B is de\ufb01ned as \u03c9B(b) = maxb(cid:48)(\u03c9b(cid:48)(b)).\nAt any point b \u2208 \u2206 the error of the approximation can be written as\n\n\u0001B(b) = |\u03c1(b) \u2212 \u03c9B(b)|,\n\n(5)\nand if we speci\ufb01cally pick b as the point where \u0001B(b) is maximal (worst error), then we can try to\nbound this error depending on the nature of \u03c1.\nIt is well known that a piecewise linear approximation of a Lipschitz function is bounded because\nthe gradient \u2207\u03c1(b(cid:48)) that is used to construct the hyperplane \u03c9b(cid:48)(b) has bounded norm [18]. Unfor-\ntunately, the negative entropy is not Lipschitz (f (x) = x log2(x) has an in\ufb01nite slope when x \u2192 0),\nso this result is not generic enough to cover a wide range of active sensing problems. Yet, under\ncertain mild assumptions a proper error bound can still be found.\nThe aim of the rest of this section is to \ufb01nd an error bound in three steps. First, we will introduce\nsome basic results over the simplex and the convexity of \u03c1. Informally, Lemma 4.1 will show that,\nfor each b, it is possible to \ufb01nd a belief point in B far enough from the boundary of the simplex but\nwithin a bounded distance to b. Then, in a second step, we will assume the function \u03c1(b) veri\ufb01es\nthe \u03b1-H\u00a8older condition to be able to bound the norm of the gradient in Lemma 4.2. In the end,\nTheorem 4.3 will use both lemmas to bound the error of \u03c1\u2019s approximation under these assumptions.\n\nFigure 1: Simplices \u2206 and \u2206\u03b5, and the points b, b(cid:48) and b(cid:48)(cid:48).\n\nFor each point b \u2208 \u2206, it is possible to associate a point b\u2217 = argmaxx\u2208B \u03c9x(b) corresponding to\nthe point in B whose tangent hyperplane gives the best approximation of \u03c1 at b. Consider the point\nb \u2208 \u2206 where \u0001B(b) is maximum: this error can be easily computed using the gradient \u2207\u03c1(b\u2217).\nUnfortunately, some partial derivatives of \u03c1 may diverge to in\ufb01nity on the boundary of the simplex\nintermediate point b(cid:48) in an inner simplex \u2206\u03b5, where \u2206\u03b5 = {b \u2208 [\u03b5, 1]N|(cid:80)\nin the non-Lipschitz case, making the error hard to analyze. Therefore, to ensure that this error can\nbe bounded, instead of b\u2217, we will take a safe b(cid:48)(cid:48) \u2208 B (far enough from the boundary) by using an\ni bi = 1} with N = |S|.\nThus, for a given b \u2208 \u2206 and \u03b5 \u2208 (0, 1\n], we de\ufb01ne the point b(cid:48) = argminx\u2208\u2206\u03b5 (cid:107)x \u2212 b(cid:107)1 as the\nclosest point to b in \u2206\u03b5 and b(cid:48)(cid:48) = argminx\u2208B (cid:107)x\u2212 b(cid:48)(cid:107)1 as the closest point to b(cid:48) in B (see Figure 1).\nN\nThese two points will be used to \ufb01nd an upper bound for the distance (cid:107)b\u2212 b(cid:48)(cid:48)(cid:107)1 based on the density\n(cid:107)b \u2212 b(cid:48)(cid:107)1.\nof B, de\ufb01ned as \u03b4B = min\nb\u2208\u2206\nLemma 4.1. The distance (1-norm) between the maximum error point b \u2208 \u2206 and the selected\nb(cid:48)(cid:48) \u2208 B is bounded by (cid:107)b \u2212 b(cid:48)(cid:48)(cid:107)1 \u2264 2(N \u2212 1)\u03b5 + \u03b4B. [Proof in [17, Appendix]]\nIf we pick \u03b5 > \u03b4B, then we are sure that b(cid:48)(cid:48) is not on the boundary of the simplex \u2206, with a\nminimum distance from the boundary of \u03b7 = \u03b5 \u2212 \u03b4B. This will allow \ufb01nding bounds for the PWLC\n\nmax\nb(cid:48)\u2208B\n\n4For convenience\u2014and without loss of generality\u2014we only consider the case where \u03c1(b, a) = \u03c1(b).\n\n6\n\n\u2206\u2206\u03b5bb\u2032b\u201d\u03b5\u03b5\u2032\fapproximation of convex \u03b1-H\u00a8older functions, which is a broader family of functions including the\nnegative entropy, convex Lipschitz functions and others. The \u03b1-H\u00a8older condition is a generalization\nof the Lipschitz condition. In our setting it means, for a function f : D (cid:55)\u2192 R with D \u2282 Rn, that it\ncomplies with\n\n\u2203\u03b1 \u2208 (0, 1], \u2203K\u03b1 > 0, s.t. |f (x) \u2212 f (y)| \u2264 K\u03b1(cid:107)x \u2212 y(cid:107)\u03b1\n1 .\n\nThe limit case, where a convex \u03b1-H\u00a8older function has in\ufb01nite-valued norm for the gradient, is always\non the boundary of the simplex \u2206 (due to the convexity), and therefore the point b(cid:48)(cid:48) will be free of\nthis predicament because of \u03b7. More precisely, an \u03b1-H\u00a8older function in \u2206 with constant K\u03b1 in\n1-norm complies with the Lipschitz condition on \u2206\u03b7 with a constant K\u03b1\u03b7\u03b1 (see [17, Appendix]).\nMoreover, the norm of the gradient (cid:107)\u2207f (b(cid:48)(cid:48))(cid:107)1 is also bounded as stated by Lemma 4.2.\nLemma 4.2. Let \u03b7 > 0 and f be an \u03b1-H\u00a8older (with constant K\u03b1), bounded and convex function\nfrom \u2206 to R, f being differentiable everywhere in \u2206o (the interior of \u2206). Then, for all b \u2208 \u2206\u03b7,\n(cid:107)\u2207f (b)(cid:107)1 \u2264 K\u03b1\u03b7\u03b1\u22121. [Proof in [17, Appendix]]\nUnder these conditions, we can show that the PWLC approximation is bounded.\nTheorem 4.3. Let \u03c1 be a continuous and convex function over \u2206, differentiable everywhere in\n\u2206o (the interior of \u2206), and satisfying the \u03b1-H\u00a8older condition with constant K\u03b1. The error of an\nb , where C is a scalar constant. [Proof in [17, Appendix]]\napproximation \u03c9B can be bounded by C\u03b4\u03b1\n\n4.2 Exact Updates\n\nKnowing that the approximation of \u03c1 is bounded for a wide family of functions, the techniques\ndescribed in Sec. 3.3.1 can be directly applied using \u03c9B(b) as the PWLC reward function. These\nalgorithms can be safely used because the propagation of the error due to exact updates is bounded.\nThis can be proven using a similar methodology as in [11, 10]. Let Vt be the value function using the\nPWLC approximation described above and V \u2217t\nthe optimal value function both at time t, H being\nthe exact update operator and \u02c6H the same operator with the PWLC approximation. Then, the error\nfrom the real value function is\n(cid:107)Vt \u2212 V \u2217t (cid:107)\n\n\u221e + (cid:107)HVt\u22121 \u2212 HV \u2217t\u22121(cid:107)\n\n\u221e = (cid:107) \u02c6HVt\u22121 \u2212 HV \u2217t\u22121(cid:107)\n\u221e\n\u2264 (cid:107) \u02c6HVt\u22121 \u2212 HVt\u22121(cid:107)\n\u2264 |\u03c9b\u2217 + \u03b1b\u2217 \u00b7 b \u2212 \u03c1(b) \u2212 \u03b1b\u2217 \u00b7 b| + (cid:107)HVt\u22121 \u2212 HV \u2217t\u22121(cid:107)\n\u2264 C\u03b4\u03b1\nB + (cid:107)HVt\u22121 \u2212 HV \u2217t\u22121(cid:107)\n\u2264 C\u03b4\u03b1\nB + \u03b3(cid:107)Vt\u22121 \u2212 V \u2217t\u22121(cid:107)\n\u2264 C\u03b4\u03b1\n1 \u2212 \u03b3\n\n\u221e\n\n\u221e\n\nB\n\n(By de\ufb01nition)\n(By triangular inequality)\n\u221e (Maximum error at b)\n(By Theorem 4.3)\n(By contraction)\n\n(By sum of a geometric series)\n\nFor these algorithms, the selection of the set B remains open, raising similar issues as the selection\nof belief points in PB algorithms.\n\n4.3 Approximate Updates\n\nIn the case of PB algorithms, the extension is also straightforward, and the algorithms described in\nSec. 3.3.2 can be used with a bounded error. The selection of B, the set of points for the PWLC\napproximation, and the set of points for the algorithm, can be shared5. This simpli\ufb01es the study of\nthe bound when using both approximation techniques at the same time. Let \u02c6Vt be the value function\nat time t calculated using the PWLC approximation and a PB algorithm. Then the error between \u02c6Vt\nis (cid:107) \u02c6Vt \u2212 V \u2217t (cid:107)\nand V \u2217t\n\u221e. The second term is the same as in Sec. 4.2,\n\u221e\nso it is bounded by C\u03b4\u03b1\n1\u2212\u03b3 . The \ufb01rst term can be bounded by the same reasoning as in [11], where\n(cid:107) \u02c6Vt \u2212 Vt(cid:107)\n, with Rmin and Rmax the minimum and maximum values for\n\n\u221e + (cid:107)Vt \u2212 V \u2217t (cid:107)\n\n\u2264 (cid:107) \u02c6Vt \u2212 Vt(cid:107)\n\nB )\u03b4B\n\nB\n\n\u221e = (Rmax\u2212Rmin+C\u03b4\u03b1\n\n1\u2212\u03b3\n\n5Points from \u2206\u2019s boundary can be removed where the gradient is not de\ufb01ned, as the proofs only rely on\n\ninterior points.\n\n7\n\n\f\u03c1(b) respectively. This is because the worst case for an \u03b1 vector is Rmin\u2212\u0001\nis only Rmax\n\n1\u2212\u03b3 because the approximation is always a lower bound.\n\n1\u2212\u03b3\n\n, meanwhile the best case\n\n5 Conclusions\n\nWe have introduced \u03c1POMDPs, an extension of POMDPs that allows for expressing sequential\ndecision-making problems where reducing the uncertainty on some state variables is an explicit\nobjective. In this model, the reward \u03c1 is typically a convex function of the belief state.\nUsing the convexity of \u03c1, a \ufb01rst important result that we prove is that a Bellman backup Vn =\nHVn\u22121 preserves convexity. In particular, if \u03c1 is PWLC and the value function V0 is equal to 0, then\nVn is also PWLC and it is straightforward to adapt many state-of-the-art POMDP algorithms. Yet, if\n\u03c1 is not PWLC, performing exact updates is much more complex. We therefore propose employing\nPWLC approximations of the convex reward function at hand to come back to a simple case, and\nshow that the resulting algorithms converge to the optimal value function in the limit.\nPrevious work has already introduced belief-dependent rewards, such as Spaan\u2019s discussion about\nPOMDPs and Active Perception [19], or Hero et al.\u2019s work in sensor management using POMDPs\n[5]. Yet, the \ufb01rst one only presents the problem of non-PWLC value functions without giving a\nspeci\ufb01c solution, meanwhile the second solves the problem using Monte-Carlo techniques that do\nnot rely on the PWLC property. In the robotics \ufb01eld, uncertainty measurements within POMDPs\nhave been widely used as heuristics [2], with very good results but no convergence guarantees.\nThese techniques use only state-dependent rewards, but uncertainty measurements are employed to\nspeed up the solving process, at the cost of losing some basic properties (e.g. Markovian property).\nOur work paves the way for solving problems with belief-dependent rewards, using new algorithms\napproximating the value function (e.g. point-based ones) in a theoretically sound manner.\nAn important point is that the time complexity of the new algorithms only changes due to the size\nof the approximation of \u03c1. Future work includes conducting experiments to measure the increase in\ncomplexity. A more complex task is to evaluate the quality of the resulting approximations due to\nthe lack of other algorithms for \u03c1POMDPs. An option is to look at online Monte-Carlo algorithms\n[20] as they should require little changes.\n\nAcknowledgements\n\nThis research was supported by the CONICYT-Embassade de France doctoral grant and the CO-\nMAC project. We would also like to thank Bruno Scherrer for the insightful discussions and the\nanonymous reviewers for their helpful comments and suggestions.\n\nReferences\n\n[1] R. Smallwood and E. Sondik. The optimal control of partially observable Markov decision\n\nprocesses over a \ufb01nite horizon. Operation Research, 21:1071\u20131088, 1973.\n\n[2] S. Thrun. Probabilistic algorithms in robotics. AI Magazine, 21(4):93\u2013109, 2000.\n[3] L. Mihaylova, T. Lefebvre, H. Bruyninckx, K. Gadeyne, and J. De Schutter. Active sensing for\n\nrobotics - a survey. In Proc. 5th Intl. Conf. On Numerical Methods and Applications, 2002.\n\n[4] S. Ji and L. Carin. Cost-sensitive feature acquisition and classi\ufb01cation. Pattern Recogn.,\n\n40(5):1474\u20131485, 2007.\n\n[5] A. Hero, D. Castan, D. Cochran, and K. Kastella. Foundations and Applications of Sensor\n\nManagement. Springer Publishing Company, Incorporated, 2007.\n\n[6] A. Cassandra. Exact and approximate algorithms for partially observable Markov decision\n\nprocesses. PhD thesis, Providence, RI, USA, 1998.\n\n[7] R. Bellman. The theory of dynamic programming. Bull. Amer. Math. Soc., 60:503\u2013516, 1954.\n[8] G. Monahan. A survey of partially observable Markov decision processes. Management Sci-\n\nence, 28:1\u201316, 1982.\n\n8\n\n\f[9] M. Hauskrecht. Value-function approximations for partially observable Markov decision pro-\n\ncesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394.\n\n[10] W. Lovejoy. Computationally feasible bounds for partially observed Markov decision pro-\n\ncesses. Operations Research, 39(1):162\u2013175.\n\n[11] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large POMDPs.\n\nJournal of Arti\ufb01cial Intelligence Research (JAIR), 27:335\u2013380, 2006.\n\n[12] M. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs.\n\nJournal of Arti\ufb01cial Intelligence Research, 24:195\u2013220, 2005.\n\n[13] T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and imple-\n\nmentation. In Proc. of the Int. Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[14] H. Kurniawati, D. Hsu, and W. Lee. SARSOP: Ef\ufb01cient point-based POMDP planning by\napproximating optimally reachable belief spaces. In Robotics: Science and Systems IV, 2008.\n[15] R. Kaplow. Point-based POMDP solvers: Survey and comparative analysis. Master\u2019s thesis,\n\nMontreal, Quebec, Canada, 2010.\n\n[16] T. Cover and J. Thomas. Elements of Information Theory. Wiley-Interscience, 1991.\n[17] M. Araya-L\u00b4opez, O. Buffet, V. Thomas, and F. Charpillet. A POMDP extension with belief-\ndependent rewards \u2013 extended version. Technical Report RR-7433, INRIA, Oct 2010. (See\nalso NIPS supplementary material).\n\n[18] R. Saigal. On piecewise linear approximations to smooth mappings. Mathematics of Opera-\n\ntions Research, 4(2):153\u2013161, 1979.\n\n[19] M. Spaan. Cooperative active perception using POMDPs. In AAAI 2008 Workshop on Ad-\n\nvancements in POMDP Solvers, July 2008.\n\n[20] S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online planning algorithms for POMDPs.\n\nJournal of Arti\ufb01cial Intelligence Research (JAIR), 32:663\u2013704, 2008.\n\n9\n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Mauricio", "family_name": "Araya", "institution": null}, {"given_name": "Olivier", "family_name": "Buffet", "institution": null}, {"given_name": "Vincent", "family_name": "Thomas", "institution": null}, {"given_name": "Fran\u00e7cois", "family_name": "Charpillet", "institution": null}]}