{"title": "rho-POMDPs have Lipschitz-Continuous epsilon-Optimal Value Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 6933, "page_last": 6943, "abstract": "Many state-of-the-art algorithms for solving Partially Observable Markov Decision Processes (POMDPs) rely on turning the problem into a \u201cfully observable\u201d problem\u2014a belief MDP\u2014and exploiting the piece-wise linearity and convexity (PWLC) of the optimal value function in this new state space (the belief simplex \u2206). This approach has been extended to solving \u03c1-POMDPs\u2014i.e., for information-oriented criteria\u2014when the reward \u03c1 is convex in \u2206. General \u03c1-POMDPs can also be turned into \u201cfully observable\u201d problems, but with no means to exploit the PWLC property. In this paper, we focus on POMDPs and \u03c1-POMDPs with \u03bb \u03c1 -Lipschitz reward function, and demonstrate that, for finite horizons, the optimal value function is Lipschitz-continuous. Then, value function approximators are proposed for both upper- and lower-bounding the optimal value function, which are shown to provide uniformly improvable bounds. This allows proposing two algorithms derived from HSVI which are empirically evaluated on various benchmark problems.", "full_text": "\u03c1-POMDPs have Lipschitz-Continuous\n\n\u0001-Optimal Value Functions\n\nMathieu Fehr1, Olivier Buffet2, Vincent Thomas2, Jilles Dibangoye3\n\n1 \u00c9cole Normale Sup\u00e9rieure de la rue d\u2019Ulm, Paris, France\n\n2 Universit\u00e9 de Lorraine, CNRS, Inria, LORIA, Nancy, France\n3 Universit\u00e9 de Lyon, INSA Lyon, Inria, CITI, Lyon, France\n\nmathieu.fehr@ens.fr, olivier.buffet@loria.fr, vincent.thomas@loria.fr, jilles.dibangoye@inria.fr\n\nAbstract\n\nMany state-of-the-art algorithms for solving Partially Observable Markov Deci-\nsion Processes (POMDPs) rely on turning the problem into a \u201cfully observable\u201d\nproblem\u2014a belief MDP\u2014and exploiting the piece-wise linearity and convexity\n(PWLC) of the optimal value function in this new state space (the belief simplex\n\u2206). This approach has been extended to solving \u03c1-POMDPs\u2014i.e., for information-\noriented criteria\u2014when the reward \u03c1 is convex in \u2206. General \u03c1-POMDPs can also\nbe turned into \u201cfully observable\u201d problems, but with no means to exploit the PWLC\nproperty. In this paper, we focus on POMDPs and \u03c1-POMDPs with \u03bb\u03c1-Lipschitz re-\nward function, and demonstrate that, for \ufb01nite horizons, the optimal value function\nis Lipschitz-continuous. Then, value function approximators are proposed for both\nupper- and lower-bounding the optimal value function, which are shown to provide\nuniformly improvable bounds. This allows proposing two algorithms derived from\nHSVI which are empirically evaluated on various benchmark problems.\n\n1\n\nIntroduction\n\nMany state-of-the-art algorithms for solving Partially Observable Markov Decision Processes\n(POMDPs) rely on turning the problem into a \u201cfully observable\u201d problem\u2014namely a belief MDP\u2014\nand exploiting the piece-wise linearity and convexity (PWLC) of the optimal value function [Sondik,\n1971, Smallwood and Sondik, 1973] in this new problem\u2019s state space (here the belief space \u2206).\nState of the art off-line algorithms [Pineau et al., 2006, Smith and Simmons, 2004] maintain approxi-\nmators that (i) are upper or lower bounds, and (ii) have generalization capabilities: a local update\nat b improves the bound in a surrounding region of b. This approach has been extended to solving\n\u03c1-POMDPs as belief MDPs\u2014i.e., problems whose performance criterion depends on the belief (e.g.,\nactive information gathering)\u2014when the reward \u03c1 is convex in \u2206 [Araya-L\u00f3pez et al., 2010].1 Yet, it\ndoes not extend to problems with non-convex \u03c1\u2014e.g., (i) if a museum monitoring system is rewarded\nfor each visitor located with \u201cenough certainty\u201d (i.e., using a threshold function), or (ii) if collecting\ndata regarding patients while preserving their privacy by discarding information that could harm\nanonymity.\nGeneralizing value function approximators are also an important topic in (fully observable, mono-\nagent) reinforcement learning, as recently with Deep RL [Mnih et al., 2013]. To allow for error-\nbounded approximations in continuous settings, some works have built on the hypothesis that\nthe dynamics and the reward function were Lipschitz-continuous (LC), which leads to Lipschitz-\ncontinuous value functions [Laraki and Sudderth, 2004, Hinderer, 2005, Fonteneau et al., 2009,\n\n1And also to solving Decentralized POMDPs (Dec-POMDPs) as occupancy MDPs (oMDPs)\u2014i.e., when\n\ndesigning multiple collaborating controllers\u2014[Dibangoye et al., 2013, 2016].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIeong et al. [2007] also\nRachelson and Lagoudakis, 2010, Dufour and Prieto-Rumeau, 2012].\nconsidered exploiting the LC property in heuristic search settings. These approaches cannot be\napplied to the aforementioned partially observable problems as the dynamics of the induced MDPs\nare a priori not LC.\nThis paper shows that, for \u03c1-POMDPs with \u03bb\u03c1-Lipschitz reward function (and thus for any POMDP)\nand for \ufb01nite horizons, the optimal value function is still LC, a property that shall replace the PWLC\nproperty. Yet, to allow for better approximators and tighter theoretical bounds, we use an extended\nde\ufb01nition of Lipschitz-continuity where (i) the Lipschitz constant is a vector rather than a scalar,\nand (ii) we consider local\u2014rather than uniform\u2014LC. From there, value function approximators are\nproposed for both upper- and lower-bounding the optimal value function. Following Smith [2007],\nthese approximators are shown to provide uniformly improvable bounds [Zhang and Zhang, 2001]\nfor use with point-based algorithms like HSVI, which is then guaranteed to converge to an \u0001-optimal\nsolution. This allows proposing two algorithms derived from HSVI: (i) one that uses guaranteed/safe\nLipschitz constants, but at the cost of overly pessimistic error bounds, and (ii) one that searches for\ngood Lipschitz constants, but then losing optimality guarantees. This work is also a step towards\nsolving partially observable stochastic games as continuous-space SGs with LC approximators.\nThe paper is organized as follows. Section 2 discusses related work on inforation-oriented control.\nSec. 3 presents background on POMDPs, \u03c1-POMDPs and Lipschitz continuity. Sec. 4 demonstrates\nthat, for \ufb01nite horizons, the optimal value function is Lipschitz-continuous, Sec. 5 proposes value\nfunction approximators and two point-based algorithms (based on HSVI). Sec. 6 evaluates them\nempirically. Proofs are provided as supplementary material.\n\n2 Related Work\n\nEarly research on information-oriented control (IOC) involved problems formalized either (i) as\nPOMDPs (as Egorov et al. [2016] did recently, since an observation-dependent reward can be trivially\nrecast as a state-dependent reward), or (ii) with belief-dependent rewards (and mostly ad-hoc solution\ntechniques). \u03c1-POMDPs allow easily formalizing many\u2014if not most\u2014IOC problems. Araya-L\u00f3pez\net al. [2010] show that a \u03c1-POMDP with convex belief-dependent reward \u03c1 can be solved with\nmodi\ufb01ed point-based POMDP solvers exploiting the PWLC property (with error bounds that depend\non the quality of the PWLC-approximation of \u03c1).\nThe POMDP-IR framework [Spaan et al., 2015] allows describing IOC problems with linear rewards\u2014\nthus, a subclass of \u201cPWLC\u201d \u03c1-POMDPs (i.e., when \u03c1 is PWLC). Yet, as Satsangi et al. [2015] showed\nthat a PWLC \u03c1-POMDP can be turned into a POMDP-IR, both classes are in fact equivalent. In both\ncases the proposed solution techniques are modi\ufb01ed POMDP solvers, and it seems (to us) that an\nalgorithm proposed in one framework should apply with limited changes in the other framework.\nFor its part, the general \u03c1-POMDP framework allows formalizing more problems\u2014e.g., directly\nspecifying an entropy-based criterion. While Spaan et al. [2015] obtain better empirical results\nwith their POMDP-IR-based method than with a \u03c1-POMDP-based method, this probably says more\nabout particular solutions applied on a particular problem than about the frameworks themselves (as\ndiscussed above).\nThe case of non-convex \u03c1 (including information-averse scenarios) may have been mostly avoided up\nto now because no satisfying solution technique existed. The present work analyzes the optimal value\nfunction\u2019s properties when \u03c1 is Lipschitz-continuous, which leads to a prototype solution algorithm.\nThis is a \ufb01rst step towards proposing new tools for solving a wider class of information-oriented\nPOMDPs than currently feasible. Future work will thus be more oriented towards practical applica-\ntions, possibly with evaluations on surveillance problems\u2014which are only motivating scenarios in\nthe present paper. Note that Egorov et al. [2016] propose solutions dedicated to surveillance (with\nan adversarial setting), not for general IOC problems. Regarding adversarial settings, another very\npromising direction is exploiting the Lipschitz continuity in a similar manner to solve (zero-sum)\nPartially Observable Stochastic Games.\n\n2\n\n\f3 Background\nNotations: We denote: \u02c6x = x/(cid:107)x(cid:107)1 the normalization of a vector x; |x| a component-by-\ncomponent (CbC) absolute value operator; (cid:126)maxxf (x) a CbC maximum operator for vector-valued\nfunction f (x); and 1 a row vector of 1s.\n\n3.1 POMDPs\nA POMDP [Astrom, 1965] is de\ufb01ned by a tuple (cid:104)S,A,Z, P, r, \u03b3, b0(cid:105), where S, A and Z are \ufb01nite\nsets of states, actions and observations; Pa,z(s, s(cid:48)) gives the probability of transiting to state s(cid:48) and\nobserving observation z when applying action a in state s (Pa,z is an S \u00d7 S matrix); r(s, a) \u2208 R\nis the reward associated to performing action a in state s; \u03b3 \u2208 [0; 1) is a discount factor; and b0 is\nthe initial belief state\u2014i.e., the initial probability distribution over possible states. The objective is\nthen to \ufb01nd a policy \u03c0 that prescribes actions depending on past actions and observations so as to\nmaximize the expected discounted sum of rewards (here with an in\ufb01nite temporal horizon).\nTo that end, a POMDP is often turned into a belief MDP (cid:104)\u2206,A, T, r, \u03b3, b0(cid:105) where \u2206 is the simplex\nof possible belief states, A is the same action set, and T (b, a, b(cid:48)) = P (b(cid:48)|b, a) and r(b, a) =\ns b(s)r(s, a) are the induced transition and reward functions. This setting allows considering\npolicies \u03c0 : \u2206 \u2192 A, each being associated to its value function V \u03c0(b)\nt=0 \u03b3tr(bt, \u03c0(bt)|b0 =\nb]. Optimal policies maximize V \u03c0 in all belief states reachable from b0. Their value function V \u2217\nis the \ufb01xed point of Bellman\u2019s optimality operator (H) [Bellman, 1957] HV : b (cid:55)\u2192 maxa[r(b, a) +\n\n= E[(cid:80)\u221e\n\n.\n\n(cid:80)\n\u03b3(cid:80)\n\nz (cid:107)Pa,zb(cid:107)1V (ba,z)], and acting greedily with respect to V \u2217 provides such a policy.\n\nV \u2217 being piece-wise linear and convex (PWLC)2 for any \ufb01nite horizon [Sondik, 1971, Smallwood\nand Sondik, 1973] allows to approximate it from below by an upper-envelope U of hyperplanes, and\nfrom above by a lower-envelope L of points. A local update at belief state b then allows improving U\nor L not only at b but in its vicinity. This generalization allows for error-bounded approximations\nusing a \ufb01nite number of belief points [Pineau et al., 2006], and for more ef\ufb01cient branch pruning in\nheuristic search approaches [Smith, 2007]. All this led to current off-line point-based algorithms such\nas PBVI [Pineau et al., 2003, 2006], HSVI [Smith and Simmons, 2004, 2005, Smith, 2007], SARSOP\n[Kurniawati et al., 2008], GapMin [Poupart et al., 2011], and PGVI [Zhang et al., 2014]. We shall\nconsider in particular HSVI (Heuristic Search Value Iteration, see Algorithm 1) as it is a prototypical\nalgorithm maintaining both U and L, and providing performance guarantees by stopping when\nU (b0) \u2212 L(b0) is below an \u0001 threshold. HSVI decides on where to perform updates by generating\ntrajectories picking (i) actions greedily w.r.t. to U and (ii) observations so as to reduce the gap\nbetween U and L. Importantly, the HSVI framework is based on uniformly improvable bounds (cf.\nSec. 5.1) and applicable beyond POMDPs with PWLC approximations.\n\n3.2 \u03c1-POMDPs\n\n\u03c1-POMDPs [Araya-L\u00f3pez et al., 2010] differ from POMDPs in their reward function \u03c1(b, a)\u2014rather\nthan r(s, a)\u2014that allows de\ufb01ning not only control-oriented criteria, but also information-oriented\nones, thus generalizing POMDPs. Such problems are met regularly, but often modeled and addressed\nwith ad-hoc techniques [Fox et al., 1998, Mihaylova et al., 2006]. Araya-L\u00f3pez et al. [2010] have\nshown that, (i) if \u03c1 is PWLC, previously described techniques can still be applied with similar error\nbounds, and (ii) if \u03c1 is convex and either Lipschitz-continuous or \u03b1-H\u00f6lder (as Shannon\u2019s entropy),\nthen a PWLC approximation of \u03c1 can be used to obtain error-bounded solutions again.\nWhile many problems can be modeled with convex \u03c1, this leaves us with a number of problems that\ncannot be solved with similar approximations. Here, we will exploit Lipschitz-continuous reward\nfunctions \u03c1 to solve more general \u03c1-POMDPs with similar algorithmic schemes. As an example,\nin the museum monitoring scenario, with X the random variable for a visitor\u2019s location and bX\n= \u03c3(\u03b1((cid:107)bX(cid:107)\u221e \u2212 \u03b2))\u2014with \u03c3(\u00b7) the sigmoid function\u2014is\n.\nthe corresponding belief, then \u03c1X (b, a)\na smooth threshold function (thus non-convex) whose Lipschitz constant depends on \u03b1 > 0 and\nrewarding preferably distributions whose maximum probability is greater than \u03b2 \u2208 [0, 1].\n\n2It is thus also Lipschitz-continuous.\n\n3\n\n\fAlgorithm 1: Heuristic Search Value Iteration & Inc-lc-HSVI\n14 Fct Update (b)\n15\n16\n\nInitialize L and U\nwhile (U (b0) \u2212 L(b0)) > \u0001 do\n\nL \u2190 Update (L, b)\nU \u2190 Update (U, b)\n\n1 Fct HSVI (\u0001)\n2\n3\n4\n\nRecursivelyTry (b0, d = 0)\n\nreturn L\n\n\u03b3(cid:80)\n\n5\n6 Fct RecursivelyTry (b, d)\n7\n8\n9\n\nz(cid:107)Pa,zb(cid:107)1U (ba,z)}\n\nif (U (b) \u2212 L(b)) > \u03b3\u2212d\u0001 then\nUpdate (b)\na\u2217 \u2208 arg maxa\u2208A{r(b, a) +\nz\u2217 \u2208 arg maxz\u2208Z{(cid:107)Pa\u2217,zb(cid:107)1 \u00d7\n(U (ba\u2217,z) \u2212 L(ba\u2217,z) \u2212 \u03b3\u2212d\u0001)}\nRecursivelyTry (ba\u2217,z\u2217\n, d + 1)\nUpdate (b)\n\n10\n\n11\n12\n\n13\n\n/* Note: Vanilla HSVI for POMDPs uses\n\nPWLC approximators.\nwith LC approximators.\n\nlc-HSVI is HSVI\n\n/* Below: Main loop of Incremental\n\nlc-HSVI (see Sec. 5.2).\n\n17 Fct inc-lc-HSVI (\u0001, \u03bb0)\n18\n19\n20\n\n\u03bb \u2190 \u03bb0\nwhile fails(lc-HSVI (\u0001, \u03bb)) do\n\n\u03bb \u2190 2\u03bb\n\n*/\n\n*/\n\nreturn\n\n21\n\nreturn L\n\n3.3 Lipschitz-Continuities (in normed spaces)\nLet f : X \u2192 Y be a function, where X and Y are normed spaces. f is uniformly Lipschitz-\ncontinuous if there exists \u03bbf \u2208 R+ such that, for all (x, x(cid:48)) \u2208 X 2, (cid:107)f (x) \u2212 f (x(cid:48))(cid:107) \u2264 \u03bbf(cid:107)x \u2212 x(cid:48)(cid:107).\nf is locally Lipschitz-continuous if, for each x, there exists \u03bbf (x) \u2208 R+ such that, for all x(cid:48) \u2208 X,\n(cid:107)f (x) \u2212 f (x(cid:48))(cid:107) \u2264 \u03bbf (x)(cid:107)x \u2212 x(cid:48)(cid:107). The former de\ufb01nition is more common and induces uniform\ncontinuity of f, but we will rely on the later (omitting \u201clocally\u201d), which induces local continuity\nof f, to handle more problems and obtain tighter bounds. We propose another generalization using\nvector rather than scalar Lipschitz constants, again to allow for tighter bounds: f is Lipschitz-\ncontinuous if, for each x, there exists a row vector \u03bbf (x) \u2208 (R+)dim(X) such that, for all x(cid:48) \u2208 X,\n(cid:107)f (x) \u2212 f (x(cid:48))(cid:107) \u2264 \u03bbf (x) \u00b7 |x \u2212 x(cid:48)| (scalar product equivalent to a weighted L1-norm).\nNote that Lipschitz-continuity is here always relative to the simplex \u2206, not R|S|. \u2206 and A being both\ncompact, properties that hold in the local and vector setting also hold in the uniform and/or scalar\nsetting (but bounds are looser).\n4 Lipschitz-continuity of V \u2217\n\nAssuming a \u03c1-POMDP with local and vector Lipschitz-continuous reward function with \u201cconstant\u201d\n\u03bb\u03c1(b, a) in (b, a), this section \ufb01rst states that Bellman\u2019s optimality operator preserves the LC property,\nwhich then allows proving that V \u2217 is LC for \ufb01nite horizons, but not necessarily for in\ufb01nite ones.\nProposition 1 (H preserves Lipschitz-Continuity). Given a \u03c1-POMDP with \u03bb\u03c1(\u00b7,\u00b7)-LC reward\nfunction, and a \u03bbV (\u00b7)-LC value function V , then HV is (at least) \u03bbHV (\u00b7)-LC with, in each belief b,\n(1)\n= (cid:107)w(cid:107)1V ( \u02c6w) is LC (see supplementary\n.\nProof (sketch). A key point here is to show that \u03ba(w)\nmaterial), which relies on the triangle (in)equality. Then the rest consists essentially in some algebra\n(cid:3)\nusing this property and other LC properties.\n\n[(|V (ba,z)| + \u03bbV (ba,z)ba,z) 1 + \u03bbV (ba,z)] Pa,z\n\n\u03bbHV (b) = (cid:126)maxa\n\n\u03bb\u03c1(b, a) + \u03b3\n\n(cid:88)\n\n(cid:104)\n\n(cid:105)\n\n.\n\nz\n\nAs can be observed in Eq. (1), the resulting update formula of the value function\u2019s Lipschitz constant\nexploits both the locality (dependence on the belief b) and the use of a vector rather than a scalar.\nThe dependence of the update formula on |V (ba,z)| may seem surprising since adding constant\nkr \u2208 R to \u03c1(b, a) should induce adding a related constant kV to V without changing local Lipschitz\nconstants. This dependence is due to approximations made in the proof. |V (ba,z)| can in fact be\nreplaced by |V (ba,z) + kV | in Equation 1 with a tunable kV . This induces the multi-objective task of\nminimizing the components of the (vector) Lipschitz constant through kV , a problem we address by\nsetting kV = \u2212 maxz V (ba,z)+minz V (ba,z)\n\n.\n\n2\n\n4\n\n\fTheorem 1 (Local Lipschitz-continuity of V \u2217\n\u00b7\n\u03c1, for any \ufb01nite time horizon T , the optimal value function is (locally+vector) LC.\nProof. The value function for T = 0 is trivially 0-LC.\nBy induction, as Bellman\u2019s optimality operator preserves the LC property (Prop. 1), the optimal value\n(cid:3)\nfunction is LC for any \ufb01nite T .\n\nfor \ufb01nite T ). Given a \u03c1-POMDP with \u03bb\u03c1(b, a)-LC\n\nAsymptotic Behavior Previous results do not tell whether the resulting Lipschitz constant tends to\na limit value when T goes to in\ufb01nity. This issue is considered here for a scalar and uniform constant \u03bb.\n\u03bb\u03c1(b, a) and \u03bbt(b) (the Lipschitz constant for Vt, not indicating the horizon T ) respectively become\n\u03bb\u03c1 and \u03bbt.\nCorollary 1. For a \u03c1-POMDP with (uniform) \u03bb\u03c1-LC \u03c1, the optimal value function veri\ufb01es, for all t\nand all b1, b2,\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:0)\u03bb\u03c1 + \u03b3V lim(cid:1) 1 \u2212 (2\u03b3)T\u2212t\n(cid:0)\u03bb\u03c1 + \u03b3V lim(cid:1) (T \u2212 t)\n\n1 \u2212 (2\u03b3)\n\nif \u03b3 (cid:54)= 1\n2 ,\nif \u03b3 = 1\n2 ,\n\n|V \u2217\nt (b1) \u2212 V \u2217\n\nt (b2)| \u2264 \u03bbt(cid:107)b1 \u2212 b2(cid:107)1, where \u03bbt =\n\n= 1\n\n1\u2212\u03b3 maxb,a|r(b, a)|.\n\nwith V lim .\nIn the common case \u03b3 \u2265 1\n2, \u03bb0 diverges when T \u2192 \u221e. Hence, V \u2217 may not be LC in the in\ufb01nite\nhorizon setting. Yet, it will suf\ufb01ce to compute \ufb01nite-horizon LC approximations of V \u2217 (as usual with\nPWLC approximations), as explained in the next section.\n\n5 Approximating V \u2217\n\nThis section shows how to de\ufb01ne, initialize, update and prune LC upper- and lower-bounding\napproximators of V \u2217, and then derive an \u0001-optimal variant of the HSVI algorithm.\n\n5.1 Upper- and Lower-Bounding V \u2217\n\nThe upper-bounding LC approximator is de\ufb01ned as a \ufb01nite set of downward-pointing L1-cones (see\n\u03b2 = (cid:104)\u03b2, u, \u03bb(cid:105)\u2014located at belief \u03b2, with \u201csummit\u201d\nFigure 1 (left)), where an upper-bounding cone cU\nvalue u and \u201cslope\u201d vector \u03bb\u2014induces a function U\u03b2(b) = u + \u03bb \u00b7 |\u03b2 \u2212 b|. The upper bound is thus\n\u03b2}\u03b2\u2208BU \u2014i.e., U (b) = min\u03b2\u2208BU U\u03b2(b).\nde\ufb01ned as the lower envelope of a set of cones C U = {cU\nRespectively, for the lower-bounding approximator: a lower-bounding (upward-pointing) cone\n\u03b2 = (cid:104)\u03b2, l, \u03bb(cid:105) induces a function L\u03b2(b) = l \u2212 \u03bb \u00b7 |\u03b2 \u2212 b|; and the lower bound is de\ufb01ned as the\ncL\nupper envelope of a set of cones C L = {cL\nWe now (i) show how the (pointwise) update of the upper- or lower-bound preserves this representa-\ntion; (ii) verify that the properties required for HSVI to converge to an \u0001-optimal solution still hold;\nand (iii) discuss their initialization.\n\n\u03b2}\u03b2\u2208BL\u2014i.e., L(b) = max\u03b2\u2208BL L\u03b2(b).\n\nUpdating (Upper and Lower) Bounds The following proposition and its counterpart state that,\nfor both U and L, a pointwise update results in adding a new cone with its own Lipschitz constant.\nTheorem 2 (Updating U). Let us assume that (i) \u03c1 is \u03bb\u03c1(b, a)-LC for each (b, a), and (ii) the upper\nbound U is described by a set of upper cones C U . Then, for any b, an improved upper bound is\nobtained by adding a cone in b, with value and Lipschitz constant:\n\n(cid:32)\n\n(cid:88)\n(cid:107)Pa,zb(cid:107)1U (ba,z)\n\n(cid:33)\n\nu(b) = [HU ](b) = max\n\n\u03c1(b, a) + \u03b3\n\na\n\n(cid:32)\n\n\u03bb(b) = (cid:126)maxa(cid:48)\n\n\u03bb\u03c1(b, a(cid:48)) + \u03b3\n\n(cid:88)\n\n(cid:104)\n\nz\n\n\u03bb(\u03b2a(cid:48),z) +\n\n,\n\n(cid:16)|u(\u03b2a(cid:48),z)| + \u03bb(\u03b2a(cid:48),z)\u03b2a(cid:48),z(cid:17)\n\n(cid:33)\n\nPa(cid:48),z\n\n,\n\n(cid:105)\n\n1\n\nwhere \u03b2a,z is the current point in B(cid:48) that best approximates U in ba,z (see Fig. 1 (left)).\n\nz\n\n5\n\n\fa,z )\n\u03bb(\u03b2\n\nu(\u03b2a,z)\n\nba,z\n\n\u03b2a,z\n\nFigure 1: (left) An optimal V \u2217 surrounded by its upper and lower bounds (2 cones for U in red, and 3 cones for\nL in blue). The value of U at ba,z is approximated by the cone located at \u03b2a,z. (right) Grid environment of the\ngrid-info problem (Sec. 6.1).\n\nu3, \u03bb3\n\nu2, \u03bb3\n\nu2, \u03bb2\n\nu1, \u03bb1\n\nb\n\nl1, \u03bb1\n\nl2, \u03bb2\n\nl3, \u03bb3\n\nb\n\nFigure 2: Bellman update of U (left) and L (right) at b resulting in 1 (dashed) cone per action. (right) The\nupper envelope of 3 cones is approximated by (solid) cone (cid:104)b, u2, \u03bb3(cid:105). (left) The upper envelope of 3 cones is\napproximated by (solid) cone (cid:104)b, l2, \u03bb2(cid:105).\n\nb and the updated upper bound is thus K U\n\nThe operator performing this update at b is noted K U\nb U.\nIntuitively (see Fig. 2 (left)), each action induces one cone, which may be preferred depending on the\npoint b(cid:48) where the evaluation is done. Yet, rather than adding the upper enveloppe of this set of cones,\na single upper-bounding cone is employed using the maximum u(b) and \u03bb(b).\nA pointwise update of U in b consists in computing (i) u(b) by Bellman update, in \u0398(|A| \u00d7 |Z| \u00d7\n(|S|2 + |BU| \u00d7 |S|)) (memoizing points \u03b2a,z that optimize U (ba,z)) and (ii) \u03bb(b) by a Bellman-like\nupdate, in \u0398(|A| \u00d7 |Z| \u00d7 |S|2). The latter computation searches for the worst Lipschitz constant\ncomponent by component, thus harming the generalization capabilities. Better Lipschitz constants\ncould be obtained by \ufb01rst pruning cones that are dominated by other cones.\nFor the lower bound L, an update requires adding at least one of the cones with maximum l, but we\neven add each cone induced by each action. The complexity is unchanged (replacing BU by BL).\n\nPruning Cones The LC setting requires a procedure for pruning cones. Without loss of gen-\n\u03b2 must be maintained if there exists b \u2208 \u2206 such that\nerality, we consider U. A cone cU\nU\u03b2(b) < min\u03b2(cid:48)\u2208BU\\{\u03b2} U\u03b2(cid:48)(b), which is equivalent to \ufb01nding a strictly negative value of\n= U\u03b2(b) \u2212 min\u03b2(cid:48)\u2208BU\\{\u03b2} U\u03b2(cid:48)(b). This could be done by applying a minimization pro-\n.\n\u03c6U\n\u03b2 (b)\n\u03b2 , if\ncedure on \u03c6U\n(cid:48)|, and\nanother cone cU\n(ii) \u03bb(\u03b2) \u2265 \u03bb(\u03b2\n). This can be improved by comparing not the Lipschitz constants, but the value at\n\n\u03b2(cid:48) exists that completely dominates it\u2014i.e., (i) u(\u03b2) \u2265 u(\u03b2\n(cid:48)\n\n\u03b2 until a negative value is found. A more pragmatic approach is to search, for cone cU\n\n(cid:48)\n\n) + \u03bb(\u03b2\n\n(cid:48)\n\n)|\u03b2 \u2212 \u03b2\n\n6\n\n123213\fthe corners of the simplex, to make sure that dominance is checked inside \u2206. The resulting process is\nconservative and cheap, but not complete.\n\nPreservation of HSVI\u2019s Convergence Properties Ensuring \ufb01nite time convergence of HSVI (even\nbeyond POMDPs) to an \u0001-optimal solution requires using (a) a uniformly improvable (UI) lower bound\nL, i.e., HL \u2265 L, where H is Bellman optimality operator; (b) respectively a uniformly improvable\n(UI) upper bound U; (c) a strong pointwise update operator for the lower bound L, K L\u00b7 , i.e., for each\nb where it is applied and any L, (i) (HL)(b) = (K L\nb L)(b(cid:48)) in any\nother point b(cid:48); and (d) resp. a strong pointwise update operator for U, K U\u00b7\nTrivially, our proposed operators are strong, thus conservative [Smith, 2007, Def. 3.24, 3.25]. Also,\nany conservative update operator preserves UI [Smith, 2007, Th. 3.29]. We thus essentially need to\nensure that initializations induce UI bounds (see next sub-section).\n\nb L)(b) and (ii) (HL)(b(cid:48)) \u2265 (K L\n\n.\n\nInitialization For a usual POMDP (\u03c1 = r), initializations described by Smith [2007] are UI by\nconstruction. For a \u03c1-POMDP, similar constructions seem dif\ufb01cult to obtain. Another option is to\ngo back to a POMDP with reward (linear in b) ru upper-bounding (resp. rl lower-bounding) \u03c1. We\ncan then (i) employ the associated POMDP initialization, or (ii) solve the resulting POMDPs. In\neach case, the resulting bounds can be used as UI LC (with in\ufb01nitely many cones). Going further, \u03c1\ncould even be better upper- (resp. lower-) bounded by a lower (resp. upper) envelope of linear reward\nfunctions, which would lead to better initializations of U (resp. L) by taking lower (resp. upper)\nenvelopes of independent bounds.\n\n5.2 Algorithms\n\nWe will distinguish HSVI variants depending on the approximators at hand: pwlc-HSVI, lc-HSVI\nand pw-HSVI respectively depend on the classical PWLC approximators, the LC approximators\npreviously described, and non-generalizing pointwise (PW) approximators (equivalent to cones with\nan in\ufb01nite Lipschitz constant). In each case, HSVI\u2019s convergence guarantees hold.\n\nIncremental Variant\nlc-HSVI computes (using Eq.(1)) upper bounds on the true (local and vector)\nLipschitz constants\u2014i.e., the smallest constants for which the Lipschitz property holds. Yet, these\nupper bounds are often very pessimistic, which leads to (i) a poor generalization capability of U\nand L, and, (ii) as a consequence, a very slow convergence. To circumvent the resulting pessimistic\nbounds and to assess how much is lost due to this pessimism, we also propose another algorithm\nthat incrementally searches for a valid (global and scalar) Lipschitz constant \u03bb. The intuition is that,\ndespite the search process, the resulting planning process could be more ef\ufb01cient due to (i) quickly\ndetecting when a constant is invalid, and (ii) quickly converging to a solution when a valid constant is\nfound. One issue is that the algorithm may terminate with an invalid solution.\nWe \ufb01rst need to de\ufb01ne lc-HSVI(\u03bb), a variant of lc-HSVI where the Lipschitz constant is uniformly\nconstrained to (scalar) value \u03bb. As a consequence, (i) adding a new cone at \u03b2 only requires computing\nu(\u03b2) or l(\u03b2), and (ii) the pragmatic pruning process is complete. If \u03bb is not large enough, the\nalgorithm may fail due to (LXU) L and U crossing each other at an update point b, (NUI) L or U being\nnot uniformly improvable\u2014i.e., an update leads to a worse value than expected\u2014or, (UR) unstable\nresults\u2014i.e., two consecutive runs with values \u03bbt and \u03bbt+1 verify |Lt(b0) \u2212 Lt+1(b0)| > \u0001.\nThen, lc-HSVI(\u03bb) is incorporated in an incremental algorithm, inc-lc-HSVI (see Alg. 1, fct.\ninc-lc-HSVI), that starts with some initial \u03bb and runs lc-HSVI(\u03bb) with geometrically increas-\ning values of \u03bb until lc-HSVI(\u03bb) returns with no (LXU/NUI/UR) failure. As already mentioned, this\nprocess does not guarantee that a large enough \u03bb has been found for L and U to be proper bounds. In\npractice, we use \u03bb0 = 1, but problem-dependent values should be considered to avoid being sensitive\nto af\ufb01ne transforms of the reward function.\n\n7\n\n\f6 Experiments\n\n6.1 Benchmark Problems\n\nTo evaluate the various algorithms at hand, we consider both POMDP and \u03c1-POMDP benchmark\nproblems. The former problems\u2014a diverse set taken from Cassandra\u2019s POMDP page3\u2014allow\ncomparing the proposed algorithms against the standard pwlc-HSVI. The \u03c1-POMDP problems are all\nbased on a grid environment as described below.\ngrid-info \u03c1-POMDP We consider an agent moving on a 3 \u00d7 3 toric grid with white and black cells\n(see Fig. 1 (right)). Each cell is indicated by its coordinates (x, y) (\u2208 {1, 2, 3}). The agent is initially\nplaced uniformly at random. Moves (n,s,e,w) succeed with probability .8, otherwise the agent stays\nstill. The current cell\u2019s color is observed with probability 1. \u03b3 is set to 0.95.\nLet bx (resp. by) be the belief over the x (resp. y) coordinate. Then, \u03c1(b) = +(cid:107)bx \u2212 1\n3 1(cid:107)1 (resp.\n\u2212(cid:107)bx \u2212 1\n3 1(cid:107)1) rewards knowing x (kx) (resp. not knowing x (\u00ackx)). And replacing bx by by allows\nrewarding knowing y (ky) and not knowing y (\u00acky).\n\n6.2 Experiments\nWe run x-HSVI (x \u2208 {pwlc, pw, lc, inc-lc}) on all benchmark problems\u2014with the exception of\npwlc-HSVI not being run on \u03c1-POMDPs\u2014setting \u0001 = 0.1 and a timeout of 600s. In inc-lc-HSVI, \u03bb\nis initially set to 1. L and U are initialized (i) for POMDPs, using HSVI1\u2019s blind estimate and MDP\n1\u2212\u03b3 . The Java program4 is run on an i5 CPU\nestimate, and (ii) for \u03c1-POMDPs, using Rmin\nM540 at 2.53GHz. Experimental results are presented in Table 1. When convergence is not achieved,\nwe look at the \ufb01nal L(b0) and U (b0) values to assess the ef\ufb01ciency of an algorithm. Note that, for\ninc-lc-HSVI, log2(\u03bb) gives the number of restarts.\n\n1\u2212\u03b3 and Rmax\n\ninc-lc-HSVI\u2019s Restart Criteria We \ufb01rst look at the effect of the three restart criteria in inc-lc-HSVI\nthrough the top sub-table. The \ufb01rst two columns are similar, showing that not testing that L and U\ncross each other (noLXU) has little in\ufb02uence. Looking at execution traces, the LXU criterion is in\nfact only triggered when not checking for uniform improvability (noNUI). The time to converge is\nnotably sped up by not testing for unstable results (noUR), with only one case of convergence to bad\nvalues in the Tiger problem (tiger70). More speed improvement is obtained by not testing uniform\nimprovability (noNUI), in which case the LXU rule is triggered more often. As a result, we take as our\ndefault con\ufb01guration the \u201cnoNUI\u201d setting\u2014which only uses the LXU and UR stopping criteria.\n\nComparing Approximators and Algorithms We now compare the four algorithms at hand\nthrough the bottom sub-table. pwlc-HSVI (when applicable) dominates overall the experiments,\nexcept on a few cases where inc-lc-HSVI converges in less time. As can be observed, the Lipschitz\nconstants obtained by inc-lc-HSVI are of the same order of magnitude as the ones derived in pwlc-\nHSVI from the \ufb01nal lower bounds L. inc-lc-HSVI(noNUI) would be a satisfying solution on the\nbenchmarks at hand (when not using the PWLC property) if not lacking theoretical guarantees. For\nits part, lc-HSVI ends up with worst-case constants orders of magnitude larger in many cases, which\nsuggests that its bounds have little generalization capabilities, as in pw-HSVI. pw-HSVI is obviously\nfaster than lc-HSVI due to much cheaper updates.\n\n7 Discussion\n\nThis work shows that, for \ufb01nite horizons, the optimal value function of a \u03c1-POMDP with Lipschitz \u03c1\nis Lipschitz-continuous (LC). The Lipschitz-continuity here is not uniform with a scalar constant, but\nlocal with a vector constant, which allows for more ef\ufb01cient updates. While the PWLC property (of\nV \u2217) provides useful generalizing lower and upper bounds for POMDPs and \u03c1-POMDPs with convex\n\u03c1, the LC property provides similar bounds for POMDPs and \u03c1-POMDPs with LC \u03c1\u2014where V \u2217 may\nnot be convex. These bounds are envelopes of either upward- or downward-pointing \u201ccones\u201d, and,\n\n3http://www.pomdp.org/examples/\n4Full code available here: https://gitlab.inria.fr/buffet/lc-hsvi-nips18\n\n8\n\n\fTable 1: Comparison of (top) inc-lc-HSVI with all 3 stopping criteria, or with 1 of them disabled and (bottom)\nx-HSVI algorithms (for x \u2208 {pwlc, pw, lc, inc-lc}), in terms of (i) CPU time (timeout 600s), (ii) number of\ntrajectories, (iii-iv) width (gap) at b0, and (v) Lipschitz constant\n\nx-HSVI\n\ninc-lc\n\n4x3.95\n4x4.95\ncheese.95\n\ncit\n\nhallway\nhallway2\n\nmilos-aaai97\n\nmit\n\nnetwork\npaint.95\npentagon\nshuttle.95\ntiger85\ntiger-grid\ngrid-info kx\ngrid-info ky\ngrid-info \u00ackx\ngrid-info \u00acky\n\n(#it) gap(b0)\nt (s)\n\u03bb\n1\n1\n(254)\n0.10\n0\n1\n(125)\n0.10\n0\n1\n(69)\n0.10\n1\n(44)\n600\n0.67\n1\n(645)\n600\n1.11\n1\n600\n(687)\n1.00\n512\n600 (1331)\n49.25\n1\n600\n(46)\n0.69\n512\n600 (8572)\n0.44\n0\n1\n(84)\n0.10\n1\n(62)\n600\n0.73\n0\n4\n(47)\n0.08\n0\n64\n(15)\n0.07\n8\n(455)\n600\n3.95\n12\n64\n(242)\n0.95\n27\n512\n(395)\n0.10\n600 (1482)\n128\n0.83\n9.19 2048\n600\n(757)\n\ninc-lc(noLXU)\n(#it) gap(b0)\nt (s)\n\u03bb\n1\n1\n(254)\n0.10\n0\n1\n(125)\n0.10\n0\n1\n(69)\n0.10\n1\n(48)\n600\n0.67\n1\n(642)\n600\n1.11\n1\n600\n(686)\n1.00\n512\n600 (1300)\n49.45\n1\n600\n(47)\n0.69\n512\n600 (8462)\n0.47\n0\n1\n(84)\n0.10\n1\n(56)\n600\n0.73\n0\n4\n(47)\n0.08\n0\n64\n(15)\n0.07\n8\n(449)\n600\n3.95\n12\n64\n(242)\n0.95\n26\n512\n(395)\n0.10\n600 (1535)\n128\n0.72\n9.40 2048\n600\n(717)\n\ninc-lc(noNUI)\n(#it) gap(b0)\nt (s)\n\u03bb\n1\n1\n(254)\n0.10\n0\n1\n(125)\n0.10\n0\n1\n(69)\n0.10\n1\n(41)\n600\n0.67\n1\n(611)\n600\n1.12\n1\n600\n(668)\n1.00\n64\n600 (1797)\n43.78\n600\n1\n(47)\n0.69\n34 (2819)\n0.10 128\n0\n1\n(84)\n0.10\n1\n(60)\n600\n0.73\n0\n4\n(47)\n0.08\n0\n64\n(15)\n0.07\n600\n(626)\n1\n0.78\n1\n4\n(279)\n0.10\n1\n4\n(279)\n0.10\n4\n4\n(695)\n0.10\n4\n(707)\n4\n0.10\n\ninc-lc(noUR)\n(#it) gap(b0)\nt (s)\n\u03bb\n1\n1\n(254)\n0.10\n0\n1\n(125)\n0.10\n0\n1\n(69)\n0.10\n1\n(44)\n600\n0.67\n1\n(622)\n600\n1.12\n1\n600\n(694)\n1.00\n512\n600 (1287)\n49.47\n1\n600\n(46)\n0.69\n258 (6849)\n256\n0.10\n0\n1\n(84)\n0.10\n1\n(56)\n600\n0.73\n0\n4\n(47)\n0.08\n0\n32\n(16)\n0.07\n8\n(456)\n600\n3.95\n8\n32\n(350)\n0.10\n26\n512\n(395)\n0.10\n151 (1572)\n32\n0.10\n9.14 2048\n600\n(773)\n\nx-HSVI\n\npwlc\n\npw\n\nlc\n\n4x3.95\n4x4.95\ncheese.95\n\ncit\n\nhallway\nhallway2\n\nmilos-aaai97\n\nmit\n\nnetwork\npaint.95\npentagon\nshuttle.95\ntiger85\ntiger-grid\ngrid-info kx\ngrid-info ky\ngrid-info \u00ackx\ngrid-info \u00acky\n\n(#it) gap(b0)\nt (s)\n\u03bb\n1\n1.19\n(134)\n0.10\n1\n0.66\n(120)\n0.10\n0\n1.15\n(59)\n0.10\n1.39\n(19)\n600\n0.13\n0.70\n(414)\n600\n0.35\n0.63\n600\n(385)\n0.67\n89.49\n600 (1152)\n29.55\n600\n1.81\n(21)\n0.12\n498 (7703)\n0.10 168.22\n1.00\n(143)\n0.10\n(27)\n1.00\n0.31\n22.77\n(23)\n0.10\n55.00\n(15)\n0.09\n11.45\n(264)\n0.51\n\u2013\n\u2013\n(\u2013)\n\u2013\n\u2013\n(\u2013)\n(\u2013)\n\u2013\n\u2013\n\u2013\n\u2013\n(\u2013)\n\n2\n601\n0\n0\n600\n\u2013\n\u2013\n\u2013\n\u2013\n\n(#it) gap(b0)\nt (s)\n(447)\n600\n0.94\n1\n(134)\n0.10\n0\n(69)\n0.10\n(126)\n600\n0.84\n(683)\n600\n1.30\n600\n(690)\n1.04\n600 (1725)\n49.05\n600\n(236)\n0.87\n600 (3021)\n453.62\n600 (3695)\n3.55\n600\n(89)\n0.83\n0\n(42)\n0.09\n0\n(15)\n0.08\n600 (1563)\n16.74\n(709)\n600\n0.24\n27\n(432)\n0.10\n600 (2319)\n9.15\n600 (1393)\n6.66\n\n(#it) gap(b0)\nt (s)\n(214)\n600\n3.27\n6\n(134)\n0.10\n1\n(69)\n0.10\n(34)\n601\n0.84\n(203)\n600\n1.33\n(208)\n600\n1.06\n(595)\n600\n52.87\n(64)\n600\n0.88\n600\n(941)\n510.76\n600 (1008)\n600\n(32)\n0\n(42)\n0\n(15)\n(375)\n600\n(358)\n600\n344\n(426)\n600\n(889)\n(604)\n600\n\n\u03bb\n1.9e+05\n2.4e+05\n4.8e+04\n3.0e+01\n1.8e+00\n4.0e+00\n2.3e+03\n1.9e+17\n5.0e+16\n4.37 1.7e+281\n2.8e+01\n0.83\n7.5e+00\n0.09\n2.2e+02\n0.08\n4.5e+01\n17.23\n3.6e+01\n1.82\n4.0e+01\n0.10\n3.3e+01\n13.74\n3.1e+01\n8.52\n\ninc-lc(noNUI)\n(#it) gap(b0)\nt (s)\n\u03bb\n1\n1\n(254)\n0.10\n0\n1\n(125)\n0.10\n0\n1\n(69)\n0.10\n1\n(41)\n600\n0.67\n1\n(611)\n600\n1.12\n1\n600\n(668)\n1.00\n64\n600 (1797)\n43.78\n600\n1\n(47)\n0.69\n34 (2819)\n0.10 128\n0\n1\n(84)\n0.10\n600\n(60)\n1\n0.73\n0\n4\n(47)\n0.08\n0\n64\n(15)\n0.07\n1\n(626)\n600\n0.78\n1\n4\n(279)\n0.10\n1\n4\n(279)\n0.10\n4\n(695)\n4\n0.10\n4\n4\n(707)\n0.10\n\nwith appropriate initializations, are uniformly improvable. Two algorithms are proposed: HSVI used\nwith these \u201cLC bounds\u201d\u2014which preserves HSVI\u2019s convergence properties\u2014, and an incremental\nalgorithm that searches for a (uniform) scalar Lipschitz constant allowing for fast computations\u2014with\nno guarantees that the bounds are valid.\nThe experiments show that there lc-HSVI\u2019s pessimistic constants are far from inc-lc-HSVI\u2019s guesses.\nThis encourages searching for better (safe) Lipschitz constants\u2014possibly using a particular norm\nsuch that the dynamics of the bMDP are LC,5 as Platzman [1977] did for sub-rectangular bMDPs\n(a restrictive class of problems)\u2014but also improving the initialization of L and U, and possibly\ninc-lc-HSVI\u2019s restart and stopping criteria (ideally guaranteeing that a valid constant is found).\nWe also aim at exploiting the Lipschitz continuity to solve partially observable stochastic games\n(POSGs) [Hansen et al., 2004]. Indeed, while the PWLC property allows ef\ufb01ciently solving not\nonly POMDPs, but also Dec-POMDPs turned into occupancy MDPs [Dibangoye et al., 2013, 2016],\nthe LC property may allow to provide generalizing bounds for POSGs turned into occupancy SGs,\nstarting with 2-player 0-sum scenarios.\n\nAcknowledgments\n\nLet us thank David Reboullet for helping with some proofs (reminding us about the power of the\ntriangle (in)equality), and the anonymous reviewers for their insightful comments.\n\n5Previously cited works on LC MDPs rely on LC dynamics.\n\n9\n\n\fReferences\nM. Araya-L\u00f3pez, O. Buffet, V. Thomas, and F. Charpillet. A POMDP extension with belief-dependent\n\nrewards. In Advances in Neural Information Processing Systems 23 (NIPS-10), 2010.\n\nK. Astrom. Optimal control of Markov processes with incomplete state information. Journal of\n\nMathematical Analysis and Applications, 10(1), 1965. ISSN 0022-247X.\n\nR. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 1957.\n\nJ. Dibangoye, C. Amato, O. Buffet, and F. Charpillet. Optimally solving Dec-POMDPs as continuous-\nstate MDPs. In Proceedings of the Twenty-Third International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI-13), 2013.\n\nJ. Dibangoye, C. Amato, O. Buffet, and F. Charpillet. Optimally solving Dec-POMDPs as continuous-\nstate MDPs. Journal of Arti\ufb01cial Intelligence Research, 55, 2016. URL http://www.jair.org/\npapers/paper4623.html.\n\nF. Dufour and T. Prieto-Rumeau. Approximation of Markov decision processes with general state\nspace. Journal of Mathematical Analysis and Applications, 388(2), 2012. ISSN 0022-247X. doi:\nhttps://doi.org/10.1016/j.jmaa.2011.11.015. URL http://www.sciencedirect.com/science/\narticle/pii/S0022247X11010353.\n\nM. Egorov, M. J. Kochenderfer, and J. J. Uudmae. Target surveillance in adversarial environments\nIn Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence\n\nusing POMDPs.\n(AAAI-16), 2016.\n\nR. Fonteneau, S. Murphy, L. Wehenkel, and D. Ernst. Inferring bounds on the performance of a control\npolicy from a sample of trajectories. In Proceedings of the IEEE Symposium on Approximate\nDynamic Programming and Reinforcement Learning (ADPRL-09), 2009.\n\nD. Fox, W. Burgard, and S. Thrun. Active Markov localization for mobile robots. Robotics and\n\nAutonomous Systems, 25(3\u20134), 1998. doi: http://dx.doi.org/10.1016/S0921-8890(98)00049-9.\n\nE. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic\ngames. In Proceedings of the Nineteenth National Conference on Arti\ufb01cial Intelligence (AAAI-04),\n2004.\n\nK. Hinderer. Lipschitz continuity of value functions in Markovian decision processes. Mathematical\nMethods of Operations Research, 62(1), Sep 2005. doi: 10.1007/s00186-005-0438-1. URL\nhttps://doi.org/10.1007/s00186-005-0438-1.\n\nS. Ieong, N. Lambert, Y. Shoham, and R. Brafman. Near-optimal search in continuous domains. In\n\nProceedings of the National Conference on Arti\ufb01cial Intelligence (AAAI-07), 2007.\n\nH. Kurniawati, D. Hsu, and W. Lee. SARSOP: Ef\ufb01cient point-based POMDP planning by approxi-\n\nmating optimally reachable belief spaces. In Robotics: Science and Systems IV, 2008.\n\nR. Laraki and W. D. Sudderth. The preservation of continuity and Lipschitz continuity by optimal\n\nreward operators. Mathematics of Operations Research, 29(3), 2004.\n\nL. Mihaylova, T. Lefebvre, H. Bruyninckx, and J. D. Schutter. NATO Science Series on Data\nFusion for Situation Monitoring, Incident Detection, Alert and Response Management, volume\n198, chapter Active Robotic Sensing as Decision Making with Statistical Methods. 2006.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.\n\nJ. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs.\nIn Proceedings of the Eighteenth International Joint Conference on Arti\ufb01cial Intelligence (IJCAI-\n03), 2003.\n\nJ. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large POMDPs. Journal\n\nof Arti\ufb01cial Intelligence Research (JAIR), 27, 2006.\n\n10\n\n\fL. K. Platzman. Finite Memory Estimation and Control of Finite Probabilistic Systems. PhD thesis,\n\nMassachusetts Institute of Technology (MIT), 1977.\n\nP. Poupart, K.-E. Kim, and D. Kim. Closing the gap: Improved bounds on optimal POMDP\nsolutions. In Proceedings of the Twenty-First International Conference on Automated Planning\nand Scheduling (ICAPS-11), 2011.\n\nE. Rachelson and M. Lagoudakis. On the locality of action domination in sequential decision making.\nIn Proc. of the International Symposium on Arti\ufb01cial Intelligence and Mathematics (ISAIM-10),\n2010.\n\nY. Satsangi, S. Whiteson, and M. T. J. Spaan. An analysis of piecewise-linear and convex value\nfunctions for active perception POMDPs. Technical Report IAS-UVA-15-01, IAS, Universiteit van\nAmsterdam, 2015.\n\nR. Smallwood and E. Sondik. The optimal control of partially observable Markov decision processes\n\nover a \ufb01nite horizon. Operation Research, 21, 1973.\n\nT. Smith. Probabilistic Planning for Robotic Exploration. PhD thesis, The Robotics Institute,\n\nCarnegie Mellon University, 2007.\n\nT. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proceedings of the\n\nAnnual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-04), 2004.\n\nT. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation.\nIn Proceedings of the Twenty-First Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-05),\n2005.\n\nE. Sondik. The Optimal Control of Partially Observable Markov Decision Processes. PhD thesis,\n\nStanford University, 1971.\n\nM. T. Spaan, T. S. Veiga, and P. U. Lima. Decision-theoretic planning under uncertainty with\ninformation rewards for active cooperative perception. Autonomous Agents and Multi-Agent\nSystems, 29(6), 2015.\n\nN. L. Zhang and W. Zhang. Speeding up the convergence of value iteration in partially observable\nMarkov decision processes. Journal of Arti\ufb01cial Intelligence Research, 14, 2001. URL http:\n//dx.doi.org/10.1613/jair.761.\n\nZ. Zhang, D. Hsu, and W. S. Lee. Covering number for ef\ufb01cient heuristic-based pomdp planning. In\nProceedings of the 31st International Conference on Machine Learning (ICML-14), 2014. URL\nhttp://jmlr.org/proceedings/papers/v32/zhanga14.pdf.\n\n11\n\n\f", "award": [], "sourceid": 3451, "authors": [{"given_name": "Mathieu", "family_name": "Fehr", "institution": "\u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Olivier", "family_name": "Buffet", "institution": "INRIA / LORIA"}, {"given_name": "Vincent", "family_name": "Thomas", "institution": "LORIA / INRIA"}, {"given_name": "Jilles", "family_name": "Dibangoye", "institution": "INRIA, INSA Lyon"}]}