{"title": "Cost-Sensitive Exploration in Bayesian Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3068, "page_last": 3076, "abstract": "In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected long-term total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems.", "full_text": "Cost-Sensitive Exploration in\n\nBayesian Reinforcement Learning\n\nDongho Kim\n\nDepartment of Engineering\nUniversity of Cambridge, UK\n\nKee-Eung Kim\n\nDept of Computer Science\n\nKAIST, Korea\n\ndk449@cam.ac.uk\n\nkekim@cs.kaist.ac.kr\n\nPascal Poupart\n\nSchool of Computer Science\n\nUniversity of Waterloo, Canada\nppoupart@cs.uwaterloo.ca\n\nAbstract\n\nIn this paper, we consider Bayesian reinforcement learning (BRL) where actions\nincur costs in addition to rewards, and thus exploration has to be constrained in\nterms of the expected total cost while learning to maximize the expected long-\nterm total reward.\nIn order to formalize cost-sensitive exploration, we use the\nconstrained Markov decision process (CMDP) as the model of the environment, in\nwhich we can naturally encode exploration requirements using the cost function.\nWe extend BEETLE, a model-based BRL method, for learning in the environment\nwith cost constraints. We demonstrate the cost-sensitive exploration behaviour in\na number of simulated problems.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL), the agent interacts with a (partially) unknown environment, classi-\ncally assumed to be a Markov decision process (MDP), with the goal of maximizing its expected\nlong-term total reward. The agent faces the exploration-exploitation dilemma: the agent must se-\nlect actions that exploit its current knowledge about the environment to maximize reward, but it\nalso needs to select actions that explore for more information so that it can act better. Bayesian RL\n(BRL) [1, 2, 3, 4] provides a principled framework to the exploration-exploitation dilemma.\n\nHowever, exploratory actions may have serious consequences. For example, a robot exploring in an\nunfamiliar terrain may reach a dangerous location and sustain heavy damage, or wander off from the\nrecharging station to the point where a costly rescue mission is required. In a less mission critical\nscenario, a route recommendation system that learns actual travel times should be aware of toll fees\nassociated with different routes. Therefore, the agent needs to carefully (if not completely) avoid\ncritical situations while exploring to gain more information.\n\nThe constrained MDP (CMDP) extends the standard MDP to account for limited resources or mul-\ntiple objectives [5]. The CMDP assumes that executing actions incur costs and rewards that should\nbe optimized separately. Assuming the expected total reward and cost criterion, the goal is to \ufb01nd\nan optimal policy that maximizes the expected total reward while bounding the expected total cost.\nSince we can naturally encode undesirable behaviors into the cost function, we formulate the cost-\nsensitive exploration problem as RL in the environment modeled as a CMDP.\n\nNote that we can employ other criteria for the cost constraint in CMDPs. We can make the actual\ntotal cost below the cost bound with probability one using the sample-path cost constraints [6, 7], or\nwith probability 1 \u2212 \u03b4 using the percentile cost constraints [8]. In this paper, we restrict ourselves\nto the expected total cost constraint mainly due to the computational ef\ufb01ciency in solving the con-\nstrained optimization problem. Extending our work to other cost criteria is left as a future work. The\nmain argument we make is that the CMDP provides a natural framework for representing various\napproaches to constrained exploration, such as safe exploration [9, 10].\n\n1\n\n\fIn order to perform cost-sensitive exploration in the Bayesian RL (BRL) setting, we cast the problem\nas a constrained partially observable MDP (CPOMDP) [11, 12] planning problem. Speci\ufb01cally, we\ntake a model-based BRL approach and extend BEETLE [4] to solve the CPOMDP which models\nBRL with cost constraints.\n\n2 Background\n\nIn this section, we review the background for cost-sensitive exploration in BRL. As we explained\nin the previous section, we assume that the environment is modeled as a CMDP, and formulate\nmodel-based BRL as a CPOMDP. We brie\ufb02y review the CMDP and CPOMDP before summarizing\nBEETLE, a model-based BRL for environments without cost constraints.\n\n2.1 Constrained MDPs (CMDPs) and Constrained POMDPs (CPOMDPs)\n\nThe standard (in\ufb01nite-horizon discounted return) MDP is de\ufb01ned by tuple hS, A, T, R, \u03b3, b0i where:\nS is the set of states s; A is the set of actions a; T (s, a, s\u2032) is the transition function which denotes\nthe probability Pr(s\u2032|s, a) of changing to state s\u2032 from s by executing action a; R(s, a) \u2208 \u211c is the\nreward function which denotes the immediate reward of executing action a in state s; \u03b3 \u2208 [0, 1) is\nthe discount factor; b0(s) is the initial state probability for state s. b0 is optional, since an optimal\npolicy \u03c0\u2217 : S \u2192 A that maps from states to actions can be shown not to be dependent on b0.\nThe constrained MDP (CMDP) is de\ufb01ned by tuple hS, A, T, R, C, \u02c6c, \u03b3, b0i with the following addi-\ntional components: C(s, a) \u2208 \u211c is the cost function which denotes the immediate cost incurred by\nexecuting action a in state s; \u02c6c is the bound on expected total discounted cost.\nAn optimal policy of a CMDP maximizes the expected total discounted reward over the in\ufb01nite\nhorizon, while not incurring more than \u02c6c total discounted cost in the expectation. We can formalize\nthis constrained optimization problem as:\n\nmax\u03c0 V \u03c0\n\ns.t. C\u03c0 \u2264 \u02c6c.\n\nt=0 \u03b3tR(st, at)] is the expected total discounted reward, and C\u03c0 =\nt=0 \u03b3tC(st, at)] is the expected total discounted cost. We will also use C\u03c0(s) to denote\n\nwhere V \u03c0 = E\u03c0,b0[P\u221e\nE\u03c0,b0[P\u221e\nthe expected total cost starting from the state s.\nIt has been shown that an optimal policy for CMDP is generally a randomized stationary policy [5].\nHence, we de\ufb01ne a policy \u03c0 as a mapping of states to probability distributions over actions, where\n\u03c0(s, a) denotes the probability that an agent will execute action a in state s. We can \ufb01nd an optimal\npolicy by solving the following linear program (LP):\n\nx X\nmax\n\ns,a\n\nR(s, a)x(s, a)\n\ns.t. X\n\nx(s\u2032, a) \u2212 \u03b3 X\n\nx(s, a)T (s, a, s\u2032) = b0(s\u2032) \u2200s\u2032\n\ns,a\n\nC(s, a)x(s, a) \u2264 \u02c6c and x(s, a) \u2265 0 \u2200s, a\n\na\n\nX\n\ns,a\n\n(1)\n\nThe variables x\u2019s are related to the occupancy measure of optimal policy, where x(s, a) is the ex-\npected discounted number of times executing a at state s. If the above LP yields a feasible solution,\noptimal policy can be obtained by \u03c0(s, a) = x(s, a)/ Pa\u2032 x(s, a\u2032). Note that due to the introduc-\ntion of cost constraints, the resulting optimal policy is contingent on the initial state distribution b0,\nin contrast to the standard MDP of which an optimal policy can be independent of the initial state\ndistribution. Note also that the above LP may be infeasible if there is no policy that can satisfy the\ncost constraint.\n\nThe constrained POMDP (CPOMDP) extends the standard POMDP in a similar manner. The stan-\ndard POMDP is de\ufb01ned by tuple hS, A, Z, T, O, R, \u03b3, b0i with the following additional components:\nthe set Z of observations z, and the observation probability O(s\u2032, a, z) representing the probability\nPr(z|s\u2032, a) of observing z when executing action a and changing to state s\u2032. The states in the\nPOMDP are hidden to the agent, and it has to act based on the observations instead. The CPOMDP\n\n2\n\n\fAlgorithm 1: Point-based backup of \u03b1-vector pairs with admissible cost\ninput : (b, d) with belief state b and admissible cost d; set \u0393 of \u03b1-vector pairs\noutput: set \u0393\u2032\n// regress\nforeach a \u2208 A do\n\n(b,d) of \u03b1-vector pairs (contains at most 2 pairs for a single cost function)\n\n\u03b1a,\u2217\nR = R(\u00b7, a), \u03b1a,\u2217\nforeach (\u03b1i,R, \u03b1i,C ) \u2208 \u0393, z \u2208 Z do\n\nC = C(\u00b7, a)\n\n\u03b1a,z\ni,R(s) = Ps\u2032 T (s, a, s\u2032)O(s\u2032, a, z)\u03b1i,R(s\u2032)\n\u03b1a,z\ni,C (s) = Ps\u2032 T (s, a, s\u2032)O(s\u2032, a, z)\u03b1i,C (s\u2032)\n\n// backup for each action\nforeach a \u2208 A do\n\nSolve the following LP to obtain best randomized action at the next time step:\n\nmax\n\u02dcwiz ,dz\n\nb \u00b7 X\n\n\u02dcwiz\u03b1a,z\n\ni,R subject to\n\ni,z\n\ni,C \u2264 dz \u2200z\n\nb \u00b7 Pi \u02dcwiz\u03b1a,z\nPi \u02dcwiz = 1 \u2200z\n\u02dcwiz \u2265 0 \u2200i, z\nPz dz = 1\n\n\u03b3 (d \u2212 C(b, a))\n\nR = \u03b1a,\u2217\n\u03b1a\nC = \u03b1a,\u2217\n\u03b1a\n\nR + \u03b3 Pi,z \u02dcwiz\u03b1a,z\nC + \u03b3 Pi,z \u02dcwiz\u03b1a,z\n\ni,C\n\ni,R\n\n// \ufb01nd the best randomized action for the current time step\nSolve the following LP with :\n\nmax\nwa\n\nb \u00b7 X\n\nwa\u03b1a\n\nR subject to\n\na\n\nC \u2264 d\n\nb \u00b7 Pa wa\u03b1a\nPa wa = 1\nwa \u2265 0 \u2200a\n\nreturn \u0393\u2032\n\n(b,d) = {(\u03b1a\n\nR, \u03b1a\n\nC )|wa > 0}\n\nis de\ufb01ned by adding the cost function C and the cost bound \u02c6c into the de\ufb01nition as in the CMDP. Al-\nthough the CPOMDP is intractable to solve as is the case with the POMDP, there exists an ef\ufb01cient\npoint-based algorithm [12].\nThe Bellman backup operator for CPOMDP generates pairs of \u03b1-vectors (\u03b1R, \u03b1C ), each vector\ncorresponding to the expected total reward and cost, respectively. In order to facilitate de\ufb01ning the\nBellman backup operator at a belief state, we augment the belief state with a scalar quantity called\nadmissible cost [13], which represents the expected total cost that can be additionally incurred for\nthe future time steps without violating the cost constraint. Suppose that, at time step t, the agent\nhas so far incurred a total cost of Wt, i.e., Wt = Pt\n\u03c4 =0 \u03b3\u03c4 C(s\u03c4 , a\u03c4 ). The admissible cost at\ntime step t + 1 is de\ufb01ned as dt = 1\n\u03b3 t+1 (\u02c6c \u2212 Wt). It can be computed recursively by the equation\ndt+1 = 1\n\u03b3 (dt \u2212 C(st, at)), which can be derived from Wt = Wt\u22121 + \u03b3C(st, at), and d0 = \u02c6c. Given\na pair of belief state and admissible cost (b, d) and the set of \u03b1-vector pairs \u0393 = {(\u03b1i,R, \u03b1i,C )}, the\nbest (randomized) action is obtained by solving the following LP:\n\nmax\nwi\n\nb \u00b7 X\n\nwi\u03b1i,R\n\nsubject to\n\ni\n\nb \u00b7 Pi wi\u03b1i,C \u2264 d\nPi wi = 1\nwi \u2265 0 \u2200i\n\nwhere wi corresponds to the probability of choosing the action associated with the pair (\u03b1i,R, \u03b1i,C ).\nThe point-based backup for CPOMDP leveraging the above LP formulation is shown in Algo-\nrithm 1.1\n\n1Note that this algorithm is an improvement over the heuristic distribution of the admissible cost to each\n\nobservation by ratio Pr(z|b, a) in [12]. Instead, we optimize the cost distribution by solving an LP.\n\n3\n\n\f2.2 BEETLE\n\nBEETLE [4] is a model-based BRL algorithm, based on the idea that BRL can be formulated\nas a POMDP planning problem. Assuming that the environment is modeled as a discrete-state\nMDP P = hS, A, T, R, \u03b3i where the transition function T is unknown, we treat each transi-\ntion probability T (s, a, s\u2032) as an unknown parameter \u03b8s,s\u2032\nand formulate BRL as a hyperstate\nPOMDP hSP , AP , ZP , TP , OP , RP , \u03b3, b0i where SP = S \u00d7 {\u03b8s,s\u2032\na }, AP = A, ZP = S,\nTP (s, \u03b8, a, s\u2032, \u03b8\u2032) = \u03b8s,s\u2032\nIn sum-\nmary, the hyperstate POMDP augments the original state space with the set of unknown parameters\n{\u03b8s,s\u2032\nThe belief state b in the hyperstate POMDP yields the posterior of \u03b8. Speci\ufb01cally, assuming a\nproduct of Dirichlets for the belief state such that\n\na }, since the agent has to take actions without exact information on the unknown parameters.\n\n\u03b4\u03b8(\u03b8\u2032), OP (s\u2032, \u03b8\u2032, a, z) = \u03b4s\u2032 (z), and RP (s, \u03b8, a) = R(s, a).\n\na\n\na\n\nb(\u03b8) = Y\n\nDir(\u03b8s,\u2217\n\na ; ns,\u2217\na )\n\ns,a\n\na\n\nis the parameter vector of multinomial distribution de\ufb01ning the transition function for\nwhere \u03b8s,\u2217\nis the hyperparameter vector of the corresponding Dirichlet distri-\nstate s and action a, and ns,\u2217\nbution. Since the hyperparameter ns,s\u2032\ncan be viewed as pseudocounts, i.e., the number of times\nobserving transition (s, a, s\u2032), the updated belief after observing transition (\u02c6s, \u02c6a, \u02c6s\u2032) is also a product\nof Dirichlets:\n\na\n\na\n\nb\u02c6s,\u02c6s\u2032\n\u02c6a\n\n(\u03b8) = Y\n\ns,a\n\nDir(\u03b8s,\u2217\n\na ; ns,\u2217\n\na + \u03b4\u02c6s,\u02c6a,\u02c6s\u2032 (s, a, s\u2032))\n\nHence, belief states in the hyperstate POMDP can be represented by |S|2|A| variables one for each\nhyperparameter, and the belief update is ef\ufb01ciently performed by incrementing the hyperparmeter\ncorresponding to the observed transition.\n\nSolving the hyperstate POMDP is performed by dynamic programming with the Bellman backup\noperator [2]. Speci\ufb01cally, the value function is represented as a set \u0393 of \u03b1-functions for each state\ns (b) = max\u03b1\u2208\u0393 \u03b1s(b) where \u03b1s(b) =\ns, so that the value of optimal policy is obtained by V \u2217\nR\u03b8 b(\u03b8)\u03b1s(\u03b8)d\u03b8. Using the fact that \u03b1-functions are multivariate polynomials of \u03b8, we can obtain\nan exact solution to the Bellman backup.\n\nThere are two computational challenges with the hyperstate POMDP approach. First, being a\nPOMDP, the Bellman backup has to be performed on all possible belief states in the probability\nsimplex. BEETLE adopts Perseus [14], performing randomized point-based backups con\ufb01ned to\nthe set of sampled (s, b) pairs by simulating a default or random policy, and reducing the total\nnumber of value backups by improving the value of many belief points through a single backup.\nSecond, the number of monomial terms in the \u03b1-function increases exponentially with the number\nof backups. BEETLE chooses a \ufb01xed set of basis functions and projects the \u03b1-function onto a linear\ncombination of these basis functions. The set of basis functions is chosen to be the set of monomials\nextracted from the sampled belief states.\n\n3 Constrained BEETLE (CBEETLE)\n\na }, AP = A, ZP = S, TP (s, \u03b8, a, s\u2032, \u03b8\u2032) = \u03b8s,s\u2032\n\nWe take an approach similar to BEETLE for cost-sensitive exploration in BRL. Speci\ufb01cally, we for-\nmulate cost-sensitive BRL as a hyperstate CPOMDP hSP , AP , ZP , TP , OP , RP , CP , \u02c6c, \u03b3, b0i where\nSP = S \u00d7 {\u03b8s,s\u2032\n\u03b4\u03b8(\u03b8\u2032), OP (s\u2032, \u03b8\u2032, a, z) = \u03b4s\u2032 (z),\nRP (s, \u03b8, a) = R(s, a), and CP (s, \u03b8, a) = C(s, a).\nNote that using the cost function C and cost bound \u02c6c to encode the constraints on the exploration\nbehaviour allows us to enjoy the same \ufb02exibility as using the reward function to de\ufb01ne the task\nobjective in the standard MDP and POMDP. Although, for the sake of exposition, we use a single\ncost function and discount factor in our de\ufb01nition of CMDP and CPOMDP, we can generalize the\nmodel to have multiple cost functions that capture different aspects of exploration behaviour that\ncannot be put together on the same scale, and different discount factors for rewards and costs. In\naddition, we can even completely eliminate the possibility of executing action a in state s by setting\nthe discount factor to 1 for the cost constraint and impose a suf\ufb01ciently low cost bound \u02c6c < C(s, a).\n\na\n\n4\n\n\fAlgorithm 2: Point-based backup of \u03b1-function pairs for the hyperstate CPOMDP2\ninput : (s, n, d) with state s, Dirichlet hyperparameter n representing belief state b, and admissible\n\ncost d; set \u0393s of \u03b1-function pairs for each state s\n\n(s,n,d) of \u03b1-function pairs (contains at most 2 pairs for a single cost function)\n\noutput: set \u0393\u2032\n// regress\nforeach a \u2208 A do\n\n\u03b1a,\u2217\nR = R(s, a), \u03b1a,\u2217\nforeach s\u2032 \u2208 S, (\u03b1i,R, \u03b1i,C ) \u2208 \u0393s\u2032 do\n\nC = C(s, a)\n\n// constant functions\n\n\u03b1a,s\u2032\ni,R = \u03b8s,s\u2032\n\na \u03b1i,R, \u03b1a,s\u2032\n\ni,C = \u03b8s,s\u2032\n\na \u03b1i,C\n\n// multiplied by variable \u03b8s,s\u2032\n\na\n\n// backup for each action\nforeach a \u2208 A do\n\nSolve the following LP to obtain best randomized action at the next time step:\n\nmax\n\u02dcwis\u2032 ,dz\n\nX\n\n\u02dcwis\u2032 \u03b1a,s\u2032\n\ni,R (b)\n\ni,s\u2032\n\ni,C (b) \u2264 ds\u2032 \u2200s\u2032\n\nsubject to\n\nPi \u02dcwis\u2032 \u03b1a,s\u2032\nPi \u02dcwis\u2032 = 1 \u2200s\u2032\n\u02dcwis\u2032 \u2265 0 \u2200i, s\u2032\nPz ds\u2032 = 1\nC + \u03b3 Pi,s\u2032 \u02dcwis\u2032 \u03b1a,s\u2032\n\ni,C\n\n\u03b3 (d \u2212 C(s, a))\n\nR = \u03b1a,\u2217\n\u03b1a\n\nR + \u03b3 Pi,s\u2032 \u02dcwis\u2032 \u03b1a,s\u2032\n\ni,R , \u03b1a\n\nC = \u03b1a,\u2217\n\n// \ufb01nd the best randomized action for the current time step\nSolve the following LP with :\n\nmax\nwa\n\nX\n\nwa\u03b1a\n\nR(b)\n\nsubject to\n\na\n\nC(b) \u2264 d\n\nPa wa\u03b1a\nPa wa = 1\nwa \u2265 0 \u2200a\n\nreturn \u0393\u2032\n\n(s,n,d) = {(\u03b1a\n\nR, \u03b1a\n\nC )|wa > 0}\n\nWe call our algorithm CBEETLE, which solves the hyperstate CPOMDP planning problem. As\nin BEETLE, \u03b1-vectors for the expected total reward and cost are represented as \u03b1-functions in\nterms of unknown parameters. The point-based backup operator in Algorithm 1 naturally extends\nto \u03b1-functions without signi\ufb01cant increase in the computation complexity: the size of LP does not\nincrease even though the belief states represent probability distributions over unknown parameters.\nAlgorithm 2 shows the point-based backup of \u03b1-functions in the hyperstate CPOMDP. In addition,\nif we choose a \ufb01xed set of basis functions for representing \u03b1-functions, we can pre-compute the\nprojections of \u03b1-functions ( \u02dcT , \u02dcR, and \u02dcC) in the same way as BEETLE. This technique is used in the\npoint-based backup, although not explicitly described in the pseudocode due to the page limit.\n\nWe also implemented the randomized point-based backup to further improve the performance. The\nkey step in the randomized value update is to check whether a newly generated \u03b1-function pairs\n\u0393 = {(\u03b1i,R, \u03b1i,C )} from a point-based backup yields improved value at some other sampled belief\nstate (s, n, d). We can obtain the value of \u0393 at the belief state by solving the following LP:\n\nmax\nwi\n\nX\n\ni\n\nwi\u03b1i,R(b)\n\nsubject to\n\nPi wi\u03b1i,C (b) \u2264 d\nPi wi = 1\nwi \u2265 0 \u2200i\n\n(2)\n\nIf we can \ufb01nd an improved value, we skip the point-based backup at (s, n, d) in the current iteration.\nAlgorithm 3 shows the randomized point-based value update.\n\nIn summary, the point-based value iteration algorithm for CPOMDP and BEETLE readily provide\nall the essential computational tools to implement the hyperstate CPOMDP planning for the cost-\nsensitive BRL.\n\n2The \u03b1-functions in the pseudocode are functions of \u03b8 and \u03b1(b) is de\ufb01ned to be R\u03b8 b(\u03b8)\u03b1(\u03b8)d\u03b8 as explained\n\nin Sec. 2.2.\n\n5\n\n\fAlgorithm 3: Randomized point-based value update for the hyperstate CPOMDP\ninput : set B of sampled belief points, and set \u0393s of \u03b1-function pairs for each state s\noutput: set \u0393\u2032\n// initialize\n\u02dcB = B // belief points needed to be improved\nforeach s \u2208 S do\n\ns of \u03b1-function pairs (updated value function)\n\n\u0393\u2032\n\ns = \u2205\n\n// randomized backup\nwhile \u02dcB 6= \u2205 do\n\nSample \u02dcb = (\u02dcs, \u02dcn, \u02dcd) \u2208 \u02dcB\nObtain \u0393\u2032\n\u02dcb\n\u0393\u2032\n\u02dcs = \u0393\u2032\n\u02dcs \u222a \u0393\u2032\n\u02dcb\nforeach b \u2208 B do\n\nby point-based backup at \u02dcb with {\u0393s|\u2200s \u2208 S} (Algorithm 2)\n\nCalculate V \u2032(b) by solving the LP Eqn. 2 with \u0393\u2032\n\u02dcb\n\n\u02dcB = {b \u2208 B : V \u2032(b) < V (b)}\n\nreturn {\u0393\u2032\n\ns|\u2200s \u2208 S}\n\n(a)\n\n(b)\n\nFigure 1: (a) 5-state chain: each edge is labeled with action, reward, and cost associated with the\ntransition. (b) 6 \u00d7 7 maze: a 6 \u00d7 7 grid including the start location with recharging station (S), goal\nlocation (G), and 3 \ufb02ags to capture.\n\n4 Experiments\n\nWe used the constrained versions of two standard BRL problems to demonstrate the cost-sensitive\nexploration. The \ufb01rst one is the 5-state chain [15, 16, 4], and the second one is the 6 \u00d7 7 maze [16].\n\n4.1 Description of Problems\n\nThe 5-state chain problem is shown in Figure 1a, where the agent has two actions 1 and 2. The agent\nreceives a large reward of 10 by executing action 1 in state 5, or a small reward of 2 by executing\naction 2 in any state. With probability 0.2, the agent slips and makes the transition corresponding\nto the other action. We de\ufb01ned the constrained version of the problem by assigning a cost of 1 for\naction 1 in every state, thus making the consecutive execution of action 1 potentially violate the cost\nconstraint.\n\nThe 6 \u00d7 7 maze problem is shown in Figure 1b, where the white cells are navigatable locations and\ngray cells are walls that block navigation. There are 5 actions available to the agent: move left, right,\nup, down, or stay. Every \u201cmove\u201d action (except for the stay action) can fail with probability 0.1,\nresulting in a slip to two nearby cells that are perpendicular to the intended direction. If the agent\nbumps into a wall, the action will have no effect. The goal of this problem is to capture as many\n\ufb02ags as possible and reach the goal location. Upon reaching the goal, the agent obtains a reward\nequal to the number of \ufb02ags captured, and the agent gets warped back to the start location. Since\nthere are 33 reachable locations in the maze and 8 possible combinations for the status of captured\n\ufb02ags, there are a total of 264 states. We de\ufb01ned the constrained version of the problem by assuming\nthat the agent is equipped with a battery and every action consumes energy except the stay action at\n\n6\n\n\frecharging station. We modeled the power consumption by assigning a cost of 0 for executing the\nstay action at the recharging station, and a cost of 1 otherwise. Thus, the battery recharging is done\nby executing stay action at the recharging station, as the admissible cost increases by factor 1/\u03b3.3\n\n4.2 Results\n\nTable 1 summarizes the experimental results for the constrained chain and maze problems.\n\nIn the chain problem, we used two structural prior models, \u201ctied\u201d and \u201csemi\u201d, among three priors\nexperimented in [4]. Both chain-tied and chain-semi assume that the transition dynamics are known\nto the agent except for the slip probabilities. In chain-tied, the slip probability is assumed to be\nindependent of state and action, thus there is only one unknown parameter in transition dynamics.\nIn chain-semi, the slip probability is assumed to be action dependent, thus there are two unknown\nparameters since there are two actions. We used uninformative Dirichlet priors in both settings.\nWe excluded experimenting with the \u201cfull\u201d prior model (completely unknown transition dynamics)\nsince even BEETLE was not able to learn a near-optimal policy as reported in [4].\n\nWe report the average discounted total reward and cost as well as their 95% con\ufb01dence intervals\nfor the \ufb01rst 1000 time steps using 200 simulated trials. We performed 60 Bellman iterations on 500\nbelief states, and used the \ufb01rst 50 belief states for choosing the set of basis functions. The discount\nfactor was set to 0.99.\n\nWhen \u02c6c=100, which is the maximum expected total cost that can be incurred by any policy, CBEE-\nTLE found policies that are as good as the policy found by BEETLE since the cost constraint has no\neffect. As we impose tighter cost constraints by \u02c6c=75, 50, and 25, the policies start to trade off the\nrewards in order to meet the cost constraint. Note also that, although we use approximations in the\nvarious stages of the algorithm, \u02c6c is within the con\ufb01dence intervals of the average total cost, meaning\nthat the cost constraint is either met or violated by statistically insigni\ufb01cant amounts. Since chain-\nsemi has more unknown parameters than chain-tied, it is natural that the performance of CBEETLE\npolicy is slighly degraded in chain-semi. Note also that as we impose tighter cost constraints, the\nrunning times generally increase. This is because the cost constraint in the LP tends to become\nactive at more belief states, generating two \u03b1-function pairs instead of a single \u03b1-function pair when\nthe cost constaint in the LP is not active.\n\nThe results for the maze problem were calculated for the \ufb01rst 2000 time steps using 100 simulated\ntrials. We performed 30 Bellman iterations on 2000 belief states, and used 50 basis functions. Due\nto the computational requirement for solving the large hyperstate CPOMDP, we only experimented\nwith the \u201ctied\u201d prior model which assumes that the slip probability is shared by every state and\naction. Running CBEETLE with \u02c6c = 1/(1 \u2212 0.95) = 20 is equivalent to running BEETLE without\ncost constraints, as veri\ufb01ed in the table.\n\nWe further analyzed the cost-sensitive exploration behaviour in the maze problem. Figure 2 com-\npares the policy behaviors of BEETLE and CBEETLE(\u02c6c=18) in the maze problem. The BEETLE\npolicy generally captures the top \ufb02ag \ufb01rst (Figure 2a), then navigates straight to the goal (Figure 2b)\nor captures the right \ufb02ag and navigates to the goal (Figure 2c). If it captures the right \ufb02ag \ufb01rst, it then\nnavigates to the goal (Figure 2d) or captures the top \ufb02ag and navigates to the goal (Figure 2e). We\nsuspect that the reason the third \ufb02ag on the left is not captured is due to the relatively low discount\nrate, hence ignored due to numerical approximations. The CBEETLE policy shows a similar cap-\nture behaviour, but it stays at the recharging station for a number of time steps between the \ufb01rst and\nsecond \ufb02ag captures, which can be con\ufb01rmed by the high state visitation frequency for the cell S in\nFigures 2g and 2i. This is because the policy cannot navigate to the other \ufb02ag position and move to\nthe goal without recharging the battery in between. The agent also frequently visits the recharging\nstation before the \ufb01rst \ufb02ag capture (Figure 2f) because it actively explores for the \ufb01rst \ufb02ag with a\nhigh uncertainty in the dynamics.\n\n3It may seem odd that the battery recharges at an exponential rate. We can set \u03b3 = 1 and make the cost\nfunction assign, e.g., a cost of -1 for recharging and 1 for consuming, but our implementation currently assumes\nsame discount factor for the rewards and costs. Implementation for different discount factors is left as a future\nwork, but note that we can still obtain meaningful results with \u03b3 suf\ufb01ciently close to 1.\n\n7\n\n\fTable 1: Experimental results for the chain and maze problems.\ntime\n\navg discounted\n\navg discounted\n\nproblem\n\nalgorithm\n\n\u02c6c\n\nchain-tied\n|S| = 5\n|A| = 2\n\nchain-semi\n|S| = 5\n|A| = 2\n\nmaze-tied\n|S| = 264\n\n|A| = 5\n\nBEETLE\n\nCBEETLE\n\nBEETLE\n\nCBEETLE\n\nBEETLE\nCBEETLE\n\n\u2212\n100\n75\n50\n25\n\u2212\n100\n75\n50\n25\n\u2212\n20\n18\n\nutopic\nvalue\n354.77\n354.77\n325.75\n296.73\n238.95\n354.77\n354.77\n325.75\n296.73\n238.95\n1.03\n1.03\n0.97\n\ntotal reward\n351.11\u00b18.42\n354.68\u00b18.57\n287.70\u00b18.17\n264.97\u00b17.06\n212.19\u00b14.98\n351.11\u00b18.42\n354.68\u00b18.57\n287.64\u00b18.16\n256.76\u00b17.23\n204.84\u00b14.51\n1.02\u00b10.02\n1.02\u00b10.02\n0.93\u00b10.04\n\ntotal cost\n\n(minutes)\n\n\u2212\n\n100.00\u00b10\n75.05\u00b10.14\n49.96\u00b10.09\n25.12\u00b10.13\n\n\u2212\n\n100.00\u00b10\n75.05\u00b10.14\n50.09\u00b10.14\n25.01\u00b10.16\n\n\u2212\n\n19.04\u00b10.02\n17.96\u00b10.46\n\n1.0\n2.4\n2.4\n44.3\n80.59\n1.6\n3.7\n3.8\n70.7\n139.3\n159.8\n242.5\n733.1\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\n(i)\n\n(j)\n\nFigure 2: State visitation frequencies of each location in the maze problem over 100 runs. Brightness\nis proportional to the relative visitation frequency. (a-e) Behavior of BEETLE (a) before the \ufb01rst\n\ufb02ag capture, (b) after the top \ufb02ag captured \ufb01rst, (c) after the top \ufb02ag captured \ufb01rst and the right \ufb02ag\nsecond, (d) after the right \ufb02ag captured \ufb01rst, and (e) after the right \ufb02ag captured \ufb01rst and the top \ufb02ag\nsecond. (f-j) Behavior of CBEETLE (\u02c6c = 18). The yellow star represents the current location of the\nagent.\n\n5 Conclusion\n\nIn this paper, we proposed CBEETLE, a model-based BRL algorithm for cost-sensitive exploration,\nextending BEETLE to solve the hyperstate CPOMDP which models BRL using cost constraints. We\nshowed that cost-sensitive BRL can be effectively solved by the randomized point-based value iter-\nation for CPOMDPs. Experimental results show that CBEETLE can learn reasonably good policies\nfor underlying CMDPs while exploring the unknown environment cost-sensitively.\n\nWhile our experiments show that the policies generally satisfy the cost constraints, it can still po-\ntentially violate the constraints since we approximate the alpha functions using a \ufb01nite number of\nbasis functions. As for the future work, we plan to focus on making CBEETLE more robust to the\napproximation errors by performing a constrained optimization when approximating alpha functions\nto guarantee that we never violate the cost constraints.\n\nAcknowledgments\n\nThis work was supported by National Research Foundation of Korea (Grant# 2012-007881), the\nDefense Acquisition Program Administration and Agency for Defense Development of Korea (Con-\ntract# UD080042AD), and the SW Computing R&D Program of KEIT (2011-10041313) funded by\nthe Ministry of Knowledge Economy of Korea.\n\n8\n\n\fReferences\n\n[1] R. Howard. Dynamic programming. MIT Press, 1960.\n[2] M. Duff. Optimal learning: Computational procedures for Bayes-adaptive Markov decision\n\nprocesses. PhD thesis, University of Massachusetts, Amherst, 2002.\n\n[3] S. Ross, J. Pineau, B. Chaib-draa, and P. Kreitmann. A Bayesian approach for larning and\nplanning in partially observable markov decision processes. Journal of Machine Learning\nResearch, 12, 2011.\n\n[4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to descrete Bayesian\n\nreinforcement learning. In Proc. of ICML, 2006.\n\n[5] E. Altman. Constrained Markov Decision Processes. Chapman & Hall/CRC, 1999.\n[6] K. W. Ross and R. Varadarajan. Markov decision-processes with sample path constraints - the\n\ncommunicating case. Operations Research, 37(5):780\u2013790, 1989.\n\n[7] K. W. Ross and R. Varadarajan. Multichain Markov decision-processes with a sample path\nconstraint - a decomposition approach. Mathematics of Operations Research, 16(1):195\u2013207,\n1991.\n\n[8] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with param-\n\neter uncertainty. Operations Research, 58(1), 2010.\n\n[9] A. Hans, D. Schneega\u00df, A. M. Sch\u00a8afer, and S. Udluft. Safe exploration for reinforcement\n\nlearning. In Proc. of 16th European Symposium on Arti\ufb01cial Neural Networks, 2008.\n\n[10] T. M. Moldovan and P. Abbeel. Safe exploration in Markov decision processes. In Proc. of\n\nNIPS Workshop on Bayesian Optimization, Experimental Design and Bandits, 2011.\n\n[11] J. D. Isom, S. P. Meyn, and R. D. Braatz. Piecewise linear dynamic programming for con-\n\nstrained POMDPs. In Proc. of AAAI, 2008.\n\n[12] D. Kim, J. Lee, K.-E. Kim, and P. Poupart. Point-based value iteration for constrained\n\nPOMDPs. In Proc. of IJCAI, 2011.\n\n[13] A. B. Piunovskiy and X. Mao. Constrained Markovian decision processes: the dynamic pro-\n\ngramming approach. Operations Research Letters, 27(3):119\u2013126, 2000.\n\n[14] M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs.\n\nJournal of Arti\ufb01cial Intelligence Research, 24, 2005.\n\n[15] R. Dearden, N. Friedman, and D. Andre. Bayesian Q-learning. In Proc. of AAAI, 1998.\n[16] M. Strens. A Bayesian framework for reinforcement learning. In Proc. of ICML, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1408, "authors": [{"given_name": "Dongho", "family_name": "Kim", "institution": null}, {"given_name": "Kee-eung", "family_name": "Kim", "institution": null}, {"given_name": "Pascal", "family_name": "Poupart", "institution": null}]}