{"title": "Monte-Carlo Tree Search for Constrained POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 7923, "page_last": 7932, "abstract": "Monte-Carlo Tree Search (MCTS) has been successfully applied to very large POMDPs, a standard model for stochastic sequential decision-making problems. However, many real-world problems inherently have multiple goals, where multi-objective formulations are more natural. The constrained POMDP (CPOMDP) is such a model that maximizes the reward while constraining the cost, extending the standard POMDP model. To date, solution methods for CPOMDPs assume an explicit model of the environment, and thus are hardly applicable to large-scale real-world problems. In this paper, we present CC-POMCP (Cost-Constrained POMCP), an online MCTS algorithm for large CPOMDPs that leverages the optimization of LP-induced parameters and only requires a black-box simulator of the environment. In the experiments, we demonstrate that CC-POMCP converges to the optimal stochastic action selection in CPOMDP and pushes the state-of-the-art by being able to scale to very large problems.", "full_text": "Monte-Carlo Tree Search for Constrained POMDPs\n\nJongmin Lee1, Geon-Hyeong Kim1, Pascal Poupart2, Kee-Eung Kim1,3\n\n1 School of Computing, KAIST, Republic of Korea\n\n2 University of Waterloo, Waterloo AI Institute and Vector Institute\n\n{jmlee,ghkim}@ai.kaist.ac.kr, ppoupart@uwaterloo.ca, kekim@cs.kaist.ac.kr\n\n3 PROWLER.io\n\nAbstract\n\nMonte-Carlo Tree Search (MCTS) has been successfully applied to very large\nPOMDPs, a standard model for stochastic sequential decision-making problems.\nHowever, many real-world problems inherently have multiple goals, where multi-\nobjective formulations are more natural. The constrained POMDP (CPOMDP) is\nsuch a model that maximizes the reward while constraining the cost, extending\nthe standard POMDP model. To date, solution methods for CPOMDPs assume an\nexplicit model of the environment, and thus are hardly applicable to large-scale real-\nworld problems. In this paper, we present CC-POMCP (Cost-Constrained POMCP),\nan online MCTS algorithm for large CPOMDPs that leverages the optimization of\nLP-induced parameters and only requires a black-box simulator of the environment.\nIn the experiments, we demonstrate that CC-POMCP converges to the optimal\nstochastic action selection in CPOMDP and pushes the state-of-the-art by being\nable to scale to very large problems.\n\n1\n\nIntroduction\n\nMonte-Carlo Tree Search (MCTS) [4, 5, 12] is a generic online planning algorithm that effectively\ncombines random sampling and tree search, and has shown great promise in many areas such as\nonline Bayesian reinforcement learning [8, 10] and computer Go [7, 20]. MCTS ef\ufb01ciently explores\nthe search space by investing more search effort in promising states and actions while balancing\nexploration and exploitation in the direction of maximizing the cumulative (scalar) rewards. Due to\nits outstanding performance without relying on any prior domain knowledge or heuristic function,\nMCTS has become the de-facto standard method for solving very large sequential decision making\nproblems, commonly formulated as Markov decision processes (MDPs) and partially observable\nMDPs (POMDPs).\nHowever in many situations, it is not straightforward to formulate the objective with the reward\nmaximization alone, as in the following examples. For spoken dialogue systems [24], it is common to\noptimize the dialogue strategy towards minimizing the number of turns while maintaining the success\nrate of dialogue tasks above a certain level. For UAVs under search and rescue mission, the main goal\nwould be to \ufb01nd as many targets as possible, while avoiding threats that may endanger the mission\nitself. The constrained POMDP (CPOMDP) [9] is an appealing framework for dealing with this kind\nof multi-objective sequential decision making problems when the environment is partially observable.\nThe model assumes that the action incurs not only rewards, but also K different types of costs, and\nthe goal is to \ufb01nd an optimal policy that maximizes the expected cumulative rewards while bounding\neach of K expected cumulative costs below certain levels.\nAlthough the CPOMDP is a very expressive model, it is known to be very dif\ufb01cult to solve due to the\nPSPACE-complete nature of solving POMDP [16] originating from the two main challenges: the\ncurse of dimensionality and the curse of history. Partially observable Monte-Carlo Planner (POMCP)\n[19] tames these two curses of POMDP by using Monte-Carlo sampling both in the root belief-state\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand in the black-box simulation of the history. In contrast, solution methods for CPOMDPs, e.g.\ndynamic programming [11], linear programming [18], and online uniform-cost search [23], are not\nyet advanced up to this level, partly due to requiring an explicit model of the environment. This\nprevents CPOMDP from being a practical approach for modeling real-world applications.\nIn this paper, we present an MCTS algorithm for CPOMDPs, which precisely addresses the scalability.\nTo the best of our knowledge, extending MCTS to CPOMDPs (or even CMDPs) has remained\nunexplored since it is not straightforward to handle the constrained optimization in tree search. This\nchallenge is compounded by the fact that optimal policies can be stochastic.1 In order to develop\nMCTS for CPOMDPs, we \ufb01rst show that solving CPOMDPs is essentially equivalent to jointly\nsolving an unconstrained POMDP while optimizing its LP-induced parameters that control the trade-\noff between the reward and the costs. From this result, we present our algorithm, Cost-Constrained\nPOMCP (CC-POMCP), for solving large CPOMDPs that combine traditional MCTS with LP-induced\nparameter optimization. In the experiments section, we demonstrate that CC-POMCP converges to\nthe optimal stochastic action selection using a synthetic domain and that it is able to handle very large\nproblems including constrained variants of Rocksample(15,15) and Atari 2600 arcade game, pushing\nthe state-of-the-art scalability in CPOMDP solvers.\n\n2 Background\n\nPartially observable Markov decision processes (POMDPs) [22] provide a principled framework for\nmodeling sequential decision making problems under stochastic transitions and noisy observations. It\nis formally de\ufb01ned by tuple (cid:104)Sp, A, Op, Tp, Zp, Rp, \u03b3, b0(cid:105), where Sp is the set of environment states s,\nA is the set of actions a, Op is the set of observations o, Tp(s(cid:48)|s, a) = Pr(st+1 = s(cid:48)|st = s, at = a)\nis the transition probability, Zp(o|s(cid:48), a) = Pr(ot+1 = o|st+1 = s(cid:48), at = a) is the observation\nprobability, Rp(s, a) \u2208 R is the immediate reward for taking action a in state s, \u03b3 \u2208 [0, 1) is the\ndiscount factor, and b0(s) = Pr(s0 = s) is the starting state distribution at time step 0. The history\nht = [a0, o0, . . . , at, ot] and htat+1 = [a0, o0, . . . , at, ot, at+1] denote the sequence of actions and\nobservations. The agent takes an action via policy \u03c0(a|h) = Pr(at = a|ht = h) that maps from\nhistory to probability distribution over actions. In POMDPs, the environment state is not directly\nobservable, thus the agent maintains belief bt(s) = Pr(st = s|ht) that can be recursively updated\nusing the Bayes rule: when taking action a in b and observing o, the updated belief is bao(s(cid:48)) \u221d\ns Tp(s(cid:48)|s, a)b(s). Since belief bt is a suf\ufb01cient statistic of history ht, the POMDP\ncan be understood as the belief-state MDP (cid:104)B, A, T, R, \u03b3, b0(cid:105), where b0 is the initial state, B is the\no,s,s(cid:48) Zp(o|s(cid:48), a)Tp(s(cid:48)|s, a)b(s)\u03b4(b(cid:48), bao)\ns b(s)Rp(s, a) is the immediate reward function. We\nshall use h and b = Pr(s|h) interchangeably as long as there is no confusion (e.g. QR(h, a) =\nQR(b, a), \u03c0(a|h) = \u03c0(a|b)). The goal is to \ufb01nd an optimal policy \u03c0\u2217 that maximizes the expected\ndiscounted return (i.e. cumulative discounted rewards):\n\nZp(o|s(cid:48), a)(cid:80)\nset of reachable beliefs starting from b0, T (b(cid:48)|b, a) = (cid:80)\nis the transition probability, R(b, a) = (cid:80)\n\n(cid:34) \u221e(cid:88)\n\nt=0\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12) b0\n\n.\n\nmax\n\n\u03c0\n\nV \u03c0\nR (b0) = E\u03c0\n\n\u03b3tR(bt, at)\n\nConstrained POMDPs (CPOMDPs) [9, 11, 18] is a generalization of POMDPs for multi-objective\nIt is formally de\ufb01ned by tuple (cid:104)Sp, A, Op, Tp, Zp, Rp, Cp, \u02c6c, \u03b3, b0(cid:105), where Cp =\nproblems.\n{Cp,k}1..K is the set of K non-negative cost functions with individual thresholds \u02c6c = {\u02c6ck}1..K.\nSimilarly, a CPOMDP can be cast into an equivalent belief-state CMDP (cid:104)B, A, T, R, C, \u02c6c, \u03b3(cid:105), where\ns b(s)Cp,k(s, a). The goal is to compute an optimal policy that maximizes the expected\n\nCk(b, a) =(cid:80)\n\ncumulative reward while bounding the expected cumulative costs:\n\nR (b0) = E\u03c0\nV \u03c0\n\nmax\n\n\u03c0\n\n\u03b3tR(bt, at)\n\ns.t. V \u03c0\nCk\n\n(b0) = E\u03c0\n\n\u03b3tCk(bt, at)\n\n(cid:34) \u221e(cid:88)\n(cid:34) \u221e(cid:88)\n\nt=0\n\nt=0\n\n(cid:35)\n(cid:12)(cid:12)(cid:12) b0\n(cid:12)(cid:12)(cid:12) b0\n\n(cid:35)\n\n\u2264 \u02c6ck \u2200k\n\n1 Stochastic nature of the optimal policy in CPOMDPs results from the stochasticity of optimal policies in\n\nCMDPs [1]. A more formal treatment on this matter, pertinent to CPOMDPs, can be found in [6].\n\n2\n\n\fAn optimal policy of the CPOMDP (or the equivalent belief-state CMDP) is generally stochastic and\ncan be obtained by solving the following linear program (LP) [1]:\n\n(cid:88)\n\nb,a\n\n(cid:88)\n(cid:88)\n\na(cid:48)\n\nb,a\n\nmax\n\n{y(b,a)\u22650}\u2200b,a\n\nR(b, a)y(b, a)\n\ns.t.\n\ny(b(cid:48), a(cid:48)) = \u03b4(b0, b(cid:48)) + \u03b3\n\n(cid:88)\n\nb,a\n\nT (b(cid:48)|b, a)y(b, a) \u2200b(cid:48)\n\n(1)\n\nCk(b, a)y(b, a) \u2264 \u02c6ck \u2200k\n\nwhere y(b, a) can be interpreted as a discounted occupancy measure of (b, a), and \u03b4(x, y) is a Dirac\ndelta function that has the value of 1 if x = y and 0 otherwise. Once the optimal solution y\u2217(b, a)\nis obtained, an optimal stochastic policy and the corresponding optimal value are computed by\nb,a R(b, a)y\u2217(b, a) respectively. It\n\n\u03c0\u2217(a|b) = Pr(a|b) = y\u2217(b, a)/(cid:80)\n\nR(b0; \u02c6c) =(cid:80)\n\na(cid:48) y\u2217(b, a(cid:48)) and V \u2217\n\nis usually intractable to solve LP (1) exactly since the cardinality of B can be in\ufb01nite.\nPOMCP [19] is a highly scalable Monte-Carlo tree search (MCTS) algorithm for (unconstrained)\nPOMDPs. The algorithm uses Monte-Carlo simulation for both tree search and belief update to\nsimultaneously tackle the curse of history and the curse of dimensionality. In each simulation, a state\nparticle is sampled from the root node\u2019s belief-state s \u223c B(h) (called root sampling) and is used to\nsample a trajectory using a black-box simulator (s(cid:48), o, r) \u223c G(s, a). It adopts UCB1 [2] as the tree\npolicy, i.e. the action selection rule in the internal nodes of the search tree:\n\n(cid:34)\n\n(cid:115)\n\n(cid:35)\n\narg max\n\na\n\nQR(h, a) + \u03ba\n\nlog N (h)\nN (h, a)\n\nwhere QR(h, a) is the average of the sampled returns, N (h) is the number of simulations performed\nthrough h, N (h, a) is the number of times action a is selected in h, and \u03ba is the exploration constant to\nadjust the exploration-exploitation trade-off. POMCP expands the search tree non-uniformly, focusing\nmore search efforts in promising nodes. It can be formally shown that QR(h, a) asymptotically\nconverges to the optimal value Q\u2217\nUnfortunately, it is not straightforward to use POMCP for CPOMDPs since the original UCB1 action\nselection rule does not have any notion of cost constraints. If we naively adopt the vanilla, reward-\nmaximizing POMCP, we may obtain cost-violating action sequences. We could obtain the average of\nsampled cumulative costs QC during search, but it is not straightforward how to leverage them in the\ntree policy: if we naively prevent action branches that violate the cost constraint QC(h, a) \u2264 \u02c6c, we\nmay end up with policies that are too conservative and thus sub-optimal, i.e. a feasible policy may be\nrejected during search if the Monte-Carlo estimate violates the cost constraint.\n\nR(h, a) in POMDPs.\n\n3 Solving CPOMDP via a POMDP Solver\n\nThe derivation of our algorithm starts from the dual of (1):\n\n(cid:88)\n(cid:88)\nV(b) \u2265 R(b, a) \u2212(cid:88)\n\n\u03b4(b0, b)V(b) +\n\nk\n\nb\n\nmin\n\n{V(b)}\u2200b\n{\u03bbk\u22650}\u2200k\n\ns.t.\n\n\u02c6ck\u03bbk\n\n(cid:88)\n\nT (b(cid:48)|b, a)V(b(cid:48)) \u2200b, a\n\n(2)\n\nCk(b, a)\u03bbk + \u03b3\n\nk\n\nObserve that if we treat \u03bb = [\u03bb1, . . . , \u03bbK](cid:62) as a constant, the problem becomes an unconstrained\nbelief-state MDP with the scalarized reward function R(b, a) \u2212 \u03bb\n\u03bb be the optimal\nvalue function of this unconstrained POMDP. Then, for any \u03bb, there exists a corresponding unique\nV\u2217\n\u03bb, and we can compute V\u2217\n\n\u03bb with a POMDP solver. Thus, solving the dual LP (2) reduces to:\n\nC(b, a). Let V\u2217\n\n(cid:62)\n\n(3)\nMoreover, if there is an optimal solution y\u2217 to the primal LP in (1), then there exists a corresponding\ndual optimal solution V\u2217 and \u03bb\n\n\u2217, and the duality gap is zero, i.e.\n\n\u03bb(b0) + \u03bb\n\nmin\n\u03bb\u22650\n\n\u02c6c\n\n(cid:104)V\u2217\n\nb(cid:48)\n\n(cid:105)\n\n(cid:62)\n\nV \u2217\nR(b0; \u02c6c) =\n\nR(b, a)y\u2217(b, a) = V\u2217\n\n\u03bb\u2217 (b0) + \u03bb\n\n\u2217(cid:62)\n\n\u02c6c\n\n(cid:88)\n\nb,a\n\n3\n\n\fby the strong duality theorem.\nTo compute optimal \u03bb in Eq. (3), we have to consider the trade-off between the \ufb01rst term and the\nsecond term according to the cost constraint \u02c6c. For example, if the cost constraint \u02c6c is very large, the\n\u2217 tends to be close to zero since the objective function would be mostly affected by\noptimal solution \u03bb\n(cid:62)\nthe second term \u03bb\n\u02c6c. On the other hand, if \u02c6c is suf\ufb01ciently small, the \ufb01rst term will be dominant\n\u2217 tends to get larger in order to have a negative impact on the reward\nand the optimal solution \u03bb\nR(b, a)\u2212 \u03bb\nC(b, a). Thus, it may seem that Eq. (3) is a complex optimization problem. However, as\nwe will see in the following proposition, the objective function in Eq. (3) is actually piecewise-linear\nand convex over \u03bb, as depicted in Figure 3 in Appendix A.\nProposition 1. Let V\u2217\nR(b, a) \u2212 \u03bb\n(The proof is provided in Appendix A.)\n\n\u03bb be the optimal value function of the POMDP with scalarized reward function\n\u02c6c is a piecewise-linear and convex (PWLC) function of \u03bb.\n\nC(b, a). Then, V\u2217\n\n\u03bb(b0) + \u03bb\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\nIn addition, we can show that the optimal solution \u03bb\nProposition 2 (Lemma 4 in [14]). Suppose that the reward function is bounded in [Rmin, Rmax] and\n\u2217(cid:107)1 \u2264 Rmax\u2212Rmin\nthere exists \u03c4 > 0 and a (feasible) policy \u03c0 such that V \u03c0\n.\n\u03c4 (1\u2212\u03b3)\n\nC (b0) + \u03c4 1 \u2264 \u02c6c. Then, (cid:107)\u03bb\n\n\u2217 is bounded:\n\n\u2217 by greedily optimizing (3) with \u03bbk in the\n]. The remaining question is how to compute that direction for updating \u03bb. We\n\nThus, from Propositions 1 and 2, we can obtain optimal \u03bb\nrange [0, Rmax\u2212Rmin\n\u03c4 (1\u2212\u03b3)\nstart with the following lemma to answer this question:\nLemma 1. Let M1 = (cid:104)B, A, T, R1, \u03b3(cid:105) and M2 = (cid:104)B, A, T, R2, \u03b3(cid:105) be two (belief-state) MDPs\n2 be the value functions of M1 and M2 with\ndiffering only in the reward function, and V \u03c0\na \ufb01xed policy \u03c0. Then, the value function of the new MDP M = (cid:104)B, A, T, pR1 + qR2, \u03b3(cid:105) with the\npolicy \u03c0 is V \u03c0(b) = pV \u03c0\nV \u03c0\u2217\nR (b0) \u2212 \u03bb\nLemma 1 implies that V\u2217\n(cid:62)\nC (b0) where \u03c0\u2217\n\u03bb is\nthe optimal policy with respect to the scalarized reward function R(b, a) \u2212 \u03bb\n(cid:62)\nC(b, a), and thus (3)\n(cid:105)\nbecomes:\n\n1 and V \u03c0\n2 (b) for all b \u2208 B.\n1 (b) + qV \u03c0\n\u03bb can be decomposed into V\u2217\n\n(The proof is provided in Appendix B.)\n\u03bb(b0) = V \u03c0\u2217\n\n(cid:104)\n\n\u03bb\n\n\u03bb\n\n(4)\n\nV \u03c0\u2217\nR (b0) \u2212 \u03bb\n\n\u03bb\n\n(cid:62)\n\nV \u03c0\u2217\nC (b0) + \u03bb\n\n\u03bb\n\n(cid:62)\n\n\u02c6c\n\nmin\n\u03bb\u22650\n\n\u03bb\n\n\u03bb\n\n\u03bb\n\n\u03bb constant so that we use the direction V \u03c0\u2217\n\nOne way to compute the descent direction for \u03bb would be by taking the derivative of Eq. (4) with\nC (b0) \u2212 \u02c6c. The following\nrespect to \u03bb while holding \u03c0\u2217\ntheorem shows that this is indeed a valid direction:\nTheorem 1. For any \u03bb, V \u03c0\u2217\nC (b0)\u2212\u02c6c is a negative subgradient that decreases the objective in Eq. (3),\n\u03bb is the optimal policy with respect to the scalarized reward function R(b, a) \u2212 \u03bb\nwhere \u03c0\u2217\nC(b, a).\nAlso, if V \u03c0\u2217\nC (b0) \u2212 \u02c6c = 0 then \u03bb is the optimal solution of Eq. (3). (Proof provided in Appendix C.)\nThe direction V \u03c0\u2217\nC (b0) \u2212 \u02c6c has a natural interpretation: if the current policy violates the k-th cost\nconstraint (i.e. V \u03c0\u2217\n> \u02c6ck), \u03bbk increases so that the cost is penalized more in the scalarized reward\nfunction R(b, a) \u2212 \u03bb\n(cid:62)\nC(b, a). On the other hand, if the current policy is too conservative for the\nk-th cost constraint (i.e. V \u03c0\u2217\nIn summary, we can solve the dual of LP of the belief-state CMDP by iterating through the following\nsteps, starting from any \u03bb:\n\n< \u02c6ck), \u03bbk decreases so that the cost is penalized less.\n\n\u03bb\nCk\n\n\u03bb\nCk\n\n(cid:62)\n\n\u03bb\n\n\u03bb \u2190 SolveBeliefMDP((cid:104)B, A, T, R \u2212 \u03bb\nC, \u03b3(cid:105))\nC \u2190 PolicyEvaluation((cid:104)B, A, T, C, \u03b3(cid:105), \u03c0\u2217\n\u03bb)\n\n1. \u03c0\u2217\n2. V \u03c0\u2217\n3. \u03bb \u2190 \u03bb + \u03b1n(V \u03c0\u2217\n\nC (b0) \u2212 \u02c6c) and clip \u03bbk to range [0, Rmax\u2212Rmin\n\u03c4 (1\u2212\u03b3)\n\n(cid:62)\n\n\u03bb\n\n\u03bb\n\nby using a step-size sequence \u03b1n such that(cid:80)\n\nn \u03b1n = \u221e and(cid:80)\n\nn < \u221e.\n\nn \u03b12\n\nBy Theorem 1, this procedure is a subgradient method, guaranteed to converge to the optimal solution\n\n] \u2200k \u2208 {1, 2, ...K}\n\n4\n\n\fAlgorithm 1 Cost-Constrained POMCP (CC-POMCP)\n\nfunction SEARCH(h0)\n\n\u03bb is randomly initialized.\nrepeat\n\nif h = \u2205 then\n\ns \u223c b0\ns \u223c B(h0)\n\nelse\n\nend if\nSIMULATE(s, h0, 0)\na \u223c GREEDYPOLICY(h0, 0, 0)\n\u03bb \u2190 \u03bb + \u03b1n [QC (h0, a) \u2212 \u02c6c]\nClip \u03bbk to range [0, Rmax\u2212Rmin\n\u03c4 (1\u2212\u03b3)\n\nuntil TIMEOUT()\nreturn GREEDYPOLICY(h0, 0, \u03bd)\n\nend function\nfunction ROLLOUT(s, h, d)\n\nif d = (maximum-depth) then\n\nreturn [0, 0]\n\n] \u2200k = {1, 2, ...K}\n\nfunction SIMULATE(s, h, d)\n\nif d = (maximum-depth) then\n\nreturn [0, 0]\nend if\nif h /\u2208 T then\n\nT (ha) \u2190 (Ninit, QR,init, QC,init,\u2205) \u2200a\nreturn ROLLOUT(s, h, d)\n\nend if\na \u223c GREEDYPOLICY(h, \u03ba, \u03bd)\n(s(cid:48), o, r, c) \u223c G(s, a)\n[R, C] \u2190 [r, c] + \u03b3\u00b7 SIMULATE(s(cid:48), hao, d + 1)\nB(h) \u2190 B(h) \u222a {s}\nN (h) \u2190 N (h) + 1\nN (h, a) \u2190 N (h, a) + 1\nQR(h, a) \u2190 QR(h, a) + R\u2212QR(h,a)\nQC (h, a) \u2190 QC (h, a) + C\u2212QC (h,a)\n\u00afc(h, a) \u2190 \u00afc(h, a) + c\u2212\u00afc(h,a)\nreturn [R, C]\n\nN (h,a)\n\nN (h,a)\n\nN (h,a)\n\nend if\na \u223c \u03c0rollout(\u00b7|h) and (s(cid:48), o, r, c) \u223c G(s, a)\nreturn [r, c] + \u03b3\u00b7 ROLLOUT(s(cid:48), hao, d + 1)\n\nend function\nfunction GREEDYPOLICY(h, \u03ba, \u03bd)\n\nQ\u2295\n\u03bb (h, a) := QR(h, a)\u2212\u03bb(cid:62)QC (h, a) + \u03ba\na\u2217 \u2190 arg maxa Q\n\n\u2295\n\u03bb (h, a)\n\nA\u2217 \u2190(cid:110)\n\na\u2217\n\ni\n\n(cid:12)(cid:12) |Q\u03bb(h, a\u2217\n\n(cid:16)(cid:113) log N (h,a\u2217\n\ni ) \u2212 Q\u03bb(h, a\u2217)|\ni )\ni ) +\n\nN (h,a\u2217\n\n\u2264 \u03bd\n\nSolve LP (10) with A\u2217 to compute a policy \u03c0(a\u2217\nreturn \u03c0(\u00b7|h)\n\n(cid:113) log N (h)\n(cid:113) log N (h,a\u2217)\n(cid:17)(cid:111)\n\nN (h,a)\n\nN (h,a\u2217)\n\ni |h) = wi.\n\nend function\n\nend function\nfunction MAINLOOP()\n\u02c6c \u2190 (cost constraint)\ns \u2190 (initial state)\nh \u2190 \u2205\nwhile s is not terminal do\n\n\u03c0 \u2190 SEARCH(h)\na \u223c \u03c0(\u00b7|h)\n\u02c6c \u2190 \u02c6c\u2212\u03c0(a|h)\u00afc(h,a)\u2212(cid:80)\n(s(cid:48), o, r, c) \u223c G(s, a)\ns \u2190 s(cid:48)\nh \u2190 hao\nend while\nend function\n\na(cid:48)(cid:54)=a \u03c0(a(cid:48)|h)QC (h,a(cid:48))\n\n\u03b3\u03c0(a|h)\n\n4 Cost-Constrained POMCP (CC-POMCP)\n\nAlthough we have eliminated the cost-constraints by introducing simultaneous update of \u03bb, it still\nrelies on exactly solving POMDPs via SolveBeliefMDP in each iteration, which is impractical for\nlarge CPOMDPs. Fortunately, all we need in step 3 is the cost value at the initial belief state V \u03c0\u2217\nC (b0)\nwith respect to the optimal policy when the reward function is given by R \u2212 \u03bb\nC. This is exactly\nthe situation where MCTS can be effectively applied: MCTS focuses on \ufb01nding the optimal action\nselection at the root node using the Monte-Carlo estimate of long-term rewards (or costs). We\nare now ready to present our online algorithm for large CPOMDPs, which we refer to as Cost-\nConstrained POMCP (CC-POMCP), shown in Algorithm 1. The changes from the standard POMCP\nare highlighted in blue. CC-POMCP is an extension of POMCP with cost constraints and can be seen\nas an anytime approximation of policy iteration with the simultaneous optimization of \u03bb: the policy\nis sequentially evaluated via Monte-Carlo return\n\n(cid:62)\n\n\u03bb\n\nQR(h, a) \u2190 QR(h, a) +\n\nR \u2212 QR(h, a)\n\nN (h, a)\n\nand QC(h, a) \u2190 QC(h, a) +\n\nC \u2212 QC(h, a)\n\nN (h, a)\n\n(5)\n\nand the policy is implicitly improved by the UCB1 action selection rule based on the scalarized value\nQ\u03bb(h, a) = QR(h, a) \u2212 \u03bb\n\nQC(h, a):\n\n(cid:62)\n\nQ\u2295\n\u03bb (h, a) =\n\narg max\n\na\n\nQC(h, a) + \u03ba\n\nlog N (h)\nN (h, a)\n\n(6)\n\n(cid:115)\n\n(cid:35)\n\n(cid:34)\n\nQR(h, a) \u2212 \u03bb\n\n(cid:62)\n\n5\n\n\fFinally, \u03bb is updated simultaneously using the current estimate of VC(s0) \u2212 \u02c6c, which is the descent\ndirection of the convex objective function:\n\n\u03bb \u2190 \u03bb + \u03b1n(QC(h0, a) \u2212 \u02c6c) where a \u223c \u03c0(\u00b7|h0)\n\n(7)\n\u2217 under mild\n\nThe following theorem states that CC-POMCP asymptotically converges to optimal \u03bb\nassumption:\nTheorem 2. Suppose that \u03bb is updated with increasing simulation step t, and the search tree is\nreset at the end of \u03bb\u2019s update as detailed in Appendix F. If the asymptotic bias of UCT holds\nfor all types of cost values (i.e. \u2203M > 0, \u2200k, |V \u03c0\u2217\nt )), then either\nsign(V \u03c0\u2217\n(h0) \u2212 \u02c6ck) = sign(VCk (h0) \u2212 \u02c6ck) or |V \u03c0\u2217\nt holds with probability 1\nas t \u2192 \u221e.\n\n(h0) \u2212 VCk (h0)| \u2264 M ( log t\n(h0) \u2212 \u02c6ck| \u2264 M log t\n\n\u03bb\nCk\n\u03bb\nCk\n\n\u03bb\nCk\n\nThe above states that either \u03bb is close to optimal or it is improved by the update towards the direction\nof negative subgradient. Note that CC-POMCP inherits the scalability of POMCP and thus does not\nrequire an explicit model of the environment: all we need is a black-box simulator G of the CPOMDP,\nwhich generates sample (s(cid:48), o, r, c) \u223c G(s, a) of the next state s(cid:48), observation o, reward r, and cost\nvector c, given the current state s and action a.\n\n4.1 Admissible Costs\n\nAfter the agent executes an action, the cost constraint threshold \u02c6c must be updated at the next time\nstep. For this, we reformulate the notion of admissible cost [17], originally formulated for dynamic\nprogramming. The admissible cost \u02c6ct+1 at time step t + 1 denotes the expected total cost allowed\nto be incurred in future time steps {t + 1, t + 2, ...} without violating the cost constraints. Under\ndynamic programming, the update is given by [17]: \u02c6ct+1 = \u02c6ct\u2212E[C(bt,at)|b0,\u03c0]\nwhere evaluating\nE[C(bt, at)|b0, \u03c0] requires the probability of reaching (bt, at) at time step t, which in turn requires\nmarginalizing out the history in the past [a0, b1, a2, ..., bt\u22121]. This is intractable for large state spaces.\nOn the other hand, under forward search, the admissible cost at the next time step t + 1 is simply\n\u02c6ct+1 = V \u03c0\u2217\nC (bt+1). We can access V \u03c0\u2217\nC (bt+1) by starting from the root node of the search tree ht,\nand sequentially following the action branch at and the next observation branch ot+1. Here we\nare assuming that the exact optimal V \u03c0\u2217\nC is obtained, which is certainly achievable after in\ufb01nitely\nmany simulations of CC-POMCP. Note also that even though \u02c6ct > V \u03c0\u2217\nC (bt) is possible in general,\nassuming \u02c6ct = V \u03c0\u2217\nC (bt), this means that no feasible\npolicy exists.\n\nC (bt) does not change the solution. If \u02c6ct < V \u03c0\u2217\n\n\u03b3\n\n4.2 Filling the Gap: Stochastic vs Deterministic Policies\n\nOur approach relies on the POMDP with scalarized rewards, but care must be taken as the optimal\n\u03bb be the deterministic optimal\npolicy of the CPOMDP is generally stochastic: given optimal \u03bb\npolicy for the POMDP with the scalarized reward function R \u2212 \u03bb\nC. Then, by the duality between\nthe primal (1) and the dual (2),\n\n\u2217, let \u03c0\u2217\n\u2217(cid:62)\n\n\u2217(cid:62)\n\n\u03bb(b0) + \u03bb\n\n(V \u03c0\u2217\nC (b0) \u2212 \u02c6c)\nR(b0; \u02c6c) = V\u2217\nV \u2217\nk > 0 and V \u03c0\u2217\nThis implies that if \u03bb\u2217\n\u03bb is not optimal for the original\nCPOMDP. This is exactly the situation where the optimal policy is stochastic. In order to make the\npolicy computed by our algorithm stochastic, we make sure that the following optimality condition is\nsatis\ufb01ed, derived from V \u2217\n\n(b0) (cid:54)= ck for some k then \u03c0\u2217\n\nR (b0) \u2212 \u03bb\n\n\u02c6c = V \u03c0\u2217\n\n\u03bb\nCk\n\n(8)\n\n\u2217(cid:62)\n\n\u03bb\n\n\u03bb\n\nk=1\n\n\u03bb\nCk\n\nk(V \u03c0\u2217\n\u03bb\u2217\ni with equally maximal scalarized action values Q\u03bb(b, a\u2217\n\ni ) \u2212\ni ) participate as the support of the stochastic policy, and are selected with probabil-\n\ni ) = QR(b, a\u2217\n\n\u03bb(a|b0)Q\u03c0\u2217\n\u03c0\u2217\n\n(b0, a) \u2212 \u02c6ck\n\n= 0\n\n(9)\n\n\u03bb\u2217\n\n\u03bb\nCk\n\nk=1\n\na\n\nk\n\nThat is, actions a\u2217\n(cid:62)\n\u03bb\nity \u03c0(a\u2217\n\nQC(b, a\u2217\n\ni |b) that satis\ufb01es \u2200k : \u03bb\u2217\n\n\u03c0(a\u2217\n\ni |b)QCk (b, a\u2217\n\ni ) = \u02c6ck.\n\n(cid:33)\n\nK(cid:88)\n\n\u03bb\n\nR (b0):\n\nR(b0; \u02c6c) = V \u03c0\u2217\n(b0) \u2212 \u02c6ck) =\n\nK(cid:88)\nk > 0, (cid:80)\n\na\u2217\n\ni\n\n(cid:32)(cid:88)\n\n6\n\n\f(a) Toy\n\n(b) Rocksample (5, 7)\n\n(c) Rocksample (7, 8)\n\n(d) Rocksample (11, 11)\n\nFigure 1: The result of CPOMDP Toy domain [11] and the constrained variants of Rocksample\n[21]. The result of Rocksample (15, 15) is presented in Appendix I. For each domain, the left \ufb01gure\nshows the average discounted cumulative reward, and the right \ufb01gure shows the average discounted\ncumulative cost. The wall-clock search time for CC-POMCP is presented on the top of x-axis.\n\nGREEDYPOLICY in Algorithm 1 computes the stochastic policy according to the above principle. In\npractice, due to the randomness in Monte-Carlo sampling, action values in (9) are always subject to\nestimation error, so it is reformulated as a linear programming with up to |A| + 2K variables:\n\n\u03bbk(\u03be+\n\nk + \u03be\u2212\nk )\n\n(10)\n\nK(cid:88)\n\nk=1\n\nmin\n{wi,\u03be+\nk ,\u03be\n\n\u2212\nk }\n\n(cid:88)\n(cid:88)\n\ni:a\u2217\n\ni \u2208A\u2217\n\nwiQCk (h, a\u2217\n\ni ) = \u02c6ck + (\u03be+\n\nk \u2212 \u03be\u2212\n\nk ) \u2200k\n\ns.t.\n\ns.t.\n\nwi = 1 and wi, \u03be+\n\nk , \u03be\u2212\n\nk \u2265 0\n\ni \u2208A\u2217\n\ni:a\u2217\ni ) (cid:39) Q\u03bb(h, a\u2217) s.t. a\u2217 = arg maxa Q\u2295\n\ni | Q\u03bb(h, a\u2217\n\nwhere A\u2217 = {a\u2217\n\u03bb (h, a)} 2, and the solutions are\ni |h). Here, when K = 1, there is a simple analytic solution to LP (10), which is described\nwi = \u03c0(a\u2217\nin Appendix G. Even when K > 1, note that the optimization problem occurs only when the number\nof equally maximal scalarized action values is more than 1 thus randomization of actions is required.\nIt is well known in CMDPs that an optimal policy requires at most K randomizations [1], which\nmeans that we expect to invoke optimization on extremely small part of the state space when the\nproblem is very large.\n\n5 Experiments\n\nAll the parameters for running CC-POMCP are provided in Appendix H.\nBaseline agent To the best of our knowledge, this work is the \ufb01rst attempt to solve constrained\n(PO)MDP using Monte-Carlo Tree Search. Since there is no algorithm for direct performance\ncomparison for large problems, we implemented a simple baseline agent using MCTS. This agent\nworks as outlined in section 2: it chooses an action via reward-maximizing POMCP while preventing\naction branches that violate cost constraint QC(s, a) \u2264 \u02c6c. If all action branches violate the cost\nconstraints, the agent chooses action uniformly at random.\n\n2Exact condition for Q\u03bb(h, a\u2217\n\ni ) (cid:39) Q\u03bb(h, a\u2217) and its theoretical guarantee are provided in Appendix E.\n\n7\n\n101102103104105simulations0.800.850.900.95discounted cumulative reward0.0010.010.1 earch time of CCPOMCP ( ec )101102103104105 imulation 0.900.920.940.96di counted cumulative co t\u012ec\u0302 Optimal ( tocha tic)Optimal (determini tic)Ba elineCCPOMCP (our )0.0010.010.1 earch time of CCPOMCP ( ec )102104106simulations0.02.55.07.510.012.515.0discounted c m lative reward0.0010.010.11search time of CCPOMCP (secs)102104106sim lations0.02.55.07.510.012.515.0disco nted c m lative cost\u012ecC\u0302LPBaselineCCPOMCP (o rs)0.0010.010.11search time of CCPOMCP (secs)102104106simulations0246810discounted cumula ive reward0.0010.010.11search ime of CCPOMCP (secs)102104106simula ions0.02.55.07.510.012.515.0discoun ed cumula ive cos \u012ecCALPBaselineCCPOMCP (ours)0.0010.010.11search ime of CCPOMCP (secs)102104106simulations\u22121012345discounted cumulative re ard0.0010.010.1110search time of CCPOMCP (secs)102104106simulations051015discounted cumulative cost\u012ec\u0302aselineCCPOMCP (ours)0.0010.010.1110search time of CCPOMCP (secs)\fDomain\n\n|S|\n\nRocksample (5,7)\n\n3,201\n\nRocksample (7,8)\n\n12,544\n\nRocksample (11,11)\n\n247,808\n\nRocksample (15,15)\n\n7,372,800\n\n1\n\n1\n\n1\n\n1\n\nCALP\nBaseline\nCC-POMCP\nCALP\nBaseline\nCC-POMCP\nCALP\nBaseline\nCC-POMCP\nCALP\nBaseline\nCC-POMCP\n\n\u02c6c Algorithm\n\nCumulative reward\n\n12.77\u00b10\n1.09\u00b10.88\n11.36\u00b11.02\n3.67\u00b10\n\u22120.23\u00b10.44\n9.36\u00b10.76\n0.14\u00b10.33\n2.65\u00b10.73\n0.39\u00b10.58\n0.74\u00b10.33\n\nN/A\n\nN/A\n\nCumulative cost\n0.78\u00b10\n12.74\u00b10.50\n0.79\u00b10.06\n1.20\u00b10\n13.92\u00b10.33\n0.56\u00b10.06\n15.29\u00b10.25\n0.09\u00b10.04\n16.27\u00b10.27\n0.69\u00b10.08\n\nN/A\n\nN/A\n\nTable 1: Comparison of CC-POMDP with the state-of-the-art of\ufb02ine solver, CALP [18].\n\nCPOMDP: Toy and Rocksample We \ufb01rst tested CC-POMCP on the synthetic toy domain intro-\nduced in [11] to demonstrate convergence to stochastic optimal actions, where the cost constraint \u02c6c is\n0.95. Any deterministic policy is suboptimal or violates the cost constraint. As can be seen in Figure\n1a, CC-POMCP converges to optimal stochastic action selection (thus experimentally con\ufb01rms the\nsoundness of algorithm), while the baseline agent converges to the suboptimal policy (optimal policy\namong deterministic ones).\nWe also conducted experiments on cost-constrained variants of Rocksample [21]. Rocksample(n, k)\nsimulates a Mars rover in n \u00d7 n grid containing k rocks. The goal is to sort out good rocks, collect\nthem, and escape the map by moving to the rightmost part of the map. We augmented the single-\nobjective Rocksample with the cost function that assigns 1 to low reward state-action pairs (i.e.\nCp(s, a) = 1 if Rp(s, a) < 0), similarly to [18]. We also assigned the cost of 1 to actions detecting\nwhether a rock is good or bad. The cost constraint \u02c6c is set to 1. We compared CC-POMCP with\nthe state-of-the-art of\ufb02ine CPOMDP solver, CALP [18]. CALP was allowed 10 minutes for the\nof\ufb02ine computation, and we performed exact policy evaluation with respect to the resulting \ufb01nite state\ncontroller without simulation in the real environment. The results on Rocksample are summarized in\nTable 1 and Figure 1. In Rocksample (5, 7), the reward performance of CC-POMCP is comparable to\nCALP when more than 2 seconds of search time is allowed while at the same time satisfying the cost\nconstraint. In contrast, baseline agent basically exhibited random behavior since the Monte-Carlo\nreturn at early stage mostly violates cost constraints for all actions. On Rocksample (7, 8), CALP\nfailed to compute a feasible policy, and CC-POMCP outperformed CALP in terms of reward while\nsatisfying the cost constraint. Finally, CC-POMCP was able to scale to Rocksample (11, 11) and (15,\n15): given a few seconds of search time, CC-POMCP was able to \ufb01nd actions satisfying the cost\nconstraints, and tended to yield higher returns as we increased the number of simulations.\n\nCMDP: Pong We also conducted experiments on a multi-objective version of PONG, an arcade\ngame running on the Arcade Learning Environment (ALE) [3], depicted in Figure 2a. In this domain,\nthe left paddle is handled by the default computer opponent and the right paddle is controlled by\nthe agent. We use the RAM state feature, i.e. the states are binary strings of length 1024 which\nresults in |S| = 21024. The action space is {up, down, stay}. The agent receives a reward of\n{1,\u22121} for each round depending on win/lose. The episode terminates if the accumulated reward\nis {21,\u221221}. We assigned cost 0 to the center area (position \u2208 [0.4, 0.6]), 1 to the neighboring\narea (position \u2208 [0.2, 0.4] \u222a [0.6, 0.8]), and 2 to the area farthest away from the center (position \u2208\n[0.0, 0.2] \u222a [0.8, 1.0]). This cost function was motivated by the scenario, where a human expert tries\nto constrain the RL agent to adhere to human advice that the agent should stay in the center. This\nadvice is encoded as the cost function and its threshold. We experimented with various cost constraint\nthresholds \u02c6c \u2208 {200, 100, 50, 30, 20} ranging from the unconstrained case \u02c6c = 200 (\u2235 Cmax\n1\u2212\u03b3 = 200)\nto the tightly constrained case \u02c6c = 20. We can see that the agent has two con\ufb02icting objectives: in\norder to achieve high rewards, it sometimes needs to move the paddle to positions far away from the\ncenter, but if this happens too often, the cost constraint will be violated. Thus, it needs to trade off\nbetween reward and cost properly depending on the cost constraint threshold \u02c6c.\nFigure 2b summarizes the experimental results from the CC-POMCP and the baseline agents. When\n\u02c6c = 200 (unconstrained case), both algorithms always win the game 21 by 0. As we lower \u02c6c,\n\n8\n\n\f|S| = 21024, |A| = 3\n\nALGO\n\navg cumulative\n\u02c6c\nrewards\n21.00\u00b10.00\n200 CC-POMCP\n21.00\u00b10.00\nBaseline\n19.27\u00b11.63\n100 CC-POMCP\nBaseline \u221215.05\u00b13.83\n17.88\u00b11.79\n50 CC-POMCP\nBaseline \u221220.45\u00b10.26\n-0.07\u00b15.23\n30 CC-POMCP\nBaseline \u221220.48\u00b10.30\n20 CC-POMCP \u221217.80\u00b12.91\nBaseline \u221220.48\u00b10.30\n\navg score\n\navg discounted\ncumulative costs FOE vs ALGO\n133.00\u00b14.97\n0.0 vs 21.0\n136.66\u00b14.45\n0.0 vs 21.0\n99.88\u00b10.13\n1.4 vs 20.7\n110.88\u00b13.86\n18.9 vs 3.9\n49.95\u00b10.07\n2.8 vs 20.7\n130.40\u00b14.99\n21.0 vs 0.6\n30.40\u00b10.46\n13.2 vs 13.2\n131.37\u00b15.08\n21.0 vs 0.5\n25.36\u00b11.25\n20.1 vs 2.2\n131.37\u00b15.08\n21.0 vs 0.5\n\n(a) Domain description\n\n(b) Simulation results\n\nFigure 2: (a) Multi-objective version of Atari 2600 PONG, visualizing the cost function. (b) Results of\nthe constrained PONG. Above: Histogram of the CC-POMCP agent\u2019s position, where the horizontal\naxis denotes the position of the agent (0: topmost, 1: bottommost) and the vertical axis denotes the\nrelative discounted visitation rate for each bin.\n\nCC-POMCP tends to stay in the center in order to make a trade off between reward and cost (shown\nin the histograms in Figure 2b). We can also see that the agent gradually performs worse in terms\nof scores as \u02c6c decreases. This is a natural result since it is forced to stay in the center and thus\nsacri\ufb01ce the game score. Overall, CC-POMCP computes a good policy while generally respecting\nthe cost constraint. On the other hand, the baseline fails to show a meaningful policy except when\n\u02c6c = 200 since the Monte-Carlo cost returns at early stage mostly violate the cost constraint, resulting\nin random behavior.\n\n6 Conclusion\n\nWe presented CC-POMCP, an online MCTS algorithm for very large CPOMDPs. We showed that\nsolving the dual LP of CPOMDPs is equivalent to jointly solving an unconstrained POMDP and\noptimizing its LP-induced parameters \u03bb, and provided theoretical results that shed insight on the\nproperties of \u03bb and how to optimize it. We then extended POMCP to maximize the scalarized value\nwhile simultaneously updating \u03bb using the current action-value estimates QC. We also empirically\nshowed that CC-POMCP converges to the optimal stochastic actions on a toy domain and easily scales\nto very large CPOMDPs through the constrained variants of Rocksample and the multi-objective\nversion of PONG.\n\nAcknowledgement\n\nThis work was supported by the ICT R&D program of MSIT/IITP of Korea (No. 2017-0-01778) and\nDAPA/ADD of Korea (UD170018CD). J. Lee acknowledges the Global Ph.D. Fellowship Program\nby NRF of Korea (NRF-2018-Global Ph.D. Fellowship Program).\n\nReferences\n[1] Eitan Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.\n\n[2] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2):235\u2013256, 2002.\n\n[3] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 2013.\n\n9\n\n00.510.00.10.20.30.40.5\u012ec\u030220000.51\u012ec\u030210000.51\u012ec\u03025000.51\u012ec\u03023000.51\u012ec\u030220\f[4] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Stephen Tavener,\nDiego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods.\nIEEE Transactions on Computational Intelligence and AI in Games, 4:1\u201349, 2012.\n\n[5] R\u00e9mi Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo tree search. In Proceedings of\n\nthe 5th International Conference on Computers and Games, pages 72\u201383, 2006.\n\n[6] Eugene A Feinberga and Aleksey B Piunovskiyb. Nonatomic total rewards markov decision processes\n\nwith multiple criteria. J. Math. Anal. Appl, 273:93\u2013111, 2002.\n\n[7] Sylvain Gelly and David Silver. Monte-Carlo tree search and rapid action value estimation in computer Go.\n\nArtif. Intell., 175(11):1856\u20131875, 2011.\n\n[8] Arthur Guez, David Silver, and Peter Dayan. Scalable and ef\ufb01cient Bayes-adaptive reinforcement learning\n\nbased on Monte-carlo tree search. Journal of Arti\ufb01cial Intelligence Research, pages 841\u2013883, 2013.\n\n[9] Joshua D. Isom, Sean P. Meyn, and Richard D. Braatz. Piecewise linear dynamic programming for\nconstrained POMDPs. In Proceedings of the Twenty-Third AAAI Conference on Arti\ufb01cial Intelligence,\npages 291\u2013296, 2008.\n\n[10] Sammie Katt, Frans A. Oliehoek, and Christopher Amato. Learning in POMDPs with Monte Carlo tree\nsearch. In Proceedings of the 34th International Conference on Machine Learning, pages 1819\u20131827,\n2017.\n\n[11] Dongho Kim, Jaesong Lee, Kee-Eung Kim, and Pascal Poupart. Point-based value iteration for constrained\nPOMDPs. In Proceedings of the Twenty-Second International Joint Conference on Arti\ufb01cial Intelligence -\nVolume Volume Three, IJCAI\u201911, pages 1968\u20131974, 2011.\n\n[12] Levente Kocsis and Csaba Szepesv\u00e1ri. Bandit based Monte-Carlo planning.\n\nIn Proceedings of the\n\nSeventeenth European Conference on Machine Learning (ECML 2006), pages 282\u2013293, 2006.\n\n[13] Levente Kocsis, Csaba Szepesv\u00e1ri, and Jan Willemson. Improved Monte-Carlo Search. Technical Report 1,\n\nUniv. Tartu, Estonia, 2006.\n\n[14] Jongmin Lee, Youngsoo Jang, Pascal Poupart, and Kee-Eung Kim. Constrained Bayesian reinforcement\nlearning via approximate linear programming. In Proceedings of the Twenty-Sixth International Joint\nConference on Arti\ufb01cial Intelligence (IJCAI-17), pages 2088\u20132095, 2017.\n\n[15] Eunsoo Oh and Kee-Eung Kim. A geometric traversal algorithm for reward-uncertain MDPs. In Proceed-\nings of the Twenty-Seventh Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-11), pages 565\u2013572,\n2011.\n\n[16] Christos Papadimitriou and John N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper.\n\nRes., 12(3):441\u2013450, 1987.\n\n[17] Alexei B Piunovskiy and Xuerong Mao. Constrained Markovian decision processes: the dynamic program-\n\nming approach. Operations research letters, 27(3):119\u2013126, 2000.\n\n[18] Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling. Approxi-\nmate linear programming for constrained partially observable Markov decision processes. In Proceedings\nof the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, pages 3342\u20133348, 2015.\n\n[19] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In Advances in Neural Information\n\nProcessing Systems 23, pages 2164\u20132172, 2010.\n\n[20] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik\nGrewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray\nKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks\nand tree search. Nature, pages 484\u2013489, 2016.\n\n[21] Trey Smith and Reid Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, UAI \u201904, pages 520\u2013527, 2004.\n\n[22] Edward J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford\n\nUniversity, 1971.\n\n[23] A. Undurti and J. P. How. An online algorithm for constrained POMDPs. In 2010 IEEE International\n\nConference on Robotics and Automation, pages 3966\u20133973, 2010.\n\n[24] Jason D. Williams and Steve Young. Partially observable markov decision processes for spoken dialog\n\nsystems. Computer Speech and Language, 21(2):393\u2013422, 2007.\n\n10\n\n\f", "award": [], "sourceid": 4906, "authors": [{"given_name": "Jongmin", "family_name": "Lee", "institution": "KAIST"}, {"given_name": "Geon-hyeong", "family_name": "Kim", "institution": "KAIST"}, {"given_name": "Pascal", "family_name": "Poupart", "institution": "University of Waterloo & RBC Borealis AI"}, {"given_name": "Kee-Eung", "family_name": "Kim", "institution": "KAIST"}]}