{"title": "Bayes-Adaptive Simulation-based Search with Value Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 459, "abstract": "Bayes-adaptive planning offers a principled solution to the exploration-exploitation trade-off under model uncertainty. It finds the optimal policy in belief space, which explicitly accounts for the expected effect on future rewards of reductions in uncertainty. However, the Bayes-adaptive solution is typically intractable in domains with large or continuous state spaces. We present a tractable method for approximating the Bayes-adaptive solution by combining simulation-based search with a novel value function approximation technique that generalises over belief space. Our method outperforms prior approaches in both discrete bandit tasks and simple continuous navigation and control tasks.", "full_text": "Bayes-Adaptive Simulation-based Search\n\nwith Value Function Approximation\n\nArthur Guez\u2217,1,2\n\nNicolas Heess2\n\nDavid Silver2\n\nPeter Dayan1\n\n\u2217aguez@google.com\n\n1Gatsby Unit, UCL\n\n2Google DeepMind\n\nAbstract\n\nBayes-adaptive planning offers a principled solution to the exploration-\nexploitation trade-off under model uncertainty. It \ufb01nds the optimal policy in be-\nlief space, which explicitly accounts for the expected effect on future rewards of\nreductions in uncertainty. However, the Bayes-adaptive solution is typically in-\ntractable in domains with large or continuous state spaces. We present a tractable\nmethod for approximating the Bayes-adaptive solution by combining simulation-\nbased search with a novel value function approximation technique that generalises\nappropriately over belief space. Our method outperforms prior approaches in both\ndiscrete bandit tasks and simple continuous navigation and control tasks.\n\nIntroduction\n\n1\nA fundamental problem in sequential decision making is controlling an agent when the environmen-\ntal dynamics are only partially known. In such circumstances, probabilistic models of the environ-\nment are used to capture the uncertainty of current knowledge given past data; they thus imply how\nexploring the environment can be expected to lead to new, exploitable, information.\nIn the context of Bayesian model-based reinforcement learning (RL), Bayes-adaptive (BA) planning\n[8] solves the resulting exploration-exploitation trade-off by directly optimizing future expected\ndiscounted return in the joint space of states and beliefs about the environment (or, equivalently,\ninteraction histories). Performing such optimization even approximately is computationally highly\nchallenging; however, recent work has demonstrated that online planning by sample-based forward-\nsearch can be effective [22, 1, 12]. These algorithms estimate the value of future interactions by\nsimulating trajectories while growing a search tree, taking model uncertainty into account. However,\none major limitation of Monte Carlo search algorithms in general is that, na\u00a8\u0131vely applied, they fail to\ngeneralize values between related states. In the BA case, a separate value is stored for each distinct\npath of possible interactions. Thus, the algorithms fail not only to generalize values between related\npaths, but also to re\ufb02ect the fact that different histories can correspond to the same belief about\nthe environment. As a result, the number of required simulations grows exponentially with search\ndepth. Worse yet, except in very restricted scenarios, the lack of generalization renders MC search\nalgorithms effectively inapplicable to BAMDPs with continuous state or action spaces.\nIn this paper, we propose a class of ef\ufb01cient simulation-based algorithms for Bayes-adaptive model-\nbased RL which use function approximation to estimate the value of interaction histories during\nsearch. This enables generalization between different beliefs, states, and actions during planning,\nand therefore also works for continuous state spaces. To our knowledge this is the \ufb01rst broadly\napplicable MC search algorithm for continuous BAMDPs.\nOur algorithm builds on the success of a recent tree-based algorithm for discrete BAMDPs (BAMCP,\n[12]) and exploits value function approximation for generalization across interaction histories, as\nhas been proposed for simulation-based search in MDPs [19]. As a crucial step towards this end we\ndevelop a suitable parametric form for the value function estimates that can generalize appropriately\n\n1\n\n\facross histories, using the importance sampling weights of posterior samples to compress histories\ninto a \ufb01nite-dimensional feature vector. As in BAMCP we take advantage of root sampling [18, 12] to\navoid expensive belief updates at every step of simulation, making the algorithm practical for a broad\nrange of priors over environment dynamics. We also provide an interpretation of root sampling as an\nauxiliary variable sampling method. This leads to a new proof of its validity in general simulation-\nbased settings, including BAMDPs with continuous state and action spaces, and a large class of\nalgorithms that includes MC and TD upates.\nEmpirically, we show that our approach requires considerably fewer simulations to \ufb01nd good poli-\ncies than BAMCP in a (discrete) bandit task and two continuous control tasks with a Gaussian process\nprior over the dynamics [5, 6]. In the well-known pendulum swing-up task, our algorithm learns how\nto balance after just a few seconds of interaction. Below, we \ufb01rst brie\ufb02y review the Bayesian formu-\nlation of optimal decision making under model uncertainty (section 2; please see [8] for additional\ndetails). We then explain our algorithm (section 3) and present empirical evaluations in section 4.\nWe conclude with a discussion, including related work (sections 5 and 6).\n2 Background\nA Markov Decision Processes (MDP) is described as a tuple M = (cid:104)S, A,P,R, \u03b3(cid:105) with S the\nset of states (which may be in\ufb01nite), A the discrete set of actions, P : S \u00d7 A \u00d7 S \u2192 R the\nstate transition probability kernel, R : S \u00d7 A \u2192 R the reward function, and \u03b3 < 1 the discount\nfactor. The agent starts with a prior P (P) over the dynamics, and maintains a posterior distribution\nbt(P) = P (P |ht) \u221d P (ht|P)P (P), where ht denotes the history of states, actions, and rewards\nup to time t.\nThe uncertainty about the dynamics of the model can be transformed into certainty about the current\nstate inside an augmented state space S+ = H \u00d7 S, where H is the set of possible histories (the\ncurrent state also being the suf\ufb01x of the current history). The dynamics and rewards associated with\nthis augmented state space are described by\n\nP +(h, s, a, has(cid:48), s(cid:48)) =\n\n(1)\nTogether, the 5-tuple M + = (cid:104)S+, A,P +,R+, \u03b3(cid:105) forms the Bayes-Adaptive MDP (BAMDP) for the\nMDP problem M. Since the dynamics of the BAMDP are known, it can in principle be solved to\nobtain the optimal value function associated with each action:\n\nP(s, a, s(cid:48))P (P|h) dP, R+(h, s, a) = R(s, a).\n\n(cid:90)\n\nP\n\n(cid:34) \u221e(cid:88)\n\nt(cid:48)=t\n\n(cid:35)\n\nQ\u2217(ht, st, a) = max\n\n\u02dc\u03c0\n\nE\u02dc\u03c0\n\n\u03b3t(cid:48)\u2212trt(cid:48)|at = a\n\n;\n\n\u02dc\u03c0\u2217(ht, st) = argmax\n\na\n\nQ\u2217(ht, st, a),\n\n(2)\n\nwhere \u02dc\u03c0 : S+\u00d7A \u2192 [0, 1] is a policy over the augmented state space, from which the optimal action\nfor each belief-state \u02dc\u03c0\u2217(ht, st) can readily be derived. Optimal actions in the BAMDP are executed\ngreedily in the real MDP M, and constitute the best course of action (i.e., integrating exploration and\nexploitation) for a Bayesian agent with respect to its prior belief over P.\n3 Bayes-Adaptive simulation-based search\nOur simulation-based search algorithm for the Bayes-adaptive setup combines ef\ufb01cient MC search\nvia root-sampling with value function approximation. We \ufb01rst explain its underlying idea, assuming\na suitable function approximator exists, and provide a novel proof justifying the use of root sampling\nthat also applies in continuous state-action BAMDPs. Finally, we explain how to model Q-values as\na function of interaction histories.\n\n3.1 Algorithm\nAs in other forward-search planning algorithms for Bayesian model-based RL [22, 17, 1, 12], at\neach step t, which is associated with the current history ht (or belief) and state st, we plan online to\n\ufb01nd \u02dc\u03c0\u2217(ht, st) by constructing an action-value function Q(h, s, a). Such methods use simulation to\nbuild a search tree of belief states, each of whose nodes corresponds to a single (future) history, and\nestimate optimal values for these nodes. However, existing algorithms only update the nodes that\nare directly traversed in each simulation. This is inef\ufb01cient, as it fails to generalize across multiple\nhistories corresponding either to exactly the same, or similar, beliefs. Instead, each such history\nmust be traversed and updated separately.\n\n2\n\n\fsimulating\n\na\n\nfuture\n\nAlgorithm 1: Bayes-Adaptive simulation-based\nsearch with root sampling\nprocedure Search( ht, st )\n\nHere, we use a more general simulation-based search that relies on function approximation, rather\nthan a tree, to represent the values for possible simulated histories and states. This approach was\noriginally suggested in the context of planning in large MDPs[19]; we extend it to the case of\nBayes-Adaptive planning. The Q-value of a particular history, state, and action is represented\nas Q(h, s, a; w), where w is a vector of learnable parameters. Fixed-length simulations are run\nfrom the current belief-state ht, st, and the parameter w is updated online, during search, based on\nexperience accumulated along these trajectories, using an incremental RL control algorithm (e.g.,\nMonte-Carlo control, Q-learning). If the parametric form and features induce generalization be-\ntween histories, then each forward simulation can affect the values of histories that are not directly\nexperienced. This can considerably speed up planning, and enables continuous-state problems to\nbe tackled. Note that a search tree would be a special case of the function approximation approach\nwhen the representation of states and histories is tabular.\ncontext of Bayes-Adaptive plan-\nIn the\nworks\nsimulation-based\nning,\nby\ntrajectory\nht+T = statrtst+1 . . . at+T\u22121rt+T\u22121st+T of\nT transitions (the planning horizon) starting\nfrom the current belief-state ht, st. Actions\nare selected by following a \ufb01xed policy \u02dc\u03c0,\nwhich is itself a function of\nthe history,\na \u223c \u02dc\u03c0(h,\u00b7). State transitions can be sam-\npled according to the BAMDP dynamics,\nst(cid:48) \u223c P +(ht(cid:48)\u22121, st(cid:48)\u22121, at(cid:48), ht(cid:48)\u22121at(cid:48)\u00b7,\u00b7). How-\never,\nthis can be computationally expensive\nsince belief updates must be applied at every\nstep of the simulation. As an alternative, we\nuse root sampling [18], which only samples the\ndynamics P k \u223c P (P |ht) once at the root for\neach simulation k and then samples transitions\naccording to st(cid:48) \u223c P k(st(cid:48)\u22121, at(cid:48)\u22121,\u00b7); we provide justi\ufb01cation for this approach in Section 3.2.1\nAfter the trajectory hT has been simulated on a step, the Q-value is modi\ufb01ed by updating w based\non the data in ht+T . Any incremental algorithm could be used, including SARSA, Q-learning, or\ngradient TD [20]; we use a simple scheme to minimize an appropriately weighted squared loss\nE[(Q(ht(cid:48), st(cid:48), at(cid:48); w) \u2212 Rt(cid:48))2]:\n\nif t > T then return 0\na \u2190 \u02dc\u03c0\u0001\u2212greedy(Q(h, s,\u00b7; w))\ns(cid:48) \u223c P(s, a,\u00b7), r \u2190 R(s, a)\nR \u2190 r + \u03b3 Simulate(has(cid:48), s(cid:48),P, t+1)\nw \u2190 w \u2212\u03b1 (Q(h, s, a; w) \u2212 R)\u2207wQ(h, s, a; w)\nreturn R\n\nuntil Timeout()\nreturn argmaxa Q(ht, st, a; w)\nend procedure\nprocedure Simulate( h, s,P, t)\n\nend procedure\n\nsearch\n\nrepeat\n\nP \u223c P (P |ht)\nSimulate(ht, st,P, 0)\n\n|\u2206 w | = \u03b1 (Q(ht(cid:48), st(cid:48), at(cid:48); w) \u2212 Rt(cid:48))\u2207wQ(ht(cid:48), st(cid:48), at(cid:48); w),\n\n(3)\n\nwhere \u03b1 is the learning rate and Rt(cid:48) denotes the discounted return obtained from history ht(cid:48).2 Al-\ngorithm 1 provides pseudo-code for this scheme; here we suggest using as the \ufb01xed policy for a\nsimulation the \u0001\u2212greedy \u02dc\u03c0\u0001\u2212greedy based on some given Q value. Other policies could be considered\n(e.g., the UCT policy for search trees), but are not the main focus of this paper.\n\n3.2 Analysis\nIn order to exploit general results on the convergence of classical RL algorithms for our simulation-\nbased search, it is necessary to show that starting from the current history, root sampling produces\nthe appropriate distribution of rollouts. For the purpose of this section, a simulation-based search\nalgorithm includes Algorithm 1 (with Monte-Carlo backups) but also incremental variants, as dis-\ncussed above, or BAMCP.\nLet D \u02dc\u03c0\nt be the rollout distribution function of forward-simulations that explicitly updates the belief\nat each step (i.e., using P +): D \u02dc\u03c0\nt (ht+T ) is the probability density that history ht+T is generated\nwhen running that simulation from ht, st, with T the horizon of the simulation, and \u02dc\u03c0 an arbitrary\nhistory policy. Similarly de\ufb01ne the quantity \u02dcDt\n(ht+T ) as the probability density that history ht+T\nis generated when running forward-simulations with root sampling, as in Algorithm 1. The following\nlemma shows that these two rollout distributions are the same.\n\n\u02dc\u03c0\n\n1For comparison, a version of the algorithm without root sampling is listed in the supplementary material.\n2The loss is weighted according to the distr. of belief-states visited from the current state by executing \u02dc\u03c0.\n\n3\n\n\f(cid:89)\n(cid:89)\n\n(cid:90)\n(cid:90)\n\nP\n\n(cid:90)\n\nP\n\nt (ht+T ) = \u02dcD \u02dc\u03c0\n\nLemma 1. D \u02dc\u03c0\nt (ht+T ) for all policies \u02dc\u03c0 : H \u00d7 A \u2192 [0, 1] and for all ht+T \u2208 H of\nlength t + T .\nProof. A similar result has been obtained for discrete state-action spaces as Lemma 1 in [12] using\nan induction step on the history length. Here we provide a more intuitive interpretation of root sam-\npling as an auxiliary variable sampling scheme which also applies directly to continuous spaces. We\nshow the equivalence by rewriting the distribution of rollouts. The usual way of sampling histories\nin simulation-based search, with belief updates, is justi\ufb01ed by factoring the density as follows:\n\np(ht+T|ht, \u02dc\u03c0) = p(atst+1at+1st+2 . . . st+T|ht, \u02dc\u03c0)\n\n= p(at|ht, \u02dc\u03c0)p(st+1|ht, \u02dc\u03c0, at)p(at+1|ht+1, \u02dc\u03c0) . . . p(st+T|ht+T\u22121, at+T , \u02dc\u03c0)\n=\n\np(st(cid:48)|ht(cid:48)\u22121, \u02dc\u03c0, at(cid:48)\u22121)\n\n\u02dc\u03c0(ht(cid:48), at(cid:48))\n\n(cid:89)\n(cid:89)\n\n(cid:90)\n\nt\u2264t(cid:48) 2000.510510152000.10.2BAMCP> 2000.510510152000.10.2FA> 2000.510510152000.10.2THOMPFractionTime (s)> 2000.510510152000.10.2BAFA> 2000.510510152000.10.2BAMCP> 2000.510510152000.10.2FA> 2000.510510152000.10.2THOMPFractionTime (s)> 2000.51\fis due to a lack of structure in the problem and test this with a more challenging, albeit arti\ufb01cial,\nversion of the pendulum problem that requires non-myopic planning over longer horizons. In this\nmodi\ufb01ed version, balancing the pendulum (i.e., being in the region |\u03b8| < \u03c0\n4 ) is either rewarding\n(R(s) = 1) with probability 0.5, or costly (R(s) = \u22121) with probability 0.5; all other states have an\nassociated reward of 0. This can be modeled formally by introducing another binary latent variable\nin the model. These latent dynamics are observed with certainty if the pendulum reaches any state\nwhere |\u03b8| \u2265 3\u03c0\n4 . The rest of the problem is the same. To approximate correctly the Bayes-optimal\nsolution in this setting, the planning algorithm must optimize the belief-state policy after it simulates\nobserving whether balancing is rewarding or not. We run this version of the problem with the same\nalgorithms as above and report the results in Figure 3-b. This hard planning problem highlights more\nclearly the bene\ufb01ts of Bayes-adaptive planning and value generalization. Our approach manages to\nbalance the pendulum more 80% of the time, compared to about 35% for BAMCP, while THOMP\nand FA fail to balance for almost all runs. In the Suppl. material, Figure S2 illustrates the in\ufb02uence\nof the number of particles M on the performance of BAFA.\n5 Related Work\nSimulation-based search with value function approximation has been investigated in large and also\ncontinuous MDPs, in combination with TD-learning [19] or Monte-Carlo control [3]. However, this\nhas not been in a Bayes-adaptive setting. By contrast, existing online Bayes-Adaptive algorithms\n[22, 17, 1, 12, 9] rely on a tree structure to build a map from histories to value. This cannot bene\ufb01t\nfrom generalization in a straightforward manner, leading to the inef\ufb01ciencies demonstrated above\nand hindering their application to the continuous case. Continuous Bayes-Adaptive (PO)MDPs have\nbeen considered using an online Monte-Carlo algorithm [4]; however this tree-based planning algo-\nrithm expands nodes uniformly, and does not admit generalization between beliefs. This severely\nlimits the possible depth of tree search ([4] use a depth of 3).\nIn the POMDP literature, a key idea to represent beliefs is to sample a \ufb01nite set of (possibly approx-\nimate) belief points [21, 16] from the set of possible beliefs in order to obtain a small number of\n(belief-)states for which to backup values of\ufb02ine or via forward search [13]. In contrast, our sam-\npling approach to belief representation does not restrict the number of (approximate) belief points\nsince our belief features (z(h)) can take an in\ufb01nite number of values, but it instead restricts their\ndimension, thus avoiding in\ufb01nite-dimensional belief spaces. Wang et al.[23] also use importance\nsampling to compute the weights of a \ufb01nite set of particles. However, they use these particles to\ndiscretize the model space and thus create an approximate, discrete POMDP. They solve this of-\n\ufb02ine with no (further) generalization between beliefs, and thus no opportunity to re-adjust the belief\nrepresentation based on past experience. A function approximation scheme in the context of BA\nplanning has been considered by Duff [7], in an of\ufb02ine actor-critic paradigm. However, this was in\na discrete setting where counts could be used as features for the belief.\n6 Discussion\nWe have introduced a tractable approach to Bayes-adaptive planning in large or continuous state\nspaces. Our method is quite general, subsuming Monte Carlo tree search methods, while allowing\nfor arbitrary generalizations over interaction histories using value function approximation. Each\nsimulation is no longer an isolated path in an exponentially growing tree, but instead value backups\ncan impact many non-visited beliefs and states. We proposed a particular parametric form for the\naction-value function based on a Monte-Carlo approximation of the belief. To reduce the compu-\ntational complexity of each simulation, we adopt a root sampling method which avoids expensive\nbelief updates during a simulation and hence poses very few restrictions on the possible form of the\nprior over environment dynamics.\nOur experiments demonstrated that the BA solution can be effectively approximated, and that the\nresulting generalization can lead to substantial gains in ef\ufb01ciency in discrete tasks with large trees.\nWe also showed that our approach can be used to solve continuous BA problems with non-trivial\nplanning horizons without discretization, something which had not previously been possible. Using\na widely used GP framework to model continuous system dynamics (for the case of a swing-up\npendulum task), we achieved state-of the art performance.\nOur general framework can be applied with more powerful methods for learning the parameters of\nthe value function approximation, and it can also be adapted to be used with continuous actions. We\nexpect that further gains will be possible, e.g. from the use of bootstrapping in the weight updates,\nalternative rollout policies, and reusing values and policies between (real) steps.\n\n8\n\n\fReferences\n[1] J. Asmuth and M. Littman. Approaching Bayes-optimality using Monte-Carlo tree search. In Proceedings\n\nof the 27th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 19\u201326, 2011.\n\n[2] Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control\n\nTheory and Applications, 9(3):310\u2013335, 2011.\n\n[3] SRK Branavan, D. Silver, and R. Barzilay. Learning to win by reading manuals in a Monte-Carlo frame-\n\nwork. Journal of Arti\ufb01cial Intelligence Research, 43:661\u2013704, 2012.\n\n[4] P. Dallaire, C. Besse, S. Ross, and B. Chaib-draa. Bayesian reinforcement learning in continuous\nPOMDPs with Gaussian processes. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ Inter-\nnational Conference on, pages 2604\u20132609. IEEE, 2009.\n\n[5] Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters. Gaussian process dynamic program-\n\nming. Neurocomputing, 72(7):1508\u20131524, 2009.\n\n[6] MP Deisenroth and CE Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach to policy search.\nIn Proceedings of the 28th International Conference on Machine Learning, pages 465\u2013473. International\nMachine Learning Society, 2011.\n\n[7] M. Duff. Design for an optimal probe. In Proceedings of the 20th International Conference on Machine\n\nLearning, pages 131\u2013138, 2003.\n\n[8] M.O.G. Duff. Optimal Learning: Computational Procedures For Bayes-Adaptive Markov Decision Pro-\n\ncesses. PhD thesis, University of Massachusetts Amherst, 2002.\n\n[9] Raphael Fonteneau, Lucian Busoniu, and R\u00b4emi Munos. Optimistic planning for belief-augmented Markov\ndecision processes. In IEEE International Symposium on Adaptive Dynamic Programming and reinforce-\nment Learning (ADPRL 2013), 2013.\n\n[10] J.C. Gittins, R. Weber, and K.D. Glazebrook. Multi-armed bandit allocation indices. Wiley Online\n\nLibrary, 1989.\n\n[11] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian\nIn IEE Proceedings F (Radar and Signal Processing), volume 140, pages\n\nBayesian state estimation.\n107\u2013113, 1993.\n\n[12] A. Guez, D. Silver, and P. Dayan. Ef\ufb01cient Bayes-adaptive reinforcement learning using sample-based\n\nsearch. In Advances in Neural Information Processing Systems (NIPS), pages 1034\u20131042, 2012.\n\n[13] Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Ef\ufb01cient point-based POMDP planning by\napproximating optimally reachable belief spaces. In Robotics: Science and Systems, pages 65\u201372, 2008.\n[14] H.R. Maei, C. Szepesv\u00b4ari, S. Bhatnagar, and R.S. Sutton. Toward off-policy learning control with function\n\napproximation. Proc. ICML 2010, pages 719\u2013726, 2010.\n\n[15] Teodor Mihai Moldovan, Michael I Jordan, and Pieter Abbeel. Dirichlet Process reinforcement learning.\n\nIn Reinforcement Learning and Decision Making Meeting, 2013.\n\n[16] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, volume 18, pages 1025\u20131032, 2003.\n\n[17] S. Ross and J. Pineau. Model-based bayesian reinforcement learning in large structured domains. In Proc.\n\n24th Conference in Uncertainty in Arti\ufb01cial Intelligence (UAI08), pages 476\u2013483, 2008.\n\n[18] D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 2164\u20132172, 2010.\n\n[19] David Silver, Richard S Sutton, and Martin M\u00a8uller. Temporal-difference search in computer go. Machine\n\nlearning, 87(2):183\u2013219, 2012.\n\n[20] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00b4ari, and E. Wiewiora. Fast\ngradient-descent methods for temporal-difference learning with linear function approximation. In Pro-\nceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, volume 382,\npage 125, 2009.\n\n[21] Sebastian Thrun. Monte Carlo POMDPs. In NIPS, volume 12, pages 1064\u20131070, 1999.\n[22] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward\noptimization. In Proceedings of the 22nd International Conference on Machine learning, pages 956\u2013963,\n2005.\n\n[23] Y. Wang, K.S. Won, D. Hsu, and W.S. Lee. Monte Carlo Bayesian reinforcement learning. In Proceedings\n\nof the 29th International Conference on Machine Learning, 2012.\n\n9\n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Arthur", "family_name": "Guez", "institution": "Google DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Gatsby Unit"}, {"given_name": "David", "family_name": "Silver", "institution": "UCL"}, {"given_name": "Peter", "family_name": "Dayan", "institution": "Gatsby Unit, UCL"}]}