{"title": "Bayes-Adaptive Simulation-based Search with Value Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 459, "abstract": "Bayes-adaptive planning offers a principled solution to the exploration-exploitation trade-off under model uncertainty. It finds the optimal policy in belief space, which explicitly accounts for the expected effect on future rewards of reductions in uncertainty. However, the Bayes-adaptive solution is typically intractable in domains with large or continuous state spaces. We present a tractable method for approximating the Bayes-adaptive solution by combining simulation-based search with a novel value function approximation technique that generalises over belief space. Our method outperforms prior approaches in both discrete bandit tasks and simple continuous navigation and control tasks.", "full_text": "Bayes-Adaptive Simulation-based Search\n\nwith Value Function Approximation\n\nArthur Guez\u2217,1,2\n\nNicolas Heess2\n\nDavid Silver2\n\nPeter Dayan1\n\n\u2217aguez@google.com\n\n1Gatsby Unit, UCL\n\n2Google DeepMind\n\nAbstract\n\nBayes-adaptive planning offers a principled solution to the exploration-\nexploitation trade-off under model uncertainty. It \ufb01nds the optimal policy in be-\nlief space, which explicitly accounts for the expected effect on future rewards of\nreductions in uncertainty. However, the Bayes-adaptive solution is typically in-\ntractable in domains with large or continuous state spaces. We present a tractable\nmethod for approximating the Bayes-adaptive solution by combining simulation-\nbased search with a novel value function approximation technique that generalises\nappropriately over belief space. Our method outperforms prior approaches in both\ndiscrete bandit tasks and simple continuous navigation and control tasks.\n\nIntroduction\n\n1\nA fundamental problem in sequential decision making is controlling an agent when the environmen-\ntal dynamics are only partially known. In such circumstances, probabilistic models of the environ-\nment are used to capture the uncertainty of current knowledge given past data; they thus imply how\nexploring the environment can be expected to lead to new, exploitable, information.\nIn the context of Bayesian model-based reinforcement learning (RL), Bayes-adaptive (BA) planning\n[8] solves the resulting exploration-exploitation trade-off by directly optimizing future expected\ndiscounted return in the joint space of states and beliefs about the environment (or, equivalently,\ninteraction histories). Performing such optimization even approximately is computationally highly\nchallenging; however, recent work has demonstrated that online planning by sample-based forward-\nsearch can be effective [22, 1, 12]. These algorithms estimate the value of future interactions by\nsimulating trajectories while growing a search tree, taking model uncertainty into account. However,\none major limitation of Monte Carlo search algorithms in general is that, na\u00a8\u0131vely applied, they fail to\ngeneralize values between related states. In the BA case, a separate value is stored for each distinct\npath of possible interactions. Thus, the algorithms fail not only to generalize values between related\npaths, but also to re\ufb02ect the fact that different histories can correspond to the same belief about\nthe environment. As a result, the number of required simulations grows exponentially with search\ndepth. Worse yet, except in very restricted scenarios, the lack of generalization renders MC search\nalgorithms effectively inapplicable to BAMDPs with continuous state or action spaces.\nIn this paper, we propose a class of ef\ufb01cient simulation-based algorithms for Bayes-adaptive model-\nbased RL which use function approximation to estimate the value of interaction histories during\nsearch. This enables generalization between different beliefs, states, and actions during planning,\nand therefore also works for continuous state spaces. To our knowledge this is the \ufb01rst broadly\napplicable MC search algorithm for continuous BAMDPs.\nOur algorithm builds on the success of a recent tree-based algorithm for discrete BAMDPs (BAMCP,\n[12]) and exploits value function approximation for generalization across interaction histories, as\nhas been proposed for simulation-based search in MDPs [19]. As a crucial step towards this end we\ndevelop a suitable parametric form for the value function estimates that can generalize appropriately\n\n1\n\n\facross histories, using the importance sampling weights of posterior samples to compress histories\ninto a \ufb01nite-dimensional feature vector. As in BAMCP we take advantage of root sampling [18, 12] to\navoid expensive belief updates at every step of simulation, making the algorithm practical for a broad\nrange of priors over environment dynamics. We also provide an interpretation of root sampling as an\nauxiliary variable sampling method. This leads to a new proof of its validity in general simulation-\nbased settings, including BAMDPs with continuous state and action spaces, and a large class of\nalgorithms that includes MC and TD upates.\nEmpirically, we show that our approach requires considerably fewer simulations to \ufb01nd good poli-\ncies than BAMCP in a (discrete) bandit task and two continuous control tasks with a Gaussian process\nprior over the dynamics [5, 6]. In the well-known pendulum swing-up task, our algorithm learns how\nto balance after just a few seconds of interaction. Below, we \ufb01rst brie\ufb02y review the Bayesian formu-\nlation of optimal decision making under model uncertainty (section 2; please see [8] for additional\ndetails). We then explain our algorithm (section 3) and present empirical evaluations in section 4.\nWe conclude with a discussion, including related work (sections 5 and 6).\n2 Background\nA Markov Decision Processes (MDP) is described as a tuple M = (cid:104)S, A,P,R, \u03b3(cid:105) with S the\nset of states (which may be in\ufb01nite), A the discrete set of actions, P : S \u00d7 A \u00d7 S \u2192 R the\nstate transition probability kernel, R : S \u00d7 A \u2192 R the reward function, and \u03b3 < 1 the discount\nfactor. The agent starts with a prior P (P) over the dynamics, and maintains a posterior distribution\nbt(P) = P (P |ht) \u221d P (ht|P)P (P), where ht denotes the history of states, actions, and rewards\nup to time t.\nThe uncertainty about the dynamics of the model can be transformed into certainty about the current\nstate inside an augmented state space S+ = H \u00d7 S, where H is the set of possible histories (the\ncurrent state also being the suf\ufb01x of the current history). The dynamics and rewards associated with\nthis augmented state space are described by\n\nP +(h, s, a, has(cid:48), s(cid:48)) =\n\n(1)\nTogether, the 5-tuple M + = (cid:104)S+, A,P +,R+, \u03b3(cid:105) forms the Bayes-Adaptive MDP (BAMDP) for the\nMDP problem M. Since the dynamics of the BAMDP are known, it can in principle be solved to\nobtain the optimal value function associated with each action:\n\nP(s, a, s(cid:48))P (P|h) dP, R+(h, s, a) = R(s, a).\n\n(cid:90)\n\nP\n\n(cid:34) \u221e(cid:88)\n\nt(cid:48)=t\n\n(cid:35)\n\nQ\u2217(ht, st, a) = max\n\n\u02dc\u03c0\n\nE\u02dc\u03c0\n\n\u03b3t(cid:48)\u2212trt(cid:48)|at = a\n\n;\n\n\u02dc\u03c0\u2217(ht, st) = argmax\n\na\n\nQ\u2217(ht, st, a),\n\n(2)\n\nwhere \u02dc\u03c0 : S+\u00d7A \u2192 [0, 1] is a policy over the augmented state space, from which the optimal action\nfor each belief-state \u02dc\u03c0\u2217(ht, st) can readily be derived. Optimal actions in the BAMDP are executed\ngreedily in the real MDP M, and constitute the best course of action (i.e., integrating exploration and\nexploitation) for a Bayesian agent with respect to its prior belief over P.\n3 Bayes-Adaptive simulation-based search\nOur simulation-based search algorithm for the Bayes-adaptive setup combines ef\ufb01cient MC search\nvia root-sampling with value function approximation. We \ufb01rst explain its underlying idea, assuming\na suitable function approximator exists, and provide a novel proof justifying the use of root sampling\nthat also applies in continuous state-action BAMDPs. Finally, we explain how to model Q-values as\na function of interaction histories.\n\n3.1 Algorithm\nAs in other forward-search planning algorithms for Bayesian model-based RL [22, 17, 1, 12], at\neach step t, which is associated with the current history ht (or belief) and state st, we plan online to\n\ufb01nd \u02dc\u03c0\u2217(ht, st) by constructing an action-value function Q(h, s, a). Such methods use simulation to\nbuild a search tree of belief states, each of whose nodes corresponds to a single (future) history, and\nestimate optimal values for these nodes. However, existing algorithms only update the nodes that\nare directly traversed in each simulation. This is inef\ufb01cient, as it fails to generalize across multiple\nhistories corresponding either to exactly the same, or similar, beliefs. Instead, each such history\nmust be traversed and updated separately.\n\n2\n\n\fsimulating\n\na\n\nfuture\n\nAlgorithm 1: Bayes-Adaptive simulation-based\nsearch with root sampling\nprocedure Search( ht, st )\n\nHere, we use a more general simulation-based search that relies on function approximation, rather\nthan a tree, to represent the values for possible simulated histories and states. This approach was\noriginally suggested in the context of planning in large MDPs[19]; we extend it to the case of\nBayes-Adaptive planning. The Q-value of a particular history, state, and action is represented\nas Q(h, s, a; w), where w is a vector of learnable parameters. Fixed-length simulations are run\nfrom the current belief-state ht, st, and the parameter w is updated online, during search, based on\nexperience accumulated along these trajectories, using an incremental RL control algorithm (e.g.,\nMonte-Carlo control, Q-learning). If the parametric form and features induce generalization be-\ntween histories, then each forward simulation can affect the values of histories that are not directly\nexperienced. This can considerably speed up planning, and enables continuous-state problems to\nbe tackled. Note that a search tree would be a special case of the function approximation approach\nwhen the representation of states and histories is tabular.\ncontext of Bayes-Adaptive plan-\nIn the\nworks\nsimulation-based\nning,\nby\ntrajectory\nht+T = statrtst+1 . . . at+T\u22121rt+T\u22121st+T of\nT transitions (the planning horizon) starting\nfrom the current belief-state ht, st. Actions\nare selected by following a \ufb01xed policy \u02dc\u03c0,\nwhich is itself a function of\nthe history,\na \u223c \u02dc\u03c0(h,\u00b7). State transitions can be sam-\npled according to the BAMDP dynamics,\nst(cid:48) \u223c P +(ht(cid:48)\u22121, st(cid:48)\u22121, at(cid:48), ht(cid:48)\u22121at(cid:48)\u00b7,\u00b7). How-\never,\nthis can be computationally expensive\nsince belief updates must be applied at every\nstep of the simulation. As an alternative, we\nuse root sampling [18], which only samples the\ndynamics P k \u223c P (P |ht) once at the root for\neach simulation k and then samples transitions\naccording to st(cid:48) \u223c P k(st(cid:48)\u22121, at(cid:48)\u22121,\u00b7); we provide justi\ufb01cation for this approach in Section 3.2.1\nAfter the trajectory hT has been simulated on a step, the Q-value is modi\ufb01ed by updating w based\non the data in ht+T . Any incremental algorithm could be used, including SARSA, Q-learning, or\ngradient TD [20]; we use a simple scheme to minimize an appropriately weighted squared loss\nE[(Q(ht(cid:48), st(cid:48), at(cid:48); w) \u2212 Rt(cid:48))2]:\n\nif t > T then return 0\na \u2190 \u02dc\u03c0\u0001\u2212greedy(Q(h, s,\u00b7; w))\ns(cid:48) \u223c P(s, a,\u00b7), r \u2190 R(s, a)\nR \u2190 r + \u03b3 Simulate(has(cid:48), s(cid:48),P, t+1)\nw \u2190 w \u2212\u03b1 (Q(h, s, a; w) \u2212 R)\u2207wQ(h, s, a; w)\nreturn R\n\nuntil Timeout()\nreturn argmaxa Q(ht, st, a; w)\nend procedure\nprocedure Simulate( h, s,P, t)\n\nend procedure\n\nsearch\n\nrepeat\n\nP \u223c P (P |ht)\nSimulate(ht, st,P, 0)\n\n|\u2206 w | = \u03b1 (Q(ht(cid:48), st(cid:48), at(cid:48); w) \u2212 Rt(cid:48))\u2207wQ(ht(cid:48), st(cid:48), at(cid:48); w),\n\n(3)\n\nwhere \u03b1 is the learning rate and Rt(cid:48) denotes the discounted return obtained from history ht(cid:48).2 Al-\ngorithm 1 provides pseudo-code for this scheme; here we suggest using as the \ufb01xed policy for a\nsimulation the \u0001\u2212greedy \u02dc\u03c0\u0001\u2212greedy based on some given Q value. Other policies could be considered\n(e.g., the UCT policy for search trees), but are not the main focus of this paper.\n\n3.2 Analysis\nIn order to exploit general results on the convergence of classical RL algorithms for our simulation-\nbased search, it is necessary to show that starting from the current history, root sampling produces\nthe appropriate distribution of rollouts. For the purpose of this section, a simulation-based search\nalgorithm includes Algorithm 1 (with Monte-Carlo backups) but also incremental variants, as dis-\ncussed above, or BAMCP.\nLet D \u02dc\u03c0\nt be the rollout distribution function of forward-simulations that explicitly updates the belief\nat each step (i.e., using P +): D \u02dc\u03c0\nt (ht+T ) is the probability density that history ht+T is generated\nwhen running that simulation from ht, st, with T the horizon of the simulation, and \u02dc\u03c0 an arbitrary\nhistory policy. Similarly de\ufb01ne the quantity \u02dcDt\n(ht+T ) as the probability density that history ht+T\nis generated when running forward-simulations with root sampling, as in Algorithm 1. The following\nlemma shows that these two rollout distributions are the same.\n\n\u02dc\u03c0\n\n1For comparison, a version of the algorithm without root sampling is listed in the supplementary material.\n2The loss is weighted according to the distr. of belief-states visited from the current state by executing \u02dc\u03c0.\n\n3\n\n\f(cid:89)\n(cid:89)\n\n(cid:90)\n(cid:90)\n\nP\n\n(cid:90)\n\nP\n\nt (ht+T ) = \u02dcD \u02dc\u03c0\n\nLemma 1. D \u02dc\u03c0\nt (ht+T ) for all policies \u02dc\u03c0 : H \u00d7 A \u2192 [0, 1] and for all ht+T \u2208 H of\nlength t + T .\nProof. A similar result has been obtained for discrete state-action spaces as Lemma 1 in [12] using\nan induction step on the history length. Here we provide a more intuitive interpretation of root sam-\npling as an auxiliary variable sampling scheme which also applies directly to continuous spaces. We\nshow the equivalence by rewriting the distribution of rollouts. The usual way of sampling histories\nin simulation-based search, with belief updates, is justi\ufb01ed by factoring the density as follows:\n\np(ht+T|ht, \u02dc\u03c0) = p(atst+1at+1st+2 . . . st+T|ht, \u02dc\u03c0)\n\n= p(at|ht, \u02dc\u03c0)p(st+1|ht, \u02dc\u03c0, at)p(at+1|ht+1, \u02dc\u03c0) . . . p(st+T|ht+T\u22121, at+T , \u02dc\u03c0)\n=\n\np(st(cid:48)|ht(cid:48)\u22121, \u02dc\u03c0, at(cid:48)\u22121)\n\n\u02dc\u03c0(ht(cid:48), at(cid:48))\n\n(cid:89)\n(cid:89)\n\n(cid:90)\n\nt\u2264t(cid:48)<t+T\n\nt<t(cid:48)\u2264t+T\n\n=\n\n\u02dc\u03c0(ht(cid:48), at(cid:48))\n\nt\u2264t(cid:48)<t+T\n\nt<t(cid:48)\u2264t+T\n\nP\n\nP (P |ht(cid:48)\u22121)P(st(cid:48)\u22121, at(cid:48)\u22121, st(cid:48)) dP,\n\n(4)\n(5)\n(6)\n\n(7)\n\nwhich makes clear how each simulation step involves a belief update in order to compute (or sample)\nthe integrals. Instead, one may write the history density as the marginalization of the joint over\nhistory and the dynamics P, and then notice that an history is generated in a Markovian way if\nconditioned on the dynamics:\n\np(ht+T|P, ht, \u02dc\u03c0)p(P |ht, \u02dc\u03c0) dP =\n\np(ht+T|P, \u02dc\u03c0)p(P |ht) dP (8)\n\np(ht+T|ht, \u02dc\u03c0) =\n\n=\n\n(cid:89)\n\n\u02dc\u03c0(ht(cid:48), at(cid:48))\n\nP\n\nt\u2264t(cid:48)<t+T\n\n(cid:89)\n\nt<t(cid:48)\u2264t+T\n\nP(st(cid:48)\u22121, at(cid:48)\u22121, st(cid:48)) p(P |ht) dP,\n\n(9)\n\nt (ht+T ).\n\nt (ht+T ) = \u02dcD \u02dc\u03c0\n\nwhere eq. (9) makes use of the Markov assumption in the MDP. This makes clear the validity of\nsampling only from p(P |ht), as in root sampling. From these derivations, it is immediately clear\nthat D \u02dc\u03c0\nThe result in Lemma 1 does not depend on the way we update the value Q, or on its representation,\nsince the policy is \ufb01xed for a given simulation.3Furthermore, the result guarantees that simulation-\nbased searches will be identical in distribution with and without root sampling. Thus, we have:\nCorollary 1. De\ufb01ne a Bayes-adaptive simulation-based planning algorithm as a procedure that\nrepeatedly samples future trajectories ht+T \u223c D \u02dc\u03c0\nt from the current history ht (simulation phase),\nand updates the Q value after each simulation based on the experience ht+T (special cases are\nAlgorithm 1 and BAMCP). Then such a simulation-based algorithm has the same distribution of\nparameter updates with or without root sampling. This also implies that the two variants share the\nsame \ufb01xed-points, since the updates match in distribution.\n\nFor example, for a discrete environment we can choose a tabular representation of the value function\nin history space. Applying the MC updates in eq. (3) results in a MC control algorithm applied to the\nsub-BAMDP from the root state. This is exactly the (BA version of the) MC tree search algorithm\n[12]. The same principle can also be applied to MC control with function approximation with\nconvergence results under appropriate conditions [2]. Finally, more general updates such as gradient\nQ-learning could be applied with corresponding convergence guarantees [14].\n\n3.3 History Features and Parametric Form for the Q-value\nThe quality of a history policy obtained using simulation-based search with a parametric represen-\ntation Q(h, s, a; w) crucially depends on the features associated with the arguments of Q, i.e., the\nhistory, state and action. These features should arrange for histories that lead to the same, or simi-\nlar, beliefs have the same, or similar, representations, to enable appropriate generalization. This is\nchallenging since beliefs can be in\ufb01nite-dimensional objects with non-compact suf\ufb01cient statistics\nthat are therefore hard to express or manipulate. Learning good representations from histories is also\ntough, for instance because of hidden symmetries (e.g., the irrelevance of the order of the experience\ntuples that lead to a particular belief).\n\n3Note that, in Algorithm 1, Q is only updated after the simulation is complete.\n\n4\n\n\fWe propose a parametric representation of the belief at a particular planning step based on sampling.\nThat is, we draw a set of M independent MDP samples or particles U = {P 1,P 2, . . . ,P M} from\nthe current belief bt = P (P |ht), and associate each with a weight zU\nm(h), such that the vector\nzU (h) is a \ufb01nite-dimensional approximate representation of the belief based on the set U. We will\nalso refer to zU as a function zU : H \u2192 RM that maps histories to a feature vector.\nThere are various ways one could design the zU function. It is computationally convenient to com-\npute zU (h) recursively as importance weights, just as in a sequential importance sampling parti-\ncle \ufb01lter [11]; this only assumes we have access to the likelihood of the observations (i.e., state\nM \u2200m and are then up-\ntransitions).\ndated recursively using the likelihood of the dynamics model for that particle of observations as\nm(has(cid:48)) \u221d zU\nzU\nOne advantage of this de\ufb01nition is that it enforces a correspondence between the history and belief\nrepresentations in the \ufb01nite-dimensional space, in the sense that zU (h(cid:48)) = zU (h) if belief(h) =\nbelief(h(cid:48)). That is, we can work in history space during planning, alleviating the need for complete\nbelief updates, but via a \ufb01nite and well-behaved representation of the actual belief \u2014 since different\nhistories corresponding to the same belief are mapped to the same representation.\nThis feature vector can be combined with any function approximator. In our experiments, we com-\nbine it with features of the current state and action, \u03c6(s, a), in a simple bilinear form:\n\nIn other words, the weights are initialized as zU\nm(h)P (s(cid:48)|a, s,P m) = zU\n\nm(h)P m(s, a, s(cid:48)).\n\nm(ht) = 1\n\nQ(h, s, a; W) = zU (h)T W \u03c6(s, a),\n\n(10)\n\nwhere W is the matrix of learnable parameters adjusted during the search (eq. 3). Here \u03c6(s, a)\nis a domain-dependent state-action feature vector as is standard in fully observable settings with\nfunction approximation. Special cases include tabular representations or forms of tile coding. We\ndiscuss the relation of this parametric form to the true value function in the Supp. material.\nIn the next section, we investigate empirically in three varied domains the combination of this para-\nmetric form, simulation-based search and Monte-Carlo backups, collectively known as BAFA (for\nBayes Adaptive planning with Function Approximation).\n4 Experimental results\nThe discrete Bernoulli bandit domain (section 4.1) demon-\nstrates dramatic ef\ufb01ciency gains due to generalization with\nconvergence to a near Bayes-optimal solution. The nav-\nigation task (section 4.2) and the pendulum (section 4.3)\ndemonstrate the ability of BAFA to handle non-trivial plan-\nning horizons for large BAMDPs with continuous states.\nWe provide comparisons to a state of the art BA tree-search\nalgorithm (BAMCP, [12]), choosing a suitable discretization\nof the state space for the continuous problems. For the pen-\ndulum we also compare to two Bayesian, but not Bayes\nadaptive, approaches.\n\n(a) m\u03b1,\u03b2\n\n4.1 Bernoulli Bandit\nBandits have simple dynamics, yet they are still challenging\nfor a generic Bayes-Adaptive planner. Importantly, ground\ntruth is sometimes available [10], so we can evaluate how\nfar the approximations are from Bayes-optimality.\nWe consider a 2-armed Bernoulli bandit problem. We op-\npose an uncertain arm with prior success probability p1 \u223c\nBeta(\u03b1, \u03b2) against an arm with known success probability\np0. We consider the scenario \u03b3 = 0.99, p0 = 0.2 for which\nthe optimal decision, and the posterior mean decision frequently differ. Decision errors for differ-\nent values of \u03b1, \u03b2 do not have the same consequence, so we weight each scenario according to the\ndifference between their associated Gittins indices. De\ufb01ne the weight as m\u03b1,\u03b2 = |g\u03b1,\u03b2 \u2212 p0| where\ng\u03b1,\u03b2 is the Gittins index for \u03b1, \u03b2; this is an upper-bound (up to a scaling factor) on the difference\nbetween the value of the arms. The weights are shown in Figure 1-a.\n\nFigure 1: a) The weights m\u03b1,\u03b2 b) Av-\neraged (weighted) decision errors for the\ndifferent methods as a function of the\nnumber of simulations.\n\n(b)\n\n5\n\n\u03b1\u03b2 246810510150.20.40.60.810310410500.511.52Number of simulationsWeighted decision error BAFA, M=2BAFA, M=5BAFA, M=25BAMCP (Tree\u2212search)Posterior Mean\fWe compute the weighted errors over 20 runs for a particular method as E\u03b1,\u03b2 = m\u03b1,\u03b2 \u00b7\nP (Wrong decision for (\u03b1, \u03b2)), and report the sum of these terms across the range 1 \u2264 \u03b1 \u2264 10\nand 1 \u2264 \u03b2 \u2264 19 in Figure 1-b as a function of the number of simulations.\nThough this is a discrete problem, these results show that the value function approximation ap-\nproach, even with a limited number of particles (M) for the history features, learns considerably\nmore quickly than BAMCP . This is because BAFA generalizes between similar beliefs.\n\n6 , 0, \u03c0\n\n6 , \u03c0\n\n3 ,\u2212 \u03c0\n\n3}, \u0001\u0001\u0001 is small isotropic Gaussian noise (\u03c3 = 0.05), and l = 1\n\n4.2 Height map navigation\nWe next consider a 2-D navigation problem on an unknown continuous height map. The agent\u2019s\nstate is (x, y, z, \u03b8), it moves on a bounded region of the (x, y) \u2208 8 \u00d7 8m plane according to\n(known) noisy dynamics. The agent chooses between 5 different actions, the dynamics for (x, y)\nare (xt+1, yt+1) = (xt, yt) + l(cos(\u03b8a), sin(\u03b8a)) + \u0001\u0001\u0001, where \u03b8a corresponds to the action from this\nset \u03b8a \u2208 \u03b8 + {\u2212 \u03c0\n3m is\nthe step size. Within the bounded region, the reward function is the value of a latent height map\nz = f (x, y) which is only observed at a single point by the agent. The height map is a draw from\na Gaussian process (GP), f \u223c GP (0,K), using a multi-scale squared exponential kernel for the\ncovariance matrix and zero mean. In order to test long-horizon planning, we downplay situations\nwhere the agents can simply follow the expected gradient locally to reach high reward regions by\nstarting the agent on a small local maximum. To achieve this we simply condition the GP draw on a\nfew pseudo-observations with small negative z around the agent and a small positive z at the starting\nposition, which creates a small bump (on average). The domain is illustrated in Figure 2-a with an\nexample map.\nWe compare BAMCP against BAFA on this domain, planning over 75 steps with a discount of 0.98.\nSince BAMCP works with discrete state, we uniformly discretize the height observations. For the\nstate-features in BAFA, we use a regular tile coding of the space; an RBF network leads to similar\nresults. We use a common set of a 100 ground truth maps drawn from the prior for each algo-\nrithm/setting, and we average the discounted return over 200 runs (2 runs/map) and report that result\nin Figure 2-b as a function of the planning horizon (T ). This result illustrates the ability of BAFA\nto cope with non-trivial planning horizons in belief space. Despite the discretization, BAMCP is\nvery ef\ufb01cient with short planning horizons, but has trouble optimizing the history policy with long\nhorizons because of the huge tree induced by the discretization of the observations.\n\n(a)\n\n(b)\n\nFigure 2: (a) Example map showing with the height color-coded from white (negative reward z) to black\n(positive reward z). The black dots denote the location of the initial pseudo-observations used to obtain the\nground truth map. The white squares show the past trajectory of the agent, starting at the cross and ending\nat the current position in green. The green trajectory is one particular forward simulation of BAFA from that\nposition. (b) Averaged discounted return (higher is better) in the navigation domain for discretized BAMCP and\nBAFA as a function of the number of simulations (K), and as function of the planning horizon (x-axis).\n\n4.3 Under-actuated Pendulum Swing-up\nFinally, we consider the classic RL problem in which an agent must swing a pendulum from hanging\nvertically down to balancing vertically up, but given only limited torque. This requires the agent to\nbuild up momentum by swinging, before being able to balance. Note that although a wide variety\nof methods can successfully learn this task given enough experience, it is a challenging domain for\nBayes-adaptive algorithms, which have duly not been tried.\n\n6\n\n051015202510152025303540Planning horizonDiscounted return BAMCP K=2000BAMCP K=5000BAMCP K=15000BAFA K=2000BAFA K=5000BAFA K=15000\fa,Ki\n\nt+1 \u2212 si\n\na) for each state dimension i and each action a (where Ki\n\nWe use conventional parameter settings for the pendulum [5], a mass of 1kg, a length of 1m, a\nmaximum torque of 5Nm, and coef\ufb01cient of friction of 0.05 kg m2 / s. The state of the pendulum\nis s = (\u03b8, \u02d9\u03b8). Each time-step corresponds to 0.05s, \u03b3 = 0.98, and the reward function is R(s) =\ncos(\u03b8). In the initial state, the pendulum is pointing down with no velocity, s0 = (\u03c0, 0). Three\nactions are available to the agent, to apply a torque of either {\u22125, 0, 5}Nm. The agent does not\ninitially know the dynamics of the pendulum. As in [5], we assume it employs independent Gaussian\nt \u223c\nprocesses to capture the state change in each dimension for a given action. That is, si\na are Squared Exponential\nGP (mi\nkernels). Since there are 2 dimensions and 3 actions, we maintain 6 Gaussian processes, and plan\nin the joint space of (\u03b8, \u02d9\u03b8) together with the possible future GP posteriors to decide which action to\ntake at any given step.\nWe compare four approaches on this problem to understand the contributions of both generalization\nand Bayes-Adaptive planning to the performance of the agent. BAFA includes both; we also consider\ntwo non-Bayes-adaptive variants using the same simulation-based approach with value generaliza-\ntion. In a Thompson Sampling variant (THOMP), we only consider a single posterior sample of the\ndynamics at each step and greedily solve using simulation-based search. In an exploit-only variant\n(FA), we run a simulation-based search that optimizes a state-only policy over the uncertainty in the\ndynamics, this is achieved by running BAFA with no history feature.4 For BAFA, FA, and THOMP,\nwe use the same RBF network for the state-action features, consisting of 900 nodes. In addition,\nwe also consider the BAMCP planner with an uniform discretization of the \u03b8, \u02d9\u03b8 space that worked\nbest in a coarse initial search; this method performs Bayes-adaptive planning but with no value\ngeneralization.\n\n(a)\n\n(b)\n\nFigure 3: Histogram of delay until the agent reaches its \ufb01rst balance state (|\u03b8| < \u03c0\n4 for \u2265 3s) for different\nmethods in the pendulum domain. (a) A standard version of the pendulum problem with a cosine cost function.\n(b) A more dif\ufb01cult version of the problem with uncertain cost for balancing (see text). There is a 20s time limit,\nso all runs which do not achieve balancing within that time window are reported in the red bar. The histogram\nis computed with 100 runs with (a) K = 10000, or (b) K = 15000, simulations for each algorithm, horizon\nT = 50 and (for BAFA) M = 50 particles. The black dashed line represents the median of the distribution.\n\nWe allow each algorithm a maximum of 20s of interaction with the pendulum, and consider as up-\nstate any con\ufb01guration of the pendulum for which |\u03b8| \u2264 \u03c0\n4 and we consider the pendulum balanced\nif it stays in an up-state for more than 3s. We report in Figure 3-a the time it takes for each method to\nreach for the \ufb01rst time a balanced state. We observe that Bayes-adaptive planning (BAFA or BAMCP)\noutperforms more heuristic exploration methods, with most runs balancing before 8.5s. In the Suppl.\nmaterial, Figure S1 shows traces of example runs. With the same parametrization of the pendulum,\nDeisenroth et al. reported balancing the pole after between 15 and 60 seconds of interaction when\nassuming access to a restart distribution [5]. More recently, Moldovan et al. reported balancing after\n12-18s of interaction using a method tailored for locally linear dynamics [15].\nHowever, the pendulum problem also illustrates that BA planning for this particular task is not\nhugely advantageous compared to more myopic approaches to exploration. We speculate that this\n\n4The approximate value function for FA and THOMP thus takes the form Q(s, a) = wT \u03c6(s, a).\n\n7\n\n0510152000.10.2BAFA> 2000.510510152000.10.2BAMCP> 2000.510510152000.10.2FA> 2000.510510152000.10.2THOMPFractionTime (s)> 2000.510510152000.10.2BAFA> 2000.510510152000.10.2BAMCP> 2000.510510152000.10.2FA> 2000.510510152000.10.2THOMPFractionTime (s)> 2000.51\fis due to a lack of structure in the problem and test this with a more challenging, albeit arti\ufb01cial,\nversion of the pendulum problem that requires non-myopic planning over longer horizons. In this\nmodi\ufb01ed version, balancing the pendulum (i.e., being in the region |\u03b8| < \u03c0\n4 ) is either rewarding\n(R(s) = 1) with probability 0.5, or costly (R(s) = \u22121) with probability 0.5; all other states have an\nassociated reward of 0. This can be modeled formally by introducing another binary latent variable\nin the model. These latent dynamics are observed with certainty if the pendulum reaches any state\nwhere |\u03b8| \u2265 3\u03c0\n4 . The rest of the problem is the same. To approximate correctly the Bayes-optimal\nsolution in this setting, the planning algorithm must optimize the belief-state policy after it simulates\nobserving whether balancing is rewarding or not. We run this version of the problem with the same\nalgorithms as above and report the results in Figure 3-b. This hard planning problem highlights more\nclearly the bene\ufb01ts of Bayes-adaptive planning and value generalization. Our approach manages to\nbalance the pendulum more 80% of the time, compared to about 35% for BAMCP, while THOMP\nand FA fail to balance for almost all runs. In the Suppl. material, Figure S2 illustrates the in\ufb02uence\nof the number of particles M on the performance of BAFA.\n5 Related Work\nSimulation-based search with value function approximation has been investigated in large and also\ncontinuous MDPs, in combination with TD-learning [19] or Monte-Carlo control [3]. However, this\nhas not been in a Bayes-adaptive setting. By contrast, existing online Bayes-Adaptive algorithms\n[22, 17, 1, 12, 9] rely on a tree structure to build a map from histories to value. This cannot bene\ufb01t\nfrom generalization in a straightforward manner, leading to the inef\ufb01ciencies demonstrated above\nand hindering their application to the continuous case. Continuous Bayes-Adaptive (PO)MDPs have\nbeen considered using an online Monte-Carlo algorithm [4]; however this tree-based planning algo-\nrithm expands nodes uniformly, and does not admit generalization between beliefs. This severely\nlimits the possible depth of tree search ([4] use a depth of 3).\nIn the POMDP literature, a key idea to represent beliefs is to sample a \ufb01nite set of (possibly approx-\nimate) belief points [21, 16] from the set of possible beliefs in order to obtain a small number of\n(belief-)states for which to backup values of\ufb02ine or via forward search [13]. In contrast, our sam-\npling approach to belief representation does not restrict the number of (approximate) belief points\nsince our belief features (z(h)) can take an in\ufb01nite number of values, but it instead restricts their\ndimension, thus avoiding in\ufb01nite-dimensional belief spaces. Wang et al.[23] also use importance\nsampling to compute the weights of a \ufb01nite set of particles. However, they use these particles to\ndiscretize the model space and thus create an approximate, discrete POMDP. They solve this of-\n\ufb02ine with no (further) generalization between beliefs, and thus no opportunity to re-adjust the belief\nrepresentation based on past experience. A function approximation scheme in the context of BA\nplanning has been considered by Duff [7], in an of\ufb02ine actor-critic paradigm. However, this was in\na discrete setting where counts could be used as features for the belief.\n6 Discussion\nWe have introduced a tractable approach to Bayes-adaptive planning in large or continuous state\nspaces. Our method is quite general, subsuming Monte Carlo tree search methods, while allowing\nfor arbitrary generalizations over interaction histories using value function approximation. Each\nsimulation is no longer an isolated path in an exponentially growing tree, but instead value backups\ncan impact many non-visited beliefs and states. We proposed a particular parametric form for the\naction-value function based on a Monte-Carlo approximation of the belief. To reduce the compu-\ntational complexity of each simulation, we adopt a root sampling method which avoids expensive\nbelief updates during a simulation and hence poses very few restrictions on the possible form of the\nprior over environment dynamics.\nOur experiments demonstrated that the BA solution can be effectively approximated, and that the\nresulting generalization can lead to substantial gains in ef\ufb01ciency in discrete tasks with large trees.\nWe also showed that our approach can be used to solve continuous BA problems with non-trivial\nplanning horizons without discretization, something which had not previously been possible. Using\na widely used GP framework to model continuous system dynamics (for the case of a swing-up\npendulum task), we achieved state-of the art performance.\nOur general framework can be applied with more powerful methods for learning the parameters of\nthe value function approximation, and it can also be adapted to be used with continuous actions. We\nexpect that further gains will be possible, e.g. from the use of bootstrapping in the weight updates,\nalternative rollout policies, and reusing values and policies between (real) steps.\n\n8\n\n\fReferences\n[1] J. Asmuth and M. Littman. Approaching Bayes-optimality using Monte-Carlo tree search. In Proceedings\n\nof the 27th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 19\u201326, 2011.\n\n[2] Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control\n\nTheory and Applications, 9(3):310\u2013335, 2011.\n\n[3] SRK Branavan, D. Silver, and R. Barzilay. Learning to win by reading manuals in a Monte-Carlo frame-\n\nwork. Journal of Arti\ufb01cial Intelligence Research, 43:661\u2013704, 2012.\n\n[4] P. Dallaire, C. Besse, S. Ross, and B. Chaib-draa. Bayesian reinforcement learning in continuous\nPOMDPs with Gaussian processes. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ Inter-\nnational Conference on, pages 2604\u20132609. IEEE, 2009.\n\n[5] Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters. Gaussian process dynamic program-\n\nming. Neurocomputing, 72(7):1508\u20131524, 2009.\n\n[6] MP Deisenroth and CE Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach to policy search.\nIn Proceedings of the 28th International Conference on Machine Learning, pages 465\u2013473. International\nMachine Learning Society, 2011.\n\n[7] M. Duff. Design for an optimal probe. In Proceedings of the 20th International Conference on Machine\n\nLearning, pages 131\u2013138, 2003.\n\n[8] M.O.G. Duff. Optimal Learning: Computational Procedures For Bayes-Adaptive Markov Decision Pro-\n\ncesses. PhD thesis, University of Massachusetts Amherst, 2002.\n\n[9] Raphael Fonteneau, Lucian Busoniu, and R\u00b4emi Munos. Optimistic planning for belief-augmented Markov\ndecision processes. In IEEE International Symposium on Adaptive Dynamic Programming and reinforce-\nment Learning (ADPRL 2013), 2013.\n\n[10] J.C. Gittins, R. Weber, and K.D. Glazebrook. Multi-armed bandit allocation indices. Wiley Online\n\nLibrary, 1989.\n\n[11] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian\nIn IEE Proceedings F (Radar and Signal Processing), volume 140, pages\n\nBayesian state estimation.\n107\u2013113, 1993.\n\n[12] A. Guez, D. Silver, and P. Dayan. Ef\ufb01cient Bayes-adaptive reinforcement learning using sample-based\n\nsearch. In Advances in Neural Information Processing Systems (NIPS), pages 1034\u20131042, 2012.\n\n[13] Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Ef\ufb01cient point-based POMDP planning by\napproximating optimally reachable belief spaces. In Robotics: Science and Systems, pages 65\u201372, 2008.\n[14] H.R. Maei, C. Szepesv\u00b4ari, S. Bhatnagar, and R.S. Sutton. Toward off-policy learning control with function\n\napproximation. Proc. ICML 2010, pages 719\u2013726, 2010.\n\n[15] Teodor Mihai Moldovan, Michael I Jordan, and Pieter Abbeel. Dirichlet Process reinforcement learning.\n\nIn Reinforcement Learning and Decision Making Meeting, 2013.\n\n[16] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, volume 18, pages 1025\u20131032, 2003.\n\n[17] S. Ross and J. Pineau. Model-based bayesian reinforcement learning in large structured domains. In Proc.\n\n24th Conference in Uncertainty in Arti\ufb01cial Intelligence (UAI08), pages 476\u2013483, 2008.\n\n[18] D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 2164\u20132172, 2010.\n\n[19] David Silver, Richard S Sutton, and Martin M\u00a8uller. Temporal-difference search in computer go. Machine\n\nlearning, 87(2):183\u2013219, 2012.\n\n[20] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00b4ari, and E. Wiewiora. Fast\ngradient-descent methods for temporal-difference learning with linear function approximation. In Pro-\nceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, volume 382,\npage 125, 2009.\n\n[21] Sebastian Thrun. Monte Carlo POMDPs. In NIPS, volume 12, pages 1064\u20131070, 1999.\n[22] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward\noptimization. In Proceedings of the 22nd International Conference on Machine learning, pages 956\u2013963,\n2005.\n\n[23] Y. Wang, K.S. Won, D. Hsu, and W.S. Lee. Monte Carlo Bayesian reinforcement learning. In Proceedings\n\nof the 29th International Conference on Machine Learning, 2012.\n\n9\n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Arthur", "family_name": "Guez", "institution": "Google DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Gatsby Unit"}, {"given_name": "David", "family_name": "Silver", "institution": "UCL"}, {"given_name": "Peter", "family_name": "Dayan", "institution": "Gatsby Unit, UCL"}]}