{"title": "Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 9528, "page_last": 9538, "abstract": "Potential based reward shaping is a powerful technique for accelerating convergence of reinforcement learning algorithms. Typically, such information includes an estimate of the optimal value function and is often provided by a human expert or other sources of domain knowledge. However, this information is often biased or inaccurate and can mislead many reinforcement learning algorithms. In this paper, we apply Bayesian Model Combination with multiple experts in a way that learns to trust a good combination of experts as training progresses. This approach is both computationally efficient and general, and is shown numerically to improve convergence across discrete and continuous domains and different reinforcement learning algorithms.", "full_text": "Reinforcement Learning with Multiple Experts:\n\nA Bayesian Model Combination Approach\n\nMichael Gimelfarb\n\nScott Sanner\n\nMechanical and Industrial Engineering\n\nMechanical and Industrial Engineering\n\nUniversity of Toronto\n\nmike.gimelfarb@mail.utoronto.ca\n\nUniversity of Toronto\n\nssanner@mie.utoronto.ca\n\nChi-Guhn Lee\n\nMechanical and Industrial Engineering\n\nUniversity of Toronto\n\ncglee@mie.utoronto.ca\n\nAbstract\n\nPotential based reward shaping is a powerful technique for accelerating convergence\nof reinforcement learning algorithms. Typically, such information includes an\nestimate of the optimal value function and is often provided by a human expert or\nother sources of domain knowledge. However, this information is often biased or\ninaccurate and can mislead many reinforcement learning algorithms. In this paper,\nwe apply Bayesian Model Combination with multiple experts in a way that learns\nto trust a good combination of experts as training progresses. This approach is\nboth computationally ef\ufb01cient and general, and is shown numerically to improve\nconvergence across discrete and continuous domains and different reinforcement\nlearning algorithms.\n\n1\n\nIntroduction\n\nPotential-based reward shaping incorporates prior domain knowledge in the form of additional\nrewards provided during training to speed up convergence of reinforcement learning algorithms,\nwithout changing the optimal policies (Ng et al. [1999]). While much of the existing theory and\napplications assume that advice comes from a single source throughout training (Grze\u00b4s [2017],\nHarutyunyan et al. [2015], Tenorio-Gonzalez et al. [2010]), there is much less work done on learning\nfrom multiple sources of advice as training progresses. One reason for doing so is that expert\ndemonstrations or advice can often be biased or incomplete, so being able to identify good advice\nfrom bad is critical to guarantee robustness of convergence.\nIn this paper, the decision maker is presented with multiple sources of expert advice in the form of\npotential-based reward functions, some of which can be misleading and should not be trusted. The\ndecision maker does not know a priori which expert(s) to trust, but rather learns this from experience\nin a Bayesian framework. More speci\ufb01cally, the decision maker starts with a prior distribution over\nthe probability simplex, and updates the belief to a posterior distribution as new training rewards are\nobserved. Because our proposed algorithm follows the potential-based reward shaping framework, it\npreserves the theoretical guarantees for policy invariance established in Ng et al. [1999].\nThis paper proceeds as follows. Section 2 introduces the key de\ufb01nitions used throughout the paper.\nIn Section 3, we apply Bayesian model combination, that allows the decision maker to asymptotically\nlearn the best combination of experts, all with reduced variance as compared to similar approaches. In\nSection 3.1, we show that the total return can be written as a linear combination of individual return\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcontributions from each expert, weighted by the expected posterior belief that the expert is correct. In\nSection 3.2, we show that the exact posterior updates are analytically intractable. Instead, we apply\nmoment matching to project the true posterior distribution onto the multivariate Dirichlet distribution,\nand show how accurate approximation and inference can be done in linear time in the number of\nexperts. In Section 3.3 we then show how our approach can be incorporated into any reinforcement\nlearning algorithm, preserving the asymptotically optimal policy without incurring additional runtime\ncomplexity. Finally, in Section 4, we demonstrate the effectiveness of this approach across various\nreinforcement learning methods and problem domains.\n\nRelated Work\n\nLearning from expert knowledge is not new. In transfer learning, for example, the decision maker\nuses prior knowledge obtained from training on task(s) to improve performance on future tasks\n(Konidaris and Barto [2006]). In inverse reinforcement learning, the agent recovers an unknown\nreward function that can then be used for shaping (Suay et al. [2016]). In many cases, a human\nexpert can directly provide the learning agent with training examples or preferences before or during\ntraining to guide exploration (Brys et al. [2015], Christiano et al. [2017]). All of these approaches\ntry to perturb the intermediate value functions to encourage more guided exploration of the state\nspace. A somewhat different approach, called policy shaping, instead reshapes the learned policies\n(Grif\ufb01th et al. [2013]). Grzes and Kudenko [2009] and Grze\u00b4s and Kudenko [2010] recently introduced\non-line methods to learn a reward shaping function, but only for model-based learning using R-Max\nor model-free learning with multi-grid discretization. Our approach can work in on-line settings, with\ngeneral algorithms under minimal assumptions, and with value function approximation.\nThe idea of combining multiple models/experts or learning algorithms to improve performance is\ncentral to ensemble learning (Dietterich [2000]), and has been applied in a variety of ways in the RL\nliterature. For example, Maclin et al. [2005] used kernel regression, Philipp and Rettinger [2017]\nused contextual bandits, and Downey and Sanner [2010] applied Bayesian model averaging. Asmuth\net al. [2009] applied a Bayesian method to sample multiple models for action selection. The only\nwork we are aware of that incorporated reward shaping advice in a Bayesian learning framework\nis the recent paper by Marom and Rosman [2018]. However, that paper exploited the structure\nof the transition model (belief clusters) in order to do ef\ufb01cient Bayesian inference, whereas our\npaper focuses on posterior approximation using variational ideas, and their analysis and results are\nconsiderably different from ours. More generally, Bayesian approaches have many advantages over\nfrequentist approaches, including prior speci\ufb01cation, clear and intuitive interpretation, ability to test\nhypotheses (O\u2019Hagan [2004]), and theoretically optimal exploration (Thompson [1933]).\n\n2 De\ufb01nitions\n\n2.1 Markov Decision Process\n\nThe decision-making framework used throughout this paper focuses on the Markov decision process\n(MDP) (Bertsekas and Shreve [2004]). Formally, an MDP is de\ufb01ned as a tuple (S,A, T, R, \u03b3), where:\nS is a general set of states, A is a \ufb01nite set of actions, T : S \u00d7A\u00d7S \u2192 R+ is a stationary Markovian\ntransition function, where T (s, a, s(cid:48)) = P (s(cid:48)|s, a), the probability of transitioning to state s(cid:48) after\ntaking action a in state s, R : S \u00d7 A \u00d7 S \u2192 R is a bounded reward function, where R(s, a, s(cid:48)) is the\nimmediate reward received upon transitioning to state s(cid:48) after taking action a in state s, and \u03b3 \u2208 [0, 1]\nis a discount factor.\nWe de\ufb01ne a random policy \u00b5 as a probability distribution \u00b5(s, a) = P (a|s) over actions A given\ncurrent state s. Given an MDP (S,A, T, R, \u03b3), a policy \u00b5, and initial state-action pair (s, a), we\nde\ufb01ne the in\ufb01nite-horizon expected discounted rewards as\n\n(cid:34) \u221e(cid:88)\n\nt=0\n\n(cid:35)\n(cid:12)(cid:12)(cid:12)s0 = s, a0 = a\n\n,\n\nQ\u00b5(s, a) = E\n\n\u03b3tR(st, at, st+1)\n\n(1)\n\nwhere at \u223c P (\u00b7|st) = \u00b5(st,\u00b7) and st+1 \u223c T (st, at,\u00b7). The objective of the agent is to \ufb01nd an\noptimal policy \u00b5\u2217 that maximizes (1). When the transition and reward functions are known, the\nexistence of an optimal deterministic stationary policy is guaranteed, in which case value iteration or\npolicy iteration can be used to \ufb01nd an optimal policy (Bertsekas and Shreve [2004]).\n\n2\n\n\f2.2 Reinforcement Learning\n\nIn the reinforcement learning (RL) setting, the transition probabilities and reward function are not\nexplicitly known to the agent but rather learned from experience. In order to learn the optimal\npolicies in this framework and to facilitate the development of the paper, we follow generalized policy\niteration (GPI) (Sutton and Barto [2018]). However, the Bayesian framework developed in this paper\nis dependent on neither the exploration policy used nor the value function representation, and can be\napplied with on-policy and off-policy learning, value function approximation, traces, deep RL (Li\n[2017]), and other approaches.\nMore speci\ufb01cally, GPI performs two steps in alternation: a policy evaluation step that estimates\nthe value Q\u00b5i of the current policy \u00b5i, and a policy improvement step that uses Q\u00b5i to construct\na new policy \u00b5i+1. In practice, these two steps are often interleaved. A simple yet effective way\nto implement GPI is to follow the \u03b5-greedy policy, that encourages exploration by randomly and\nuniformly selecting an action in A at time t with probability \u03b5t, and otherwise selects the best action\nbased on Q\u00b5i(s, a); the parameter \u03b5t \u2208 (0, 1), t \u2265 0 controls the trade-off between exploration and\nexploitation.\nIn order to estimate the value of policy \u00b5 = \u00b5i, we follow the temporal difference learning (TD)\napproach. Speci\ufb01cally, given a new estimate of the expected future returns Rt at time t after taking\naction at in state st according to some policy \u00b5, Q(st, at) (dropping the dependence on \u00b5) is updated\nas follows\n\nQt+1(st, at) = Qt(st, at) + \u03b1 [Rt \u2212 Qt(st, at)] ,\n\n(2)\n\nwhere \u03b1 > 0 is a problem-dependent learning rate parameter.\nTwo popular approaches for estimating Rt are Q-learning and SARSA, given respectively as\n\nRt = rt + \u03b3 max\n\na(cid:48)\u2208A Qt(st+1, a(cid:48))\n\nRt = rt + \u03b3Qt(st+1, at+1),\n\n(3)\n\nwhere rt = R(st, at, st+1) is the immediate reward, st+1 \u223c T (st, at,\u00b7) and at+1 \u223c \u00b5(st+1,\u00b7).\nWhile both approaches compute Rt by bootstrapping from current Q-values, the key distinction\nbetween them is that SARSA is an on-policy algorithm whereas Q-learning is off-policy. n-step TD\nand TD(\u03bb) are more sophisticated examples of TD-learning algorithms (Sutton and Barto [2018]).\n\n2.3 Potential-Based Reward Shaping\n\nIn many domains, particularly when rewards are sparse and the agent cannot learn quickly, it is\nnecessary to incorporate prior knowledge in order for TD-learning to converge faster. The idea\nof reward shaping is to incorporate prior knowledge about the domain in the form of additional\nrewards during training to speed up convergence towards the optimal policy. Formally, given an\nMDP (S,A, T, R, \u03b3) and a reward shaping function F : S \u00d7 A \u00d7 S \u2192 R, we solve the MDP\n(S,A, T, R(cid:48), \u03b3) with reward function R(cid:48) given by\n\nR(cid:48)(s, a, s(cid:48)) = R(s, a, s(cid:48)) + F (s, a, s(cid:48)).\n\n(4)\n\nWhile this approach has been applied successfully to many problems, an improper choice of shaping\nfunction can change the optimal policy (Randl\u00f8v and Alstr\u00f8m [1998]).\nIn order to address this problem, potential-based reward shaping was proposed, in which F is\nrestricted to functions of the form\n\nF (s, a, s(cid:48)) = \u03b3\u03a6(s(cid:48)) \u2212 \u03a6(s),\n\n(5)\nwhere \u03a6 : S \u2192 R is called the potential function. It has been shown that this is the only class of\nreward shaping functions that preserves policy optimality (Ng et al. [1999]). Reward shaping has\nalso been shown to be equivalent to Q-value initialization (Wiewiora [2003]). More recently, policy\ninvariance has been extended for non-stationary time-dependent potential functions of the form\n\nF (s, a, t, s(cid:48), t(cid:48)) = \u03b3\u03a6(s(cid:48), t(cid:48)) \u2212 \u03a6(s, t)\n\n(6)\n\n(Devlin and Kudenko [2012]), for action-dependence (Wiewiora et al. [2003]), as well as for partially-\nobserved (Eck et al. [2013]) and multi-agent systems (Devlin and Kudenko [2011]).\n\n3\n\n\f3 Bayesian Reward Shaping\nThe decision maker is presented with advice from N \u2265 1 experts in the form of potential functions\n\u03a61, \u03a62, . . . \u03a6N . The advice could come from heuristics or guesses (Harutyunyan et al. [2015]), from\nsimilar solved tasks (Taylor and Stone [2009]), from demonstrations (Brys et al. [2015]), and in\ngeneral can be analytic or computational. One concrete example that our proposed setup can be\napplied to is transfer learning (Taylor and Stone [2009]). Here, models are \ufb01rst trained on a number\nof tasks to obtain corresponding value functions. By de\ufb01ning suitable inter-task mappings (Taylor\net al. [2007]), these value functions can be incorporated into a target task as reward shaping advice.\nUnfortunately in practice, the advice available to the learning agent is often contradictory or contains\nnumerical errors, in which case it could hurt convergence. In order to make optimal use of the expert\nadvice during the learning process, the agent should ideally learn which expert(s) to trust as more\ninformation becomes available, and act on this knowledge by applying the techniques in Section 2.2.\nTo do this, the agent assigns weights w to the experts and updates them on-line during training.\nThe two main approaches to incorporating multiple models in a Bayesian framework are Bayesian\nmodel averaging (BMA) and Bayesian model combination (BMC). Roughly speaking, taking experts\nas hypotheses, BMA converges asymptotically toward the optimal hypothesis, while BMC converges\ntoward the optimal ensemble. The model combination approach has two clear advantages over\nmodel averaging: (1) when two or more potential functions are optimal, it will converge to a linear\ncombination of them, and (2) it provides an estimator with reduced variance (Minka [2000]). In this\nsection, we show how BMC can be used to incorporate imperfect advice from multiple experts into\nreinforcement learning problems, all with the same space and time complexity as TD-learning.\n\n3.1 Bayesian Model Combination\n\n(cid:110)\nw \u2208 RN :(cid:80)N\n\n(cid:111)\ni=1 wi = 1, wi \u2265 0\n\nIn the general setting of Bayesian model combination, we interpret Q-values for each state-action\npair qs,a as random variables, and maintain a set of past return observations D and a multi-\nvariate posterior probability distribution P (q|D) over Q-values. We also maintain a posterior\nprobability distribution \u03c0 : S N\u22121 \u2192 R+ over the (N \u2212 1)-dimensional probability simplex\nS N\u22121 =\n. Here, weight vectors w are interpreted as cate-\ngorical distributions over experts; such a mechanism will allow us to learn the optimal distribution\nover experts, rather than a single expert. In the following subsections, we show how to maintain each\nof these distributions over time, but here we show how to use them for the general RL problem.\nGiven a state s = st and action a = at at time t, the return under model combination \u03c1t(s, a) is\n\n\u03c1t(s, a) = E [qs,a|D] =\n\nq P (q|D) dq\n\nP (q|D, w) P (w|D) dw dq =\n\nP (q|i)wi\u03c0t(w) dw dq\n\n(cid:90)\n\nR\n\n(cid:90)\n\nR\n\n(cid:90)\n\nSN\u22121\n\nq P (q|i)\n\n(cid:90)\n(cid:90)\n\nq\n\nR\n\nR\n\n(cid:90)\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n=\n\n=\n\n=\n\n(cid:90)\n(cid:90)\n\nq\n\n(cid:90)\nN(cid:88)\n\nR\n\ni=1\n\nR\n\nN(cid:88)\n\ni=1\n\nSN\u22121\nq P (q|i) E\u03c0t [wi] dq\n\n(7)\n\n(8)\n\n(9)\n\nwi\u03c0t(w) dw dq =\n\nSN\u22121\n\nN(cid:88)\n\ni=1\n\nE\u03c0t [wi]\n\nq P (q|i) dq =\n\nE\u03c0t [wi] E [qs,a|i],\n\nwhere: the \ufb01rst equality in (7) follows from the law of total probability applied to P (q|D), whereas\nthe second equality follows from conditioning on the expert i \u2208 {1, 2 . . . N}, using the facts that qs,a\nis independent of w given i and P (i|w) = wi; the \ufb01rst equality in (8) follows from interchange of\nsummation and integration, while the second from the de\ufb01nition of expectation over wi; \ufb01nally, (9)\nfollows from the de\ufb01nition of expectation of qs,a given i.\nThis result is intuitively and computationally pleasing, and shows that the total return can be written\nas a linear combination of individual return \u201ccontributions\" from each expert model, weighted by the\nexpected posterior belief that the expert is correct. We now show how each of these two expectations\ncan be computed.\n\n4\n\n\f3.2 Posterior Approximation using Moment Matching\nStarting with prior distribution \u03c0t at time t over the simplex S N\u22121, and given new data point d, we\nwould like to perform a posterior update using Bayes\u2019 theorem\n\n\u03c0t+1(w) = P (w|D, d) \u221d P (d|w)\u03c0t(w) \u221d N(cid:88)\n\nP (d|i) P (i|w)\u03c0t(w)\n\ni=1\n\neiwi\u03c0t(w),\n\n(10)\n\nwhere we denote evidence ei = P (d|i), and Ct+1 is the normalizing constant for \u03c0t+1 determined as\n\nCt+1 =\n\neiwi\u03c0t(w) dw =\n\nei\n\nwi\u03c0t(w) dw =\n\nei E\u03c0t [wi].\n\n(11)\n\n(cid:90)\n\nN(cid:88)\n\ni=1\n\nSN\u22121\n\nN(cid:88)\n\ni=1\n\nUnfortunately the exact posterior update is computationally intractable for general evidence ei, and\nso an approximate posterior update is required.\nAssumed density \ufb01ltering, or moment matching, projects the true posterior distribution \u03c0t+1 onto an\nexponential subfamily of proposal distributions by minimizing the KL-divergence between \u03c0t+1 and\nthe proposal distribution. We note that an excellent exponential family proposal distribution for our\nposterior in (10) is the multivariate Dirichlet distribution with parameters \u03b1 \u2208 RN\n+ , density function\n\nN(cid:88)\n\ni=1\n\n=\n\n1\n\nCt+1\n\n(cid:90)\n\nN(cid:88)\n\nSN\u22121\n\ni=1\n\n(cid:17)\n\n\u0393\n\ni=1 \u03b1i\n\ni=1 \u0393(\u03b1i)\n\n(cid:16)(cid:80)N\n(cid:81)N\n(cid:16)(cid:80)N\n(cid:16)(cid:80)N\n\n\u0393\n\ni=1 \u03b1i\n\nN(cid:89)\n(cid:17)\n\ni=1\n\ni=1(\u03b1i + ni)\n\n(cid:17) N(cid:89)\n\ni=1\n\nf (w; \u03b1) =\n\n(cid:34) N(cid:89)\n\ni=1\n\n(cid:35)\n\nwni\ni\n\n=\n\n\u0393\n\nEf\n\nw\u03b1i\u22121\n\ni\n\n, w \u2208 S N\u22121,\n\n\u0393(\u03b1i + ni)\n\n\u0393(\u03b1i)\n\n, ni \u2265 0.\n\nand generalized moments\n\n(12)\n\n(13)\n\nFor the exponential family of proposal distributions, exact moment matching requires the moments\nover the suf\ufb01cient statistics. Since this is not available for the Dirichlet family in closed form, it\nnecessitates an iterative approach that is not computationally feasible in on-line RL. Instead, we\nfollow Hsu and Poupart [2016] and Omar [2016] by matching the moments (13), leading to an\nef\ufb01cient closed-form O(N ) time update.\nIn particular, given means m1, m2 . . . mN\u22121 of marginals w1, w2 . . . wN\u22121 of \u03c0t+1 and second\nmoment s1 of w1, we apply approximate moment matching with proposal Dir(\u03b1) by solving the\nsystem of equations\n\nmi =\n\ns1 =\n\n, i = 1, 2 . . . N \u2212 1\n\n\u03b1i\n\u03b10\n\u03b11(\u03b11 + 1)\n\u03b10(\u03b10 + 1)\n\n(14)\n\n(15)\n\nwhere \u03b10 =(cid:80)N\n\ni=1 \u03b1i > 0. Please note that the second moment condition (15) is necessary here,\nsince without it the system is under-determined. Also, we could use any of s2, s3, . . . sN in place of\ns1; in our experiments, we use the value of si which results in the largest value of si \u2212 m2\ni to avoid\nunder\ufb02ow in the solution. The unique positive solution of (14) and (15) is\n\n\u03b10 =\n\nm1 \u2212 s1\ns1 \u2212 m2\n\n1\n\n\u03b1i = mi\u03b10 = mi\n\n(cid:18) m1 \u2212 s1\n\ns1 \u2212 m2\n\n1\n\n(cid:19)\n\n, i = 1, 2 . . . N \u2212 1.\n\n(16)\n\nIn order to apply the moment matching solution (16) to approximate the posterior update (10), it\nremains to compute the moments m1, m2, . . . mN\u22121 and s1 of \u03c0t+1.\n\n5\n\n\fwe obtain Ct+1 =(cid:80)N\n\nWe proceed by induction on t. More speci\ufb01cally, we assume that the prior \u03c00 = Dir(\u03b10) was chosen\narbitrarily and that the projection Dir(\u03b1t) of \u03c0t was already obtained. Given new evidence e \u2208 RN\n+ ,\ni=1 \u03b1t,i > 0. Using\n\ni=1 ei E\u03c0t [wi] =(cid:80)N\n\nwhere \u03b1t,0 =(cid:80)N\n\n= e\u00b7\u03b1t\n\ni=1 ei\n\n\u03b1t,0\n\n(10) and (13),\n\nej\n\n(cid:90)\n(cid:90)\nN(cid:88)\n\uf8eb\uf8edei E\u03c0t\n\uf8eb\uf8edei\n\nj=1\n\n\u03b1t,i\n\u03b1t,0\n\nN(cid:88)\n\nj=1\n\n(cid:88)\n\nj(cid:54)=i\n\nmi = E\u03c0t+1 [wi] =\n\n\u03b1t,0\ne \u00b7 \u03b1t\n\nSN\u22121\n\nejwjwi\u03c0t(w) dw\n\nN(cid:88)\n\nj=1\n\n\u03b1t,0\ne \u00b7 \u03b1t\n\n\uf8f6\uf8f8\n\nej E\u03c0t [wiwj]\n\n\uf8f6\uf8f8\n\nej\n\n\u03b1t,i\u03b1t,j\n\n\u03b1t,0(\u03b1t,0 + 1)\n\n=\n\n\u03b1t,0\ne \u00b7 \u03b1t\n\n=\n\n\u03b1t,0\ne \u00b7 \u03b1t\n\nwjwi\u03c0t(w) dw =\n\nSN\u22121\n\n(cid:2)w2\n\n(cid:3) +\n\ni\n\n(cid:88)\n\nj(cid:54)=i\n\nej E\u03c0t [wiwj]\n\n\u03b1t,i(\u03b1t,i + 1)\n\u03b1t,0(\u03b1t,0 + 1)\n\n+\n\n=\n\n=\n\n\u03b1t,0\ne \u00b7 \u03b1t\n\u03b1t,i(ei + e \u00b7 \u03b1t)\n(e \u00b7 \u03b1t)(\u03b1t,0 + 1)\n\n.\n\n(17)\n\n(18)\n\nUsing the same technique, we can readily obtain the corresponding formula for s1,\n\ns1 =\n\n\u03b1t,1(\u03b1t,1 + 1)(2e1 + e \u00b7 \u03b1t)\n(e \u00b7 \u03b1t)(\u03b1t,0 + 1)(\u03b1t,0 + 2)\n\n.\n\nCombining (17) and (18) with the general solution to the moment matching problem (16) yields\nthe new projected posterior Dir(\u03b1t+1). This leads to a very ef\ufb01cient O(N ) algorithm for posterior\nupdates given in Algorithm 1.\n\nAlgorithm 1 PosteriorUpdate(\u03b1t, e)\n1: for i = 1, 2 . . . N \u2212 1 do\nmi \u2190 \u03b1t,i(ei+e\u00b7\u03b1t)\n2:\n(e\u00b7\u03b1t)(\u03b1t,0+1)\n3: s1 \u2190 \u03b1t,1(\u03b1t,1+1)(2e1+e\u00b7\u03b1t)\n(e\u00b7\u03b1t)(\u03b1t,0+1)(\u03b1t,0+2)\n4: \u03b1t+1,0 \u2190 m1\u2212s1\ns1\u2212m2\n5: for i = 1, 2 . . . N \u2212 1 do\n\u03b1t+1,i \u2190 mi\u03b1t+1,0\n6:\n\n7: \u03b1t+1,N \u2190 \u03b1t+1,0 \u2212(cid:80)N\u22121\n\n1\n\ni=1 \u03b1t+1,i\n\n8: return \u03b1t+1\n\n(cid:46) Compute posterior moments\n\n(cid:46) Compute \u03b1t+1\n\nFinally, once we have obtained \u03b1t, we can compute E\u03c0t [wi] = \u03b1t,i\nshow how to compute E [qs,a|i] and evidence e.\n\n\u03b1t,0\n\n= \u03b1t,i(cid:80)N\n\nj=1 \u03b1t,j\n\n. It remains only to\n\n3.3 Algorithm\n\nFollowing the Bayesian Q-learning framework (Dearden et al. [1998]), we model Q-values for each\nstate-action pair as independent Gaussian distributed random variables. Since the best choice of \u03a6\nshould be the optimal value function V \u2217, we model Q-values qs,a given the best expert \u03a6i as\n\nqs,a|i \u223c N(cid:0)\u03a6i(s), (\u03c3i\n\ns,a)2(cid:1) ,\n\n(19)\n\nwhere i \u2208 {1, 2 . . . N}. Since (\u03c3i\ns,a)2.\nHowever, maintaining an estimate for each expert and state-action pair would not be practical for\nlarge spaces, so we follow Downey and Sanner [2010] and replace (\u03c3i\ns,a)2 by the sample variance \u02c6\u03c32\nof D. This permits constant-time updates per sample without any additional memory overhead, and\nthis worked very well in our experiments.\n\ns,a)2 is not known, we need to maintain an estimator of (\u03c3i\n\n6\n\n\fUsing these observations and the approximation \u03c0t = Dir(\u03b1t) from Section 3.2, (9) reduces to\n\nN(cid:88)\n\ni=1\n\n(cid:80)N\n(cid:80)N\n\ni=1 \u03a6i(s)\u03b1t,i\n\ni=1 \u03b1t,i\n\n\u03c1t(s, a) =\n\nE [qs,a|i] E\u03c0t [wi] =\n\n,\n\n(20)\n\nand de\ufb01nes the reward shaping potential function \u02c6\u03a6 used during training. Finally, given a return\nobservation d \u2208 D in state s, the evidence ei for each i \u2208 {1, 2 . . . N} is computed simply from the\n\nGaussian probability distribution N(cid:0)\u03a6i(s), \u02c6\u03c32(cid:1) in (19).\n\nWe note that all steps can be performed ef\ufb01ciently on-line and so this approach does not require\nstoring D explicitly. Furthermore, it can be easily incorporated into general reinforcement learning\nalgorithms without increasing the runtime complexity. Perhaps most importantly, since \u03c1t in (20) is a\npotential-based reward shaping function, it would not change the asymptotically optimal policy. The\ncomplete algorithm is summarized in Algorithm 2. Here, TrainRL(F ) is a general procedure for\ntraining on one state-action-reward sequence using the immediate reward function R + F .\n\n+\n\nAlgorithm 2 RL with Bayesian Reward Shaping\n1: initialize \u03b1 \u2208 RN\n2: for episode = 0, 1 . . . M do\n3:\n\n\u02c6\u03a6 \u2190 (cid:80)N\n(cid:80)N\nF (s, a, s(cid:48)) \u2190 \u03b3 \u02c6\u03a6(s(cid:48)) \u2212 \u02c6\u03a6(s)\n(Rt, st)t=1...T \u2190 TrainRL(F )\nfor all (Rt, st) do\n\ni=1 \u03a6i\u03b1i\n\ni=1 \u03b1i\n\nupdate \u02c6\u03c32 and compute e\n\u03b1 \u2190 PosteriorUpdate(\u03b1, e)\n\n4:\n5:\n6:\n7:\n8:\n\n(cid:46) Main loop\n(cid:46) Pool experts and compute shaped reward\n\n(cid:46) Perform one episode of training\n(cid:46) Posterior update\n\nRemarks: Steps 3 and 4 in Algorithm 2 update the advice off-line on a sequence of cached observations.\nIt is possible to make this algorithm on-line by performing steps 3 and 4 after each observation, but\ncare must be taken to ensure consistency of the optimal policies (Devlin and Kudenko [2012]).\n\n4 Experimental Results\n\nIn order to validate the effectiveness of our proposed algorithm, we apply it to a Gridworld problem\nwith subgoals and the classical CartPole problem. We implement the exact tabular Q-learning and\nSARSA algorithms (2) and the off-policy Deep Q-Learning algorithm with experience replay (Mnih\net al. [2013]). In all cases, we followed \u03b5-greedy policies introduced in Section 2.2, and manually\nselected parameters that worked well for all experts. Policies are learned from scratch, with table\nentries initialized to zero and neural networks initialized randomly.\n\n4.1 Gridworld\n\nThis is the 5-by-5 navigation problem with subgoals introduced in Ng et al. [1999]. We charge +1\npoints for every move, and one additional point whenever it is invalid (e.g. choosing \u201cUP\" when\nadjacent to the top edge, or an attempt is made to collect a \ufb02ag in an incorrect order) to encourage the\nagent to choose valid moves. For all algorithms, we set the length of each episode to T = 200 steps,\n\u03b3 = 1.0, and \u03b5t = 0.98t, where t \u2265 0 is the episode.\nIn the tabular case, we set \u03b1 = 0.4 for Q-learning and \u03b1 = 0.36 for SARSA. The DQN is a dense\nnetwork with encoded state s as inputs and action-values {Q(s, a) : a \u2208 A} as outputs, and two\nfully-connected hidden layers with 25 neurons per layer. We use one-hot encoding for states (see,\ne.g. Lantz [2013]). Hidden neurons use leaky ReLU activations and outputs use linear to allow\nunbounded values. The learning rate is \ufb01xed at 0.001 throughout training that is done on-line using\nthe Adam optimizer in batches of size 16 sampled randomly from memory of size 10000 (we found\nthat doing 5 epochs of training per batch led to more stable convergence).\nWe consider the following \ufb01ve experts in our analysis: \u03a6opt(s) = V \u2217(s) the optimal value function,\n\u03a6good(x, y, c) = \u221222(5 \u2212 c \u2212 0.5)/5 is reasoned in Ng et al. [1999] assuming equidistant subgoals,\n\u03a6zero(x, y, c) = 0, \u03a6rand(x, y, c) = U where U \u223c U [\u221220, 20], and \u03a6neg(s) = \u2212V \u2217(s).\n\n7\n\n\fFigure 1: Test performance (number of steps required to reach the \ufb01nal goal) of the learned policy on\nGridworld for each potential versus the number of training episodes, averaged over 100 independent\nruns of tabular Q/SARSA and 20 runs of DQN. BMC corresponds to Algorithm 2 applied to all\npotential functions.\n\nFigure 2: Test performance (number of steps that the pole was balanced) of the learned policy on\nCartPole for each potential versus the number of training episodes, averaged over 100 runs of tabular\nQ/SARSA and 20 of DQN.\n\n4.2 CartPole\n\nThis is a classical control problem described in Geva and Sitte [1993] and implemented in OpenAI\nGym (Brockman et al. [2016]). In order to encourage the agent to balance the pole, we provide a\nreward of +1 at every step as long as the pole is upright. We set T = 500 frames, \u03b3 = 0.95, and\n\u03b5t = 0.98t. Finally, to prevent over-\ufb01tting, we stop training whenever the score attained on each of\nthe last 5 episodes is 500.\nIn both tabular cases, we set \u03b1t = max{0.01, 1\n2 0.99t} and \u03b5t = max{0.01, 0.98t}, where t \u2265 0 is\nthe episode. States (x, \u03b8, \u02d9x, \u02d9\u03b8) are discretized into 3, 3, 6 and 3 bins, respectively, for a total of 162\nstates. The neural network takes continuous inputs in R4 and has two fully-connected hidden layers\nwith 12 neurons in each. Once again we use ReLU activations for hidden neurons and linear for\noutput neurons. We set the learning rate to 0.0005 and train using the Adam optimizer. To further\nprevent over-\ufb01tting, we train off-line at the end of each episode on 100 batches of size 32 and use L2\nregularization with penalty 1E-6.\nWe consider the following \ufb01ve experts: \u03a6guess(s) = 20(1 \u2212 |\u03b8|\n0.2618 ) assigns a reward based on\nthe proximity of the pole angle to the vertical, \u03a6net is a pre-trained neural network with two\nhidden layers with 6 neurons per layer, \u03a6zero(s) = 0, \u03a6rand(s) = U where U \u223c U [\u221220, 20], and\n\u03a6neg(s) = \u2212\u03a6net(s).\n\n4.3 Summary\n\nThe performance obtained from each expert and the model combination approach are illustrated in\nFigure 1 for Gridworld and 2 for CartPole, and the learned expert weights are illustrated in Figure 3.\n\n8\n\nNo.Trials050100150200250300350StepstoGoal050100150200250Steps to Goal for Gridworld using Tabular QBMCOpt.GoodZeroRand.Neg.No.Trials0100200300400StepstoGoal050100150200250Steps to Goal for Gridworld using SarsaBMCOpt.GoodZeroRand.Neg.No.Trials0255075100125StepstoGoal050100150200250Steps to Goal for Gridworld using DQNBMCOpt.GoodZeroRand.Neg.No.Trials050100150200250300StepsBalanced0100200300400500600Steps Balanced for CartPole using QBMCNetGuessZeroRand.Neg.No.Trials050100150200250300StepsBalanced0100200300400500600Steps Balanced for CartPole using SarsaBMCNetGuessZeroRand.Neg.No.Trials050100150200StepsBalanced0100200300400500600Steps Balanced for CartPole using DQNBMCNetGuessZeroRand.Neg.\fFigure 3: Posterior weights assigned to each potential as a function of the number of episodes of\ntraining, averaged over 100 independent runs using tabular Q and SARSA and 20 runs using DQN.\n\nIn Gridworld, it is interesting to see that \u03a6good and \u03a6opt are quantitatively very similar, yet both have\nsimilar effects on the rate of convergence in the tabular case, whereas \u03a6opt considerably outperforms\n\u03a6good in the deep learning case. As shown in Figure 3, our algorithm assigns most of its weight to\n\u03a6opt and results in near-optimal performance in all three cases.\nIn CartPole, it is not immediately clear that \u03a6guess is better than \u03a6net, since both should be very\nclose to V \u2217. However, \u03a6net is both a biased estimate of V \u2217 and noisy (due the inexactness of gradient\ndescent), whereas the simple expert \u03a6guess is highly related to the goal (keeping the pole centered).\nFurthermore, \u03a6net is even less accurate in the tabular case due to state discretization. Once again,\nFigure 3 clearly shows that our approach can handle both analytic and computational advice and is\nsensitive to approximation error and noise.\n\n5 Conclusion and Future Work\n\nIn this paper, the decision maker is presented with multiple sources of expert advice in the form of\npotential-based reward functions, some of which can be misleading and should not be trusted. We\nassumed that the decision makes does not know a priori which expert(s) to trust, but rather learns\nthis from experience in a Bayesian framework. More speci\ufb01cally, we followed the Bayesian model\ncombination approach and assigned posterior probabilities to distributions over experts. We showed\nthat the total expected return is a linear combination of individual expert predictions, weighted by the\nposterior beliefs assigned to them. We solved the issue of tractability by projecting the true posterior\ndistribution onto the Dirichlet family using moment matching, and then specialized our analysis to\nBayesian Q-learning. Our approach followed the potential-based reward shaping framework and\ndoes not change the optimal policies. Finally we showed that our proposed method accelerated the\nlearning phase when solving discrete and continuous domains using different learning algorithms.\nFurther extensions and generalizations of this work could include rigorous theoretical analysis of\nposterior convergence under certain conditions on the reward shaping functions. It is also possible\nto extend our analysis to state/action-dependent weightings of experts, at the cost of higher space\ncomplexity; this could be useful in situations where the most suitable potential function changes in\ndifferent regions of the state space. It also remains to scale our work to large-scale and real-world\nproblems where imperfect advice and issues in convergence could be more prevalent.\n\nAcknowledgments\n\nWe would like to thank the NeurIPS reviewers for their feedback, that signi\ufb01cantly improved this\npaper.\n\nReferences\nJ. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A bayesian sampling approach to exploration in\n\nreinforcement learning. In UAI, pages 19\u201326. AUAI Press, 2009.\n\nD. P. Bertsekas and S. Shreve. Stochastic optimal control: the discrete-time case. Athena Scienti\ufb01c, 2004.\n\nG. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym.\n\narXiv preprint arXiv:1606.01540, 2016.\n\nT. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. Taylor, and A. Now\u00e9. Reinforcement learning from\n\ndemonstration through shaping. In IJCAI, pages 3352\u20133358, 2015.\n\n9\n\nNo.Trials0255075100Probability00.51Posterior Weightsfor Gridworld using QOpt.GoodZeroRand.Neg.No.Trials0255075100Probability00.51Posterior Weights forGridworld using SarsaOpt.GoodZeroRand.Neg.No.Trials0255075100Probability00.51Posterior Weightsfor Gridworld using DQNOpt.GoodZeroRand.Neg.No.Trials0255075100Probability00.51Posterior Weightsfor CartPole using QNetGuessZeroRand.Neg.No.Trials0255075100Probability00.51Posterior Weights forCartPole using SarsaNetGuessZeroRand.Neg.No.Trials0255075100Probability00.51Posterior Weightsfor CartPole using DQNNetGuessZeroRand.Neg.\fP. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from\n\nhuman preferences. In NeurIPS, pages 4302\u20134310, 2017.\n\nR. Dearden, N. Friedman, and S. Russell. Bayesian q-learning. In AAAI/IAAI, pages 761\u2013768, 1998.\n\nS. Devlin and D. Kudenko. Theoretical considerations of potential-based reward shaping for multi-agent systems.\nIn AAMAS, pages 225\u2013232. International Foundation for Autonomous Agents and Multiagent Systems, 2011.\n\nS. Devlin and D. Kudenko. Dynamic potential-based reward shaping. In AAMAS, pages 433\u2013440. International\n\nFoundation for Autonomous Agents and Multiagent Systems, 2012.\n\nT. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classi\ufb01er\n\nsystems, pages 1\u201315. Springer, 2000.\n\nC. Downey and S. Sanner. Temporal difference bayesian model averaging: A bayesian perspective on adapting\n\nlambda. In ICML, pages 311\u2013318, 2010.\n\nA. Eck, L.-K. Soh, S. Devlin, and D. Kudenko. Potential-based reward shaping for pomdps. In AAMAS, pages\n\n1123\u20131124. International Foundation for Autonomous Agents and Multiagent Systems, 2013.\n\nS. Geva and J. Sitte. A cartpole experiment benchmark for trainable controllers. IEEE Control Systems, 13(5):\n\n40\u201351, 1993.\n\nS. Grif\ufb01th, K. Subramanian, J. Scholz, C. L. Isbell, and A. L. Thomaz. Policy shaping: Integrating human\n\nfeedback with reinforcement learning. In NeurIPS, pages 2625\u20132633, 2013.\n\nM. Grze\u00b4s. Reward shaping in episodic reinforcement learning. In AAMAS, pages 565\u2013573. International\n\nFoundation for Autonomous Agents and Multiagent Systems, 2017.\n\nM. Grzes and D. Kudenko. Learning shaping rewards in model-based reinforcement learning. In Proc. AAMAS\n\n2009 Workshop on Adaptive Learning Agents, volume 115, 2009.\n\nM. Grze\u00b4s and D. Kudenko. Online learning of shaping rewards in reinforcement learning. Neural Networks,\n23(4):541 \u2013 550, 2010. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2010.01.001. The 18th\nInternational Conference on Arti\ufb01cial Neural Networks, ICANN 2008.\n\nA. Harutyunyan, T. Brys, P. Vrancx, and A. Now\u00e9. Shaping mario with human advice. In AAMAS, pages\n\n1913\u20131914. International Foundation for Autonomous Agents and Multiagent Systems, 2015.\n\nW.-S. Hsu and P. Poupart. Online bayesian moment matching for topic modeling with unknown number of\n\ntopics. In NeurIPS, pages 4536\u20134544, 2016.\n\nG. Konidaris and A. Barto. Autonomous shaping: Knowledge transfer in reinforcement learning. In ICML,\n\npages 489\u2013496. ACM, 2006.\n\nB. Lantz. Machine learning with R. Packt Publishing Ltd, 2013.\n\nY. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.\n\nR. Maclin, J. Shavlik, L. Torrey, T. Walker, and E. Wild. Giving advice about preferred actions to reinforcement\n\nlearners via knowledge-based kernel regression. In AAAI, pages 819\u2013824, 2005.\n\nO. Marom and B. S. Rosman. Belief reward shaping in reinforcement learning. In AAAI, 2018.\n\nT. P. Minka. Bayesian model averaging is not model combination. Available electronically at http://www. stat.\n\ncmu. edu/minka/papers/bma. html, pages 1\u20132, 2000.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari\n\nwith deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\nA. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to\n\nreward shaping. In ICML, volume 99, pages 278\u2013287, 1999.\n\nF. Omar. Online bayesian learning in probabilistic graphical models using moment matching with applications.\n\n2016.\n\nA. O\u2019Hagan. Bayesian statistics: principles and bene\ufb01ts. Frontis, pages 31\u201345, 2004.\n\nP. Philipp and A. Rettinger. Reinforcement learning for multi-step expert advice. In AAMAS, pages 962\u2013971.\n\nInternational Foundation for Autonomous Agents and Multiagent Systems, 2017.\n\n10\n\n\fJ. Randl\u00f8v and P. Alstr\u00f8m. Learning to drive a bicycle using reinforcement learning and shaping. In ICML,\n\nvolume 98, pages 463\u2013471, 1998.\n\nH. B. Suay, T. Brys, M. E. Taylor, and S. Chernova. Learning from demonstration for shaping through inverse\nreinforcement learning. In AAMAS, pages 429\u2013437. International Foundation for Autonomous Agents and\nMultiagent Systems, 2016.\n\nR. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 2018.\n\nM. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. JMLR, 10(Jul):\n\n1633\u20131685, 2009.\n\nM. E. Taylor, P. Stone, and Y. Liu. Transfer learning via inter-task mappings for temporal difference learning.\n\nJMLR, 8(Sep):2125\u20132167, 2007.\n\nA. C. Tenorio-Gonzalez, E. F. Morales, and L. Villase\u00f1or-Pineda. Dynamic reward shaping: training a robot by\n\nvoice. In Ibero-American Conference on Arti\ufb01cial Intelligence, pages 483\u2013492. Springer, 2010.\n\nW. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of\n\ntwo samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nE. Wiewiora. Potential-based shaping and q-value initialization are equivalent. JAIR, 19:205\u2013208, 2003.\n\nE. Wiewiora, G. W. Cottrell, and C. Elkan. Principled methods for advising reinforcement learning agents. In\n\nICML, pages 792\u2013799, 2003.\n\n11\n\n\f", "award": [], "sourceid": 5798, "authors": [{"given_name": "Michael", "family_name": "Gimelfarb", "institution": "University of Toronto"}, {"given_name": "Scott", "family_name": "Sanner", "institution": "University of Toronto"}, {"given_name": "Chi-Guhn", "family_name": "Lee", "institution": "University of Toronto"}]}