{"title": "Propagating Uncertainty in Reinforcement Learning via Wasserstein Barycenters", "book": "Advances in Neural Information Processing Systems", "page_first": 4333, "page_last": 4345, "abstract": "How does the uncertainty of the value function propagate when performing temporal difference learning? In this paper, we address this question by proposing a Bayesian framework in which we employ approximate posterior distributions to model the uncertainty of the value function and Wasserstein barycenters to propagate it across state-action pairs. Leveraging on these tools, we present an algorithm, Wasserstein Q-Learning (WQL), starting in the tabular case and then, we show how it can be extended to deal with continuous domains. Furthermore, we prove that, under mild assumptions, a slight variation of WQL enjoys desirable theoretical properties in the tabular setting. Finally, we present an experimental campaign to show the effectiveness of WQL on finite problems, compared to several RL algorithms, some of which are specifically designed for exploration, along with some preliminary results on Atari games.", "full_text": "Propagating Uncertainty in Reinforcement Learning\n\nvia Wasserstein Barycenters\n\nAlberto Maria Metelli\u2217\n\nDEIB\n\nPolitecnico di Milano\n\nMilan, Italy\n\nAmarildo Likmeta\u2217\n\nDEIB\n\nPolitecnico di Milano\n\nMilan, Italy\n\nMarcello Restelli\n\nDEIB\n\nPolitecnico di Milano\n\nMilan, Italy\n\nalbertomaria.metelli@polimi.it\n\namarildo.likmeta@polimi.it\n\nmarcello.restelli@polimi.it\n\nAbstract\n\nHow does the uncertainty of the value function propagate when performing tem-\nporal difference learning? In this paper, we address this question by proposing a\nBayesian framework in which we employ approximate posterior distributions to\nmodel the uncertainty of the value function and Wasserstein barycenters to propa-\ngate it across state-action pairs. Leveraging on these tools, we present an algorithm,\nWasserstein Q-Learning (WQL), starting in the tabular case and then, we show how\nit can be extended to deal with continuous domains. Furthermore, we prove that,\nunder mild assumptions, a slight variation of WQL enjoys desirable theoretical\nproperties in the tabular setting. Finally, we present an experimental campaign\nto show the effectiveness of WQL on \ufb01nite problems, compared to several RL\nalgorithms, some of which are speci\ufb01cally designed for exploration, along with\nsome preliminary results on Atari games.\n\n1\n\nIntroduction\n\nEffectively balancing exploration and exploitation is a key challenge in Reinforcement Learning [RL,\n43]. When an agent takes decisions under uncertainty, it faces the dilemma between exploiting the\ninformation collected so far to execute what is believed to be the best action or to choose a possibly\nsuboptimal action to explore new portions of the environment and gather new information, leading\nto more pro\ufb01table behaviors in the future. Traditional exploration strategies, such as \u0001-greedy and\nBoltzmann exploration [43], inject random noise into the action-selection process, i.e., the policy, to\nguarantee that each action is tried often enough. Although these methods allow RL algorithms to learn\nthe optimal value function under mild assumptions [39], they are not ef\ufb01cient, since exploration is\nrandom and not driven by con\ufb01dence on the value function estimate. Therefore, they might converge\ntowards the optimal behavior after an exponential number of steps [24].\nThe exploration-exploitation dilemma has been extensively analyzed in the RL community, focusing\non the de\ufb01nition of proper indices for provably-ef\ufb01cient exploration and devising algorithms with\nstrong theoretical guarantees [25, 11, 21, 30]. Most of these algorithms are inherently model-based,\ni.e., they need to maintain and update estimates of the environment dynamics and the reward function\nduring the learning process. For this reason, model-based methods are rather unsuited to problems\nwith large state spaces and inapplicable to continuous environments. Apart from rare exceptions [41],\nthe RL community has only recently focused on devising ef\ufb01cient model-free exploration strategies.\nSome works have succeeded in obtaining provably-ef\ufb01cient algorithms [35, 31, 23]; whereas others\nare more empirically-oriented [30, 29, 6].\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA fundamental step towards ef\ufb01cient exploration is the quanti\ufb01cation of the uncertainty of the value\nfunction. The notion of uncertainty is formalized in Bayesian statistics by means of a posterior\ndistribution. Bayesian Reinforcement Learning incorporates the Bayesian inference tools to provide a\nprincipled way to address the exploration-exploitation dilemma [20]. However, these methods rarely\nexploit the speci\ufb01c way in which the uncertainty propagates through the Bellman equation. Recently,\nin [28] a partial answer has been provided, proposing an uncertainty Bellman inequality; although no\nposterior distribution is explicitly considered.\nIn this paper, we propose a novel Bayesian framework to address the problem of exploration using\nposterior distributions over the value function. Speci\ufb01cally, we focus on how to model and propagate\nuncertainty when performing temporal-difference learning (Section 3). Moreover, we show how\nto use this uncertainty information to effectively explore the environment. Finally, we combine\nthese elements to build our algorithm: Wasserstein Q-Learning (WQL, Section 4). Similarly to\nBayesian Q-Learning [15], we equip each state-action pair with an approximate posterior distribution\n(named Q-posterior), whose goal is to quantify the uncertainty of the value function. Whenever a\ntransition occurs, we update our distribution, in a temporal difference [TD, 43] fashion, in order to\nincorporate all sources of uncertainty: i) the one due to the sample estimate of the reward function\nand environment dynamics; ii) the uncertainty injected using the estimate of the next-state value\nfunction. Rather than employing a standard Bayesian update, we resort to a variational approach to\napproximate the posterior distribution, based on Wasserstein barycenters [2]. Recently, several works\nhave embedded into RL algorithms notions coming from Optimal Transport [OT, 51], especially\nthe Wasserstein metric, to improve the learning abilities of policy search algorithms [34] or in the\n\ufb01led of robust RL [1]. Furthermore, we prove in Section 5, that a slight modi\ufb01cation of WQL, in\ntabular domains, is PAC-MDP in the average loss setting [42]. After examining the related literature\n(Section 6), we present an experimental evaluation on tabular environments to show the effectiveness\nof WQL, compared to the classic RL algorithms, some of which speci\ufb01cally designed for exploration\n(Section 7.1). Finally, we provide some preliminary results on the application of WQL to deep\narchitectures (Section 7.2). The proofs of all results are reported in Appendix B. The implementation\nof the proposed algorithms can be found at https://github.com/albertometelli/wql.\n\n2 Preliminaries\n\nIn this section, we provide the notation and the basic notions we will use in the following. Given a\nset X , we denote with P(X ) the set of all probability measures over X .\nMarkov Decision Processes A discrete-time Markov Decision Process [MDP, 36] is de\ufb01ned as\na 5-tuple M = (S,A,P,R, \u03b3), where S is the state space, A is the (\ufb01nite) action space, P :\nS \u00d7 A \u2192 P(S) is a Markovian transition model, R : S \u00d7 A \u2192 P(R) is a Markovian reward\nmodel, \u03b3 \u2208 [0, 1) is the discount factor. The behavior of an agent is de\ufb01ned by means of a\nMarkovian policy \u03c0 : S \u2192 P(A). Whenever the environment is in state s \u2208 S, the agent performs\naction A \u223c \u03c0(\u00b7|s) and the environment transitions to the next state S(cid:48)\n\u223c P(\u00b7|s, A) providing\nthe agent with the reward R \u223c R(\u00b7|s, A). We assume |R| \u2264 rmax < +\u221e almost surely. We\nindicate with r(s, a) = ER\u223cR(\u00b7|s,a)[R] the expected reward obtained by taking action a \u2208 A\nin state s \u2208 S. Given a policy \u03c0 we de\ufb01ne the state-value function, or V-function, as v\u03c0(s) =\nEA\u223c\u03c0(\u00b7|s),S(cid:48)\u223cP(\u00b7|s,A) [r(s, A) + \u03b3v\u03c0(S(cid:48))]. The action-value function, or Q-function, is given by\nq\u03c0(s, a) = r(s, a) + \u03b3 ES(cid:48)\u223cP(\u00b7|s,a),A(cid:48)\u223c\u03c0(\u00b7|S(cid:48)) [q\u03c0(S(cid:48), A(cid:48))]. The optimal action-value function is\nde\ufb01ned as q\u2217(s, a) = sup\u03c0\u2208\u03a0{q\u03c0(s, a)} for all (s, a) \u2208 S \u00d7 A and it satis\ufb01es the optimal Bellman\nequation: q\u2217(s, a) = r(s, a)+\u03b3 ES(cid:48)\u223cP(\u00b7|s,a) [maxa(cid:48)\u2208A{q\u2217(S(cid:48), a(cid:48))}]. The boundedness of the reward\nfunction implies that the Q-function is uniformly bounded, i.e., |q\u2217(s, a)| \u2264 qmax \u2264 rmax/(1 \u2212 \u03b3).\nThen, an optimal policy \u03c0\u2217 is any policy that plays only greedy actions w.r.t. q\u2217, i.e., for all s \u2208 S we\nhave \u03c0\u2217(\u00b7|s) \u2208 P (arg maxa\u2208A{q\u2217(s, a)}).\nTemporal Difference Learning Temporal-difference methods update the estimate of the optimal\nQ-function using the estimates of the next states V-functions [43]. For TD(0), we have that whenever\na (St, At, St+1, Rt+1) tuple is collected, the temporal difference update rule is executed:\n\n(1)\nwhere qt is the estimated Q-function at time t, \u03b1t \u2265 0 is a learning rate, and vt is an estimate of the\nV-function at time t. Different choices for vt generate different learning algorithms. If vt(St+1) =\n\nqt+1(St, At) = (1 \u2212 \u03b1t) qt(St, At) + \u03b1t (Rt+1 + \u03b3vt(St+1)) ,\n\n2\n\n\fqt(St+1, At+1) we get the SARSA update [38], if vt(St+1) = EA\u223c\u03c0t(\u00b7|St+1)[qt(St+1, A)] we get the\nExpected SARSA update [50], being \u03c0t the exploration policy played at time t, and if vt(St+1) =\nmaxa\u2208A{qt(St+1, a)} we are performing Q-learning [52].\nWasserstein Barycenters Let (X , d) be a complete separable metric (Polish) space and x0 \u2208 X\nbe an arbitrary point. For each p \u2208 [1, +\u221e) we de\ufb01ne Pp(X ) as the set of all probability measures\n\u00b5 over (X , F ) such that EX\u223c\u00b5[d(X, x0)p] < +\u221e. Let \u00b5, \u03bd \u2208 Pp(X ), the Lp-Wasserstein distance\nbetween \u00b5 and \u03bd is de\ufb01ned as [51]:\n\n(cid:18)\n\n(cid:19)1/p\n\nWp(\u00b5, \u03bd) =\n\ninf\n\n\u03c1\u2208\u0393(\u00b5,\u03bd)\n\nE\n\nX,Y \u223c\u03c1\n\n[d(X, Y )p]\n\n,\n\n(2)\n\nwhere \u0393(\u00b5, \u03bd) is the set of all probability measures on X \u00d7 X (couplings) with marginals \u00b5 and \u03bd.\nWith little abuse of notation, we will indicate with Wp(X, Y ) = Wp(\u00b5, \u03bd), whenever clear from\nthe context. The Wasserstein distance comes from the optimal transport community. Intuitively, it\nrepresents the \u201ccost\u201d to move the probability mass to turn one distribution into the other. Given a set of\nprobability measures {\u03bdi}n\ni=1 \u03bei = 1\nand \u03bei \u2265 0, the L2-Wasserstein barycenter is de\ufb01ned as [2]:\n\ni=1, belonging to the class N , and a set of weights {\u03bei}n\n\ni=1,(cid:80)n\n\n(cid:41)\n\n\u03bd = arg inf\n\u03bd\u2208N\n\n\u03beiW2(\u03bd, \u03bdi)2\n\n.\n\n(3)\n\n(cid:40) n(cid:88)\n\ni=1\n\n3 How to Model and Propagate Uncertainty?\n\nIn this section, we introduce a unifying Bayesian framework for exploration in RL that employs\n(approximate) posterior distributions to model uncertainty of value functions (Section 3.1) and Wasser-\nstein barycenters to propagate uncertainty when performing TD updates (Section 3.2). Furthermore,\nwe discuss how to leverage on the Q-posteriors to estimate the action that attains the maximum return\nin each state (Section 3.3) and to effectively explore the environment (Section 3.4).\n\n3.1 Modeling Uncertainty via Q-Posteriors\nTaking inspiration from Bayesian approaches to RL [15, 20], for each state s \u2208 S and action a \u2208 A\nwe maintain a probability distribution Q(s, a), which we call Q-posterior, representing a (possibly\napproximate) posterior distribution of the Q-function estimate. This distribution will depend on\nthe underlying MDP, in particular, the environment dynamics P and reward model R, and on the\nupdates of the Q-function estimates performed. As in a model-free scenario we cannot represent\nsuch distribution exactly, we employ a class of approximating probability distributions Q \u2286 P(R).\nSimilarly to usual value functions, we introduce the V-posterior V(s) which represents the (possibly\napproximate) posterior distribution of V-function, that combines the uncertainties modeled by the\nQ-posteriors Q(s, a). Furthermore, being the V-function de\ufb01ned, in the usual framework, as the\nexpectation of the Q-function over the action space, i.e., v\u03c0(s) = EA\u223c\u03c0[q\u03c0(s, a)], it is natural to\nde\ufb01ne, in our setting, the V-posterior V(s) as the Wasserstein barycenter of the Q-posteriors Q(s, a).2\nDe\ufb01nition 3.1 (V-posterior). Given a policy \u03c0 and a state s \u2208 S, we de\ufb01ne the V-posterior V(s)\ninduced by the Q-posteriors Q(s, a) with a \u2208 A as the Wassertein barycenter of the Q(s, a):\n\n(cid:26)\n\n(cid:104)\n\nW2 (V,Q(s, A))2(cid:105)(cid:27)\n\nV(s) \u2208 arg inf\nV\u2208Q\n\nE\n\nA\u223c\u03c0(\u00b7|s)\n\n.\n\n(4)\n\nWhen the policy \u03c0 is known, the expectation over the action space can be computed as we are\nassuming that A is \ufb01nite. In a prediction problem, policy \u03c0 is a \ufb01xed policy, whereas, in a control\nproblem, \u03c0 is a policy aimed at properly selecting the best action in state s accounting for the\nuncertainty modeled by the Q-posterior (see Section 3.3). Moreover, when Q(s, a) are deterministic\ndistributions, V(s) is a deterministic distribution too centered in the mean of the Q(s, a). In this way,\nwe obtain the usual V-function de\ufb01nition (see Proposition A.3).\n\n2The Wasserstein barycenter can be regarded as a way of averaging distributions [2].\n\n3\n\n\fIt is important to stress that our approach is rather different from Distributional Reinforcement\nLearning [9, 13, 12, 37]. Indeed, we employ a distribution to represent the uncertainty of the Q-\nfunction estimate and not the intrinsic randomness of the return. The two distributions are clearly\nrelated and both depend on the stochasticity of the reward and of the transition model. However, in\nour approach the stochasticity refers to the uncertainty on the Q-function estimate which reduces as\nthe number of updates increases, being a sample mean.3\n\n3.2 Propagating Uncertainty via Wasserstein Barycenters\n\nIn this section, we discuss the problem of uncertainty propagation, i.e., how to deal with the update\nof the Q-posteriors when experiencing a transition (St, At, St+1, Rt+1). Whenever a TD update\n(Equation (1)) is performed, there are two sources of uncertainty involved. First, we implicitly\nestimate the environment dynamics P(\u00b7|St, At) and the reward model R(\u00b7|St, At) using a set of\nsampled transitions (St, At, St+1, Rt+1). Second, when using the V-function estimates of the next\nstates vt(St+1) we bring into qt+1(St, At) part of the uncertainty of vt(St+1) and they become\ncorrelated. For this reason, the standard Bayesian posterior update, used for instance in Bayesian\nQ-learning [15], becomes rather inappropriate as it assumes that the samples are independent, which\nis clearly not true. We argue that, rather than using a Bayesian update, when we have a Q-posterior\nQt(St, At) and a V-posterior Vt(St+1) we can combine them using a notion of barycenter, which\ndoes not require the independence assumption. We formalize this idea in the following update rule.\nDe\ufb01nition 3.2 (Wasserstein Temporal Difference). Let Qt be the current Q-posterior, given a\ntransition (St, At, St+1, Rt+1), we de\ufb01ne the TD-target-posterior as Tt = Rt+1 + \u03b3Vt(St+1). Let\n\u03b1t \u2265 0 be the learning rate, we de\ufb01ne the Wasserstein Temporal Difference (WTD) update rule as:\n(5)\n\n(cid:110)\n(1 \u2212 \u03b1t)W2 (Q,Qt(St, At))2 + \u03b1tW2 (Q,Tt)2(cid:111)\n\nQt+1(St, At) \u2208 arg inf\nQ\u2208Q\n\n.\n\nTherefore, the new Q-posterior Qt+1(St, At) is the Wasserstein barycenter between the current\nQ-posterior Qt(St, At) and the TD-target posterior Tt = Rt+1 + \u03b3Vt(St+1), which in turn embeds\ninformation of the current transition (i.e., the reward Rt+1 and the next state St+1) and the next-state\nV-posterior Vt(St+1). It is worth noting that the two terms appearing in Equation (5) account for\nall sources of uncertainty. Indeed, the \ufb01rst term W2 (Q,Qt(St, At)) avoids moving too far from the\ncurrent estimation Qt(St, At), as we are performing the update experimenting a single transition,\nwhereas W2 (Q,Tt) allows bringing in the new Q-posterior the V-posterior of the next-state Vt(St+1)\n(including its uncertainty). We stress the analogy with the standard TD update in the following result.\n\nProposition 3.1. If Q is the set of deterministic distributions over R, then the WTD update rule\n(Equation (5)) has a unique solution that corresponds to the TD update rule (Equation (1)).\n\nSupporting deterministic distributions, as the Q-posteriors, is fundamental for our method that models\na sample mean, whose variance reduces as the number of samples increases, moving towards a\ndeterministic distribution. This justi\ufb01es the choice of the Wasserstein metric over other distributional\ndistances (e.g., \u03b1-divergences). The choice of the prior for Q0 plays an important role, along with\nthe learning rate schedule \u03b1t. We will show in Section 5 that speci\ufb01c choices of Q0 and \u03b1t, for a\nparticular class of distributions Q, allow achieving PAC-MDP property in the average loss setting.\n\n3.3 Estimating the Maximum Expected Value\nThe TD-target-posterior Tt = Rt+1 + \u03b3Vt(St+1) is de\ufb01ned in terms of the next state V-posterior\nVt(St+1). In a control problem, we aim at learning the optimal Q-function q\u2217 and, thus, we are\ninterested in propagating back to Qt+1(St, At) a V-posterior Vt(St+1) related to the optimal action\n(cid:1).\nto be taken in the next state.4 This can be performed by a suitable choice of the policy \u03c0, as\nin De\ufb01nition 3.1. A straightforward approach consists in propagating the Q-posterior Q(St+1, a)\n\nof the action with the highest estimated mean, i.e., \u03c0M (\u00b7|s) \u2208 P(cid:0)arg maxa\u2208A{EQ\u223cQ(s,a)[Q]}\n\n3A notable difference w.r.t. the distributional RL is that the variance of our posterior distribution\n4We stress that we are uninterested in modeling the distribution maxa\u2208A{Q(s, a)}, but rather in exploiting\n\nVarQ\u223cQ(s,a)[Q] vanishes as the number of updates grows to in\ufb01nity.\nthe uncertainty modeled by Q(s, a) to properly perform the computation of the optimal action.\n\n4\n\n\fTable 1: Probability density function (pdf), Wasserstein Temporal Difference (WTD) update rule and\ncomputation of the V-posterior for Gaussian and Particle posterior distributions.\n\nQ\n\nGaussian\n\nParticle\n\nexp\n\n\u03c3(s,a)\n\n\u2212 1\n\n2\n\npdf\n\n(cid:17)2(cid:27)\n\n(cid:26)\n(cid:16) x\u2212m(s,a)\n(cid:112)2\u03c0\u03c32(s, a)\n(cid:80)M\n(cid:80)M\nj=1 wj\u03b4(x \u2212 xj(s, a))\nx1(s, a) \u2264 ... \u2264 xM (s, a)\ni=j wj = 1 and wj \u2265 0\n\nWTD and V-posterior\nmt+1(St, At) = \u03b1tmt(St, At) + (1 \u2212 \u03b1t) (Rt+1 + \u03b3mt(St+1))\n\u03c3t+1(St, At) = \u03b1t\u03c3t(St, At) + (1 \u2212 \u03b1t)\u03b3\u03c3t(St+1)\nm(s) = EA\u223c\u03c0(\u00b7|s) [m(s, A)]\n\u03c3(s) = EA\u223c\u03c0(\u00b7|s) [\u03c3(s, A)]\n\nxj,t+1(St, At) = \u03b1txj,t(St, At) + (1\u2212 \u03b1t) (Rt+1 + \u03b3xj,t(St+1))\nxj(s) = EA\u223c\u03c0(\u00b7|s) [xj(s, A)] , j = 1, 2, ..., M\n\nWe refer to this approach as Mean Estimator (ME) for the maximum. However, when posterior\ndistributions are available, we can use them to de\ufb01ne a wiser way to estimate the V-posterior of\nthe next state.5 A \ufb01rst method based on Optimism in the Face of Uncertainty [OFU, 3] consists\nin selecting the action that maximizes a statistical upper bound u\u03b4(s, a) of the Q-posterior, i.e.,\n\n(cid:1). We will refer to this method as Optimistic Estimator\n\n\u03c0O(\u00b7|s) \u2208 P(cid:0)arg maxa\u2208A{u\u03b4(s, a)}\n\n(OE). However, if we want to make full usage of the Q-posteriors, we can resort to the Posterior\nEstimator (PE) of the maximum, based on Posterior Sampling [PS, 47]. In this case, each action\ncontributes to the update rule weighted by the probability of being the optimal action, i.e., \u03c0P (a|s) =\nPrQs,a\u223cQ(s,a) (a \u2208 arg maxa(cid:48)\u2208A{Qs,a(cid:48)}).\n3.4 Exploring using the Q-posteriors\n\nIn the previous section, we have introduced two approaches that exploit the Q-posterior to properly\nde\ufb01ne the V-posterior of the next state, using speci\ufb01c policies \u03c0. These policies can also be used to\nimplement effective exploration strategies aware of the uncertainty. Using the optimistic policy \u03c0O\nin each state, we play (deterministically) the action that maximizes the statistical upper bound on\nthe estimated Q-function u\u03b4(s, a), we call this strategy Optimistic Exploration (OX). Instead, we\ncan directly use the posterior policy \u03c0P to sample the action from the Q-posterior Q(s, a). Thus, in\nPosterior Exploration (PX), each action is played with the probability of being optimal.\n\n4 Wasserstein Q-Learning\n\nInput: a prior distribution Q0, a step size schedule\n(\u03b1t)t\u22650, an exploration policy schedule (\u03c0t)t\u22650\n1: Initialize Q(s, a) with the prior Q0\n2: for t = 1, 2, ... do\n3:\n4:\n5:\n6:\n7: end for\n\nThe ideas presented so far can be combined in\nan algorithm, Wasserstein Q-Learning (WQL),\nwhose pseudocode is reported in Algorithm 1.\nWe developed our approach for a generic class of\ndistributions Q, however, in practice, we focus\non two speci\ufb01c classes: Gaussian posteriors (G-\nWQL) and Particle posteriors (P-WQL), i.e., a\nmixture of M > 1 Dirac deltas. For both classes\nthe Wasserstein Barycenter is unique and can be\ncomputed in closed form (see Appendix A.3).6 In\nTable 1, we summarize the main relevant features\nof these distributions classes. WQL simply needs\nto store the parameters of the Q-posterior for every state-action pair (m(s, a) and \u03c3(s, a) for G-WQL\nand xj(s, a) for P-WQL). Therefore, unlike the majority of provably-ef\ufb01cient algorithms, it can be\nextended straightforwardly to continuous state spaces as long as we adopt a function approximator\nfor the parameters of the posterior. For instance, we could approximate m(s, a) and \u03c3(s, a) or the\nparticles xj(s, a) using a neural network with multiple heads. For this reason, our method easily\n\nTake action At \u223c \u03c0t(\u00b7|St)\nObserve St+1 and Rt+1\nCompute Vt(St+1) using Equation (4)\nUpdate Qt+1(St, At) using Equation (5)\n\nAlgorithm 1: Wasserstein Q-Learning.\n\n5This problem was treated in RL, without distributions, proposing several estimators, such as the double\n\nestimator [48] and the weighted estimator [17, 16].\n\n6It is worth noting that, even for the Gaussian case, using the standard Bayesian posterior update is inappro-\n\npriate, as the independence of the Q-function estimates across state-action pairs cannot be assumed.\n\n5\n\n\fapplies to deep architecture by adopting a network that directly outputs the posterior parameters,\ninstead of the value function (see Section 7.2).\n\n5 Theoretical Analysis\n\nIn this section, we show that WQL, with some modi\ufb01cations, enjoys desirable theoretical properties\nin the tabular setting. We start providing a modi\ufb01cation of the WTD update rule that will be used for\nthe analysis; then we prove that with such modi\ufb01cation our algorithm, under certain assumptions, is\nPAC-MDP in the average loss setting [42].\nDe\ufb01nition 5.1 (Modi\ufb01ed Wasserstein Temporal Difference). Let Qt be the current Q-posterior and\nQb be a zero-mean distribution, given a transition (St, At, St+1, Rt+1), we de\ufb01ne the TD-target-\nposterior as Tt = Rt+1 + \u03b3Vt(St+1). Let \u03b1t , \u03b2t \u2265 0 be the learning rates, we de\ufb01ne the Modi\ufb01ed\nWasserstein Temporal Difference (MWTD) update rule as:\n\n(cid:101)Qt+1(St, At) \u2208 arg inf\n\nQ\u2208Q\n\nQt+1(St, At) \u2208 arg inf\nQ\u2208Q\n\nW2\n\n(cid:26)\n(cid:26)\n\n(cid:16)\n\n(cid:17)2\nQ, (cid:101)Qt(St, At)\n(cid:17)2(cid:27)\n+ \u03b1tW2 (Q,Tt)2\nQ, (cid:101)Qt+1(St, At) + \u03b2tQb\n\n,\n\n(cid:16)\n\n(1 \u2212 \u03b1t)W2\n\n(cid:27)\n\n,\n\nWe will denote the algorithm employing this update rule as Modi\ufb01ed Wasserstein Q-Learning\n(MWQL). The reason why we need to change the WTD lies in the fact that the uncertainty on the\nQ-function value (the Q-posterior) is, as already mentioned, the contribution of two terms: i) the\nuncertainty on the reward and transition model; ii) the uncertainty on the next-state Q-function. These\nterms need to be averaged into the Q-posterior at different speeds. If nt(s, a) is the number of times\n\n(s, a) is visited up to time t, (i) has to reduce proportionally to 1/(cid:112)nt(s, a) being a sample mean,\n\n(cid:80)T\n\n(cid:80)T\ni=t \u03b3i\u2212tRi+1. The quantity LA = 1\n\nwhile (ii) is averaged with coef\ufb01cients proportional to 1/nt(s, a). Therefore, we should keep the two\nsources of uncertainty separated. To this end, we use an additional distribution Qb to prevent the\nuncertainty from reducing too fast.\nThe notion of PAC-MDP in the average loss setting [42] is a relaxation of the classical PAC-MDP\nnotion introduced in [24], in which we consider the actual reward received by the algorithm while\nlearning, instead of the expected values over future policies. We recall the de\ufb01nitions given in [42].\nDe\ufb01nition 5.2 (De\ufb01nition 4 of [42]). Suppose a learning algorithm A is run for T steps. Consider\npartial sequence S0, R1, ..., ST\u22121, RT , ST visited by A. The instantaneous loss of the agent at time t\nis ilA(t) = v\u2217(St) \u2212\nt=1 ilA(t) is called the average loss.\nThen, a learning algorithm A is PAC-MDP in the average loss setting if for any \u0001 \u2265 0 and \u03b4 \u2208 [0, 1],\nwe can choose a value T , polynomial in the relevant quantities (1/\u0001, 1/\u03b4,|S|,|A|, 1/(1 \u2212 \u03b3)), such\nthat the average loss LA of the agent (following the learning algorithm A) on a trial of T steps is\nguaranteed to be less than \u0001 with probability at least 1 \u2212 \u03b4.\nIn the following, we will restrict our attention to MWQL with Gaussian posterior, optimistic esti-\nmator (OE) and optimistic exploration policy (OX). We leave the analysis of the posterior sampling\nexploration (PX) as future work. To prove the main result we need an intermediate result.\nTheorem 5.1. Let S0, ..., ST\u22121, ST be the sequence of states and actions visited by MWQL with\nGaussian posterior, OE and OX. Then, there exists a prior Q0 and a zero-mean distribution Qb and a\nlearning rate schedule for (\u03b1t, \u03b2t)t\u22650 (whose values are reported in Appendix B.1), such that for any\n\u03b4 \u2208 [0, 1], with probability at least 1 \u2212 \u03b4 it holds that:7\nqmax\n(1 \u2212 \u03b3) 3\n\n[v\u2217(St) \u2212 vA(St)] \u2264 O\n\n|S||A|T log |S||A|T\n\nT(cid:88)\n\nt=1\n\nwhere vA is the value function induced by the (non-stationary) policy played by algorithm A.\n\n(cid:114)\n\n(cid:33)\n\n,\n\n(cid:32)\n\n(6)\n\nT\n\n2\n\n\u03b4\n\nFrom this result, we can exploit an analysis similar to [42] to prove that MWQL with Gaussian\nposterior, OE and OX is PAC-MDP in the average loss setting.\n\n7This performance index resembles the regret [21]. However, it is a weaker notion, being de\ufb01ned in terms of\n\nthe trajectory generated by algorithm A, instead of the trajectories of an optimal policy.\n\n6\n\n\fTheorem 5.2. Under the hypothesis of Theorem 5.1, MWQL with Gaussian posterior, OE and OX is\nPAC-MDP in the average loss setting, i.e., for any \u0001 \u2265 0 and \u03b4 \u2208 [0, 1], after\n\n(cid:32)\n\n(cid:33)\n\nT = O\n\nq2\nmax|S||A|\n\u00012(1 \u2212 \u03b3)3 log\n\nmax|S|2|A|2\nq2\n\u03b4\u00012(1 \u2212 \u03b3)3\n\nsteps we have that the average loss LA \u2264 \u0001 with probability at least 1 \u2212 \u03b4.\nThe per-step computational complexity of MWQL is O(log |A|) as we can maintain the upper bounds\nof the Q-function as a max-priority queue [40] and the space complexity is O(|S||A|).\nDespite the theoretical guarantees, MWQL turns out to be often impractical for two main reasons.\nFirst, MWQL cannot be extended to continuous MDPs, as \u03b1t and \u03b2t are de\ufb01ned in terms of number\nof visits n(s, a) (Equation (20)), which can only be computed for \ufb01nite MDPs. Second, as many\nprovably ef\ufb01cient RL algorithms, MWQL is extremely conservative, leading to very slow convergence.\nThis is why most provably ef\ufb01cient RL algorithms, when used in practice, are run with non-theoretical\nvalues of hyperparameters. In this sense, WQL can be seen as a \u201cpractical\u201d version of MWQL in\nwhich \u03b1t is treated as a normal hyper-parameter and \u03b2t = 0.\n\n6 Related Works\n\nA variety of approaches has been proposed in the RL literature to tackle the exploration-exploitation\ntrade-off [44]. We consider only those that do not assume the availability of a simulator of the\nenvironment [26]. A \ufb01rst dimension of classi\ufb01cation is the RL setting they consider: \ufb01nite-horizon,\ndiscounted or undiscounted. Finite-horizon MDPs are a convenient framework to devise provably-\nef\ufb01cient exploration algorithms with theoretical guarantees on the regret [32, 14, 5]. Recently, in\n[23] it was shown that Q-learning, in the \ufb01nite-horizon setting, can be made ef\ufb01cient by resorting to\nsuitable exploration bonuses. Similar results have been proposed in the in\ufb01nite-horizon undiscounted\ncase. The main challenge of this class of problems is the connection structure of the MDP [7].\nEarly approaches [25, 4, 46, 21] impose restrictive requirements on either mixing/hitting times or\ndiameter, which have been progressively relaxed [19]. A signi\ufb01cant part of the early provably-ef\ufb01cient\nalgorithms considers the discounted setting [25, 11, 41, 45, 27]. However, their theoretical guarantees\nare based on the notion of PAC-MDP [24] rather than on regret.\nAnother relevant dimension is the kind of policy used for exploration. Taking inspiration from the\nMulti Armed Bandit [MAB, 10] framework, two main approaches have been proposed: Optimism\nin the Face of Uncertainty [3] and Thompson Sampling [47]. Most exploration algorithms employ\nthe optimistic technique, selecting actions from the optimal policy of an optimistic approximation\nof the MDP [21] or of the value function directly [41, 23]. Some methods, instead, use a posterior\nsampling approach in which either the entire MDP or a value function is sampled from a (possibly\napproximate) posterior distribution.\nInspired by these methods, numerous practical variants have been devised. Exploration bonuses,\nbased on pseudo-counts [8, 33], mimicking optimism, have been applied with positive results to\ndeep architectures. Likewise, with the idea of approximating a posterior distribution, Bootstrapped\nDQN [30] and Bayesian DQN [6] succeeded in solving challenging Atari games. Recently, new\nresults of sample-ef\ufb01ciency beyond tabular domains have been derived [22].\n\n7 Experiments\n\nIn this section, we provide an experimental evaluation of WQL on tabular domains along with some\npreliminary results on Atari games (implementation details are reported in Appendix C).\n\n7.1 Tabular Domains\n\nWe evaluate WQL on a set of RL tasks designed to emphasize exploration: the Taxi problem [15], the\nChain [15], the River Swim [42], and the Six Arms [42]. We extensively test several WQL variants that\ndiffer on: i) the Q-posterior model (Gaussian G-WQL vs particle P-WQL); ii) the exploration strategy\n(optimistic OX vs posterior sampling PX), iii) the estimator of the maximum (ME, OE, and PE).\n\n7\n\n\fFigure 1: Online average return as a function of the number of samples, comparison of P-WQL and\nG-WQL with QL, BQL, Delayed-QL, and MBIE-EB. 10 runs, 95% c.i.\n\nWe compare these combinations with the classic Q-\nlearning [QL, 52] (Boltzmann exploration), Bootstrapped\nQ-learning [BQL, 30] both with the double estimator [48],\nDelayed Q-learning [Delayed-QL, 41] and MBIE-EB [42].8\nFigure 1 shows the online performance on the considered\ntabular tasks. While we tried all the WQL variants, due to\nspace constraints, we show the best combination of explo-\nration strategy and maximum estimator for both Gaussian\nand particle models (complete results are reported in Ap-\npendix C). We can see that WQL learns substantially faster\nthan classical approaches, like QL, in tasks that require sig-\nni\ufb01cant exploration, such as Taxi, Six Arms, or River Swim.\nOur algorithm also outperforms BQL in most tasks, except\nin the River Swim, where performances are not substan-\ntially different. Finally, we can see that across all the tasks\nWQL displays a faster learning curve w.r.t. to Delayed-QL.\nMBIE-EB outperforms WQL in small domains like Chain\nand RiverSwim, but not in SixArms. MBIE-EB was not\ntested on the Taxi domain as the number of states (\u223c 200) makes the computational time demands\nprohibitive. We cross-validate the hyperparameter of Delayed Q-Learning and MBIE-EB.\nAmong the variants of WQL, we discovered that the choice of the exploration strategy and the\nmaximum estimator are highly task dependent. However, we can see a general pattern across the\ntasks. As intuition suggests, being the exploration strategy and the maximum estimator closely\nrelated, the best combinations are: OX exploration with OE estimator and PX exploration with PE\nestimator. We illustrate in Figure 2 all the possible combinations of G-WQL on Six Arms, a domain\nin which exploration is essential. We can notice that the \u201chybrid\u201d combinations, such as OX with PE\nand PX with OE are signi\ufb01cantly outperformed by the more \u201ccoherent\u201d ones.\n\nFigure 2: Online average return as a\nfunction of the number of samples for\nthe different versions of G-WQL algo-\nrithm. 10 runs, 95% c.i.\n\n7.2 Atari Games\n\nWe adapted WQL with the particle model to be used paired with deep architectures. For this purpose,\nsimilarly to Bootstrapped DQN [BDQN, 30], we use a network architecture with a head for each\nparticle while the convolutional layers are shared among them. We compare the resulting algorithm,\nwhich we call Particle DQN (PDQN), with Double DQN [DDQN, 49], a classic benchmark in\nDeep-RL, and Bootstrapped DQN, speci\ufb01cally designed for deep exploration using Q-posteriors. To\ncompare algorithms we consider of\ufb02ine scores, i.e., the scores collected using the current greedy\npolicy. The goal of this experiment, conducted on three Atari games, is to prove that WQL, although\ndesigned to work in \ufb01nite environments, can easily be extended to deep networks with potentially\ngood results. In Figure 3, we can see that PDQN, compared to BDQN and DDQN, manages to\n\n8We are considering a discounted setting, thus, several provably ef\ufb01cient algorithms, like UCRL2 [21],\nPSRL [32], RLSVI [30], optimistic Q-learning [23] and UCBVI [5], cannot be compared as they consider either\naverage reward or \ufb01nite-horizon setting.\n\n8\n\n01234Samples\u00d71050.00.51.01.5AverageReturn\u00d7101Taxi0.00.20.40.60.8Samples\u00d71052.02.53.03.5\u00d7102Chain0.00.20.40.60.8Samples\u00d71050246\u00d7104RiverSwim0.00.20.40.60.8Samples\u00d7105012\u00d7105SixArmsG-WQLP-WQLBQLQLDelayed-QLMBIE0.00.20.40.60.8Samples\u00d7105012AverageReturn\u00d7105PX-PEPX-MEPX-OEOX-PEOX-MEOX-OE\fInput: a prior distribution {xi}M\n1: Initialize a Q-function network with M outputs {Qj}M\n\ni=1, a step size schedule (\u03b1t)t\u22650, an exploration policy schedule (\u03c0t)t\u22650\n\nj=1 and parameters \u03b8 and the target network with\n\nparameters \u03b8\u2212 = \u03b8\n2: for t = 1, 2, ... do\n3:\n4:\n5:\n6:\n\nTake action At \u223c \u03c0t(\u00b7|St; \u03b8)\nStore transition (St, At, St+1, Rt+1) in the replay buffer\nSample random a batch of transitions (Sl, Al, Sl+1, Rl+1) from the replay buffer\nCompute targets yj(Sl+1) = EA\u223c\u03c0(\u00b7|Sl+1)[Qj(Sl+1, A; \u03b8\u2212)] for each output Qj where \u03c0 \u2208\n{\u03c0M , \u03c0O, \u03c0P} as in Section 3.3\nj=1(yj(St+1)\u2212 Qj(Sl, Al; \u03b8))2 and using\n\nPerform a gradient descent step w.r.t. \u03b8 on the objective(cid:80)M\n\n7:\n\n8:\n9: end for\n\nthe step size \u03b1t\nPeriodically update target network \u03b8\u2212 = \u03b8\n\nAlgorithm 2: Particle DQN.\n\nFigure 3: Of\ufb02ine average return of the greedy policy as a function of the number of collected frames,\ncomparing PDQN, DDQN and BDQN on Asterix, Enduro and Breakout games. 5 runs, 95% c.i.\n\nachieve higher scores in Asterix and Enduro, where exploration is needed, while achieving similar\nscores in Breakout. A relevant feature of PDQN is the particle initialization interval. Indeed, a\nnarrower initial interval causes faster learning but might lead to premature convergence. In this sense,\nthe initial interval becomes a hyperparameter of PDQN, which in\ufb02uences the amount of exploration\nand it is likely task-dependent. The pseudocode of PDQN is shown in Algorithm 2.\n\n8 Discussion and Conclusions\n\nIn this paper, we presented a novel RL algorithm, Wasserstein Q-Learning (WQL), which addresses\nseveral issues related to ef\ufb01cient exploration in model-free RL. We discussed how to model uncertainty\nof the estimated Q-function by means of approximate posterior distributions (Q-posteriors). Then,\nwe devised a variational method to propagate uncertainty across state-action pairs when performing\nTD learning, based on Wasserstein barycenters. The experimental evaluation allowed us to appreciate\nthe properties of WQL. In tabular domains, whenever exploration is really necessary, our approach\nis able to signi\ufb01cantly outperform TD methods even if designed speci\ufb01cally for exploration (e.g.,\nBootstrapped Q-Learning and Delayed Q-Learning). Although preliminary, the results on the Atari\ngames are promising and need to be further investigated as future work in order to make WQL scale\non complex environments. We believe that our algorithm contributes to bridging the gap between\ntheory and practice of exploration in RL. WQL is a theoretically grounded method, equipped with\nguarantees in the average loss setting, but, at the same time, it is a very simple algorithm, easily\nextensible to deal with continuous domains.\n\nReferences\n[1] Mohammed Amin Abdullah, Hang Ren, Haitham Bou Ammar, Vladimir Milenkovic, Rui Luo,\nMingtian Zhang, and Jun Wang. Wasserstein robust reinforcement learning. arXiv preprint\narXiv:1907.13196, 2019.\n\n9\n\n0.00.51.0Frames\u00d71080.00.5AverageReturn\u00d7104Asterix024Frames\u00d7107024\u00d7102Breakout012Frames\u00d71070.00.51.0\u00d7103EnduroBDQNPDQNDDQN\f[2] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on\n\nMathematical Analysis, 43(2):904\u2013924, 2011.\n\n[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[4] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, pages 49\u201356, 2007.\n\n[5] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for\nIn Proceedings of the 34th International Conference on Machine\n\nreinforcement learning.\nLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 263\u2013272, 2017.\n\n[6] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Ef\ufb01cient exploration\n\nthrough bayesian deep q-networks. arXiv preprint arXiv:1802.04412, 2018.\n\n[7] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\n[8] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi\nMunos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural\nInformation Processing Systems, pages 1471\u20131479, 2016.\n\n[9] Marc G. Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on rein-\nforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11\nAugust 2017, volume 70 of Proceedings of Machine Learning Research, pages 449\u2013458. PMLR,\n2017.\n\n[10] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments\n(monographs on statistics and applied probability). London: Chapman and Hall, 5:71\u201387, 1985.\n\n[11] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A general polynomial time algorithm for\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231, 2002.\n\n[12] Will Dabney, Georg Ostrovski, David Silver, and R\u00e9mi Munos. Implicit quantile networks for\ndistributional reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, Proceed-\nings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan,\nStockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research,\npages 1104\u20131113. PMLR, 2018.\n\n[13] Will Dabney, Mark Rowland, Marc G. Bellemare, and R\u00e9mi Munos. Distributional reinforce-\nment learning with quantile regression. In Sheila A. McIlraith and Kilian Q. Weinberger, editors,\nProceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, (AAAI-18), the\n30th innovative Applications of Arti\ufb01cial Intelligence (IAAI-18), and the 8th AAAI Symposium\non Educational Advances in Arti\ufb01cial Intelligence (EAAI-18), New Orleans, Louisiana, USA,\nFebruary 2-7, 2018, pages 2892\u20132901. AAAI Press, 2018.\n\n[14] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforce-\nment learning. In Advances in Neural Information Processing Systems, pages 2818\u20132826,\n2015.\n\n[15] Richard Dearden, Nir Friedman, and Stuart J. Russell. Bayesian q-learning. In Jack Mostow and\nChuck Rich, editors, Proceedings of the Fifteenth National Conference on Arti\ufb01cial Intelligence\nand Tenth Innovative Applications of Arti\ufb01cial Intelligence Conference, AAAI 98, IAAI 98, July\n26-30, 1998, Madison, Wisconsin, USA., pages 761\u2013768. AAAI Press / The MIT Press, 1998.\n\n[16] Carlo D\u2019Eramo, Alessandro Nuara, Matteo Pirotta, and Marcello Restelli. Estimating the\nmaximum expected value in continuous reinforcement learning problems. In Satinder P. Singh\nand Shaul Markovitch, editors, Proceedings of the Thirty-First AAAI Conference on Arti\ufb01cial\nIntelligence, February 4-9, 2017, San Francisco, California, USA., pages 1840\u20131846. AAAI\nPress, 2017.\n\n10\n\n\f[17] Carlo D\u2019Eramo, Marcello Restelli, and Alessandro Nuara. Estimating maximum expected value\nthrough gaussian approximation. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings,\npages 1032\u20131040. JMLR.org, 2016.\n\n[18] DC Dowson and BV Landau. The fr\u00e9chet distance between multivariate normal distributions.\n\nJournal of multivariate analysis, 12(3):450\u2013455, 1982.\n\n[19] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Ef\ufb01cient bias-span-\nconstrained exploration-exploitation in reinforcement learning. In Jennifer G. Dy and Andreas\nKrause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML\n2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of\nMachine Learning Research, pages 1573\u20131581. PMLR, 2018.\n\n[20] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforce-\nment learning: A survey. Foundations and Trends R(cid:13) in Machine Learning, 8(5-6):359\u2013483,\n2015.\n\n[21] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[22] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire.\nContextual decision processes with low bellman rank are pac-learnable. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1704\u20131713. JMLR. org,\n2017.\n\n[23] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably\nef\ufb01cient?\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4868\u20134878.\nCurran Associates, Inc., 2018.\n\n[24] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD\n\nthesis, University of London London, England, 2003.\n\n[25] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine learning, 49(2-3):209\u2013232, 2002.\n\n[26] Sven Koenig and Reid G Simmons. Complexity analysis of real-time reinforcement learning\napplied to \ufb01nding shortest paths in deterministic domains. Technical report, Carnegie Mellon\nUniversity Pittsburgh PA School of Computer Science, 1992.\n\n[27] Tor Lattimore and Marcus Hutter. Near-optimal PAC bounds for discounted mdps. Theor.\n\nComput. Sci., 558:125\u2013143, 2014.\n\n[28] Brendan O\u2019Donoghue, Ian Osband, Remi Munos, and Vlad Mnih. The uncertainty Bellman\nequation and exploration. In Jennifer Dy and Andreas Krause, editors, Proceedings of the\n35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 3839\u20133848, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\n[29] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforce-\nment learning. In Advances in Neural Information Processing Systems 31: Annual Conference\non Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al,\nCanada., pages 8626\u20138638, 2018.\n\n[30] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration\nvia bootstrapped DQN. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona,\nSpain, pages 4026\u20134034, 2016.\n\n11\n\n\f[31] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized\nvalue functions. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of\nthe 33nd International Conference on Machine Learning, ICML 2016, New York City, NY,\nUSA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages\n2377\u20132386. JMLR.org, 2016.\n\n[32] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\nposterior sampling. In Advances in Neural Information Processing Systems 26: 27th Annual\nConference on Neural Information Processing Systems 2013. Proceedings of a meeting held\nDecember 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3003\u20133011, 2013.\n\n[33] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and R\u00e9mi Munos. Count-based\n\nexploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.\n\n[34] Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska, Krzysztof Choro-\nmanski, and Michael I. Jordan. Behavior-guided reinforcement learning. arXiv preprint\narXiv:1906.04349, 2019.\n\n[35] Jason Pazis, Ronald Parr, and Jonathan P. How. Improving PAC exploration using the median\nof means. In Advances in Neural Information Processing Systems 29: Annual Conference on\nNeural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages\n3891\u20133899, 2016.\n\n[36] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, 2014.\n\n[37] Mark Rowland, Marc G. Bellemare, Will Dabney, R\u00e9mi Munos, and Yee Whye Teh. An analysis\nof categorical distributional reinforcement learning. In Amos J. Storkey and Fernando P\u00e9rez-\nCruz, editors, International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2018,\n9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of\nMachine Learning Research, pages 29\u201337. PMLR, 2018.\n\n[38] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems,\nvolume 37. University of Cambridge, Department of Engineering Cambridge, England, 1994.\n\n[39] Satinder P. Singh, Tommi S. Jaakkola, Michael L. Littman, and Csaba Szepesv\u00e1ri. Conver-\ngence results for single-step on-policy reinforcement-learning algorithms. Machine Learning,\n38(3):287\u2013308, 2000.\n\n[40] Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in \ufb01nite mdps:\n\nPAC analysis. Journal of Machine Learning Research, 10:2413\u20132444, 2009.\n\n[41] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC\nmodel-free reinforcement learning. In Machine Learning, Proceedings of the Twenty-Third\nInternational Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006,\npages 881\u2013888, 2006.\n\n[42] Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation\n\nfor markov decision processes. J. Comput. Syst. Sci., 74(8):1309\u20131331, 2008.\n\n[43] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[44] Csaba Szepesv\u00e1ri. Algorithms for reinforcement learning. Synthesis lectures on arti\ufb01cial\n\nintelligence and machine learning, 4(1):1\u2013103, 2010.\n\n[45] Istvan Szita and Csaba Szepesv\u00e1ri. Model-based reinforcement learning with nearly tight\nIn Proceedings of the 27th International Conference on\n\nexploration complexity bounds.\nMachine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 1031\u20131038, 2010.\n\n[46] Ambuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for\nirreducible mdps. In Advances in Neural Information Processing Systems, pages 1505\u20131512,\n2008.\n\n12\n\n\f[47] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[48] Hado Van Hasselt. Double q-learning. In Advances in Neural Information Processing Systems,\n\npages 2613\u20132621, 2010.\n\n[49] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\nq-learning. In Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth\nAAAI Conference on Arti\ufb01cial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.,\npages 2094\u20132100. AAAI Press, 2016.\n\n[50] Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and\nempirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement\nLearning, 2009. ADPRL\u201909. IEEE Symposium on, pages 177\u2013184. IEEE, 2009.\n\n[51] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[52] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King\u2019s\n\nCollege, Cambridge, 1989.\n\n13\n\n\f", "award": [], "sourceid": 2425, "authors": [{"given_name": "Alberto Maria", "family_name": "Metelli", "institution": "Politecnico di Milano"}, {"given_name": "Amarildo", "family_name": "Likmeta", "institution": "Politecnico di Milano"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}]}