{"title": "Better Exploration with Optimistic Actor Critic", "book": "Advances in Neural Information Processing Systems", "page_first": 1787, "page_last": 1798, "abstract": "Actor-critic methods, a type of model-free Reinforcement Learning, have been successfully applied to challenging tasks in continuous control, often achieving state-of-the art performance. However, wide-scale adoption of these methods in real-world domains is made difficult by their poor sample efficiency. We address this problem both theoretically and empirically. On the theoretical side, we identify two phenomena preventing efficient exploration in existing state-of-the-art algorithms such as Soft Actor Critic. First, combining a greedy actor update with a pessimistic estimate of the critic leads to the avoidance of actions that the agent does not know about, a phenomenon we call pessimistic underexploration. Second, current algorithms are directionally uninformed, sampling actions with equal probability in opposite directions from the current mean. This is wasteful, since we typically need actions taken along certain directions much more than others. To address both of these phenomena, we introduce a new algorithm, Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function. This allows us to apply the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation. We evaluate OAC in several challenging continuous control tasks, achieving state-of the art sample efficiency.", "full_text": "Better Exploration with Optimistic Actor-Critic\n\nKamil Ciosek\n\nQuan Vuong\u2217\n\nMicrosoft Research Cambridge, UK\nkamil.ciosek@microsoft.com\n\nUniversity of California San Diego\n\nqvuong@ucsd.edu\n\nRobert Loftin\n\nKatja Hofmann\n\nMicrosoft Research Cambridge, UK\n\nt-roloft@microsoft.com\n\nMicrosoft Research Cambridge, UK\nkatja.hofmann@microsoft.com\n\nAbstract\n\nActor-critic methods, a type of model-free Reinforcement Learning, have been\nsuccessfully applied to challenging tasks in continuous control, often achieving\nstate-of-the art performance. However, wide-scale adoption of these methods in\nreal-world domains is made dif\ufb01cult by their poor sample ef\ufb01ciency. We address\nthis problem both theoretically and empirically. On the theoretical side, we identify\ntwo phenomena preventing ef\ufb01cient exploration in existing state-of-the-art algo-\nrithms such as Soft Actor Critic. First, combining a greedy actor update with a\npessimistic estimate of the critic leads to the avoidance of actions that the agent\ndoes not know about, a phenomenon we call pessimistic underexploration. Sec-\nond, current algorithms are directionally uninformed, sampling actions with equal\nprobability in opposite directions from the current mean. This is wasteful, since\nwe typically need actions taken along certain directions much more than others. To\naddress both of these phenomena, we introduce a new algorithm, Optimistic Actor\nCritic, which approximates a lower and upper con\ufb01dence bound on the state-action\nvalue function. This allows us to apply the principle of optimism in the face of\nuncertainty to perform directed exploration using the upper bound while still using\nthe lower bound to avoid overestimation. We evaluate OAC in several challenging\ncontinuous control tasks, achieving state-of the art sample ef\ufb01ciency.\n\n1\n\nIntroduction\n\nA major obstacle that impedes a wider adoption of actor-critic methods [31, 40, 49, 44] for control\ntasks is their poor sample ef\ufb01ciency. In practice, despite impressive recent advances [24, 17], millions\nof environment interactions are needed to obtain a reasonably performant policy for control problems\nwith moderate complexity. In systems where obtaining samples is expensive, this often makes the\ndeployment of these algorithms prohibitively costly.\n\nThis paper aims at mitigating this problem by more ef\ufb01cient exploration . We begin by examining\nthe exploration behavior of SAC [24] and TD3 [17], two recent model-free algorithms with state-\nof-the-art sample ef\ufb01ciency and make two insights. First, in order to avoid overestimation [26, 46],\nSAC and TD3 use a critic that computes an approximate lower con\ufb01dence bound2. The actor then\nadjusts the exploration policy to maximize this lower bound. This improves the stability of the\nupdates and allows the use of larger learning rates. However, using the lower bound can also seriously\ninhibit exploration if it is far from the true Q-function. If the lower bound has a spurious maximum,\nthe covariance of the policy will decrease, causing pessimistic underexploration, i.e. discouraging\n\n\u2217Work done while an intern at Microsoft Research, Cambridge.\n2See Appendix C for details.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe algorithm from sampling actions that would lead to an improvement to the \ufb02awed estimate of\nthe critic. Moreover, Gaussian policies are directionally uninformed, sampling actions with equal\nprobability in any two opposing directions from the mean. This is wasteful since some regions in the\naction space close to the current policy are likely to have already been explored by past policies and\ndo not require more samples.\n\nWe formulate Optimistic Actor-Critic (OAC), an algorithm which explores more ef\ufb01ciently by\napplying the principle of optimism in the face of uncertainty [9]. OAC uses an off-policy exploration\nstrategy that is adjusted to maximize an upper con\ufb01dence bound to the critic, obtained from an\nepistemic uncertainty estimate on the Q-function computed with the bootstrap [35]. OAC avoids\npessimistic underexploration because it uses an upper bound to determine exploration covariance.\nBecause the exploration policy is not constrained to have the same mean as the target policy, OAC\nis directionally informed, reducing the waste arising from sampling parts of action space that have\nalready been explored by past policies.\n\nOff-policy Reinforcement Leaning is known to be prone to instability when combined with function\napproximation, a phenomenon known as the deadly triad [43, 47]. OAC achieves stability by\nenforcing a KL constraint between the exploration policy and the target policy. Moreover, similarly to\nSAC and TD3, OAC mitigates overestimation by updating its target policy using a lower con\ufb01dence\nbound of the critic [26, 46].\n\nEmpirically, we evaluate Optimistic Actor Critic in several challenging continuous control tasks\nand achieve state-of-the-art sample ef\ufb01ciency on the Humanoid benchmark. We perform ablations\nand isolate the effect of bootstrapped uncertainty estimates on performance. Moreover, we perform\nhyperparameter ablations and demonstrate that OAC is stable in practice.\n\n2 Preliminaries\n\nReinforcement learning (RL) aims to learn optimal behavior policies for an agent acting in an\nenvironment with a scalar reward signal. Formally, we consider a Markov decision process [39],\n\nde\ufb01ned as a tuple (S, A, R, p, p0, \u03b3). An agent observes an environmental state s \u2208 S = Rn; takes a\nsequence of actions a1, a2, ..., where at \u2208 A \u2286 Rd; transitions to the next state s\u2032 \u223c p(\u00b7|s, a) under\nthe state transition distribution p(s\u2032|s, a); and receives a scalar reward r \u2208 R. The agent\u2019s initial state\ns0 is distributed as s0 \u223c p0(\u00b7).\nA policy \u03c0 can be used to generate actions a \u223c \u03c0(\u00b7|s). Using the policy to sequentially generate\nactions allows us to obtain a trajectory through the environment \u03c4 = (s0, a0, r0, s1, a1, r1, ...). For\nany given policy, we de\ufb01ne the action-value function as Q\u03c0(s, a) = E\u03c4 :s0=s,a0=a[Pt \u03b3trt], where\n\u03b3 \u2208 [0, 1) is a discount factor. We assume that Q\u03c0(s, a) is differentiable with respect to the action.\nThe objective of Reinforcement Learning is to \ufb01nd a deployment policy \u03c0eval which maximizes the\ntotal return J = E\u03c4 :s0\u223cp0 [Pt \u03b3trt]. In order to provide regularization and aid exploration, most\n\nactor-critic algorithms [24, 17, 31] do not adjust \u03c0eval directly. Instead, they use a target policy \u03c0T ,\ntrained to have high entropy in addition to maximizing the expected return J .3 The deployment\npolicy \u03c0eval is typically deterministic and set to the mean of the stochastic target policy \u03c0T .\n\nActor-critic methods [44, 6, 8, 7] seek a locally optimal target policy \u03c0T by maintaining a critic,\nlearned using a value-based method, and an actor, adjusted using a policy gradient update. The\ncritic is learned with a variant of SARSA [48, 43, 41]. In order to limit overestimation [26, 46],\nmodern actor-critic methods learn an approximate lower con\ufb01dence bound on the Q-function [24, 17],\nobtained by using two networks \u02c6Q1\nLB, which have identical structure, but are initialized with\ndifferent weights. In order to avoid cumbersome terminology, we refer to \u02c6QLB simply as a lower\nbound in the remainder of the paper. Another set of target networks [33, 31] slowly tracks the values\nof \u02c6QLB in order to improve stability.\n\nLB and \u02c6Q2\n\n\u02c6QLB(st, at) = min( \u02c6Q1\n\nLB(st, at), \u02c6Q2\n(st, at) \u2190 R(st, at) + \u03b3 min( \u02d8Q1\n\n\u02c6Q{1,2}\n\nLB\n\nLB(st, at))\n\nLB(st+1, a), \u02d8Q2\n\nLB(st+1, a)) where a \u223c \u03c0T (\u00b7|st+1).\n\n(1)\n\n(2)\n\n3Policy improvement results can still be obtained with the entropy term present, in a certain idealized setting\n\n[24].\n\n2\n\n\fQ\u03c0, \u03c0(a)\n\nQ\u03c0, \u03c0(a)\n\nsamples\nneeded more\n\nsamples\nneeded less\n\n\u00b5\n\n\u03c0current\n\n\u03c0past\n\nQ\u03c0(s, a)\n\u02c6QLB(s, a)\na\n\nQ\u03c0(s, a)\n\u02c6QLB(s, a)\na\n\n\u03c0current\n\n(a) Pessimistic underexploration\n\n(b) Directional uninformedness\n\nFigure 1: Exploration inef\ufb01ciencies in actor-critic methods. The state s is \ufb01xed. The graph shows\nQ\u03c0, which is unknown to the algorithm, its known lower bound \u02c6QLB (in red) and two policies \u03c0current\nand \u03c0past at different time-steps of the algorithm (in blue).\n\nMeanwhile, the actor adjusts the policy parameter vector \u03b8 of the policy \u03c0T in order to maximize\nJ by following its gradient. The gradient can be written in several forms [44, 40, 13, 27, 22, 23].\nRecent actor-critic methods use a reparametrised policy gradient [27, 22, 23]. We denote a random\nvariable sampled from a standard multivariate Gaussian as \u03b5 \u223c N (0, I) and denote the standard\nnormal density as \u03c6(\u03b5). The re-parametrisation function f is de\ufb01ned such that the probability density\nof the random variable f\u03b8(s, \u03b5) is the same as the density of \u03c0T (a|s), where \u03b5 \u223c N (0, I). The\ngradient of the return can then be written as:\n\n\u2207\u03b8J = Rs \u03c1(s)R\u03b5 \u2207\u03b8 \u02c6QLB(s, f\u03b8(s, \u03b5))\u03c6(\u03b5)d\u03b5ds\n\nwhere \u03c1(s) , P\u221et=0 \u03b3tp(st = s|s0) is the discounted-ergodic occupancy measure. In order to\nprovide regularization and encourage exploration, it is common to use a gradient \u2207\u03b8J \u03b1 that adds an\nadditional entropy term \u2207\u03b8H(\u03c0(\u00b7, s)).\n\n(3)\n\nds\n\n(4)\n\n\u2207\u03b8J \u03b1\n\n\u02c6QLB\n\n= Rs \u03c1(s)R\u03b5 \u2207\u03b8 \u02c6QLB(s, f\u03b8(s, \u03b5))\u03c6(\u03b5)d\u03b5 + \u03b1R\u03b5 \u2212\u2207\u03b8 log f\u03b8(s, \u03b5)\u03c6(\u03b5)d\u03b5\n}\n\n\u2207\u03b8H(\u03c0(\u00b7,s))\n\n{z\n\n|\n\nDuring training, (4) is approximated with samples by replacing integration over \u03b5 with Monte-Carlo\nestimates and integration over the state space with a sum along the trajectory.\n\n\u2207\u03b8J \u03b1\n\n\u02c6QLB \u2248 \u2207\u03b8 \u02c6J \u03b1\n\n\u02c6QLB\n\n= PN\n\nt=0 \u03b3t\u2207\u03b8 \u02c6QLB(st, f\u03b8(s, \u03b5t)) + \u03b1 \u2212 \u2207\u03b8 log f\u03b8(st, \u03b5t).\n\n(5)\n\nIn the standard set-up, actions used in (1) and (5) are generated using \u03c0T . In the table-lookup case,\nthe update can be reliably applied off-policy, using an action generated with a separate exploration\npolicy \u03c0E. In the function approximation setting, this leads to updates that can be biased because of\nthe changes to \u03c1(s). In this work, we address these issues by imposing a KL constraint between the\nexploration policy and the target policy. We give a more detailed account of addressing the associated\nstability issues in section 4.3.\n\n3 Existing Exploration Strategy is Inef\ufb01cient\n\nAs mentioned earlier, modern actor-critic methods such as SAC [24] and TD3 [17] explore in an\ninef\ufb01cient way. We now give more details about the phenomena that lead to this inef\ufb01ciency.\n\nPessimistic underexploration.\nIn order to improve sample ef\ufb01ciency by preventing the catas-\ntrophic overestimation of the critic [26, 46], SAC and TD3 [17, 25, 24] use a lower bound approxima-\ntion to the critic, similar to (1). However, relying on this lower bound for exploration is inef\ufb01cient. By\ngreedily maximizing the lower bound, the policy becomes very concentrated near a maximum. When\nthe critic is inaccurate and the maximum is spurious, this can be very harmful. This is illustrated\nin Figure 1a. At \ufb01rst, the agent explores with a broad policy, denoted \u03c0past. Since \u02c6QLB increases to\nthe left, the policy gradually moves in that direction, becoming \u03c0current. Because \u02c6QLB (shown in red)\n\n3\n\n\fhas a maximum at the mean \u00b5 of \u03c0current, the policy \u03c0current has a small standard deviation. This is\nsuboptimal since we need to sample actions far away from the mean to \ufb01nd out that the true critic Q\u03c0\ndoes not have a maximum at \u00b5. We include evidence that this problem actually happens in MuJoCo\nAnt in Appendix F.\n\nThe phenomenon of underexploration is speci\ufb01c to the lower as opposed to an upper bound. An\nupper bound which is too large in certain areas of the action space encourages the agent to explore\nthem and correct the critic, akin to optimistic initialization in the tabular setting [42, 43]. We include\nmore intuition about the difference between the upper and lower bound in Appendix I. Due to\noverestimation, we cannot address pessimistic underexploration by simply using the upper bound in\nthe actor [17]. Instead, recent algorithms have used an entropy term (4) in the actor update. While\nthis helps exploration somewhat by preventing the covariance from collapsing to zero, it does not\naddress the core issue that we need to explore more around a spurious maximum. We propose a more\neffective solution in section 4.\n\nDirectional uninformedness. Actor-critic algorithms that use Gaussian policies, like SAC [25]\nand TD3 [17], sample actions in opposite directions from the mean with equal probability. However,\nin a policy gradient algorithm, the current policy will have been obtained by incremental updates,\nwhich means that it won\u2019t be very different from recent past policies. Therefore, exploration in both\ndirections is wasteful, since the parts of the action space where past policies had high density are\nlikely to have already been explored. This phenomenon is shown in Figure 1b. Since the policy\n\u03c0current is Gaussian and symmetric around the mean, it is equally likely to sample actions to the left\nand to the right. However, while sampling to the left would be useful for learning an improved critic,\nsampling to the right is wasteful, since the critic estimate in that part of the action space is already\ngood enough. In section 4, we address this issue by using an exploration policy shifted relative to the\ntarget policy.\n\n4 Better Exploration with Optimism\n\nOptimistic Actor Critic (OAC) is based on the principle of optimism in the face of uncertainty [50].\nInspired by recent theoretical results about ef\ufb01cient exploration in model-free RL [28], OAC obtains\nan exploration policy \u03c0E which locally maximizes an approximate upper con\ufb01dence bound of Q\u03c0\neach time the agent enters a new state. The policy \u03c0E is separate from the target policy \u03c0T learned\nusing (5) and is used only to sample actions in the environment. Formally, the exploration policy\n\u03c0E = N (\u00b5E, \u03a3E), is de\ufb01ned as\n\u00b5e, \u03a3E =\n\narg max\n\n(6)\n\n\u00b5,\u03a3:\n\nKL(N (\u00b5,\u03a3),N (\u00b5T ,\u03a3T ))\u2264\u03b4\n\nEa\u223cN (\u00b5,\u03a3)(cid:2) \u00afQUB(s, a)(cid:3) .\n\nBelow, we derive the OAC algorithm formally. We begin by obtaining the upper bound \u00afQUB(s, a)\n(section 4.1). We then motivate the optimization problem (6), in particular the use of the KL constraint\n(section 4.2). Finally, in section 4.3, we describe the OAC algorithm and outline how it mitigates\npessimistic underexploration and directional uninformedness while still maintaining the stability of\nlearning. In Section 4.4, we compare OAC to related work. In Appendix B, we derive an alternative\nvariant of OAC that works with deterministic policies.\n\n4.1 Obtaining an Upper Bound\n\nThe approximate upper con\ufb01dence bound \u00afQUB used by OAC is derived in three stages. First, we\nobtain an epistemic uncertainty estimate \u03c3Q about the true state-action value function Q. We then use\nit to de\ufb01ne an upper bound \u02c6QUB. Finally, we introduce its linear approximation \u00afQUB, which allows\nus to obtain a tractable algorithm.\n\nEpistemic uncertainty For computational ef\ufb01ciency, we use a Gaussian distribution to model\nepistemic uncertainty. We \ufb01t mean and standard deviation based on bootstraps [16] of the critic. The\nmean belief is de\ufb01ned as \u00b5Q(s, a) = 1\n\nLB(s, a)(cid:17) , while the standard deviation is\n\n=\n\n1\n\n2 (cid:12)(cid:12)(cid:12)\n\n\u02c6Q1\nLB(s, a) \u2212 \u02c6Q2\n\nLB(s, a)(cid:12)(cid:12)(cid:12) .\n\n(7)\n\n\u03c3Q(s, a) = rPi\u2208{1,2}\n\n1\n\n2 (cid:16) \u02c6Qi\n\n2 (cid:16) \u02c6Q1\nLB(s, a) + \u02c6Q2\nLB(s, a) \u2212 \u00b5Q(s, a)(cid:17)2\n\n4\n\n\fHere, the second equality is derived in appendix C. The bootstraps are obtained using (1). Since\nexisting algorithms [24, 17] already maintain two bootstraps, we can obtain \u00b5Q and \u03c3Q at negligible\ncomputational cost. Despite the fact that (1) uses the same target value for both bootstraps, we\ndemonstrate in Section 5 that using a two-network bootstrap leads to a large performance improvement\nin practice. Moreover, OAC can be easily extended to to use more expensive and better uncertainty\nestimates if required.\n\nUpper bound. Using the uncertainty estimate (24), we de\ufb01ne the upper bound as \u02c6QUB(s, a) =\n\u00b5Q(s, a) + \u03b2UB\u03c3Q(s, a). We use the parameter \u03b2UB \u2208 R+ to \ufb01x the level of optimism. In order to\nobtain a tractable algorithm, we approximate \u02c6QUB with a linear function \u00afQUB.\n\n\u00afQUB(s, a) = a\u22a4h\u2207a \u02c6QUB(s, a)ia=\u00b5T\n\n+ const\n\n(8)\n\nBy Taylor\u2019s theorem, \u00afQUB(s, a) is the best possible linear \ufb01t to \u02c6QUB(s, a) in a suf\ufb01ciently small\nregion near the current policy mean \u00b5T for any \ufb01xed state s [10, Theorem 3.22]. Since the gradient\nis computationally similar to the lower-bound gradients in (5), our upper bound\n\nh\u2207a \u02c6QUB(s, a)ia=\u00b5T\n\nestimate can be easily obtained in practice without additional tuning.\n\n4.2 Optimistic Exploration\n\nOur exploration policy \u03c0E, introduced in (6), trades off between two criteria: the maximization of an\nupper bound \u00afQUB(s, a), de\ufb01ned in (8), which increases our chances of executing informative actions,\naccording to the principle of optimism in the face of uncertainty [9], and constraining the maximum\nKL divergence between the exploration policy and the target policy \u03c0T , which ensures the stability of\nupdates. The KL constraint in (6) is crucial for two reasons. First, it guarantees that the exploration\npolicy \u03c0E is not very different from the target policy \u03c0T . This allows us to preserve the stability of\noptimization and makes it less likely that we take catastrophically bad actions, ending the episode\nand preventing further learning. Second, it makes sure that the exploration policy remains within the\naction range where the approximate upper bound \u00afQUB is accurate. We chose the KL divergence over\nother similarity measures for probability distributions since it leads to tractable updates.\nThanks to the linear form on \u00afQUB and because both \u03c0E and \u03c0T are Gaussian, the maximization of (6)\ncan be solved in closed form. We state the solution below.\nProposition 1. The exploration policy resulting from (6) has the form \u03c0E = N (\u00b5E, \u03a3E), where\n\n\u00b5E = \u00b5T +\n\n(cid:13)\n(cid:13)\n(cid:13)\n(cid:13)\n\n[\u2207a \u02c6QUB(s,a)]a=\u00b5T\n\n\u03a3T [\u2207a \u02c6QUB(s,a)]a=\u00b5T\n\nand \u03a3E = \u03a3T .\n\n(9)\n\n(cid:13)\n(cid:13)\n(cid:13)\n(cid:13)\u03a3\n\n\u221a2\u03b4\n\nWe stress that the covariance of the exploration policy is the same as the target policy. The proof is\ndeferred to Appendix A.\n\n4.3 The Optimistic Actor-Critic Algorithm\n\nOptimistic Actor Critic (see Algorithm 1) samples actions using the exploration policy (9) in line\n\nin (9) is computed at minimal\n\n4 and stores it in a memory buffer. The term h\u2207a \u02c6QUB(s, a)ia=\u00b5T\n\ncost4 using automatic differentiation, analogous to the critic derivative in the actor update (4). OAC\nthen uses its memory buffer to train the critic (line 10) and the actor (line 12). We also introduced a\nmodi\ufb01cation of the lower bound used in the actor, using \u02c6Q\u2032LB = \u00b5Q(s, a) + \u03b2LB\u03c3Q(s, a), allowing\nus to use more conservative policy updates. The critic (1) is recovered by setting \u03b2LB = \u22121.\nOAC avoids the pitfalls of greedy exploration Figure 2 illustrates OAC\u2019s exploration policy \u03c0E\nvisually. Since the policy \u03c0E is far from the spurious maximum of \u02c6QLB (red line in \ufb01gure 2), executing\nactions sampled from \u03c0E leads to a quick correction to the critic estimate. This way, OAC avoids\npessimistic underexploration. Since \u03c0E is not symmetric with respect to the mean of \u03c0T (dashed\nline), OAC also avoids directional uninformedness.\n\n4In practice, the per-iteration wall clock time it takes to run OAC is the same as SAC.\n\n5\n\n\f\u22b2 Initial parameters w1, w2 of the critic and \u03b8 of the target policy \u03c0T .\n\u22b2 Initialize target network weights and replay pool\n\n\u22b2 Sample action from exploration policy as in (9).\n\u22b2 Sample transition from the environment\n\u22b2 Store the transition in the replay pool\n\nLB(st,at)\u2212R(st,at)\u2212\u03b3 min( \u02d8Q1\n\n\u22b2 Update two bootstraps of the critic\nLB(st+1,a), \u02d8Q2\n\nLB(st+1,a))k2\n\n2\n\nAlgorithm 1 Optimistic Actor-Critic (OAC).\n\nfor each environment step do\n\nRequire: w1, w2, \u03b8\n1: \u02d8w1 \u2190 w1, \u02d8w2 \u2190 w2,D \u2190 \u2205\n2: for each iteration do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nend for\nfor each training step do\n\nat \u223c \u03c0E(at|st)\nst+1 \u223c p(st+1|st, at)\nD \u2190 D \u222a {(st, at, R(st, at), st+1)}\n\nfor i \u2208 {1, 2} do\nend for\n\nupdate wi with \u02c6\u2207wik \u02c6Qi\n\n10:\n11:\n12:\n\nend for\n\n13:\n14:\n15: end for\nOutput: w1, w2, \u03b8\n\nupdate \u03b8 with \u2207\u03b8 \u02c6J \u03b1\n\u02d8w1 \u2190 \u03c4 w1 + (1 \u2212 \u03c4 ) \u02d8w1, \u02d8w2 \u2190 \u03c4 w2 + (1 \u2212 \u03c4 ) \u02d8w2\n\n\u02c6Q\u2032\nLB\n\n\u22b2 Policy gradient update.\n\n\u22b2 Update target networks\n\n\u22b2 Optimized parameters\n\nQ\u03c0, \u03c0(a)\n\nsamples\nneeded more\n\nsamples\nneeded less\n\n\u02c6QUB(s, a)\nQ\u03c0(s, a)\n\u02c6QLB(s, a)\na\n\n\u03c0E\n\n\u03c0T\n\nFigure 2: The OAC exploration policy \u03c0E avoids pessimistic underexploration by sampling far from\nthe spurious maximum of the lower bound \u02c6QLB. Since \u03c0E is not symmetric wrt. the mean of the\ntarget policy (dashed line), it also addresses directional uninformedness.\n\nStability While off-policy deep Reinforcement Learning is dif\ufb01cult to stabilize in general [43, 47],\nOAC is remarkably stable. Due to the KL constraint in equation (6), the exploration policy \u03c0E\nremains close to the target policy \u03c0T . In fact, despite using a separate exploration policy, OAC isn\u2019t\nvery different in this respect from SAC [24] or TD3 [17], which explore with a stochastic policy but\nuse a deterministic policy for evaluation. In Section 5, we demonstrate empirically that OAC and\nSAC are equally stable in practice. Moreover, similarly to other recent state-of-the-art actor-critic\nalgorithms [17, 25], we use target networks [33, 31] to stabilize learning. We provide the details in\nAppendix D.\n\nOverestimation vs Optimism While OAC is an optimistic algorithm, it does not exhibit catas-\ntrophic overestimation [17, 26, 46]. OAC uses the optimistic estimate (8) for exploration only. The\npolicy \u03c0E is computed from scratch (line 4 in Algorithm 1) every time the algorithm takes an action\nand is used only for exploration. The critic and actor updates (1) and (5) are still performed with\na lower bound. This means that there is no way the upper bound can in\ufb02uence the critic except\nindirectly through the distribution of state-action pairs in the memory buffer.\n\n4.4 Related work\n\nOAC is distinct from other methods that maintain uncertainty estimates over the state-action value\nfunction. Actor-Expert [32] uses a point estimate of Q\u22c6, unlike OAC, which uses a bootstrap\napproximating Q\u03c0. Bayesian actor-critic methods [19\u201321] model the probability distribution over Q\u03c0,\nbut unlike OAC, do not use it for exploration. Approaches combining DQN with bootstrap [11, 36] and\nthe uncertainty Bellman equation [34] are designed for discrete actions. Model-based reinforcement\n\n6\n\n\fFigure 3: OAC versus SAC, TD3, DDPG on 5 Mujoco environments. The horizontal axis indicates\nnumber of environment steps. The vertical axis indicates the total undiscounted return. The shaded\nareas denote one standard deviation.\n\nlearning methods thet involve uncertainty [18, 15, 12] are very computationally expensive due to\nthe need of learning a distribution over environment models. OAC may seem super\ufb01cially similar\nto natural actor critic [5, 29, 37, 38] due to the KL constraint in (6). In fact, it is very different.\nWhile natural actor critic uses KL to enforce the similarity between in\ufb01nitesimally small updates to\nthe target policy, OAC constrains the exploration policy to be within a non-trivial distance of the\ntarget policy. Other approaches that de\ufb01ne the exploration policy as a solution to a KL-constrained\noptimization problem include MOTO [2], MORE [4] and Maximum a Posteriori Policy optimization\n[3]. These methods differ from OAC in that they do not use epistemic uncertainty estimates and\nexplore by enforcing entropy.\n\n5 Experiments\n\nOur experiments have three main goals. First, to test whether Optimistic Actor Critic has performance\ncompetitive to state-of-the art algorithms. Second, to assess whether optimistic exploration based\non the bootstrapped uncertainty estimate (24), is suf\ufb01cient to produce a performance improvement.\nThird, to assess whether optimistic exploration adversely affects the stability of the learning process.\n\nMuJoCo Continuous Control We test OAC on the MuJoCo [45] continuous control benchmarks.\nWe compare OAC to SAC [25] and TD3 [17], two recent model-free RL methods that achieve\nstate-of-the art performance. For completeness, we also include a comparison to a tuned version of\nDDPG [31], an established algorithm that does not maintain multiple bootstraps of the critic network.\nOAC uses 3 hyper-parameters related to exploration. The parameters \u03b2UB and \u03b2LB control the amount\nof uncertainty used to compute the upper and lower bound respectively. The parameter \u03b4 controls the\nmaximal allowed divergence between the exploration policy and the target policy. We provide the\nvalues of all hyper-parameters and details of the hyper-parameter tuning in Appendix D. Results in\nFigure 3 show that using optimism improves the overall performance of actor-critic methods. On\nAnt, OAC improves the performance somewhat. On Hopper, OAC achieves state-of the art \ufb01nal\nperformance. On Walker, we achieve the same performance as SAC while the high variance of results\non HalfCheetah makes it dif\ufb01cult to draw conclusions on which algorithm performs better.5.\n\nState-of-the art result on Humanoid The upper-right plot of Figure 3 shows that the vanilla\nversion of OAC outperforms SAC the on the Humanoid task. To test the statistical signi\ufb01cance of our\nresult, we re-ran both SAC and OAC in a setting where 4 training steps per iteration are used. By\nexploiting the memory buffer more fully, the 4-step versions show the bene\ufb01t of improved exploration\n\n5Because of this high variance, we measured a lower mean performance of SAC in Figure 3 than previously\n\nreported. We provide details in Appendix E.\n\n7\n\n\fFigure 4: Impact of the bootstrapped uncertainty estimate on the performance of OAC.\n\nFigure 5: Left \ufb01gure: individual runs of OAC vs SAC. Right \ufb01gure: sensitivity to the KL constraint \u03b4.\nError bars indicate 90% con\ufb01dence interval.\n\nmore clearly. The results are shown in the lower-right plot in Figure 3. At the end of training, the 90%\ncon\ufb01dence interval6 for the performance of OAC was 5033 \u00b1 147 while the performance of SAC was\n4586 \u00b1 117. We stress that we did not tune hyper-parameters on the Humanoid environment. Overall,\nthe fact that we are able to improve on Soft-Actor-Critic, which is currently the most sample-ef\ufb01cient\nmodel-free RL algorithm for continuous tasks shows that optimism can be leveraged to bene\ufb01t sample\nef\ufb01ciency. We provide an explicit plot of sample-ef\ufb01ciency in Appendix J.\n\nUsefulness of the Bootstrapped Uncertainty Estimate OAC uses an epistemic uncertainty esti-\nmate obtained using two bootstraps of the critic network. To investigate its bene\ufb01t, we compare the\nperformance of OAC to a modi\ufb01ed version of the algorithm, which adjusts the exploration policy to\nmaximize the approximate lower bound, replacing \u02c6QUB with \u02c6QLB in equation (9). While the modi\ufb01ed\nalgorithm does not use the uncertainty estimate, it still uses a shifted exploration policy, preferring\nactions that achieve higher state-action values. The results is shown in Figure 4 (we include more plots\nin Figure 8 in the Appendix). Using the bootstrapped uncertainty estimate improves performance on\nthe most challenging Humanoid domain, while producing either a slight improvement or a no change\nin performance on others domains. Since the upper bound is computationally very cheap to obtain,\nwe conclude that it is worthwhile to use it.\n\nSensitivity to the KL constraint OAC relies on the hyperparameter \u03b4, which controls the maxi-\nmum allowed KL divergence between the exploration policy and the target policy. In Figure 5, we\n\nevaluate how the term \u221a2\u03b4 used in the the exploration policy (9) affects average performance of OAC\n\ntrained for 1 million environment steps on the Ant-v2 domain. The results demonstrate that there is a\nbroad range of settings for the hyperparameter \u03b4, which leads to good performance.\n\nLearning is Stable in Practice Since OAC explores with a shifted policy, it might at \ufb01rst be\nexpected of having poorer learning stability relative to algorithms that use the target policy for\nexploration. While we have already shown above that the performance difference between OAC and\nSAC is statistically signi\ufb01cant and not due to increased variance across runs, we now investigate\nstability further. In Figure 5 we compare individual learning runs across both algorithms. We\nconclude that OAC and SAC are similarly stable, avoiding the problems associated with stabilising\ndeep off-policy RL [43, 47].\n\n6Due to computational constraints, we used a slightly different target update rate for OAC. We describe the\n\ndetails in Appendix D.\n\n8\n\n\f6 Conclusions\n\nWe present Optimistic Actor Critic (OAC), a model-free deep reinforcement learning algorithm which\nexplores by maximizing an approximate con\ufb01dence bound on the state-action value function. By\naddressing the inef\ufb01ciencies of pessimistic underexploration and directional uninformedness, we are\nable to achieve state-of-the art sample ef\ufb01ciency in continuous control tasks. Our results suggest that\nthe principle of optimism in the face of uncertainty can be used to improve the sample ef\ufb01ciency of\npolicy gradient algorithms in a way which carries almost no additional computational overhead.\n\nReferences\n\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp,\nGeoffrey Irving, Michael Isard, Yangqing Jia, Rafal J\u00f3zefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh\nLevenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster,\nJonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay\nVasudevan, Fernanda B. Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,\nand Xiaoqiang Zheng. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems.\nCoRR, abs/1603.04467, 2016. URL http://arxiv.org/abs/1603.04467.\n\n[2] Abbas Abdolmaleki, Rudolf Lioutikov, Jan R Peters, Nuno Lau, Luis Pualo Reis, and Gerhard Neumann.\nModel-Based Relative Entropy Stochastic Search. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3537\u20133545. Curran\nAssociates, Inc., 2015.\n\n[3] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, R\u00e9mi Munos, Nicolas Heess, and Martin A.\nRiedmiller. Maximum a Posteriori Policy Optimisation. In 6th International Conference on Learning Rep-\nresentations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.\nOpenReview.net, 2018.\n\n[4] Riad Akrour, Gerhard Neumann, Hany Abdulsamad, and Abbas Abdolmaleki. Model-Free Trajectory\nOptimization for Reinforcement Learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY,\nUSA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 2961\u20132970.\nJMLR.org, 2016.\n\n[5] Shun-ichi Amari. Natural Gradient Works Ef\ufb01ciently in Learning. Neural Computation, 10(2):251\u2013276,\n\n1998. doi: 10.1162/089976698300017746.\n\n[6] J. Baxter and P. L. Bartlett. Direct gradient-based reinforcement learning. In IEEE International Symposium\non Circuits and Systems, ISCAS 2000, Emerging Technologies for the 21st Century, Geneva, Switzerland,\n28-31 May 2000, Proceedings, pages 271\u2013274. IEEE, 2000. doi: 10.1109/ISCAS.2000.856049.\n\n[7] Jonathan Baxter and Peter L. Bartlett. In\ufb01nite-Horizon Policy-Gradient Estimation. J. Artif. Intell. Res.,\n\n15:319\u2013350, 2001. doi: 10.1613/jair.806.\n\n[8] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with In\ufb01nite-Horizon, Policy-Gradient\n\nEstimation. J. Artif. Intell. Res., 15:351\u2013381, 2001. doi: 10.1613/jair.807.\n\n[9] Ronen I. Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal\n\nreinforcement learning. Journal of Machine Learning Research, 3(Oct):213\u2013231, 2002.\n\n[10] James J. Callahan. Advanced Calculus: A Geometric View. Springer Science & Business Media, September\n\n2010. ISBN 978-1-4419-7332-0.\n\n[11] Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.\n\narXiv preprint arXiv:1706.01502, 2017.\n\n[12] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in\na handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing\nSystems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8\nDecember 2018, Montr\u00e9al, Canada., pages 4759\u20134770, 2018. URL http://papers.nips.cc/paper/\n7725-deep-reinforcement-learning-in-a-handful-of-trials-using-probabilistic-dynamics-models.\n\n9\n\n\f[13] Kamil Ciosek and Shimon Whiteson. Expected Policy Gradients. In Sheila A. McIlraith and Kilian Q.\nWeinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, (AAAI-\n18), the 30th Innovative Applications of Arti\ufb01cial Intelligence (IAAI-18), and the 8th AAAI Symposium on\nEducational Advances in Arti\ufb01cial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7,\n2018, pages 2868\u20132875. AAAI Press, 2018.\n\n[14] Kamil Ciosek and Shimon Whiteson. Expected Policy Gradients for Reinforcement Learning. CoRR,\n\nabs/1801.03326, 2018.\n\n[15] Stefan Depeweg, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Finale Doshi-Velez, and Steffen Udluft. Learning and pol-\nicy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127,\n2016.\n\n[16] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. SIAM Review, 36(4):677\u2013678,\n\n1994. doi: 10.1137/1036171.\n\n[17] Scott Fujimoto, Herke Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic\n\nMethods. In International Conference on Machine Learning, pages 1582\u20131591, 2018.\n\n[18] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network\n\ndynamics models. In Data-Ef\ufb01cient Machine Learning workshop, ICML, volume 4, 2016.\n\n[19] Mohammad Ghavamzadeh and Yaakov Engel. Bayesian actor-critic algorithms. In Machine Learning,\nProceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June\n20-24, 2007, pages 297\u2013304, 2007. doi: 10.1145/1273496.1273534. URL https://doi.org/10.1145/\n1273496.1273534.\n\n[20] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian reinforcement learning:\nA survey. Foundations and Trends in Machine Learning, 8(5-6):359\u2013483, 2015. doi: 10.1561/2200000049.\nURL https://doi.org/10.1561/2200000049.\n\n[21] Mohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian policy gradient and actor-critic\nalgorithms. Journal of Machine Learning Research, 17:66:1\u201366:53, 2016. URL http://jmlr.org/\npapers/v17/10-245.html.\n\n[22] Shixiang Gu, Tim Lillicrap, Richard E. Turner, Zoubin Ghahramani, Bernhard Sch\u00f6lkopf, and Sergey\nLevine. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep\nReinforcement Learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob\nFergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing\nSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017,\nLong Beach, CA, USA, pages 3849\u20133858, 2017.\n\n[23] Shixiang Gu, Timothy P. Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-\nProp: Sample-Ef\ufb01cient Policy Gradient with An Off-Policy Critic. In 5th International Conference on\nLearning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.\nOpenReview.net, 2017.\n\n[24] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum\nEntropy Deep Reinforcement Learning with a Stochastic Actor. In International Conference on Machine\nLearning, pages 1856\u20131865, 2018.\n\n[25] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar,\nHenry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic Algorithms and\nApplications. CoRR, abs/1812.05905, 2018.\n\n[26] Hado V. Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems, pages\n\n2613\u20132621, 2010.\n\n[27] Nicolas Heess, Gregory Wayne, David Silver, Timothy P. Lillicrap, Tom Erez, and Yuval Tassa. Learning\nContinuous Control Policies by Stochastic Value Gradients. In Corinna Cortes, Neil D. Lawrence, Daniel D.\nLee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems\n28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal,\nQuebec, Canada, pages 2944\u20132952, 2015.\n\n[28] Chi Jin, Zeyuan Allen-Zhu, S\u00e9bastien Bubeck, and Michael I. Jordan. Is Q-Learning Provably Ef\ufb01cient? In\nSamy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol\u00f2 Cesa-Bianchi, and Roman\nGarnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural\nInformation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada, pages\n4868\u20134878, 2018.\n\n10\n\n\f[29] Sham Kakade. A Natural Policy Gradient. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahra-\nmani, editors, Advances in Neural Information Processing Systems 14 [Neural Information Processing\nSystems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada],\npages 1531\u20131538. MIT Press, 2001.\n\n[30] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International\nConference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference\nTrack Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.\n\n[31] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and\nYann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan,\nPuerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.\n\n[32] Sungsu Lim, Ajin Joseph, Lei Le, Yangchen Pan, and Martha White. Actor-expert: A framework for\nusing action-value methods in continuous action spaces. CoRR, abs/1810.09103, 2018. URL http:\n//arxiv.org/abs/1810.09103.\n\n[33] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,\nAlex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,\nAmir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis\nHassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\ndoi: 10.1038/nature14236.\n\n[34] Brendan O\u2019Donoghue, Ian Osband, R\u00e9mi Munos, and Volodymyr Mnih. The uncertainty bellman equation\nand exploration.\nIn Proceedings of the 35th International Conference on Machine Learning, ICML\n2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 3836\u20133845, 2018. URL http:\n//proceedings.mlr.press/v80/o-donoghue18a.html.\n\n[35] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep Exploration via Boot-\nstrapped DQN. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman\nGarnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural\nInformation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4026\u20134034, 2016.\n\n[36] Ian Osband, John Aslanides, and Albin Cassirer.\n\nlearning.\n\ninforcement\nConference on Neural\n2018, Montr\u00e9al, Canada., pages 8626\u20138638, 2018.\n8080-randomized-prior-functions-for-deep-reinforcement-learning.\n\nRandomized prior functions for deep re-\nIn Advances in Neural Information Processing Systems 31: Annual\nInformation Processing Systems 2018, NeurIPS 2018, 3-8 December\nURL http://papers.nips.cc/paper/\n\n[37] Jan Peters and Stefan Schaal. Policy Gradient Methods for Robotics. In 2006 IEEE/RSJ International\nConference on Intelligent Robots and Systems, IROS 2006, October 9-15, 2006, Beijing, China, pages\n2219\u20132225. IEEE, 2006. ISBN 978-1-4244-0258-8. doi: 10.1109/IROS.2006.282564.\n\n[38] Jan Peters and Stefan Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180\u20131190, 2008. doi:\n\n10.1016/j.neucom.2007.11.026.\n\n[39] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley\n\n& Sons, 2014.\n\n[40] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A. Riedmiller.\nIn Proceedings of the 31th International Conference on\nDeterministic Policy Gradient Algorithms.\nMachine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and\nConference Proceedings, pages 387\u2013395. JMLR.org, 2014.\n\n[41] Richard S. Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse\nCoding. In David S. Touretzky, Michael Mozer, and Michael E. Hasselmo, editors, Advances in Neural\nInformation Processing Systems 8, NIPS, Denver, CO, USA, November 27-30, 1995, pages 1038\u20131044.\nMIT Press, 1995. ISBN 978-0-262-20107-0.\n\n[42] Richard S Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse\nCoding. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information\nProcessing Systems 8, pages 1038\u20131044. MIT Press, 1996.\n\n[43] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second\n\nedition, 2018.\n\n11\n\n\f[44] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy Gradient Methods\nfor Reinforcement Learning with Function Approximation. In Sara A. Solla, Todd K. Leen, and Klaus-\nRobert M\u00fcller, editors, Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver,\nColorado, USA, November 29 - December 4, 1999], pages 1057\u20131063. The MIT Press, 1999. ISBN\n978-0-262-19450-1.\n\n[45] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. 2012\n\nIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026\u20135033, 2012.\n\n[46] Hado van Hasselt, Arthur Guez, and David Silver. Deep Reinforcement Learning with Double Q-Learning.\nIn Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on\nArti\ufb01cial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 2094\u20132100. AAAI Press,\n2016. ISBN 978-1-57735-760-5.\n\n[47] Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil.\n\nDeep Reinforcement Learning and the Deadly Triad. CoRR, abs/1812.02648, 2018.\n\n[48] Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco A. Wiering. A theoretical and empirical\nanalysis of Expected Sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement\nLearning, ADPRL 2009, Nashville, TN, USA, March 31 - April 1, 2009, pages 177\u2013184. IEEE, 2009. ISBN\n978-1-4244-2761-1. doi: 10.1109/ADPRL.2009.4927542.\n\n[49] Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement\n\nLearning. Machine Learning, 8:229\u2013256, 1992. doi: 10.1007/BF00992696.\n\n[50] Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.\n\nPhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2010.\n\n12\n\n\f", "award": [], "sourceid": 1035, "authors": [{"given_name": "Kamil", "family_name": "Ciosek", "institution": "Microsoft"}, {"given_name": "Quan", "family_name": "Vuong", "institution": "University of California San Diego"}, {"given_name": "Robert", "family_name": "Loftin", "institution": "Microsoft Research"}, {"given_name": "Katja", "family_name": "Hofmann", "institution": "Microsoft Research"}]}