{"title": "Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8151, "page_last": 8162, "abstract": "Gradient-based methods for optimisation of objectives in stochastic settings with unknown or intractable dynamics require estimators of derivatives. We derive an objective that, under automatic differentiation, produces low-variance unbiased estimators of derivatives at any order. Our objective is compatible with arbitrary advantage estimators, which allows the control of the bias and variance of any-order derivatives when using function approximation. Furthermore, we propose a method to trade off bias and variance of higher order derivatives by discounting the impact of more distant causal dependencies. We demonstrate the correctness and utility of our estimator in analytically tractable MDPs and in meta-reinforcement-learning for continuous control.", "full_text": "Loaded DiCE: Trading off Bias and Variance in\n\nAny-Order Score Function Estimators for\n\nReinforcement Learning\n\nGregory Farquhar \u2217\nUniversity of Oxford\n\nShimon Whiteson\nUniversity of Oxford\n\nJakob Foerster\n\nFacebook AI Research\n\nAbstract\n\nGradient-based methods for optimisation of objectives in stochastic settings with\nunknown or intractable dynamics require estimators of derivatives. We derive an\nobjective that, under automatic differentiation, produces low-variance unbiased\nestimators of derivatives at any order. Our objective is compatible with arbitrary\nadvantage estimators, which allows the control of the bias and variance of any-order\nderivatives when using function approximation. Furthermore, we propose a method\nto trade off bias and variance of higher order derivatives by discounting the impact\nof more distant causal dependencies. We demonstrate the correctness and utility of\nour objective in analytically tractable MDPs and in meta-reinforcement-learning\nfor continuous control.\n\n1\n\nIntroduction\n\nIn stochastic settings, such as reinforcement learning (RL), it is often impossible to compute the\nderivative of our objectives, because they depend on an unknown or intractable distribution (such\nas the transition function of an RL environment). In these cases, gradient-based optimisation is\nonly possible through the use of stochastic gradient estimators. Great successes in these domains\nhave been found by building estimators of \ufb01rst-order derivatives which are amenable to automatic\ndifferentiation, and using them to optimise the parameters of deep neural networks [Fran\u00e7ois-Lavet\net al., 2018].\nNonetheless, for a number of exciting applications, \ufb01rst-order derivatives are insuf\ufb01cient. Meta-\nlearning and multi-agent learning often involve differentiating through the learning step of a gradient-\nbased learner [Finn et al., 2017, Stadie et al., 2018, Zintgraf et al., 2019, Foerster et al., 2018a].\nHigher-order optimisation methods can also improve sample ef\ufb01ciency [Furmston et al., 2016].\nHowever, estimating these higher order derivatives correctly, with low variance, and easily in the\ncontext of automatic differentiation, has proven challenging.\nFoerster et al. [2018b] propose tools for constructing estimators for any-order derivatives that are\neasy to use because they avoid the cumbersome manipulations otherwise required to account for the\ndependency of the gradient estimates on the distributions they are sampled from. However, their\nformulation relies on pure Monte-Carlo estimates of the objective, introducing unacceptable variance\nin estimates of \ufb01rst- and higher-order derivatives and limiting the uptake of methods relying on these\nderivatives.\nMeanwhile, great strides have been made in the development of estimators for \ufb01rst-order derivatives\nof stochastic objectives. In reinforcement learning, the use of learned value functions as both critics\nand baselines has been extensively studied. The trade-off between bias and variance in gradient\nestimators can be made explicit in mixed objectives that combine Monte-Carlo samples of the\n\n\u2217Correspondence to gregory.farquhar@cs.ox.ac.uk\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fobjective with learned value functions [Schulman et al., 2015b]. These techniques create families\nof advantage estimators that can be used to reduce variance and accelerate credit assignment in\n\ufb01rst-order optimisation, but have not been applied in full generality to higher-order derivatives.\nIn this work, we derive an objective that can be differentiated any number of times to produce correct\nestimators of higher-order derivatives in Stochastic Computation Graphs (SCGs) that have a Markov\nproperty, such as those found in RL and sequence modeling. Unlike prior work, this objective is fully\ncompatible with arbitrary choices of advantage estimators. When using approximate value functions,\nthis allows for explicit trade-offs between bias and variance in any-order derivative estimates to\nbe made using known techniques (or using any future advantage estimation methods designed for\n\ufb01rst-order derivatives). Furthermore, we propose a method for trading off bias and variance of higher\norder derivatives by discounting the impact of more distant causal dependencies.\nEmpirically, we \ufb01rst use small random MDPs that admit analytic solutions to show that our estimator\nis unbiased and low variance when using a perfect value function, and that bias and variance\nmay be \ufb02exibly traded off using two hyperparameters. We further study our objective in more\nchallenging meta-reinforcement-learning problems for simulated continuous control, and show\nthe impact of various parameter choices on training. Demonstration code is available at https:\n//github.com/oxwhirl/loaded-dice. Only a handful of additional lines of code are needed to\nimplement our objective in any existing codebase that uses higher-order derivatives for RL.\n\n2 Background\n\n2.1 Gradient estimators\n\nWe are commonly faced with objectives that have the form of an expectation over random variables.\nIn order to calculate the gradient of the expectation with respect to parameters of interest, we must\noften employ gradient estimators, because the gradient cannot be computed exactly. For example, in\nreinforcement learning the environment dynamics are unknown and form a part of our objective, the\nexpected returns. The polyonymous \u201clikelihood ratio\u201d, \u201cscore function\u201d, or \u201cREINFORCE\u201d estimator\nis given by\n\n\u2207\u03b8 Ex[f (x, \u03b8)] = Ex[f (x, \u03b8)\u2207\u03b8 log p(x; \u03b8) + \u2207\u03b8f (x, \u03b8)].\n\n(1)\nThe expectation on the RHS may now be estimated from Monte-Carlo samples drawn from p(x; \u03b8).\nOften f is independent of \u03b8 and the second term is dropped. If f depends on \u03b8, but the random\nvariable does not (or may be reparameterised to depend only deterministically on \u03b8) we may instead\ndrop the \ufb01rst term. See Fu [2006] or Mohamed et al. [2019] for a more comprehensive review.\n\n2.2 Stochastic Computation Graphs and MDPs\n\nStochastic computation graphs (SCGs) are directed acyclic graphs in which nodes are determinsitic\nor stochastic functions, and edges indicate functional dependencies [Schulman et al., 2015a]. The\ngradient estimators described above may be used to estimate the gradients of the objective (the sum\nof cost nodes) with respect to parameters \u03b8. Schulman et al. [2015a] propose a surrogate loss, a\nsingle objective that produces the desired gradient estimates under differentiation.\nWeber et al. [2019] apply more advanced \ufb01rst-order gradient estimators to SCGs. They formalise\nMarkov properties for SCGs that allow the most \ufb02exible and powerful of these estimators, originally\ndeveloped in the context of reinforcement learning, to be applied. We describe these estimators in\nthe following subsection, but \ufb01rst de\ufb01ne the relevant subset of SCGs. To keep the main body of this\npaper simple and highlight the most important known use case for our method, we adopt the notation\nof reinforcement learning rather than the more cumbersome notation of generic SCGs.\nThe graph in reinforcement learning describes a Markov Decision Process (MDP), and begins with\nan initial state s0 at time t = 0. At each timestep, an action at is sampled from a stochastic policy\n\u03c0\u03b8, parameterised by \u03b8, that maps states to actions. This adds a stochastic node at to the graph. The\nstate-action pair leads to a reward rt, and a next state st+1, from which the process continues. A\nsimple MDP graph is shown in Figure 1. In the \ufb01gure, as in many problems, the reward conditions\nonly on the state rather than the state and action. We consider episodic problems that terminate after\nT steps, although all of our results may be extended to the non-terminating case. The (discounted)\nrewards are the cost nodes of this graph, leading to the familiar reinforcement learning objective of an\n\n2\n\n\fFigure 1: Some example SCGs that support our new objective. From left to right (a) Vanilla MDP (b)\nMDP with stochastic latent goal variable g (c) POMDP\n\nexpected discounted sum of rewards: J = E[(cid:80)T\n\nt=0 \u03b3trt], where the expectation is taken with respect\n\nto the policy as well as the unknown transition dynamics of the underlying MDP.\nA generalisation of our results holds for a slightly more general class of SCGs as well, whose objective\nis still a sum of rewards over time. We may have any number of stochastic and deterministic nodes\nXt corresponding to each timestep t. However, these nodes may only in\ufb02uence the future rewards\nthrough their in\ufb02uence on the next timestep. More formally, this Markov property states that for any\nnode w such that there exists a directed path from w to any rt(cid:48), t(cid:48) \u2265 t not blocked by Xt, none of the\ndescendants of w are in Xt (de\ufb01nition 6 of Weber et al. [2019]). This class of SCGs can capture a\nbroad class of MDP-like models, such as those in Figure 1.\n\n2.3 Gradient estimators with advantages\n\nA value function for a set of nodes in an SCG is the expectation of the objective over the other stochas-\ntic variables (excluding that set of nodes). These can reduce variance by serving as control variates\n(\u201cbaselines\u201d), or as critics that also condition on the sampled values taken by the corresponding\nstochastic nodes (i.e. the sampled actions). The difference of the critic and baseline value functions is\nknown as the advantage, which replaces sampled costs in the gradient estimator.\nBaseline value functions only affect the variance of gradient estimators [Weaver and Tao, 2001].\nHowever, using learned, imperfect critic value functions results in biased gradient estimators. We\nmay trade off bias and variance by using different mixtures of sampled costs (unbiased, high variance)\nand learned critic value functions (biased, low variance). This choice of advantage estimator and its\nhyperparameters can be used to tune the bias and variance of the resulting gradient estimator to suit\nthe problem at hand.\nThere are many ways to model the advantage function in RL. A popular and simple family of\nadvantage estimators is proposed by Schulman et al. [2015b]:\n\nAGAE(\u03b3,\u03c4 )(st, at) =\n\n(2)\n\n(\u03b3\u03c4 )t(cid:48)\u2212t(cid:0)rt(cid:48) + \u03b3 \u02c6V (st(cid:48)+1) \u2212 \u02c6V (st(cid:48))(cid:1).\n\n\u221e(cid:88)\n\nt(cid:48)=t\n\nThe parameter \u03c4 trades off bias and variance: when \u03c4 = 1, A is formed only of sampled rewards and\nis unbiased, but high variance; when \u03c4 = 0, AGAE uses only the next sampled reward rt and relies\nheavily on the estimated value function \u02c6V , reducing variance at the cost of bias.\n\n2.4 Higher order estimators\n\nTo construct higher order gradient estimators, we may recursively apply the above techniques, treating\ngradient estimates as objectives in a new SCG. Foerster et al. [2018b] note several shortcomings of\nthe surrogate loss approach of Schulman et al. [2015a] for higher-order derivatives. The surrogate\nloss cannot itself be differentiated again to produce correct higher-order estimators. Even estimates\nproduced using the surrogate loss cannot be treated as objectives in a new SCG, because the surrogate\nloss severs dependencies of the sampled costs on the sampling distribution.\n\n3\n\n\fTo address this, Foerster et al. [2018b] introduce DiCE, a single objective that may be differentiated\nrepeatedly (using automatic differentiation) to produce unbiased estimators of derivatives of any\norder. The DiCE objective for reinforcement learning is given by\n\nT(cid:88)\n\nt=0\n\n(cid:88)\n\nw\u2208W\n\nJ =\n\n\u03b3t\n\n(a\u2264t)rt,\n\n(3)\n\nwhere a\u2264t indicates the set of stochastic nodes (i.e. actions) occurring at timestep t or earlier.\n\nis a special operator that acts on a set of stochastic nodes W.\n\n(\u00b7) always evaluates to 1, but has a\n\nspecial behaviour under differentiation:\n\n\u2207\u03b8\n\n(W) = (W)\n\n\u2207\u03b8 log p(w; \u03b8)\n\n(4)\n\nT(cid:88)\n\n(\u2205) = 1, so it has a zero derivative.\n\nThis operator in effect automates the likelihood-ratio trick for differentiation of expectations, while\nmaintaining dependencies such that the same trick will be applied when computing higher order\nderivatives. For notational convenience in our later derivation, we extend the de\ufb01nition of\nslightly\nby de\ufb01ning its operation on the empty set:\nThe original version of DiCE has two critical drawbacks compared to the state-of-the-art methods\ndescribed above for estimating \ufb01rst-order derivatives of stochastic objectives. First, it has no mecha-\nnism for using baselines to reduce the variance of estimators of higher order derivatives. Mao et al.\n[2019], and Liu et al. [2019] (subsequently but independently) suggest the same partial solution\nfor this problem, but neither provide proof of unbiasedness of their estimator beyond second order.\nSecond, DiCE (and the estimator of Mao et al. [2019] and Liu et al. [2019]) are formulated in a way\nthat requires the use of Monte-Carlo sampled costs. Without a form that permits the use of critic\nvalue functions, there is no way to make use of the full range of possible advantage estimators.\nIn an exact calculation of higher-order derivative estimators, the dependence of a given reward on\nall previous actions leads to nested sums over previous timesteps. These terms tend to have high\nvariance when estimated from data, and become small in the vicinity of local optima, as noted by\nFurmston et al. [2016]. Rothfuss et al. [2018] use this observation to propose a simpli\ufb01ed version of\nthe DiCE objective dropping these dependencies:\n\nJLV C =\n\n(at)Rt\n\n(5)\n\nt=0\n\nThis estimator is biased for higher than \ufb01rst-order derivatives, and Rothfuss et al. [2018] do not derive\na correct unbiased estimator for all orders, make use of advantage estimation in this objective, or\nextend its applicability beyond meta-learning in the style of MAML [Finn et al., 2017].\nIn the next section, we introduce a new objective which may make use of the critic as well as baseline\nvalue functions, and thereby allows the bias and variance of any-order derivatives to be traded off\nthrough the choice of an advantage estimator. Furthermore, we introduce a discounting of past\ndependencies that allows a smooth trade-off of bias and variance due to the high-variance terms\nidenti\ufb01ed by Furmston et al. [2016].\n\n3 Method\n\nThe DiCE objective is cast as a sum over rewards, with the dependencies of the reward node rt on its\nstochastic causes captured by\n(a\u2264t). To use critic value functions, on the other hand, we must use\nforward-looking sums over returns.\nThis is possible if the graph maintains the Markov property de\ufb01ned above in Section 2.2 with respect\nto its objective, so as to permit a sequential decomposition of the cost nodes, i.e., rewards rt, and\ntheir stochastic causes in\ufb02uenced by \u03b8, i.e., the actions at. We begin with the DiCE objective for a\ndiscounted sum of rewards given in (3), where our true objective is the expected discounted sum of\nrewards in trajectories drawn from a policy \u03c0\u03b8.\n\n4\n\n\ft(cid:48)=t \u03b3t(cid:48)\u2212trt(cid:48). Now we have rt = Rt \u2212 \u03b3Rt+1, so:\n\nNow we simply take a change of variables t(cid:48) = t + 1 in the second term, relabeling the dummy\nvariable immediately back to t:\n\n(a\u2264t)(Rt \u2212 \u03b3Rt+1)\n\n=\n\nt=0\n\nt=0\n\n\u03b3t\n\n\u03b3t\n\nJ =\n\nT(cid:88)\nT(cid:88)\n\nWe de\ufb01ne, as is typical in RL, the return Rt =(cid:80)T\n(a\u2264t)Rt \u2212 T(cid:88)\n(a\u2264t)Rt \u2212 T +1(cid:88)\n(a\u2264t)Rt \u2212 T +1(cid:88)\n(a\u2264t)Rt \u2212 T(cid:88)\n(cid:18)\nT(cid:88)\n\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=0\n+ \u03b30\n\n\u03b3t\n(a<0)R0 \u2212 \u03b3T +1\n\n=\n\n=\n\n\u03b3t\n\n\u03b3t\n\nJ =\n\n\u03b3t\n\nt=1\n\nt=1\n\nt=0\n\nt=0\n\nt=0\n\nt=0\n\n\u03b3t+1\n\n(a\u2264t)Rt+1\n\n\u03b3t\n\n(a\u2264t\u22121)Rt\n\n\u03b3t\n\n(at, s>t\nonto Rt. For a complete derivation please see the Supplementary Material. This is simply a critic\nvalue function, de\ufb01ned by Q(st, at) = E\u03c0[Rt|st, at]:\n\n(a\u2264t) \u2212 (at does not change the expectation of\nthe estimator, as shown by the standard derivation reproduced in Schulman et al. [2015a].\nIn\nreinforcement learning, it is common to use the expected state value V (st) = Eat[Q(st, at)] as an\napproximation of the optimal baseline. The estimator may now use A(st, at) = Q(st, at) \u2212 V (st) in\nplace of Rt, further reducing its variance. We have now derived an estimator in terms of an advantage\nA(st, at) that recovers unbiased estimates of derivatives of any order:\n\n(cid:18)\n\n(cid:88)\n\n(cid:19)\n\nJ\u2666 =\n\n\u03b3t\n\n(a\u2264t) \u2212 (at) correctly keep rewards after the actions that cause them. Note that\n\nk=0 \u03b3krt+k+1 depends only on \u03c4>t. The expectation of our objective is given by:\n\nRt =(cid:80)T\u2212t\n\n\u03c4\n\n(cid:88)\n(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=0\n\n\u03c4\n\nt=0\n\nE\u03c0[J(cid:5)] =\n\n=\n\n=\n\n=\n\nP (\u03c4 )J(cid:5)(\u03c4 )\n\n(cid:18) T(cid:88)\n\u03b3t(cid:0) (a\u2264t) \u2212 (at) = Rt.\n\n\u03c4\n\nJt =\n\nP (\u03c4 )f (\u03c4\u2264t)g(\u03c4>t),\n\nNext we use:\n\nP (\u03c4 ) = P (\u03c4\u2264t)P (\u03c4>t|\u03c4\u2264t)\n\n= P (\u03c4\u2264t)P (\u03c4>t|st, at),\n\nwhere in the last step we have used the Markov property. Substituting we obtain:\n\nIf we substitute back for g and f we obtain:\n\nP (\u03c4\u2265t|st, at)g(\u03c4>t)\n\n\u03c4\n\n\u03c4>t\n\n\u03c4\u2264t\n\n(cid:88)\n\nP (\u03c4\u2264t)f (\u03c4\u2264t)\n\nP (\u03c4\u2264t)P (\u03c4>t|st, at)f (\u03c4\u2264t)g(\u03c4>t)\n\n(cid:88)\n(cid:88)\nP (\u03c4\u2264t)(cid:0) (a\u2264t) \u2212 (at\n\nP (\u03c4\u2265t|st, at)Rt\n\nJt =\n\n=\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u03c4\u2264t\n\n\u03c4\u2264t\n\n\u03c4\u2264t\n\nJt =\n\n=\n\n=\n\nPutting all together we obtain the \ufb01nal form:\n\n(cid:20) T(cid:88)\n\n(cid:18)\n\n\u03b3t\n\n(cid:19)\n\n(cid:21)\n\nQ(st, at)\n\n(a\u2264t) \u2212 (a