{"title": "Hindsight Credit Assignment", "book": "Advances in Neural Information Processing Systems", "page_first": 12488, "page_last": 12497, "abstract": "We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions can be rewritten through this lens, yielding a new family of algorithms. We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks.", "full_text": "Hindsight Credit Assignment\n\nAnna Harutyunyan, Will Dabney, Thomas Mesnard, Nicolas Heess, Mohammad G. Azar,\nBilal Piot, Hado van Hasselt, Satinder Singh, Greg Wayne, Doina Precup, R\u00e9mi Munos\n\n{harutyunyan, wdabney, munos}@google.com\n\nDeepMind\n\nAbstract\n\nWe consider the problem of ef\ufb01cient credit assignment in reinforcement learning.\nIn order to ef\ufb01ciently and meaningfully utilize new data, we propose to explicitly\nassign credit to past decisions based on the likelihood of them having led to the\nobserved outcome. This approach uses new information in hindsight, rather than\nemploying foresight. Somewhat surprisingly, we show that value functions can\nbe rewritten through this lens, yielding a new family of algorithms. We study the\nproperties of these algorithms, and empirically show that they successfully address\nimportant credit assignment challenges, through a set of illustrative tasks.\n\n1\n\nIntroduction\n\nA reinforcement learning (RL) agent is tasked with two fundamental, interdependent problems:\nexploration (how to discover useful data), and credit assignment (how to incorporate it). In this work,\nwe take a careful look at the problem of credit assignment. The instrumental learning object in RL \u2013\nthe value function \u2013 quanti\ufb01es the following question: \u201chow does choosing an action a in a state x\naffect future return?\u201d. This is a challenging question for several reasons.\nIssue 1: Variance. The simplest way of estimating the value function is by averaging returns\n(future discounted sums of rewards) starting from taking a in x. This Monte Carlo style of estimation\nis inef\ufb01cient, since there can be a lot of randomness in trajectories.\nIssue 2: Partial observability. To amortize the search and reduce variance, temporal difference\n(TD) methods, like Sarsa and Q-learning, use a learned approximation of the value function and\nbootstrap. This introduces bias due to the approximation, as well as a reliance on the Markov\nassumption, which is especially problematic when the agent operates outside of a Markov Decision\nProcess (MDP), for example if the state is partially observed, or if there is function approximation.\nBootstrapping may then cause the value function to not converge at all, or to remain permanently\nbiased [19].\nIssue 3: Time as a proxy. TD(\u03bb) methods control this bias-variance trade-off, but they rely on\ntime as the sole metric for relevance: the more recent the action, the more credit or blame it receives\nfrom a future reward [20, 21]. Although time is a reasonable proxy for cause-and-effect (especially\nin MDPs), in general it is a heuristic, and can hence be improved by learning.\nIssue 4: No counterfactuals. The only data used for estimating an action\u2019s value are trajectories\nthat contain that action, while ideally we would like to be able to use the same trajectory to update all\nrelevant actions, not just the ones that happened to (serendipitously) occur.\nFigure 1 illustrates these issues concretely. At the high-level, we wish to achieve credit assignment\nmechanisms that are both sample-ef\ufb01cient (issues 1 and 4), and expressive (issues 2 and 3). To this\nend, we propose to reverse the key learning question, and learn estimators that measure: \u201cgiven\nthe future outcome (reward or state), how relevant was the choice of a in x to achieve it?\u201d, which\nis essentially the credit assignment question itself. Although eligibility traces consider the same\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left. Consider the trajectory shown by solid arrows to be the sampled trajectory, \u03c4. An RL\nalgorithm will typically assign credit for the reward obtained in state y to the actions along \u03c4. This is\nunsatisfying for two reasons: (1) action a was not essential in reaching state z, any other a(cid:48) would\nhave been just as effective; hence, overemphasizing a is a source of variance; (2) from z, action c was\nsampled, leading to a multi-step trajectory into y, but action b transitions to y from z directly; so, it\nshould get more of the credit for y. Note that c could have been an exploratory action, but also could\nhave been more likely according to the policy in z, but given that y was reached, b was more likely\nRight. The choice between actions a or b at state x causes a transition to either ya or yb, but they are\nperceptually aliased. On the next decision, the same action c transitions the agent to different states,\ndepending on the true underlying y. The state y can be a single state, or could itself be a trajectory.\nThis scenario can happen e.g. when the features are being learned. A TD algorithm that bootstraps in\ny will not be able to learn the correct values of a and b, since it will average over the rewards of za\nand zb. When y is a potentially long trajectory with a noisy reward, a Monte Carlo algorithm will\nincorporate the noise along y into the values of both a and b, despite it being irrelevant to the choice\nbetween them. We would like to be able to directly determine the relevance of a to being in za.\n\nquestion, they do so in a way that is (purposefully) equivalent to the forward view [20], and so they\nhave to rely mainly on \u201cvanilla\" features, like time, to decide credit assignment. Reasoning in the\nbackward view explicitly opens up a new family of algorithms. Speci\ufb01cally, we propose to use a form\nof hindsight conditioning to determine the relevance of a past action to a particular outcome. We\nshow that the usual value functions can be rewritten in hindsight, yielding a new family of estimators,\nand derive policy gradient algorithms that use these estimators. We demonstrate empirically the\nability of these algorithms to address the highlighted issues through a set of diagnostic tasks, which\nare not handled well by other means.\n\n2 Background and Notation\nA Markov decision process (MDP) [14] is a tuple (X ,A, p, r, \u03b3), with X being the state space, A -\nthe action space, p : X \u00d7 A \u00d7 X \u2192 [0, 1] \u2013 the state-transition distribution (with p(y|x, a) denoting\nthe probability of transitioning to state y from x by choosing action a), r : X \u00d7 A \u2192 R \u2013 the reward\nfunction, and \u03b3 \u2208 [0, 1) \u2013 the scalar discount factor. A stochastic policy \u03c0 maps each state to a\ndistribution over actions: \u03c0(a|x) denotes the probability of choosing action a in state x. Let T (x, \u03c0)\nand T (x, a, \u03c0) be the distributions over trajectories \u03c4 = (Xk, Ak, Rk)k\u2208N+ generated by a policy \u03c0,\nk\u22650 \u03b3kRk be the return obtained\nalong the trajectory \u03c4. The value (or V-) function V \u03c0 and the action-value (or Q-) function Q\u03c0 denote\nthe expected return under the policy \u03c0 given X0 = x and (X0, A0) = (x, a), respectively:\n\ngiven X0 = x and (X0, A0) = (x, a), respectively. Let Z(\u03c4 ) def=(cid:80)\n\nV \u03c0(x) def= E\u03c4\u223cT (x,\u03c0)\n\nZ(\u03c4 )\n\n,\n\nQ\u03c0(x, a) def= E\u03c4\u223cT (x,a,\u03c0)\n\nZ(\u03c4 )\n\n.\n\n(1)\n\n(cid:104)\n\n(cid:105)\n\n(cid:104)\n\n(cid:105)\n\nThe bene\ufb01t of choosing a given action a over the usual policy \u03c0 is measured by the advantage function\nA\u03c0(x, a) def= Q\u03c0(x, a) \u2212 V \u03c0(x). Policy gradient algorithms improve the policy by changing \u03c0 in the\ndirection of the gradient of the value function [22]. This gradient at some initial state x0 is\n\u2207V \u03c0(x0) =\n\nd\u03c0(x|x0)Q\u03c0(x, a)\u2207\u03c0(a|x) = E\u03c4\u223cT (x0,\u03c0)\n\n\u03b3kA\u03c0(Xk, a)\u2207\u03c0(a|Xk)\n\n(cid:104)(cid:88)\n\n(cid:105)\n\n,\n\n(cid:88)\n\nk\u22650\n\na\n\n(cid:88)\nwhere d\u03c0(x|x0) def= (cid:80)\n\nx,a\n\nk \u03b3kP\u03c4\u223cT (x0,\u03c0)(Xk = x) is the (unnormalized) discounted state-visitation\ndistribution. Practical algorithms such as REINFORCE [25] approximate Q\u03c0 or A\u03c0 with an n-step\ntruncated return, possibly combined with a bootstrapped approximate value function V , which is also\noften used as baseline (see [22, 12]) along a trajectory \u03c4 = (Xk, Ak, Rk)k \u223c T (x, a, \u03c0):\n\nA\u03c0(x, a) \u2248 n\u22121(cid:88)\n\n\u03b3kRk + \u03b3nV (Xn) \u2212 V (x).\n\nk=0\n\n2\n\n\f3 Conditioning on the Future\n\nThe classical value function attempts to answer the question: \"how does the current action affect\nfuture outcomes?\" By relying on predictions about these future outcomes, existing approaches often\nexacerbate problems around variance (issue 1) and partial observability (issue 2). Furthermore, these\nmethods tend to use temporal distance as a proxy for relevance (issue 3) and are unable to assign\ncredit counter-factually (issue 4). We propose to learn estimators that explicitly consider the credit\nassignment question: \"given an outcome, how relevant were past decisions?\".\nThis approach can in fact be linked to some classical methods in statistical estimation. In particular,\nMonte Carlo simulation is known to be inaccurate when there are rare events that are of interest:\nthe averaging requires an infeasible number of samples to obtain an accurate estimate [16]. One\nsolution is to change measures, that is, to use another distribution for which the events are less rare,\nand correct with importance sampling. The Girsanov theorem is a well-known example of this in\nprocesses with Brownian dynamics [4], known to produce lower variance estimates.\nThis scenario of rare random events is particularly relevant to ef\ufb01cient credit assignment in RL.\nWhen a new signi\ufb01cant outcome is experienced, the agent ought to quickly update its estimates\nand policy accordingly. Let \u03c4 \u223c T (x, \u03c0) be a sampled trajectory, and f some function of it. By\nchanging measures from the policy \u03c0 with which it was sampled to a future-conditional, or hindsight\ndistribution h(\u00b7|x, \u03c0, f (\u03c4 )), we hope to improve the ef\ufb01ciency of credit assignment. The importance\nsampling ratio h(a|x,\u03c0,f (\u03c4 ))\nthen precisely denotes the relevance of an action a to the speci\ufb01c future\nf (\u03c4 ). If the distribution h(a|x, \u03c0, f (\u03c4 )) is accurate, this allows us to quickly assign credit to all\nactions relevant to achieving f (\u03c4 ). In this work, we consider f to be a future state, or a future return.\nTo highlight the use of the future-conditional distribution, we refer to the resulting family of methods\nas Hindsight Credit Assignment (HCA).\nThe remainder of this section formalizes the insight outlined above, and derives the usual value\nfunctions and policy gradients in hindsight, while the next one presents new algorithms based on\nsampling these expressions.\n\n\u03c0(a|x)\n\n3.1 Conditioning on Future States\n\nThe agent composes its estimates of the return from an action a by summing over the rewards obtained\nfrom future states Xk. One option of hindsight conditioning is to consider, at each step, the likelihood\nof an action a given that the future state Xk was reached.\nDe\ufb01nition 1 (State-conditional hindsight distributions). For any action a and any state y, de\ufb01ne\nhk(a|x, \u03c0, y) to be the conditional probability over trajectories \u03c4 \u223c T (x, \u03c0) of the \ufb01rst action A0 of\ntrajectory \u03c4 being equal to a, given that the state y has occurred at step k along trajectory \u03c4:\n\nhk(a|x, \u03c0, y)\n\ndef\n\n= P\u03c4\u223cT (x,\u03c0)(A0 = a|Xk = y).\n\n(2)\nIntuitively, hk(a|x, \u03c0, y) quanti\ufb01es the relevance of action a to the future state Xk. If a is not relevant\nto reaching Xk, this probability is simply the policy \u03c0(a|x) (there is no relevant information in Xk). If\na is instrumental to reaching Xk, hk(a|x, \u03c0, y) > \u03c0(a|x), and vice versa, if a detracts from reaching\nXk, hk(a|x, \u03c0, y) < \u03c0(a|x). In general, hk is a lower-entropy distribution than \u03c0. The relationship\nof hk to more familiar quantities can be understood through the following identity obtained by an\napplication of Bayes\u2019 rule:\nhk(a|x, \u03c0, y)\n\nP(Xk = y|X0 = x, A0 = a, \u03c0)\n\nP\u03c4\u223cT (x,a,\u03c0)(Xk = y)\nP\u03c4\u223cT (x,\u03c0)(Xk = y)\n\n.\n\n=\n\n\u03c0(a|x)\n\n=\n\nP(Xk = y|X0 = x, \u03c0)\n\nUsing this identity and importance sampling, we can rewrite the usual Q-function in terms of hk.\nSince there is only one policy \u03c0 involved here, we will drop the explicit conditioning, but it is implied.\nTheorem 1. Consider an action a and a state x for which \u03c0(a|x) > 0 . Then the following holds:\n\nQ\u03c0(x, a) = r(x, a) + E\u03c4\u223cT (x,\u03c0)\n\n(cid:104)(cid:88)\n\nk\u22651\n\n(cid:105)\n\n\u03b3k hk(a|x, Xk)\n\u03c0(a|x)\n\nRk\n\n.\n\nSo, each of the rewards Rk along the way is weighted by the ratio hk(a|x,Xk)\n, which exactly quanti\ufb01es\n\u03c0(a|x)\nhow relevant a was in achieving the corresponding state Xk. Following the discussion above, this\n\n3\n\n\f(cid:105)\n\n(cid:105)\n\n(cid:17)\n\n(cid:17)\n\nratio is 1 if a is irrelevant, and larger or smaller than 1 in the other cases. The expression for the\nQ-function is similar to that in Eq. (1), but the new expectation is no longer conditioned on the\ninitial action a \u2013 the policy \u03c0 is followed from the start (A0 \u223c \u03c0(\u00b7|x) instead of A0 = a). This is an\nimportant point, as it will allow us to use returns generated by any action A0 to update the values\nof all actions, to the extent that they are relevant according to hk(a|x,Xk)\n. Theorem 1 implies the\n\u03c0(a|x)\nfollowing expression for the advantage:\n\nA\u03c0(x, a) = r(x, a) \u2212 r\u03c0(x) + E\u03c4\u223cT (x,\u03c0)\n\n\u2212 1\n\n\u03b3kRk\n\n,\n\n(3)\n\n(cid:104)(cid:88)\n\n(cid:16) hk(a|x, Xk)\n\n\u03c0(a|x)\n\nk\u22651\n\na\u2208A \u03c0(a|x)r(x, a). This form of the advantage is particularly appealing, since it\ndirectly removes irrelevant rewards from consideration. Indeed, whenever hk(a|x,Xk)\n\u03c0(a|x) = 1, the reward\nRk does not participate in the advantage for the value of action a. When there is inconsequential\nnoise that is outside of the agent\u2019s control, this may greatly reduce the variance of the estimates.\n\nwhere r\u03c0(x) =(cid:80)\n\nRemoving time dependence. For clarity of exposition, here we have considered the hindsight\ndistribution to be additionally conditioned on time. Indeed, hk depends not only on reaching the\nstate, but also on the number of timesteps k that it takes to do so. In general, this can be limiting,\nas it introduces a stronger dependence on the particular trajectory, and a harder estimation problem\nof the hindsight distribution. It turns out we can generalize all of the results presented here to a\ntime-independent distribution h\u03b2(a|x, y), which gives the probability of a conditioned on reaching y\nat some point in the future. The scalar \u03b2 \u2208 [0, 1) is the \"probability of survival\" at each step. This can\neither be the discount \u03b3, or a termination probability if the problem is undiscounted. In the discounted\nreward case Eq. (3) can be written in terms of h\u03b2 as follows:\n\nA\u03c0(x, a) = r(x, a) \u2212 r\u03c0(x) + E\u03c4\u223cT (x,\u03c0)\n\n\u2212 1\n\n\u03b3kRk\n\n,\n\n(4)\n\n(cid:104)(cid:88)\n\n(cid:16) h\u03b2(a|x, Xk)\n\n\u03c0(a|x)\n\nk\u22651\n\nwith the choice of \u03b2 = \u03b3. The interested reader may \ufb01nd the relevant proofs in the appendix.\nFinally, it is possible to obtain a hindsight V-function, analogously to the Q-function from Theorem 1.\nThe next section does this for return-conditional HCA. We include other variations in appendix.\n\n3.2 Conditioning on Future Returns\n\nThe previous section derived Q-functions that explicitly reweigh the rewards at each step, based on\nthe corresponding states\u2019 connection to the action whose value we wish to estimate. Since ultimately\nwe are interested in the return, we could alternatively use it for future conditioning itself.\nDe\ufb01nition 2 (Return-conditional hindsight distributions). For any action a and any possible return\nz, de\ufb01ne hz(a|x, \u03c0, z) to be the conditional probability over trajectories \u03c4 \u223c T (x, \u03c0) of the \ufb01rst\naction A0 being a, given that z has been observed along \u03c4:\n\nhz(a|x, \u03c0, z)\n\ndef\n\n= P\u03c4\u223cT (x,\u03c0)\n\n(cid:0)A0 = a|Z(\u03c4 ) = z(cid:1).\n\nThe distribution hz(a|x, \u03c0, z) is intuitively similar to hk, but instead of future states, it directly\nquanti\ufb01es the relevance of a to obtaining the entire return z. This is appealing, since in the end\nwe care about returns. Further, this could be simpler to learn, since instead of the possibly high-\ndimensional state, we now need to worry only about a scalar outcome. On the other hand, it is no\nlonger \"jumpy\" in time, so may bene\ufb01t less from structure in the dynamics. As with hk, we will drop\nthe explicit conditioning on \u03c0, but it is implied. We have the following result.\nTheorem 2. Consider an action a, and assume that for any possible random return z = Z(\u03c4 ) for\nsome trajectory \u03c4 \u223c T (x, \u03c0) we have hz(a|x, z) > 0. Then we have:\n\n(cid:104)\n\nV \u03c0(x) = E\u03c4\u223cT (x,a,\u03c0)\n\n\u03c0(a|x)\n\nZ(\u03c4 )\n\nhz(a|x, Z(\u03c4 ))\n\n(cid:105)\n\n.\n\n(5)\n\nThe V- (rather than Q-) function form here has interesting properties that we will discuss in the\nnext section. Mathematically, the two forms are analogous to derive, but the ratio is now \ufb02ipped.\nEquations (5) and (1) imply the following expression for the advantage:\n\n4\n\n\fA\u03c0(x, a) = E\u03c4\u223cT (x,a,\u03c0)\n\n(cid:104)(cid:16)\n\n1 \u2212\n\n(cid:17)\n\n(cid:105)\n\nZ(\u03c4 )\n\n.\n\n\u03c0(a|x)\n\nhz(a|x, Z(\u03c4 ))\n\n(6)\n\nThe factor c(a|x, Z) = 1 \u2212 \u03c0(a|x)\nhz(a|x,Z) expresses how much a single action a contributed to obtaining\na return Z. If other actions (drawn from \u03c0(\u00b7|x)) would have yielded the same return, c(a|x, Z) = 0,\nand the advantage is 0. If an action a has made achieving Z more likely, then c(a|x, Z) > 0, and\nconversly, if other actions would have contributed to achieving Z more than a, then c(a|x, Z) < 0.\nHence, c(a|x, Z) expresses the impact an action has on the environment, in terms of the return, if\neverything else (future decisions as well as randomness of the environment) is unchanged.\nBoth h\u03b2 and hz can be learned online from sampled trajectories (see Sec. 4 for algorithms, and a\ndiscussion in Sec. 4.1). Finally, while we chose to focus on state and return conditioning, one could\nconsider other options. For example, conditioning on the reward (instead of the state) at a future time\nk, or an embedding of (or part of) the future trajectory, could have interesting properties.\n\n3.3 Policy Gradients\n\nWe now give a policy gradient theorem based on the new expressions of the value function.\nTheorem 3. Let \u03c0\u03b8 be the policy parameterized by \u03b8, and \u03b2 = \u03b3. Then, the gradient of the value at\nsome state x0 is:\n\n\u03b3k(cid:88)\n\na\n\n(cid:104)(cid:88)\n(cid:104)(cid:88)\n(cid:88)\n\nk\u22650\n\nk\u22650\n\n\u2207\u03c0\u03b8(a|Xk)Qx(Xk, a)\n\n\u03b3k\u2207 log \u03c0\u03b8(Ak|Xk)Az(Xk, Ak)\n\u03b3t\u2212k h\u03b2(a|Xk, Xt)\n\u03c0\u03b8(a|Xk)\n\nRt,\n\n(cid:105)\n\n(cid:105)\n\n(7)\n\n(8)\n\n,\n\n\u2207\u03b8V \u03c0\u03b8 (x0) = E\u03c4\u223cT (x0,\u03c0\u03b8)\n\n= E\u03c4\u223cT (x0,\u03c0\u03b8)\n\nQx(Xk, a)\n\ndef\n= r(Xk, a) +\n\n(cid:16)\n\nt\u2265k+1\n\u03c0\u03b8(a|x)\n\n(cid:17)\n\nAz(x, a)\n\ndef\n=\n\n1 \u2212\n\nhz(a|x, Z(\u03c4k:\u221e))\n\nZ(\u03c4k:\u221e).\n\nNote that the expression for state HCA in Eq. (7) is written for all actions, rather than only the\nsampled one. Interestingly, this form does not require (or bene\ufb01t from) a baseline. Contrary to the\nusual all-actions algorithm which must use the critic, the HCA reweighting allows us to use returns\nsampled from a particular starting action to obtain value estimates for all actions.\n\n4 Algorithms\n\nUsing the new policy gradient theorem 3, we will now give novel algorithms based on sampling the\nexpectations (7) and (8). Then, we will discuss the training of the relevant hindsight distributions.\nState-Conditional HCA Consider a parametric representation of the policy \u03c0(\u00b7|x) and the future-\nstate-conditional distribution h\u03b2(a|x, y), as well as the baseline V and an estimate of the immediate\nreward \u02c6r. Generate T -step trajectories \u03c4 T = (Xs, As, Rs)0\u2264s\u2264T . We can compose an estimate of\nthe return for all actions a (see Theorem 7 in appendix):\n\nt=s+1\n\nQx(Xs, a) \u2248 \u02c6r(Xs, a) +\n\nRt + \u03b3T\u2212s h\u03b2(a|Xs, XT )\n\u03c0(a|Xs)\nThe algorithm proceeds by training V (Xs) to predict the usual return Zs = (cid:80)T\u22121\npredict As (cross entropy loss), and \ufb01nally by updating the policy logits with(cid:80)\n\nt=s \u03b3t\u2212sRt +\n\u03b3T\u2212sV (XT ) and \u02c6r(Xs, As) to predict Rs (square loss), the hindsight distribution h\u03b2(a|Xs, Xt) to\na Qx(Xs, a)\u2207\u03c0(a |\n\n\u03b3t\u2212s h\u03b2(a|Xs, Xt)\n\u03c0(a|Xs)\n\nXs). See Algorithm 1 in appendix for the detailed pseudocode.\nReturn-Conditional HCA Consider a parametric representation of the policy \u03c0(\u00b7|x) and the return-\nconditioned distribution hz(a|x, z). Generate full trajectories \u03c4 = (Xs, As, Rs)s\u2208N+ and compute\n\nV (XT ).\n\nT\u22121(cid:88)\n\n5\n\n\fthe sampled advantage at each step:\n\nwhere Zs = (cid:80)\n\n(cid:16)\n\nAz(Xs, As) =\n\n1 \u2212 \u03c0(As|Xs)\nhz(As|Xs, Zs)\n\n(cid:17)\n\nZs,\n\nt\u2265s \u03b3t\u2212sRt. The algorithm proceeds by training the hindsight distribution\nhz(a|Xs, Zs) to predict As (cross entropy loss), and updating the policy gradient with \u2207 log \u03c0(As |\nXs)Az(Xs, As). See Algorithm 2 in appendix for the detailed pseudocode.\n\nRL without value functions. The return-conditional version lends itself to a particularly simple\nalgorithm. In particular, we no longer need to learn the value function V \u2013 if hz(a|Xs, Zs) is\nestimated well, using complete rollouts is feasible without variance issues. This takes our idea of\nreversing the direction of the learning question to the extreme, it is now entirely in hindsight.\n\n\u03c0(As|Xs)\n\ndef=\nThe result is an actor-critic algorithm, where the usual baseline V (Xs) is replaced by bs\nhz(As|Xs,Zs) Zs. This baseline is strongly correlated to the return Zs (it is proportional to it), which is\ndesirable since we would like to remove as much of the variance (due to the dynamics of the world, or\nthe agent\u2019s own policy) as possible. The following proposition veri\ufb01es that despite being correlated,\nthis baseline does not introduce bias into the policy gradient.\nProposition 1. The baseline bs = \u03c0(As|Xs)\n\nhz(As|Xs,Zs) Zs does not introduce any bias in the policy gradient:\n\n(cid:104)(cid:88)\n\n\u03b3s\u2207 log \u03c0(As|Xs)(cid:0)Zs(\u03c4 ) \u2212 bs\n\n(cid:1)(cid:105)\n\n= \u2207V (x0).\n\nE\u03c4\u223cT (x0,\u03c0)\n\n4.1 Learning Hindsight Distributions\n\ns\n\nWe have given equivalent rewritings of the usual value functions in terms of the proposed hindsight\ndistributions, and have motivated their properties, when they are accurate. Now, the question is if\nit is feasible to learn good estimates of those distributions from experience, and whether shifting\nthe learning problem in this way is bene\ufb01cial. The remainder of this section discusses this question,\nwhile the next one provides empirical evidence for the af\ufb01rmative.\nThere are several conventional objects that could be learned to help with credit assignment: a value\nfunction, a forward model, or an inverse model over states. An accurate forward model allows one\nto compute value functions directly with no variance, and an accurate inverse model \u2013 to perform\nprecise credit assignment. However, learning such generative models accurately is dif\ufb01cult and has\nbeen a long-standing challenge in RL, especially in high-dimensional state spaces. Interestingly, the\nhindsight distribution is a discriminative, rather than generative model, and is hence not required to\nmodel the full distribution over states. Additionally, the action space is usually much smaller than the\nstate space, and so shifting the focus to actions potentially makes the problem much easier. When\ncertain structure in the dynamics is present, learning hindsight distributions may be signi\ufb01cantly\neasier still \u2013 e.g. if the transition model is stochastic or the policy is changing, a particular (x, a)\ncan lead to many possible future states, but a particular future state can be explained by a small\nnumber of past actions. In general, learning hz and h\u03b2 are supervised learning problems, so the new\nalgorithms delegate some of the learning dif\ufb01culty in RL to a supervised setting, for which many\nef\ufb01cient approaches exist (e.g. [7, 23]).\n\n5 Experiments\n\nTo empirically validate our proposal in a controlled way, we devised a set of diagnostic tasks that\nhighlight issues 1-4, while also being representative of what occurs in practice (Fig. 2). We then\nsystematically verify the intuitions developed throughout the paper. In all cases, we learn the hindsight\ndistributions in tandem with the control policy. For each problem we compare HCA with state and\nreturn conditioning to standard baseline policy gradient, that is: n-step advantage actor critic (with\nn = \u221e for Monte Carlo). All the results are an average of 100 independent runs, with the plots\ndepicting means and standard deviations. For simplicity we take \u03b3 = 1 in all of the tasks.\n\nShortcut. We begin with an example capturing the intuition from Fig. 1 (left). Fig. 2 (left) depicts\na chain of length n with a rewarding \ufb01nal state. At each step, one action takes a shortcut and directly\n\n6\n\n\fFigure 2: Left: Shortcut. Each state has two actions, one transitions directly to the goal, the other\nto the next state of the chain. Center: Delayed effect. Start state presents a choice of two actions,\nfollowed by an aliased chain, with the consequence of the initial choice apparent only in the \ufb01nal\nstate. Right: Ambiguous bandit. Each action transitions to a particular state with high probability,\nbut to the other action\u2019s state with low probability. When the two states have noisy rewards, credit\nassignment to each action becomes challenging.\n\nFigure 3: Shortcut. Left: learning curves for n = 5 with the policy between long and short paths\ninitialized uniformly. Explicitly considering the likelihood of reaching the \ufb01nal state allows state-\nconditioned HCA to more quickly adjust its policy. Right: the advantage of the shortcut action\nestimated by performing 1000 rollouts from a \ufb01xed policy. The x-axis depicts the policy probabilities\nof the actions on the long path. The oracle is computed analytically without sampling. When the\nshortcut action is unlikely and rarely encountered, it is dif\ufb01cult to obtain an accurate estimate of the\nadvantage. HCA is consistently able to maintain larger (and more accurate) advantages.\n\nFigure 4: Delayed effect. Left: Bootstrapping. The learning curves for n = 5, \u03c3 = 0, and a 3-step\nreturn, which causes the agent to bootstrap in the partially observed region. As expected, naive\nbootstrapping is unable to learn a good estimate. Middle: Using full Monte Carlo returns (for n = 3)\novercomes partial observability, but is prone to noise. The plot depicts learning curves for the setting\nwith added white noise of \u03c3 = 2. Right. The average performance w.r.t. different noise levels \u2013\npredictably, state HCA is the most robust.\n\nFigure 5: Ambiguous bandit with Gaussian rewards of means 1, 2, and standard deviation 1.5. Left:\nThe state identity is observed. Both HCA methods improve on PG. Middle: The state identity is\nhidden, handicapping state HCA, but return HCA continues to improve on PG. Right: Average\nperformance w.r.t. different \u0001-s with Gaussian rewards of means 1, 2, and standard deviation 0.5.\nNote that the optimal value itself decays in this case.\n\n7\n\n020406080100Episodes1.41.51.61.71.81.92.0Value020406080100Episodes1.41.51.61.71.81.92.0Value0.00.10.20.30.4Epsilon1.41.51.61.71.81.92.02.1Average ValueHCA | StateHCA | ReturnPolicy GradientOptimal Value\ftransitions to the \ufb01nal state, while the other continues on the longer path, which may be more likely\naccording to the policy. There is a per-step penalty (of \u22121), and a \ufb01nal reward of 1. There is also a\nchance (of 0.1) that the agent transitions to the absorbing state directly.\nThis problem highlights two issues: (1) the importance of counter-factual credit assignment (issue 4);\nwhen the long path is taken more frequently than the shortcut path, counter-factual updates become\nincreasingly effective (see Fig. 3, right) (2) the use of time as a proxy for relevance (issue 3) is shown\nto be only a heuristic, even in a fully-observable MDP. The relevance for the states along the chain is\nnot accurately re\ufb02ected in the long temporal distance between them and the goal state. In Fig. 3 we\nshow that HCA is more effective at quickly adjusting the policy towards the shortcut action.\n\nDelayed Effect. The next task instantiates the example from Fig. 1 (right). Fig. 2 (middle) depicts\na POMDP, in which after the \ufb01rst decision, there is aliasing until the \ufb01nal state. This is a common\ncase of partial observability, and is especially pertinent if the features are being learned. We show\nthat (1) Bootstrapping naively is inadequate in this case (issue 2), but HCA is able to carry the\nappropriate information;1 and (2) While Monte Carlo is able to overcome the partial observability, its\nperformance deteriorates when intermediate reward noise is present (issue 1). HCA on the other hand\nis able to reduce the variance due to the irrelevant noise in the rewards.\nAdditionally, in this example the \ufb01rst decision is the most relevant choice, despite being the most\ntemporally remote, once again highlighting that using temporal proximity for credit assignment is\na heuristic (issue 3). One of the \ufb01nal states is rewarding (with r = 1), the other penalizing (with\nr = \u22121), and the middle states contain white noise of standard deviation \u03c3. Fig. 4 depicts our results.\nIn this task, the return-conditional HCA has a more dif\ufb01cult learning problem, as it needs to correctly\nmodel the noise distribution to condition on, which is as dif\ufb01cult as learning the values naively, and\nhence performs similarly to the baseline.\n\nAmbiguous Bandit. Finally, to emphasize that credit assignment can be challenging, even when it\nis not long-term, we consider a problem without a temporal component. Fig. 2 (right) depicts a bandit\nwith two actions, leading to two different states, whose reward functions are similar (here: drawn\nfrom overlapping Gaussian distributions), with some probability \u0001 of crossover. The challenge here is\ndue to variance (issue 1) and a lack of counter-factual updates (issue 4). It is dif\ufb01cult to tell whether\nan action was genuinely better, or just happened to be on the tail end of the distribution. This is a\ncommon scenario when bootstrapping with similar values. Due to the explicit aim at modeling the\ndistributions, the hindsight algorithms are more ef\ufb01cient (Fig. 5 (left)).\nTo highlight the differences between the two types of hindsight conditioning, we introduce partial\nobservability (issue 2), see Fig. 5 (right). The return-conditional policy is still able to improve over\npolicy gradient, but state-conditioning now fails to provide informative conditioning (by construction).\n\n6 Related Work\n\nHindsight experience replay (HER) [1] introduces the idea of off-policy learning about many goals\nfrom the same trajectory. The intuition is that regardless of what goal the trajectory was pursuing\noriginally, in hindsight it, e.g., successfully found the one corresponding to its \ufb01nal state, and there is\nsomething to be learned. Rauber et al. [15] extend the same intuition to policy gradient algorithms,\nwith goal-conditioned policies. Goyal et al. [5] also use goal conditioning and learn a backtracking\nmodel, which predicts the state-action pairs occurring on trajectories that end up in goal states. These\nworks share our intuition of in hindsight using the same data to learn about many things, but in the\ncontext of goal-conditioned policies, while we essentially contrast conditional and unconditional\npolicies, where the conditioning is on the extra outcome (state or return). Note that we never act w.r.t.\nthe conditional policy, and it is used solely for credit assignment.\nThe temporal value transport algorithm [11] also aims to propagate credit ef\ufb01ciently backward in\ntime. It uses an attention mechanism over memory to jump over parts of a trajectory that are irrelevant\nfor the rewards obtained. While demonstrated on challenging problems, that method is biased; a\npromising direction for future research is to apply our unbiased hindsight mechanism with past\nstates chosen by such an attention mechanism. Another line of work with a related intuition is\nRUDDER [2]. It uses an LSTM to predict future returns and sensitivity analysis to distribute those\n\n1See the discussion in Appendix F.\n\n8\n\n\freturns as immediate rewards in order to reduce the learning horizon and make long-term credit\nassignment easier. Instead of aiming to redistribute the return, state HCA up- or dowmnweights\nindividual rewards according to their relevance to the past action.\nA large number of variance reduction techniques have been applied in RL, e.g. using learned\nvalue functions as critics, and other control variates [e.g. 24]. When a model of the environment is\navailable, it can be used to reduce variance. Rollouts from the same state \ufb01ll the same role in policy\ngradients [18]. Differentiable system dynamics allow low-variance estimates of the Q-value gradient\nby using the pathwise derivative estimator, effectively backpropagating the gradient of the objective\nalong trajectories [e.g. 17, 9, 10]. In stochastic systems this requires knowledge of the environment\nnoise. To bypass this, Heess et al. [9] infer the noise given an observed trajectory. Buesing et al. [3]\napply this idea to POMDPs, where it can be viewed as reasoning about events in hindsight. They use\na structural causal model of the dynamics and infer the posterior over latent causes from empirical\ntrajectories. Using an empirical rather than a learned distribution over latent causes can reduce bias\nand, together with the (deterministic) model of the system dynamics, allows exploring the effect of\nalternative action choices for an observed trajectory.\nInverse models similar to the ones we use appear, for instance, in variational intrinsic control [6] (see\nalso e.g. [8]). However, in our work, the inverse model serves as a way of determining the in\ufb02uence\nof an action on a future outcome, whereas the work in [6, 8] aims to use the inverse model to derive\nan intrinsic reward for training policies in which actions in\ufb02uence the future observations.\nFinally, prioritized sweeping can be viewed as changing the sampling distribution with hindsight\nknowledge of the TD errors [13].\n\n7 Closing\n\nWe proposed a new family of algorithms that explicitly consider the question of credit assignment\nas a part of, or instead of, estimating the traditional value function. The proposed estimators come\nwith new properties, and as we validate empirically, are able to address some of the key issues in\ncredit assignment. Investigating the scalability of these algorithms in the deep reinforcement learning\nsetting is an exciting problem for future research.\n\nAcknowledgements\n\nThe authors thank Joseph Modayil for reviews of earlier manuscripts, Theo Weber for several\ninsightful suggestions, and the anonymous reviewers for their useful feedback.\n\nReferences\n[1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.\nIn Advances in Neural Information Processing Systems, pages 5048\u20135058, 2017.\n\n[2] Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes\nBrandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. In\nH. Wallach, H. Larochelle, A. Beygelzimer, F. Alch\u00e9-Buc, E. Fox, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 32, pages 13544\u201313555, 2019.\n\n[3] Lars Buesing, Theophane Weber, Yori Zwols, S\u00e9bastien Racani\u00e8re, Arthur Guez, Jean-Baptiste\nLespiau, and Nicolas Heess. Woulda, coulda, shoulda: Counterfactually-guided policy search.\nCoRR, abs/1811.06272, 2018.\n\n[4] Igor Vladimirovich Girsanov. On transforming a certain class of stochastic processes by\nabsolutely continuous substitution of measures. Theory of Probability & Its Applications,\n5(3):285\u2013301, 1960.\n\n[5] Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy Lillicrap, Sergey\nLevine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for ef\ufb01cient\nreinforcement learning. In International Conference on Learning Representations(ICLR), 2019.\n\n9\n\n\f[6] Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv\n\npreprint arXiv:1611.07507, 2016.\n\n[7] Michael Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation\nprinciple for unnormalized statistical models. In Proceedings of the Thirteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\n[8] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller.\nLearning an embedding space for transferable robot skills. In International Conference on\nLearning Representations (ICLR), 2018.\n\n[9] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa.\nLearning continuous control policies by stochastic value gradients. In C. Cortes, N. D. Lawrence,\nD. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 28, pages 2944\u20132952. Curran Associates, Inc., 2015.\n\n[10] Mikael Henaff, William F Whitney, and Yann LeCun. Model-based planning with discrete and\n\ncontinuous actions. arXiv preprint arXiv:1705.07177, 2017.\n\n[11] Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico\nCarnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by\ntransporting value. arXiv preprint arXiv:1810.06721, 2018.\n\n[12] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-\nment learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The\n33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine\nLearning Research, pages 1928\u20131937, New York, New York, USA, 20\u201322 Jun 2016. PMLR.\n\n[13] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning\n\nwith less data and less time. Machine learning, 13(1):103\u2013130, 1993.\n\n[14] Martin Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 1994.\n\n[15] Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and J\u00fcrgen Schmidhuber. Hindsight policy\n\ngradients. In International Conference on Learning Representations (ICLR), 2019.\n\n[16] Gerardo Rubino, Bruno Tuf\ufb01n, et al. Rare event simulation using Monte Carlo methods,\n\nvolume 73. Wiley Online Library, 2009.\n\n[17] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation\n\nusing stochastic computation graphs. CoRR, abs/1506.05254, 2015.\n\n[18] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[19] Satinder Singh, Tommi Jaakkola, and Michael I. Jordan. Learning without state estimation in\npartially observable environments. In International Conference on Machine Learning (ICML),\n1994.\n\n[20] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning,\n\n3(1):9\u201344, 1988.\n\n[21] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nCambridge, MA, USA, 2nd edition, 2018.\n\n[22] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[23] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. arXiv preprint arXiv:1807.03748, 2018.\n\n10\n\n\f", "award": [], "sourceid": 6771, "authors": [{"given_name": "Anna", "family_name": "Harutyunyan", "institution": "DeepMind"}, {"given_name": "Will", "family_name": "Dabney", "institution": "DeepMind"}, {"given_name": "Thomas", "family_name": "Mesnard", "institution": "DeepMind"}, {"given_name": "Mohammad", "family_name": "Gheshlaghi Azar", "institution": "DeepMind"}, {"given_name": "Bilal", "family_name": "Piot", "institution": "DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}, {"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "Gregory", "family_name": "Wayne", "institution": "Google DeepMind"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "DeepMind"}, {"given_name": "Doina", "family_name": "Precup", "institution": "DeepMind"}, {"given_name": "Remi", "family_name": "Munos", "institution": "DeepMind"}]}