{"title": "Hindsight Experience Replay", "book": "Advances in Neural Information Processing Systems", "page_first": 5048, "page_last": 5058, "abstract": "Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task. The video presenting our experiments is available at https://goo.gl/SMrQnI.", "full_text": "Hindsight Experience Replay\n\nMarcin Andrychowicz\u21e4, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong,\nPeter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel\u2020, Wojciech Zaremba\u2020\n\nOpenAI\n\nAbstract\n\nDealing with sparse rewards is one of the biggest challenges in Reinforcement\nLearning (RL). We present a novel technique called Hindsight Experience Replay\nwhich allows sample-ef\ufb01cient learning from rewards which are sparse and binary\nand therefore avoid the need for complicated reward engineering. It can be com-\nbined with an arbitrary off-policy RL algorithm and may be seen as a form of\nimplicit curriculum.\nWe demonstrate our approach on the task of manipulating objects with a robotic\narm. In particular, we run experiments on three different tasks: pushing, sliding,\nand pick-and-place, in each case using only binary rewards indicating whether or\nnot the task is completed. Our ablation studies show that Hindsight Experience\nReplay is a crucial ingredient which makes training possible in these challenging\nenvironments. We show that our policies trained on a physics simulation can\nbe deployed on a physical robot and successfully complete the task. The video\npresenting our experiments is available at https://goo.gl/SMrQnI.\n\nIntroduction\n\n1\nReinforcement learning (RL) combined with neural networks has recently led to a wide range of\nsuccesses in learning policies for sequential decision-making problems. This includes simulated\nenvironments, such as playing Atari games (Mnih et al., 2015), and defeating the best human player\nat the game of Go (Silver et al., 2016), as well as robotic tasks such as helicopter control (Ng et al.,\n2006), hitting a baseball (Peters and Schaal, 2008), screwing a cap onto a bottle (Levine et al., 2015),\nor door opening (Chebotar et al., 2016).\nHowever, a common challenge, especially for robotics, is the need to engineer a reward function\nthat not only re\ufb02ects the task at hand but is also carefully shaped (Ng et al., 1999) to guide the\npolicy optimization. For example, Popov et al. (2017) use a cost function consisting of \ufb01ve relatively\ncomplicated terms which need to be carefully weighted in order to train a policy for stacking a\nbrick on top of another one. The necessity of cost engineering limits the applicability of RL in the\nreal world because it requires both RL expertise and domain-speci\ufb01c knowledge. Moreover, it is\nnot applicable in situations where we do not know what admissible behaviour may look like. It is\ntherefore of great practical relevance to develop algorithms which can learn from unshaped reward\nsignals, e.g. a binary signal indicating successful task completion.\nOne ability humans have, unlike the current generation of model-free RL algorithms, is to learn\nalmost as much from achieving an undesired outcome as from the desired one. Imagine that you are\nlearning how to play hockey and are trying to shoot a puck into a net. You hit the puck but it misses\nthe net on the right side. The conclusion drawn by a standard RL algorithm in such a situation would\nbe that the performed sequence of actions does not lead to a successful shot, and little (if anything)\n\n\u21e4 marcin@openai.com\n\u2020 Equal advising.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwould be learned. It is however possible to draw another conclusion, namely that this sequence of\nactions would be successful if the net had been placed further to the right.\nIn this paper we introduce a technique called Hindsight Experience Replay (HER) which allows the\nalgorithm to perform exactly this kind of reasoning and can be combined with any off-policy RL\nalgorithm. It is applicable whenever there are multiple goals which can be achieved, e.g. achieving\neach state of the system may be treated as a separate goal. Not only does HER improve the sample\nef\ufb01ciency in this setting, but more importantly, it makes learning possible even if the reward signal is\nsparse and binary. Our approach is based on training universal policies (Schaul et al., 2015a) which\ntake as input not only the current state, but also a goal state. The pivotal idea behind HER is to replay\neach episode with a different goal than the one the agent was trying to achieve, e.g. one of the goals\nwhich was achieved in the episode.\n2 Background\n2.1 Reinforcement Learning\nWe consider the standard reinforcement learning formalism consisting of an agent interacting with\nan environment. To simplify the exposition we assume that the environment is fully observable.\nAn environment is described by a set of states S, a set of actions A, a distribution of initial states\np(s0), a reward function r : S \u21e5 A ! R, transition probabilities p(st+1|st, at), and a discount factor\n 2 [0, 1].\nA deterministic policy is a mapping from states to actions: \u21e1 : S ! A. Every episode starts with\nsampling an initial state s0. At every timestep t the agent produces an action based on the current state:\nat = \u21e1(st). Then it gets the reward rt = r(st, at) and the environment\u2019s new state is sampled from\nthe distribution p(\u00b7|st, at). A discounted sum of future rewards is called a return: Rt =P1i=t itri.\nThe agent\u2019s goal is to maximize its expected return Es0[R0|s0]. The Q-function or action-value\nfunction is de\ufb01ned as Q\u21e1(st, at) = E[Rt|st, at].\nLet \u21e1\u21e4 denote an optimal policy i.e. any policy \u21e1\u21e4 s.t. Q\u21e1\u21e4(s, a) Q\u21e1(s, a) for every s 2 S, a 2 A\nand any policy \u21e1. All optimal policies have the same Q-function which is called optimal Q-function\nand denoted Q\u21e4. It is easy to show that it satis\ufb01es the following equation called the Bellman equation:\n\nQ\u21e4(s0, a0) .\n\nQ\u21e4(s, a) = Es0\u21e0p(\u00b7|s,a)\uf8ffr(s, a) + max\n\na02A\n\n2.2 Deep Q-Networks (DQN)\nDeep Q-Networks (DQN) (Mnih et al., 2015) is a model-free RL algorithm for discrete action\nspaces. Here we sketch it only informally, see Mnih et al. (2015) for more details. In DQN we\nmaintain a neural network Q which approximates Q\u21e4. A greedy policy w.r.t. Q is de\ufb01ned as\n\u21e1Q(s) = argmaxa2AQ(s, a). An \u270f-greedy policy w.r.t. Q is a policy which with probability \u270f takes\na random action (sampled uniformly from A) and takes the action \u21e1Q(s) with probability 1 \u270f.\nthe current approximation of\nDuring training we generate episodes using \u270f-greedy policy w.r.t.\nthe action-value function Q. The transition tuples (st, at, rt, st+1) encountered during training are\nstored in the so-called replay buffer. The generation of new episodes is interleaved with neural\nnetwork training. The network is trained using mini-batch gradient descent on the loss L which\nencourages the approximated Q-function to satisfy the Bellman equation: L = E (Q(st, at) yt)2,\nwhere yt = rt + maxa02A Q(st+1, a0) and the tuples (st, at, rt, st+1) are sampled from the replay\nbuffer1.\n2.3 Deep Deterministic Policy Gradients (DDPG)\nDeep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015) is a model-free RL algorithm\nfor continuous action spaces. Here we sketch it only informally, see Lillicrap et al. (2015) for more\ndetails. In DDPG we maintain two neural networks: a target policy (also called an actor) \u21e1 : S ! A\nand an action-value function approximator (called the critic) Q : S \u21e5 A ! R. The critic\u2019s job is to\napproximate the actor\u2019s action-value function Q\u21e1.\n\n1The targets yt depend on the network parameters but this dependency is ignored during backpropagation.\nMoreover, DQN uses the so-called target network to make the optimization procedure more stable but we omit it\nhere as it is not relevant to our results.\n\n2\n\n\fEpisodes are generated using a behavioral policy which is a noisy version of the target policy, e.g.\n\u21e1b(s) = \u21e1(s) + N (0, 1). The critic is trained in a similar way as the Q-function in DQN but the\ntargets yt are computed using actions outputted by the actor, i.e. yt = rt + Q(st+1, \u21e1(st+1)).\nThe actor is trained with mini-batch gradient descent on the loss La = EsQ(s, \u21e1(s)), where s\nis sampled from the replay buffer. The gradient of La w.r.t. actor parameters can be computed by\nbackpropagation through the combined critic and actor networks.\n2.4 Universal Value Function Approximators (UVFA)\nUniversal Value Function Approximators (UVFA) (Schaul et al., 2015a) is an extension of DQN to\nthe setup where there is more than one goal we may try to achieve. Let G be the space of possible\ngoals. Every goal g 2 G corresponds to some reward function rg : S \u21e5 A ! R. Every episode starts\nwith sampling a state-goal pair from some distribution p(s0, g). The goal stays \ufb01xed for the whole\nepisode. At every timestep the agent gets as input not only the current state but also the current goal\n\u21e1 : S \u21e5 G ! A and gets the reward rt = rg(st, at). The Q-function now depends not only on a\nstate-action pair but also on a goal Q\u21e1(st, at, g) = E[Rt|st, at, g]. Schaul et al. (2015a) show that in\nthis setup it is possible to train an approximator to the Q-function using direct bootstrapping from the\nBellman equation (just like in case of DQN) and that a greedy policy derived from it can generalize\nto previously unseen state-action pairs. The extension of this approach to DDPG is straightforward.\n3 Hindsight Experience Replay\n3.1 A motivating example\nConsider a bit-\ufb02ipping environment with the state space S = {0, 1}n and the action space A =\n{0, 1, . . . , n 1} for some integer n in which executing the i-th action \ufb02ips the i-th bit of the state.\nFor every episode we sample uniformly an initial state as well as a target state and the policy gets a\nreward of 1 as long as it is not in the target state, i.e. rg(s, a) = [s 6= g].\nStandard RL algorithms are bound to fail in this environment for\nn > 40 because they will never experience any reward other than 1.\nNotice that using techniques for improving exploration (e.g. VIME\n(Houthooft et al., 2016), count-based exploration (Ostrovski et al.,\n2017) or bootstrapped DQN (Osband et al., 2016)) does not help\nhere because the real problem is not in lack of diversity of states\nbeing visited, rather it is simply impractical to explore such a large\nstate space. The standard solution to this problem would be to use\na shaped reward function which is more informative and guides the\nagent towards the goal, e.g. rg(s, a) = ||s g||2. While using a\nshaped reward solves the problem in our toy environment, it may be\ndif\ufb01cult to apply to more complicated problems. We investigate the\nresults of reward shaping experimentally in Sec. 4.4.\nInstead of shaping the reward we propose a different solution which does not require any domain\nknowledge. Consider an episode with a state sequence s1, . . . , sT and a goal g 6= s1, . . . , sT which\nimplies that the agent received a reward of 1 at every timestep. The pivotal idea behind our approach\nis to re-examine this trajectory with a different goal \u2014 while this trajectory may not help us learn\nhow to achieve the state g, it de\ufb01nitely tells us something about how to achieve the state sT . This\ninformation can be harvested by using an off-policy RL algorithm and experience replay where we\nreplace g in the replay buffer by sT . In addition we can still replay with the original goal g left intact\nin the replay buffer. With this modi\ufb01cation at least half of the replayed trajectories contain rewards\ndifferent from 1 and learning becomes much simpler. Fig. 1 compares the \ufb01nal performance of\nDQN with and without this additional replay technique which we call Hindsight Experience Replay\n(HER). DQN without HER can only solve the task for n \uf8ff 13 while DQN with HER easily solves\nthe task for n up to 50. See Appendix A for the details of the experimental setup. Note that this\napproach combined with powerful function approximators (e.g., deep neural networks) allows the\nagent to learn how to achieve the goal g even if it has never observed it during training.\nWe more formally describe our approach in the following sections.\n3.2 Multi-goal RL\nWe are interested in training agents which learn to achieve multiple different goals. We follow the\napproach from Universal Value Function Approximators (Schaul et al., 2015a), i.e. we train policies\n\nFigure 1: Bit-\ufb02ipping experi-\nment.\n\n3\n\n\fand value functions which take as input not only a state s 2 S but also a goal g 2 G. Moreover, we\nshow that training an agent to perform multiple tasks can be easier than training it to perform only\none task (see Sec. 4.3 for details) and therefore our approach may be applicable even if there is only\none task we would like the agent to perform (a similar situation was recently observed by Pinto and\nGupta (2016)).\nWe assume that every goal g 2 G corresponds to some predicate fg : S ! {0, 1} and that the agent\u2019s\ngoal is to achieve any state s that satis\ufb01es fg(s) = 1. In the case when we want to exactly specify the\ndesired state of the system we may use S = G and fg(s) = [s = g]. The goals can also specify only\nsome properties of the state, e.g. suppose that S = R2 and we want to be able to achieve an arbitrary\nstate with the given value of x coordinate. In this case G = R and fg((x, y)) = [x = g].\nMoreover, we assume that given a state s we can easily \ufb01nd a goal g which is satis\ufb01ed in this state.\nMore formally, we assume that there is given a mapping m : S ! G s.t. 8s2Sfm(s)(s) = 1. Notice\nthat this assumption is not very restrictive and can usually be satis\ufb01ed. In the case where each goal\ncorresponds to a state we want to achieve, i.e. G = S and fg(s) = [s = g], the mapping m is just an\nidentity. For the case of 2-dimensional state and 1-dimensional goals from the previous paragraph\nthis mapping is also very simple m((x, y)) = x.\nA universal policy can be trained using an arbitrary RL algorithm by sampling goals and initial states\nfrom some distributions, running the agent for some number of timesteps and giving it a negative\nreward at every timestep when the goal is not achieved, i.e. rg(s, a) = [fg(s) = 0]. This does not\nhowever work very well in practice because this reward function is sparse and not very informative.\nIn order to solve this problem we introduce the technique of Hindsight Experience Replay which is\nthe crux of our approach.\n3.3 Algorithm\nThe idea behind Hindsight Experience Replay (HER) is very simple: after experiencing some episode\ns0, s1, . . . , sT we store in the replay buffer every transition st ! st+1 not only with the original\ngoal used for this episode but also with a subset of other goals. Notice that the goal being pursued\nin\ufb02uences the agent\u2019s actions but not the environment dynamics and therefore we can replay each\ntrajectory with an arbitrary goal assuming that we use an off-policy RL algorithm like DQN (Mnih\net al., 2015), DDPG (Lillicrap et al., 2015), NAF (Gu et al., 2016) or SDQN (Metz et al., 2017).\nOne choice which has to be made in order to use HER is the set of additional goals used for replay.\nIn the simplest version of our algorithm we replay each trajectory with the goal m(sT ), i.e. the goal\nwhich is achieved in the \ufb01nal state of the episode. We experimentally compare different types and\nquantities of additional goals for replay in Sec. 4.5. In all cases we also replay each trajectory with\nthe original goal pursued in the episode. See Alg. 1 for a more formal description of the algorithm.\nHER may be seen as a form of implicit curriculum as the goals used for replay naturally shift from\nones which are simple to achieve even by a random agent to more dif\ufb01cult ones. However, in contrast\nto explicit curriculum, HER does not require having any control over the distribution of initial\nenvironment states. Not only does HER learn with extremely sparse rewards, in our experiments\nit also performs better with sparse rewards than with shaped ones (See Sec. 4.4). These results are\nindicative of the practical challenges with reward shaping, and that shaped rewards would often\nconstitute a compromise on the metric we truly care about (such as binary success/failure).\n4 Experiments\nThe video presenting our experiments is available at https://goo.gl/SMrQnI.\n4.1 Environments\nThe are no standard environments for multi-goal RL and therefore we created our own environments.\nWe decided to use manipulation environments based on an existing hardware robot to ensure that the\nchallenges we face correspond as closely as possible to the real world. In all experiments we use a\n7-DOF Fetch Robotics arm which has a two-\ufb01ngered parallel gripper. The robot is simulated using\nthe MuJoCo (Todorov et al., 2012) physics engine. The whole training procedure is performed in\nthe simulation but we show in Sec. 4.6 that the trained policies perform well on the physical robot\nwithout any \ufb01netuning.\nPolicies are represented as Multi-Layer Perceptrons (MLPs) with Recti\ufb01ed Linear Unit (ReLU)\nactivation functions. Training is performed using the DDPG algorithm (Lillicrap et al., 2015) with\n\n4\n\n\fAlgorithm 1 Hindsight Experience Replay (HER)\n\nGiven:\n\n\u2022 an off-policy RL algorithm A,\n\u2022 a strategy S for sampling goals for replay,\n\u2022 a reward function r : S \u21e5 A \u21e5 G ! R.\n\nInitialize A\nInitialize replay buffer R\nfor episode = 1, M do\n\nSample a goal g and an initial state s0.\nfor t = 0, T 1 do\n\nSample an action at using the behavioral policy from A:\n\nat \u21e1b(st||g)\n\nExecute the action at and observe a new state st+1\n\nend for\nfor t = 0, T 1 do\nrt := r(st, at, g)\nStore the transition (st||g, at, rt, st+1||g) in R\nSample a set of additional goals for replay G := S(current episode)\nfor g0 2 G do\n\nr0 := r(st, at, g0)\nStore the transition (st||g0, at, r0, st+1||g0) in R\n\nSample a minibatch B from the replay buffer R\nPerform one step of optimization using A and minibatch B\n\nend for\n\nend for\nfor t = 1, N do\n\nend for\n\nend for\n\n. e.g. DQN, DDPG, NAF, SDQN\n. e.g. S(s0, . . . , sT ) = m(sT )\n. e.g. r(s, a, g) = [fg(s) = 0]\n. e.g. initialize neural networks\n\n. || denotes concatenation\n\n. standard experience replay\n\n. HER\n\nAdam (Kingma and Ba, 2014) as the optimizer. See Appendix A for more details and the values of all\nhyperparameters.\nWe consider 3 different tasks:\n\n1. Pushing. In this task a box is placed on a table in front of the robot and the task is to move\nit to the target location on the table. The robot \ufb01ngers are locked to prevent grasping. The\nlearned behaviour is a mixture of pushing and rolling.\n\n2. Sliding. In this task a puck is placed on a long slippery table and the target position is outside\nof the robot\u2019s reach so that it has to hit the puck with such a force that it slides and then\nstops in the appropriate place due to friction.\n\n3. Pick-and-place. This task is similar to pushing but the target position is in the air and the\n\ufb01ngers are not locked. To make exploration in this task easier we recorded a single state in\nwhich the box is grasped and start half of the training episodes from this state2.\n\nThe images showing the tasks being performed can be found in Appendix C.\nStates: The state of the system is represented in the MuJoCo physics engine.\nGoals: Goals describe the desired position of the object (a box or a puck depending on the task) with\nsome \ufb01xed tolerance of \u270f i.e. G = R3 and fg(s) = [|g sobject| \uf8ff \u270f], where sobject is the position\nof the object in the state s. The mapping from states to goals used in HER is simply m(s) = sobject.\nRewards: Unless stated otherwise we use binary and sparse rewards r(s, a, g) = [fg(s0) = 0]\nwhere s0 if the state after the execution of the action a in the state s. We compare sparse and shaped\nreward functions in Sec. 4.4.\nState-goal distributions: For all tasks the initial position of the gripper is \ufb01xed, while the initial\nposition of the object and the target are randomized. See Appendix A for details.\nObservations: In this paragraph relative means relative to the current gripper position. The policy is\n2This was necessary because we could not successfully train any policies for this task without using the\ndemonstration state. We have later discovered that training is possible without this trick if only the goal position\nis sometimes on the table and sometimes in the air.\n\n5\n\n\fgiven as input the absolute position of the gripper, the relative position of the object and the target3,\nas well as the distance between the \ufb01ngers. The Q-function is additionally given the linear velocity of\nthe gripper and \ufb01ngers as well as relative linear and angular velocity of the object. We decided to\nrestrict the input to the policy in order to make deployment on the physical robot easier.\nActions: None of the problems we consider require gripper rotation and therefore we keep it \ufb01xed.\nAction space is 4-dimensional. Three dimensions specify the desired relative gripper position at\nthe next timestep. We use MuJoCo constraints to move the gripper towards the desired position but\nJacobian-based control could be used instead4. The last dimension speci\ufb01es the desired distance\nbetween the 2 \ufb01ngers which are position controlled.\nStrategy S for sampling goals for replay: Unless stated otherwise HER uses replay with the goal\ncorresponding to the \ufb01nal state in each episode, i.e. S(s0, . . . , sT ) = m(sT ). We compare different\nstrategies for choosing which goals to replay with in Sec. 4.5.\n4.2 Does HER improve performance?\nIn order to verify if HER improves performance we evaluate DDPG with and without HER on all\n3 tasks. Moreover, we compare against DDPG with count-based exploration5 (Strehl and Littman,\n2005; Kolter and Ng, 2009; Tang et al., 2016; Bellemare et al., 2016; Ostrovski et al., 2017). For\nHER we store each transition in the replay buffer twice: once with the goal used for the generation\nof the episode and once with the goal corresponding to the \ufb01nal state from the episode (we call this\nstrategy final). In Sec. 4.5 we perform ablation studies of different strategies S for choosing goals\nfor replay, here we include the best version from Sec. 4.5 in the plot for comparison.\n\nFigure 2: Multiple goals.\n\nFigure 3: Single goal.\n\nFig. 2 shows the learning curves for all 3 tasks6. DDPG without HER is unable to solve any of the\ntasks7 and DDPG with count-based exploration is only able to make some progress on the sliding\ntask. On the other hand, DDPG with HER solves all tasks almost perfectly. It con\ufb01rms that HER is a\ncrucial element which makes learning from sparse, binary rewards possible.\n4.3 Does HER improve performance even if there is only one goal we care about?\nIn this section we evaluate whether HER improves performance in the case where there is only one\ngoal we care about. To this end, we repeat the experiments from the previous section but the goal\nstate is identical in all episodes.\nFrom Fig. 3 it is clear that DDPG+HER performs much better than pure DDPG even if the goal state\nis identical in all episodes. More importantly, comparing Fig. 2 and Fig. 3 we can also notice that\nHER learns faster if training episodes contain multiple goals, so in practice it is advisable to train on\nmultiple goals even if we care only about one of them.\n\nmovements which are reproducible on the physical robot despite not being fully physically plausible.\n\n3The target position is relative to the current object position.\n4The successful deployment on a physical robot (Sec. 4.6) con\ufb01rms that our control model produces\n5 We discretize the state space and use an intrinsic reward of the form \u21b5/pN, where \u21b5 is a hyper-\nparameter and N is the number of times the given state was visited. The discretization works as fol-\nlows. We take the relative position of the box and the target and then discretize every coordinate using\na grid with a stepsize which is a hyperparameter. We have performed a hyperparameter search over\n\u21b5 2 {0.032, 0.064, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32}, 2 {1cm, 2cm, 4cm, 8cm}. The best results were\nobtained using \u21b5 = 1 and = 1cm and these are the results we report.\n6An episode is considered successful if the distance between the object and the goal at the end of the episode\nis less than 7cm for pushing and pick-and-place and less than 20cm for sliding. The results are averaged across 5\nrandom seeds and shaded areas represent one standard deviation.\n\n7We also evaluated DQN (without HER) on our tasks and it was not able to solve any of them.\n\n6\n\n\fFigure 4: Ablation study of different strategies for choosing additional goals for replay. The top row\nshows the highest (across the training epochs) test performance and the bottom row shows the average\ntest performance across all training epochs. On the right top plot the curves for final, episode and\nfuture coincide as all these strategies achieve perfect performance on this task.\n\n4.4 How does HER interact with reward shaping?\nSo far we only considered binary rewards of the form r(s, a, g) = [|g sobject| > \u270f]. In this\nsection we check how the performance of DDPG with and without HER changes if we replace\nthis reward with one which is shaped. We considered reward functions of the form r(s, a, g) =\n|g sobject|p |g s0object|p, where s0 is the state of the environment after the execution of the\naction a in the state s and 2 {0, 1}, p 2 {1, 2} are hyperparameters.\nSurprisingly neither DDPG, nor DDPG+HER was able to successfully solve any of the tasks with any\nof these reward functions8(learning curves can be found in Appendix D). Our results are consistent\nwith the fact that successful applications of RL to dif\ufb01cult manipulation tasks which does not use\ndemonstrations usually have more complicated reward functions than the ones we tried (e.g. Popov\net al. (2017)).\nThe following two reasons can cause shaped rewards to perform so poorly: (1) There is a huge\ndiscrepancy between what we optimize (i.e. a shaped reward function) and the success condition (i.e.:\nis the object within some radius from the goal at the end of the episode); (2) Shaped rewards penalize\nfor inappropriate behaviour (e.g. moving the box in a wrong direction) which may hinder exploration.\nIt can cause the agent to learn not to touch the box at all if it can not manipulate it precisely and we\nnoticed such behaviour in some of our experiments.\nOur results suggest that domain-agnostic reward shaping does not work well (at least in the simple\nforms we have tried). Of course for every problem there exists a reward which makes it easy (Ng\net al., 1999) but designing such shaped rewards requires a lot of domain knowledge and may in some\ncases not be much easier than directly scripting the policy. This strengthens our belief that learning\nfrom sparse, binary rewards is an important problem.\n\n4.5 How many goals should we replay each trajectory with and how to choose them?\nIn this section we experimentally evaluate different strategies (i.e. S in Alg. 1) for choosing goals\nto use with HER. So far the only additional goals we used for replay were the ones corresponding\nto the \ufb01nal state of the environment and we will call this strategy final. Apart from it we consider\nthe following strategies: future \u2014 replay with k random states which come from the same episode\nas the transition being replayed and were observed after it, episode \u2014 replay with k random\nstates coming from the same episode as the transition being replayed, random \u2014 replay with k\nrandom states encountered so far in the whole training procedure. All of these strategies have a\nhyperparameter k which controls the ratio of HER data to data coming from normal experience replay\nin the replay buffer.\n\n8We also tried to rescale the distances, so that the range of rewards is similar as in the case of binary rewards,\nclipping big distances and adding a simple (linear or quadratic) term encouraging the gripper to move towards\nthe object but none of these techniques have led to successful training.\n\n7\n\n\fFigure 5: The pick-and-place policy deployed on the physical robot.\n\nThe plots comparing different strategies and different values of k can be found in Fig. 4. We can\nsee from the plots that all strategies apart from random solve pushing and pick-and-place almost\nperfectly regardless of the values of k. In all cases future with k equal 4 or 8 performs best and it\nis the only strategy which is able to solve the sliding task almost perfectly. The learning curves for\nfuture with k = 4 can be found in Fig. 2. It con\ufb01rms that the most valuable goals for replay are the\nones which are going to be achieved in the near future9. Notice that increasing the values of k above\n8 degrades performance because the fraction of normal replay data in the buffer becomes very low.\n4.6 Deployment on a physical robot\nWe took a policy for the pick-and-place task trained in the simulator (version with the future strategy\nand k = 4 from Sec. 4.5) and deployed it on a physical fetch robot without any \ufb01netuning. The box\nposition was predicted using a separately trained CNN using raw fetch head camera images. See\nAppendix B for details.\nInitially the policy succeeded in 2 out of 5 trials. It was not robust to small errors in the box position\nestimation because it was trained on perfect state coming from the simulation. After retraining the\npolicy with gaussian noise (std=1cm) added to observations10 the success rate increased to 5/5. The\nvideo showing some of the trials is available at https://goo.gl/SMrQnI.\n5 Related work\nThe technique of experience replay has been introduced in Lin (1992) and became very popular\nafter it was used in the DQN agent playing Atari (Mnih et al., 2015). Prioritized experience replay\n(Schaul et al., 2015b) is an improvement to experience replay which prioritizes transitions in the\nreplay buffer in order to speed up training. It it orthogonal to our work and both approaches can be\neasily combined.\nLearning simultaneously policies for multiple tasks have been heavily explored in the context of\npolicy search, e.g. Schmidhuber and Huber (1990); Caruana (1998); Da Silva et al. (2012); Kober et al.\n(2012); Devin et al. (2016); Pinto and Gupta (2016). Learning off-policy value functions for multiple\ntasks was investigated by Foster and Dayan (2002) and Sutton et al. (2011). Our work is most heavily\nbased on Schaul et al. (2015a) who considers training a single neural network approximating multiple\nvalue functions. Learning simultaneously to perform multiple tasks has been also investigated for\na long time in the context of Hierarchical Reinforcement Learning, e.g. Bakker and Schmidhuber\n(2004); Vezhnevets et al. (2017).\nOur approach may be seen as a form of implicit curriculum learning (Elman, 1993; Bengio et al.,\n2009). While curriculum is now often used for training neural networks (e.g. Zaremba and Sutskever\n(2014); Graves et al. (2016)), the curriculum is almost always hand-crafted. The problem of automatic\ncurriculum generation was approached by Schmidhuber (2004) who constructed an asymptotically\noptimal algorithm for this problem using program search. Another interesting approach is PowerPlay\n(Schmidhuber, 2013; Srivastava et al., 2013) which is a general framework for automatic task selection.\nGraves et al. (2017) consider a setup where there is a \ufb01xed discrete set of tasks and empirically\nevaluate different strategies for automatic curriculum generation in this settings. Another approach\ninvestigated by Sukhbaatar et al. (2017) and Held et al. (2017) uses self-play between the policy and\na task-setter in order to automatically generate goal states which are on the border of what the current\npolicy can achieve. Our approach is orthogonal to these techniques and can be combined with them.\n\n9We have also tried replaying the goals which are close to the ones achieved in the near future but it has not\n\nperformed better than the future strategy\n\n10The Q-function approximator was trained using exact observations. It does not have to be robust to noisy\n\nobservations because it is not used during the deployment on the physical robot.\n\n8\n\n\f6 Conclusions\nWe introduced a novel technique called Hindsight Experience Replay which makes possible applying\nRL algorithms to problems with sparse and binary rewards. Our technique can be combined with an\narbitrary off-policy RL algorithm and we experimentally demonstrated that with DQN and DDPG.\nWe showed that HER allows training policies which push, slide and pick-and-place objects with a\nrobotic arm to the speci\ufb01ed positions while the vanilla RL algorithm fails to solve these tasks. We\nalso showed that the policy for the pick-and-place task performs well on the physical robot without\nany \ufb01netuning. As far as we know, it is the \ufb01rst time so complicated behaviours were learned using\nonly sparse, binary rewards.\n\nAcknowledgments\nWe would like to thank Ankur Handa, Jonathan Ho, John Schulman, Matthias Plappert, Tim Salimans,\nand Vikash Kumar for providing feedback on the previous versions of this manuscript. We would\nalso like to thank Rein Houthooft and the whole OpenAI team for fruitful discussions as well as\nBowen Baker for performing some additional experiments.\n\nReferences\nAbadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,\nM., et al. (2016). Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems. arXiv\npreprint arXiv:1603.04467.\n\nBakker, B. and Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discovery and\n\nsubpolicy specialization. In Proc. of the 8-th Conf. on Intelligent Autonomous Systems, pages 438\u2013445.\n\nBellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying count-\nbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages\n1471\u20131479.\n\nBengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th\n\nannual international conference on machine learning, pages 41\u201348. ACM.\n\nCaruana, R. (1998). Multitask learning. In Learning to learn, pages 95\u2013133. Springer.\n\nChebotar, Y., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., and Levine, S. (2016). Path integral guided policy\n\nsearch. arXiv preprint arXiv:1610.00529.\n\nDa Silva, B., Konidaris, G., and Barto, A. (2012). Learning parameterized skills. arXiv preprint arXiv:1206.6398.\n\nDevin, C., Gupta, A., Darrell, T., Abbeel, P., and Levine, S. (2016). Learning modular neural network policies\n\nfor multi-task and multi-robot transfer. arXiv preprint arXiv:1609.07088.\n\nElman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition,\n\n48(1):71\u201399.\n\nFoster, D. and Dayan, P. (2002). Structure in the space of value functions. Machine Learning, 49(2):325\u2013346.\n\nGraves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum\n\nlearning for neural networks. arXiv preprint arXiv:1704.03003.\n\nGraves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwi\u00b4nska, A., Colmenarejo, S. G.,\nGrefenstette, E., Ramalho, T., Agapiou, J., et al. (2016). Hybrid computing using a neural network with\ndynamic external memory. Nature, 538(7626):471\u2013476.\n\nGu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous deep q-learning with model-based\n\nacceleration. arXiv preprint arXiv:1603.00748.\n\nHeld, D., Geng, X., Florensa, C., and Abbeel, P. (2017). Automatic goal generation for reinforcement learning\n\nagents. arXiv preprint arXiv:1705.06366.\n\nHouthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Vime: Variational\ninformation maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109\u2013\n1117.\n\nKingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.\n\n9\n\n\fKober, J., Wilhelm, A., Oztop, E., and Peters, J. (2012). Reinforcement learning to adjust parametrized motor\n\nprimitives to new situations. Autonomous Robots, 33(4):361\u2013379.\n\nKolter, J. Z. and Ng, A. Y. (2009). Near-bayesian exploration in polynomial time. In Proceedings of the 26th\n\nAnnual International Conference on Machine Learning, pages 513\u2013520. ACM.\n\nLevine, S., Finn, C., Darrell, T., and Abbeel, P. (2015). End-to-end training of deep visuomotor policies. arXiv\n\npreprint arXiv:1504.00702.\n\nLillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015).\n\nContinuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.\n\nLin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching.\n\nMachine learning, 8(3-4):293\u2013321.\n\nMetz, L., Ibarz, J., Jaitly, N., and Davidson, J. (2017). Discrete sequential prediction of continuous actions for\n\ndeep rl. arXiv preprint arXiv:1705.05035.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,\nFidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.\nNature, 518(7540):529\u2013533.\n\nNg, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. (2006). Autonomous\n\ninverted helicopter \ufb02ight via reinforcement learning. Experimental Robotics IX, pages 363\u2013372.\n\nNg, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations: Theory and\n\napplication to reward shaping. In ICML, volume 99, pages 278\u2013287.\n\nOsband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep exploration via bootstrapped dqn. In\n\nAdvances In Neural Information Processing Systems, pages 4026\u20134034.\n\nOstrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos, R. (2017). Count-based exploration with neural\n\ndensity models. arXiv preprint arXiv:1703.01310.\n\nPeters, J. and Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural networks,\n\n21(4):682\u2013697.\n\nPinto, L. and Gupta, A. (2016). Learning to push by grasping: Using multiple tasks for effective learning. arXiv\n\npreprint arXiv:1609.09025.\n\nPopov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, T., and\nRiedmiller, M. (2017). Data-ef\ufb01cient deep reinforcement learning for dexterous manipulation. arXiv preprint\narXiv:1704.03073.\n\nSchaul, T., Horgan, D., Gregor, K., and Silver, D. (2015a). Universal value function approximators.\nProceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1312\u20131320.\n\nIn\n\nSchaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015b). Prioritized experience replay. arXiv preprint\n\narXiv:1511.05952.\n\nSchmidhuber, J. (2004). Optimal ordered problem solver. Machine Learning, 54(3):211\u2013254.\n\nSchmidhuber, J. (2013). Powerplay: Training an increasingly general problem solver by continually searching\n\nfor the simplest still unsolvable problem. Frontiers in psychology, 4.\n\nSchmidhuber, J. and Huber, R. (1990). Learning to generate focus trajectories for attentive vision. Institut f\u00fcr\n\nInformatik.\n\nSilver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou,\nI., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and\ntree search. Nature, 529(7587):484\u2013489.\n\nSrivastava, R. K., Steunebrink, B. R., and Schmidhuber, J. (2013). First experiments with powerplay. Neural\n\nNetworks, 41:130\u2013136.\n\nStrehl, A. L. and Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Proceedings\n\nof the 22nd international conference on Machine learning, pages 856\u2013863. ACM.\n\nSukhbaatar, S., Kostrikov, I., Szlam, A., and Fergus, R. (2017). Intrinsic motivation and automatic curricula via\n\nasymmetric self-play. arXiv preprint arXiv:1703.05407.\n\n10\n\n\fSutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A\nscalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The\n10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761\u2013768.\nInternational Foundation for Autonomous Agents and Multiagent Systems.\n\nTang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P.\n(2016). # exploration: A study of count-based exploration for deep reinforcement learning. arXiv preprint\narXiv:1611.04717.\n\nTobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. (2017). Domain randomization for\n\ntransferring deep neural networks from simulation to the real world. arXiv preprint arXiv:1703.06907.\n\nTodorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Intelligent\n\nRobots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026\u20135033. IEEE.\n\nVezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017).\n\nFeudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161.\n\nZaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv preprint arXiv:1410.4615.\n\n11\n\n\f", "award": [], "sourceid": 2595, "authors": [{"given_name": "Marcin", "family_name": "Andrychowicz", "institution": "OpenAI"}, {"given_name": "Filip", "family_name": "Wolski", "institution": "Whisper.ai"}, {"given_name": "Alex", "family_name": "Ray", "institution": "OpenAI"}, {"given_name": "Jonas", "family_name": "Schneider", "institution": "OpenAI"}, {"given_name": "Rachel", "family_name": "Fong", "institution": "OpenAI"}, {"given_name": "Peter", "family_name": "Welinder", "institution": "OpenAI"}, {"given_name": "Bob", "family_name": "McGrew", "institution": "OpenAI"}, {"given_name": "Josh", "family_name": "Tobin", "institution": "OpenAI"}, {"given_name": "OpenAI", "family_name": "Pieter Abbeel", "institution": "OpenAI, UC Berkeley"}, {"given_name": "Wojciech", "family_name": "Zaremba", "institution": "OpenAI"}]}