{"title": "Policy Continuation with Hindsight Inverse Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 10265, "page_last": 10275, "abstract": "Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. On two multi-goal tasks GridWorld and FetchReach, PCHID significantly improves the sample efficiency as well as the final performance.", "full_text": "Policy Continuation with Hindsight Inverse Dynamics\n\nHao Sun1, Zhizhong Li1, Xiaotong Liu2, Dahua Lin1, Bolei Zhou1\n\n1The Chinese University of Hong Kong, 2Peking University\n\nAbstract\n\nSolving goal-oriented tasks is an important but challenging problem in reinforce-\nment learning (RL). For such tasks, the rewards are often sparse, making it dif\ufb01cult\nto learn a policy effectively. To tackle this dif\ufb01culty, we propose a new approach\ncalled Policy Continuation with Hindsight Inverse Dynamics (PCHID). This ap-\nproach learns from Hindsight Inverse Dynamics based on Hindsight Experience\nReplay. Enabling the learning process in a self-imitated manner and thus can be\ntrained with supervised learning. This work also extends it to multi-step settings\nwith Policy Continuation. The proposed method is general, which can work in\nisolation or be combined with other on-policy and off-policy algorithms. On two\nmulti-goal tasks GridWorld and FetchReach, PCHID signi\ufb01cantly improves the\nsample ef\ufb01ciency as well as the \ufb01nal performance1.\n\n1\n\nIntroduction\n\nImagine you are given the task of Tower of Hanoi with ten disks, what would you probably do to\nsolve this complex problem? This game looks daunting at the \ufb01rst glance. However through trials\nand errors, one may discover the key is to recursively relocate the disks on the top of the stack from\none pod to another, assisted by an intermediate one. In this case, you are actually learning skills\nfrom easier sub-tasks and those skills help you to learn more. This case exempli\ufb01es the procedure\nof self-imitated curriculum learning, which recursively develops the skills of solving more complex\nproblems.\n\nTower of Hanoi belongs to an important kind of challenging problems in Reinforcement Learning\n(RL), namely solving the goal-oriented tasks. In such tasks, rewards are usually very sparse. For\nexample, in many goal-oriented tasks, a single binary reward is provided only when the task is\ncompleted [1, 2, 3]. Previous works attribute the dif\ufb01culty of the sparse reward problems to the low\nef\ufb01ciency in experience collection [4]. Thus many approaches have been proposed to tackle this prob-\nlem, including automatic goal generation [5], self-imitation learning [6], hierarchical reinforcement\nlearning [7], curiosity driven methods [8, 9], curriculum learning [1, 10], and Hindsight Experience\nReplay (HER) [11]. Most of these works guide the agent by demonstrating on successful choices\nbased on suf\ufb01cient exploration to improve learning ef\ufb01ciency. Differently, HER opens up a new way\nto learn more from failures, assigning hindsight credit to primal experiences. However, it is limited\nby only applicable when combined with off-policy algorithms[3].\n\nIn this paper we propose an approach of goal-oriented RL called Policy Continuation with Hindsight\nInverse Dynamics (PCHID), which leverages the key idea of self-imitate learning. In contrast to HER,\nour method can work as an auxiliary module for both on-policy and off-policy algorithms, or as an\nisolated controller itself. Moreover, by learning to predict actions directly from back-propagation\nthrough self-imitation [12], instead of temporal difference [13] or policy gradient [14, 15, 16, 17],\nthe data ef\ufb01ciency is greatly improved.\n\nThe contributions of this work lie in three aspects: (1) We introduce the state-goal space partition\nfor multi-goal RL and thereon de\ufb01ne Policy Continuation (PC) as a new approach to such tasks.\n\n1Code and related materials are available at https://sites.google.com/view/neurips2019pchid\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(2) We propose Hindsight Inverse Dynamics (HID), which extends the vanilla Inverse Dynamics\nmethod to the goal-oriented setting. (3) We further integrate PC and HID into PCHID, which can\neffectively leverage self-supervised learning to accelerate the process of reinforcement learning. Note\nthat PCHID is a general method. Both on-policy and off-policy algorithms can bene\ufb01t therefrom. We\ntest this method on challenging RL problems, where it achieves considerably higher sample ef\ufb01ciency\nand performance.\n\n2 Related Work\n\nHindsight Experience Replay Learning with sparse rewards in RL problems is always a leading\nchallenge for the rewards are usually uneasy to reach with random explorations. Hindsight Experience\nReplay (HER) which relabels the failed rollouts as successful ones is proposed by Andrychowicz et\nal. [11] as a method to deal with such problem. The agent in HER receives a reward when reaching\neither the original goal or the relabeled goal in each episode by storing both original transition pairs\nst, g, at, r and relabeled transitions st, g\u2032, at, r\u2032 in the replay buffer. HER was later extended to work\nwith demonstration data [4] and boosted with multi-processing training [3]. The work of Rauber et\nal. [18] further extended the hindsight knowledge into policy gradient methods using importance\nsampling.\n\nInverse Dynamics Given a state transition pair (st, at, st+1), the inverse dynamics [19] takes\n(st, st+1) as the input and outputs the corresponding action at. Previous works used inverse dynamics\nto perform feature extraction [20, 9, 21] for policy network optimization. The actions stored in such\ntransition pairs are always collected with a random policy so that it can barely be used to optimize\nthe policy network directly. In our work, we use hindsight experience to revise the original transition\npairs in inverse dynamics, and we call this approach Hindsight Inverse Dynamics. The details will be\nelucidated in the next section.\n\nAuxiliary Task and Curiosity Driven Method Mirowski et al. [22] propose to jointly learn the\ngoal-driven reinforcement learning problems with an unsupervised depth prediction task and a\nself-supervised loop closure classi\ufb01cation task, achieving data ef\ufb01ciency and task performance\nimprovement. But their method requires extra supervision like depth input. Shelhamer et al. [21]\nintroduce several self-supervised auxiliary tasks to perform feature extraction and adopt the learned\nfeatures to reinforcement learning, improving the data ef\ufb01ciency and returns of end-to-end learning.\nPathak et al. [20] propose to learn an intrinsic curiosity reward besides the normal extrinsic reward,\nformulated by prediction error of a visual feature space and improved the learning ef\ufb01ciency. Both of\nthe approaches belong to self-supervision and utilize inverse dynamics during training. Although our\nmethod can be used as an auxiliary task and trained in self-supervised way, we improve the vanilla\ninverse dynamics with hindsight, which enables direct joint training of policy networks with temporal\ndifference and self-supervised learning.\n\n3 Policy Continuation with Hindsight Inverse Dynamics\n\nIn this section we \ufb01rst brie\ufb02y go through the preliminaries in Sec.3.1. In Sec.3.2 we retrospect a toy\nexample introduced in HER as a motivating example. Sec.3.3 to 3.6 describe our method in detail.\n\n3.1 Preliminaries\n\nMarkov Decision Process We consider a Markov Decision Process (MDP) denoted by a tu-\nple (S, A, P, r, \u03b3), where S, A are the \ufb01nite state and action space, P describes the transition\nprobability as S \u00d7 A \u00d7 S \u2192 [0, 1]. r : S \u2192 R is the reward function and \u03b3 \u2208 [0, 1] is the\ndiscount factor. \u03c0 : S \u00d7 A \u2192 [0, 1] denotes a policy, and an optimal policy \u03c0\u2217 satis\ufb01es \u03c0\u2217 =\narg max\u03c0 Es,a\u223c\u03c0[P\u221e\nt=0 \u03b3tr(st)] where at \u223c \u03c0(at|st), st+1 \u223c P(st+1|at, st) and an s0 is given as\na start state. When transition and policy are deterministic, \u03c0\u2217 = arg max\u03c0 Es0 [P\u221e\nt=0 \u03b3tr(st)] and\nat = \u03c0(st), st+1 = T (st, at), where \u03c0 : S \u2192 A is deterministic and T models the deterministic\ntransition dynamics. The expectation is over all the possible start states.\n\nUniversal Value Function Approximators and Multi-Goal RL The Universal Value Function\nApproximator (UVFA) [23] extends the state space of Deep Q-Networks (DQN) [24] to include goal\n\n2\n\n\f/.\n\n/)\n\n!1 !.\n\n21\n\n/0\n\n!\n\n\"\n\nFigure 1: (a): Results in bit-\ufb02ipping problem. (b): An illustration of \ufb02at state space. (c): An example\nof the GridWorld domain, which is a non-\ufb02at case.\n\nstate g \u2208 G as part of the input, i.e., st is extended to (st, g) \u2208 S \u00d7 G. And the policy becomes\n\u03c0 : S \u00d7 G \u2192 at, which is pretty useful in the setting where there are multiple goals to achieve.\nMoreover, Schaul etal. [23] show that in such a setting, the learned policy can be generalized\nto previous unseen state-goal pairs. Our application of UVFA on Proximal Policy Optimization\nalgorithm (PPO) [25] is straightforward. In the following of this work, we will use state-goal pairs to\ndenote the extended state space (s, g) \u2208 S \u00d7 G, at = \u03c0(st, g) and (st+1, g) = T (st, at). The goal g\nis \ufb01xed within an episode, but changed across different episodes.\n\n3.2 Revisiting the Bit-Flipping Problem\n\nThe bit-\ufb02ipping problem was provided as a motivating example in HER [11], where there are n bits\nwith the state space S = {0, 1}n and the action space A = {0, 1, ..0, n \u2212 1}. An action a corresponds\nto turn the a-th bit of the state. Each episode starts with a randomly generated state s0 and a random\ngoal state g. Only when the goal state g is reached the agent will receive a reward. HER proposed\nto relabel the failed trajectories to receive more reward signals thus enable the policy to learn from\nfailures. However, the method is based on temporal difference thus the ef\ufb01ciency of data is limited.\nAs we can learn from failures, here comes the question that can we learn a policy by supervised\nlearning where the data is generated using hindsight experience?\n\nInspired by the self-imitate learning ability of human, we aim to employ self-imitation to learn how to\nget success in RL even when the original goal has not yet achieved. A straightforward way to utilize\nself-imitate learning is to adopt the inverse dynamics. However, in most cases the actions stored in\ninverse dynamics are irrelevant to the goals.\n\nSpeci\ufb01cally, transition tuples like ((st, g), (st+1, g), at) are saved to learn the inverse dynamics of\ngoal-oriented tasks. Then the learning process can be executed simply as classi\ufb01cation when action\nspace is discrete or regression when action space is continuous. Given a neural network parameterized\nby \u03c6, the objective of learning inverse dynamics is as follows,\n\n\u03c6 = arg min\n\n\u03c6\n\nX\n\n||f\u03c6((st, g), (st+1, g)) \u2212 at||2.\n\n(1)\n\nst,st+1,at\n\nDue to the unawareness of the goals while the agent is taking actions, the goals g in Eq.(1) are\nonly placeholders. Thus, it will cost nothing to replace g with g\u2032 = m(st+1) but result in a more\nmeaningful form, i.e., encoding the following state as a hindsight goal. That is to say, if the agent\nwants to reach g\u2032 from st, it should take the action of at, thus the decision making process is aware of\nthe hindsight goal. We adopt f\u03c6 trained from Eq.(1) as an additional module incorporating with HER\nin the Bit-\ufb02ipping environment, by simply adding up their logit outputs. As shown in Fig.1(a), such\nan additional module leads to signi\ufb01cant improvement. We attribute this success to the \ufb02atness of the\nstate space. Fig.1(b) illustrates such a \ufb02atness case where an agent in a grid map is required to reach\nthe goal g3 starting from s0: if the agent has already known how to reach s1 in the east, intuitively, it\nhas no problem to extrapolate its policy to reach g3 in the farther east.\n\nNevertheless, success is not always within an effortless single step reach. Reaching the goals of g1\nand g2 are relatively harder tasks, and navigating from the start point to goal point in the GridWorld\ndomain shown in Fig.1(c) is even more challenging. To further employ the self-imitate learning and\novercome the single step limitation of inverse dynamics, we come up with a new approach called\nPolicy Continuation with Hindsight Inverse Dynamics.\n\n3\n\n020406080100Number of frames(1e4)14121086RewardCombine Inverse Dynamics with HERHER + 0.10IDHER + 0.05IDHER + 0.03IDHER + 0.01IDHERID\f3.3 Perspective of Policy Continuation on Multi-Goal RL Task\n\nOur approach is mainly based on policy continuation over sub-policies, which can be viewed as an\nemendation of the spontaneous extrapolation in the bit-\ufb02ipping case.\n\nDe\ufb01nition 1: Policy Continuation(PC) Suppose \u03c0 is a policy function de\ufb01ned on a non-empty\nsub-state-space SU of the state space S, i.e., SU \u2282 S. If SV is a larger subset of S, containing SU ,\ni.e., SU \u2282 SV and \u03a0 is a policy function de\ufb01ned on SV such that\n\n\u03a0(s) = \u03c0(s)\n\n\u2200s \u2208 SU\n\nthen we call \u03a0 a policy continuation of \u03c0, or we can say the restriction of \u03a0 to SU is the policy\nfunction \u03c0.\nDenote the optimal policy as \u03c0\u2217 : (st, gt) \u2192 at, we introduce the concept of k-step solvability:\n\nDe\ufb01nition 2: k-Step Solvability Given a state-goal pair (s, g) as a task of a certain system with\ndeterministic dynamics, if reaching the goal g needs at least k steps under the optimal policy \u03c0\u2217\nstarting from s, i.e., starting from s0 = s and execute ai = \u03c0\u2217(si, g) for i = {0, 1, ..., k \u2212 1}, the\nstate sk = T (sk\u22121, ak\u22121) satis\ufb01es m(sk) = g, we call the pair (s, g) has k-step solvability, or (s, g)\nis k-step solvable.\n\nIdeally the k-step solvability means the number of steps it should take from s to g, given the maximum\npermitted action value. In practice the k-step solvability is an evolving concept that can gradually\nchange during the learning process, thus is de\ufb01ned as \"whether it can be solve with \u03c0k\u22121 within k\nsteps after the convergence of \u03c0k\u22121 trained on (k-1)-step HIDs\".\n\nWe follow HER to assume a mapping m : S \u2192 G s.t. \u2200s \u2208 S the reward function r(s, m(s)) = 1,\nthus, the information of a goal g is encoded in state s. For the simplest case we have m as identical\nmapping and G = S where the goal g is considered as a certain state s of the system.\n\nFollowing the idea of recursion in curriculum learning, we can divide the \ufb01nite state-goal space into\nT + 2 parts according to their k-step solvability,\n\nS \u00d7 G = (S \u00d7 G)0 \u222a (S \u00d7 G)1 \u222a ... \u222a (S \u00d7 G)T \u222a (S \u00d7 G)U\n\n(2)\n\nwhere (s, g) \u2208 S \u00d7 G, T is a \ufb01nite time-step horizon that we suppose the task should be solved within,\nand (S \u00d7 G)i, i \u2208 {0, 1, 2, ...T } denotes the set of i-step solvable state-goal pairs, (s, g) \u2208 (S \u00d7 G)U\ndenotes unsolvable state-goal pairs, i.e., (s, g) is not k-step solvable for \u2200k \u2208 {0, 1, 2, ..., T }, and\n(S \u00d7 G)0 is the trivial case g = m(s0). As the optimal policy only aims to solve the solvable\nstate-goal pairs, we can take (S \u00d7 G)U out of consideration. It is clear that we can de\ufb01ne a disjoint\nsub-state-goal space union for the solvable state-goal pairs\n\nDe\ufb01nition 3: Solvable State-Goal Space Partition Given a certain environment, any solvable\nstate-goal pairs can be categorized into only one sub state-goal space by the following partition\n\nS \u00d7 G\\(S \u00d7 G)U =\n\nT\n\n[\n\nj=0\n\n(S \u00d7 G)j\n\n(3)\n\nThen, we de\ufb01ne a set of sub-policies {\u03c0i}, i \u2208 {0, 1, 2, ..., T } on solvable sub-state-goal space\nSi\n\nj=0(S \u00d7 G)j respectively, with the following de\ufb01nition\n\nDe\ufb01nition 4: Sub Policy on Sub Space \u03c0i is a sub-policy de\ufb01ned on the sub-state-goal space\n(S \u00d7 G)i. We say \u03c0\u2217\ni is an optimal sub-policy if it is able to solve all i-step solvable state-goal pair\ntasks in i steps.\n\nIf {\u03c0\u2217\n\ni } is restricted as a policy continuation of {\u03c0\u2217\n\nT , and \u03c0\u2217\n\nCorollary 1:\ni is\nable to solve any i-step solvable problem for i \u2264 k. By de\ufb01nition, the optimal policy \u03c0\u2217 is a policy\ncontinuation of the sub policy \u03c0\u2217\nWe can recursively approximate \u03c0\u2217 by expanding the domain of sub-state-goal space in policy\ncontinuation from an optimal sub-policy \u03c0\u2217\n0 . While in practice, we use neural networks to approximate\n\nT is already a substitute for the optimal policy \u03c0\u2217.\n\ni\u22121} for \u2200i \u2208 {1, 2, ...k}, \u03c0\u2217\n\n4\n\n\fsuch sub-policies to do policy continuation. We propose to parameterize a policy function \u03c0 = f\u03b8\nby \u03b8 with neural networks and optimize f\u03b8 by self-supervised learning with the data collected by\nHindsight Inverse Dynamics (HID) recursively and optimize \u03c0i by joint optimization.\n\n3.4 Hindsight Inverse Dynamics\n\nOne-Step Hindsight Inverse Dynamics One step HID data can be collected easily. With n\nrandomly rollout trajectories {(s0, g), a0, r0, (s1, g), a1, ..., (sT , g), aT , rT }i, i \u2208 {1, 2, ..., n}, we\ncan use a modi\ufb01ed inverse dynamics by substituting the original goal g with hindsight goal g\u2032 =\nm(st+1) for every st and result in {(s0, m(s1)), a0, (s1, m(s2)), a1, ..., (sT \u22121, m(sT )), aT \u22121}i, i \u2208\n{1, 2, ..., n}. We can then \ufb01t f\u03b81 by\n\n\u03b81 = arg min\n\n\u03b8\n\nX\n\nst,st+1,at\n\n||f\u03b8((st, m(st+1)), (st+1, m(st+1))) \u2212 at||2\n\n(4)\n\nBy collecting enough trajectories, we can optimize f\u03b8 implemented by neural networks with stochastic\ngradient descent [26]. When m is an identical mapping, the function f\u03b81 is a good enough approx-\nimator for \u03c0\u2217\n1 , which is guaranteed by the approximation ability of neural networks [27, 28, 29].\nOtherwise, we should adapt Eq. 4 as \u03b81 = arg min\u03b8 Pst,st+1,at ||f\u03b8((st, m(st+1)), m(st+1))\u2212at||2,\ni.e., we should omit the state information in future state st+1, to regard f\u03b81 as a policy. And in\npractice it becomes \u03b81 = arg min\u03b8 Pst,st+1,at ||f\u03b8(st, m(st+1)) \u2212 at||2.\n\nMulti-Step Hindsight Inverse Dynamics Once we have f\u03b8k\u22121 , an approximator of \u03c0\u2217\nk\u22121, k-\nstep HID is ready to get. We can collect valid k-step HID data recursively by testing whether\nthe k-step HID state-goal pairs indeed need k steps to solve, i.e., for any k-step transitions\n{(st, g), at, rt, ..., (st+k, g), at+k, rt+k}, if our policy \u03c0\u2217\nk\u22121 at hand can not provide with an-\nother solution from (st, m(st+k)) to (st+k, m(st+k)) in less than k steps, the state-goal pair\n(st, m(st+k)) must be k-step solvable, and this pair together with the action at will be marked\nas (s(k)\n. Fig.2 illustrates this process. The testing process is based on a function\nTEST(\u00b7) and we will focus on the selection of TEST in Sec.3.6. Transition pairs like this will\nbe collected to optimize \u03b8k. In practice, we leverage joint training to ensure f\u03b8k to be a policy\ncontinuation of \u03c0\u2217\n\nt+k)), a(k)\n\n, m(s(k)\n\ni , i \u2208 {1, ..., k} i.e.,\n\nt\n\nt\n\n\u03b8k = arg min\n\n\u03b8\n\ns(i)\n\nt\n\nX\nt+i,a(i)\n\nt\n\n,s(i)\n\n,i\u2208{1,...,k}\n\n||f\u03b8((st, m(st+i)), (st+i, m(st+i))) \u2212 at||2\n\n(5)\n\n3.5 Dynamic Programming Formulation\n\nFor most goal-oriented tasks, the learning objective is to \ufb01nd a policy to reach the goal as soon as\npossible. In such circumstances,\n\nL\u03c0(st, g) = L\u03c0(st+1, g) + 1\n\n(6)\n\nwhere L\u03c0(s, g) here is de\ufb01ned as the number of steps to be executed from s to g with policy \u03c0 and 1\nis the additional\n\nL\u03c0\u2217\n\n(st, g) = L\u03c0\u2217\n\n(st+1, g) + 1\n\n(7)\n\nand \u03c0\u2217 = arg min\u03c0 L\u03c0(st, g) As for the learning process, it is impossible to enumerate all possible\nintermediate state st+1 in the continuous state space and\nSuppose now we have the optimal sub-policy \u03c0\u2217\nwill have\n\nk\u22121 of all i-step solvable problems \u2200i \u2264 k \u2212 1, we\n\nL\u03c0\u2217\n\nk (st, g) = L\u03c0\u2217\n\nk\u22121 (st+1, g) + 1\n\n(8)\n\n5\n\n\f'()'(+, !#, -(!#$+))\n\n!#\n\n!#$+\n\nLess than +?\n\nLess than +?\n\n!#$%/+\n\n!#$%\n\n!&\n\n'()'(%, !#, -(!#$%))\n\n!&\n\nLess than %?\n\n!#\n\n!#$0\n\n!#$+\n\n!#$%\n\n!#$%/0\n\n!#$%/+\n\n!\"\n\n!\"\n\nFigure 2: Test whether the transitions are 2-step (left) or k-step (right) solvable. The TEST function\nreturns True if the transition st \u2192 st+k needs at least k steps.\n\nAlgorithm 1 Policy Continuation with Hindsight Inverse Dynamics (PCHID)\n\nRequire policy \u03c0b(s, g), reward function r(s, g) (equal to 1 if g = m(s) else 0), a buffer for\nPCHID B = {B1, B2, ..., BT \u22121}, a list K\nInitialize \u03c0b(s, g), B, K = [1]\nfor episode = 1, M do\n\ngenerate s0, g by the system\nfor t = 0, T \u2212 1 do\n\nSelect an action by the behavior policy at = \u03c0b(st, g)\nExecute the action at and get the next state st+1\nStore the transition ((st, g), at, (st+1, g)) in a temporary episode buffer\n\nend for\nfor t = 0, T \u2212 1 do\n\nfor k \u2208 K do\n\ncalculate additional goal according to st+k by g\u2032 = m(st+k)\nif TEST(k, st, g\u2032) = True then\n\nStore (st, g\u2032, at) in Bk\n\nend if\nend for\n\nend for\nSample a minibatch B from buffer B\nOptimize behavior policy \u03c0b(st, g\u2032) to predict at by supervised learning\nif Converge then\n\nAdd max(K) + 1 in K\n\nend if\nend for\n\nholds for any (st, g) \u2208 (S \u00d7 G)k. We can sample trajectories by random rollout or any unbiased\npolicies and choose some feasible (st, g) pairs from them, i.e., any st and st+k in a trajectory that\ncan not be solved by the\n\nat=\u03c0k(st,st+k)\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 st+1\n\nst\n\n\u03c0\u2217\nk\u22121\u2212\u2212\u2212\u2192 st+k\n\n(9)\n\nSuch a recursive approach starts from \u03c0\u2217\nsupervised learning by any given (st, st+1) pairs for (st, st+1) \u2208 (S \u00d7 G)1 by de\ufb01nition.\n\n1 , which can be easily approximated by trained with self\n\nThe combination of PC and with multi-step HID leads to our algorithm PCHID. PCHID can work\nalone or as an auxiliary module with other RL algorithms. We discuss three different combination\nmethods of PCHID and other algorithms in Sec.4.3. The full algorithm of the PCHID is presented as\nAlgorithm 1.\n\n3.6 On the Selection of TEST Function\n\nIn Algorithm 1, a crucial step to extend the (k \u2212 1)-step sub policy to k-step sub policy is to test\nwhether a k-step transition st \u2192 st+k in a trajectory is indeed a k-step solvable problem if we regard\nst as a start state s0 and m(st+k) as a goal g. We propose two approaches and evaluate both in Sec.4.\n\nInteraction A straightforward idea is to reset the environment to st and execute action at by policy\n\u03c0k\u22121, followed by execution of at+1, at+2, ..., and record if it achieves the goal in less than k steps.\n\n6\n\n\fWe call this approach Interaction for it requires the environment to be resettable and interact with\nthe environment. This approach can be portable when the transition dynamics is known or can be\napproximated without a heavy computation expense.\n\nRandom Network Distillation (RND) Given a state as input, the RND [30] is proposed to provide\nexploration bonus by comparing the output difference between a \ufb01xed randomly initialized neural\nnetwork NA and another neural network NB, which is trained to minimize the output difference\nbetween NA and NB with previous states. After training NB with 1, 2, ..., k \u2212 1 step transition pairs\nto minimize the output difference between NA and NB, since NB has never seen k-step solvable\ntransition pairs, these pairs will be differentiated for they lead to larger output differences.\n\n3.7 Synchronous Improvement\n\nIn PCHID, the learning scheme is set to be curriculum, i.e., the agent must learn to master easy skills\nbefore learning complex ones. However, in general the ef\ufb01ciency of \ufb01nding a transition sequence that\nis i-step solvable decreases as i increases. The size of buffer Bi is thus decreasing for i = 1, 2, 3, ..., T\nand the learning of \u03c0i might be restricted due to limited experiences. Besides, in continuous control\ntasks, the k-step solvability means the number of steps it should take from s to g, given the maximum\npermitted action value. In practice the k-step solvability can be treated as an evolving concept that\nchanges gradually as the learning goes. Speci\ufb01cally, at the beginning, an agent can only walk with\nsmall paces as it has learned from experiences collected by random movements. As the training\ncontinues, the agent is con\ufb01dent to move with larger paces, which may change the distribution of\nselected actions. Consequently, previous k-step solvable state goal pairs may be solved in less than k\nsteps.\n\nBased on the ef\ufb01ciency limitation and the progressive de\ufb01nition of k-step solvatbility, we propose\na synchronous version of PCHID. Readers please refer to the supplementary material for detailed\ndiscussion on the intuitive interpretation and empirical results.\n\n4 Experiments\n\nAs a policy \u03c0(s, g) aims at reaching a state s\u2032 where m(s\u2032) = g, by intuition the dif\ufb01culty of solving\nsuch a goal-oriented task depends on the complexity of m. In Sec.4.1 we start with a simple case\nwhere m is an identical mapping in the environment of GridWorld by showing the agent a fully\nobservable map. Moreover, the GridWorld environment permits us to use prior knowledge to calculate\nthe accuracy of any TEST function. We show that PCHID can work independently or augmented\nwith the DQN in discrete action space setting, outperforming the DQN as well as the DQN augmented\nwith HER. The GridWorld environment corresponds to the identical mapping case G = S. In Sec.4.2\nwe test our method on a continuous control problem, the FetchReach environment provided by\nPlappert et al. [3]. Our method outperforms PPO by achieving 100% successful rate in about 100\nepisodes. We further compare the sensitivity of PPO to reward values and the robustness PCHID\nowns. The state-goal mapping of FetchReach environment is G \u2282 S.\n\n4.1 GridWorld Navigation\n\nWe use the GridWorld navigation task in Value Iteration Networks (VIN) [31], in which the state\ninformation includes the position of the agent, and an image of the map of obstacles and goal position.\nIn our experiments we use 16 \u00d7 16 domains, navigation in which is not an effortless task. Fig.1(c)\nshows an example of our domains. The action space is discrete and contains 8 actions leading the\nagent to its 8 neighbour positions respectively. A reward of 10 will be provided if the agent reaches\nthe goal within 50 timesteps, otherwise the agent will receive a reward of \u22120.02. An action leading\nthe agent to an obstacle will not be executed, thus the agent will stay where it is. In each episode, a\nnew map will randomly selected start s and goal g points will be generated. We train our agent for\n500 episodes in total so that the agent needs to learn to navigate within just 500 trials, which is much\nless than the number used in VIN [31].2 Thus we can demonstrate the high data ef\ufb01ciency of PCHID\n\n2Tarmar et al. train VIN through the imitation learning (IL) with ground-truth shortest paths between start\n\nand goal positions. Although both of our approaches are based on IL, we do not need ground-truth data\n\n7\n\n\fFigure 3: (a): The rollout success rate on test maps in 10 experiments with different random seeds.\nHER outperforms VIN, but the difference disappears when combined with PCHID. PCHID-1 and\nPCHID-5 represent 1-step and 5-step PCHID. (b): Performance of PCHID module alone with\ndifferent TEST functions. The blue line is from ground truth testing results, the orange line and\ngreen line are Interaction and RND respectively, and the red line is the 1-step result as a baseline.\n(c)(d): Test accuracy and recall with Interaction and RND method under different threshold.\n\nFigure 4: (a): The FetchReach environment. (b): The reward obtaining process of each method.\nIn PPO r10 the reward of achieving the goal becomes 10 instead of 0 as default, and the reward is\nre-scaled to be comparable with other approaches. This is to show the sensitivity of PPO to reward\nvalue. By contrast, the performance of PCHID is unrelated to reward value. (c): The success rate\nof each method. Combining PPO with PCHID brings about little improvement over PCHID, but\ncombining HER with PCHID improves the performance signi\ufb01cantly.\n\nby testing the learned agent on 1000 unseen maps. Our work follows VIN to use the rollout success\nrate as the evaluation metric.\n\nOur empirical results are shown in Fig.3. Our method is compared with DQN, both of which are\nequipped with VIN as policy networks. We also apply HER to DQN but result in a little improvement.\nPC with 1-step HID, denoted by PCHID 1, achieves similar accuracy as DQN in much less episodes,\nand combining PC with 5-step HID, denoted by PCHID 5, and HER results in much more distinctive\nimprovement.\n\n4.2 OpenAI Fetch Env\n\nIn the Fetch environments, there are several tasks based on a 7-DoF Fetch robotics arm with\na two-\ufb01ngered parallel gripper. There are four tasks: FetchReach, FetchPush, FetchSlide and\nFetchPickAndPlace. In those tasks, the states include the Cartesian positions, linear velocity of the\ngripper, and position information as well as velocity information of an object if presented. The goal\nis presented as a 3-dimentional vector describing the target location of the object to be moved to. The\nagent will get a reward of 0 if the object is at the target location within a tolerance or \u22121 otherwise.\nAction is a continuous 4-dimentional vector with the \ufb01rst three of them controlling movement of the\ngripper and the last one controlling opening and closing of the gripper.\n\nFetchReach Here we demonstrate PCHID in the FetchReach task. We compare PCHID with\nPPO and HER based on PPO. Our work is the \ufb01rst to extend hindsight knowledge into on-policy\nalgorithms [3]. Fig.4 shows our results. PCHID greatly improves the learning ef\ufb01ciency of PPO.\nAlthough HER is not designed for on-policy algorithms, our combination of PCHID and PPO-based\nHER results in the best performance.\n\n8\n\n0100200300400500Episode0.00.10.20.30.40.50.6AccuracyComparisonHERDQNPCHID-1 + DQNPCHID-1 + HERPCHID-5 + HER0100200300400500Episode0.00.10.20.30.40.50.6AccuracyDifferent TEST FunctionGrountruthInteractionRND1 step0100200300400500Episode0.940.950.960.970.980.991.00AccuracyAccuracy of each TEST functionRND 0.25RND 0.275RND 0.30Interaction0100200300400500Episode0.20.40.60.81.0RecallRecall of each TEST function0.250.2750.30Interaction0100200300400500600Episode5045403530252015RewardReward ObtainHERPPOPPO r10PCHIDPCHID + HERPCHID + PPO0100200300400500600Episode0.00.20.40.60.81.0Success RateSuccess RateHERPPOPPO r10PCHIDPCHID + HERPCHID + PPO\fFigure 5: (a): Accuracy of GridWorld under different combination strategies. (b): Averaging outputs\nwith different weights. (c): Obtained Reward of FetchReach under different strategies.\n\n4.3 Combing PCHID with Other RL Algorithms\n\nAs PCHID only requires suf\ufb01cient exploration in the environment to approximate optimal sub-policies\nprogressively, it can be easily plugged into other RL algorithms, including both on-policy algorithms\nand off-policy algorithms. At this point, the PCHID module can be regarded as an extension of HER\nfor off-policy algorithms. We put forward three combination strategies and evaluate each of them on\nboth GridWorld and FetchReach environment.\n\nJoint Training The \ufb01rst strategy for combining PCHID with normal RL algorithm is to adopt a\nshared policy between them. A shared network is trained through both temporal difference learning\nin RL and self-supervised learning in PCHID. The PCHID module in joint training can be viewed as\na regularizer.\n\nAveraging Outputs Another strategy for combination is to train two policy networks separately,\nwith data collected in the same set of trajectories. When the action space is discrete, we can simply\naverage the two output vectors of policy networks, e.g. the Q-value vector and the log-probability\nvector of PCHID. When the action space is continuous, we can then average the two predicted action\nvectors and perform an interpolated action. From this perspective, the RL agent here actually learns\nhow to work based on PCHID and it parallels the key insight of ResNet [32]. If PCHID itself can\nsolve the task perfectly, the RL agent only needs to follow the advice of PCHID. Otherwise, when it\ncomes to complex tasks, PCHID will provide basic proposals of each decision to be made. The RL\nagent receives hints from those proposals thus the learning becomes easier.\n\nIntrinsic Reward (IR) This approach is quite similar to the curiosity driven methods. Instead of\nusing the inverse dynamics to de\ufb01ne the curiosity, we use the prediction difference between PCHID\nmodule and RL agent as an intrinsic reward to motivate RL agent to act as PCHID. Maximizing the\nintrinsic reward helps the RL agent to avoid aimless explorations hence can speed up the learning\nprocess.\n\nFig.5 shows our results in GridWorld and FetchReach with different combination strategies. Joint\ntraining performs the best and it does not need hyper-parameter tuning. On the contrary, the averaging\noutputs requires determining the weights while the intrinsic reward requires adjusting its scale with\nregard to the external reward.\n\n5 Conclusion\n\nIn this work we propose the Policy Continuation with Hindsight Inverse Dynamics (PCHID) to solve\nthe goal-oriented reward sparse tasks from a new perspective. Our experiments show the PCHID is\nable to improve data ef\ufb01ciency remarkably in both discrete and continuous control tasks. Moreover,\nour method can be incorporated with both on-policy and off-policy RL algorithms \ufb02exibly.\n\nAcknowledgement: We acknowledge discussions with Yuhang Song and Chuheng Zhang. This\nwork was partially supported by SenseTime Group (CUHK Agreement No.7051699) and CUHK\ndirect fund (No.4055098).\n\n9\n\n0100200300400500Episode0.00.10.20.30.40.50.6AccuracyCombination MethodsJoint TrainBinary IRContinuous IRAverage Output0100200300400500Episode0.00.10.20.30.40.50.6AccuracyAverage Outputs with different WeightsHER + 0.1PCHIDHER + 0.4PCHIDHER + 0.7PCHIDHER + 1.0PCHIDHER + 1.3PCHIDHER + 1.6PCHID0100200300400500600Episode5045403530252015RewardReward ObtainJoint TrainAverage OutputOnly PCHID\fReferences\n\n[1] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse\n\ncurriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.\n\n[2] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking\ndeep reinforcement learning for continuous control. In International Conference on Machine\nLearning, pages 1329\u20131338, 2016.\n\n[3] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn\nPowell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal\nreinforcement learning: Challenging robotics environments and request for research. arXiv\npreprint arXiv:1802.09464, 2018.\n\n[4] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over-\ncoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International\nConference on Robotics and Automation (ICRA), pages 6292\u20136299. IEEE, 2018.\n\n[5] David Held, Xinyang Geng, Carlos Florensa, and Pieter Abbeel. Automatic goal generation for\n\nreinforcement learning agents. arXiv preprint arXiv:1705.06366, 2017.\n\n[6] Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. arXiv\n\npreprint arXiv:1806.05635, 2018.\n\n[7] Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical reinforcement learning with hindsight.\n\nIn International Conference on Learning Representations, 2019.\n\n[8] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex-\nploration by self-supervised prediction. In IEEE Conference on Computer Vision & Pattern\nRecognition Workshops, 2017.\n\n[9] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros.\n\nLarge-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.\n\n[10] Yuxin Wu and Yuandong Tian. Training agent for \ufb01rst-person shooter game with actor-critic\n\ncurriculum learning. In International Conference on Learning Representations, 2017.\n\n[11] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience\nreplay. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5048\u20135058.\nCurran Associates, Inc., 2017.\n\n[12] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by\n\nback-propagating errors. Cognitive modeling, 5(3):1, 1988.\n\n[13] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning,\n\n3(1):9\u201344, 1988.\n\n[14] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\n\nDeterministic policy gradient algorithms. In ICML, 2014.\n\n[15] Sham M Kakade. A natural policy gradient. In Advances in neural information processing\n\nsystems, pages 1531\u20131538, 2002.\n\n[16] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[17] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[18] Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Juergen Schmidhuber. Hindsight policy\n\ngradients. arXiv preprint arXiv:1711.06006, 2017.\n\n10\n\n\f[19] Michael I Jordan and David E Rumelhart. Forward models: Supervised learning with a distal\n\nteacher. Cognitive science, 16(3):307\u2013354, 1992.\n\n[20] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo-\nration by self-supervised prediction. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR) Workshops, July 2017.\n\n[21] Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward:\n\nSelf-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.\n\n[22] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, and Raia Hadsell. Learning to\nnavigate in complex environments. In International Conference on Learning Representations,\n2017.\n\n[23] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap-\nproximators. In Francis Bach and David Blei, editors, Proceedings of the 32nd International\nConference on Machine Learning, volume 37 of Proceedings of Machine Learning Research,\npages 1312\u20131320, Lille, France, 07\u201309 Jul 2015. PMLR.\n\n[24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[25] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[26] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[27] Robert Hecht-Nielsen. Kolmogorov\u2019s mapping neural network existence theorem. In Proceed-\nings of the IEEE International Conference on Neural Networks III, pages 11\u201313. IEEE Press,\n1987.\n\n[28] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are\n\nuniversal approximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[29] Vera Kuurkova. Kolmogorov\u2019s theorem and multilayer neural networks. Neural networks,\n\n5(3):501\u2013506, 1992.\n\n[30] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random\n\nnetwork distillation. arXiv preprint arXiv:1810.12894, 2018.\n\n[31] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks.\n\nIn Advances in Neural Information Processing Systems, pages 2154\u20132162, 2016.\n\n[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[33] Larbi Alili, Pierre Patie, and Jesper Lund Pedersen. Representations of the \ufb01rst hitting time\n\ndensity of an ornstein-uhlenbeck process. Stochastic Models, 21(4):967\u2013980, 2005.\n\n[34] Marlin U Thomas. Some mean \ufb01rst-passage time approximations for the ornstein-uhlenbeck\n\nprocess. Journal of Applied Probability, 12(3):600\u2013604, 1975.\n\n[35] Luigi M Ricciardi and Shunsuke Sato. First-passage-time density and moments of the ornstein-\n\nuhlenbeck process. Journal of Applied Probability, 25(1):43\u201357, 1988.\n\n[36] Ian Blake and William Lindsey. Level-crossing problems for random processes. IEEE transac-\n\ntions on information theory, 19(3):295\u2013315, 1973.\n\n11\n\n\f", "award": [], "sourceid": 5415, "authors": [{"given_name": "Hao", "family_name": "Sun", "institution": "CUHK"}, {"given_name": "Zhizhong", "family_name": "Li", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xiaotong", "family_name": "Liu", "institution": "Peking Uinversity"}, {"given_name": "Bolei", "family_name": "Zhou", "institution": "CUHK"}, {"given_name": "Dahua", "family_name": "Lin", "institution": "The Chinese University of Hong Kong"}]}