{"title": "Search on the Replay Buffer: Bridging Planning and Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15246, "page_last": 15257, "abstract": "The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and relative values of states, but fails to plan over long horizons. Despite the successes of each method on various tasks, long horizon, sparse reward tasks with high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid injecting bias through reward shaping. We introduce a general-purpose control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our main idea is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a particular subgoal. We use goal-conditioned RL to learn a policy to reach each waypoint and to learn a distance metric for search. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over hundreds of steps, and generalizes substantially better than standard RL algorithms.", "full_text": "Search on the Replay Buffer:\n\nBridging Planning and Reinforcement Learning\n\nBenjamin Eysenbach\u03b8\u03c6, Ruslan Salakhutdinov\u03b8, Sergey Levine\u03c6\u03c8\n\n\u03b8CMU, \u03c6Google Brain, \u03c8UC Berkeley\n\nbeysenba@cs.cmu.edu\n\nAbstract\n\nThe history of learning for control has been an exciting back and forth between\ntwo broad classes of algorithms: planning and reinforcement learning. Planning\nalgorithms effectively reason over long horizons, but assume access to a local\npolicy and distance metric over collision-free paths. Reinforcement learning excels\nat learning policies and the relative values of states, but fails to plan over long\nhorizons. Despite the successes of each method in various domains, tasks that\nrequire reasoning over long horizons with limited feedback and high-dimensional\nobservations remain exceedingly challenging for both planning and reinforcement\nlearning algorithms. Frustratingly, these sorts of tasks are potentially the most\nuseful, as they are simple to design (a human only need to provide an example\ngoal state) and avoid reward shaping, which can bias the agent towards \ufb01nd a sub-\noptimal solution. We introduce a general-purpose control algorithm that combines\nthe strengths of planning and reinforcement learning to effectively solve these\ntasks. Our aim is to decompose the task of reaching a distant goal state into\na sequence of easier tasks, each of which corresponds to reaching a particular\nsubgoal. Planning algorithms can automatically \ufb01nd these waypoints, but only if\nprovided with suitable abstractions of the environment \u2013 namely, a graph consisting\nof nodes and edges. Our main insight is that this graph can be constructed via\nreinforcement learning, where a goal-conditioned value function provides edge\nweights, and nodes are taken to be previously seen observations in a replay buffer.\nUsing graph search over our replay buffer, we can automatically generate this\nsequence of subgoals, even in image-based environments. Our algorithm, search\non the replay buffer (SoRB), enables agents to solve sparse reward tasks over one\nhundred steps, and generalizes substantially better than standard RL algorithms.1\n\n1\n\nIntroduction\n\nHow can agents learn to solve complex, temporally extended tasks? Classically, planning algorithms\ngive us one tool for learning such tasks. While planning algorithms work well for tasks where\nit is easy to determine distances between states and easy to design a local policy to reach nearby\nstates, both of these requirements become roadblocks when applying planning to high-dimensional\n(e.g., image-based) tasks. Learning algorithms excel at handling high-dimensional observations,\nbut reinforcement learning (RL) \u2013 learning for control \u2013 fails to reason over long horizons to solve\ntemporally extended tasks. In this paper, we propose a method that combines the strengths of planning\nand RL, resulting in an algorithm that can plan over long horizons in tasks with high-dimensional\nobservations.\nRecent work has introduced goal-conditioned RL algorithms (Pong et al., 2018; Schaul et al., 2015)\nthat acquire a single policy for reaching many goals. In practice, goal-conditioned RL succeeds at\n\n1Run our algorithm in your browser: http://bit.ly/rl_search\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Search on the Replay Buffer: (a) Goal-conditioned RL often fails to reach distant goals,\nbut can successfully reach the goal if starting nearby (inside the green region). (b) Our goal is to\nuse observations in our replay buffer (yellow squares) as waypoints leading to the goal. (c) We\nautomatically \ufb01nd these waypoints by using the agent\u2019s value function to predict when two states\nare nearby, and building the corresponding graph. (d) We run graph search to \ufb01nd the sequence of\nwaypoints (blue arrows), and then use our goal-conditioned policy to reach each waypoint.\n\nreaching nearby goals but fails to reach distant goals; performance degrades quickly as the number of\nsteps to the goal increases (Levy et al., 2019; Nachum et al., 2018). Moreover, goal-conditioned RL\noften requires large amounts of reward shaping (Chiang et al., 2019) or human demonstrations (Lynch\net al., 2019; Nair et al., 2018), both of which can limit the asymptotic performance of the policy by\ndiscouraging the policy from seeking novel solutions.\nWe propose to solve long-horizon, sparse reward tasks by decomposing the task into a series of easier\ngoal-reaching tasks. We learn a goal-conditioned policy for solving each of the goal-reaching tasks.\nOur main idea is to reduce the problem of \ufb01nding these subgoals to solving a shortest path problem\nover states that we have previous visited, using a distance metric extracted from our goal-conditioned\npolicy. We call this algorithm Search on Replay Buffer (SoRB), and provide a simple illustration of\nthe algorithm in Figure 1.\nOur primary contribution is an algorithm that bridges planning and deep RL for solving long-horizon,\nsparse reward tasks. We develop a practical instantiation of this algorithm using ensembles of\ndistributional value functions, which allows us to robustly learn distances and use them for risk-aware\nplanning. Empirically, we \ufb01nd that our method generates effective plans to solve long horizon\nnavigation tasks, even in image-based domains, without a map and without odometry. Comparisons\nwith state-of-the-art RL methods show that SoRB is substantially more successful in reaching distant\ngoals. We also observe that the learned policy generalizes well to navigate in unseen environments.\nIn summary, graph search over previously visited states is a simple tool for boosting the performance\nof a goal-conditioned RL algorithm.\n\n2 Bridging Planning and Reinforcement Learning\n\nPlanning algorithms must be able to (1) sample valid states, (2) estimate the distance between\nreachable pairs of states, and (3) use a local policy to navigate between nearby states. These\nrequirements are dif\ufb01cult to satisfy in complex tasks with high dimensional observations, such as\nimages. For example, consider a robot arm stacking blocks using image observations. Sampling states\nrequires generating photo-realistic images, and estimating distances and choosing actions requires\nreasoning about dozens of interactions between blocks. Our method will obtain distance estimates\nand a local policy using a RL algorithm. To sample states, we will simply use a replay buffer of\npreviously visited states as a non-parametric generative model.\n\n2.1 Building Block: Goal-Conditioned RL\n\nA key building block of our method is a goal-conditioned policy and its associated value function.\nWe consider a goal-reaching agent interacting with an environment. The agent observes its current\nstate s \u2208 S and a goal state sg \u2208 S. The initial state for each episode is sampled s1 \u223c \u03c1(s), and\ndynamics are governed by the distribution p(st+1 | st, at). At every step, the agent samples an action\na \u223c \u03c0(a | s, sg) and receives a corresponding reward r(s, a, sg) that indicates whether the agent\nhas reached the goal. The episode terminates as soon as the agent reaches the goal, or after T steps,\nwhichever occurs \ufb01rst. The agent\u2019s task is to maximize its cumulative, undiscounted, reward. We use\nan off-policy algorithm to learn such a policy, as well as its associated goal-conditioned Q-function\n\n2\n\n\fand value function:\n\nQ(s, a, sg) = Es1\u223c\u03c1(s),at\u223c\u03c0(at|st,sg)\nst+1\u223cp(st+1|st,at)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nr(st, at, sg)\n\n,\n\nV (s, sg) = max\n\na\n\nQ(s, a, sg)\n\nWe obtain a policy by acting greedily w.r.t. the Q-function: \u03c0(a | s, sg) = arg maxa Q(s, a, sg).\nWe choose an off-policy RL algorithm with goal relabelling (Andrychowicz et al., 2017; Kaelbling,\n1993b) and distributional RL (Bellemare et al., 2017)) not only for improved data ef\ufb01ciency, but also\nto obtain good distance estimates (See Section 2.2). We will use DQN (Mnih et al., 2013) for discrete\naction environments and DDPG (Lillicrap et al., 2015) for continuous action environments. Both\nalgorithms operate by minimizing the Bellman error over transitions sampled from a replay buffer B.\n\n2.2 Distances from Goal-Conditioned Reinforcement Learning\n\nTo ultimately perform planning, we need to compute the shortest path distance between pairs of\nstates. Following Kaelbling (1993b), we de\ufb01ne a reward function that returns -1 at every step:\nr(s, a, sg) (cid:44) \u22121. The episode ends when the agent is suf\ufb01ciently close to the goal, as determined\nby a state-identity oracle. Using this reward function and termination condition, there is a close\nconnection between the Q values and shortest paths. We de\ufb01ne dsp(s, sg) to be the shortest path\ndistance from state s to state sg. That is, dsp(s, sg) is the expected number of steps to reach sg\nfrom s under the optimal policy. The value of state s with respect to goal sg is simply the negative\nshortest path distance: V (s, sg) = \u2212dsp(s, sg). We likewise de\ufb01ne dsp(s, a, sg) as the shortest path\ndistance, conditioned on initially taking action a. Then Q values also equal a negative shortest path\ndistance: Q(s, a, sg) = \u2212dsp(s, a, sg). Thus, goal-conditioned RL on a suitable reward function\nyields a Q-function that allows us to estimate shortest-path distances.\n\n2.3 The Replay Buffer as a Graph\n\nWe build a weighted, directed graph directly on top of states in our replay buffer, so each node\ncorresponds to an observation (e.g., an image). We add edges between nodes with weight (i.e., length)\nequal to their predicted distance, using d\u03c0(s1, s2) as our estimate of the distance using our current\nQ-function. While, in theory, going directly to the goal is always a shortest path, in practice the\ngoal-conditioned policy will fail to reach distant goals directly (See Fig. 6.). We will therefore ignore\nedges that are longer than MAXDIST, a hyperparameter:\n\nG (cid:44) (V,E,W)\n\nwhere V = B,\n\nE = B \u00d7 B = {es1\u2192s2 | s1, s2 \u2208 B}\n\n(cid:26)d\u03c0(s1, s2)\n\n\u221e\n\nW(es1\u2192s2) =\n\nif d\u03c0(s1, s2) < MAXDIST\notherwise\n\nGiven a start and goal state, we temporarily add each to the graph. We add directed edges from the\nstart state to every other state, and from every other state to the goal state, using the same criteria as\nabove. We use Dijkstra\u2019s Algorithm to \ufb01nd the shortest path. See Appendix A for details.\n\n2.4 Algorithm Summary\n\nAfter learning a goal-conditioned Q-function,\nwe perform graph search to \ufb01nd a set of way-\npoints and use the goal-conditioned policy to\nreach each. We view the combination of graph\nsearch and the underlying goal-conditioned pol-\nicy as a new SEARCHPOLICY, shown in Algo-\nrithm 1. The algorithm starts by using graph\nsearch to obtain the shortest path sw1 , sw2,\u00b7\u00b7\u00b7\nfrom the current state s to the goal state sg, plan-\nning over the states in our replay buffer B. We\nthen estimate the distance from the current state\nto the \ufb01rst waypoint, as well as the distance\nfrom the current state to the goal. In most cases,\nwe then condition the policy on the \ufb01rst way-\npoint, sw1. However, if the goal state is closer\n\nAlgorithm 1 Inputs are the current state s, the goal\nstate sg, a buffer of observations B, the learned\npolicy \u03c0 and its value function V . Returns an\naction a.\n\nfunction SEARCHPOLICY(s, sg,B, V, \u03c0)\n\nsw1,\u00b7\u00b7\u00b7 \u2190 SHORTESTPATH(s, sg,B, V )\nds\u2192w1 \u2190 \u2212V (s, sw1)\nds\u2192g \u2190 \u2212V (s, sg)\nif ds\u2192w1 < ds\u2192g or ds\u2192g > MAXDIST\n\nelse\n\na \u2190 \u03c0(a,| s, sw1)\na \u2190 \u03c0(a,| s, sg)\n\nreturn a\n\n3\n\n\fthan the next waypoint and the goal state is not too far away, then we directly condition the policy on\nthe \ufb01nal goal. If the replay buffer is empty or there is not a path in G to the goal, then Algorithm 1\nresorts to standard goal-conditioned RL.\n\n3 Better Distance Estimates\n\nThe success of our SEARCHPOLICY depends heavily on the accuracy of our distance estimates. This\nsection proposes two techniques to learn better distances with RL.\n\n3.1 Better Distances via Distributional Reinforcement Learning\n\nOff-the-shelf Q-learning algorithms such as\nDQN (Mnih et al., 2013) or DDPG (Lillicrap\net al., 2015) will fail to learn accurate distance\nestimates using the \u22121 reward function. The\ntrue value for a state and goal that are unreach-\nable is \u2212\u221e, which cannot be represented by a\nstandard, feed-forward Q-network. Simply clip-\nping the Q-value estimates to be within some\nrange avoids the problem of ill-de\ufb01ned Q-values,\nbut empirically we found it challenging to train\nclipped Q-networks. We adopt distributional Q-\nlearning (Bellemare et al., 2017), noting that is\nhas a convenient form when used with the \u22121\nreward function. Distributional RL discretizes\nthe possible value estimates into a set of bins\nB = (B1, B2,\u00b7\u00b7\u00b7 , BN ). For learning distances,\nbins correspond to distances, so Bi indicates the\nevent that the current state and goal are i steps\naway from one another. Our Q-function predicts a distribution Q(st, sg, at) \u2208 P N over these bins,\nwhere Q(st, sg, at)i is the predicted probability that states st and sg are i steps away from one\nanother. To avoid ill-de\ufb01ned Q-values, the \ufb01nal bin, BN is a catch-all for predicted distances of at\nleast N. Importantly, this gives us a well-de\ufb01ned method to represent large and in\ufb01nite distances.\nUnder this formulation, the targets Q\u2217 \u2208 P N for our Q-values have a simple form:\n\nFigure 2: The Bellman update for distributional\nRL is simple when learning distances, simply cor-\nresponding to a left-shift of the Q-values at every\nstep until the agent reaches the goal.\n\n(cid:26)(1, 0,\u00b7\u00b7\u00b7 , 0)\n\nQ\u2217 =\n\n(0, Q1,\u00b7\u00b7\u00b7 , QN\u22122, QN\u22121 + QN )\n\nif st = g\nif st (cid:54)= g\n\nAs illustrated in Figure 2, if the state and goal are equivalent, then the target places all probability\nmass in bin 0. Otherwise, the targets are a right-shift of the current predictions. To ensure the\ntarget values sum to one, the mass in bin N of the targets is the sum of bins N \u2212 1 and N from the\npredicted values. Following Bellemare et al. (2017), we update our Q function by minimizing the KL\ndivergence between our predictions Q\u03b8 and the target Q\u2217:\nDKL(Q\u2217 (cid:107) Q\u03b8)\n\nmin\n\n(1)\n\n\u03b8\n\n3.2 Robust Distances via Ensembles of Value Functions\n\nSince we ultimately want to use estimated distances to perform search, it is crucial that we have\naccurate distances estimates. It is challenging to robustly estimate the distance between all |B|2\npairs of states in our buffer B, some of which may not have occurred during training. If we fail and\nspuriously predict that a pair of distant states are nearby, graph search will exploit this \u201cwormhole\u201d\nand yield a path which assumes that the agent can \u201cteleport\u201d from one distant state to another. We\nseek to use a bootstrap (Bickel et al., 1981) as a principled way to estimate uncertainty for our\nQ-values. Following prior work (Lakshminarayanan et al., 2017; Osband et al., 2016), we implement\nan approximation to the bootstrap. We train an ensemble of Q-networks, each with independent\nweights, but trained on the same data using the same loss (Eq. 1). When performing graph search, we\naggregate predictions from each Q-network in our ensemble. Empirically, we found that ensembles\nwere crucial for getting graph search to work on image-based tasks, but we observed little difference\nin whether we took the maximum predicted distance or the average predicted distance.\n\n4\n\n\f4 Related Work\nPlanning Algorithms: Planning algorithms (Choset et al., 2005; LaValle, 2006) ef\ufb01ciently solve long-\nhorizon tasks, including those that stymie RL algorithms (see, e.g., Kavraki et al. (1996); Lau and\nKuffner (2005); Levine et al. (2011)). However, these techniques assume that we can (1) ef\ufb01ciently\nsample valid states, (2) estimate the distance between two states, and (3) acquire a local policy for\nreaching nearby states, all of which make it challenging to apply these techniques to high-dimensional\ntasks (e.g., with image observations). Our method removes these assumptions by (1) sampling states\nfrom the replay buffer and (2,3) learning the distance metric and policy with RL. Some prior works\nhave also combined planning algorithms with RL (Chiang et al., 2019; Faust et al., 2018; Savinov\net al., 2018a), \ufb01nding that the combination yields agents adept at reaching distant goals. Perhaps the\nmost similar work is Semi-Parametric Topological Memory (Savinov et al., 2018a), which also uses\ngraph search to \ufb01nd waypoints for a learned policy. We compare to SPTM in Section 5.3.\nGoal-Conditioned RL: Goal-conditioned policies (Kaelbling, 1993b; Pong et al., 2018; Schaul et al.,\n2015) take as input the current state and a goal state, and predict a sequence of actions to arrive at\nthe goal. Our algorithm learns a goal-conditioned policy to reach waypoints along the planned path.\nRecent algorithms (Andrychowicz et al., 2017; Pong et al., 2018) combine off-policy RL algorithms\nwith goal-relabelling to improve the sample complexity and robustness of goal-conditioned policies.\nSimilar algorithms have been proposed for visual navigation (Anderson et al., 2018; Gupta et al.,\n2017; Mirowski et al., 2016; Zhang et al., 2018; Zhu et al., 2017). A common theme in recent work\nis learning distance metrics to accelerate RL. While most methods (Florensa et al., 2019; Savinov\net al., 2018b; Wu et al., 2018) simply perform RL on top of the learned representation, our method\nexplicitly performs search using the learned metric.\nHierarchical RL: Hierarchical RL algorithms automatically learn a set of primitive skills to help an\nagent learn complex tasks. One class of methods (Bacon et al., 2017; Frans et al., 2017; Kaelbling,\n1993a; Kulkarni et al., 2016; Nachum et al., 2018; Parr and Russell, 1998; Precup, 2000; Sutton\net al., 1999; Vezhnevets et al., 2017) jointly learn a low-level policy for performing each of the skills\ntogether with a high-level policy for sequencing these skills to complete a desired task. Another class\nof algorithms (Drummond, 2002; Fox et al., 2017; S\u00b8ims\u00b8ek et al., 2005) focus solely on automatically\ndiscovering these skills or subgoals. SoRB learns primitive skills that correspond to goal-reaching\ntasks, similar to Nachum et al. (2018). While jointly learning high-level and low-level policies can be\nunstable (see discussion in Nachum et al. (2018)), we sidestep the problem by using graph search as\na \ufb01xed, high-level policy.\nModel Based RL: RL methods are typically di-\nvided into model-free (Schulman et al., 2015a,b,\n2017; Williams, 1992) and model-based (Lill-\nicrap et al., 2015; Watkins and Dayan, 1992)\napproaches. Model-based approaches all per-\nform some degree of planning, from predicting\nthe value of some state (Mnih et al., 2013; Sil-\nver et al., 2016), obtaining representations by\nunrolling a learned dynamics model (Racani`ere\net al., 2017), or learning a policy directly on a\nlearned dynamics model (Agrawal et al., 2016;\nChua et al., 2018; Finn and Levine, 2017; Kurutach et al., 2018; Nagabandi et al., 2018; Oh et al.,\n2015; Sutton, 1990). One line of work (Amos et al., 2018; Lee et al., 2018; Srinivas et al., 2018; Tamar\net al., 2016) embeds a differentiable planner inside a policy, with the planner learned end-to-end\nwith the rest of the policy. Other work (Lenz et al., 2015; Watter et al., 2015) explicitly learns a\nrepresentation for use inside a standard planning algorithm. In contrast, SoRB learns to predict the\ndistances between states, which can be viewed as a high-level inverse model. SoRB predicts a scalar\n(the distance) rather than actions or observations, making the prediction problem substantially easier.\nBy planning over previously visited states, SoRB does not have to cope with infeasible states that can\nbe predicted by forward models in state-space and latent-space.\n\nFigure 3: Four classes of model-based RL methods.\nDimensions in the last column correspond to typi-\ncal robotics tasks with image/lidar observations.\n\nprediction\ndimension\n1000s+\n10s\n10s\n1\n\nstate-space \u0013\nlatent-space\n\u0017\n\u0013\n\u0013\n\ninverse\nSoRB\n\nmodel\n\nreal\nstates\n\nmulti-\nstep\n\u0013\n\u0013\n\u0017\n\u0013\n\n5 Experiments\nWe compare SoRB to prior methods on two tasks: a simple 2D environment, and then a visual\nnavigation task, where our method will plan over images. Ablation experiments will illustrate that\naccurate distances estimates are crucial to our algorithm\u2019s success.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Simple 2D Navigation: (Left) Two simple navigation environments. (Center) An agent\nthat combines a goal-conditioned policy with search is substantially more successful at reaching\ndistant goals in these environments than using the goal-conditioned policy alone. (Right) A standard\ngoal-conditioned policy (top) fails to reach distant goals. Applying graph search on top of that same\npolicy (bottom) yields a sequence of intermediate waypoints (yellow squares) that enable the agent to\nsuccessfully reach distant goals.\n\n5.1 Didactic Example: 2D Navigation\n\nWe start by building intuition for our method by applying it to two simple 2D navigation tasks, shown\nin Figure 4a. The start and goal state are chosen randomly in free space, and reaching the goal often\ntakes over 100 steps, even for the optimal policy. We used goal-conditioned RL to learn a policy for\neach environment, and then evaluated this policy on randomly sampled (start, goal) pairs of varying\ndif\ufb01culty. To implement SoRB, we used exactly the same policy, both to perform graph search and\nthen to reach each of the planned waypoints. In Figure 4b, we observe that the goal-conditioned\npolicy can reach nearby goals, but fails to generalize to distant goals. In contrast, SoRB successfully\nreaches goals over 100 steps away, with little drop in success rate. Figure 4c compares rollouts from\nthe goal-conditioned policy and our policy. Note that our policy takes actions that temporarily lead\naway from the goal so the agent can maneuver through a hallway to eventually reach the goal.\n\n5.2 Planning over Images for Visual Navigation\n\nWe now examine how our method scales to\nhigh-dimensional observations in a visual nav-\nigation task, illustrated in Figure 5. We use 3D\nhouses from the SUNCG dataset (Song et al.,\n2017), similar to the task described by Shah\net al. (2018). The agent receives either RGB\nor depth images and takes actions to move\nNorth/South/East/West. Following Shah et al.\n(2018), we stitch four images into a panorama,\nso the resulting observation has dimension 4 \u00d7\n24\u00d7 32\u00d7 C, where C is the number of channels\n(3 for RGB, 1 for Depth). At the start of each\nepisode, we randomly sample an initial state and\ngoal state. We found that sampling nearby goals\n(within 4 steps) more often (80% of the time)\nimproved the performance of goal-conditioned\nRL. We use the same goal sampling distribution for all methods. The agent observes both the current\nimage and the goal image, and should take actions that lead to the goal state. The episode terminates\nonce the agent is within 1 meter of the goal. We also terminate if the agent has failed to reach the\ngoal after 20 time steps, but treat the two types of termination differently when computing the TD\nerror (see Pardo et al. (2017)). Note that it is challenging to specify a meaningful distance metric and\nlocal policy on pixel inputs, so it is dif\ufb01cult to apply standard planning algorithms to this task.\nOn this task, we evaluate four state-of-the-art prior methods: hindsight experience replay\n(HER) (Andrychowicz et al., 2017), distributional RL (C51) (Bellemare et al., 2017), semi-parametric\ntopological memory (SPTM) (Savinov et al., 2018a), and value iteration networks (VIN) (Tamar et al.,\n2016). SoRB uses C51 as its underlying goal-conditioned policy. For VIN, we tuned the number of\niterations as well as the number of hidden units in the recurrent layer. For SPTM, we performed a\ngrid search over the threshold for adding edges, the threshold for choosing the next waypoint along\n\nFigure 5: Visual Navigation: Given an initial state\nand goal state, our method automatically \ufb01nds a\nsequence of intermediate waypoints. The agent\nthen follows those waypoints to reach the goal.\n\n6\n\n\fFigure 6: Visual Navigation: We compare our method (SoRB) to prior work on the visual navigation\nenvironment (Fig. 5), using RGB images (Left) and depth images (Right) . We \ufb01nd that only our\nmethod succeeds in reaching distant goals. Baselines: SPTM (Savinov et al., 2018a), C51 (Bellemare\net al., 2017), VIN (Tamar et al., 2016), HER (Andrychowicz et al., 2017).\n\nthe shortest path, and the parameters for sampling the training data. In total, we performed over 1000\nexperiments to tune baselines, more than an order of magnitude more than we used for tuning our\nown method. See Appendix F for details.\nWe evaluated each method on goals ranging from 2 to 20 steps from the start. For each distance, we\nrandomly sampled 30 (start, goal) pairs, and recorded the average success rate, de\ufb01ned as reaching\nwithin 1 meter of the goal within 100 steps. We then repeated each experiment for 5 random seeds.\nIn Figure 6, we plot each random seed as a transparent line; the solid line corresponds to the average\nacross the 5 random seeds. While all prior methods degrade quickly as the distance to the goal\nincreases, our method continues to succeed in reaching goals with probability around 90%. SPTM,\nthe only prior method that also employs search, performs second best, but substantially worse than\nour method.\n\n5.3 Comparison with Semi-Parametric Topological Memory\n\n(a) Goal-Conditioned Policy\n\n(b) Distance Predictions\n\nFigure 7: SoRB vs SPTM: Our method and Semi-Parametric Topological Memory (Savinov et al.,\n2018b) differ in the policy used and how distances are estimated. We \ufb01nd (Left) that both methods\nlearn comparable policies, but (Right) our method learns more accurate distances. See text for details.\n\nTo understand why SoRB succeeds at reaching distant goals more frequently than SPTM, we examine\nthe two key differences between the methods: (1) the goal-conditioned policy used to reach nearby\ngoals and (2) the distance metric used to construct the graph. While SoRB acquires a goal-conditioned\npolicy via goal-conditioned RL, SPTM obtains a policy by learning an inverse model with supervised\nlearning. First, we compared the performance of the RL policy (used in SoRB) with the inverse model\npolicy (used in SPTM). In Figure 7a, the solid colored lines show that, without search, the policy\nused by SPTM is more successful than the RL policy, but performance of both policies degrades as\nthe distance to the goal increases. We also evaluate a variant of our method that uses the policy from\nSPTM to reach each waypoint, and \ufb01nd (dashed-lines) no difference in performance, likely because\nthe policies are equally good at reaching nearby goals (within MAXDIST steps). We conclude that\nthe difference in goal-conditioned policies cannot explain the difference in success rate.\nThe other key difference between SoRB and SPTM is their learned distance metrics. When using\ndistances for graph search, it is critical for the predicted distance between two states to re\ufb02ect whether\nthe policy can successfully navigate between those states: the model should be more successful at\n\n7\n\n\freaching goals which it predicts are nearby. We can naturally measure this alignment using the area\nunder a precision recall curve. Note that while SoRB predicts distances in the range [0, T ], SPTM\npredicts whether two states are reachable, so its predictions will be in the range [0, 1]. Nonetheless,\nprecision-recall curves2 only depend on the ordering of the predictions, not their absolute values.\nFigure 7b shows that the distances predicted by SoRB more accurately re\ufb02ect whether the policy will\nreach the goal, as compared with SPTM. The average AUC across \ufb01ve random seeds is 22% higher\nfor SoRB than SPTM. In retrospect, this \ufb01nding is not surprising: while SPTM employs a learned,\ninverse model policy, it learns distances w.r.t. a random policy.\n\n5.4 Better Distance Estimates\n\n(a) Distributional RL\n\n(b) Ensembles\n\nFigure 8: Better Distance Estimates: (Left) Without distributional RL, our method performs poorly.\n(Right) Ensembles contribute to a moderate increase in success rate, especially for distant goals.\nWe now examine the ingredients in SoRB that contribute to its accurate distance estimates: distri-\nbutional RL and ensembles of value functions. In a \ufb01rst experiment, evaluated a variant of SoRB\ntrained without distributional RL. As shown in Figure 8a, this variant performed worse than the\nrandom policy, clearly illustrating that distributional RL is a key component of SoRB. The second\nexperiment studied the effect of using ensembles of value functions. Recalling that we introduced\nensembles to avoid erroneous distance predictions for distant pairs of states, we expect that ensembles\nwill contribute most towards success at reaching distant goals. Figure 8b con\ufb01rms this prediction,\nillustrating that ensembles provide a 10 \u2013 20% increase in success at reaching goals that are at least\n10 steps away. We run additional ablation analysis in Appendix C.\n\n5.5 Generalizing to New Houses\n\nFigure 9: Does SoRB Generalize? After training on 100 SUNCG houses, we collect random data in\nheld-out houses to use for search in those new environments. Whether using depth images or RGB\nimages, SoRB generalizes well to new houses, reaching almost 80% of goals 10 steps away, while\ngoal-conditioned RL reaches less than 20% of these goals. Transparent lines correspond to average\nsuccess rate across 22 held-out houses for each of three random seeds.\n\nWe now study whether our method generalizes to new visual navigation environments. We train on\n100 SUNCG houses, randomly sampling one per episode. We evaluated on a held-out test set of\n22 SUNCG houses. In each house, we collect 1000 random observations and \ufb01ll our replay buffer\nwith those observations to perform search. We use the same goal-conditioned policy and associated\ndistance function that we learned during training. As before, we measure the fraction of goals reached\nas we increase the distance to the goal. In Figure 9, we observe that SoRB reaches almost 80% of\n\n2We negate the distance prediction from SoRB before computing the precision recall curve because small\n\ndistances indicate that the policy should be more successful.\n\n8\n\n\fgoals that are 10 steps away, about four times more than reached by the goal-conditioned RL agent.\nOur method succeeds in reaching 40% of goals 20 steps away, while goal-conditioned RL has a\nsuccess rate near 0%. We repeated the experiment for three random seeds, retraining the policy from\nscratch each time. Note that there is no discernible difference between the three random seeds, plotted\nas transparent lines, indicating the robustness of our method to random initialization.\n\n6 Discussion and Future Work\n\nWe presented SoRB, a method that combines planning via graph search and goal-conditioned RL. By\nexploiting the structure of goal-reaching tasks, we can obtain policies that generalize substantially\nbetter than those learned directly from RL. In our experiments, we show that SoRB can solve\ntemporally extended navigation problems, traverse environments with image observations, and\ngeneralize to new houses in the SUNCG dataset. Broadly, we expect SoRB to outperform existing RL\napproaches on long-horizon tasks, especially those with high-dimensional inputs. Our method relies\nheavily on goal-conditioned RL, and we expect advances in this area to make our method applicable\nto even more dif\ufb01cult tasks. While we used a stage-wise procedure, \ufb01rst learning the goal-conditioned\npolicy and then applying graph search, in future work we aim to explore how graph search can\nimprove the goal-conditioned policy itself, perhaps via policy distillation or obtaining better Q-value\nestimates. In addition, while the planning algorithm we use is simple (namely, Dijkstra), we believe\nthat the key idea of using distance estimates obtained from RL algorithms for planning will open\ndoors to incorporating more sophisticated planning techniques into RL.\n\nAcknowledgements: We thank Vitchyr Pong, Xingyu Lin, and Shane Gu for helpful discussions on learning\ngoal-conditioned value functions, Aleksandra Faust and Brian Okorn for feedback on connections to planning,\nand Nikolay Savinov for feedback on the SPTM baseline. RS is supported by NSF grant IIS1763562, ONR\ngrant N000141812861, AFRL CogDeCON, and Apple. Any opinions, \ufb01ndings and conclusions expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the views of NSF, AFRL, ONR, or Apple.\n\nReferences\nAgrawal, P., Nair, A. V., Abbeel, P., Malik, J., and Levine, S. (2016). Learning to poke by poking: Experiential\n\nlearning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074\u20135082.\n\nAmos, B., Jimenez, I., Sacks, J., Boots, B., and Kolter, J. Z. (2018). Differentiable mpc for end-to-end planning\n\nand control. In Advances in Neural Information Processing Systems, pages 8289\u20138300.\n\nAnderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., S\u00a8underhauf, N., Reid, I., Gould, S., and van den Hengel,\nA. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real\nenvironments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3674\u20133683.\n\nAndrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel,\nO. P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in Neural Information Processing\nSystems, pages 5048\u20135058.\n\nBacon, P.-L., Harb, J., and Precup, D. (2017). The option-critic architecture. In Thirty-First AAAI Conference on\n\nArti\ufb01cial Intelligence.\n\nBellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449\u2013458. JMLR.\norg.\n\nBickel, P. J., Freedman, D. A., et al. (1981). Some asymptotic theory for the bootstrap. The annals of statistics,\n\n9(6):1196\u20131217.\n\nChiang, H.-T. L., Faust, A., Fiser, M., and Francis, A. (2019). Learning navigation behaviors end-to-end with\n\nautorl. IEEE Robotics and Automation Letters, 4(2):2007\u20132014.\n\nChoset, H. M., Hutchinson, S., Lynch, K. M., Kantor, G., Burgard, W., Kavraki, L. E., and Thrun, S. (2005).\n\nPrinciples of robot motion: theory, algorithms, and implementation. MIT press.\n\n9\n\n\fChua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of\ntrials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages\n4759\u20134770.\n\nDrummond, C. (2002). Accelerating reinforcement learning by composing solutions of automatically identi\ufb01ed\n\nsubtasks. Journal of Arti\ufb01cial Intelligence Research, 16:59\u2013104.\n\nFaust, A., Ramirez, O., Fiser, M., Oslund, K., Francis, A., Davidson, J., and Tapia, L. (2018). Prm-rl: Long-range\nrobotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proc. IEEE\nInt. Conf. Robot. Autom. (ICRA), pages 5113\u20135120, Brisbane, Australia.\n\nFinn, C. and Levine, S. (2017). Deep visual foresight for planning robot motion. In 2017 IEEE International\n\nConference on Robotics and Automation (ICRA), pages 2786\u20132793. IEEE.\n\nFlorensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. (2019). Self-supervised learning of\n\nimage embedding for continuous control. arXiv preprint arXiv:1901.00943.\n\nFox, R., Krishnan, S., Stoica, I., and Goldberg, K. (2017). Multi-level discovery of deep options. arXiv preprint\n\narXiv:1703.08294.\n\nFrans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. (2017). Meta learning shared hierarchies. arXiv\n\npreprint arXiv:1710.09767.\n\nGupta, S., Davidson, J., Levine, S., Sukthankar, R., and Malik, J. (2017). Cognitive mapping and planning for\n\nvisual navigation. arXiv preprint arXiv:1702.03920, 3.\n\nHadar, J. and Russell, W. R. (1969). Rules for ordering uncertain prospects. The American Economic Review,\n\n59(1):25\u201334.\n\nKaelbling, L. P. (1993a). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the\n\ntenth international conference on machine learning, volume 951, pages 167\u2013173.\n\nKaelbling, L. P. (1993b). Learning to achieve goals. In IJCAI, pages 1094\u20131099. Citeseer.\n\nKavraki, L., Svestka, P., and Overmars, M. H. (1996). Probabilistic roadmaps for path planning in high-\n\ndimensional con\ufb01guration spaces. IEEE transactions on robotics and automation, 12(4):566\u2013580.\n\nKulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. (2016). Hierarchical deep reinforcement learning:\nIntegrating temporal abstraction and intrinsic motivation. In Advances in neural information processing\nsystems, pages 3675\u20133683.\n\nKurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. (2018). Model-ensemble trust-region policy\n\noptimization. arXiv preprint arXiv:1802.10592.\n\nLakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation\n\nusing deep ensembles. In Advances in Neural Information Processing Systems, pages 6402\u20136413.\n\nLau, M. and Kuffner, J. J. (2005). Behavior planning for character animation. In Proceedings of the 2005 ACM\n\nSIGGRAPH/Eurographics symposium on Computer animation, pages 271\u2013280. ACM.\n\nLaValle, S. M. (2006). Planning algorithms. Cambridge university press.\n\nLee, L., Parisotto, E., Chaplot, D. S., Xing, E., and Salakhutdinov, R. (2018). Gated path planning networks.\n\narXiv preprint arXiv:1806.06408.\n\nLenz, I., Knepper, R. A., and Saxena, A. (2015). Deepmpc: Learning deep latent features for model predictive\n\ncontrol. In Robotics: Science and Systems. Rome, Italy.\n\nLevine, S., Lee, Y., Koltun, V., and Popovi\u00b4c, Z. (2011). Space-time planning with parameterized locomotion\n\ncontrollers. ACM Transactions on Graphics (TOG), 30(3):23.\n\nLevy, A., Platt, R., and Saenko, K. (2019). Hierarchical reinforcement learning with hindsight. In International\n\nConference on Learning Representations.\n\nLillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015).\n\nContinuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.\n\nLynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. (2019). Learning latent\n\nplans from play. arXiv preprint arXiv:1903.01973.\n\n10\n\n\fMirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre,\nL., Kavukcuoglu, K., et al. (2016). Learning to navigate in complex environments. arXiv preprint\narXiv:1611.03673.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013).\n\nPlaying atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.\n\nNachum, O., Gu, S. S., Lee, H., and Levine, S. (2018). Data-ef\ufb01cient hierarchical reinforcement learning. In\n\nAdvances in Neural Information Processing Systems, pages 3307\u20133317.\n\nNagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for model-based deep\nreinforcement learning with model-free \ufb01ne-tuning. In 2018 IEEE International Conference on Robotics and\nAutomation (ICRA), pages 7559\u20137566. IEEE.\n\nNair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Overcoming exploration\nin reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and\nAutomation (ICRA), pages 6292\u20136299. IEEE.\n\nOh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-conditional video prediction using deep\n\nnetworks in atari games. In Advances in neural information processing systems, pages 2863\u20132871.\n\nOsband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep exploration via bootstrapped dqn. In\n\nAdvances in neural information processing systems, pages 4026\u20134034.\n\nPardo, F., Tavakoli, A., Levdik, V., and Kormushev, P. (2017). Time limits in reinforcement learning. arXiv\n\npreprint arXiv:1712.00378.\n\nParr, R. and Russell, S. J. (1998). Reinforcement learning with hierarchies of machines. In Advances in neural\n\ninformation processing systems, pages 1043\u20131049.\n\nPong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal difference models: Model-free deep rl for\n\nmodel-based control. arXiv preprint arXiv:1802.09081.\n\nPrecup, D. (2000). Temporal abstraction in reinforcement learning. University of Massachusetts Amherst.\n\nRacani`ere, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess,\nN., Li, Y., et al. (2017). Imagination-augmented agents for deep reinforcement learning. In Advances in\nneural information processing systems, pages 5690\u20135701.\n\nSavinov, N., Dosovitskiy, A., and Koltun, V. (2018a). Semi-parametric topological memory for navigation.\n\narXiv preprint arXiv:1803.00653.\n\nSavinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. (2018b). Episodic\n\ncuriosity through reachability. arXiv preprint arXiv:1810.02274.\n\nSchaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators.\n\nInternational Conference on Machine Learning, pages 1312\u20131320.\n\nIn\n\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In\n\nInternational Conference on Machine Learning, pages 1889\u20131897.\n\nSchulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). High-dimensional continuous control\n\nusing generalized advantage estimation. arXiv preprint arXiv:1506.02438.\n\nSchulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347.\n\nShah, P., Fiser, M., Faust, A., Kew, J. C., and Hakkani-Tur, D. (2018). Follownet: Robot navigation by following\n\nnatural language directions with deep reinforcement learning. arXiv preprint arXiv:1805.06150.\n\nSilver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou,\nI., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and\ntree search. nature, 529(7587):484.\n\nS\u00b8ims\u00b8ek, \u00a8O., Wolfe, A. P., and Barto, A. G. (2005). Identifying useful subgoals in reinforcement learning by local\ngraph partitioning. In Proceedings of the 22nd international conference on Machine learning, pages 816\u2013823.\nACM.\n\n11\n\n\fSong, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., and Funkhouser, T. (2017). Semantic scene completion from\na single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1746\u20131754.\n\nSrinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018). Universal planning networks. arXiv preprint\n\narXiv:1804.00645.\n\nSutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating\n\ndynamic programming. In Machine Learning Proceedings 1990, pages 216\u2013224. Elsevier.\n\nSutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):181\u2013211.\n\nTamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In Advances in\n\nNeural Information Processing Systems, pages 2154\u20132162.\n\nVezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017).\nFeudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 3540\u20133549. JMLR. org.\n\nWatkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279\u2013292.\n\nWatter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. (2015). Embed to control: A locally linear latent\ndynamics model for control from raw images. In Advances in neural information processing systems, pages\n2746\u20132754.\n\nWilliams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine learning, 8(3-4):229\u2013256.\n\nWu, Y., Tucker, G., and Nachum, O. (2018). The laplacian in rl: Learning representations with ef\ufb01cient\n\napproximations. arXiv preprint arXiv:1810.04586.\n\nZhang, A., Lerer, A., Sukhbaatar, S., Fergus, R., and Szlam, A. (2018). Composable planning with attributes.\n\narXiv preprint arXiv:1803.00512.\n\nZhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., and Farhadi, A. (2017). Target-driven visual\nnavigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017\nIEEE International Conference on, pages 3357\u20133364. IEEE.\n\n12\n\n\f", "award": [], "sourceid": 8751, "authors": [{"given_name": "Ben", "family_name": "Eysenbach", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}