{"title": "Visual Reinforcement Learning with Imagined Goals", "book": "Advances in Neural Information Processing Systems", "page_first": 9191, "page_last": 9200, "abstract": "For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised \"practice\" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals in a real-world physical system, and substantially outperforms prior techniques.", "full_text": "Visual Reinforcement Learning with Imagined Goals\n\nAshvin Nair\u2217, Vitchyr Pong\u2217, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine\n\nUniversity of California, Berkeley\n\n{anair17,vitchyr,mdalal,shikharbahl,stevenlin598,svlevine}@berkeley.edu\n\nAbstract\n\nFor an autonomous agent to ful\ufb01ll a wide range of user-speci\ufb01ed goals at test time,\nit must be able to learn broadly applicable and general-purpose skill repertoires.\nFurthermore, to provide the requisite level of generality, these skills must handle\nraw sensory input such as images. In this paper, we propose an algorithm that\nacquires such general-purpose skills by combining unsupervised representation\nlearning and reinforcement learning of goal-conditioned policies. Since the partic-\nular goals that might be required at test-time are not known in advance, the agent\nperforms a self-supervised \u201cpractice\u201d phase where it imagines goals and attempts\nto achieve them. We learn a visual representation with three distinct purposes: sam-\npling goals for self-supervised practice, providing a structured transformation of\nraw sensory inputs, and computing a reward signal for goal reaching. We also pro-\npose a retroactive goal relabeling scheme to further improve the sample-ef\ufb01ciency\nof our method. Our off-policy algorithm is ef\ufb01cient enough to learn policies that\noperate on raw image observations and goals for a real-world robotic system, and\nsubstantially outperforms prior techniques.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) algorithms hold the promise of allowing autonomous agents, such as\nrobots, to learn to accomplish arbitrary tasks. However, the standard RL framework involves learning\npolicies that are speci\ufb01c to individual tasks, which are de\ufb01ned by hand-speci\ufb01ed reward functions.\nAgents that exist persistently in the world can prepare to solve diverse tasks by setting their own\ngoals, practicing complex behaviors, and learning about the world around them. In fact, humans\nare very pro\ufb01cient at setting abstract goals for themselves, and evidence shows that this behavior is\nalready present from early infancy [43], albeit with simple goals such as reaching. The behavior and\nrepresentation of goals grows more complex over time as they learn how to manipulate objects and\nlocomote. How can we begin to devise a reinforcement learning system that sets its own goals and\nlearns from experience with minimal outside intervention and manual engineering?\n\nIn this paper, we take a step toward this goal by designing an RL framework that jointly learns\nrepresentations of raw sensory inputs and policies that achieve arbitrary goals under this representation\nby practicing to reach self-speci\ufb01ed random goals during training. To provide for automated and\n\ufb02exible goal-setting, we must \ufb01rst choose how a general goal can be speci\ufb01ed for an agent interacting\nwith a complex and highly variable environment. Even providing the state of such an environment\nto a policy is a challenge. For instance, a task that requires a robot to manipulate various objects\nwould require a combinatorial representation, re\ufb02ecting variability in the number and type of objects\nin the current scene. Directly using raw sensory signals, such as images, avoids this challenge, but\nlearning from raw images is substantially harder. In particular, pixel-wise Euclidean distance is\nnot an effective reward function for visual tasks since distances between images do not correspond\nto meaningful distances between states [36, 49]. Furthermore, although end-to-end model-free\n\n\u2217Equal contribution. Order was determined by coin \ufb02ip.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: We train a VAE using data generated by our exploration policy (left). We use the VAE for\nmultiple purposes during training time (middle): to sample goals to train the policy, to embed the\nobservations into a latent space, and to compute distances in the latent space. During test time (right),\nwe embed a speci\ufb01ed goal observation og into a goal latent zg as input to the policy. Videos of our\nmethod can be found at sites.google.com/site/visualrlwithimaginedgoals\n\nreinforcement learning can handle image observations, this comes at a high cost in sample complexity,\nmaking it dif\ufb01cult to use in the real world.\n\nWe propose to address both challenges by incorporating unsupervised representation learning into\ngoal-conditioned policies. In our method, which is illustrated in Figure 1, a representation of raw\nsensory inputs is learned by means of a latent variable model, which in our case is based on the\nvariational autoencoder (VAE) [19]. This model serves three complementary purposes. First, it\nprovides a more structured representation of sensory inputs for RL, making it feasible to learn from\nimages even in the real world. Second, it allows for sampling of new states, which can be used to\nset synthetic goals during training to allow the goal-conditioned policy to practice diverse behaviors.\nWe can also more ef\ufb01ciently utilize samples from the environment by relabeling synthetic goals\nin an off-policy RL algorithm, which makes our algorithm substantially more ef\ufb01cient. Third, the\nlearned representation provides a space where distances are more meaningful than the original space\nof observations, and can therefore provide well-shaped reward functions for RL. By learning to reach\nrandom goals sampled from the latent variable model, the goal-conditioned policy learns about the\nworld and can be used to achieve new, user-speci\ufb01ed goals at test-time.\n\nThe main contribution of our work is a framework for learning general-purpose goal-conditioned\npolicies that can achieve goals speci\ufb01ed with target observations. We call our method reinforcement\nlearning with imagined goals (RIG). RIG combines sample-ef\ufb01cient off-policy goal-conditioned\nreinforcement learning with unsupervised representation learning. We use representation learning\nto acquire a latent distribution that can be used to sample goals for unsupervised practice and\ndata augmentation, to provide a well-shaped distance function for reinforcement learning, and to\nprovide a more structured representation for the value function and policy. While several prior\nmethods, discussed in the following section, have sought to learn goal-conditioned policies, we\ncan do so with image goals and observations without a manually speci\ufb01ed reward signal. Our\nexperimental evaluation illustrates that our method substantially improves the performance of image-\nbased reinforcement learning, can effectively learn policies for complex image-based tasks, and can\nbe used to learn real-world robotic manipulation skills with raw image inputs. Videos of our method\nin simulated and real-world environments can be found at https://sites.google.com/site/\nvisualrlwithimaginedgoals/.\n\n2 Related Work\n\nWhile prior works on vision-based deep reinforcement learning for robotics can ef\ufb01ciently learn a\nvariety of behaviors such as grasping [33, 32, 24], pushing [1, 8, 10], navigation [30, 21], and other\nmanipulation tasks [26, 23, 30], they each make assumptions that limit their applicability to training\ngeneral-purpose robots. Levine et al. [23] uses time-varying models, which requires an episodic setup\nthat makes them dif\ufb01cult to extend to non-episodic and continual learning scenarios. Pinto et al. [33]\nproposed a similar approach that uses goal images, but requires instrumented training in simulation.\nLillicrap et al. [26] uses fully model-free training, but does not learn goal-conditioned skills. As we\nshow in our experiments, this approach is very dif\ufb01cult to extend to the goal-conditioned setting\n\n2\n\n\fwith image inputs. Model-based methods that predict images [48, 10, 8, 28] or learn inverse models\n[1] can also accommodate various goals, but tend to limit the horizon length due to model drift. To\nour knowledge, no prior method uses model-free RL to learn policies conditioned on a single goal\nimage with suf\ufb01cient ef\ufb01ciency to train directly on real-world robotic systems, without access to\nground-truth state or reward information during training.\n\nOur method uses a goal-conditioned value function [40] in order to solve more general tasks [45, 18].\nTo improve the sample-ef\ufb01ciency of our method during off-policy training, we retroactively relabel\nsamples in the replay buffer with goals sampled from the latent representation. Goal relabeling has\nbeen explored in prior work [18, 2, 37, 25, 35]. Andrychowicz et al. [2] and Levy et al. [25] use goal\nrelabeling for sparse rewards problems with known goal spaces, restricting the resampled goals to\nstates encountered along that trajectory, since almost any other goal will have no reward signal. We\nsample random goals from our learned latent space to use as replay goals for off-policy Q-learning\nrather than restricting ourselves to states seen along the sampled trajectory, enabling substantially\nmore ef\ufb01cient learning. We use the same goal sampling mechanism for exploration in RL. Goal\nsetting for policy learning has previously been discussed [3] and recently P\u00e9r\u00e9 et al. [31] have also\nproposed using unsupervised learning for setting goals for exploration. However, we use a model-free\nQ-learning method that operates on raw state observations and actions, allowing us to solve visually\nand dynamically complex tasks.\n\nA number of prior works have used unsupervised learning to acquire better representations for RL.\nThese methods use the learned representation as a substitute for the state for the policy, but require\nadditional information, such as access to the ground truth reward function based on the true state\nduring training time [16, 14, 48, 11, 21, 17], expert trajectories [44], human demonstrations [42],\nor pre-trained object-detection features [22]. In contrast, we learn to generate goals and use the\nlearned representation to obtain a reward function for those goals without any of these extra sources\nof supervision. Finn et al. [11] combine unsupervised representation learning with reinforcement\nlearning, but in a framework that trains a policy to reach a single goal. Many prior works have also\nfocused on learning controllable and disentangled representations [41, 5, 6, 38, 7, 46]. We use a\nmethod based on variational autoencoders, but these prior techniques are complementary to ours and\ncould be incorporated into our method.\n\n3 Background\n\nOur method combines reinforcement learning with goal-conditioned value functions and unsupervised\nrepresentation learning. Here, we brie\ufb02y review the techniques that we build on in our method.\n\nGoal-conditioned reinforcement learning.\n\nIn reinforcement learning, the goal is to learn a policy\n\u03c0(st) = at that maximizes expected return, which we denote as Rt = E[PT\ni=t \u03b3(i\u2212t)ri], where\nri = r(si, ai, si+1) and the expectation is under the current policy and environment dynamics. Here,\ns \u2208 S is a state observation, a \u2208 A is an action, and \u03b3 is a discount factor. Standard model-free RL\nlearns policies that achieve a single task. If our aim is instead to obtain a policy that can accomplish a\nvariety of tasks, we can construct a goal-conditioned policy and reward, and optimize the expected\nreturn with respect to a goal distribution: Eg\u223cG[Eri,si\u223cE,ai\u223c\u03c0[R0]], where G is the set of goals and\nthe reward is also a function of g. A variety of algorithms can learn goal-conditioned policies, but to\nenable sample-ef\ufb01cient learning, we focus on algorithms that acquire goal-conditioned Q-functions,\nwhich can be trained off-policy. A goal-conditioned Q-function Q(s, a, g) learns the expected return\nfor the goal g starting from state s and taking action a. Given a state s, action a, next state s\u2032, goal g,\nand correspond reward r, one can train an approximate Q-function parameterized by w by minimizing\nthe following Bellman error\n\nE(w) =\n\n1\n2\n\n||Qw(s, a, g) \u2212 (r + \u03b3 max\n\na\u2032\n\nQ \u00afw(s\u2032, a\u2032, g))||2\n\n(1)\n\nwhere \u00afw indicates that \u00afw is treated as a constant. Crucially, one can optimize this loss using off-policy\ndata (s, a, s\u2032, g, r) with a standard actor-critic algorithm [26, 13, 27].\n\nVariational Autoencoders. Variational autoencoders (VAEs) have been demonstrated to learn\nstructured latent representations of high dimensional data [19]. The VAE consists of an encoder q\u03c6,\nwhich maps states to latent distributions, and a decoder p\u03c8, which maps latents to distributions over\nstates. The encoder and decoder parameters, \u03c6 and \u03c8 respectively, are jointly trained to maximize\n\nL(\u03c8, \u03c6; s(i)) = \u2212\u03b2DKL(q\u03c6(z|s(i))||p(z)) + E\n\nq\u03c6(z|s(i))[log p\u03c8(s(i) | z)],\n\n(2)\n\n3\n\n\fwhere p(z) is some prior, which we take to be the unit Gaussian, DKL is the Kullback-Leibler\ndivergence, and \u03b2 is a hyperparameter that balances the two terms. The use of \u03b2 values other than one\nis sometimes referred to as a \u03b2-VAE [15]. The encoder q\u03c6 parameterizes the mean and log-variance\ndiagonal of a Gaussian distribution, q\u03c6(s) = N (\u00b5\u03c6(s), \u03c32\n\u03c6(s)). The decoder p\u03c8 parameterizes a\nBernoulli distribution for each pixel value. This parameterization corresponds to training the decoder\nwith cross-entropy loss on normalized pixel values. Full details of the hyperparameters are in the\nSupplementary Material.\n\n4 Goal-Conditioned Policies with Unsupervised Representation Learning\n\nTo devise a practical algorithm based on goal-conditioned value functions, we must choose a suitable\ngoal representation. In the absence of domain knowledge and instrumentation, a general-purpose\nchoice is to set the goal space G to be the same as the state observations space S. This choice is fully\ngeneral as it can be applied to any task, and still permits considerable user control since the user can\nchoose a \u201cgoal state\u201d to set a desired goal for a trained goal-conditioned policy. But when the state\nspace S corresponds to high-dimensional sensory inputs such as images 1 learning a goal-conditioned\nQ-function and policy becomes exceedingly dif\ufb01cult as we illustrate empirically in Section 5.\n\nOur method jointly addresses a number of problems that arise when working with high-dimensional\ninputs such as images: sample ef\ufb01cient learning, reward speci\ufb01cation, and automated goal-setting.\nWe address these problems by learning a latent embedding using a \u03b2-VAE. We use this latent space\nto represent the goal and state and retroactively relabel data with latent goals sampled from the\nVAE prior to improve sample ef\ufb01ciency. We also show that distances in the latent space give us a\nwell-shaped reward function for images. Lastly, we sample from the prior to allow an agent to set\nand \u201cpractice\u201d reaching its own goal, removing the need for humans to specify new goals during\ntraining time. We next describe the speci\ufb01c components of our method, and summarize our complete\nalgorithm in Section 4.5.\n\n4.1 Sample-Ef\ufb01cient RL with Learned Representations\n\nOne challenging problem with end-to-end approaches for visual RL tasks is that the resulting policy\nneeds to learn both perception and control. Rather than operating directly on observations, we embed\nthe state st and goals g into a latent space Z using an encoder e to obtain a latent state zt = e(st)\nand latent goal zg = e(g). To learn a representation of the state and goal space, we train a \u03b2-VAE by\nexecuting a random policy and collecting state observations, {s(i)}, and optimize Equation (2). We\nthen use the mean of the encoder as the state encoding, i.e. z = e(s) , \u00b5\u03c6(s).\n\nAfter training the VAE, we train a goal-conditioned Q-function Q(z, a, zg) and corresponding policy\n\u03c0\u03b8(z, zg) in this latent space. The policy is trained to reach a goal zg using the reward function\ndiscussed in Section 4.2. For the underlying RL algorithm, we use twin delayed deep deterministic\npolicy gradients (TD3) [13], though any value-based RL algorithm could be used. Note that the\npolicy (and Q-function) operates completely in the latent space. During test time, to reach a speci\ufb01c\ngoal state g, we encode the goal zg = e(g) and input this latent goal to the policy.\n\nAs the policy improves, it may visit parts of the state space that the VAE was never trained on,\nresulting in arbitrary encodings that may not make learning easier. Therefore, in addition to procedure\ndescribed above, we \ufb01ne-tune the VAE using both the randomly generated state observations {s(i)}\nand the state observations collected during exploration. We show in Section 8.3 that this additional\ntraining helps the performance of the algorithm.\n\n4.2 Reward Speci\ufb01cation\n\nTraining the goal-conditioned value function requires de\ufb01ning a goal-conditioned reward r(s, g).\nUsing Euclidean distances in the space of image pixels provides a poor metric, since similar con\ufb01gu-\nrations in the world can be extremely different in image space. In addition to compactly representing\nhigh-dimensional observations, we can utilize our representation to obtain a reward function based\n\n1We make the simplifying assumption that the system is Markovian with respect to the sensory input, and\n\none could incorporate memory into the state for partially observed tasks.\n\n4\n\n\fon a metric that better re\ufb02ects the similarity between the state and the goal. One choice for such a\nreward is to use the negative Mahalanobis distance in the latent space:\n\nr(s, g) = \u2212||e(s) \u2212 e(g)||A = \u2212||z \u2212 zg||A,\n\nwhere the matrix A weights different dimensions in the latent space. This approach has an appealing\ninterpretation when we set A to be the precision matrix of the VAE encoder, q\u03c6. Since we use a\nGaussian encoder, we have that\n\nr(s, g) = \u2212||z \u2212 zg||A \u221d qlog e\u03c6(zg | s)\n\n(3)\n\nIn other words, minimizing this squared distance in the latent space is equivalent to rewarding\nreaching states that maximize the probability of the latent goal zg. In practice, we found that setting\nA = I, corresponding to Euclidean distance, performed better than Mahalanobis distance, though its\neffect is the same \u2014 to bring z close to zg and maximize the probability of the latent goal zg given\nthe observation. This interpretation would not be possible when using normal autoencoders since\ndistances are not trained to have any probabilistic meaning. Indeed, we show in Section 5 that using\ndistances in a normal autoencoder representation often does not result in meaningful behavior.\n\n4.3\n\nImproving Sample Ef\ufb01ciency with Latent Goal Relabeling\n\nTo further enable sample-ef\ufb01cient learning in the real world, we use the VAE to relabel goals. Note\nthat we can optimize Equation (1) using any valid (s, a, s\u2032, g, r) tuple. If we could arti\ufb01cially generate\nthese tuples, then we could train our entire RL algorithm without collecting any data. Unfortunately,\nwe do not know the system dynamics, and therefore have to sample transitions (s, a, s\u2032) by interacting\nwith the world. However, we have the freedom to relabel the goal and reward synthetically. So if we\nhave a mechanism for generating goals and computing rewards, then given (s, a, s\u2032), we can generate\na new goal g and new reward r(s, a, s\u2032, g) to produce a new tuple (s, a, s\u2032, g, r). By arti\ufb01cially\ngenerating and recomputing rewards, we can convert a single (s, a, s\u2032) transition into potentially\nin\ufb01nitely many valid training datums.\n\nFor image-based tasks, this procedure would require generating goal images, an onerous task on its\nown. However, our reinforcement learning algorithm operates directly in the latent space for goals\nand rewards. So rather than generating goals g, we generate latent goals zg by sampling from the\nVAE prior p(z). We then recompute rewards using Equation (3). By retroactively relabeling the goals\nand rewards, we obtain much more data to train our value function. This sampling procedure is made\npossible by our use of a latent variable model, which is explicitly trained so that sampling from the\nlatent distribution is straightforward.\n\nIn practice, the distribution of latents will not exactly match the prior. To mitigate this distribution\nmismatch, we use a \ufb01tted prior when sampling from the prior: we \ufb01t a diagonal Gaussian to the latent\nencodings of the VAE training data, and use this \ufb01tted prior in place of the unit Gaussian prior.\n\nRetroactively generating goals is also explored in tabular domains by Kaelbling [18] and in continuous\ndomains by Andrychowicz et al. [2] using hindsight experience replay (HER). However, HER is\nlimited to sampling goals seen along a trajectory, which greatly limits the number and diversity of\ngoals with which one can relabel a given transition. Our \ufb01nal method uses a mixture of the two\nstrategies: half of the goals are generated from the prior and half from goals use the \u201cfuture\u201d strategy\ndescribed in Andrychowicz et al. [2]. We show in Section 5 that relabeling the goal with samples\nfrom the VAE prior results in signi\ufb01cantly better sample-ef\ufb01ciency.\n\n4.4 Automated Goal-Generation for Exploration\n\nIf we do not know which particular goals will be provided at test time, we would like our RL agent to\ncarry out a self-supervised \u201cpractice\u201d phase during training, where the algorithm proposes its own\ngoals, and then practices how to reach them. Since the VAE prior represents a distribution over latent\ngoals and state observations, we again sample from this distribution to obtain plausible goals. After\nsampling a goal latent from the prior zg \u223c p(z), we give this to our policy \u03c0(z, zg) to collect data.\n\n4.5 Algorithm Summary\n\nWe call the complete algorithm reinforcement learning with imagined goals (RIG) and summarize it\nin Algorithm 1. We \ufb01rst collect data with a simple exploration policy, though any exploration strategy\n\n5\n\n\fAlgorithm 1 RIG: Reinforcement learning with imagined goals\n\nRequire: VAE encoder q\u03c6, VAE decoder p\u03c8, policy\n\n\u03c0\u03b8, goal-conditioned value function Qw.\n\n1: Collect D = {s(i)} using exploration policy.\n2: Train \u03b2-VAE on D by optimizing (2).\n3: Fit prior p(z) to latent encodings {\u00b5\u03c6(s(i))}.\n4: for n = 0, ..., N \u2212 1 episodes do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\nGet action at = \u03c0\u03b8(e(st), zg) + noise.\nGet next state st+1 \u223c p(\u00b7 | st, at).\nStore (st, at, st+1, zg) into replay buffer R.\nSample transition (s, a, s\u2032, zg) \u223c R.\n\nSample latent goal from prior zg \u223c p(z).\nSample initial state s0 \u223c E.\nfor t = 0, ..., H \u2212 1 steps do\n\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n\nEncode z = e(s), z \u2032 = e(s\u2032).\n(Probability 0.5) replace zg with z \u2032\nCompute new reward r = \u2212||z \u2032 \u2212 zg||.\nMinimize (1) using (z, a, z \u2032, zg, r).\n\ng \u223c p(z).\n\nend for\nfor t = 0, ..., H \u2212 1 steps do\n\nfor i = 0, ..., k \u2212 1 steps do\n\nSample future state shi , t < hi \u2264 H \u2212 1.\nStore (st, at, st+1, e(shi )) into R.\n\nend for\n\nend for\nFine-tune \u03b2-VAE every K episodes on mixture\n\nof D and R.\n\n24: end for\n\ncould be used for this stage, including off-the-shelf exploration bonuses [29, 4] or unsupervised\nreinforcement learning methods [9, 12]. Then, we train a VAE latent variable model on state\nobservations and \ufb01netune it over the course of training. We use this latent variable model for multiple\npurposes: We sample a latent goal zg from the model and condition the policy on this goal. We embed\nall states and goals using the model\u2019s encoder. When we train our goal-conditioned value function,\nwe resample goals from the prior and compute rewards in the latent space using Equation (3). Any\nRL algorithm that trains Q-functions could be used, and we use TD3 [13] in our implementation.\n\n5 Experiments\n\nOur experiments address the following questions:\n\n1. How does our method compare to prior model-free RL algorithms in terms of sample\n\nef\ufb01ciency and performance, when learning continuous control tasks from images?\n\n2. How critical is each component of our algorithm for ef\ufb01cient learning?\n3. Does our method work on tasks where the state space cannot be easily speci\ufb01ed ahead of\n\ntime, such as tasks that require interaction with variable numbers of objects?\n\n4. Can our method scale to real world vision-based robotic control tasks?\n\nFor the \ufb01rst two questions, we evaluate our method against a number of prior algorithms and ablated\nversions of our approach on a suite of the following simulated tasks. Visual Reacher: a MuJoCo [47]\nenvironment with a 7-dof Sawyer arm reaching goal positions. The arm is shown the left of Figure\n2. The end-effector (EE) is constrained to a 2-dimensional rectangle parallel to a table. The action\ncontrols EE velocity within a maximum velocity. Visual Pusher: a MuJoCo environment with a 7-dof\nSawyer arm and a small puck on a table that the arm must push to a target push. Visual Multi-Object\nPusher: a copy of the Visual Pusher environment with two pucks. Visual Door: a Sawyer arm with a\ndoor it can attempt to open by latching onto the handle. Visual Pick and Place: a Sawyer arm with\na small ball and an additional dimension of control for opening and closing the gripper. Detailed\ndescriptions of the environments are provided in the Supplementary Material.\n\nFigure 2: (Left) The simulated pusher, door opening, and pick-and-place environments are pictured.\n(Right) Test rollouts from our learned policy on the three pushing environments. Each row is one\nrollout. The right two columns show a goal image g and its VAE reconstruction \u02c6g. The images to\ntheir left show frames from a trajectory to reach the given goal.\n\n6\n\n\fFigure 3: Simulation results, \ufb01nal distance to goal vs simulation steps2. RIG (red) consistently\noutperforms the baselines, except for the oracle which uses ground truth object state for observations\nand rewards. On the hardest tasks, only our method and the oracle discover viable solutions.\n\nSolving these tasks directly from images poses a challenge since the controller must learn both\nperception and control. The evaluation metric is the distance of objects (including the arm) to their\nrespective goals. To evaluate our policy, we set the environment to a sampled goal position, capture\nan image, and encode the image to use as the goal. Although we use the ground-truth positions for\nevaluation, we do not use the ground-truth positions for training the policies. The only inputs\nfrom the environment that our algorithm receives are the image observations. For Visual Reacher, we\npretrained the VAE with 100 images. For other tasks, we used 10,000 images.\n\nWe compare our method with the following prior works. L&R: Lange and Riedmiller [20] trains an\nautoencoder to handle images. DSAE: Deep spatial autoencoders [11] learns a spatial autoencoder and\nuses guided policy search [23] to achieve a single goal image. HER: Hindsight experience replay [2]\nutilizes a sparse reward signal and relabeling trajectories with achieved goals. Oracle: RL with direct\naccess to state information for observations and rewards.\n\nTo our knowledge, no prior work demonstrates policies that can reach a variety of goal images\nwithout access to a true-state reward function, and so we needed to make modi\ufb01cations to make the\ncomparisons feasible. L&R assumes a reward function from the environment. Since we have no\nstate-based reward function, we specify the reward function as distance in the autoencoder latent\nspace. HER does not embed inputs into a latent space but instead operates directly on the input, so we\nuse pixel-wise mean squared error (MSE) as the metric. DSAE is trained only for a single goal, so we\nallow the method to generalize to a variety of test goal images by using a goal-conditioned Q-function.\nTo make the implementations comparable, we use the same off-policy algorithm, TD3 [13], to train\nL&R, HER, and our method. Unlike our method, prior methods do not specify how to select goals\nduring training, so we favorably give them real images as goals for rollouts, sampled from the same\ndistribution that we use to test.\n\nWe see in Figure 3 that our method can ef\ufb01ciently learn policies\nfrom visual inputs to perform simulated reaching and pushing,\nwithout access to the object state. Our approach substantially\noutperforms the prior methods, for which the use of image\ngoals and observations poses a major challenge. HER struggles\nbecause pixel-wise MSE is hard to optimize. Our latent-space\nrewards are much better shaped and allow us to learn more\ncomplex tasks. Finally, our method is close to the state-based\n\u201coracle\" method in terms of sample ef\ufb01ciency and performance,\nwithout having any access to object state. Notably, in the\nmulti-object environment, our method actually outperforms\nthe oracle, likely because the state-based reward contains local\nminima. Overall, these result show that our method is capable\nof handling raw image observations much more effectively\nthan previously proposed goal-conditioned RL methods. Next,\n\nFigure 4: Reward type ablation re-\nsults. RIG (red), which uses latent\nEuclidean distance, outperforms the\nother methods.\n\n2In all our simulation results, each plot shows a 95% con\ufb01dence interval of the mean across 5 seeds.\n\n7\n\n0K2K4K6K8K10KTimesteps0.00.10.2Final Distance to GoalVisual Reacher Baselines0K100K200K300K400K500KTimesteps0.100.150.200.25Visual Pusher Baselines0K100K200K300K400K500KTimesteps0.20.30.4Visual Multi-object Pusher BaselinesRIGDSAEHEROracleL&R0K50K100K150K200KTimesteps0.00.20.4Final Distance to GoalVisual Door Baselines0K100K200K300K400K500KTimesteps0.050.100.150.20Visual Pick and Place BaselinesRIGDSAEHEROracleL&R0K50K100K150K200K250KTimesteps0.160.180.200.22Final Distance to GoalVisual PusherRIGLog Prob.Pixel MSE\fFigure 7: (Left) Our method compared to the HER baseline and oracle on a real-world visual reaching\ntask. (Middle) Our robot setup is pictured. (Right) Test rollouts of our learned policy.\n\nwe perform ablations to evaluate our contributions in isolation. Results on Visual Pusher are shown\nbut see the Supplementary Material (section 8) for experiments on all three simulated environments.\n\nReward Speci\ufb01cation Comparison We evaluate how effective distance in the VAE latent space\nis for the Visual Pusher task. We keep our method the same, and only change the reward function\nthat we use to train the goal-conditioned valued function. We include the following methods for\ncomparison: Latent Distance, which uses the reward used in RIG, i.e. A = I in Equation (3); Log\nProbability, which uses the Mahalanobis distance in Equation (3), where A is the precision matrix of\nthe encoder; and Pixel MSE, which uses mean-squared error (MSE) between state and goal in pixel\nspace. 3 In Figure 4, we see that latent distance signi\ufb01cantly outperforms log probability. We suspect\nthat small variances of the VAE encoder results in drastically large rewards, making the learning more\ndif\ufb01cult. We also see that latent distances results in faster learning when compared to pixel MSE.\n\nRelabeling Strategy Comparison As described in section\n4.3, our method uses a novel goal relabeling method based\non sampling from the generative model. To isolate how much\nour new goal relabeling method contributes to our algorithm,\nwe vary the resampling strategy while \ufb01xing other compo-\nnents of our algorithm. The resampling strategies that we\nconsider are: Future, relabeling the goal for a transition by\nsampling uniformly from future states in the trajectory as done\nin Andrychowicz et al. [2]; VAE, sampling goals from the VAE\nonly; RIG, relabeling goals with probability 0.5 from the VAE\nand probability 0.5 using the future strategy; and None, no\nrelabeling. In Figure 5, we see that sampling from the VAE and Future is signi\ufb01cantly better than not\nrelabeling at all. In RIG, we use an equal mixture of the VAE and Future sampling strategies, which\nperforms best by a large margin. Appendix section 8.1 contains results on all simulated environments,\nand section 8.4 considers relabeling strategies with a known goal distribution.\n\nFigure 5: Relabeling ablation.\n\nLearning with Variable Numbers of Objects A major ad-\nvantage of working directly from pixels is that the policy input\ncan easily represent combinatorial structure in the environment,\nwhich would be dif\ufb01cult to encode into a \ufb01xed-length state\nvector even if a perfect perception system were available. For\nexample, if a robot has to interact with different combinations\nand numbers of objects, picking a single MDP state represen-\ntation would be challenging, even with access to object poses.\nBy directly processing images for both the state and the goal,\nno modi\ufb01cation is needed to handle the combinatorial struc-\nture: the number of pixels always remains the same, regardless\nof how many objects are in the scene.\n\nFigure 6: Training curve for learning\nwith varying number of objects.\n\nWe demonstrate that our method can handle this dif\ufb01cult scenario by evaluating on a task where the\nenvironment, based on the Visual Multi-Object Pusher, randomly contains zero, one, or two objects\nin each episode during testing. During training, each episode still always starts with both objects in\nthe scene, so the experiments tests whether a trained policy can handle variable numbers of objects at\ntest time. Figure 6 shows that our method can learn to solve this task successfully, without decrease\n\n3To compute the pixel MSE for a sampled latent goal, we decode the goal latent using the VAE decoder, p\u03c8,\n\nto generate the corresponding goal image.\n\n8\n\n0K2K4K6K8K10K0.00.10.20.3Final Distance to GoalReal-World Visual ReacherRIGHEROracle0K100K200K300K400K500KTimesteps0.1500.1750.2000.2250.250Final Distance to GoalVisual PusherRIGNoneFutureVAE0K100K200K300K400K500KTimesteps0.100.120.140.16Final Distance to GoalVisual Pusher, Varying # of ObjectsRIG\fPuck Distance to Goal (cm)\n\nRIG\n\nHER\n\n4.5 \u00b1 2.5\n\n14.9 \u00b1 5.4\n\nFigure 8: (Left) The learning curve for real-world pushing. (Middle) Our robot pushing setup is\npictured, with frames from test rollouts of our learned policy. (Right) Our method compared to the\nHER baseline on the real-world visual pushing task. We evaluated the performance of each method\nby manually measuring the distance between the goal position of the puck and \ufb01nal position of the\npuck for 15 test rollouts, reporting mean and standard deviation.\n\nin performance from the base setting where both objects are present (in Figure 3). Developing and\ndemonstrating algorithms that solve tasks with varied underlying structure is an important step toward\ncreating autonomous agents that can handle the diversity of tasks present \u201cin the wild.\u201d\n\n5.1 Visual RL with Physical Robots\n\nRIG is a practical and straightforward algorithm to apply to real physical systems: the ef\ufb01ciency of\noff-policy learning with goal relabeling makes training times manageable, while the use of image-\nbased rewards through the learned representation frees us from the burden of manually design reward\nfunctions, which itself can require hand-engineered perception systems [39]. We trained policies for\nvisual reaching and pushing on a real-world Sawyer robotic arm, shown in Figure 7. The control\nsetup matches Visual Reacher and Visual Pusher respectively, meaning that the only input from the\nenvironment consists of camera images.\n\nWe see in Figure 7 that our method is applicable to real-world robotic tasks, almost matching the\nstate-based oracle method and far exceeding the baseline method on the reaching task. Our method\nneeds just 10,000 samples or about an hour of real-world interaction time to solve visual reaching.\n\nReal-world pushing results are shown in Figure 8. To solve visual pusher, which is more visually\ncomplicated and requires reasoning about the contact between the arm and object, our method requires\nabout 25,000 samples, which is still a reasonable amount of real-world training time. Note that unlike\nprevious results, we do not have access to the true puck position during training so for the learning\ncurve we report test episode returns on the VAE latent distance reward. We see RIG making steady\nprogress at optimizing the latent distance as learning proceeds.\n\n6 Discussion and Future Work\n\nIn this paper, we present a new RL algorithm that can ef\ufb01ciently solve goal-conditioned, vision-based\ntasks without access to any ground truth state or reward functions. Our method trains a generative\nmodel that is used for multiple purposes: we embed the state and goals using the encoder; we sample\nfrom the prior to generate goals for exploration; we also sample latents to retroactively relabel goals\nand rewards; and we use distances in the latent space for rewards to train a goal-conditioned value\nfunction. We show that these components culminate in a sample ef\ufb01cient algorithm that works\ndirectly from vision. As a result, we are able to apply our method to a variety of simulated visual\ntasks, including a variable-object task that cannot be easily represented with a \ufb01xed length vector,\nas well as real world robotic tasks. Algorithms that can learn in the real world and directly use raw\nimages can allow a single policy to solve a large and diverse set of tasks, even when these tasks\nrequire distinct internal representations.\n\n7 Acknowledgements\n\nWe would like to thank Aravind Srinivas and Pulkit Agrawal for useful discussions, and Alex Lee\nfor helpful feedback on an initial draft of the paper. We would also like to thank Carlos Florensa\nfor making multiple useful suggestions in later version of the draft. This work was supported by\nthe National Science Foundation IIS-1651843 and IIS-1614653, a Huawei Fellowship, Berkeley\nDeepDrive, Siemens, and support from NVIDIA.\n\n9\n\n0K5K10K15K20K25KTimesteps10864Episode Return (1e-3)Real-World Visual Pusher\fReferences\n\n[1] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to Poke by\nPoking: Experiential Learning of Intuitive Physics. In Advances in Neural Information Processing Systems\n(NIPS), 2016.\n\n[2] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob\nMcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In Advances in\nNeural Information Processing Systems (NIPS), jul 2017.\n\n[3] A Baranes and P-Y Oudeyer. Active Learning of Inverse Models with Intrinsically Motivated Goal\nExploration in Robots. Robotics and Autonomous Systems, 61(1):49\u201373, 2012. doi: 10.1016/j.robot.2012.\n05.008.\n\n[4] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\nUnifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing\nSystems (NIPS), pages 1471\u20131479, 2016.\n\n[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems (NIPS), pages 2172\u20132180, 2016.\n\n[6] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of\n\nvariation in deep networks. arXiv preprint arXiv:1412.6583, 2014.\n\n[7] Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via\n\ngenerative entangling. CoRR, abs/1210.5, 2012.\n\n[8] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-Supervised Visual Planning with\n\nTemporal Skip Connections. In Conference on Robot Learning (CoRL), 2017.\n\n[9] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is All You Need:\n\nLearning Skills without a Reward Function. arXiv preprint arXiv:1802.06070, 2018.\n\n[10] Chelsea Finn and Sergey Levine. Deep Visual Foresight for Planning Robot Motion. In Advances in\n\nNeural Information Processing Systems (NIPS), 2016.\n\n[11] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial\nautoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation\n(ICRA), volume 2016-June, pages 512\u2013519. IEEE, 2016.\n\n[12] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement\n\nlearning. arXiv preprint arXiv:1704.03012, 2017.\n\n[13] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-\n\nCritic Methods. arXiv preprint arXiv:1802.09477, 2018.\n\n[14] David Ha and J\u00fcrgen Schmidhuber. World Models. arXiv preprint arXiv:1803.10122, 2018.\n\n[15] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. \u03b2-VAE: Learning basic visual concepts with a constrained variational\nframework. International Conference on Learning Representations (ICLR), 2017.\n\n[16] Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew\nBotvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement\nlearning. International Conference on Machine Learning (ICML), 2017.\n\n[17] Rico Jonschkowski, Roland Hafner, Jonathan Scholz, and Martin Riedmiller. Pves: Position-velocity\nencoders for unsupervised learning of structured state representations. arXiv preprint arXiv:1705.09805,\n2017.\n\n[18] L P Kaelbling. Learning to achieve goals. In IJCAI-93. Proceedings of the Thirteenth International Joint\n\nConference on Arti\ufb01cial Intelligence, volume vol.2, pages 1094 \u2013 8, 1993.\n\n[19] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on\n\nLearning Representations (ICLR), 2014.\n\n[20] Sascha Lange and Martin A Riedmiller. Deep learning of visual control policies. In European Symposium\n\non Arti\ufb01cial Neural Networks (ESANN). Citeseer, 2010.\n\n10\n\n\f", "award": [], "sourceid": 5535, "authors": [{"given_name": "Ashvin", "family_name": "Nair", "institution": "UC Berkeley"}, {"given_name": "Vitchyr", "family_name": "Pong", "institution": "UC Berkeley"}, {"given_name": "Murtaza", "family_name": "Dalal", "institution": "University of California, Berkeley"}, {"given_name": "Shikhar", "family_name": "Bahl", "institution": "UC Berkeley"}, {"given_name": "Steven", "family_name": "Lin", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}