{"title": "Evolved Policy Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 5400, "page_last": 5409, "abstract": "We propose a metalearning approach for learning gradient-based reinforcement learning (RL) algorithms. The idea is to evolve a differentiable loss function, such that an agent, which optimizes its policy to minimize this loss, will achieve high rewards. The loss is parametrized via temporal convolutions over the agent's experience. Because this loss is highly flexible in its ability to take into account the agent's history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method. We also demonstrate that EPG's learned loss can generalize to out-of-distribution test time tasks, and exhibits qualitatively different behavior from other popular metalearning algorithms.", "full_text": "Evolved Policy Gradients\n\nRein Houthooft\u2217, Richard Y. Chen\u2217, Phillip Isola\u2217\u2020\u00d7, Bradly C. Stadie\u2217\u2020, Filip Wolski\u2217,\n\nJonathan Ho\u2217\u2020, Pieter Abbeel\u2020\nOpenAI\u2217, UC Berkeley\u2020, MIT\u00d7\n\nAbstract\n\nWe propose a metalearning approach for learning gradient-based reinforcement\nlearning (RL) algorithms. The idea is to evolve a differentiable loss function,\nsuch that an agent, which optimizes its policy to minimize this loss, will achieve\nhigh rewards. The loss is parametrized via temporal convolutions over the agent\u2019s\nexperience. Because this loss is highly \ufb02exible in its ability to take into account\nthe agent\u2019s history, it enables fast task learning. Empirical results show that\nour evolved policy gradient algorithm (EPG) achieves faster learning on several\nrandomized environments compared to an off-the-shelf policy gradient method.\nWe also demonstrate that EPG\u2019s learned loss can generalize to out-of-distribution\ntest time tasks, and exhibits qualitatively different behavior from other popular\nmetalearning algorithms.\n\n1\n\nIntroduction\n\nMost current reinforcement learning (RL) agents approach each new task de novo. Initially, they have\nno notion of what actions to try out, nor which outcomes are desirable. Instead, they rely entirely\non external reward signals to guide their initial behavior. Coming from such a blank slate, it is no\nsurprise that RL agents take far longer than humans to learn simple skills [12].\n\nOur aim in this paper is to devise agents that have a prior\nnotion of what constitutes making progress on a novel task.\nRather than encoding this knowledge explicitly through a\nlearned behavioral policy, we encode it implicitly through\na learned loss function. The end goal is agents that can use\nthis loss function to learn quickly on a novel task. This\napproach can be seen as a form of metalearning, in which\nwe learn a learning algorithm. Rather than mining rules\nthat generalize across data points, as in traditional ma-\nchine learning, metalearning concerns itself with devising\nalgorithms that generalize across tasks, by infusing prior\nknowledge of the task distribution [7].\n\nFigure 1: High-level overview of our\napproach.\n\nOur method consists of two optimization loops. In the\ninner loop, an agent learns to solve a task, sampled from a particular distribution over a family of\ntasks. The agent learns to solve this task by minimizing a loss function provided by the outer loop. In\nthe outer loop, the parameters of the loss function are adjusted so as to maximize the \ufb01nal returns\nachieved after inner loop learning. Figure 1 provides a high-level overview of this approach.\n\nAlthough the inner loop can be optimized with stochastic gradient descent (SGD), optimizing the\nouter loop presents substantial dif\ufb01culty. Each evaluation of the outer objective requires training a\ncomplete inner-loop agent, and this objective cannot be written as an explicit function of the loss\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fparameters we are optimizing over. Due to the lack of easily exploitable structure in this optimization\nproblem, we turn to evolution strategies (ES) [20, 27, 9, 21] as a blackbox optimizer. The evolved\nloss L can be viewed as a surrogate loss [24, 25] whose gradient is used to update the policy, which\nis similar in spirit to policy gradients, lending the name \u201cevolved policy gradients\".\n\nThe learned loss offers several advantages compared to current RL methods. Since RL methods\noptimize for short-term returns instead of accounting for the complete learning process, they may\nget stuck in local minima and fail to explore the full search space. Prior works add auxiliary reward\nterms that emphasize exploration [3, 10, 17, 32, 2, 18] and entropy loss terms [16, 23, 8, 14]. Using\nES to evolve the loss function allows us to optimize the true objective, namely the \ufb01nal trained\npolicy performance, rather than short-term returns, with the learned loss incentivizing the necessary\nexploration to achieve this. Our method also improves on standard RL algorithms by allowing the\nloss function to be adaptive to the environment and agent history, leading to faster learning and the\npotential for learning without external rewards.\n\nThere has been a \ufb02urry of recent work on metalearning policies, e.g., [5, 33, 6, 13], and it is worth\nasking why metalearn the loss as opposed to directly metalearning the policy? Our motivation is that\nwe expect loss functions to be the kind of object that may generalize very well across substantially\ndifferent tasks. This is certainly true of hand-engineered loss functions: a well-designed RL loss\nfunction, such as that in [26], can be very generically applicable, \ufb01nding use in problems ranging\nfrom playing Atari games to controlling robots [26]. In Section 4.3, we \ufb01nd evidence that a loss\nlearned by EPG can train an agent to solve a task outside the distribution of tasks on which EPG was\ntrained. This generalization behavior differs qualitatively from MAML [6] and RL2 [5], methods that\ndirectly metalearn policies.\n\nOur contributions include the following: 1) Formulating a metalearning approach that learns a\ndifferentiable loss function for RL agents, called EPG; 2) Optimizing the parameters of this loss\nfunction via ES, overcoming the challenge that \ufb01nal returns are not explicit functions of the loss\nparameters; 3) Designing a loss architecture that takes into account agent history via temporal\nconvolutions; 4) Demonstrating that EPG produces a learned loss that can train agents faster than an\noff-the-shelf policy gradient method; 5) Showing that EPG\u2019s learned loss can generalize to out-of-\ndistribution test time tasks, exhibiting qualitatively different behavior from other popular metalearning\nalgorithms. An implementation of EPG is available at http://github.com/openai/EPG.\n\n2 Notation and Background\n\nWe model reinforcement learning [30] as a Markov decision process (MDP), de\ufb01ned as the tuple\nM = (S,A, T, R, p0, \u03b3), where S and A are the state and action space. The transition dynamic\nT : S \u00d7 A \u00d7 S (cid:55)\u2192 R+ determines the distribution of the next state st+1 given the current state st\nand the action at. R : S \u00d7 A (cid:55)\u2192 R is the reward function and \u03b3 \u2208 (0, 1) is a discount factor. p0\nis the distribution of the initial state s0. An agent\u2019s policy \u03c0 : S (cid:55)\u2192 A generates an action after\nobserving a state. An episode \u03c4 \u223c M with horizon H is a sequence (s0, a0, r0, . . . , sH , aH , rH )\nof state, action, and reward at each timestep t. The discounted episodic return of \u03c4 is de\ufb01ned as\nt=0 \u03b3trt, which depends on the initial state distribution p0, the agent\u2019s policy \u03c0, and the\ntransition distribution T . The expected episodic return given agent\u2019s policy \u03c0 is E\u03c0[R\u03c4 ]. The optimal\npolicy \u03c0\u2217 maximizes the expected episodic return \u03c0\u2217 = arg max\u03c0\nE\u03c4\u223cM,\u03c0[R\u03c4 ]. In high-dimensional\nreinforcement learning settings, the policy \u03c0 is often parametrized using a deep neural network \u03c0\u03b8\nwith parameters \u03b8. The goal is to solve for \u03b8\u2217 that attains the highest expected episodic return\n\nR\u03c4 =(cid:80)H\n\n\u03b8\u2217 = arg max\n\nE\u03c4\u223cM,\u03c0\u03b8 [R\u03c4 ].\n\n\u03b8\n\n(1)\n\n(cid:34)\n\nH(cid:88)\n\n(cid:35)\n\nThis objective can be optimized via policy gradient methods [34, 31] by stepping in the direction of\nE[R\u03c4\u2207 log \u03c0(\u03c4 )]. This gradient can be transformed into a surrogate loss function [24, 25]\n\nLpg = E[R\u03c4 log \u03c0(\u03c4 )] = E\n\nR\u03c4\n\nlog \u03c0(at|st)\n\n,\n\n(2)\n\nsuch that the gradient of Lpg equals the policy gradient. This loss function is oftent transformed\nthrough variance reduction techniques including actor-critic algorithms [11]. However, this procedure\n\nt=0\n\n2\n\n\fremains limited since it relies on a particular form of discounting returns, and taking a \ufb01xed gradient\nstep with respect to the policy. Our approach instead learns a loss. Thus, it may be able to discover\nmore effective surrogates for making fast progress toward the ultimate objective of maximizing \ufb01nal\nreturns.\n\n3 Methodology\n\nWe aim to learn a loss function L\u03c6 that outperforms the usual policy gradient surrogate loss [24]. The\nlearned loss function consists of temporal convolutions over the agent\u2019s recent history. In addition to\ninternalizing environment rewards, this loss could, in principle, have several other positive effects. For\nexample, by examining the agent\u2019s history, the loss could incentivize desirable extended behaviors,\nsuch as exploration. Further, the loss could perform a form of system identi\ufb01cation, inferring\nenvironment parameters and adapting how it guides the agent as a function of these parameters (e.g.,\nby adjusting the effective learning rate of the agent). The loss function parameters \u03c6 are evolved\nthrough ES and the loss trains an agent\u2019s policy \u03c0\u03b8 in an on-policy fashion via stochastic gradient\ndescent.\n\n3.1 Metalearning Objective\nWe assume access to a distribution p(M) over MDPs. Given a sampled MDP M, the inner loop\noptimization problem is to minimize the loss L\u03c6 with respect to the agent\u2019s policy \u03c0\u03b8:\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\nE\u03c4\u223cM,\u03c0\u03b8 [L\u03c6(\u03c0\u03b8, \u03c4 )].\n\n(3)\n\nNote that this is similar to the usual RL objectives (Eqs. (1) (2)), except that we are optimizing a\nlearned loss L\u03c6 rather than directly optimizing the expected episodic return EM,\u03c0\u03b8 [R\u03c4 ] or other\nsurrogate losses. The outer loop objective is to learn L\u03c6 such that an agent\u2019s policy \u03c0\u03b8\u2217 trained with\nthe loss function achieves high expected returns in the MDP distribution:\n\n\u03c6\u2217 = arg max\n\n\u03c6\n\nEM\u223cp(M)E\u03c4\u223cM,\u03c0\u03b8\u2217 [R\u03c4 ].\n\n(4)\n\n3.2 Algorithm\n\nThe \ufb01nal episodic return R\u03c4 of a trained policy \u03c0\u03b8\u2217 cannot be represented as an explicit function\nof the loss function L\u03c6. Thus we cannot use gradient-based methods to directly solve Eq. (4). Our\napproach, summarized in Algorithm 1, relies on evolution strategies (ES) to optimize the loss function\nin the outer loop.\n\nAs described by Salimans et al. [21], ES computes the gradient of a function F (\u03c6) according to\n\u2207\u03c6E\u0001\u223cN (0,I)F (\u03c6 + \u03c3\u0001) = 1\nE\u0001\u223cN (0,I)F (\u03c6 + \u03c3\u0001)\u0001. Similar formulations also appear in prior\nworks including [29, 28, 15]. In our case, F (\u03c6) = EM\u223cp(M)E\u03c4\u223cM,\u03c0\u03b8\u2217 [R\u03c4 ] (Eq. (4)). Note that the\ndependence on \u03c6 comes through \u03b8\u2217 (Eq. (3)).\n\n\u03c3\n\nStep by step, the algorithm works as follows. At the start of each epoch in the outer loop, for W\ninner-loop workers, we generate V standard multivariate normal vectors \u0001v \u2208 N (0, I) with the same\ndimension as the loss function parameter \u03c6, assigned to V sets of W/V workers. As such, for the\nw-th worker, the outer loop assigns the (cid:100)wV /W(cid:101)-th perturbed loss function Lw = L\u03c6+\u03c3\u0001vwhere v =\n(cid:100)wV /W(cid:101) with perturbed parameters \u03c6 + \u03c3\u0001v and \u03c3 as the standard deviation.\nGiven a loss function Lw, w \u2208 {1, . . . , W}, from the outer loop, each inner-loop worker w samples\na random MDP from the task distribution, Mw \u223c p(M). The worker then trains a policy \u03c0\u03b8 in Mw\nover U steps of experience. Whenever a termination signal is reached, the environment resets with\nstate s0 sampled from the initial state distribution p0(Mw). Every M steps the policy is updated\nthrough SGD on the loss function Lw, using minibatches sampled from the steps t \u2212 M, . . . , t:\n\n(5)\n\n(cid:0)\u03c0\u03b8, \u03c4t\u2212M,...,t\n\n(cid:1).\n\n\u03b8 \u2190 \u03b8 \u2212 \u03b4in \u00b7 \u2207\u03b8Lw\n\n3\n\n\fSample \u0001v \u223c N (0, I) and calculate the loss parameter \u03c6 + \u03c3\u0001v for v = 1, . . . , V\nEach worker w = 1, . . . , W gets assigned noise vector (cid:100)wV /W(cid:101) as \u0001w\nfor each worker w = 1, . . . , W do\n\nAlgorithm 1: Evolved Policy Gradients (EPG)\n1 [Outer Loop] for epoch e = 1, . . . , E do\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n\nSample MDP Mw \u223c p(M)\nInitialize buffer with N zero tuples\nInitialize policy parameter \u03b8 randomly\n[Inner Loop] for step t = 1, . . . , U do\n\nSample initial state st \u223c p0 if Mw needs to be reset\nSample action at \u223c \u03c0\u03b8(\u00b7|st)\nTake action at in Mw and receive rt, st+1, and termination \ufb02ag dt\nAdd tuple (st, at, rt, dt) to buffer\nif t mod M = 0 then\n\nWith loss parameter \u03c6 + \u03c3\u0001w, calculate losses Li for steps i = t \u2212 M, . . . , t\nusing buffer tuples i \u2212 N, . . . , i\n\nSample minibatches mb from last M steps shuf\ufb02ed, compute Lmb =(cid:80)\n\nand update the policy parameter \u03b8 and memory parameter (Eq. (5))\n\nj\u2208mb Lj,\n\nIn Mw, using trained policy \u03c0\u03b8, sample several trajectories and compute mean return Rw\n\nUpdate the loss parameter \u03c6 (Eq. (6))\n\n18 Output: Loss L\u03c6 that trains \u03c0 from scratch according to inner loop scheme, on MDPs \u223c p(M)\n\n15\n\n16\n\n17\n\nF (\u03c6 + \u03c3\u0001v)\u0001v,\n\nv=1\n\n1 to the outer loop.\nw=1 from all workers and updates the loss function\n\nAt the end of the inner-loop training, each worker returns the \ufb01nal return Rw\nThe outer-loop aggregates the \ufb01nal returns {Rw}W\nparameter \u03c6 as follows:\n\n(cid:88)V\n\u03c6 \u2190 \u03c6 + \u03b4out \u00b7 1\nV \u03c3\nwhere F (\u03c6 + \u03c3\u0001v) = R(v\u22121)\u2217W/V +1+\u00b7\u00b7\u00b7+Rv\u2217W/V\n. As a result, each perturbed loss function Lv is\nevaluated on W/V randomly sampled MDPs from the task distribution using the \ufb01nal returns. This\nachieves variance reduction by preventing the outer-loop ES update from promoting loss functions that\nare assigned to MDPs that consistently generate higher returns. Note that the actual implementation\ncalculates each loss function\u2019s relative rank for the ES update. Algorithm 1 outputs a learned loss\nfunction L\u03c6 after E epochs of ES updates.\nAt test time, we evaluate the learned loss function L\u03c6 produced by Algorithm 1 on a test MDP M\nby training a policy from scratch. The test-time training schedule is the same as the inner loop of\nAlgorithm 1 (it is described in full in the supplementary materials).\n\n(6)\n\nW/V\n\n3.3 Architecture\nThe agent is parametrized using an MLP policy with observation space S and action space A. The\nloss has a memory unit to assist learning in the inner loop. This memory unit is a single-layer neural\nnetwork to which an invariable input vector of ones is fed. As such, it is essentially a layer of bias\nterms. Since this network has a constant input vector, we can view its weights as a very simple form\nof memory to which the loss can write via emitting the right gradient signals. An experience buffer\nstores the agent\u2019s N most recent experience steps, in the form of a list of tuples (st, at, rt, dt), with\ndt the trajectory termination \ufb02ag. Since this buffer is limited in the number of steps it stores, the\nmemory unit might allow the loss function to store information over a longer period of time.\n\nThe loss function L\u03c6 consists of temporal convolutional layers which generate a context vector\nfcontext, and dense layers, which output the loss. The architecture is depicted in Figure 2.\n\n1More speci\ufb01cally, the average return over 3 sampled trajectories using the \ufb01nal policy for worker w.\n\n4\n\n\fAt step t, the dense layers output the loss Lt by\ntaking a batch of M sequential samples\n{si, ai, di, mem, fcontext, \u03c0\u03b8(\u00b7|si)}t\ni=t\u2212M , (7)\nwhere M < N and we augment each tran-\nsition with the memory output mem, a con-\ntext vector fcontext generated from the loss\u2019s\ntemporal convolutional layers, and the pol-\nicy distribution \u03c0\u03b8(\u00b7|si). In continuous action\nspace, \u03c0\u03b8 is a Gaussian policy, i.e., \u03c0\u03b8(\u00b7|si) =\nN (\u00b7; \u00b5(si; \u03b80), \u03a3), with \u00b5(si; \u03b80) the MLP out-\nput and \u03a3 a learnable parameter vector. The pol-\nicy parameter vector is de\ufb01ned as \u03b8 = [\u03b80, \u03a3].\n\nTo generate the context vector, we \ufb01rst augment\neach transition in the buffer with the output of\nthe memory unit mem and the policy distribu-\ntion \u03c0\u03b8(\u00b7|si) to obtain a set\n\ni=t\u2212N .\n\n{si, ai, di, mem, \u03c0\u03b8(\u00b7|si)}t\n\n(8)\nWe stack these items sequentially into a matrix\nand the temporal convolutional layers take it\nas input and output the context vector fcontext.\nThe memory unit\u2019s parameters are updated via\ngradient descent at each inner-loop update (Eq.\n(5)).\n\nFigure 2: Architecture of a loss computed for\ntimestep t within a batch of M sequential samples\n(from t \u2212 M to t), using temporal convolutions\nover a buffer of size N (from t \u2212 N to t), with\nM \u2264 N: dense net on the bottom is the policy\n\u03c0(s), taking as input the observations (orange),\nwhile outputting action probabilities (green). The\ngreen block on the top represents the loss output.\nGray blocks are evolved, yellow blocks are updated\nthrough SGD.\n\nNote that both the temporal convolution layers\nand the dense layers do not observe the envi-\nronment rewards directly. However, in cases\nwhere the reward cannot be fully inferred from\nthe environment, such as the DirectionalHopper\nenvironment we will examine in Section 4.1, we\nadd rewards ri to the set of inputs in Eqs. (7)\nand (8). In fact, any information that can be\nobtained from the environment could be added\nas an input to the loss function, e.g., exploration signals, the current timestep number, etc, and we\nleave further such extensions as future work.\n\nTo bootstrap the learning process, we add to L\u03c6 a guidance policy gradient signal Lpg (in practice,\nwe use the surrogate loss from PPO [26]), making the total loss\n\u02c6L\u03c6 = (1 \u2212 \u03b1)L\u03c6 + \u03b1Lpg.\n\n(9)\nWe anneal \u03b1 from 1 to 0 over a \ufb01nite number of outer-loop epochs. As such, learning is \ufb01rst derived\nmostly from the well-structured Lpg, while over time L\u03c6 takes over and drives learning completely\nafter \u03b1 has been annealed to 0.\n\n4 Experiments\n\nWe apply our method to several randomized continuous control MuJoCo environments [1, 19, 4],\nnamely RandomHopper and RandomWalker (with randomized gravity, friction, body mass, and\nlink thickness), RandomReacher (with randomized link lengths), DirectionalHopper and Direction-\nalHalfCheetah (with randomized forward/backward reward function), GoalAnt (reward function\nbased on the randomized target location), and Fetch (randomized target location). We describe these\nenvironments in detail in the supplementary materials. These environments are chosen because they\nrequire the agent to identify a randomly sampled environment at test time via exploratory behavior.\nExamples of the randomized Hopper environments are shown in Figure 9. The plots in this section\nshow the mean value of 20 test-time training curves as a solid line, while the shaded area represents\nthe interquartile range. The dotted lines plot 5 randomly sampled curves.\n\n5\n\n\fFigure 3: RandomHopper test-\ntime training over 128 (pol-\nicy updates) \u00d764 (update fre-\nquency) = 8196 timesteps:\nPPO vs no-reward EPG\n\nFigure 4: RandomWalker test-\ntime training over 256 (pol-\nicy updates) \u00d7128 (update fre-\nquency) = 32768 timesteps:\nPPO vs no-reward EPG\n\nFigure 5: RandomReacher test-\ntime training over 512 (pol-\nicy updates) \u00d7128 (update fre-\nquency) = 65536 timesteps: PG\nvs no-reward EPG.\n\n4.1 Performance\n\nWe compare metatest-time learning performance, using the EPG loss function, against an off-the-shelf\npolicy gradient method, PPO [26]. Figures 3, 4, 5, and 6 show learning curves for these two methods\non the RandomHopper, RandomWalker, RandomReacher, and Fetch environments respectively at\ntest time. The plots show the episodic return w.r.t. the number of environment steps taken so far. In\nall experiments, EPG agents learn more quickly and obtain higher returns compared to PPO agents.\n\nIn these experiments, the PPO agent learns by observing reward\nsignals whereas the EPG agent does not observe rewards (note that\nat test time, \u03b1 in Eq. (9) equals 0). Observing rewards is not needed\nin EPG at metatest time, since any piece of information the agent\nencounters forms an input to the EPG loss function. As long as the\nagent can identify which task to solve within the distribution, it does\nnot matter whether this identi\ufb01cation is done through observations\nor rewards. This setting demonstrates the potential to use EPG when\nrewards are only available at metatraining time, for example, if a\nsystem were trained in simulation but deployed in the real world\nwhere reward signals are hard to measure.\n\nFigures 7, 8, and 6 show experiments in which a signaling \ufb02ag is\nrequired to identify the environment. Generally, this is done through\na reward function or an observation \ufb02ag, which is why EPG takes\nthe reward as input in the case where the state space is partially-\nobserved. Similarly to the previous experiments, EPG signi\ufb01cantly\noutperforms PPO on the task distribution it is metatrained on. Specif-\nically, in Figure 8, we compare EPG with both MAML (data from\n[6]) and RL2 [5], \ufb01nding that all three methods obtain similarly high\nperformance after 8000 timesteps of experience. When comparing\nEPG to RL2 (a method that learns a recurrent policy that does not\nreset the internal state upon trajectory resets), we see that RL2 solves\nthe DirectionalHalfCheetah task almost instantly through system\nidenti\ufb01cation. By learning both the algorithm and the policy initial-\nization simultaneously, it is able to signi\ufb01cantly outperform both\nMAML and EPG. However, this comes at the cost of generalization power, as we will discuss in\nSection 4.3.\n\nFigure 6: GoalAnt (top) and\nFetch (bottom) environment\nlearning over 512 and 256\n(policy updates) \u00d732 (update\nfrequency): PPO vs EPG (no\nreward for Fetch)\n\n4.2 Learning exploratory behavior\n\nWithout additional exploratory incentives, PG methods lead to suboptimal policies. To understand\nwhether EPG is able to train agents that explore, we test our method and PPO on the DirectionalHopper\nand GoalAnt environments. In DirectionalHopper, each sampled Hopper environment either rewards\nthe agent for forward or backward hopping. Note that without observing the reward, the agent cannot\n\n6\n\n\fFigure 7: DirectionalHopper environment: each Hopper environ-\nment randomly decides whether to reward forward (left) or back-\nward (right) hopping. In the right plot, we can see exploratory\nbehavior, indicated by the negative spikes in the reward curve,\nwhere the agent \ufb01rst tries out walking forwards before realizing\nthat backwards gives higher rewards.\n\nFigure 8: DirectionalHalfChee-\ntah environment from Finn et\nal. [6] (Fig. 5). Blue dots show\n0, 1, and 2 gradient steps of\nMAML after metalearning a\npolicy initialization.\n\nFigure 9: Example of learning to hop backward from a\nrandom policy in a DirectionalHopper environment. Left to\nright: sampled trajectories as learning progresses.\n\nFigure 10: Sampled trajectories at test-\ntime training on two GoalAnt environ-\nments: various directions are explored.\n\ninfer whether the Hopper environment desires forward or backward hopping. Thus we augment the\nenvironment reward to the input batches of the loss function in this setting.\n\nFigure 7 shows learning curves of both PPO agents and agents trained with the learned loss in the\nDirectionalHopper environment. The learning curves give indication that the learned loss is able\nto train agents that exhibit exploratory behavior. We see that in most instances, PPO agents stagnate\nin learning, while agents trained with our learned loss manage to explore both forward and backward\nhopping and eventually hop in the correct direction. Figure 7 (right) demonstrates the qualitative\nbehavior of our agent during learning and Figure 9 visualizes the exploratory behavior. We see\nthat the hopper \ufb01rst explores one hopping direction before learning to hop backwards. The GoalAnt\nenvironment randomizes the location of the goal. Figure 10 demonstrates the exploratory behavior\nof a learning ant trained by EPG. The ant \ufb01rst explores in various directions, including the opposite\ndirection of the target location. However, it quickly \ufb01gures out in which quadrant to explore, before\nit fully learns the correct direction to walk in.\n\n4.3 Generalization to out-of-distribution tasks\n\nWe evaluate generalization to out-of-distribution task learning on the GoalAnt environment. During\nmetatraining, goals are randomly sampled on the positive x-axis (ant walking to the right) and at test\ntime, we sample goals from the negative x-axis (ant walking to the left). Achieving generalization\nto the left side is not trivial, since it may be easy for a metalearner to over\ufb01t to the task metatraining\ndistribution. Figure 11 (a) illustrates this generalization task. We compare the performance of EPG\nagainst MAML [6] and RL2 [5]. Since PPO is not metatrained, there is no difference between both\ndirections. Therefore, the performance of PPO is the same as shown in Figure 6.\n\n7\n\n\f(a) Task illustration\n\n(b) Metatrained direction\n\n(c) Generalization direction\n\nFigure 11: Generalization in GoalAnt: the ant has only been metatrained to reach targets on the\npositive x-axis (its right side). Can it generalize to targets on the negative x-axis (its left side)?\n\nFirst, we evaluate all metalearning methods\u2019 performance when the test-time task is sampled from\nthe training-time task distribution. Figure 11 (b) shows the test-time training curve of both RL2 and\nEPG when the test-time goals are sampled from the positive x-axis. As expected, RL2 solves this\ntask extremely fast, since it couples both the learning algorithm and the policy. EPG performs very\nwell on this task as well, learning an effective policy from scratch (random initialization) in 8192\nsteps, with \ufb01nal performance matching that of RL2. MAML achieves approximately the same \ufb01nal\nperformance after taking a single SGD step (based on 8000 sampled steps).\n\nNext, we look at the generalization setting with test-time goals sampled from the negative x-axis in\nFigure 11 (c). RL2 seems to have completely over\ufb01t to the task distribution, it has not succeeded in\nlearning a general learning algorithm. Note that, although the RL2 agent still walks in the wrong\ndirection, it does so at a lower speed, indicating that it notices a deviation from the expected reward\nsignal. When looking at MAML, we see that MAML has also over\ufb01t to the metatraining distribution,\nresulting in a walking speed in the wrong direction similar to the non-generalization setting. The\nplot also depicts the result of performing 10 gradient updates from the MAML initialization, denoted\nMAML10 (note that each gradient update uses a batch of 8000 steps). With multiple gradient steps,\nMAML does make progress toward improving the returns (unlike RL2 and consistent with [7]), but\nstill learns at a far slower rate than EPG. MAML can achieve this because it uses a standard PG\nlearning algorithm to make progress beyond its initialization, and therefore enjoys the generalization\nproperty of generic PG methods.\n\nIn contrast, EPG evolves a loss function that trains agents to quickly reach goals sampled from\nnegative x-axis, never seen during metatraining. This demonstrates rudimentary generalization\nproperties, as may be expected from learning a loss function that is decoupled from the policy.\nFigure 10 shows trajectories sampled during the EPG learning process for this exact setup.\n\n5 Discussion\n\nWe have demonstrated that EPG can learn a loss that is specialized to the task distribution it is\nmetatrained on, resulting in faster test time learning on novel tasks sampled from this distribution. In\na sense, this loss function internalizes an agent\u2019s notion of what it means to make progress on a task.\nIn some cases, this eliminates the need for external, environmental rewards at metatest time, resulting\nin agents that learn entirely from intrinsic motivation [22].\n\nAlthough EPG is trained to specialize to a task distribution, it also exhibits generalization properties\nthat go beyond current metalearning methods such as RL2 and MAML. Improving the generalization\nability of EPG, as well other other metalearning algorithms, will be an important component of future\nwork. Right now, we can train an EPG loss to be effective for one small family of tasks at a time,\ne.g., getting an ant to walk left and right. However, the EPG loss for this family of tasks is unlikely\nto be at all effective on a wildly different kind of task, like playing Space Invaders. In contrast,\nstandard RL losses do have this level of generality \u2013 the same loss function can be used to learn a\nhuge variety of skills. EPG gains on performance by losing on generality. There may be a long road\nahead toward metalearning methods that both outperform standard RL methods and have the same\nlevel of generality.\n\n8\n\n\fReferences\n\n[1] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[2] Richard Y Chen, John Schulman, Pieter Abbeel, and Szymon Sidor. UCB exploration via\n\nQ-ensembles. arXiv preprint arXiv:1706.01502, 2017.\n\n[3] Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In\nProceedings of the Fifteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 150\u2013159.\nMorgan Kaufmann Publishers Inc., 1999.\n\n[4] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking\ndeep reinforcement learning for continuous control. In International Conference on Machine\nLearning, pages 1329\u20131338, 2016.\n\n[5] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. arXiv preprint arXiv:1703.03400, 2017.\n\n[7] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and\ngradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622,\n2017.\n\n[8] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning\n\nwith deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.\n\n[9] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolu-\n\ntion strategies. Evolutionary computation, 9(2):159\u2013195, 2001.\n\n[10] J Zico Kolter and Andrew Y Ng. Near-bayesian exploration in polynomial time. In Proceedings\nof the 26th Annual International Conference on Machine Learning, pages 513\u2013520. ACM,\n2009.\n\n[11] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information\n\nProcessing Systems, pages 1008\u20131014, 2000.\n\n[12] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.\n\n[13] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Meta-learning with temporal\n\nconvolutions. arXiv preprint arXiv:1707.03141, 2017.\n\n[14] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap\nbetween value and policy based reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2772\u20132782, 2017.\n\n[15] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.\n\nFoundations of Computational Mathematics, 17(2):527\u2013566, 2017.\n\n[16] Brendan O\u2019Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Pgq: Combin-\n\ning policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016.\n\n[17] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and R\u00e9mi Munos. Count-based\n\nexploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.\n\n[18] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration\nIn International Conference on Machine Learning (ICML),\n\nby self-supervised prediction.\nvolume 2017, 2017.\n\n9\n\n\f[19] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn\nPowell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal\nreinforcement learning: Challenging robotics environments and request for research. arXiv\npreprint arXiv:1802.09464, 2018.\n\n[20] I. Rechenberg and M. Eigen. Evolutionsstrategie: Optimierung Technischer Systeme nach\n\nPrinzipien der Biologischen Evolution. 1973.\n\n[21] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable\n\nalternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[22] Juergen Schmidhuber. Exploring the predictable. In Advances in evolutionary computing, pages\n\n579\u2013612. Springer, 2003.\n\n[23] John Schulman, Pieter Abbeel, and Xi Chen. Equivalence between policy gradients and soft\n\nq-learning. arXiv preprint arXiv:1704.06440, 2017.\n\n[24] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation\nusing stochastic computation graphs. In Advances in Neural Information Processing Systems,\npages 3528\u20133536, 2015.\n\n[25] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[26] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[27] Hans-Paul Schwefel. Numerische Optimierung von Computer-Modellen mittels der Evolu-\ntionsstrategie: mit einer vergleichenden Einf\u00fchrung in die Hill-Climbing-und Zufallsstrategie.\nBirkh\u00e4user, 1977.\n\n[28] Frank Sehnke, Christian Osendorfer, Thomas R\u00fcckstie\u00df, Alex Graves, Jan Peters, and J\u00fcrgen\nSchmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551\u2013559, 2010.\n\n[29] James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient\n\napproximation. IEEE transactions on automatic control, 37(3):332\u2013341, 1992.\n\n[30] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.\n\nMIT press Cambridge, 1998.\n\n[31] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in Neural\n\nmethods for reinforcement learning with function approximation.\nInformation Processing Systems, pages 1057\u20131063, 2000.\n\n[32] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,\nFilip De Turck, and Pieter Abbeel. #Exploration: A study of count-based exploration for deep\nreinforcement learning. Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[33] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.\narXiv preprint arXiv:1611.05763, 2016.\n\n[34] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n10\n\n\f", "award": [], "sourceid": 2590, "authors": [{"given_name": "Rein", "family_name": "Houthooft", "institution": "Happy Elements"}, {"given_name": "Yuhua", "family_name": "Chen", "institution": "Happy Elements Inc."}, {"given_name": "Phillip", "family_name": "Isola", "institution": "OpenAI"}, {"given_name": "Bradly", "family_name": "Stadie", "institution": "UC Berkeley"}, {"given_name": "Filip", "family_name": "Wolski", "institution": "OpenAI"}, {"given_name": "OpenAI", "family_name": "Jonathan Ho", "institution": "OpenAI, UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley | Gradescope | Covariant"}]}