{"title": "The Importance of Sampling inMeta-Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9280, "page_last": 9290, "abstract": "We interpret meta-reinforcement learning as the problem of learning how to quickly find a good sampling distribution in a new environment. This interpretation leads to the development of two new meta-reinforcement learning algorithms: E-MAML and E-$\\text{RL}^2$. Results are presented on a new environment we call `Krazy World': a difficult high-dimensional gridworld which is designed to highlight the importance of correctly differentiating through sampling distributions in meta-reinforcement learning. Further results are presented on a set of maze environments. We show E-MAML and E-$\\text{RL}^2$ deliver better performance than baseline algorithms on both tasks.", "full_text": "Some Considerations on Learning to Explore via\n\nMeta-Reinforcement Learning\n\nBradly C. Stadie\u2217\nUC Berkeley\n\nGe Yang\u2217\n\nUniversity of Chicago\n\nRein Houthooft\n\nOpenAI\n\nXi Chen\n\nCovariant.ai\n\nYan Duan\nCovariant.ai\n\nYuhuai Wu\n\nUniversity of Toronto\n\nPieter Abbeel\nUC Berkeley\n\nIlya Sutskever\n\nOpenAI\n\nAbstract\n\nWe interpret meta-reinforcement learning as the problem of learning how to quickly\n\ufb01nd a good sampling distribution in a new environment. This interpretation leads\nto the development of two new meta-reinforcement learning algorithms: E-MAML\nand E-RL2. Results are presented on a new environment we call \u2018Krazy World\u2019: a\ndif\ufb01cult high-dimensional gridworld which is designed to highlight the importance\nof correctly differentiating through sampling distributions in meta-reinforcement\nlearning. Further results are presented on a set of maze environments. We show\nE-MAML and E-RL2 deliver better performance than baseline algorithms on both\ntasks.\n\n1\n\nIntroduction\n\nReinforcement learning can be thought of as a procedure wherein an agent bias its sampling process\ntowards areas with higher rewards. This sampling process is embodied as the policy \u03c0, which is\nresponsible for outputting an action a conditioned on past environmental states {s}. Such action\naffects changes in the distribution of the next state s(cid:48) \u223c T (s, a). As a result, it is natural to identify\nthe policy \u03c0 with a sampling distribution over the state space.\nThis perspective highlights a key difference between reinforcement learning and supervised learning:\nIn supervised learning, the data is sampled from a \ufb01xed set. The i.i.d. assumption implies that the\nmodel does not affect the underlying distribution. In reinforcement learning, however, the very goal\nis to learn a policy \u03c0(a|s) that can manipulate the sampled states P\u03c0(s) to the agent\u2019s advantage.\nThis property of RL algorithms to affect their own data distribution during the learning process is\nparticularly salient in the \ufb01eld of meta-reinforcement learning. Meta RL goes by many different\nnames: learning to learn, multi-task learning, lifelong learning, transfer learning, etc [25, 26, 22,\n21, 23, 12, 24, 39, 37]. The goal, however, is usually the same\u2013we wish to train the agents to learn\ntransferable knowledge that helps it generalize to new situations. The most straightforward way\nto tackle this problem is to explicitly optimize the agent to deliver good performance after some\nadaptation step. Typically, this adaptation step will take the agent\u2019s prior and use it to update its\ncurrent policy to \ufb01t its current environment.\nThis problem de\ufb01nition induces an interesting consequence: during meta-learning, we are no longer\nunder the obligation to optimize for maximal reward during training. Instead, we can optimize for a\n\n\u2217equal contribution, correspondence to {bstadie, ge.yang}@berkeley.edu\nCode for Krazy World available at: https://github.com/bstadie/krazyworld\nCode for meta RL algorithms available at: https://github.com/episodeyang/e-maml\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsampling process that maximally informs the meta-learner how it should adapt to new environments.\nIn the context of gradient based algorithms, this means that one principled approach for learning\nan optimal sampling strategy is to differentiate the meta RL agent\u2019s per-task sampling process with\nrespect to the goal of maximizing the reward attained by the agent post-adaptation. To the best of our\nknowledge, such a scheme is hitherto unexplored.\nIn this paper, we derive an algorithm for gradient-based meta-learning that explicitly optimizes the per-\ntask sampling distributions during adaptation with respect to the expected future returns produced by\nit post-adaptation self. This algorithm is closely related to the recently proposed MAML algorithm [7].\nFor reasons that will become clear later, we call this algorithm E-MAML. Inspired by this algorithm,\nwe develop a less principled extension of RL2 that we call E-RL2. To demonstrate that this method\nlearns more transferable structures for meta adaptation, we propose a new, high-dimensional, and\ndynamically-changing set of tasks over a domain we call \u201cKrazy-World\". \u201cKrazy-World\" requires the\nagent to learn high-level structures, and is much more challenging for state-of-the-art RL algorithms\nthan simple locomotion tasks.2.\nWe show that our algorithms outperform baselines on \u201cKrazy-World\". To verify we are not over-\ufb01tting\nto this environment, we also present results on a set of maze environments.\n\nto obtain \u03b8 that maximizes \u03b7(\u03c0\u03b8) = E\u03c0\u03b8 [(cid:80)T\n\n2 Preliminaries\nReinforcement Learning Notation: Let M = (S,A,P, R, \u03c10, \u03b3, T ) represent a discrete-time\n\ufb01nite-horizon discounted Markov decision process (MDP). The elements of M have the following\nde\ufb01nitions: S is a state set, A an action set, P : S \u00d7A\u00d7S \u2192 R+ a transition probability distribution,\nR : S \u00d7 A \u2192 R a reward function, \u03c10 : S \u2192 R+ an initial state distribution, \u03b3 \u2208 [0, 1] a discount\nfactor, and T is the horizon. Occasionally we use a loss function L(s) = \u2212R(s) rather than the reward\nR. In a classical reinforcement learning setting, we optimize to obtain a set of parameters \u03b8 which\nmaximize the expected discounted return under the policy \u03c0\u03b8 : S \u00d7 A \u2192 R+. That is, we optimize\nt=0 \u03b3tr(st)], where s0 \u223c \u03c10(s0), at \u223c \u03c0\u03b8(at|st), and\nst+1 \u223c P(st+1|st, at).\nThe Meta Reinforcement Learning Objective: In meta reinforcement learning, we consider a\nfamily of MDPs M = {Mi}N\ni=1 which comprise a distribution of tasks. The goal of meta RL is to\n\ufb01nd a policy \u03c0\u03b8 and paired update method U such that, when Mi \u223c M is sampled, \u03c0U (\u03b8) solves Mi\nquickly. The word quickly is key: By quickly, we mean orders of magnitude more sample ef\ufb01cient\nthan simply solving Mi with policy gradient or value iteration methods from scratch. For example,\nin an environment where policy gradients require over 100,000 samples to produce good returns,\nan ideal meta RL algorithm should solve these tasks by collecting less than 10 trajectories. The\nassumption is that, if an algorithm can solve a problem with so few samples, then it might be \u2018learning\nto learn.\u2019 That is, the agent is not learning how to master a particular task but rather how to quickly\nadapt to new ones. The objective can be written cleanly as\nE\u03c0U (\u03b8) [LMi ]\n\n(cid:88)\n\n(1)\n\nmin\n\n\u03b8\n\nMi\n\nThis objective is similar to the one that appears in MAML [7], which we will discuss further below.\nIn MAML, U is chosen to be the stochastic gradient descent operator parameterized by the task.\n\n3 Problem Statement and Algorithms\n\n3.1 Fixing the Sampling Problem with E-MAML\n\nWe can expand the expectation from (1) into the integral form\n\n(cid:90)\n\nR(\u03c4 )\u03c0U (\u03b8)(\u03c4 )d\u03c4\n\n(2)\n\nIt is true that the objective (1) can be optimized by taking a derivative of this integral with respect to\n\u03b8 and carrying out a standard REINFORCE style analysis to obtain a tractable expression for the\n\n2Half-cheetah is an example of a weak benchmark\u2013it can be learned in just a few gradient step from a random\n\nprior. It can also be solved with a linear policy. Stop using half-cheetah.\n\n2\n\n\fgradient [40]. However, this decision might be sub-optimal.\n\nOur key insight is to recall the sampling process interpretation of RL. In this interpretation,\nthe policy \u03c0\u03b8 implicitly de\ufb01nes a sampling process over the state space. Under this interpretation,\nmeta RL tries to learn a strategy for quickly generating good per-task sampling distributions. For\nthis learning process to work, it needs to receive a signal from each per-task sampling distribution\nwhich measures its propensity to positively impact the meta-learning process. Such a term does not\nmake an appearance when directly optimizing (1). Put more succinctly, directly optimizing (1)\nwill not account for the impact of the original sampling distribution \u03c0\u03b8 on the future rewards\nR(\u03c4 ), \u03c4 \u223c \u03c0U (\u03b8,\u00af\u03c4 ). Concretely, we would like to account for the fact that the samples \u00af\u03c4 drawn under\n\u03c0\u03b8 will impact the \ufb01nal returns R(\u03c4 ) by in\ufb02uencing the initial update U (\u03b8, \u00af\u03c4 ). Making this change\nwill allow initial samples \u00af\u03c4 \u223c \u03c0\u03b8 to be reinforced by the expected future returns after the sampling\nupdate R(\u03c4 ). Under this scheme, the initial samples \u00af\u03c4 are encouraged to cover the state space enough\nto ensure that the update U (\u03b8, \u00af\u03c4 ) is maximally effective.\nIncluding this dependency can be done by writing the modi\ufb01ed expectation as\n\n(cid:90)(cid:90)\n\nR(\u03c4 )\u03c0U (\u03b8,\u00af\u03c4 )(\u03c4 )\u03c0\u03b8(\u00af\u03c4 )d\u00af\u03c4 d\u03c4\n\n(3)\n\nThis provides an expression for computing the gradient which correctly accounts for the dependence\non the initial sampling distribution.\nWe now \ufb01nd ourselves wishing to \ufb01nd a tractable expression for the gradient of (3). This can be\ndone quite smoothly by applying the product rule under the integral sign and going through the\nREINFORCE style derivation twice to arrive at a two term expression\n\n(cid:90)(cid:90)\n\n(cid:20)\n\n\u2202\n\u2202\u03b8\n\n(cid:90)(cid:90)\nT(cid:88)\n\ni=1\n\n=\n\n\u2248 1\nT\n\nR(\u03c4 )\u03c0U (\u03b8,\u00af\u03c4 )(\u03c4 )\u03c0\u03b8(\u00af\u03c4 )d\u00af\u03c4 d\u03c4\n\nR(\u03c4 )\n\n\u03c0\u03b8(\u00af\u03c4 )\n\n\u03c0U (\u03b8,\u00af\u03c4 )(\u03c4 ) + \u03c0U (\u03b8,\u00af\u03c4 )(\u03c4 )\n\n\u2202\n\u2202\u03b8\n\nR(\u03c4 i)\n\n\u2202\n\u2202\u03b8\n\nlog \u03c0U (\u03b8,\u00af\u03c4 )(\u03c4 i) +\n\n1\nT\n\nR(\u03c4 i)\n\n\u2202\n\u2202\u03b8\n\nlog \u03c0\u03b8(\u00af\u03c4 i)\n\n(cid:21)\n\n\u03c0\u03b8(\u00af\u03c4 )\n\nd\u00af\u03c4 d\u03c4\n\n\u2202\n\u2202\u03b8\n\nT(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u00af\u03c4 i \u223c \u03c0\u03b8\n\n\u03c4 i \u223c \u03c0U (\u03b8,\u00af\u03c4 )\n\n(4)\n\nThe term on the left is precisely the original MAML algorithm [7]. This term encourages the agent to\ntake update steps U that achieve good \ufb01nal rewards. The second term encourages the agent to take\nactions such that the eventual meta-update yields good rewards (crucially, it does not try and exploit\nthe reward signal under its own trajectory \u00af\u03c4). In our original derivation of this algorithm, we felt this\nterm would afford the the policy the opportunity to be more exploratory, as it will attempt to deliver\nthe maximal amount of information useful for the future rewards R(\u03c4 ) without worrying about its\nown rewards R(\u00af\u03c4 ). Since we felt this algorithm augments MAML by adding in an exploratory term,\nwe called it E-MAML. At present, the validity of this interpretation remains an open question.\nFor the experiments presented in this paper, we will assume that the operator U that is utilized in\nMAML and E-MAML is stochastic gradient descent. However, many other interesting choices exist.\n\n3.2 Choices for the Update Operator U, and the Exploration Plicy \u03c0\u03b8\n\nWhen only a single iteration of inner policy gradient occurrs, the initial policy \u03c0\u03b80 is entirely\nresponsible for the exploratory sampling. The best exploration would be a policy \u03c0\u03b80 that can\ngenerate the most informative samples for identifying the task. However if more than 1 iteration\nof inner policy gradient occurs, then some of the exploratory behavior can also be attributed to the\nupdate operation U. There is a non-trivial interplay between the initial policy and the subsequent\npolicy updates, especially when exploring the environment would take a few different policies.\n\n3.3 Understanding the E-MAML Update\n\nA standard reinforcement learning objective can be represented by the stochastic computation graph\n[29] as in Figure.1a, where the loss is computed w.r.t. the policy parameter \u03b8 using estimator Eq.1.\nFor clarity, we can use a short-hand notation U as in \u03b8(cid:48) = U (\u03b8) to represent this graph (Fig.1b).\n\n3\n\n\freturn \u03d5\n\nfor i in [0, \u00b7\u00b7\u00b7, k-1] do\n\nAlgorithm 1 E-MAML\nRequire: Task distribution: P (T )\nRequire: \u03b1, \u03b2 learning step size hyperparameters\nRequire: ninner, nmeta number of gradient updates for per-task and meta learning\n1: function U k(\u03d5,L, \u03c4[0,\u00b7\u00b7\u00b7 ,k\u22121])\n2:\n\u03d5 \u2190 \u03d5 \u2212 \u03b1\u2207\u03d5L(\u03d5, \u03c4i)\n3:\n4:\n5: procedure E-MAML(\u03b8, \u03c6)\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nrandomly initialize \u03b8 and \u03c6\nwhile not done do\nSample batch of tasks Ti from p(T )\nfor Ti \u2208 T do\n\nSample rollouts [\u03c4 ][0:ninner\u22121] \u223c \u03c0\u03b8;\n\u03b8n \u2190 U ninner (\u03b8,L, \u03c4[0:ninner\u22121])\nwith \u03c0\u03b8ninner , sample \u03c4 meta\n\n\u03b8 \u2190 \u03b8 \u2212 \u03b2(cid:80)\n\nfor i in nmeta do\n\ni = {s, a, r} from task Ti;\n\n\u2207\u03b8Lmeta[\u03c0U n\n\n, \u03c4 meta\n\ni\n\n]\n\n(\u03b8,L)\n\n\u03c4 meta\ni\n\n(cid:46) Inner Update Function\n\n(cid:46) High-order update\n\n(cid:46) Meta-update the model prior \u03b8\n(cid:46) no resample\n\nAs described in [1], we can then write down the surrogate objective of E-MAML as the graph in\nFig.1c whereas model-agnostic meta-learning (MAML) treats the graph as if the \ufb01rst policy gradient\nupdate is deterministic. It is important to note that one can in fact smoothly tune between the\nstochastic treatment of U (Fig.1c) and the non-stochastic treatment of U (Fig.1d) by choosing the\nhyperparameter \u03bb. In this regard, E-MAML considers the sampling distribution \u03c0\u03b8(cid:48) as an extra,\nexploration friendly term that one could add to the standard meta-RL objective used in MAML.\n\n(a)\n\nR\n\na\n\n\u03b8\n\nT\n\na\n\n\u03c0\n\nE(cid:2)log \u03c0\u03b8(a|s) \u00b7 A(s)(cid:3)\n\ns\n\n(c)\n\nr\n\ns\n\nL\nPPO\n\ns\u223c\u03c0\u03b8\n\nT\n\nR\n\n(b)\n\n\u03b8\n\nU\n\n\u03b8(cid:48)\n\n(d)\n\nT\n\nR\n\n\u03b8(cid:48)\n\n\u03c0\n\n\u03b8\n\nU\n\nE(cid:2) log \u03c0U (\u03b8)(a|s) \u00b7 A(s) + \u03bb \u00b7 A(\u03c4\u03c0\u03b8 ) \u00b7 (cid:88)\n\ns\u223c\u03c0\u03b8\n\nL\nPPO\n\nlog \u03c0\u03b8(a|s)(cid:3)\n\n\u03b8\n\n\u03b8(cid:48)\n\n\u03c0\n\nU\n\nE(cid:2) log \u03c0U (\u03b8)(a|s) \u00b7 A(s)(cid:3)\n\nL\nPPO\n\nFigure 1: Computation graphs for (a) REINFORCE/Policy Gradient, where \u03b8 is the policy parameter.\n\u03c0 is the policy function. During learning, the policy \u03c0\u03b8 interacts with the environment, parameterized\nas the transition function T and the reward function R. L is the proximal policy optimization\nobjective, which allows making multiple gradient steps with the same trajectory sample. Only 1-meta\ngradient updated is used in these experiments. (b) shows the short-hand we use to represent (a) in (c)\nand (d). Note that circle \u25e6 represents stochastic nodes, and (cid:3) represents deterministic nodes. The\npolicy gradient update sub-graph U is stochastic, which is where here in (b) we have a circle. (c)\nthe inner policy gradient sub-graph and the policy gradient meta-loss. The original parameter \u03b8 gets\nupdated to \u03b8(cid:48), which is then evaluated by the outer proximal policy optimization objective. E-MAML\ntreats the inner policy gradient update as an stochastic node. Whereas (d) MAML treats this as a\ndeterministic node, thus neglecting the causal entropy term of the inner update operator U.\n\n4\n\n\f3.4 E-RL2\n\nRL2 optimizes (1) by feeding multiple rollouts from multiple different MDPs into an RNN. The\nhope is that the RNN hidden state update ht = C(xt, ht\u22121), will learn to act as the update function\nU. Then, performing a policy gradient update on the RNN will correspond to optimizing the meta\nobjective (1).\nWe can write the RL2 update rule more explicitly in the following way. Suppose L represents an\nRNN. Let Envk(a) be a function that takes an action, uses it to interact with the MDP representing\ntask k, and returns the next observation o 3 , reward r, and a termination \ufb02ag d. Then we have\n\nxt = [ot\u22121, at\u22121, rt\u22121, dt\u22121]\n\n[at, ht+1] = L(ht, xt)\n[ot, rt, dt] = Envk(at)\n\n(5)\n(6)\n(7)\nTo train this RNN, we sample N MDPs from M and obtain k rollouts for each MDP by running\nthe MDP through the RNN as above. We then compute a policy gradient update to move the RNN\nparameters in a direction which maximizes the returns over the k trials performed for each MDP.\nInspired by our derivation of E-MAML, we attempt to make RL2 better account for the impact\nof its initial sampling distribution on its \ufb01nal returns. However, we will take a slightly different\napproach. Call the rollouts that help account for the impact of this initial sampling distribution\nExplore-rollouts. Call rollouts that do not account for this dependence Exploit-rollouts. For each\nMDP Mi, we will sample p Explore-rollouts and k \u2212 p Exploit-rollouts. During an Explore-rollout,\nthe forward pass through the RNN will receive all information. However, during the backwards pass,\nthe rewards contributed during Explore-rollouts will be set to zero. The graident computation will\nonly get rewards provided by Exploit-Rollouts. For example, if there is one Explore-rollout and one\nExploit-rollout, then we would proceed as follows. During the forward pass, the RNN will receive\nall information regarding rewards for both episodes. During the backwards pass, the returns will be\ncomputed as\n\nT(cid:88)\n\nR(xi) =\n\n\u03b3jrj \u00b7 \u03c7E(xj)\n\n(8)\n\nj=i\n\nWhere \u03c7E is an indicator that returns 0 if the episode is an Explore-rollout and 1 if it is an Exploit-\nrollout. This return, and not the standard sum of discounted returns, is what is used to com-\npute the policy gradient. The hope is that zeroing the return contributions from Explore-rollouts\nwill encourage the RNN to account for the impact of casting a wider sampling distribution on\nthe \ufb01nal meta-reward. That is, during Explore-rollouts the policy will take actions which may\nnot lead to immediate rewards but rather to the RNN hidden weights that perform better system\nidenti\ufb01cation. This system identi\ufb01cation will in turn lead to higher rewards in later episodes.\n\n4 Experiments\n\n4.1 Krazy World Environment\n\nTo test the importance of correctly\ndifferentiating through the sampling\nprocess in meta reinforcement learn-\ning, we engineer a new environment\nknown as Krazy World. To succeed\nat Krazy World, a successful meta\nlearning agent will \ufb01rst need to iden-\ntify and adapt to many different tile\ntypes, color palettes, and dynamics.\nThis environment is challening even\nfor state-of-the-art RL algorithms. In\n\nFigure 2: Three example worlds drawn from the task dis-\ntribution. A good agent should \ufb01rst complete a successful\nsystem identi\ufb01cation before exploiting. For example, in the\nleftmost grid the agent should identify the following: 1) the\norange squares give +1 reward, 2) the blue squares replenish\nenergy, 3) the gold squares block progress, 4) the black square\ncan only be passed by picking up the pink key, 5) the brown\nsquares will kill it, 6) it will slide over the purple squares. The\ncenter and right worlds show how these dynamics will change\nand need to be re-identi\ufb01ed every time a new task is sampled.\n\n3RL2 works well with POMDP\u2019s because the RNN is good at system-identi\ufb01cation. This is the reason why\n\nwe chose to use o as in \u201cobservation\" instead of s for \u201cstate\" in this formulation.\n\n5\n\n\fthis environment, it is essential that meta-updates account for the impact of the original sampling\ndistribution on the \ufb01nal meta-updated reward. Without accounting for this impact, the agent will not\nreceive the gradient of the per-task episodes with respect to the meta-update. But this is precisely the\ngradient that encourages the agent to quickly learn how to correctly identify parts of the environment.\nSee the Appendix for a full description of the environment.\n\n4.2 Mazes\n\nA collection of maze environments. The agent is placed at a random square within the maze and\nmust learn to navigate the twists and turns to reach the goal square. A good exploratory agent will\nspend some time learning the maze\u2019s layout in a way that minimizes repetition of future mistakes.\nThe mazes are not rendered, and consequently this task is done with state space only. The mazes are\n20 \u00d7 20 squares large.\n\n4.3 Results\n\nIn this section, we present the following exper-\nimental results.\n\n1. Learning curves on Krazy World and mazes.\n2. The gap between the agent\u2019s initial perfor-\nmance on new environments and its perfor-\nmance after updating. A good meta learning\nalgorithm will have a large gap after updating.\nA standard RL algorithm will have virtually\nno gap after only one update.\n\n3. Three experiments that examine the ex-\n\nploratory properties of our algorithms.\n\nFigure 3: Meta learning curves on Krazy World.\nWe see that E-RL2 achieves the best \ufb01nal results,\nbut has the highest initial variance. Crucially, E-\nMAML converges faster than MAML, although\nboth algorithms do manage to converge. RL2 has\nrelatively poor performance and high variance. A\nrandom agent achieves a score of around 0.05.\n\nWhen plotting learning curves in Figure 3 and\nFigure 4, the Y axis is the reward obtained af-\nter training at test time on a set of 64 held-out\ntest environments. The X axis is the total num-\nber of environmental time-steps the algorithm\nhas used for training. Every time the environ-\nment advances forward by one step, this count\nincrements by one. This is done to keep the time-\nscale consistent across meta-learning curves.\nFor Krazy World, learning curves are presented\nin Figure 3. E-MAML and E-RL2 have the best\n\ufb01nal performance. E-MAML has the steepest\ninitial gains for the \ufb01rst 10 million time-steps.\nSince meta-learning algorithms are often very\nexpensive, having a steep initial ascent is quite\nvaluable. Around 14 million training steps, E-\nRL2 passes E-MAML for the best performance.\nBy 25 million time-steps, E-RL2 has converged.\nMAML delivers comparable \ufb01nal performance\nto E-MAML. However, it takes it much longer\nto obtain this level of performance. Finally, RL2\nhas comparatively poor performance on this task\nand very high variance. When we manually\nexamined the RL2 trajectories to \ufb01gure out why,\nwe saw the agent consistently \ufb01nding a single\ngoal square and then refusing to explore any\nfurther. The additional experiments presented below seem consistent with this \ufb01nding.\nLearning curves for mazes are presented in Figure 4. Here, the story is different than Krazy World.\nRL2 and E-RL2 both perform better than MAML and E-MAML. We suspect the reason for this is\n\nFigure 4: Meta learning curves on mazes. Figure 5\nshows each curve in isolation, making it easier to\ndiscern their individual characteristics. E-MAML\nand E-RL2 perform better than their counterparts.\n\n6\n\n\fFigure 5: Gap between initial performance and performance after one update. All algorithms show\nsome level of improvement after one update. This suggests meta learning is working, because normal\npolicy gradient methods learn nothing after one update.\nthat RNNs are able to leverage memory, which is more important in mazes than in Krazy World.\nThis environment carries a penalty for hitting the wall, which MAML and E-MAML discover\nquickly. However, it takes E-RL2 and RL2 much longer to discover this penalty, resulting in worse\nperformance at the beginning of training. MAML delivers worse \ufb01nal performance and typically only\nlearns how to avoid hitting the wall. RL2 and E-MAML sporadically solve mazes. E-RL2 manages\nto solve many of the mazes.\nWhen examining meta learning algorithms, one important metric is the update size after one learning\nepisode. Our meta learning algorithms should have a large gap between their initial policy, which\nis largely exploratory, and their updated policy, which should often solve the problem entirely. For\nMAML, we look at the gap between the initial policy and the policy after one policy gradient step\nFor RL2, we look at the results after three exploratory episodes, which give the RNN hidden state\nh suf\ufb01cient time to update. Note that three is the number of exploratory episodes we used during\ntraining as well. This metric shares similarities with the Jump Start metric considered in prior\nliterature [35]. These gaps are presented in \ufb01gure 5.\n\nFigure 6: Three heuristic metrics designed to measure an algorithm\u2019s system identi\ufb01cation ability\non Krazy World: Fraction of tile types visited during test time, number of times killed at a death\nsquare during test time, and number of goal squares visited. We see that E-MAML is consistently the\nmost diligent algorithm at checking every tile type during test time. Improving performance on these\nmetrics indicates the meta learners are learning how to do at least some system identi\ufb01cation.\nFinally, in Figure 6 we see three heuristic metrics desigend to measure a meta-learners capacity for\nsystem identi\ufb01cation. First, we consider the fraction of tile types visited by the agent at test time. A\ngood agent should visit and identify many different tile types. Second, we consider the number of\ntimes an agent visits a death tile at test time. Agents that are ef\ufb01cient at identi\ufb01cation should visit\nthis tile type exactly once and then avoid it. More naive agents will run into these tiles repeatedly,\ndying repetedly and instilling a sense of pity in onlookers. Finally, we look at how many goals the\nagent reaches. RL2 tends to visit fewer goals. Usually, it \ufb01nds one goal and exploits it. Overall, our\nsuggested algorithms achieve better performance under these metrics.\n\n5 Related Work\n\nThis work builds depends upon recent advances in deep reinforcement learning. [15, 16, 13] allow for\ndiscrete control in complex environments directly from raw images. [28, 16, 27, 14], have allowed\nfor high-dimensional continuous control in complex environments from raw state information.\n\n7\n\n\fIt has been suggested that our algorithm is related to the exploration vs. exploitation dilemma. There\nexists a large body of RL work addressing this problem [10, 5, 11]. In practice, these methods are\noften not used due to dif\ufb01culties with high-dimensional observations, dif\ufb01culty in implementation on\narbitrary domains, and lack of promising results. This resulted in most deep RL work utilizing epsilon\ngreedy exploration [15], or perhaps a simple scheme like Boltzmann exploration [6]. As a result\nof these shortcomings, a variety of new approaches to exploration in deep RL have recently been\nsuggested [34, 9, 31, 18, 4, 19, 17, 8, 23, 17, 8, 26, 25, 32, 33, 12, 24]. In spite of these numerous\nefforts, the problem of exploration in RL remains dif\ufb01cult.\nMany of the problems in meta RL can alternatively be addressed with the \ufb01eld of hierarchical\nreinforcement learning. In hierarchical RL, a major focus is on learning primitives that can be reused\nand strung together. Frequently, these primitives will relate to better coverage over state visitation\nfrequencies. Recent work in this direction includes [38, 2, 36, 20, 3, 39]. The primary reason we\nconsider meta-learning over hierarchical RL is that we \ufb01nd hierarchical RL tends to focus more on\nde\ufb01ning speci\ufb01c architectures that should lead to hierarchical behavior, whereas meta learning instead\nattempts to directly optimize for these behaviors.\nAs for meta RL itself, the literature is spread out and goes by many different names. There exist\nrelevant literature on life-long learning, learning to learn, continual learning, and multi-task learning\n[22, 21]. We encourage the reviewer to look at the review articles [30, 35, 37] and their citations.\nThe work most similar to ours has focused on adding curiosity or on a free learning phrase during\ntraining. However, these approaches are still quite different because they focus on de\ufb01ning an intrinsic\nmotivation signals. We only consider better utilization of the existing reward signal. Our algorithm\nmakes an explicit connection between free learning phases and the its affect on meta-updates.\n6 Closing Remarks\nIn this paper, we considered the importance of sampling in meta reinforcement learning. Two new\nalgorithms were derived and their properties were analyzed. We showed that these algorithms tend\nto learn more quickly and cover more of their environment\u2019s states during learning than existing\nalgorithms. It is likely that future work in this area will focus on meta-learning a curiosity signal which\nis robust and transfers across tasks, or learning an explicit exploration policy. Another interesting\navenue for future work is learning intrinsic rewards that communicate long-horizon goals, thus better\njustifying exploratory behavior. Perhaps this will enable meta agents which truly want to explore\nrather than being forced to explore by mathematical trickery in their objectives.\n\n7 Acknowledgement\n\nThis work was supported in part by ONR PECASE N000141612723 and by AWS and GCE compute\ncredits.\n\nAppendix A: Krazy-World\n\nWe \ufb01nd this environment challenging for even state-of-the-\nart RL algorithms. For each environment in the test set, we\noptimize for 5 million steps using the Q-Learning algorithm\nfrom Open-AI baselines. This exercise delivered a mean score\nof 1.2 per environment, well below the human baselines score\nof 2.7. The environment has the following challenging features:\n8 different tile types: Goal squares provide +1 reward when\nretrieved. The agent reaching the goal does not cause the\nepisode to terminate, and there can be multiple goals. Ice\nsquares will be skipped over in the direction the agent is\ntransversing. Death squares will kill the agent and end the\nepisode. Wall squares act as a wall, impeding the agent\u2019s\nmovement. Lock squares can only be passed once the agent has collected the corresponding key\nfrom a key square. Teleporter squares transport the agent to a different teleporter square on the map.\nEnergy squares provide the agent with additional energy. If the agent runs out of energy, it can no\nlonger move. The agent proceeds normally across normal squares.\n\nFigure 7: A comparison of lo-\ncal and global observations for the\nKrazy World environment. In the\nlocal mode, the agent only views a\n3 \u00d7 3 grid centered about itself. In\nglobal mode, the agent views the\nentire environment.\n\n8\n\n\fAbility to randomly swap color palette: The color palette for the grid can be permuted randomly,\nchanging the color that corresponds to each of the tile types. The agent will thus have to identify\nthe new system to achieve a high score. Note that in representations of the gird wherein basis\nvectors are used rather than images to describe the state space, each basis vector corresponds to a tile\ntype\u2013permuting the colors corresponds to permuting the types of tiles these basis vectors represent.\nWe prefer to use the basis vector representation in our experiments, as it is more sample ef\ufb01cient.\nAbility to randomly swap dynamics: The game\u2019s dynamics can be altered. The most naive alteration\nsimply permutes the player\u2019s inputs and corresponding actions (issuing the command for down moves\nthe player up etc). More complex dynamics alterations allow the agent to move multiple steps at a\ntime in arbitrary directions, making the movement more akin to that of chess pieces.\nLocal or Global Observations: The agent\u2019s observation space can be set to some \ufb01xed number\nof squares around the agent, the squares in front of the agent, or the entire grid. Observations can\nbe given as images or as a grid of basis vectors. For the case of basis vectors, each element of the\ngrid is embedded as a basis vector that corresponds to that tile type. These embeddings are then\nconcatenated together to form the observation proper. We will use local observations.\n\nAppendix B: Table of Hyperparameters\n\nTable 1: Table of Hyperparameters: E-MAML\n\nHyperparameter\nParallel Samplers\nBatch Timesteps\nInner Algo\nMeta Algo\nMax Grad Norm\nPPO Clip Range\nGamma\nGAE Lambda\nAlpha\nBeta\nVf Coeff\nEnt Coeff\nInner Optimizer\nMeta Optimizer\nInner Gradient Steps\nSimple Sampling\nMeta Gradient Steps\n\nValue\n40 \u223c 128\n\n500\n\nPPO / VPG / CPI\nPPO / VPG / CPI\n\n0.9 \u223c 1.0\n0.2\n0 \u223c 1\n0.997\n0.01\n1e-3\n\n0\n\n1e-3\nSGD\nAdam\n1 \u223c 20\nTrue/False\n1 \u223c 20\n\nComments\n\nMuch simpler than TRPO\n\nimproves stability\n\ntunnable hyper parameter\n\nCould be learned\n\n60\n\nvalue baseline is disabled\nimproves training stability\n\nWith PPO being the inner algorithm, one can reuse\nthe same path sample for multiple gradient steps.\n\nRequires PPO when > 1\n\nAppendix C: Experiment Details\n\nFor both Krazy World and mazes, training proceeds in the following way. First, we initialize 32\ntraining environments and 64 test environments. Every initialized environment has a different seed.\nNext, we initialize our chosen meta-RL algorithm by uniformly sampling hyper-parameters from\nprede\ufb01ned ranges. Data is then collected from all 32 training environments. The meta-RL algorithm\nthen uses this data to make a meta-update, as detailed in the algorithm section of this paper. The\nmeta-RL algorithm is then allowed to do one training step on the 64 test environments to see how fast\nit can train at test time. These test environment results are recorded, 32 new tasks are sampled from\nthe training environments, and data collection begins again. For MAML and E-MAML, training at\ntest time means performing one VPG update at test time (see \ufb01gure ?? for evidence that taking only\none gradient step is suf\ufb01cient). For RL2 and E-RL2, this means running three exploratory episodes to\nallow the RNN memory weights time to update and then reporting the loss on the fourth and \ufb01fth\nepisodes. For both algorithms, meta-updates are calculated with PPO [27]. The entire process from\nthe above paragraph is repeated from the beginning 64 times and the results are averaged to provide\nthe \ufb01nal learning curves featured in this paper.\n\n9\n\n\fReferences\n[1] Anonymous. Promp: Proximal meta-policy search. ICLR Submission, 2019.\n[2] Bacon, P. and Precup, D. The option-critic architecture. In NIPS Deep RL Workshop, 2015.\n[3] Barto and Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete\n\nEvent Dynamic Systems, 2003.\n\n[4] Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying\n\ncount based exploration and intrinsic motivation. NIPS, 2016.\n\n[5] Brafman, R. I. and Tennenholtz, M. R-max - a general polynomial time algorithm for\n\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 2002.\n\n[6] Carmel, D. and Markovitch, S. Exploration strategies for model-based learning in multi-\n\nagent systems. Autonomous Agents and Multi-Agent Systems, 1999.\n\n[7] Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of\n\ndeep networks. ICML, 2017.\n\n[8] Graziano, Vincent, Glasmachers, Tobias, Schaul, Tom, Pape, Leo, Cuccu, Giuseppe, Leitner,\nJurgen, and Schmidhuber, J\u00fcrgen. Arti\ufb01cial Curiosity for Autonomous Space Exploration.\nActa Futura, 1(1):41\u201351, 2011. doi: 10.2420/AF04.2011.41. URL http://dx.doi.org/\n10.2420/AF04.2011.41.\n\n[9] Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. Vime: Varia-\n\ntional information maximizing exploration. NIPS, 2016.\n\n[10] Kearns, M. J. and Singh, S. P. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 2002.\n\n[11] Kolter, J. Z. and Ng, A. Y. Near-bayesian exploration in polynomial time. ICML, 2009.\n[12] Kompella, Stollenga, Luciw, and Schmidhuber. Exploring the predictable. Advances in\n\nEvolutionary Computing, 2002.\n\n[13] Koutn\u00edk, Jan, Cuccu, Giuseppe, Schmidhuber, J\u00fcrgen, and Gomez, Faustino. Evolving large-\nscale neural networks for vision-based reinforcement learning. GECCO 2013, 15:1061\u2013\n1068, 2013.\n\n[14] Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.\n\nContinuous control with deep reinforcement learning. arXiv:1509.02971, 2015.\n\n[15] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,\nMarc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533,\n2015.\n\n[16] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timo-\nthy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep\nreinforcement learning. arXiv:1602.01783, 2016.\n\n[17] Ngo, Hung, Luciw, Matthew, Forster, Alexander, and Schmidhuber, Juergen. Learning skills\nfrom play: Arti\ufb01cial curiosity on a Katana robot arm. Proceedings of the International\nJoint Conference on Neural Networks, pp. 10\u201315, 2012. ISSN 2161-4393. doi: 10.1109/IJCNN.\n2012.6252824.\n\n[18] Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn.\n\nNIPS, 2016.\n\n[19] Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos, R. Count-based exploration with\n\nneural density models. arXiv:1703.01310, 2017.\n\n[20] Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K.,\nPascanu, R., and Hadsell, R. Progressive neural networks. CoRR, vol. abs/1606.04671, 2016.\n\n10\n\n\f[21] Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\n\nlearn: The meta-meta-... hook. Diploma thesis, TUM, 1987.\n\n[22] Schmidhuber. G\u00f6del machines: Fully self-referential optimal universal self-improvers.\n\nArti\ufb01cial General Intelligence, 2006.\n\n[23] Schmidhuber, J. Continual curiosity-driven skill acquisition from high-dimensional video\n\ninputs for humanoid robots. Arti\ufb01cial Intelligence, 2015.\n\n[24] Schmidhuber, J., Zhao, J., and Schraudolph, N. Reinforcement learning with self-modifying\n\npolicies. Learning to learn, Kluwer, 1997.\n\n[25] Schmidhuber, Juergen. On Learning to Think: Algorithmic Information Theory for Novel\nCombinations of Reinforcement Learning Controllers and Recurrent Neural World Mod-\nels. pp. 1\u201336, 2015. ISSN 10450823. doi: 1511.09249v1. URL http://arxiv.org/abs/\n1511.09249.\n\n[26] Schmidhuber, J\u00fcrgen. A Possibility for Implementing Curiosity and Boredom in Model-\nBuilding Neural Controllers. Meyer, J.A. and Wilson, S.W. (eds) : From Animals to animats,\npp. 222\u2013227, 1991.\n\n[27] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimiza-\n\ntion algorithms. arXiv:1707.06347, 2017.\n\n[28] Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and Abbeel, Pieter. Trust\n\nregion policy optimization. Arxiv preprint 1502.05477, 2015.\n\n[29] Schulman, John a/snd Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient esti-\nmation using stochastic computation graphs. In Advances in Neural Information Processing\nSystems, pp. 3528\u20133536, 2015.\n\n[30] Silver, Yand, and Li. Lifelong machine learning systems: Beyond learning algorithms.\n\nDAAAI Spring Symposium-Technical Report, 2013., 2013.\n\n[31] Stadie, Bradly C., Levine, S., and Abbeel, P. Incentivizing exploration in reinforcement\n\nlearning with deep predictive models. arXiv:1507.00814, 2015.\n\n[32] Storck, Jan, Hochreiter, Sepp, and Schmidhuber, J\u00fcrgen. Reinforcement driven information\nacquisition in non-deterministic environments. Proceedings of the International . . . , 2:\n159\u2013164, 1995.\n\n[33] Sun, Yi, Gomez, Faustino, and Schmidhuber, J\u00fcrgen. Planning to be surprised: Optimal\nBayesian exploration in dynamic environments. Lecture Notes in Computer Science (includ-\ning subseries Lecture Notes in Arti\ufb01cial Intelligence and Lecture Notes in Bioinformatics), 6830\nLNAI:41\u201351, 2011. ISSN 03029743. doi: 10.1007/978-3-642-22887-2_5.\n\n[34] Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck,\nF., and Abbeel, P. exploration: A study of count-based exploration for deep reinforcement\nlearning. arXiv:1611.04717, 2016.\n\n[35] Taylor and Stone. Transfer learning for reinforcement learning domains: A survey. DAAAI\n\nSpring Symposium-Technical Report, 2013., 2009.\n\n[36] Tessler, C. Givony, S., Zahavy, T., Mankowitz, D.J., and Mannor, S. A deep hierarchical\n\napproach to lifelong learning in minecraft. arXiv:1604.07255, 2016.\n\n[37] Thrun. Is learning the n-th thing any easier than learning the \ufb01rst? NIPS, 1996.\n[38] Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and\nKavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. arXiv preprint\narXiv:1703.01161, 2017.\n\n[39] Wiering and Schmidhuber. Hq-learning. Adaptive Behavior, 1997.\n[40] Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist re-\n\ninforcement learning. Machine Learning 8, 3-4 (1992), 229\u2013256, 1992.\n\n11\n\n\f", "award": [], "sourceid": 5640, "authors": [{"given_name": "Bradly", "family_name": "Stadie", "institution": "Vector Institute"}, {"given_name": "Ge", "family_name": "Yang", "institution": "Berkeley"}, {"given_name": "Rein", "family_name": "Houthooft", "institution": "Happy Elements"}, {"given_name": "Peter", "family_name": "Chen", "institution": "covariant.ai"}, {"given_name": "Yan", "family_name": "Duan", "institution": "UC Berkeley"}, {"given_name": "Yuhuai", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley | Gradescope | Covariant"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": "OpenAI"}]}