{"title": "Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck", "book": "Advances in Neural Information Processing Systems", "page_first": 13978, "page_last": 13990, "abstract": "The ability for policies to generalize to new environments is key to the broad application of RL agents. A promising approach to prevent an agent\u2019s policy from overfitting to a limited set of training environments is to apply regularization techniques originally developed for supervised learning. However, there are stark differences between supervised learning and RL. We discuss those differences and propose modifications to existing regularization techniques in order to better adapt them to RL. In particular, we focus on regularization techniques relying on the injection of noise into the learned function, a family that includes some of the most widely used approaches such as Dropout and Batch Normalization. To adapt them to RL, we propose Selective Noise Injection (SNI), which maintains the regularizing effect the injected noise has, while mitigating the adverse effects it has on the gradient quality. Furthermore, we demonstrate that the Information Bottleneck (IB) is a particularly well suited regularization technique for RL as it is effective in the low-data regime encountered early on in training RL agents. Combining the IB with SNI, we significantly outperform current state of the art results, including on the recently proposed generalization benchmark Coinrun.", "full_text": "Generalization in Reinforcement Learning with\n\nSelective Noise Injection and Information Bottleneck\n\nMaximilian Igl \u21e4\nUniversity of Oxford\n\nKamil Ciosek\n\nMicrosoft Research\n\nYingzhen Li\n\nMicrosoft Research\n\nSebastian Tschiatschek\n\nMicrosoft Research\n\nCheng Zhang\n\nMicrosoft Research\n\nSam Devlin \u2020\n\nMicrosoft Research\n\nKatja Hofmann \u2020\nMicrosoft Research\n\nAbstract\n\nThe ability for policies to generalize to new environments is key to the broad\napplication of RL agents. A promising approach to prevent an agent\u2019s policy\nfrom over\ufb01tting to a limited set of training environments is to apply regularization\ntechniques originally developed for supervised learning. However, there are stark\ndifferences between supervised learning and RL. We discuss those differences\nand propose modi\ufb01cations to existing regularization techniques in order to better\nadapt them to RL. In particular, we focus on regularization techniques relying on\nthe injection of noise into the learned function, a family that includes some of\nthe most widely used approaches such as Dropout and Batch Normalization. To\nadapt them to RL, we propose Selective Noise Injection (SNI), which maintains\nthe regularizing effect the injected noise has, while mitigating the adverse effects\nit has on the gradient quality. Furthermore, we demonstrate that the Information\nBottleneck (IB) is a particularly well suited regularization technique for RL as\nit is effective in the low-data regime encountered early on in training RL agents.\nCombining the IB with SNI, we signi\ufb01cantly outperform current state of the art\nresults, including on the recently proposed generalization benchmark Coinrun.\n\n1\n\nIntroduction\n\nDeep Reinforcement Learning (RL) has been used to successfully train policies with impressive\nperformance on a range of challenging tasks, including Atari [6, 16, 31], continuous control [35, 46]\nand tasks with long-ranged temporal dependencies [33]. In those settings, the challenge is to be able to\nsuccessfully explore and learn policies complex enough to solve the training tasks. Consequently, the\nfocus of these works was to improve the learning performance of agents in the training environment\nand less attention was being paid to generalization to testing environments.\nHowever, being able to generalize is a key requirement for the broad application of autonomous agents.\nSpurred by several recent works showing that most RL agents over\ufb01t to the training environment\n[15, 58, 62, 65, 66], multiple benchmarks to evaluate the generalization capabilities of agents were\nproposed, typically by procedurally generating or modifying levels in video games [8, 11, 20, 21,\n32, 63]. How to learn generalizable policies in these environments remains an open question, but\nearly results have shown the use of regularization techniques (like weight decay, dropout and batch\nnormalization) established in the supervised learning paradigm can also be useful for RL agents\n[11]. Our work builds on these results, but highlights two important differences between supervised\nlearning and RL which need to be taken into account when regularizing agents.\n\n\u21e4Work performed during an internship at Microsoft Research Cambridge\n\u2020Co-Senior Authors\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFirst, because in RL the training data depends on the model and, consequently, the regularization\nmethod, stochastic regularization techniques like Dropout or BatchNorm can have adverse effects. For\nexample, injecting stochasticity into the policy can lead to prematurely ending episodes, preventing\nthe agent from observing future rewards. Furthermore, stochastic regularization can destabilize\ntraining through the learned critic and off-policy importance weights. To mitigate those adverse\neffects and effectively apply stochastic regularization techniques to RL, we propose Selective Noise\nInjection (SNI). It selectively applies stochasticity only when it serves regularization and otherwise\ncomputes the output of the regularized networks deterministically. We focus our evaluation on\nDropout and the Variational Information Bottleneck (VIB), but the proposed method is applicable to\nmost forms of stochastic regularization.\nA second difference between RL and supervised learning is the non-stationarity of the data-distribution\nin RL. Despite many RL algorithms utilizing millions or even billions of observations, the diversity\nof states encountered early on in training can be small, making it dif\ufb01cult to learn general features.\nWhile it remains an open question as to why deep neural networks generalize despite being able\nto perfectly memorize the training data [5, 64], it has been shown that the optimal point on the\nworst-case generalization bound requires the model to rely on a more compressed set of features the\nfewer data-points we have [47, 54]. Therefore, to bias our agent towards more general features even\nearly on in training, we adapt the Information Bottleneck (IB) principle to an actor-critic agent, which\nwe call Information Bottleneck Actor Critic (IBAC). In contrast to other regularization techniques,\nIBAC directly incentivizes the compression of input features, resulting in features that are more robust\nunder a shifting data-distribution and that enable better generalization to held-out test environments.\nWe evaluate our proposed techniques using Proximal Policy Optimization (PPO), an off-policy actor-\ncritic algorithm, on two challenging generalization tasks, Multiroom [9] and Coinrun [11]. We show\nthe bene\ufb01ts of both IBAC and SNI individually as well as in combination, with the resulting IBAC-SNI\nsigni\ufb01cantly outperforming the previous state of the art results.\n\ndiscounted expected reward: J(\u21e1\u2713) = Eq,\u21e1,Tm,pmhPT\n\n2 Background\nWe consider having a distribution q(m) of Markov decision processes (MDPs) m 2M , with m being\na tuple (Sm,A, Tm, Rm, pm) consisting of state-space Sm, action-space A, transition distribution\nTm(s0|s, a), reward function Rm(s, a) and initial state distribution pm(s0) [38]. For training, we\neither assume unlimited access to q(m) (like in section 5.2, Multiroom) or restrict ourselves to a \ufb01xed\nset of training environments Mtrain = {m1, . . . , mn}, mi \u21e0 q (like in section 5.3, Coinrun).\nThe goal of the learning process is to \ufb01nd a policy \u21e1\u2713(a|s), parameterized by \u2713, which maximizes the\nt=0 tRm(st, at)i. Although any RL method\nwith an off-policy correction term could be used with our proposed method of SNI, PPO [46] has\nshown strong performance and enables direct comparison with prior work [11]. The actor-critic\nversion of this algorithm collects trajectory data D\u2327 using a rollout policy \u21e1r\n\u2713(at|st) and subsequently\noptimizes a surrogate loss:\nLPPO = ED\u2327\u21e5min(ct(\u2713)At, clip(ct(\u2713), 1 \u270f, 1 + \u270f)At)\u21e4\n(1)\nwith ct(\u2713) = \u21e1\u2713(at|st)\n\u2713 (at|st) for K epochs. The advantage At is computed as in A2C [30]. This is an\nef\ufb01cient approximate trust region method [44], optimizing a pessimistic lower bound of the objective\nfunction on the collected data. It corresponds to estimating the gradient w.r.t the policy conservatively,\nsince moving \u21e1\u2713 further away from \u21e1r\n\u2713, such that ct(\u2713) moves outside a chosen range [1 \u270f, 1 + \u270f],\nis only taken into account if it decreases performance. Similarly, the value function loss minimizes\nan upper bound on the squared error:\n\n\u21e1r\n\nLV\n\nPPO = ED\u2327\uf8ff 1\n\n2\n\nmax\u21e3(V\u2713 V T )2, (V r + clip(V\u2713 V r, 1 \u270f, 1 + \u270f) Vtarget)2\u2318\n\n(2)\n\nwith a bootstrapped value function target Vtarget [30] and previous value function V r. The overall\nminimization objective is then:\n\n(3)\nwhere H[\u00b7] denotes an entropy bonus to encourage exploration and prevent the policy to collapse\nprematurely. In the following, we discuss regularization techniques that can be used to mitigate\nover\ufb01tting to the states and MDPs so far seen during training.\n\nLt(\u2713) = LPPO + V LV\n\nPPO HH[\u21e1\u2713]\n\n2\n\n\f2.1 Regularization Techniques in Supervised Learning\n\n1\n\n2k\u2713k2\n\n2. With a gradient update of the form \u2713 \u2713 \u21b5r\u2713(L(\u2713) + w\n\nIn supervised learning, classi\ufb01ers are often regularized using a variety of techniques to prevent\nover\ufb01tting. Here, we brie\ufb02y present several major approaches which we either utilize as baseline or\nextend to RL in section 4.\nWeight decay, also called L2 regularization, reduces the magnitude of the weights \u2713 by adding an\nadditional loss term w\n2),\n2 k\u2713k2\nthis decays the weights in addition to optimizing L(\u2713), i.e. we have \u2713 (1 \u21b5w)\u2713 \u21b5r\u2713L(\u2713).\nData augmentation refers to changing or distorting the available input data to improve generalization.\nIn this work, we use a modi\ufb01ed version of cutout [12], proposed by [11], in which a random number\nof rectangular areas in the input image is \ufb01lled by random colors.\nBatch Normalization [17, 18] normalizes activations of speci\ufb01ed layers by estimating their mean and\nvariance using the current mini-batch. Estimating the batch statistics introduces noise which has been\nshown to help improve generalization [28] in supervised learning.\nAnother widely used regularization technique for deep neural networks is Dropout [48]. Here, during\ntraining, individual activations are randomly zeroed out with a \ufb01xed probability pd. This serves to\nprevent co-adaptation of neurons and can be applied to any layer inside the network. One common\nchoice, which we are following in our architecture, is to apply it to the last hidden layer.\nLastly, we will brie\ufb02y describe the Variational Information Bottleneck (VIB) [2], a deep variational\napproximation to the Information Bottleneck (IB) [53]. While not typically used for regularization\nin deep supervised learning, we demonstrate in section 5 that our adaptation IBAC shows strong\nperformance in RL. Given a data distribution p(X, Y ), the learned model p\u2713(y|x) is regularized by\ninserting a stochastic latent variable Z and minimizing the mutual information between the input X\nand Z, I(X, Z), while maximizing the predictive power of the latent variable, i.e. I(Z, Y ). The VIB\nobjective function is:\n\nLVIB = Ep(x,y),p\u2713(z|x)\u21e5 log q\u2713(y|z) + DKL[p\u2713(z|x)kq(z)]\u21e4\n\n(4)\nwhere p\u2713(z|x) is the encoder, q\u2713(y|z) the decoder, q(z) the approximated latent marginal often \ufb01xed\nto a normal distribution N (0, I) and is a hyperparameter. For a normal distributed p\u2713(z|x), eq. (4)\ncan be optimized by gradient decent using the reparameterization trick [25].\n\n3 The Problem of Using Stochastic Regularization in RL\n\nWe now take a closer look at a prototypical objective for training actor-critic methods and highlight\nimportant differences to supervised learning. Based on those observations, we propose an explanation\nfor the \ufb01nding that some stochastic optimization methods are less effective [11] or can even be\ndetrimental to performance when combined with other regularization techniques (see appendix D).\n\nIn supervised learning, the optimization objective takes a form similar to max\u2713 ED\u21e5log p\u2713(y|x)\u21e4,\nwhere we highlight the model p\u2713(y|x) to be updated in blue, D is the available data and \u2713 the\nparameters to be learned. On the other hand, in RL the objective for the actor is to maximize\nJ(\u21e1\u2713) = E\u21e1\u2713(a|s)\u21e5Pt tRm(st, at)\u21e4, where, for convencience, we drop q, Tm and pm from the\nnotation of the expectation. Because now the learned distribution, \u21e1\u2713(a|s), is part of data-generation,\ncomputing the gradients, as done in policy gradient methods, requires the log-derivative trick. For\nthe class of deep off-policy actor-critic methods we are experimentally evaluating in this paper, one\nalso typically uses the policy gradient theorem [52] and an estimated critic V\u2713(s) as baseline and for\nbootstrapping to reduce the gradient variance. Consequently, the gradient estimation becomes:\n\nr\u2713J(\u21e1\u2713) = E\u21e1r\n\n(5)\n\n\u2713 (at|st)24\n\nTXt\n\n\u21e1\u2713(at|st)\n\u21e1r\n\n\u2713(at|st)r\u2713log \u21e1\u2713(at|st)(rt + V\u2713(st+1) V\u2713(st))35\n\n\u2713 to collect trajectories. It can deviate from \u21e1\u2713 but should be similar\nwhere we utilize a rollout policy \u21e1r\nto keep the off-policy correction term \u21e1\u2713/\u21e1r\n\u2713 low variance. In eq. (5), only the term \u21e1\u2713(at|st) is being\nupdated and we highlight in orange all the additional in\ufb02uences of the learned policy and critic on the\ngradient.\n\n3\n\n\fis assumed constant, we can write the optimization objective\n\n\u2713 (at|st)\uf8ff\u21e3V ?\u2713 (st+1) + rt V\u2713(st)\u23182\n\n(6)\n\nDenoting by the superscript ? that V ?\u2713\nfor the critic as\nE\u21e1r\n\nAC = min\n\nLV\n\n\u2713\n\n\u2713 and V\u2713 can\nFrom eqs. (5) and (6) we can see that the injection of noise into the computation of \u21e1r\ndegrade performance in several ways: i) During rollouts using the rollout policy \u21e1r\n\u2713, it can lead to\nundesirable actions, potentially ending episodes prematurely, and thereby deteriorating the quality of\nthe observed data; ii) It leads to a higher variance of the off-policy correction term \u21e1\u2713/\u21e1r\n\u2713 because the\ninjected noise can be different for \u21e1\u2713 and \u21e1r\n\u2713, increasing gradient variance; iii) It increases variance in\nthe gradient updates of both the policy and the critic through variance in the computation of V\u2713.\n\n4 Method\n\nTo utilize the strength of noise-injecting regularization techniques in RL, we introduce Selective\nNoise Injection (SNI) in the following section. Its goal is to allow us to make use of such techniques\nwhile mitigating the adverse effects the added stochasticity can have on the RL gradient computation.\nThen, in section 4.2, we propose Information Bottleneck Actor Critic (IBAC) as a new regularization\nmethod and detail how SNI applies to IBAC, resulting in our state-of-the art method IBAC-SNI.\n\n4.1 Selective Noise Injection\n\n\u2713, \u21e1\u2713, V\u2713).\n\n\u2713, in the critic V\u2713 and in the off-policy correction term \u21e1\u2713/\u21e1r\n\nWe have identi\ufb01ed three sources of negative effects due to noise which we need to mitigate: In the\nrollout policy \u21e1r\n\u2713. We \ufb01rst introduce a\nshort notation for eq. (5) as r\u2713J(\u21e1\u2713) = GAC(\u21e1r\nTo apply SNI to a regularization technique relying on noise-injection, we need to be able to temporarily\nsuspend the noise and compute the output of the model deterministically. This is possible for most\ntechniques3: For example, in Dropout, we can freeze one particular dropout mask, in VIB we can\npass in the mode instead of sampling from the posterior distribution and in Batch Normalization\nwe can either utilize the moving average instead of the batch statistics or freeze and re-use one\nstatistic multiple times. Formally, we denote by \u00af\u21e1\u2713 the version of a component \u21e1\u2713, with the injected\nregularization noise suspended. Note that this does not mean that \u00af\u21e1\u2713 is deterministic, for example\nwhen the network approximates the parameters of a distribution.\nThen, for SNI we modify the policy gradient loss as follows: i) We use \u00afV\u2713 as critic instead of V\u2713 in\nboth eqs. (5) and (6), eliminating unnecessary noise through the critic; ii) We use \u00af\u21e1r as rollout policy\ninstead of \u21e1r. For some regularization techniques this will reduce the probability of undesirable\nactions; iii) We compute the policy gradient as a mixture between gradients for \u21e1\u2713 and \u00af\u21e1\u2713 as follows:\n(7)\n\n\u2713, \u21e1\u2713, \u00afV\u2713)\n\nAC (\u21e1r\nG SNI\n\n\u2713, \u21e1\u2713, V\u2713) = GAC(\u00af\u21e1r\n\n\u2713, \u00af\u21e1\u2713, \u00afV\u2713) + (1 )GAC(\u00af\u21e1r\n\nThe \ufb01rst term guarantees a lower variance of the off-policy importance weight, which is especially\nimportant early on in training when the network has not yet learned to compensate for the injected\nnoise. The second term uses the noise-injected policy for updates, thereby taking advantage of its\nregularizing effects while still reducing unnecessary variance through the use of \u00af\u21e1r and \u00afV\u2713. Note\nthat sharing the rollout policy \u00af\u21e1r between both terms allows us to use the same collected data.\nFurthermore most computations are shared between both terms or can be parallelized.\n\n4.2\n\nInformation Bottleneck Actor Critic\n\nEarly on in training an RL agent, we are often faced with little variation in the training data. Observed\nstates are distributed only around the initial states s0, making spurious correlations in the low amount\nof data more likely. Furthermore, because neither the policy nor the critic have suf\ufb01ciently converged\nyet, we have a high variance in the target values of our loss function.\nThis combination makes it harder and less likely for the network to learn desirable features that are\nrobust under a shifting data-distribution during training and generalize well to held-out test MDPs.\n\n3In this work, we will focus on VIB and Dropout as those show the most promising results without SNI (see\n\nsection 5) and will leave its application to other regularization techniques for future work.\n\n4\n\n\fTo counteract this reduced signal-to-noise ratio, our goal is to explicitly bias the learning towards\n\ufb01nding more compressed features which are shown to have a tighter worst-case generalization bound\n[54]. While a higher compression does not guarantee robustness under a shifting data-distribution,\nwe believe this to be a reasonable assumption in the majority of MDPs, for example because they rely\non a consistent underlying transition mechanism like physical laws.\nTo incentivize more compressed features, we use an approach similar to the VIB [2], which min-\nimizes the mutual information I(S, Z) between the state S and its latent representation Z while\nmaximizing I(Z, A), the predictive power of Z on actions A. To do so, we re-interpret the policy\ngradient update as maximization of the log-marginal likelihood of \u21e1\u2713(a|s) under the data distribution\np(s, a) := \u21e2\u21e1(s)\u21e1\u2713(a|s)A\u21e1(s,a)\nwith discounted state distribution \u21e2\u21e1(s), advantage function A\u21e1(s, a)\nand normalization constant Z. Taking the semi-gradient of this objective, i.e. assuming p(s, a) to be\n\ufb01xed, recovers the policy gradient:\n\nZ\n\nr\u2713Z Ep(s,a)[log \u21e1\u2713(a|s)] =Z \u21e2\u21e1(s)\u21e1\u2713(a|s)r\u2713 log \u21e1\u2713(a|s)A\u21e1(s, a) ds da.\n\nNow, following the same steps as [2], we introduce a stochastic latent variable z and minimize\nI (S, Z) while maximizing I(Z, A) under p(s, a), resulting in the new objective:\n\nLIB = Ep(s,a),p\u2713(z|s)\u21e5 log q\u2713(a|z) + DKL[p\u2713(z|s)kq(z)]\u21e4\n\n(9)\nWe take the gradient and use the reparameterization trick [25] to write the encoder p\u2713(z|s) as\ndeterministic function f\u2713(s, \u270f) with \u270f \u21e0 p(\u270f):\nr\u2713LIB = E\u21e2\u21e1(s)\u21e1\u2713(a|s)p(\u270f)\u21e5r\u2713 log q\u2713(a|f\u2713(s, \u270f))A\u21e1(s, a)\u21e4 + r\u2713DKL[p\u2713(z|s)kq(z)]\nresulting in a modi\ufb01ed policy gradient objective and an additional regularization term LKL.\nPolicy gradient algorithms heuristically add an entropy bonus H[\u21e1\u2713(a|s)] to prevent the policy\ndistribution from collapsing. However, this term also in\ufb02uences the distributions over z. In practice,\nwe are only interested in preventing q\u2713(a|z) (not \u21e1\u2713(a|s) = Ez[q\u2713(a|z)]) from collapsing because\nour rollout policy \u00af\u21e1\u2713 will not rely on stochasticity in z. Additionally, p\u2713(z|s) is already entropy-\nregularized by the IB loss term4. Consequently, we adapt the heuristic entropy bonus to\n\n= r\u2713(LIB\n\nAC + L KL),\n\n(10)\n\n(8)\n\n(11)\n\n(12)\n\nresulting in the overall loss function of the proposed Information Bottleneck Actor Critic (IBAC)\n\nH IB[\u21e1\u2713(a|s)] :=Z p\u2713(s, z)H[q\u2713(a|z)] ds dz,\nAC HH IB[\u21e1\u2713] + L KL\n\nAC + V LV\n\n(\u2713) = LIB\n\nt\n\nLIBAC\n\nwith the hyperparameters V , H and balancing the loss terms.\nWhile IBAC incentivizes more compressed features, it also introduces stochasticity. Consequently,\ncombining it with SNI improves performance, as we demonstrate in sections 5.2 and 5.3. To compute\nthe noise-suspended policy \u00af\u21e1\u2713 and critic \u00afV\u2713, we use the mode z = \u00b5\u2713(s) as input to q\u2713(a|z) and\nV\u2713(z), where \u00b5\u2713(s) is the mode of p\u2713(z|s) and V\u2713(z) now conditions on z instead of s, also using\n\u2713, \u00af\u21e1\u2713, \u00afV\u2713), this\nthe compressed features. Note that for SNI with = 1, i.e. with only the term GAC(\u00af\u21e1r\neffectively recovers a L2 penalty on the activations since the variance of z will then always be ignored\nand the KL-divergence between two Gaussians minimizes the squared difference of their means.\n\n5 Experiments\n\nIn the following, we present a series of experiments to show that the IB \ufb01nds more general features\nin the low-data regime and that this translates to improved generalization in RL for IBAC agents,\nespecially when combined with SNI. We evaluate our proposed regularization techniques on two\nenvironments, one grid-world with challenging generalization requirements [9] in which most\nprevious approaches are unable to \ufb01nd the solution and on the recently proposed Coinrun benchmark\n[11]. We show that IBAC-SNI outperforms previous state of the art on both environments by a\nlarge margin. Details about the used hyperparameters and network architectures can be found in\nthe Appendix, code to reproduce the results can be found at https://github.com/microsoft/\nIBAC-SNI/.\n\n4We have DKL[p\u2713(z|s)kr(z)] = Ep\u2713 (z|s)[log p\u2713(z|s) log r(z)] = H[p\u2713(z|s)] Ep\u2713 (z|s)[log r(z)]\n\n5\n\n\f5.1 Learning Features in the Low-Data Regime\n\nFigure 1: We show the loss on the test-data (lower is better). Left: Higher !f result in a larger\ndifference in generality between features f c and gc, making it easier to \ufb01t to the more general gc.\nRight: Learning gc with fewer datapoints is more challenging, but needed early in training RL agents.\n\nFirst we start in the supervised setting and show on a synthetic dataset that the VIB is particularly\nstrong at \ufb01nding more general features in the low-data regime and in the presence of multiple\nsignals with varying degrees of generality. Our motivation is that the low-data regime is commonly\nencountered in RL early on in training and many environments allow the agent to base its decision on\na variety of features in the state, of which we would like to \ufb01nd the most general ones.\nWe generate the training dataset Dtrain = {(ci, xi)}N\ni=1 with observations xi 2 Rdx and classes\nci 2{ 1, . . . , nc}. Each data point i is generated by \ufb01rst drawing the class ci \u21e0 Cat(nc) from a\nuniform categorical distribution and generating the vector xi by embedding the information about ci\nin two different ways gc and f c (see appendix B for details). Importantly, only gc is shared between\nthe training and test set. This allows us to measure the model\u2019s relative reliance on gc and f c by\nmeasuring the test performance (all models perfectly \ufb01t the training data). We allow f c to encode the\ninformation about ci in !f different ways. Consequently, the higher !f , the less general f c is.\nIn \ufb01g. 1 we measure how the test performance of fully trained classi\ufb01cation models varies for different\nregularization techniques when we i) vary the generality of f c and ii) vary the number of data-points\nin the training set. We \ufb01nd that most techniques perform comparably with the exception of the VIB\nwhich is able to \ufb01nd more general features both in the low-data regime and in the presence of multiple\nfeatures with only small differences in generality. In the next section, we show that this translates to\nfaster training and performance gains in RL for our proposed algorithm IBAC.\n\n5.2 Multiroom\n\nFigure 2: Left: Typical layout of the environment. The red triangle denotes the agent and its direction,\nthe green full square is the goal, colored boxes are doors and grey squares are walls. Middle:\nProbability of \ufb01nding the goal depending on level size for models trained on all levels. Shown are\nmean and standard error across 30 different seeds. Right: Mean and standard error over of the return\nof the same models averaged across all room sizes.\n\nIn this section, we show how IBAC can help learning in RL tasks which require generalization. For this\ntask, we do not distinguish between training and testing, but for each episode, we draw m randomly\nfrom the full distribution over MDPs q(m). As the number of MDPs is very large, learning can only\nbe successful if the agent learns general features that are transferrable between episodes.\n\n6\n\n\fThis experiment is based on [9]. The aim of the agent is to traverse a sequence of rooms to reach the\ngoal (green square in \ufb01g. 2) as quickly as possible. It takes discrete actions to rotate 90 in either\ndirection, move forward and toggle doors to be open or closed. The observation received by the agent\nincludes the full grid, one pixel per square, with object type and object status (like direction) encoded\nin the 3 color channels. Crucially, for each episode, the layout is generated randomly by placing a\nrandom number of rooms nr 2{ 1, 2, 3} in a sequence connected by one door each.\nThe results in \ufb01g. 2 show that IBAC agents are much better at successfully learning to solve this\ntask, especially for layouts with more rooms. While all other fully trained agents can solve less\nthan 3% of the layouts with two rooms and none of the ones with three, IBAC-SNI still succeeds in\nan impressive 43% and 21% of those layouts. The dif\ufb01culty of this seemingly simple task arises\nfrom its generalization requirements: Since the layout is randomly generated in each episode, each\nstate is observed very rarely, especially for multi-room layouts, requiring generalization to allow\nlearning. While in the 1 room layout the reduced policy stochasticity of the SNI agent slightly reduces\nperformance, it improves performance for more complex layouts in which higher noise becomes\ndetrimental. In the next section we will see that this also holds for the much more complex Coinrun\nenvironment in which SNI signi\ufb01cantly improves the IBAC performance.\n\n5.3 Coinrun\n\nFigure 3: Left: Performance of various agents on the test environments. We note that \u2018BatchNorm\u2019\ncorresponds to the best performing agent in [11]. Furthermore, \u2018Dropout-SNI ( = 1)\u2019 is similar to\nthe Dropout implementation used in [11] but was previously not evaluated with weight decay and\ndata augmentation. Middle: Difference between test performance and train performance (see \ufb01g. 7).\nWithout standard deviation for readability. Right: Averaged approximate KL-Divergence between\nrollout policy and updated policy, used as proxy for the variance of the importance weight. Mean and\nstandard deviation are across three random seeds.\n\nOn the previous environment, we were able to show that IBAC and SNI help agents to \ufb01nd more\ngeneral features and to do so faster. Next, we show that this can lead to a higher \ufb01nal performance on\npreviously unseen test environments. We evaluate our proposed regularization techniques on Coinrun\n[11], a recently proposed generalization benchmark with high-dimensional observations and a large\nvariety in levels. Several regularization techniques were previously evaluated there, making it an ideal\nevaluation environment for IBAC and SNI. We follow the setting proposed in [11], using the same\n500 levels for training and evaluate on randomly drawn, new levels of only the highest dif\ufb01culty.\nAs [11] have shown, combining multiple regularization techniques can improve performance, with\ntheir best- performing agent utilizing data augmentation, weight decay and batch normalization. As\nour goal is to push the state of the art on this environment and to accurately compare against their\nresults, \ufb01g. 3 uses weight decay and data-augmentation on all experiments. Consequently, \u2018Baseline\u2019\nin \ufb01g. 3 refers to only using weight decay and data-augmentation whereas the other experiments use\nDropout, Batch Normalization or IBAC in addition to weight decay and data-augmentation. Results\nwithout those baseline techniques can be found in appendix D.\nFirst, we \ufb01nd that almost all previously proposed regularization techniques decrease performance\ncompared to the baseline, see \ufb01g. 3 (left), with batch normalization performing worst, possibly due to\nits unusual interaction with weight decay [56]. Note that this combination with batch normalization\nwas the highest performing agent in [11]. We conjecture that regularization techniques relying\non stochasticity can introduce additional instability into the training update, possibly deteriorating\nperformance, especially if their regularizing effect is not suf\ufb01ciently different from what weight decay\nand data-augmentation already achieve. This result applies to both batch normalization and Dropout,\n\n7\n\n\fwith and without SNI, although SNI mitigates the adverse effects. Consequently, we can already\nimprove on the state of the art by only relying on those two non-stochastic techniques. Furthermore,\nwe \ufb01nd that IBAC in combination with SNI is able to signi\ufb01cantly outperform our new state of the\nart baseline. We also \ufb01nd that for IBAC, = 0.5 achieves better performance than = 1, justifying\nusing both terms in eq. (7).\nAs a proxy for the variance of the off-policy correction term \u21e1r\n\u2713/\u21e1\u2713, we show in \ufb01g. 3 (right) the\nestimated, averaged KL-divergence between the rollout policy and the update policy for both terms,\n\u2713,\u21e1 \u2713, \u00afV\u2713), denoted by \u2018(stoch)\u2019. Because PPO uses\nGAC(\u00af\u21e1r\ndata-points multiple times it is non-zero even for the deterministic term. First, we can see that\nusing the deterministic version reduces the KL-Divergence, explaining the positive in\ufb02uence of\n\u2713, \u00af\u21e1\u2713, \u00afV\u2713). Second, we see that the KL-Divergence of the stochastic part is much higher for\nGAC(\u00af\u21e1r\nDropout than for IBAC, offering an explanation of why for Dropout relying on purely the deterministic\npart ( = 1) outperforms an equal mixing = 0.5 (see \ufb01g. 6).\n\n\u2713, \u00af\u21e1\u2713, \u00afV\u2713), denoted by \u2018(det)\u2019 and GAC(\u00af\u21e1r\n\n6 Related Work\n\nGeneralization in RL can take a variety of forms, each necessitating different types of regularization.\nTo position this work, we distinguished two types that, whilst not mutually exclusive, we believe to be\nconceptually distinct and found useful to isolate when studying approaches to improve generalization.\nThe \ufb01rst type, robustness to uncertainty refers to settings in which the unobserved MDP m in\ufb02uences\nthe transition dynamics or reward structure. Consequently the current state s might not contain enough\ninformation to act optimally in the current MDP and we need to \ufb01nd the action which is optimal\nunder the uncertainty about m. This setting often arises in robotics and control where exact physical\ncharacteristics are unknown and domain shifts can occur [27]. Consequently, domain randomization,\nthe injection of randomness into the environment, is often purposefully applied during training to\nallow for sim-to-real transfer [26, 55]. Noise can be injected into the states of the environment [50]\nor the parameters of the transition distribution like friction coef\ufb01cients or mass values [3, 34, 60].\nThe noise injected into the dynamics can also be manipulated adversarially [29, 36, 39]. As the goal\nis to prevent over\ufb01tting to speci\ufb01c MDPs, it also has been found that using smaller [40] or simpler\n[66] networks can help. We can also aim to learn an adaptive policy by treating the environment\nas partially observable Markov decision process (POMDP) [35, 60] (similar to viewing the learning\nproblem in the framework of Bayesian RL [37]) or as a meta-learning problem [1, 10, 13, 43, 51, 57].\nOn the other hand, we distinguish feature robustness, which applies to environments with high-\ndimensional observations (like images) in which generalization to previously unseen states can be\nimproved by learning to extract better features, as the focus for this paper. Recently, a range of\nbenchmarks, typically utilizing procedurally generated levels, have been proposed to evaluate this\ntype of generalization [4, 11, 19, 20, 21, 22, 32, 59, 63].\nImproving generalization in those settings can rely on generating more diverse observation data\n[11, 42, 55], or strong, often relational, inductive biases applied to the architecture [23, 49, 61].\nContrary to the results in continuous control domains, here deeper networks have been found to be\nmore successful [7, 11]. Furthermore, this setting is more similar to that of supervised learning, so\nestablished regularization techniques like weight decay, dropout or batch-normalization have also\nsuccessfully been applied, especially in settings with a limited number of training environments [11].\nThis is the work most closely related to ours. We build on those results and improve upon them by\ntaking into account the speci\ufb01c ways in which RL is different from the supervised setting. They also\ndo not consider the VIB as a regularization technique.\nCombining RL and VIB has been recently explored for learning goal-conditioned policies [14] and\nmeta-RL [41]. Both of these previous works [14, 41] also differ from the IBAC architecture we\npropose by conditioning action selection on both the encoded and raw state observation. These\nstudies complement the contribution made here by providing evidence that the VIB can be used with\na wider range of RL algorithms including demonstrated bene\ufb01ts when used with Soft Actor-Critic for\ncontinuous control in MuJoCo [41] and on-policy A2C in MiniGrid and MiniPacMan [14].\n\n8\n\n\f7 Conclusion\n\nIn this work we highlight two important differences between supervised learning and RL: First, the\ntraining data is generated using the learned model. Consequently, using stochastic regularization\nmethods can induce adverse effects and reduce the quality of the data. We conjecture that this\nexplains the observed lower performance of Batch Normalization and Dropout. Second, in RL, we\noften encounter a noisy, low-data regime early on in training, complicating the extraction of general\nfeatures.\nWe argue that these differences should inform the choice of regularization techniques used in RL.\nTo mitigate the adverse effects of stochastic regularization, we propose Selective Noise Injection\n(SNI) which only selectively injects noise into the model, preventing reduced data quality and higher\ngradient variance through a noisy critic. On the other hand, to learn more compressed and general\nfeatures in the noisy low-data regime, we propose Information Bottleneck Actor Critic (IBAC), which\nutilizes an variational information bottleneck as part of the agent.\nWe experimentally demonstrate that the VIB is able to extract better features in the low-data regime\nand that this translates to better generalization of IBAC in RL. Furthermore, on complex environments,\nSNI is key to good performance, allowing the combined algorithm, IBAC-SNI, to achieve state of the\nart on challenging generalization benchmarks. We believe the results presented here can inform a\nrange of future works, both to improve existing algorithms and to \ufb01nd new regularization techniques\nadapted to RL.\n\nAcknowledgments\n\nWe would like to thank Shimon Whiteson for his helpful feedback, Sebastian Lee, Luke Harris, Hiske\nOverweg and Patrick Fernandes for help with experimental evaluations and Adrian O\u2019Grady, Jaroslaw\nRzepecki and Andre Kramer for help with the computing infrastructure. M. Igl is supported by the\nUK EPSRC CDT on Autonomous Intelligent Machines and Systems.\n\nReferences\n[1] Maruan Al-Shedivat, Trapit Bansal, Yura Burda, Ilya Sutskever, Igor Mordatch, and Pieter\nAbbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments.\nIn International Conference on Learning Representations, 2018.\n\n[2] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational\ninformation bottleneck. In 5th International Conference on Learning Representations, ICLR\n2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.\n\n[3] Rika Antonova, Silvia Cruciani, Christian Smith, and Danica Kragic. Reinforcement learning\n\nfor pivoting task. arXiv preprint arXiv:1703.00472, 2017.\n\n[4] Edward Beeching, Christian Wolf, Jilles Dibangoye, and Olivier Simonin. Deep reinforcement\nlearning on a budget: 3d control and reasoning without a supercomputer. CoRR, abs/1904.01806,\n2019.\n\n[5] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine\n\nlearning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.\n\n[6] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n[7] Alon Brutzkus and Amir Globerson. Over-parameterization improves generalization in the\n\nXOR detection problem. CoRR, abs/1810.03037, 2018.\n\n[8] Devendra Singh Chaplot, Guillaume Lample, Kanthashree Mysore Sathyendra, and Ruslan\nSalakhutdinov. Transfer deep reinforcement learning in 3d environments: An empirical study.\nIn NIPS Deep Reinforcemente Leaning Workshop, 2016.\n\n9\n\n\f[9] Maxime Chevalier-Boisvert and Lucas Willems. Minimalistic gridworld environment for openai\n\ngym. https://github.com/maximecb/gym-minigrid, 2018.\n\n[10] Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and\nChelsea Finn. Learning to adapt: Meta-learning for model-based control. CoRR, abs/1803.11347,\n2018.\n\n[11] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying\ngeneralization in reinforcement learning. In Proceedings of the 36th International Conference\non Machine Learning, 2019.\n\n[12] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[13] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[14] Anirudh Goyal, Riashat Islam, DJ Strouse, Zafarali Ahmed, Hugo Larochelle, Matthew\nInfobot: Transfer and exploration via the\n\nBotvinick, Sergey Levine, and Yoshua Bengio.\ninformation bottleneck. In International Conference on Learning Representations, 2019.\n\n[15] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David\nIn Thirty-Second AAAI Conference on\n\nMeger. Deep reinforcement learning that matters.\nArti\ufb01cial Intelligence, 2018.\n\n[16] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dab-\nney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining\nimprovements in deep reinforcement learning. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\n[17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. In Advances in Neural Information\nProcessing Systems, pages 1731\u20131741, 2017.\n\n[18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[19] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for\n\narti\ufb01cial intelligence experimentation. In IJCAI, pages 4246\u20134247, 2016.\n\n[20] Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Hunter Henry, Adam\nCrespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision,\ncontrol, and planning. CoRR, abs/1902.01378, 2019.\n\n[21] Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and\nSebastian Risi. Illuminating generalization in deep reinforcement learning through procedural\nlevel generation. arXiv preprint arXiv:1806.10729, 2018.\n\n[22] Yuji Kanagawa and Tomoyuki Kaneko. Rogue-gym: A new challenge for generalization in\n\nreinforcement learning. arXiv preprint arXiv:1904.08129, 2019.\n\n[23] Ken Kansky, Tom Silver, David A M\u00e9ly, Mohamed Eldawy, Miguel L\u00e1zaro-Gredilla, Xinghua\nLou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks:\nZero-shot transfer with a generative causal model of intuitive physics. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1809\u20131818. JMLR. org,\n2017.\n\n[24] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd\nInternational Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May\n7-9, 2015, Conference Track Proceedings, 2015.\n\n10\n\n\f[25] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International\nConference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014,\nConference Track Proceedings, 2014.\n\n[26] Sylvain Koos, Jean-Baptiste Mouret, and St\u00e9phane Doncieux. The transferability approach:\nCrossing the reality gap in evolutionary robotics. IEEE Transactions on Evolutionary Computa-\ntion, 17(1):122\u2013145, 2013.\n\n[27] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning\nhand-eye coordination for robotic grasping with deep learning and large-scale data collection.\nThe International Journal of Robotics Research, 37(4-5):421\u2013436, 2018.\n\n[28] Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regu-\nlarization in batch normalization. In International Conference on Learning Representations,\n2019.\n\n[29] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Adversarially robust\npolicy learning: Active construction of physically-plausible perturbations. In 2017 IEEE/RSJ\nInternational Conference on Intelligent Robots and Systems (IROS), pages 3932\u20133939. IEEE,\n2017.\n\n[30] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International conference on machine learning, pages 1928\u20131937,\n2016.\n\n[31] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR,\nabs/1312.5602, 2013.\n\n[32] Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast:\n\nA new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.\n\n[33] OpenAI. Openai \ufb01ve. https://blog.openai.com/openai-five/, 2018.\n[34] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr\u00e4henb\u00fchl, Vladlen Koltun, and Dawn Song.\nAssessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282,\n2018.\n\n[35] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real\ntransfer of robotic control with dynamics randomization. In 2018 IEEE International Conference\non Robotics and Automation (ICRA), pages 1\u20138. IEEE, 2018.\n\n[36] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial\nIn Proceedings of the 34th International Conference on Machine\n\nreinforcement learning.\nLearning-Volume 70, pages 2817\u20132826. JMLR. org, 2017.\n\n[37] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete\nIn Proceedings of the 23rd international conference on\n\nbayesian reinforcement learning.\nMachine learning, pages 697\u2013704. ACM, 2006.\n\n[38] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, 2014.\n\n[39] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt:\nLearning robust neural network policies using model ensembles. In 5th International Conference\non Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track\nProceedings, 2017.\n\n[40] Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards gen-\neralization and simplicity in continuous control. In Advances in Neural Information Processing\nSystems, pages 6550\u20136561, 2017.\n\n11\n\n\f[41] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Ef\ufb01cient off-\npolicy meta-reinforcement learning via probabilistic context variables. In Proceedings of the\n36th International Conference on Machine Learning, 2019.\n\n[42] Fereshteh Sadeghi and Sergey Levine. CAD2RL: real single-image \ufb02ight without a single\nreal image. In Robotics: Science and Systems XIII, Massachusetts Institute of Technology,\nCambridge, Massachusetts, USA, July 12-16, 2017, 2017.\n\n[43] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-\n\nmators. In International conference on machine learning, pages 1312\u20131320, 2015.\n\n[44] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[45] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-\ndimensional continuous control using generalized advantage estimation. In 4th International\nConference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,\nConference Track Proceedings, 2016.\n\n[46] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[47] Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the informa-\n\ntion bottleneck. Theoretical Computer Science, 411(29-30):2696\u20132711, 2010.\n\n[48] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[49] Mario Srouji, Jian Zhang, and Ruslan Salakhutdinov. Structured control nets for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 4749\u20134758,\n2018.\n\n[50] Freek Stulp, Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. Learning to grasp under\nIn 2011 IEEE International Conference on Robotics and Automation, pages\n\nuncertainty.\n5703\u20135708. IEEE, 2011.\n\n[51] Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to learn:\n\nMeta-critic networks for sample ef\ufb01cient learning. arXiv preprint arXiv:1706.09529, 2017.\n\n[52] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[53] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.\n\narXiv preprint physics/0004057, 2000.\n\n[54] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In\n\n2015 IEEE Information Theory Workshop (ITW), pages 1\u20135. IEEE, 2015.\n\n[55] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\nIn 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages\n23\u201330. IEEE, 2017.\n\n[56] Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint\n\narXiv:1706.05350, 2017.\n\n[57] Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, R\u00e9mi Munos,\nCharles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn.\nCoRR, abs/1611.05763, 2016.\n\n12\n\n\f[58] Shimon Whiteson, Brian Tanner, Matthew E Taylor, and Peter Stone. Protecting against\nevaluation over\ufb01tting in empirical reinforcement learning.\nIn 2011 IEEE Symposium on\nAdaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 120\u2013127. IEEE,\n2011.\n\n[59] Marek Wydmuch, Micha\u0142 Kempka, and Wojciech Ja\u00b4skowski. Vizdoom competitions: playing\n\ndoom from pixels. IEEE Transactions on Games, 2018.\n\n[60] Wenhao Yu, Jie Tan, C. Karen Liu, and Greg Turk. Preparing for the unknown: Learning\na universal policy with online system identi\ufb01cation. In Robotics: Science and Systems XIII,\nMassachusetts Institute of Technology, Cambridge, Massachusetts, USA, July 12-16, 2017,\n2017.\n\n[61] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin,\nKarl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria\nLangston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter Battaglia. Deep\nreinforcement learning with relational inductive biases. In International Conference on Learning\nRepresentations, 2019.\n\n[62] Amy Zhang, Nicolas Ballas, and Joelle Pineau. A dissection of over\ufb01tting and generalization in\n\ncontinuous reinforcement learning. arXiv preprint arXiv:1806.07937, 2018.\n\n[63] Amy Zhang, Yuxin Wu, and Joelle Pineau. Natural environment benchmarks for reinforcement\n\nlearning. arXiv preprint arXiv:1811.06032, 2018.\n\n[64] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In 5th International Conference on Learning\nRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,\n2017.\n\n[65] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on over\ufb01tting in deep\n\nreinforcement learning. arXiv preprint arXiv:1804.06893, 2018.\n\n[66] Chenyang Zhao, Olivier Siguad, Freek Stulp, and Timothy M Hospedales.\n\nInvestigating\ngeneralisation in continuous deep reinforcement learning. arXiv preprint arXiv:1902.07015,\n2019.\n\n13\n\n\f", "award": [], "sourceid": 7793, "authors": [{"given_name": "Maximilian", "family_name": "Igl", "institution": "University of Oxford"}, {"given_name": "Kamil", "family_name": "Ciosek", "institution": "Microsoft"}, {"given_name": "Yingzhen", "family_name": "Li", "institution": "Microsoft Research Cambridge"}, {"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "Microsoft Research"}, {"given_name": "Cheng", "family_name": "Zhang", "institution": "Microsoft Research, Cambridge, UK"}, {"given_name": "Sam", "family_name": "Devlin", "institution": "Microsoft Research"}, {"given_name": "Katja", "family_name": "Hofmann", "institution": "Microsoft Research"}]}