{"title": "Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents", "book": "Advances in Neural Information Processing Systems", "page_first": 5027, "page_last": 5038, "abstract": "Evolution strategies (ES) are a family of black-box optimization algorithms able to train deep neural networks roughly as well as Q-learning and policy gradient methods on challenging deep reinforcement learning (RL) problems, but are much faster (e.g. hours vs. days) because they parallelize better. However, many RL problems require directed exploration because they have reward functions that are sparse or deceptive (i.e. contain local optima), and it is unknown how to encourage such exploration with ES. Here we show that algorithms that have been invented to promote directed exploration in small-scale evolved neural networks via populations of exploring agents, specifically novelty search (NS) and quality diversity (QD) algorithms, can be hybridized with ES to improve its performance on sparse or deceptive deep RL tasks, while retaining scalability. Our experiments confirm that the resultant new algorithms, NS-ES and two QD algorithms, NSR-ES and NSRA-ES, avoid local optima encountered by ES to achieve higher performance on Atari and simulated robots learning to walk around a deceptive trap. This paper thus introduces a family of fast, scalable algorithms for reinforcement learning that are capable of directed exploration. It also adds this new family of exploration algorithms to the RL toolbox and raises the interesting possibility that analogous algorithms with multiple simultaneous paths of exploration might also combine well with existing RL algorithms outside ES.", "full_text": "Improving Exploration in Evolution Strategies for\nDeep Reinforcement Learning via a Population of\n\nNovelty-Seeking Agents\n\nEdoardo Conti\u21e4\n\nJoel Lehman\n\nVashisht Madhavan\u21e4\n\nKenneth O. Stanley\n\nUber AI Labs\nAbstract\n\nFelipe Petroski Such\n\nJeff Clune\n\nEvolution strategies (ES) are a family of black-box optimization algorithms able\nto train deep neural networks roughly as well as Q-learning and policy gradient\nmethods on challenging deep reinforcement learning (RL) problems, but are much\nfaster (e.g. hours vs. days) because they parallelize better. However, many RL\nproblems require directed exploration because they have reward functions that are\nsparse or deceptive (i.e. contain local optima), and it is unknown how to encourage\nsuch exploration with ES. Here we show that algorithms that have been invented\nto promote directed exploration in small-scale evolved neural networks via popu-\nlations of exploring agents, speci\ufb01cally novelty search (NS) and quality diversity\n(QD) algorithms, can be hybridized with ES to improve its performance on sparse\nor deceptive deep RL tasks, while retaining scalability. Our experiments con\ufb01rm\nthat the resultant new algorithms, NS-ES and two QD algorithms, NSR-ES and\nNSRA-ES, avoid local optima encountered by ES to achieve higher performance\non Atari and simulated robots learning to walk around a deceptive trap. This paper\nthus introduces a family of fast, scalable algorithms for reinforcement learning that\nare capable of directed exploration. It also adds this new family of exploration\nalgorithms to the RL toolbox and raises the interesting possibility that analogous\nalgorithms with multiple simultaneous paths of exploration might also combine\nwell with existing RL algorithms outside ES.\n\nIntroduction\n\n1\nIn RL, an agent tries to learn to perform a sequence of actions in an environment that maximizes\nsome notion of cumulative reward [1]. However, reward functions are often deceptive, and solely\noptimizing for reward without some mechanism to encourage intelligent exploration can lead to\ngetting stuck in local optima and the agent failing to properly learn [1\u20133]. Unlike in supervised\nlearning with deep neural networks (DNNs), wherein local optima are not thought to be a problem\n[4, 5], the training data in RL is determined by the actions an agent takes. If the agent greedily\ntakes actions that maximize reward, the training data for the algorithm will be limited and it may not\ndiscover alternate strategies with larger payoffs (i.e. it can get stuck in local optima) [1\u20133]. Sparse\nreward signals can also be a problem for algorithms that only maximize reward, because at times there\nmay be no reward gradient to follow. The possibility of deceptiveness and/or sparsity in the reward\nsignal motivates the need for ef\ufb01cient and directed exploration, in which an agent is motivated to visit\nunexplored states in order to learn to accumulate higher rewards. Although deep RL algorithms have\nperformed amazing feats in recent years [6\u20138], they have mostly done so despite relying on simple,\nundirected (aka dithering) exploration strategies, in which an agent hopes to explore new areas of its\nenvironment by taking random actions (e.g. epsilon-greedy exploration) [1].\nA number of methods have been proposed to promote directed exploration in RL [9, 10], including\nrecent methods that handle high-dimensional state spaces with DNNs. A common idea is to encourage\n\n\u21e4Equal contribution, corresponding authors: vashisht@uber.com, edoardo.conti@gmail.com.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fan agent to visit states it has rarely or never visited (or take novel actions in those states). Methods\nproposed to track state (or state-action pair) visitation frequency include (1) approximating state\nvisitation counts based on either auto-encoded latent codes of states [11] or pseudo-counts from state-\nspace density models [12, 13], (2) learning a dynamics model that predicts future states (assuming\npredictions will be worse for rarely visited states/state-action pairs) [14\u201316], and (3) methods based\non compression (novel states should be harder to compress) [9].\nThose methods all count each state separately. A different approach to is to hand-design (or learn) an\nabstract, holistic description of an agent\u2019s lifetime behavior, and then encourage the agent to exhibit\ndifferent behaviors from those previously performed. That is the approach of novelty search (NS) [3]\nand quality diversity (QD) algorithms [17\u201319], which are described in detail below. Such algorithms\nare also interestingly different, and have different capabilities, because they perform exploration with\na population of agents rather than a single agent (discussed in SI Sec. 6.2). NS and QD have shown\npromise with smaller neural networks on problems with low-dimensional input and output spaces\n[17\u201322]. Evolution strategies (ES) [23] was recently shown to perform well on high-dimensional\ndeep RL tasks in a short amount of wall clock time by scaling well to many distributed computers. In\nthis paper, for the \ufb01rst time, we study how these two types of algorithms can be hybridized with ES\nto scale them to deep neural networks and thus tackle hard, high-dimensional deep reinforcement\nlearning problems, without sacri\ufb01cing the speed/scalability bene\ufb01ts of ES. We \ufb01rst study NS, which\nperforms exploration only (ignoring the reward function) to \ufb01nd a set of novel solutions [3]. We then\ninvestigate algorithms that balance exploration and exploitation, speci\ufb01cally novel instances of QD\nalgorithms, which seek to produce a set of solutions that are both novel and high-performing [17\u201320].\nBoth NS and QD are explained in detail in Sec. 3.\nES directly searches in the parameter space of a neural network to \ufb01nd an effective policy. A team\nfrom OpenAI recently showed that ES can achieve competitive performance on many reinforcement\nlearning (RL) tasks while offering some unique bene\ufb01ts over traditional gradient-based RL methods\n[24]. Most notably, ES is highly parallelizable, which enables near linear speedups in runtime as a\nfunction of CPU/GPU workers. For example, with hundreds of parallel CPUs, ES was able to achieve\nroughly the same performance on Atari games with the same DNN architecture in 1 hour as A3C did\nin 24 hours [24]. In this paper, we investigate adding NS and QD to ES only; in future work, we will\ninvestigate how they might be hybridized with Q-learning and policy gradient methods. We start with\nES because (1) its fast wall-clock time allows rapid experimental iteration, and (2) NS and QD were\noriginally developed as neuroevolution methods, making it natural to try them \ufb01rst with ES, which is\nalso an evolutionary algorithm.\nHere we test whether encouraging novelty via NS and QD improves the performance of ES on\nsparse and/or deceptive control tasks. Our experiments con\ufb01rm that NS-ES and two simple versions\nof QD-ES (NSR-ES and NSRA-ES) avoid local optima encountered by ES and achieve higher\nperformance on tasks ranging from simulated robots learning to walk around a deceptive trap to the\nhigh-dimensional pixel-to-action task of playing Atari games. Our results add these new families\nof exploration algorithms to the RL toolbox, opening up avenues for studying how they can best be\ncombined with RL algorithms, whether ES or others.\n2 Background\n2.1 Evolution Strategies\nEvolution strategies (ES) are a class of black box optimization algorithms inspired by natural\nevolution [23]: At every iteration (generation), a population of parameter vectors (genomes) is\nperturbed (mutated) and, optionally, recombined (merged) via crossover. The \ufb01tness of each resultant\noffspring is then evaluated according to some objective function (reward) and some form of selection\nthen ensures that individuals with higher reward tend to produce offspring for the next generation.\nMany algorithms in the ES class differ in their representation of the population and methods of\nrecombination; the algorithms subsequently referred to in this work belong to the class of Natural\nEvolution Strategies (NES) [25, 26]. NES represents the population as a distribution of parameter\nvectors \u2713 characterized by parameters : p(\u2713). Under a \ufb01tness function, f (\u2713), NES seeks to\nmaximize the average \ufb01tness of the population, E\u2713\u21e0p[f (\u2713)], by optimizing with stochastic\ngradient ascent.\nRecent work from OpenAI outlines a version of NES applied to standard RL benchmark problems\n[24]. We will refer to this variant simply as ES going forward. In their work, a \ufb01tness function\nf (\u2713) represents the stochastic reward experienced over a full episode of agent interaction, where \u2713\n\n2\n\n\fnPn\nrE\u2713\u21e0[f (\u2713)] \u21e1 1\n\nparameterizes the policy \u21e1\u2713. From the population distribution pt , parameters \u2713i\nsampled and evaluated to obtain f (\u2713i\nan estimate of approximate gradient of expected reward:\ni=1 f (\u2713i\n\nt \u21e0N (\u2713t, 2I) are\nt). In a manner similar to REINFORCE [27], \u2713t is updated using\n\nt)r log p(\u2713i\nt)\n\nwhere n is the number of samples evaluated per generation. Intuitively, NES samples parameters in\nthe neighborhood of \u2713t and determines the direction in which \u2713t should move to improve expected\nreward. Since this gradient estimate has high variance, NES relies on a large n for variance reduction.\nGenerally, NES also evolves the covariance of the population distribution, but for the sake of fair\ncomparison with Salimans et al. [24] we consider only static covariance distributions, meaning is\n\ufb01xed throughout training.\nTo sample from the population distribution, Salimans et al. [24] apply additive Gaussian noise to\nthe current parameter vector : \u2713i\nt = \u2713t + \u270fi where \u270fi \u21e0N (0, I). Although \u2713 is high-dimensional,\nprevious work has shown Gaussian parameter noise to have bene\ufb01cial exploration properties when\napplied to deep networks [26, 28, 29]. The gradient is then estimated by taking a sum of sampled\nparameter perturbations weighted by their reward:\n\nr\u2713tE\u270f\u21e0N (0, I)[f (\u2713t + \u270f)] \u21e1 1\n\ni=1 f (\u2713i\n\nt)\u270fi\n\nnPn\n\nTo ensure that the scale of reward between domains does not bias the optimization process, we\nfollow the approach of Salimans et al. [24] and rank-normalize f (\u2713i\nt) before taking the weighted sum.\nOverall, this NES variant exhibits performance on par with contemporary, gradient-based algorithms\non dif\ufb01cult RL domains, including simulated robot locomotion and Atari environments [30].\n2.2 Novelty Search (NS)\nInspired by nature\u2019s drive towards diversity, NS encourages policies to engage in notably different\nbehaviors than those previously seen. The algorithm encourages different behaviors by computing\nthe novelty of the current policy with respect to previously generated policies and then encourages the\npopulation distribution to move towards areas of parameter space with high novelty. NS outperforms\nreward-based methods in maze and biped walking domains, which possess deceptive reward signals\nthat attract agents to local optima [3]. In this work, we investigate the ef\ufb01cacy of NS at the scale\nof DNNs by combining it with ES. In NS, a policy \u21e1 is assigned a domain-dependent behavior\ncharacterization b(\u21e1) that describes its behavior. For example, in the case of a humanoid locomotion\nproblem, b(\u21e1) may be as simple as a two-dimensional vector containing the humanoid\u2019s \ufb01nal {x, y}\nlocation. Throughout training, every \u21e1\u2713 evaluated adds b(\u21e1\u2713) to an archive set A with some probability.\nA particular policy\u2019s novelty N (b(\u21e1\u2713), A) is then computed by selecting the k-nearest neighbors of\nb(\u21e1\u2713) from A and computing the average distance between them:\n\nN (\u2713, A) = N (b(\u21e1\u2713), A) = 1\n\n|S|Pj2S ||b(\u21e1\u2713) b(\u21e1j)||2\n\nS = kN N (b(\u21e1\u2713), A)\n\n= {b(\u21e11), b(\u21e12), ..., b(\u21e1k)}\n\nAbove, the distance between behavior characterizations is calculated with an L2-norm, but any\ndistance function can be substituted. Previously, NS has been implemented with a genetic algorithm\n[3]. We next explain how NS can now be combined with ES, to leverage the advantages of both.\n3 Methods\n3.1 NS-ES\nWe use the ES optimization framework, described in Sec. 2.1, to compute and follow the gradient of\nexpected novelty with respect to \u2713t. Given an archive A and sampled parameters \u2713i\nt = \u2713t + \u270fi, the\ngradient estimate can be computed:\n\nr\u2713tE\u270f\u21e0N (0, I)[N (\u2713t + \u270f, A)|A] \u21e1 1\n\ni=1 N (\u2713i\n\nt, A)\u270fi\n\nnPn\n\nThe gradient estimate obtained tells us how to change the current policy\u2019s parameters \u2713t to increase\nthe average novelty of our parameter distribution. We condition the gradient estimate on A, as the\narchive is \ufb01xed at the beginning of a given iteration and updated only at the end. We add only the\nbehavior characterization corresponding to each \u2713t, as adding those for each sample \u2713i\nt would in\ufb02ate\nthe archive and slow the nearest-neighbors computation. As more behavior characterizations are\nadded to A, the novelty landscape changes, resulting in commonly occurring behaviors becoming\n\u201cboring.\" Optimizing for expected novelty leads to policies that move towards unexplored areas of\nbehavior space.\n\n3\n\n\fNS-ES could operate with a single agent that is rewarded for acting differently than its ancestors.\nHowever, to encourage additional diversity and get the bene\ufb01ts of population-based exploration\ndescribed in SI Sec. 6.2, we can instead create a population of M agents, which we will refer to as\nthe meta-population. Each agent, characterized by a unique \u2713m, is rewarded for being different from\nall prior agents in the archive (ancestors, other agents, and the ancestors of other agents), an idea\nrelated to that of Liu et al. [31], which optimizes for a distribution of M diverse, high-performing\npolicies. We hypothesize that the selection of M is domain dependent and that identifying which\ndomains favor which regime is a fruitful area for future research.\nWe initialize M random parameter vectors and at every iteration select one to update. For our\nexperiments, we probabilistically select which \u2713m to advance from a discrete probability distribution\nas a function of \u2713m\u2019s novelty. Speci\ufb01cally, at every iteration, for a set of agent parameter vectors\n\u21e7= {\u27131,\u2713 2, ...,\u2713 M}, we calculate each \u2713m\u2019s probability of being selected P (\u2713m) as its novelty\nnormalized by the sum of novelty across all policies:\n(1)\nPM\n\nHaving multiple, separate agents represented as independent Gaussians is a simple choice for the\nmeta-population distribution. In future work, more complex sampling distributions that represent the\nmulti-modal nature of meta-population parameter vectors could be tried.\nAfter selecting an individual m from the meta-population, we compute the gradient of expected\nnovelty with respect to m\u2019s current parameter vector, \u2713m\nt , and perform an update step accordingly:\n\nP (\u2713m) = N (\u2713m,A)\n\nj=1 N (\u2713j ,A)\n\n\u2713m\nt+1 \u2713m\n\nt + \u21b5 1\n\nnPn\n\nt\n\n, A)\u270fi\n\ni=1 N (\u2713i,m\nt , \u21b5 is the stepsize, and \u2713i,m\n\ni = \u2713m\n\nt + \u270fi, where\nWhere n is the number of sampled perturbations to \u2713m\nt+1) is computed and added to the\n\u270fi \u21e0N (0, I). Once the current parameter vector is updated, b(\u21e1\u2713m\nshared archive A. The whole process is repeated for a pre-speci\ufb01ed number of iterations, as there is\nno true convergence point of NS. During training, the algorithm preserves the policy with the highest\naverage episodic reward and returns this policy once training is complete. Although Salimans et al.\n[24] return only the \ufb01nal policy after training with ES, the ES experiments in this work return the\nbest-performing policy to facilitate fair comparison with NS-ES. Algorithm 1 in SI Sec. 6.5 outlines\na simple, parallel implementation of NS-ES. It is important to note that the addition of the archive\nand the replacement of the \ufb01tness function with novelty does not damage the scalability of the ES\noptimization procedure (SI Sec. 6.4).\n3.2 QD-ES Algorithms: NSR-ES and NSRA-ES\nNS-ES alone can enable agents to avoid deceptive local optima in the reward function. Reward\nsignals, however, are still very informative and discarding them completely may cause performance\nto suffer. Consequently, we train a variant of NS-ES, which we call NSR-ES, that combines the\nreward (\u201c\ufb01tness\") and novelty calculated for a given set of policy parameters \u2713. Similar to NS-ES and\nES, NSR-ES operates on entire episodes and can thus evaluate reward and novelty simultaneously for\nany sampled parameter vector: \u2713i,m\n, A),\naverage the two values, and set the average as the weight for the corresponding \u270fi. The averaging\nprocess is integrated into the parameter update rule as:\n\nt + \u270fi. Speci\ufb01cally, we compute f (\u2713i,m\n\n) and N (\u2713i,m\n\nt = \u2713m\n\nt\n\nt\n\n\u2713m\nt+1 \u2713m\n\nt + \u21b5 1\n\nf (\u2713i,m\n\nt\n\n)+N (\u2713i,m\n\nt\n\n2\n\n,A)\n\n\u270fi\n\ni=1\n\nnPn\n\nIntuitively, the algorithm follows the approximated gradient in parameter-space towards policies\nthat both exhibit novel behaviors and achieve high rewards. Often, however, the scales of f (\u2713) and\nN (\u2713, A) differ. To combine the two signals effectively, we rank-normalize f (\u2713i,m\n, A)\nindependently before computing the average. Optimizing a linear combination of novelty and reward\nwas previously explored in Cuccu and Gomez [32] and Cuccu et al. [33], but not with large neural\nnetworks on high-dimensional problems. The result of NSR-ES is a set of M agents being optimized\nto be both high-performing, yet different from each other.\nNSR-ES has an equal weighting of the performance and novelty gradients that is static across training.\nWe explore a further extension of NSR-ES called NSRAdapt-ES (NSRA-ES), which takes advantage\nof the opportunity to dynamically weight the priority given to the performance gradient f (\u2713i,m\n) vs.\nthe novelty gradient N (\u2713i,m\n, A) by intelligently adapting a weighting parameter w during training. By\ndoing so, the algorithm can follow the performance gradient when it is making progress, increasingly\n\n) and N (\u2713i,m\n\nt\n\nt\n\nt\n\nt\n\n4\n\n\f\u2713m\nt+1 \u2713m\n\nt + \u21b5 1\n\ni=1 wf (\u2713i,m\n\nt\n\n)\u270fi + (1 w)N (\u2713i,m\n\nt\n\n, A)\u270fi\n\nnPn\n\ntry different things if stuck in a local optimum, and switch back to following the performance gradient\nonce unstuck. For a speci\ufb01c w at a given generation, the parameter update rule for NSRA-ES is\nexpressed as follows:\n\nWe set w = 1.0 initially and decrease it if performance stagnates across a \ufb01xed number of generations.\nWe continue decreasing w until performance increases, at which point we increase w. While many\nprevious works have adapted exploration pressure online by learning the amount of noise to add to\nthe parameters [25, 26, 28, 34], such approaches rest on the assumption that an increased amount\nof parameter noise leads to increased behavioral diversity, which is often not the case (e.g. too\nmuch noise may lead to degenerate policies) [20]. Here we directly adapt the weighting between\nbehavioral diversity and performance, which more directly controls the trade-off of interest. SI\nSec. 6.5 provides a more detailed description of how we adapt w as well as pseudocode for NSR-ES\nand NSRA-ES. Source code and hyperparameter settings for our experiments can be found here:\nhttps://github.com/uber-research/deep-neuroevolution\n4 Experiments\n4.1 Simulated Humanoid Locomotion problem\nWe \ufb01rst tested our implementation of NS-ES, NSR-ES, and NSRA-ES on the problem of having a\nsimulated humanoid learn to walk. We chose this problem because it is a challenging continuous\ncontrol benchmark where most would presume a reward function is necessary to solve the problem.\nWith NS-ES, we test whether searching through novelty alone can \ufb01nd solutions to the problem.\nA similar result has been shown for much smaller neural networks (\u21e050-100 parameters) on a\nmore simple simulated biped [20], but here we test whether NS-ES can enable locomotion at the\nscale of deep neural networks on a much more sophisticated environment. NSR-ES and NSRA-ES\nexperiments then test the effectiveness of combining exploration and reward pressures on this dif\ufb01cult\ncontinuous control problem. SI Sec. 6.7 outlines complete experimental details.\nThe \ufb01rst environment is in a slightly modi\ufb01ed version of OpenAI Gym\u2019s Humanoid-v1 environment.\nBecause the heart of this challenge is to learn to walk ef\ufb01ciently, not to walk in a particular direction,\nwe modi\ufb01ed the environment reward to be isotropic (i.e. indifferent to the direction the humanoid\ntraveled) by setting the velocity component of reward to distance traveled from the origin as opposed\nto distance traveled in the positive x direction.\nAs described in section 2.2, novelty search requires a domain-speci\ufb01c behavior characterization (BC)\nfor each policy, which we denote as b(\u21e1\u2713i). For the Humanoid Locomotion problem the BC is the\nagent\u2019s \ufb01nal {x, y} location, as it was in Lehman and Stanley [20]. NS also requires a distance\nfunction between two BCs. Following Lehman and Stanley [20], the distance function is the square\nof the Euclidean distance:\n\ndist(b(\u21e1\u2713i), b(\u21e1\u2713j )) = ||b(\u21e1\u2713i) b(\u21e1\u2713j )||2\n\n2\n\nThe \ufb01rst result is that ES obtains a higher \ufb01nal reward than NS-ES (p < 0.05) and NSR-ES (p < 0.05);\nthese and all future p values are calculated via a Mann-Whitney U test. The performance gap is\nmore pronounced for smaller amounts of computation (Fig. 1 (c)). However, many will be surprised\nthat NS-ES is still able to consistently solve the problem despite ignoring the environment\u2019s reward\nfunction. While the BC is aligned [35] with the problem in that reaching new {x, y} positions tends\nto also encourage walking, there are many parts of the reward function that the BC ignores (e.g.\nenergy-ef\ufb01ciency, impact costs).\nWe hypothesize that with a sophisticated BC that encourages diversity in all of the behaviors the\nmulti-part reward function cares about, there would be no performance gap. However, such a BC\nmay be dif\ufb01cult to construct and would likely further exaggerate the amount of computation required\nfor NS to match ES. NSR-ES demonstrates faster learning than NS-ES due to the addition of reward\npressure, but ultimately results in similar \ufb01nal performance after 600 generations (p > 0.05, Fig. 1\n(c)). Promisingly, on this non-deceptive problem, NSRA-ES does not pay a cost for its latent\nexploration capabilities and performs similarly to ES (p > 0.05).\nThe Humanoid Locomotion problem does not appear to be a deceptive problem, at least for ES. To\ntest whether NS-ES, NSR-ES, and NSR-ES speci\ufb01cally help with deception, we also compare ES\nto these algorithms on a variant of this environment we created that adds a deceptive trap (a local\noptimum) that must be avoided for maximum performance (Fig. 1 (b)). In this new environment,\n\n5\n\n\fa small three-sided enclosure is placed at a short distance in front of the starting position of the\nhumanoid and the reward function is simply the distance traveled in the positive x direction.\nFig. 1 (d) and SI Sec. 6.8 show the reward received by each algorithm and Fig. 2 shows how the\nalgorithms differ qualitatively during search on this problem. In every run, ES gets stuck in the local\noptimum due to following reward into the deceptive trap. NS-ES is able to avoid the local optimum\nas it ignores reward completely and instead seeks to thoroughly explore the environment, but doing\nso also means it makes slow progress according to the reward function. NSR-ES demonstrates\nsuperior performance to NS-ES (p < 0.01) and ES (p < 0.01) as it bene\ufb01ts from both optimizing for\nreward and escaping the trap via the pressure for novelty. Like ES, NSRA-ES learns to walk into the\ndeceptive trap initially, as it initially is optimizing for reward only. Once stuck in the local optimum,\nthe algorithm continually increases its pressure for novelty, allowing it to escape the deceptive trap\nand ultimately achieve much higher rewards than NS-ES (p < 0.01) and NSR-ES (p < 0.01). Based\njust on these two domains, NSRA-ES seems to be the best algorithm across the board because it\ncan exploit well when there is no deception, add exploration dynamically when there is, and return\nto exploiting once unstuck. The latter is likely why NSRA-ES outperforms even NSR-ES on the\ndeceptive humanoid locomotion problem.\n\nFigure 1: Humanoid Locomotion Experiment. The humanoid locomotion task is shown without\na deceptive trap (a) and with one (b), and results on them in (c) and (d), respectively. Here and in\nsimilar \ufb01gures below, the median reward (of the best seen policy so far) per generation across 10\nruns is plotted as the bold line with 95% bootstrapped con\ufb01dence intervals of the median (shaded).\nFollowing Salimans et al. [24], policy performance is measured as the average performance over \u21e030\nstochastic evaluations.\n\nFig. 2 also shows the bene\ufb01t of maintaining a meta-population (M = 5) in the NS-ES, NSR-ES, and\nNSRA-ES algorithms. Some lineages get stuck in the deceptive trap, incentivizing other policies\nto explore around the trap. At that point, all three algorithms begin to allocate more computational\nresources to this newly discovered, more promising strategy via the probabilistic selection method\noutlined in Sec. 3.1. Both the novelty pressure and having a meta-population thus appear to be useful,\nbut in future work we look to disambiguate the relative contribution made by each.\n\n4.2 Atari\nWe also tested NS-ES, NSR-ES, and NSRA-ES on numerous games from the Atari 2600 environment\nin OpenAI Gym [36]. Atari games serve as an informative benchmark due to their high-dimensional\npixel input and complex control dynamics; each game also requires different levels of exploration\n\n6\n\n\fFigure 2: ES gets stuck in the deceptive local optimum while NS-ES, NSR-ES & NSRA-ES\nexplore to \ufb01nd better solutions. An overhead view of a representative run is shown for each\nalgorithm on the Humanoid Locomotion with Deceptive Trap problem. The black star represents\nthe humanoid\u2019s starting point. Each diamond represents the \ufb01nal location of a generation\u2019s policy,\ni.e. \u21e1(\u2713t), with darker shading for later generations. For NS-ES, NSR-ES, & NSRA-ES plots, each\nof the M = 5 agents in the meta-population and its descendants are represented by different colors.\nSimilar plots for all 10 runs of each algorithm are provided in SI Sec. 6.10.\n\nto solve. To demonstrate the effectiveness of NS-ES, NSR-ES, and NSRA-ES for local optima\navoidance and directed exploration, we tested on 12 different games with varying levels of complexity,\nas de\ufb01ned by the taxonomy in Bellemare et al. [12]. Primarily, we focused on games in which, during\npreliminary experiments, we observed ES prematurely converging to local optima (Seaquest, Q*Bert,\nFreeway, Frostbite, and Beam Rider). However, we also included a few other games where ES did not\nconverge to local optima to understand the performance of our algorithm in less-deceptive domains\n(Alien, Amidar, Bank Heist, Gravitar, Zaxxon, and Montezuma\u2019s Revenge). SI Sec. 6.6 describes\nadditional experimental details. We report the median reward across 5 independent runs of the best\npolicy found in each run (see Table 1).\nFor the behavior characterization, we follow an idea from Naddaf [37] and concatenate Atari game\nRAM states for each timestep in an episode. RAM states in Atari 2600 games are integer-valued vec-\ntors of length 128 in the range [0, 255] that describe all the state variables in a game (e.g. the location\nof the agent and enemies). Ultimately, we want to automatically learn behavior characterizations\ndirectly from pixels. A plethora of recent research suggests that this is a viable approach [12, 38, 39].\nFor example, low-dimensional, latent representations of the state space could be extracted from\nauto-encoders [11, 40] or networks trained to predict future states [14, 16]. In this work, however,\nwe focus on learning with a pre-de\ufb01ned, informative behavior characterization and leave the task\nof jointly learning a policy and latent representation of states for future work. In effect, basing\nnovelty on RAM states provides a con\ufb01rmation of what is possible in principle with a suf\ufb01ciently\ninformed behavior characterization. We also emphasize that, while during training NS-ES, NSR-ES,\nand NSRA-ES use RAM states to guide novelty search, the policy itself, \u21e1\u2713t, operates only on image\ninput and can be evaluated without any RAM state information. The distance between behavior\ncharacterizations is the sum of L2-distances at each timestep t:\n\ndist(b(\u21e1\u2713i), b(\u21e1\u2713j )) =PT\n\nt=1 ||(bt(\u21e1\u2713i)) bt(\u21e1\u2713j ))||2\n\nFor trajectories of different lengths, the last state of the shorter trajectory is repeated until the lengths\nof both match. Because the BC distance is not normalized by trajectory length, novelty is biased\nto be higher for longer trajectories. In some Atari games, this bias can lead to higher performing\npolicies, but in other games longer trajectories tend to have a neutral or even negative relationship\nwith performance. In this work we found it bene\ufb01cial to keep novelty unnormalized, but further\ninvestigation into different BC designs could yield additional improvements.\n\n7\n\n\fTable 1 compares the performance of each algorithm discussed above to each other and with those\nfrom two popular methods for exploration in RL, namely Noisy DQN [29] and A3C+ [12]. Noisy\nDQN and A3C+ only outperform all the ES variants considered in this paper on 3/12 games and 2/12\ngames respectively. NSRA-ES, however, outperforms the other algorithms on 5/12 games, suggesting\nthat NS and QD are viable alternatives to contemporary exploration methods.\nWhile the novelty pressure in NS-ES does help it avoid local optima in some cases (discussed below),\noptimizing for novelty alone does not result in higher reward in most games (although it does in\nsome). However, it is surprising how well NS-ES does in many tasks given that it is not explicitly\nattempting to increase reward. Because NSR-ES combines exploration with reward maximization, it\nis able to avoid local optima encountered by ES while also learning to play the game well. In each of\nthe 5 games in which we observed ES converging to premature local optima (i.e. Seaquest, Q*Bert,\nFreeway, Beam Rider, Frostbite), NSR-ES achieves a higher median reward. In the other games, ES\ndoes not bene\ufb01t from adding an exploration pressure and NSR-ES performs worse. It is expected that\nif there are no local optima and reward maximization is suf\ufb01cient to perform well, the extra cost of\nencouraging exploration will hurt performance. Mitigating such costs, NSRA-ES optimizes solely\nfor reward until a performance plateau is reached. After that, the algorithm will assign more weight\nto novelty and thus encourage exploration. We found this to be bene\ufb01cial, as NSRA-ES achieves\nhigher median rewards than ES on 8/12 games and NSR-ES on 9/12 games. It\u2019s superior performance\nvalidates NSRA-ES as the best among the evolutionary algorithms considered and suggests that using\nan adaptive weighting between novelty and reward is a promising direction for future research.\nIn the game Seaquest, the avoidance of local optima is particularly important (Fig. 3). ES performance\n\ufb02atlines early at a median reward of 960, which corresponds to a behavior of the agent descending to\nthe bottom, shooting \ufb01sh, and never coming up for air. This strategy represents a classic local optima,\nas coming up for air requires temporarily foregoing reward, but enables far higher rewards to be\nearned in the long run (Salimans et al. [24] did not encounter this particular local optimum with their\nhyperparameters, but the point is that ES without exploration can get stuck inde\ufb01nitely on whichever\nmajor local optima it encounters). NS-ES learns to come up for air in all 5 runs and achieves a\nslightly higher median reward of 1044.5 (p < 0.05). NSR-ES also avoids this local optimum, but its\nadditional reward signal helps it play the game better (e.g. it is better at shooting enemies), resulting\nin a signi\ufb01cantly higher median reward of 2329.7 (p < 0.01). Because NSRA-ES takes reward\nsteps initially, it falls into the same local optimum as ES. Because we chose (without performing a\nhyperparameter search) to change the weighting w between performance and novelty infrequently\n(only every 50 generations), and to change it by a small amount (only 0.05), 200 generations was not\nlong enough to emphasize novelty enough to escape this local optimum. We found that by changing\nw every 10 generations, this problem is remedied and the performance of NSRA-ES equals that of\nNSR-ES (p > 0.05, Fig. 3). These results motivate future research into better hyperparameters for\nchanging w, and into more complex, intelligent methods of dynamically adjusting w, including with\na population of agents with different dynamic w strategies.\nThe Atari results illustrate that NS is an effective mechanism for encouraging directed exploration,\ngiven an appropriate behavior characterization, for complex, high-dimensional control tasks. A\nnovelty pressure alone produces impressive performance on many games, sometimes even beating\nES. Combining novelty and reward performs far better, and improves ES performance on tasks where\nit appears to get stuck on local optima.\n5 Discussion and Conclusion\nNS and QD are classes of evolutionary algorithms designed to avoid local optima and promote\nexploration in RL environments, but have only been previously shown to work with small neural\nnetworks (on the order of hundreds of connections). ES was recently shown to be capable of training\ndeep neural networks that can solve challenging, high-dimensional RL tasks [24]. It also is much\nfaster when many parallel computers are available. Here we demonstrate that, when hybridized with\nES, NS and QD not only preserve the attractive scalability properties of ES, but also help ES explore\nand avoid local optima in domains with deceptive reward functions. To the best of our knowledge, this\npaper reports the \ufb01rst attempt at augmenting ES to perform directed exploration in high-dimensional\nenvironments. We thus provide an option for those interested in taking advantage of the scalability of\nES, but who also want higher performance on domains that have reward functions that are sparse or\nhave local optima. The latter scenario will likely hold for most challenging, real-world domains that\nmachine learning practitioners will wish to tackle in the future.\n\n8\n\n\fGAME\nALIEN\nAMIDAR\nBANK HEIST\nBEAM RIDER\u2020\nFREEWAY\u2020\nFROSTBITE\u2020\nGRAVITAR\nMONTEZUMA\nMS. PACMAN\nQ*BERT\u2020\nSEAQUEST\u2020\nZAXXON\n\nES NS-ES NSR-ES NSRA-ES\n4846.4\n305.0\n152.9\n906.4\n32.9\n3785.4\n1140.9\n0.0\n5171.0\n1350.0\n960.0\n7303.3\n\n2186.2\n255.8\n130.0\n876.9\n32.3\n2978.6\n732.9\n0.0\n3495.2\n1400.0\n2329.7\n6723.3\n\n3283.8\n322.2\n140.0\n871.7\n31.1\n367.4\n1129.4\n0.0\n4498.0\n1075.0\n960.0\n9885.0\n\n1124.5\n134.7\n50.0\n805.5\n22.8\n250.0\n527.5\n0.0\n2252.2\n1234.1\n1044.5\n1761.9\n\nDQN NOISYDQN\n2404\n2403\n1610\n924\n1068\n455\n20793\n10564\n32\n31\n753\n1000\n366\n447\n3\n2\n2674\n2722\n15545\n11241\n4163\n2282\n4806\n6920\n\nA3C+\n1848.3\n964.7\n991.9\n5992.1\n27.3\n506.6\n246.02\n142.5\n2380.6\n15804.7\n2274.1\n7956.1\n\nTable 1: Atari Results. The scores are the median, across 5 runs, of the mean reward (over \u21e030\nstochastic evaluations) of each run\u2019s best policy. SI Sec. 6.9 plots performance over time, along\nwith bootstrapped con\ufb01dence intervals of the median, for each ES algorithm for each game. In some\ncases rewards reported here for ES are lower than those in Salimans et al. [24], which could be due\nto differing hyperparameters (SI Sec. 6.6). Games with a \u2020 are those in which we observed ES to\nconverge prematurely, presumably due to it encountering local optima. The DQN and A3C results\nare reported after 200M frames of, and one to many days of, training. All evolutionary algorithm\nresults are reported after \u21e0 2B frames of, and \u21e02 hours of, training\n\nFigure 3: Seaquest Case Study. By switching the weighting between novelty and reward, w, every\n10 generations instead of every 50, NSRA-ES is able to overcome the local optimum ES \ufb01nds and\nachieve high scores on Seaquest.\n\nAdditionally, this work highlights alternate options for exploration in RL domains. The \ufb01rst is to\nholistically describe the behavior of an agent instead of de\ufb01ning a per-state exploration bonus. The\nsecond is to encourage a population of agents to simultaneously explore different aspects of an\nenvironment. These new options thereby open new research areas into (1) comparing holistic vs.\nstate-based exploration, and population-based vs. single-agent exploration, more systematically and\non more domains, (2) investigating the best way to combine the merits of all of these options, and\n(3) hybridizing holistic and/or population-based exploration with other algorithms that work well\non deep RL problems, such as policy gradients and DQN. It should be relatively straightforward to\ncombine NS with policy gradients (NS-PG). It is less obvious how to combine it with Q-learning\n(NS-Q), but may be possible.\nAs with any exploration method, encouraging novelty can come at a cost if such an exploration\npressure is not necessary. In Atari games such as Alien and Gravitar, and in the Humanoid Locomotion\nproblem without a deceptive trap, both NS-ES and NSR-ES perform worse than ES. To avoid this\ncost, we introduce the NSRA-ES algorithm, which attempts to invest in exploration only when\nnecessary. NSRA-ES tends to produce better results than ES, NS-ES, and NSR-ES across many\ndifferent domains, making it an attractive new algorithm for deep RL tasks. Similar strategies for\nadapting the amount of exploration online may also be advantageous for other deep RL algorithms.\nHow best to dynamically balance between exploitation and exploration in deep RL remains an open,\ncritical research challenge, and our work underscores the importance of, and motivates further, such\nwork. Overall, our work shows that ES is a rich and unexploited parallel path for deep RL research.\nIt is worthy of exploring not only because it is an alternative algorithm for RL problems, but also\nbecause innovations created in the ES family of algorithms could be ported to improve other deep RL\nalgorithm families like policy gradients and Q learning, or through hybrids thereof.\n\n9\n\n\fAcknowledgments\nWe thank all of the members of Uber AI Labs, in particular Thomas Miconi, Rui Wang, Peter Dayan,\nJohn Sears, Joost Huizinga, and Theofanis Karaletsos, for helpful discussions. We also thank Justin\nPinkul, Mike Deats, Cody Yancey, Joel Snow, Leon Rosenshein and the entire OpusStack Team\ninside Uber for providing our computing platform and for technical support.\n\nReferences\n[1] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. 1998.\n\n[2] Gunar E. Liepins and Michael D. Vose. Deceptiveness and genetic algorithm dynamics. Technical Report\nCONF-9007175-1, Oak Ridge National Lab., TN (USA); Tennessee Univ., Knoxville, TN (USA), 1990.\n\n[3] Joel Lehman and Kenneth O. Stanley. Novelty search and the problem with objectives.\n\nProgramming Theory and Practice IX (GPTP 2011), 2011.\n\nIn Genetic\n\n[4] Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586\u2013594, 2016.\n\n[5] Yann Dauphin, Razvan Pascanu, \u00c7aglar G\u00fcl\u00e7ehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio.\nIdentifying and attacking the saddle point problem in high-dimensional non-convex optimization. ArXiv\ne-prints, abs/1406.2572, 2014.\n\n[6] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[7] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML,\npages 1928\u20131937, 2016.\n\n[8] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In ICML, pages 1889\u20131897, 2015.\n\n[9] J\u00fcrgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010).\n\nTransactions on Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\nIEEE\n\n[10] Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational\n\napproaches. Frontiers in Neurorobotics, 1:6, 2009.\n\n[11] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman,\nFilip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement\nlearning. In NIPS, pages 2750\u20132759, 2017.\n\n[12] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\n\nUnifying count-based exploration and intrinsic motivation. In NIPS, pages 1471\u20131479, 2016.\n\n[13] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and R\u00e9mi Munos. Count-based exploration\n\nwith neural density models. arXiv preprint arXiv:1703.01310, 2017.\n\n[14] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning\n\nwith deep predictive models. arXiv preprint arXiv:1507.00814, 2015.\n\n[15] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational\n\ninformation maximizing exploration. In NIPS, pages 1109\u20131117, 2016.\n\n[16] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by\n\nself-supervised prediction. arXiv preprint arXiv:1705.05363, 2017.\n\n[17] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret. Robots that can adapt like animals. Nature, 521:503\u2013507,\n\n2015. doi: 10.1038/nature14422.\n\n[18] Jean-Baptiste Mouret and Jeff Clune.\n\narXiv:1504.04909, 2015.\n\nIlluminating search spaces by mapping elites. arXiv preprint\n\n[19] Justin K Pugh, Lisa B. Soros, and Kenneth O. Stanley. Quality diversity: A new frontier for evolutionary\n\ncomputation. 3(40), 2016. ISSN 2296-9144.\n\n10\n\n\f[20] Joel Lehman and Kenneth O. Stanley. Evolving a diversity of virtual creatures through novelty search and\nlocal competition. In GECCO \u201911: Proceedings of the 13th annual conference on Genetic and evolutionary\ncomputation, pages 211\u2013218, 2011.\n\n[21] Roby Velez and Jeff Clune. Novelty search creates robots with general skills for exploration. In Proceedings\nof the 2014 Conference on Genetic and Evolutionary Computation, GECCO \u201914, pages 737\u2013744, 2014.\n\n[22] Joost Huizinga, Jean-Baptiste Mouret, and Jeff Clune. Does aligning phenotypic and genotypic modularity\nimprove the evolution of neural networks? In Proceedings of the 2016 on Genetic and Evolutionary\nComputation Conference (GECCO), pages 125\u2013132, 2016.\n\n[23] Ingo Rechenberg. Evolutionsstrategien. In Simulationsmethoden in der Medizin und Biologie, pages\n\n83\u2013114. 1978.\n\n[24] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to\n\nreinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[25] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies.\n\nEvolutionary Computation, 2008., pages 3381\u20133387, 2008.\n\nIn\n\n[26] Frank Sehnke, Christian Osendorfer, Thomas R\u00fcckstie\u00df, Alex Graves, Jan Peters, and J\u00fcrgen Schmidhuber.\n\nParameter-exploring policy gradients. Neural Networks, 23(4):551\u2013559, 2010.\n\n[27] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[28] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim\nAsfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. arXiv preprint\narXiv:1706.01905, 2017.\n\n[29] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad\nMnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. arXiv\npreprint arXiv:1706.10295, 2017.\n\n[30] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:\n\nAn evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253\u2013279, 2013.\n\n[31] Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv preprint\n\narXiv:1704.02399, 2017.\n\n[32] Giuseppe Cuccu and Faustino Gomez. When novelty is not enough. In European Conference on the\n\nApplications of Evolutionary Computation, pages 234\u2013243. Springer, 2011.\n\n[33] Giuseppe Cuccu, Faustino Gomez, and Tobias Glasmachers. Novelty-based restarts for evolution strategies.\n\nIn Evolutionary Computation (CEC), 2011 IEEE Congress on, pages 158\u2013163. IEEE, 2011.\n\n[34] Nikolaus Hansen, Sibylle D M\u00fcller, and Petros Koumoutsakos. Reducing the time complexity of the\nderandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation,\n11(1):1\u201318, 2003.\n\n[35] Justin K Pugh, Lisa B Soros, Paul A Szerlip, and Kenneth O Stanley. Confronting the challenge of quality\ndiversity. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation\n(GECCO), pages 967\u2013974, 2015.\n\n[36] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. Openai gym, 2016.\n\n[37] Yavar Naddaf. Game-independent ai agents for playing atari 2600 console games. 2010.\n\n[38] Sascha Lange and Martin Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In\n\nThe International Joint Conference on Neural Networks, pages 1\u20138. IEEE, 2010.\n\n[39] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[40] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\n\nimage generation with pixelcnn decoders. In NIPS, pages 4790\u20134798, 2016.\n\n[41] Christopher Stanton and Jeff Clune. Curiosity search: producing generalists by encouraging individuals to\n\ncontinually explore and acquire skills throughout their lifetime. PloS one, 2016.\n\n11\n\n\f[42] Phillip Paquette. Super mario bros. in openai gym, 2016.\n\n[43] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,\nKieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the National Academy of Sciences, page 201611835, 2017.\n\n[44] Roby Velez and Jeff Clune. Diffusion-based neuromodulation can eliminate catastrophic forgetting in\n\nsimple neural networks. arXiv preprint arXiv:1705.07241, 2017.\n\n[45] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):\n\n128\u2013135, 1999.\n\n[46] Antoine Cully and Jean-Baptiste Mouret. Behavioral repertoire learning in robotics. In Proceedings of the\n\n15th annual conference on Genetic and evolutionary computation, pages 175\u2013182. ACM, 2013.\n\n[47] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick,\nRazvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv\npreprint arXiv:1511.06295, 2015.\n\n[48] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi,\nOriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural\nnetworks. arXiv preprint arXiv:1711.09846, 2017.\n\n[49] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju,\nArshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving deep neural networks. arXiv preprint\narXiv:1703.00548, 2017.\n\n[50] Jorge Gomes, Pedro Mariano, and Anders Lyhne Christensen. Systematic derivation of behaviour charac-\n\nterisations in evolutionary robotics. arXiv preprint arXiv:1407.0577, 2014.\n\n[51] Elliot Meyerson, Joel Lehman, and Risto Miikkulainen. Learning behavior characterizations for novelty\nsearch. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), Denver,\nColorado, 2016. ACM.\n\n[52] Joel Lehman and Kenneth O. Stanley. Abandoning objectives: Evolution through the search for novelty\nalone. Evolutionary Computation, 19(2):189\u2013223, 2011. URL http://www.mitpressjournals.org/\ndoi/pdf/10.1162/EVCO_a_00025.\n\n[53] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\n\ntechniques for training gans. In NIPS, pages 2234\u20132242, 2016.\n\n[54] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, pages 448\u2013456, 2015.\n\n[55] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n12\n\n\f", "award": [], "sourceid": 2435, "authors": [{"given_name": "Edoardo", "family_name": "Conti", "institution": "Facebook AML"}, {"given_name": "Vashisht", "family_name": "Madhavan", "institution": "Uber"}, {"given_name": "Felipe", "family_name": "Petroski Such", "institution": "Uber AI Labs"}, {"given_name": "Joel", "family_name": "Lehman", "institution": "Uber AI Labs"}, {"given_name": "Kenneth", "family_name": "Stanley", "institution": "Uber AI Labs and University of Central Florida"}, {"given_name": "Jeff", "family_name": "Clune", "institution": "Uber AI Labs"}]}