{"title": "Recurrent World Models Facilitate Policy Evolution", "book": "Advances in Neural Information Processing Systems", "page_first": 2450, "page_last": 2462, "abstract": "A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of this paper is available at https://worldmodels.github.io", "full_text": "Recurrent World Models Facilitate Policy Evolution\n\nDavid Ha\nGoogle Brain\nTokyo, Japan\n\nhadavid@google.com\n\nJ\u00fcrgen Schmidhuber\n\nNNAISENSE\n\nThe Swiss AI Lab, IDSIA (USI & SUPSI)\n\njuergen@idsia.ch\n\nAbstract\n\nA generative recurrent neural network is quickly trained in an unsupervised manner\nto model popular reinforcement learning environments through compressed spatio-\ntemporal representations. The world model\u2019s extracted features are fed into compact\nand simple policies trained by evolution, achieving state of the art results in various\nenvironments. We also train our agent entirely inside of an environment generated\nby its own internal world model, and transfer this policy back into the actual\nenvironment. Interactive version of paper: https://worldmodels.github.io\n\n1\n\nIntroduction\n\nHumans develop a mental model of the world based on what they are able to perceive with their\nlimited senses, learning abstract representations of both spatial and temporal aspects of sensory inputs.\nFor instance, we are able to observe a scene and remember an abstract description thereof [7, 67]. Our\ndecisions and actions are in\ufb02uenced by our internal predictive model. For example, what we perceive\nat any given moment seems to be governed by our predictions of the future [59, 52]. One way of\nunderstanding the predictive model inside our brains is that it might not simply be about predicting\nthe future in general, but predicting future sensory data given our current motor actions [38, 48]. We\nare able to instinctively act on this predictive model and perform fast re\ufb02exive behaviours when we\nface danger [55], without the need to consciously plan out a course of action [52].\nFor many reinforcement learning (RL) problems [37, 96, 106], an arti\ufb01cial RL agent may also bene\ufb01t\nfrom a predictive model (M) of the future [104, 95] (model-based RL). The backpropagation algorithm\n[50, 39, 103] can be used to train a large M in form of a neural network (NN). In partially observable\nenvironments, we can implement M through a recurrent neural network (RNN) [74, 75, 78, 49] to\nallow for better predictions based on memories of previous observation sequences.\n\nFigure 1: We build probabilistic generative models of OpenAI Gym [5] environments. These models\ncan mimic the actual environments (left). We test trained policies in the actual environments (right).\n\nIn fact, our M will be a large RNN that learns to predict the future given the past in an unsupervised\nmanner. M\u2019s internal representations of memories of past observations and actions are perceived and\nexploited by another NN called the controller (C) which learns through RL to perform some task\nwithout a teacher. A small and simple C limits C\u2019s credit assignment problem to a comparatively\nsmall search space, without sacri\ufb01cing the capacity and expressiveness of the large and complex M.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe combine several key concepts from a series of papers from 1990\u20132015 on RNN-based world\nmodels and controllers [74, 75, 78, 76, 83] with more recent tools from probabilistic modelling, and\npresent a simpli\ufb01ed approach to test some of those key concepts in modern RL environments [5].\nExperiments show that our approach can be used to solve a challenging race car navigation from\npixels task that previously has not been solved using more traditional methods.\nMost existing model-based RL approaches learn a model of the RL environment, but still train on the\nactual environment. Here, we also explore fully replacing an actual RL environment with a generated\none, training our agent\u2019s controller C only inside of the environment generated by its own internal\nworld model M, and transfer this policy back into the actual environment.\nTo overcome the problem of an agent exploiting imperfections of the generated environments,\nwe adjust a temperature parameter of M to control the amount of uncertainty of the generated\nenvironments. We train C inside of a noisier and more uncertain version of its generated environment,\nand demonstrate that this approach helps prevent C from taking advantage of the imperfections of M.\nWe will also discuss other related works in the model-based RL literature that share similar ideas of\nlearning a dynamics model and training an agent using this model.\n\n2 Agent Model\n\nOur simple model is inspired by our own cognitive system. Our agent has a visual sensory component\nV that compresses what it sees into a small representative code. It also has a memory component M\nthat makes predictions about future codes based on historical information. Finally, our agent has a\ndecision-making component C that decides what actions to take based only on the representations\ncreated by its vision and memory components.\n\nFigure 2: Flow diagram showing how V, M, and C interacts with the environment (left).\n\nPseudocode for how our agent model is used in the OpenAI Gym [5] environment (right).\n\nLet the agent\u2019s life span be de\ufb01ned as a sequence of time steps, t = 1, 2, . . . , tdone. Let Nz, Na, Nh\nbe positive integer constants. The environment provides our agent with a high dimensional input\nobservation at each time step t. This input is usually a 2D image frame that is part of a video sequence.\nThe role of V is to learn an abstract, compressed representation of each observed input at each time\nstep. Here, we use a Variational Autoencoder (VAE) [42, 71] as V to compress an image observed\nat time step t into a latent vector zt \u2208 RNz, with Nz being a hyperparameter. While V\u2019s role is\nto compress what the agent sees at each time step, we also want to compress what happens over\ntime. The RNN M serves as a predictive model of future zt vectors that V is expected to produce.\nSince many complex environments are stochastic in nature, we train our RNN to output a probability\ndensity function p(zt) instead of a deterministic prediction of zt.\nThe agent takes an action at \u2208 RNa at time t, where Na is the dimension of the action space. In\nour approach, we approximate p(zt) as a mixture of Gaussian distribution, and train M to output the\nprobability distribution of the next latent vector zt+1 given the current and past information made\navailable to it. More speci\ufb01cally, the RNN, with Nh hidden units, will model P (zt+1 | at, zt, ht),\nwhere ht \u2208 RNh is the hidden state of the RNN at time step t. During sampling, we can adjust a\nreal-valued temperature parameter \u03c4 to control model uncertainty, as done in previous work [28]. We\nwill \ufb01nd that adjusting \u03c4 to be useful for training our controller later on. This approach is known as a\nMixture Density Network [3] combined with an RNN (MDN-RNN) [24], and has been applied in the\npast for sequence generation problems such as generating handwriting [24, 6] and sketches [28].\n\n2\n\n\fC is responsible for determining the course of actions to take in order to maximize the expected cu-\nmulative reward of the agent during a rollout of the environment. In our experiments, we deliberately\nmake C as simple and small as possible, and train it separately from V and M, so that most of our\nagent\u2019s complexity resides in V and M. C is a simple single layer linear model that maps zt and\nht directly to action at at each time step: at = Wc [zt ht] + bc, where Wc \u2208 RNa\u00d7(Nz+Nh) and\nbc \u2208 RNa are the parameters that map the concatenated input [zt ht] to the output action at.\nThis minimal design for C also offers important practical bene\ufb01ts. Advances in deep learning\nprovided us with the tools to train large, sophisticated models ef\ufb01ciently, provided we can de\ufb01ne a\nwell-behaved, differentiable loss function. V and M are designed to be trained ef\ufb01ciently with the\nbackpropagation algorithm using modern GPU accelerators, so we would like most of the model\u2019s\ncomplexity, and model parameters to reside in V and M. The number of parameters of C, a linear\nmodel, is minimal in comparison. This choice allows us to explore more unconventional ways to\ntrain C \u2013 for example, even using evolution strategies (ES) [70, 87] to tackle more challenging RL\ntasks where the credit assignment problem is dif\ufb01cult.\nTo optimize the parameters of C, we chose the Covariance-Matrix Adaptation Evolution Strategy\n(CMA-ES) [29, 30] as our optimization algorithm since it is known to work well for solution spaces\nof up to a few thousand parameters. We evolve parameters of C on a single machine with multiple\nCPU cores running multiple rollouts of the environment in parallel. For more information about the\nmodels, training procedures, and experiment con\ufb01gurations, please see the Supplementary Materials.\n\n3 Car Racing Experiment: World Model for Feature Extraction\n\nIn this section, we describe how we can train the Agent model described earlier to solve a car racing\ntask. To our knowledge, our agent is the \ufb01rst known to solve this task.1\nFrame compressor V and predictive model M can help us extract useful representations of space\nand time. By using these features as inputs of C, we can train a compact C to perform a continuous\ncontrol task, such as learning to drive from pixel inputs for a top-down car racing environment called\nCarRacing-v0 [44]. In this environment, the tracks are randomly generated for each trial, and our\nagent is rewarded for visiting as many tiles as possible in the least amount of time. The agent controls\nthree continuous actions: steering left/right, acceleration, and brake.\n\nAlgorithm 1 Training procedure in our experiments.\n\n1. Collect 10,000 rollouts from a random policy.\n2. Train VAE (V) to encode frames into z \u2208 RNz .\n3. Train MDN-RNN (M) to model P (zt+1 | at, zt, ht).\n4. Evolve controller (C) to maximize the expected cumulative reward of a rollout.\n\nTo train V, we \ufb01rst collect a dataset of 10k random rollouts of the environment. We have \ufb01rst an\nagent acting randomly to explore the environment multiple times, and record the random actions\nat taken and the resulting observations from the environment. We use this dataset to train our VAE\nto encode each frame into low dimensional latent vector z by minimizing the difference between a\ngiven frame and the reconstructed version of the frame produced by the decoder from z. We can now\nuse our trained V to pre-process each frame at time t into zt to train our M. Using this pre-processed\ndata, along with the recorded random actions at taken, our MDN-RNN can now be trained to model\nP (zt+1| at, zt, ht) as a mixture of Gaussians. 2\nIn this experiment, V and M have no knowledge about the actual reward signals from the environment.\nTheir task is simply to compress and predict the sequence of image frames observed. Only C has\naccess to the reward information from the environment. Since there are a mere 867 parameters inside\nthe linear C, evolutionary algorithms such as CMA-ES are well suited for this optimization task.\n\n1We \ufb01nd this task interesting because although it is not dif\ufb01cult to train an agent to wobble around randomly\ngenerated tracks and obtain a mediocre score, CarRacing-v0 de\ufb01nes solving as getting average reward of 900\nover 100 consecutive trials, which means the agent can only afford very few driving mistakes.\n\n2Although in principle, we can train V and M together in an end-to-end manner, we found that training each\nseparately is more practical, achieves satisfactory results, and does not require exhaustive hyperparameter tuning.\nAs images are not required to train M on its own, we can even train on large batches of long sequences of latent\nvectors encoding the entire 1000 frames of an episode to capture longer term dependencies, on a single GPU.\n\n3\n\n\f3.1 Experiment Results\n\nV without M\nTraining an agent to drive is not a dif\ufb01cult task if we have a good representation of the observation.\nPrevious works [35, 46] have shown that with a good set of hand-engineered information about the\nobservation, such as LIDAR information, angles, positions and velocities, one can easily train a small\nfeed-forward network to take this hand-engineered input and output a satisfactory navigation policy.\nFor this reason, we \ufb01rst want to test our agent by handicapping C to only have access to V but not M,\nso we de\ufb01ne our controller as at = Wc zt + bc.\nAlthough the agent is still able to navigate the race track in this setting, we notice it wobbles around\nand misses the tracks on sharper corners, e.g., see Figure 1 (right). This handicapped agent achieved\nan average score of 632 \u00b1 251, in line with the performance of other agents on OpenAI Gym\u2019s\nleaderboard [44] and traditional Deep RL methods such as A3C [41, 36]. Adding a hidden layer to C\u2019s\npolicy network helps to improve the results to 788 \u00b1 141, but not enough to solve this environment.\n\nTable 1: CarRacing-v0 results over 100 trials.\n\nTable 2: DoomTakeCover-v0 results, varying \u03c4.\n\nMethod\nDQN [66]\nA3C (continuous) [36]\nA3C (discrete) [41]\nGym Leader [44]\nV model\nV model with hidden layer\nFull World Model\n\nAverage Score\n343 \u00b1 18\n591 \u00b1 45\n652 \u00b1 10\n838 \u00b1 11\n632 \u00b1 251\n788 \u00b1 141\n906 \u00b1 21\n\nVirtual Score Actual Score\nTemperature \u03c4\n2086 \u00b1 140\n0.10\n2060 \u00b1 277\n0.50\n1145 \u00b1 690\n1.00\n918 \u00b1 546\n1.15\n732 \u00b1 269\n1.30\nRandom Policy\nN/A\nGym Leader [62] N/A\n\n193 \u00b1 58\n196 \u00b1 50\n868 \u00b1 511\n1092 \u00b1 556\n753 \u00b1 139\n210 \u00b1 108\n820 \u00b1 58\n\nFull World Model (V and M)\nThe representation zt provided by V only captures a representation at a moment in time and does not\nhave much predictive power. In contrast, M is trained to do one thing, and to do it really well, which\nis to predict zt+1. Since M\u2019s prediction of zt+1 is produced from the RNN\u2019s hidden state ht at time t,\nht is a good candidate for a feature vector we can give to our agent. Combining zt with ht gives C a\ngood representation of both the current observation, and what to expect in the future.\nWe see that allowing the agent to access both zt and ht greatly improves its driving capability.\nThe driving is more stable, and the agent is able to seemingly attack the sharp corners effectively.\nFurthermore, we see that in making these fast re\ufb02exive driving decisions during a car race, the agent\ndoes not need to plan ahead and roll out hypothetical scenarios of the future. Since ht contain\ninformation about the probability distribution of the future, the agent can just re-use the RNN\u2019s\ninternal representation instinctively to guide its action decisions. Like a Formula One driver or a\nbaseball player hitting a fastball [52], the agent can instinctively predict when and where to navigate\nin the heat of the moment.\nOur agent is able to achieve a score of 906 \u00b1 21, effectively solving the task and obtaining new state\nof the art results. Previous attempts [41, 36] using Deep RL methods obtained average scores of\n591\u2013652 range, and the best reported solution on the leaderboard obtained an average score of 838\n\u00b1 11. Traditional Deep RL methods often require pre-processing of each frame, such as employing\nedge-detection [36], in addition to stacking a few recent frames [41, 36] into the input. In contrast,\nour agent\u2019s V and M take in a stream of raw RGB pixel images and directly learn a spatio-temporal\nrepresentation. To our knowledge, our method is the \ufb01rst reported solution to solve this task.\nSince our agent\u2019s world model is able to model the future, we can use it to come up with hypothetical\ncar racing scenarios on its own. We can use it to produce the probability distribution of zt+1 given\nthe current states, sample a zt+1 and use this sample as the real observation. We can put our trained C\nback into this generated environment. Figure 1 (left) shows a screenshot of the generated car racing\nenvironment. The interactive version of this work includes a demo of the generated environments.\n\n4\n\n\f4 VizDoom Experiment: Learning Inside of a Generated Environment\n\nWe have just seen that a policy learned inside of the real environment appears to somewhat function\ninside of the generated environment. This begs the question \u2013 can we train our agent to learn inside\nof its own generated environment, and transfer this policy back to the actual environment?\nIf our world model is suf\ufb01ciently accurate for its purpose, and complete enough for the problem at\nhand, we should be able to substitute the actual environment with this world model. After all, our\nagent does not directly observe the reality, but merely sees what the world model lets it see. In this\nexperiment, we train an agent inside the environment generated by its world model trained to mimic a\nVizDoom [40] environment. In DoomTakeCover-v0 [62], the agent must learn to avoid \ufb01reballs shot\nby monsters from the other side of the room with the sole intent of killing the agent. The cumulative\nreward is de\ufb01ned to be the number of time steps the agent manages to stay alive during a rollout.\nEach rollout of the environment runs for a maximum of 2100 time steps, and the task is considered\nsolved if the average survival time over 100 consecutive rollouts is greater than 750 time steps.\n\n4.1 Experiment Setup\n\nThe setup of our VizDoom experiment is largely the same as the Car Racing task, except for a few\nkey differences. In the Car Racing task, M is only trained to model the next zt. Since we want to\nbuild a world model we can train our agent in, our M model here will also predict whether the agent\ndies in the next frame (as a binary event donet), in addition to the next frame zt.\nSince M can predict the done state in addition to the next observation, we now have all of the\ningredients needed to make a full RL environment to mimic DoomTakeCover-v0 [62]. We \ufb01rst build\nan OpenAI Gym environment interface by wrapping a gym.Env [5] interface over our M as if it were\na real Gym environment, and then train our agent inside of this virtual environment instead of using\nthe actual environment. Thus in our simulation, we do not need the V model to encode any real\npixel frames during the generation process, so our agent will therefore only train entirely in a more\nef\ufb01cient latent space environment. Both virtual and actual environments share an identical interface,\nso after the agent learns a satisfactory policy inside of the virtual environment, we can easily deploy\nthis policy back into the actual environment to see how well the policy transfers over.\nHere, our RNN-based world model is trained to mimic a complete game environment designed by\nhuman programmers. By learning only from raw image data collected from random episodes, it\nlearns how to simulate the essential aspects of the game, such as the game logic, enemy behaviour,\nphysics, and also the 3D graphics rendering. We can even play inside of this generated environment.\nUnlike the actual game environment, however, we note that it is possible to add extra uncertainty\ninto the virtual environment, thus making the game more challenging in the generated environment.\nWe can do this by increasing the temperature \u03c4 parameter during the sampling process of zt+1. By\nincreasing the uncertainty, our generated environment becomes more dif\ufb01cult compared to the actual\nenvironment. The \ufb01reballs may move more randomly in a less predictable path compared to the\nactual game. Sometimes the agent may even die due to sheer misfortune, without explanation.\nAfter training, our controller learns to navigate around the virtual environment and escape from\ndeadly \ufb01reballs launched by monsters generated by M. Our agent achieved an average score of 918\ntime steps in the virtual environment. We then took the agent trained inside of the virtual environment\nand tested its performance on the original VizDoom environment. The agent obtained an average\nscore of 1092 time steps, far beyond the required score of 750 time steps, and also much higher than\nthe score obtained inside the more dif\ufb01cult virtual environment. The full results are listed in Table 2.\nWe see that even though V is not able to capture all of the details of each frame correctly, for instance,\ngetting the number of monsters correct, C is still able to learn to navigate in the real environment. As\nthe virtual environment cannot even keep track of the exact number of monsters in the \ufb01rst place, an\nagent that is able to survive a noisier and uncertain generated environment can thrive in the original,\ncleaner environment. We also \ufb01nd agents that perform well in higher temperature settings generally\nperform better in the normal setting. In fact, increasing \u03c4 helps prevent our controller from taking\nadvantage of the imperfections of our world model. We will discuss this in depth in the next section.\n\n5\n\n\f4.2 Cheating the World Model\n\nIn our childhood, we may have encountered ways to exploit video games in ways that were not\nintended by the original game designer [9]. Players discover ways to collect unlimited lives or health,\nand by taking advantage of these exploits, they can easily complete an otherwise dif\ufb01cult game.\nHowever, in the process of doing so, they may have forfeited the opportunity to learn the skill required\nto master the game as intended by the game designer. In our initial experiments, we noticed that our\nagent discovered an adversarial policy to move around in such a way so that the monsters in this\nvirtual environment governed by M never shoots a single \ufb01reball during some rollouts. Even when\nthere are signs of a \ufb01reball forming, the agent moves in a way to extinguish the \ufb01reballs.\nBecause M is only an approximate probabilistic model of the environment, it will occasionally\ngenerate trajectories that do not follow the laws governing the actual environment. As we previously\npointed out, even the number of monsters on the other side of the room in the actual environment is\nnot exactly reproduced by M. For this reason, our world model will be exploitable by C, even if such\nexploits do not exist in the actual environment.\nAs a result of using M to generate a virtual environment for our agent, we are also giving the controller\naccess to all of the hidden states of M. This is essentially granting our agent access to all of the\ninternal states and memory of the game engine, rather than only the game observations that the player\ngets to see. Therefore our agent can ef\ufb01ciently explore ways to directly manipulate the hidden states\nof the game engine in its quest to maximize its expected cumulative reward. The weakness of this\napproach of learning a policy inside of a learned dynamics model is that our agent can easily \ufb01nd an\nadversarial policy that can fool our dynamics model \u2013 it will \ufb01nd a policy that looks good under our\ndynamics model, but will fail in the actual environment, usually because it visits states where the\nmodel is wrong because they are away from the training distribution.\nThis weakness could be the reason that many previous works that learn dynamics models of RL\nenvironments do not actually use those models to fully replace the actual environments [60, 8]. Like\nin the M model proposed in [74, 75, 78], the dynamics model is deterministic, making it easily\nexploitable by the agent if it is not perfect. Using Bayesian models, as in PILCO [10], helps to\naddress this issue with the uncertainty estimates to some extent, however, they do not fully solve\nthe problem. Recent work [57] combines the model-based approach with traditional model-free RL\ntraining by \ufb01rst initializing the policy network with the learned policy, but must subsequently rely on\nmodel-free methods to \ufb01ne-tune this policy in the actual environment.\nTo make it more dif\ufb01cult for our C to exploit de\ufb01ciencies of M, we chose to use the MDN-RNN\nas the dynamics model of the distribution of possible outcomes in the actual environment, rather\nthan merely predicting a deterministic future. Even if the actual environment is deterministic, the\nMDN-RNN would in effect approximate it as a stochastic environment. This has the advantage of\nallowing us to train C inside a more stochastic version of any environment \u2013 we can simply adjust the\ntemperature parameter \u03c4 to control the amount of randomness in M, hence controlling the tradeoff\nbetween realism and exploitability.\nUsing a mixture of Gaussian model may seem excessive given that the latent space encoded with the\nVAE model is just a single diagonal Gaussian distribution. However, the discrete modes in a mixture\ndensity model are useful for environments with random discrete events, such as whether a monster\ndecides to shoot a \ufb01reball or stay put. While a single diagonal Gaussian might be suf\ufb01cient to encode\nindividual frames, an RNN with a mixture density output layer makes it easier to model the logic\nbehind a more complicated environment with discrete random states.\nFor instance, if we set the temperature parameter to a very low value of \u03c4 = 0.1, effectively training\nour C with an M that is almost identical to a deterministic LSTM, the monsters inside this generated\nenvironment fail to shoot \ufb01reballs, no matter what the agent does, due to mode collapse. M is not\nable to transition to another mode in the mixture of Gaussian model where \ufb01reballs are formed and\nshot. Whatever policy learned inside of this generated environment will achieve a perfect score of\n2100 most of the time, but will obviously fail when unleashed into the harsh reality of the actual\nworld, underperforming even a random policy.\nBy making the temperature \u03c4 an adjustable parameter of M, we can see the effect of training C inside\nof virtual environments with different levels of uncertainty, and see how well they transfer over to\nthe actual environment. We experiment with varying \u03c4 of the virtual environment, training an agent\ninside of this virtual environment, and observing its performance when inside the actual environment.\n\n6\n\n\fIn Table 2, while we see that increasing \u03c4 of M makes it more dif\ufb01cult for C to \ufb01nd adversarial\npolicies, increasing it too much will make the virtual environment too dif\ufb01cult for the agent to learn\nanything, hence in practice it is a hyperparameter we can tune. The temperature also affects the\ntypes of strategies the agent discovers. For example, although the best score obtained is 1092 \u00b1 556\nwith \u03c4 = 1.15, increasing \u03c4 a notch to 1.30 results in a lower score but at the same time a less risky\nstrategy with a lower variance of returns. For comparison, the best reported score [62] is 820 \u00b1 58.\n\n5 Related Work\n\nThere is extensive literature on learning a dynamics model, and using this model to train a policy.\nMany basic concepts \ufb01rst explored in the 1980s for feed-forward neural networks (FNNs) [104, 56,\n72, 105, 58] and in the 1990s for RNNs [74, 75, 78, 76] laid some of the groundwork for Learning to\nThink [83]. The more recent PILCO [10, 53] is a probabilistic model-based search policy method\ndesigned to solve dif\ufb01cult control problems. Using data collected from the environment, PILCO uses\na Gaussian process (GP) model to learn the system dynamics, and uses this model to sample many\ntrajectories in order to train a controller to perform a desired task, such as swinging up a pendulum.\nWhile GPs work well with a small set of low dimension data, their computational complexity makes\nthem dif\ufb01cult to scale up to model a large history of high dimensional observations. Other recent\nworks [17, 12] use Bayesian neural networks instead of GPs to learn a dynamics model. These\nmethods have demonstrated promising results on challenging control tasks [32], where the states\nwell de\ufb01ned, and the observation is relatively low dimensional. Here we are interested in modelling\ndynamics observed from high dimensional visual data, as a sequence of raw pixel frames.\nIn robotic control applications, the ability to learn the dynamics of a system from observing only\ncamera-based video inputs is a challenging but important problem. Early work on RL for active\nvision trained an FNN to take the current image frame of a video sequence to predict the next\nframe [85], and use this predictive model to train a fovea-shifting control network trying to \ufb01nd\ntargets in a visual scene. To get around the dif\ufb01culty of training a dynamical model to learn directly\nfrom high-dimensional pixel images, researchers explored using neural networks to \ufb01rst learn a\ncompressed representation of the video frames. Recent work along these lines [99, 100] was able\nto train controllers using the bottleneck hidden layer of an autoencoder as low-dimensional feature\nvectors to control a pendulum from pixel inputs. Learning a model of the dynamics from a compressed\nlatent space enable RL algorithms to be much more data-ef\ufb01cient [15, 101].\nVideo game environments are also popular in model-based RL research as a testbed for new ideas.\nPrevious work [51] used a feed-forward convolutional neural network (CNN) to learn a forward\nsimulation model of a video game. Learning to predict how different actions affect future states in the\nenvironment is useful for game-play agents, since if our agent can predict what happens in the future\ngiven its current state and action, it can simply select the best action that suits its goal. This has been\ndemonstrated not only in early work [58, 85] (when compute was a million times more expensive\nthan today) but also in recent studies [13] on several competitive VizDoom environments.\nThe works mentioned above use FNNs to predict the next video frame. We may want to use models\nthat can capture longer term time dependencies. RNNs are powerful models suitable for sequence\nmodelling [24]. Using RNNs to develop internal models to reason about the future has been explored\nas early as 1990 [74], and then further explored in [75, 78, 76]. A more recent work [83] presented a\nunifying framework for building an RNN-based general problem solver that can learn a world model\nof its environment and also learn to reason about the future using this model. Subsequent works\nhave used RNN-based models to generate many frames into the future [8, 60, 11, 25], and also as an\ninternal model to reason about the future [90, 68, 102].\nIn this work, we used evolution strategies (ES) to train our controller, as this offers many bene\ufb01ts. For\ninstance, we only need to provide the optimizer with the \ufb01nal cumulative reward, rather than the entire\nhistory. ES is also easy to parallelize \u2013 we can launch many instances of rollout with different\nsolutions to many workers and quickly compute a set of cumulative rewards in parallel. Recent works\n[14, 73, 26, 94] have demonstrated that ES is a viable alternative to traditional Deep RL methods on\nmany strong baselines. Before the popularity of Deep RL methods [54], evolution-based algorithms\nhave been shown to be effective at solving RL tasks [92, 22, 21, 18, 88]. Evolution-based algorithms\nhave even been able to solve dif\ufb01cult RL tasks from high dimensional pixel inputs [45, 31, 63, 1].\n\n7\n\n\f6 Discussion\n\nWe have demonstrated the possibility of training an agent to perform tasks entirely inside of its\nsimulated latent space world. This approach offers many practical bene\ufb01ts. For instance, video\ngame engines typically require heavy compute resources for rendering the game states into image\nframes, or calculating physics not immediately relevant to the game. We may not want to waste\ncycles training an agent in the actual environment, but instead train the agent as many times as we\nwant inside its simulated environment. Agents that are trained incrementally to simulate reality may\nprove to be useful for transferring policies back to the real world. Our approach may complement\nsim2real approaches outlined in previous work [4, 33].\nThe choice of implementing V as a VAE and training it as a standalone model also has its limitations,\nsince it may encode parts of the observations that are not relevant to a task. After all, unsupervised\nlearning cannot, by de\ufb01nition, know what will be useful for the task at hand. For instance, our VAE\nreproduced unimportant detailed brick tile patterns on the side walls in the Doom environment, but\nfailed to reproduce task-relevant tiles on the road in the Car Racing environment. By training together\nwith an M that predicts rewards, the VAE may learn to focus on task-relevant areas of the image,\nbut the tradeoff here is that we may not be able to reuse the VAE effectively for new tasks without\nretraining. Learning task-relevant features has connections to neuroscience as well. Primary sensory\nneurons are released from inhibition when rewards are received, which suggests that they generally\nlearn task-relevant features, rather than just any features, at least in adulthood [65].\nIn our experiments, the tasks are relatively simple, so a reasonable world model can be trained using a\ndataset collected from a random policy. But what if our environments become more sophisticated? In\nany dif\ufb01cult environment, only parts of the world are made available to the agent only after it learns\nhow to strategically navigate through its world. For more complicated tasks, an iterative training\nprocedure is required. We need our agent to be able to explore its world, and constantly collect\nnew observations so that its world model can be improved and re\ufb01ned over time. Future work will\nincorporate an iterative training procedure [83], where our controller actively explores parts of the\nenvironment that is bene\ufb01cial to improve its world model. An exciting research direction is to look at\nways to incorporate arti\ufb01cial curiosity and intrinsic motivation [81, 80, 77, 64, 61] and information\nseeking [86, 23] abilities in an agent to encourage exploration [47]. In particular, we can augment the\nreward function based on improvement in compression quality [81, 80, 77, 83].\nAnother concern is the limited capacity of our world model. While modern storage devices can store\nlarge amounts of historical data generated using an iterative training procedure, our LSTM [34, 20]-\nbased world model may not be able to store all of the recorded information inside of its weight\nconnections. While the human brain can hold decades and even centuries of memories to some\nresolution [2], our neural networks trained with backpropagation have more limited capacity and\nsuffer from issues such as catastrophic forgetting [69, 16, 43]. Future work will explore replacing the\nVAE and MDN-RNN with higher capacity models [89, 27, 93, 97, 98], or incorporating an external\nmemory module [19, 107], if we want our agent to learn to explore more complicated worlds.\nLike early RNN-based C\u2013M systems [74, 75, 78, 76], ours simulates possible futures time step\nby time step, without pro\ufb01ting from human-like hierarchical planning or abstract reasoning, which\noften ignores irrelevant spatio-temporal details. However, the more general Learning To Think [83]\napproach is not limited to this rather naive approach. Instead it allows a recurrent C to learn to address\nsubroutines of the recurrent M, and reuse them for problem solving in arbitrary computable ways,\ne.g., through hierarchical planning or other kinds of exploiting parts of M\u2019s program-like weight\nmatrix. A recent One Big Net [84] extension of the C\u2013M approach collapses C and M into a single\nnetwork, and uses PowerPlay-like [82, 91] behavioural replay (where the behaviour of a teacher net is\ncompressed into a student net [79]) to avoid forgetting old prediction and control skills when learning\nnew ones. Experiments with those more general approaches are left for future work.\n\nAcknowledgments\n\nWe would like to thank Blake Richards, Kory Mathewson, Chris Olah, Kai Arulkumaran, Denny Britz,\nKyle McDonald, Ankur Handa, Elwin Ha, Nikhil Thorat, Daniel Smilkov, Alex Graves, Douglas Eck,\nMike Schuster, Rajat Monga, Vincent Vanhoucke, Jeff Dean and Natasha Jaques for their thoughtful\nfeedback. This work was partially funded by SNF project RNNAISSANCE (200021_165675) and by\nan ERC Advanced Grant (no: 742870).\n\n8\n\n\fReferences\n\n[1] S. Alvernaz and J. Togelius. Autoencoder-augmented neuroevolution for visual doom playing.\nComputational Intelligence and Games (CIG), 2017 IEEE Conference on, pages 1\u20138. IEEE, 2017.\n\nIn\n\n[2] T. M. Bartol Jr, C. Bromer, J. Kinney, M. A. Chirillo, J. N. Bourne, K. M. Harris, and T. J. Sejnowski.\n\nNanoconnectomic upper bound on the variability of synaptic plasticity. Elife, 4, 2015.\n\n[3] C. M. Bishop. Neural networks for pattern recognition (chapter 6). Oxford university press, 1995.\n\n[4] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor,\nK. Konolige, S. Levine, and V. Vanhoucke. Using simulation and domain adaptation to improve ef\ufb01ciency\nof deep robotic grasping. Preprint arXiv:1709.07857, Sept. 2017.\n\n[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI\n\nGym. Preprint arXiv:1606.01540, June 2016.\n\n[6] S. Carter, D. Ha, I. Johnson, and C. Olah. Experiments in handwriting with a neural network. Distill,\n\n2016.\n\n[7] L. Chang and D. Y. Tsao. The code for facial identity in the primate brain. Cell, 169(6):1013\u20131028, 2017.\n\n[8] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators. Preprint\n\narXiv:1704.02254, Apr. 2017.\n\n[9] M. Consalvo. Cheating: Gaining Advantage in Videogames (Chapter 5). The MIT Press, 2007.\n\n[10] M. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach to policy search.\nIn Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465\u2013472,\n2011.\n\n[11] E. L. Denton et al. Unsupervised learning of disentangled representations from video. In Advances in\n\nNeural Information Processing Systems, pages 4417\u20134426, 2017.\n\n[12] S. Depeweg, J. M. Hern\u00e1ndez-Lobato, F. Doshi-Velez, and S. Udluft. Learning and policy search in\n\nstochastic dynamical systems with bayesian neural networks. Preprint arXiv:1605.07127, May 2016.\n\n[13] A. Dosovitskiy and V. Koltun. Learning to act by predicting the future. Preprint arXiv:1611.01779, Nov.\n\n2016.\n\n[14] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. Rusu, A. Pritzel, and D. Wierstra. Pathnet:\n\nEvolution channels gradient descent in super neural networks. Preprint arXiv:1701.08734, Jan. 2017.\n\n[15] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor\nlearning. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 512\u2013519.\nIEEE, 2016.\n\n[16] R. M. French. Catastrophic interference in connectionist networks: Can it be predicted, can it be\nprevented? In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information\nProcessing Systems 6, pages 1176\u20131177. Morgan-Kaufmann, 1994.\n\n[17] Y. Gal, R. McAllister, and C. E. Rasmussen. Improving PILCO with bayesian neural network dynamics\n\nmodels. In Data-Ef\ufb01cient Machine Learning workshop, ICML, 2016.\n\n[18] J. Gauci and K. O. Stanley. Autonomous evolution of topographic regularities in arti\ufb01cial neural networks.\n\nNeural Computation, 22(7):1860\u20131898, July 2010.\n\n[19] M. Gemici, C. Hung, A. Santoro, G. Wayne, S. Mohamed, D. Rezende, D. Amos, and T. Lillicrap.\n\nGenerative temporal models with memory. Preprint arXiv:1702.04649, Feb. 2017.\n\n[20] F. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural\n\nComputation, 12(10):2451\u20132471, Oct. 2000.\n\n[21] F. Gomez and J. Schmidhuber. Co-evolving recurrent neurons learn deep memory POMDPs. Proceedings\n\nof the 7th Annual Conference on Genetic and Evolutionary Computation, pages 491\u2013498, 2005.\n\n[22] F. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through cooperatively\n\ncoevolved synapses. Journal of Machine Learning Research, 9:937\u2013965, June 2008.\n\n9\n\n\f[23] J. Gottlieb, P.-Y. Oudeyer, M. Lopes, and A. Baranes. Information-seeking, curiosity, and attention:\n\ncomputational and neural mechanisms. Trends in cognitive sciences, 17(11):585\u2013593, 2013.\n\n[24] A. Graves. Generating sequences with recurrent neural networks. Preprint arXiv:1308.0850, 2013.\n\n[25] A. Graves. Hallucination with recurrent neural networks. https://youtu.be/-yX1SYeDHbg, 2015.\n\n[26] D. Ha. Evolving stable strategies. http://blog.otoro.net/, 2017.\n\n[27] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. In International Conference on Learning Representations,\n\n2017.\n\n[28] D. Ha and D. Eck. A neural representation of sketch drawings. In International Conference on Learning\n\nRepresentations, 2018.\n\n[29] N. Hansen. The CMA evolution strategy: A tutorial. Preprint arXiv:1604.00772, 2016.\n\n[30] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolu-\n\ntionary Computation, 9(2):159\u2013195, June 2001.\n\n[31] M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone. A neuroevolution approach to general Atari\ngame playing. IEEE Transactions on Computational Intelligence and AI in Games, 6(4):355\u2013366, 2014.\n\n[32] D. Hein, S. Depeweg, M. Tokic, S. Udluft, A. Hentschel, T. Runkler, and V. Sterzing. A benchmark\n\nenvironment motivated by industrial control problems. Preprint arXiv:1709.09480, Sept. 2017.\n\n[33] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and\nA. Lerchner. DARLA: Improving zero-shot transfer in reinforcement learning. Preprint arXiv:1707.08475,\n2017.\n\n[34] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n\n[35] J. H\u00fcnermann. Self-driving cars in the browser. http://janhuenermann.com/, 2017.\n\n[36] S. Jang, J. Min, and C. Lee. Reinforcement car racing with A3C. https://goo.gl/58SKBp, 2017.\n\n[37] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of AI\n\nresearch, 4:237\u2013285, 1996.\n\n[38] G. Keller, T. Bonhoeffer, and M. H\u00fcbener. Sensorimotor mismatch signals in primary visual cortex of the\n\nbehaving mouse. Neuron, 74(5):809 \u2013 815, 2012.\n\n[39] H. J. Kelley. Gradient theory of optimal \ufb02ight paths. ARS Journal, 30(10):947\u2013954, 1960.\n\n[40] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski. VizDoom: A Doom-based AI research\nplatform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and\nGames, pages 341\u2013348, Santorini, Greece, Sep 2016. IEEE. The best paper award.\n\n[41] M. Khan and O. Elibol. Car racing using reinforcement learning. https://goo.gl/neSBSx, 2016.\n\n[42] D. Kingma and M. Welling. Auto-encoding variational bayes. Preprint arXiv:1312.6114, 2013.\n\n[43] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,\nT. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.\nProceedings of the National Academy of Sciences, 114(13):3521\u20133526, 2017.\n\n[44] O. Klimov. CarRacing-v0. http://gym.openai.com/, 2016.\n\n[45] J. Koutnik, G. Cuccu, J. Schmidhuber, and F. Gomez. Evolving large-scale neural networks for vision-\nbased reinforcement learning. Proceedings of the 15th Annual Conference on Genetic and Evolutionary\nComputation, pages 1061\u20131068, 2013.\n\n[46] B. Lau. Using Keras and deep deterministic policy gradient to play TORCS. https://yanpanlau.github.io/,\n\n2016.\n\n[47] J. Lehman and K. Stanley. Abandoning objectives: Evolution through the search for novelty alone.\n\nEvolutionary Computation, 19(2):189\u2013223, 2011.\n\n[48] M. Leinweber, D. R. Ward, J. M. Sobczak, A. Attinger, and G. B. Keller. A sensorimotor circuit in mouse\n\ncortex for visual \ufb02ow predictions. Neuron, 95(6):1420 \u2013 1432.e5, 2017.\n\n10\n\n\f[49] L. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon\n\nUniversity, Pittsburgh, January 1993.\n\n[50] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion\n\nof the local rounding errors. Master\u2019s thesis, Univ. Helsinki, 1970.\n\n[51] M. O. R. Matthew Guzdial, Boyang Li. Game engine learning from video.\n\nIn Proceedings of the\nTwenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI-17, pages 3707\u20133713, 2017.\n\n[52] G. W. Maus, J. Fischer, and D. Whitney. Motion-dependent representation of space in area MT+. Neuron,\n\n78(3):554\u2013562, 2013.\n\n[53] R. McAllister and C. E. Rasmussen. Data-ef\ufb01cient reinforcement learning in continuous state-action\n\nGaussian-POMDPs. In Advances in Neural Information Processing Systems, pages 2037\u20132046, 2017.\n\n[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing\n\nAtari with deep reinforcement learning. Preprint arXiv:1312.5602, Dec. 2013.\n\n[55] D. Mobbs, C. C. Hagan, T. Dalgleish, B. Silston, and C. Pr\u00e9vost. The ecology of human fear: survival\n\noptimization and the nervous system. Frontiers in neuroscience, 9:55, 2015.\n\n[56] P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. Proceedings of the\n\nNinth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165\u2013176, 1987.\n\n[57] A. Nagabandi, G. Kahn, R. Fearing, and S. Levine. Neural network dynamics for model-based deep\n\nreinforcement learning with model-free \ufb01ne-tuning. Preprint arXiv:1708.02596, Aug. 2017.\n\n[58] N. Nguyen and B. Widrow. The truck backer-upper: An example of self learning in neural networks.\nIn Proceedings of the International Joint Conference on Neural Networks, pages 357\u2013363. IEEE Press,\n1989.\n\n[59] N. Nortmann, S. Rekauzke, S. Onat, P. K\u00f6nig, and D. Jancke. Primary visual cortex represents the\n\ndifference between past and present. Cerebral Cortex, 25(6):1427\u20131440, 2015.\n\n[60] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep\nnetworks in Atari games. In Advances in Neural Information Processing Systems, pages 2863\u20132871,\n2015.\n\n[61] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner.\n\nIntrinsic motivation systems for autonomous mental\n\ndevelopment. IEEE transactions on evolutionary computation, 11(2):265\u2013286, 2007.\n\n[62] P. Paquette. DoomTakeCover-v0. https://gym.openai.com/, 2016.\n\n[63] M. Parker and B. D. Bryant. Neurovisual control in the Quake II environment. IEEE Transactions on\n\nComputational Intelligence and AI in Games, 4(1):44\u201354, 2012.\n\n[64] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised\n\nprediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.\n\n[65] H.-J. Pi, B. Hangya, D. Kvitsiani, J. I. Sanders, Z. J. Huang, and A. Kepecs. Cortical interneurons that\n\nspecialize in disinhibitory control. Nature, 503(7477):521, 2013.\n\n[66] L. Prieur. Deep-Q Learning for racecar reinforcement learning problem. https://goo.gl/VpDqSw, 2017.\n\n[67] R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried. Invariant visual representation by single\n\nneurons in the human brain. Nature, 435(7045):1102, 2005.\n\n[68] S. Racani\u00e8re, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals,\nN. Heess, Y. Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in\nNeural Information Processing Systems, pages 5694\u20135705, 2017.\n\n[69] R. M. Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and\n\nforgetting functions. Psychological review, 97 2:285\u2013308, 1990.\n\n[70] I. Rechenberg. Evolutionsstrategien. In Simulationsmethoden in der Medizin und Biologie, pages 83\u2013114.\n\nSpringer, 1978.\n\n[71] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. Preprint arXiv:1401.4082, 2014.\n\n11\n\n\f[72] T. Robinson and F. Fallside. Dynamic reinforcement driven error propagation networks with application\nto game playing. In Proceedings of the 11th Conference of the Cognitive Science Society, Ann Arbor,\npages 836\u2013843, 1989.\n\n[73] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to\n\nreinforcement learning. Preprint arXiv:1703.03864, 2017.\n\n[74] J. Schmidhuber. Making the world differentiable: On using supervised learning fully recurrent neural\nnetworks for dynamic reinforcement learning and planning in non-stationary environments. Technische\nUniversit\u00e4t M\u00fcnchen Tech. Report: FKI-126-90, 1990.\n\n[75] J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive\nenvironments. In Neural Networks, 1990., 1990 IJCNN International Joint Conference on, pages 253\u2013258.\nIEEE, 1990.\n\n[76] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural con-\ntrollers. Proceedings of the First International Conference on Simulation of Adaptive Behavior on From\nAnimals to Animats, pages 222\u2013227, 1990.\n\n[77] J. Schmidhuber. Curious model-building control systems.\n\nIn Neural Networks, 1991. 1991 IEEE\n\nInternational Joint Conference on, pages 1458\u20131463. IEEE, 1991.\n\n[78] J. Schmidhuber. Reinforcement learning in markovian and non-markovian environments. In Advances in\n\nneural information processing systems, pages 500\u2013506, 1991.\n\n[79] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression.\n\nNeural Computation, 4(2):234\u2013242, 1992. (Based on TR FKI-148-91, TUM, 1991).\n\n[80] J. Schmidhuber. Developmental robotics, optimal arti\ufb01cial curiosity, creativity, music, and the \ufb01ne arts.\n\nConnection Science, 18(2):173\u2013187, 2006.\n\n[81] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010). IEEE Transactions\n\non Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\n[82] J. Schmidhuber. Powerplay: Training an increasingly general problem solver by continually searching for\n\nthe simplest still unsolvable problem. Frontiers in Psychology, 4:313, 2013.\n\n[83] J. Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of\nreinforcement learning controllers and recurrent neural world models. Preprint arXiv:1511.09249, 2015.\n\n[84] J. Schmidhuber. One big net for everything. Preprint arXiv:1802.08864, Feb. 2018.\n\n[85] J. Schmidhuber and R. Huber. Learning to generate arti\ufb01cial fovea trajectories for target detection.\n\nInternational Journal of Neural Systems, 2(1-2):125\u2013134, 1991.\n\n[86] J. Schmidhuber, J. Storck, and S. Hochreiter. Reinforcement driven information acquisition in nondeter-\n\nministic environments. Technical Report FKI- -94, TUM Department of Informatics, 1994.\n\n[87] H. Schwefel. Numerical Optimization of Computer Models. John Wiley and Sons, Inc., New York, NY,\n\nUSA, 1977.\n\n[88] F. Sehnke, C. Osendorfer, T. R\u00fcckstie\u00df, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring\n\npolicy gradients. Neural Networks, 23(4):551\u2013559, 2010.\n\n[89] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large\nneural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning\nRepresentations, 2017.\n\n[90] D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. Reichert,\nN. Rabinowitz, A. Barreto, and T. Degris. The predictron: End-to-end learning and planning. Preprint\narXiv:1612.08810, Dec. 2016.\n\n[91] R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First experiments with powerplay. Neural\n\nNetworks, 41:130\u2013136, 2013.\n\n[92] K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies. Evolution-\n\nary computation, 10(2):99\u2013127, 2002.\n\n12\n\n\f[93] J. Suarez. Language modeling with recurrent highway hypernetworks. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 3269\u20133278. Curran Associates, Inc., 2017.\n\n[94] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevolution: Genetic\nalgorithms are a competitive alternative for training deep neural networks for reinforcement learning.\nPreprint arXiv:1712.06567, Dec. 2017.\n\n[95] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic\n\nprogramming. In Machine Learning Proceedings 1990, pages 216\u2013224. Elsevier, 1990.\n\n[96] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA,\n\n1st edition, 1998.\n\n[97] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,\nand K. Kavukcuoglu. Wavenet: A generative model for raw audio. Preprint arXiv:1609.03499, Sept.\n2016.\n\n[98] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin.\nAttention is all you need. In Advances in Neural Information Processing Systems, pages 6000\u20136010,\n2017.\n\n[99] N. Wahlstr\u00f6m, T. B. Sch\u00f6n, and M. P. Desienroth. Learning deep dynamical models from image pixels.\n\nIn 17th IFAC Symposium on System Identi\ufb01cation (SYSID), October 19-21, Beijing, China, 2015.\n\n[100] N. Wahlstr\u00f6m, T. Sch\u00f6n, and M. Deisenroth. From pixels to torques: Policy learning with deep dynamical\n\nmodels. Preprint arXiv:1502.02251, June 2015.\n\n[101] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent\ndynamics model for control from raw images. In Advances in neural information processing systems,\npages 2746\u20132754, 2015.\n\n[102] N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction networks.\n\nPreprint arXiv:1706.01433, June 2017.\n\n[103] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis.\n\noptimization, pages 762\u2013770. Springer, 1982.\n\nIn System modeling and\n\n[104] P. J. Werbos. Learning how the world works: Speci\ufb01cations for predictive networks in robots and brains.\n\nIn Proceedings of IEEE International Conference on Systems, Man and Cybernetics, N.Y., 1987.\n\n[105] P. J. Werbos. Neural networks for control and system identi\ufb01cation. In Decision and Control, 1989.,\n\nProceedings of the 28th IEEE Conference on, pages 260\u2013265. IEEE, 1989.\n\n[106] M. Wiering and M. van Otterlo. Reinforcement Learning. Springer, 2012.\n\n[107] Y. Wu, G. Wayne, A. Graves, and T. Lillicrap. The Kanerva machine: A generative distributed memory.\n\nIn International Conference on Learning Representations, 2018.\n\n13\n\n\f", "award": [], "sourceid": 1242, "authors": [{"given_name": "David", "family_name": "Ha", "institution": "Google Brain"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": "Swiss AI Lab, IDSIA (USI & SUPSI) - NNAISENSE"}]}