{"title": "Learning to Predict Without Looking Ahead: World Models Without Forward Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 5379, "page_last": 5390, "abstract": "Much of model-based reinforcement learning involves learning a model of an agent's world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware---e.g., a brain---arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. In doing so, we can coerce an agent into learning a world model to fill in the observation gaps during reinforcement learning. We show that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment. Videos of our results available at https://learningtopredict.github.io/", "full_text": "Learning to Predict Without Looking Ahead:\nWorld Models Without Forward Prediction\n\nC. Daniel Freeman, Luke Metz, David Ha\n\nGoogle Brain\n\n{cdfreeman, lmetz, hadavid}@google.com\n\nAbstract\n\nMuch of model-based reinforcement learning involves learning a model of an\nagent\u2019s world, and training an agent to leverage this model to perform a task more\nef\ufb01ciently. While these models are demonstrably useful for agents, every naturally\noccurring model of the world of which we are aware\u2014e.g., a brain\u2014arose as the\nbyproduct of competing evolutionary pressures for survival, not minimization of a\nsupervised forward-predictive loss via gradient descent. That useful models can\narise out of the messy and slow optimization process of evolution suggests that\nforward-predictive modeling can arise as a side-effect of optimization under the\nright circumstances. Crucially, this optimization process need not explicitly be a\nforward-predictive loss. In this work, we introduce a modi\ufb01cation to traditional\nreinforcement learning which we call observational dropout, whereby we limit\nthe agents ability to observe the real environment at each timestep. In doing so,\nwe can coerce an agent into learning a world model to \ufb01ll in the observation gaps\nduring reinforcement learning. We show that the emerged world model, while\nnot explicitly trained to predict the future, can help the agent learn key skills\nrequired to perform well in its environment. Videos of our results available at\nhttps://learningtopredict.github.io/\n\n1\n\nIntroduction\n\nMuch of the motivation of model-based reinforcement learning (RL) derives from the potential utility\nof learned models for downstream tasks, like prediction [13, 15], planning [1, 35, 40, 41, 43, 64], and\ncounterfactual reasoning [9, 28]. Whether such models are learned from data, or created from domain\nknowledge, there\u2019s an implicit assumption that an agent\u2019s world model [21, 52, 66] is a forward model\nfor predicting future states. While a perfect forward model will undoubtedly deliver great utility, they\nare dif\ufb01cult to create, thus much of the research has been focused on either dealing with uncertainties\nof forward models [11, 16, 21], or improving their prediction accuracy [22, 28]. While progress has\nbeen made with current approaches, it is not clear that models trained explicitly to perform forward\nprediction are the only possible or even desirable solution.\n\nFigure 1: Our agent is given only infrequent observations of its environment (e.g., frames 1, 8),\nand must learn a world model to \ufb01ll in the observation gaps. The colorless cart-pole represents the\npredicted observations seen by the policy. Under such constraints, we show that world models can\nemerge so that the policy can still perform well on a swing-up cart-pole environment.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)\fWe hypothesize that explicit forward prediction is not required to learn useful models of the world,\nand that prediction may arise as an emergent property if it is useful for an agent to perform its task.\nTo encourage prediction to emerge, we introduce a constraint to our agent: at each timestep, the agent\nis only allowed to observe its environment with some probability p. To cope with this constraint, we\ngive our agent an internal model that takes as input both the previous observation and action, and it\ngenerates a new observation as an output. Crucially, the input observation to the model will be the\nground truth only with probability p, while the input observation will be its previously generated\none with probability 1 \u2212 p. The agent\u2019s policy will act on this internal observation without knowing\nwhether it is real, or generated by its internal model. In this work, we investigate to what extent world\nmodels trained with policy gradients behave like forward predictive models, by restricting the agent\u2019s\nability to observe its environment.\nBy jointly learning both the policy and model to perform well on the given task, we can directly\noptimize the model without ever explicitly optimizing for forward prediction. This allows the model\nto focus on generating any \u201cpredictions\u201d that are useful for the policy to perform well on the task,\neven if they are not realistic. The models that emerge under our constraints capture the essence of\nwhat the agent needs to see from the world. We conduct various experiments to show, under certain\nconditions, that the models learn to behave like imperfect forward predictors. We demonstrate that\nthese models can be used to generate environments that do not follow the rules that govern the actual\nenvironment, but nonetheless can be used to teach the agent important skills needed in the actual\nenvironment. We also examine the role of inductive biases in the world model, and show that the\narchitecture of the model plays a role in not only in performance, but also interpretability.\n\n2 Related Work\n\nOne promising reason to learn models of the world is to accelerate learning of policies by training\nthese models. These works obtain experience from the real environment, and \ufb01t a model directly\nto this data. Some of the earliest work leverage simple model parameterizations \u2013 e.g. learnable\nparameters for system identi\ufb01cation [46]. Recently, there has been large interest in using more\n\ufb02exible parameterizations in the form of function approximators. The earliest work we are aware of\nthat uses feed forward neural networks as predictive models for tasks is Werbos [66]. To model time\ndependence, recurrent neural network were introduced in [52]. Recently, as our modeling abilities\nincreased, there has been renewed interest in directly modeling pixels [22, 29, 45, 59]. Mathieu et al.\n[37] modify the loss function used to generate more realistic predictions. Denton and Fergus [12]\npropose a stochastic model which learns to predict the next frame in a sequence, whereas Finn et al.\n[15] employ a different parameterization involving predicting pixel movement as opposed to directly\npredicting pixels. Kumar et al. [32] employ \ufb02ow based tractable density models to learn models, and\nHa and Schmidhuber [21] leverages a VAE-RNN architecture to learn an embedding of pixel data\nacross time. Hafner et al. [22] propose to learn a latent space, and learn forward dynamics in this\nlatent space. Other methods utilize probabilistic dynamics models which allow for better planning in\nthe face of uncertainty [11, 16]. Presaging much of this work is [57], which learns a model that can\npredict environment state over multiple timescales via imagined rollouts.\nAs both predictive modeling and control improves there has been a large number of successes\nleveraging learned predictive models in Atari [8, 28] and robotics [14]. Unlike our work, all of\nthese methods leverage transitions to learn an explicit dynamics model. Despite advances in forward\npredictive modeling, the application of such models is limited to relatively simple domains where\nmodels perform well.\nErrors in the world model compound, and cause issues when used for control [3, 62]. Amos et al. [2],\nsimilar to our work, directly optimizes the dynamics model against loss by differentiating through a\nplanning procedure, and Schmidhuber [51] proposes a similar idea of improving the internal model\nusing an RNN, although the RNN world model is initially trained to perform forward prediction.\nIn this work we structure our learning problem so a model of the world will emerge as a result of\nsolving a given task. This notion of emergent behavior has been explored in a number of different\nareas and broadly is called \u201crepresentation learning\u201d [6]. Early work on autoencoders leverage\nreconstruction based losses to learn meaningful features [26, 33]. Follow up work focuses on learning\n\u201cdisentangled\u201d representations by enforcing more structure in the learning procedure[24, 25]. Self\nsupervised approaches construct other learning problems, e.g. solving a jigsaw puzzle [42], or\nleveraging temporal structure [44, 56]. Alternative setups, closer to our own specify a speci\ufb01c\n\n2\n\n\flearning problem and observe that by solving these problems lead to interesting learned behavior (e.g.\ngrid cells) [4, 10]. In the context of learning models, Watter et al. [65] construct a locally linear latent\nspace where planning can then be performed.\nThe force driving model improvement in our work consists of black box optimization. In an effort to\nemulate nature, evolutionary algorithms where proposed [18, 23, 27, 60, 67]. These algorithms are\nrobust and will adapt to constraints such as ours while still solving the given task [7, 34]. Recently,\nreinforcement learning has emerged as a promising framework to tackle optimization leveraging\nthe sequential nature of the world for increased ef\ufb01ciency [38, 39, 53, 54, 61]. The exact type\nof the optimization is of less importance to us in this work and thus we choose to use a simple\npopulation-based optimization algorithm [68] with connections to evolution strategies [47, 50, 55].\nThe boundary between what is considered model-free and model-based reinforcement learning is\nblurred when one can considers both the model network and controller network together as one giant\npolicy that can be trained end-to-end with model-free methods. [49] demonstrates this by training\nboth world model and policy via evolution. Earlier works [17, 36] demonstrate that agents can learn\ngoal-directed internal models by delaying or omitting sensory information. Instead of performance,\nhowever, this work focus on understanding what these models learn and show there usefulness \u2013 e.g.\ntraining a policy inside the learned models.\n\n3 Motivation: When a random world model is good enough\n\nA common goal when learning a world model is to learn a perfect forward predictor. In this section,\nwe provide intuitions for why this is not always necessary, and demonstrate how learning on random\n\u201cworld models\u201d can lead to performant policies when transferred to the real world. For simplicity, we\nconsider the classical control task of balance cart-pole[5]. While there are many ways of constructing\nworld models for cart-pole, an optimal forward predictive model will have to generate trajectories of\nsolutions to the simple linear differential equation describing the pole\u2019s dynamics near the unstable\nequilibrium point1. One particular coef\ufb01cient matrix fully describes these dynamics, thus, for this\nexample, we identify this coef\ufb01cient matrix as the free parameters of the world model, M.\nWhile this unique M perfectly describe the dynamics of the pole, if our objective is only to stabilize\nthe system\u2014not achieve perfect forward prediction\u2014it stands to reason that we may not necessarily\nneed to know these exact dynamics. In fact, if one solves for the linear feedback parameters that\nstabilize a cart-pole system with coef\ufb01cient matrix M(cid:48) (not necessarily equal to M), for a wide\nvariety of M(cid:48), those same linear feedback parameters will also stabilize the \u201ctrue\u201d dynamics M. Thus\none successful, albeit silly strategy for solving balance cart-pole is choosing a random M(cid:48), \ufb01nding\nlinear feedback parameters that stabilize this M(cid:48), and then deploying those same feedback controls to\nthe \u201creal\u201d model M. We provide the details of this procedure in the Appendix.\nNote that the world model learned in this way is almost arbitrarily wrong. It does not produce useful\nforward predictions, nor does it accurately estimate any of the parameters of the real world like\nthe length of the pole, or the mass of the cart. Nonetheless, it can be used to produce a successful\nstabilizing policy. In sum, this toy problem exhibits three interesting qualities: 1. That a world model\ncan be learned that produces a valid policy without needing a forward predictive loss, 2. That a world\nmodel need not itself be forward predictive (at all) to facilitate \ufb01nding a valid policy, and 3. That\nthe inductive bias intrinsic to one\u2019s world model almost entirely controls the ease of optimization of\nthe \ufb01nal policy. Unfortunately, most real world environments are not this simple and will not lead to\nperformant policies without ever observing the real world. Nonetheless, the underlying lesson that a\nworld model can be quite wrong, so long as it is wrong the in the right way, will be a recurring theme.\n\n4 Emergent world models by learning to \ufb01ll in gaps\n\nIn the previous section, we outlined a strategy for \ufb01nding policies without even \u201cseeing\u201d the real\nworld. In this section, we relax this constraint and allow the agent to periodically switch between\nreal observations and simulated observations generated by a world model. We call this method\nobservational dropout, inspired by [58].\n\n1In general, the full dynamics describing cart-pole is non-linear. However, in the limit of a heavy cart and\nsmall perturbations about the vertical at low speeds, it reduces to a linear system. See the Appendix for details.\n\n3\n\n\fMechanistically, this amounts to a map between a single markov decision process (MDP) into a\ndifferent MDP with an augmented state space. Instead of only optimizing the agent in the real\nenvironment, with some probability, at every frame, the agent uses its internal world model to produce\nan observation of the world conditioned on its previous observation. When samples from the real\nworld are used, the state of the world model is reset to the real state\u2014 effectively resynchronizing the\nagent\u2019s model to the real world.\nTo show this, consider an MDP with states s \u2208 S, transition distribution st+1 \u223c P (st, at), and\nreward distribution R(st, a, st+1) we can create a new partially observed MDP with 2 states, s(cid:48) =\n(sorig, smodel) \u2208 (S,S), consisting of both the original states, and the internal state produced by the\nworld model. The transition function then switches between the real, and world model states with\nsome probability p:\n\n(cid:40)\n\n(1)\n\nmodel is\n\nP (cid:48)(at, (s(cid:48))t) =\n\nif p < r\nif p \u2265 r\norig is the real environment transition, st+1\n\norig, st+1\norig, st+1\n\norig),\nmodel),\n\n(st+1\n(st+1\n\norig, at, st+1\n\norig, at), st+1\n\norig \u223c P (st\n\nmodel \u223c M (st\n\nmodel, at; \u03c6), p is the peek probability.\n\nwhere r \u223c Uniform(0, 1), st+1\nthe next world model transition, st+1\nThe observation space of this new partially observed MDP is always the second entry of the state\ntuple, s(cid:48). As before, we care about performing well on the real environment thus the reward function\nis the same as the original environment: R(cid:48)(st, at, st+1) = R(st\norig). Our learning task\nconsists of training an agent, \u03c0(s; \u03b8), and the world model, M (s, at; \u03c6) to maximize reward in this\naugmented MDP. In our work, we parameterize our world model M, and our policy \u03c0, as neural\nnetworks with parameters \u03c6 and \u03b8 respectively. While it\u2019s possible to optimize this objective with any\nreinforcement learning method [38, 39, 53, 54], we choose to use population based REINFORCE\n[68] due to its simplicity and effectiveness at achieving high scores on various tasks [19, 20, 50]. By\nrestricting the observations, we make optimization harder and thus expect worse performance on the\nunderlying task. We can use this optimization procedure, however, to drive learning of the world\nmodel much in the same way evolution drove our internal world models.\nOne might worry that a policy with suf\ufb01cient capacity could extract useful data from a world model,\neven if that world model\u2019s features weren\u2019t easily interpretable. In this limit, our procedure starts\nlooking like a strange sort of recurrent network, where the world model \u201clearns\u201d to extract dif\ufb01cult-\nto-interpret features (like, e.g., the hidden state of an RNN) from the world state, and then the policy\nis powerful enough to learn to use these features to make decisions about how to act. While this is\nindeed a possibility, in practice, we usually constrain the capacity of the policies we studied to be\nsmall enough that this did not occur. For a counter-example, see the fully connected world model for\nthe grid world tasks in Section 4.2.\n\n4.1 What policies can be learned from world models emerged from observation dropout?\n\nAs the balance cart-pole task discussed earlier can be trivially solved with a wide range of parameters\nfor a simple linear policy, we conduct experiments where we apply observational dropout on the\nmore dif\ufb01cult swing up cart-pole\u2014a task that cannot be solved with a linear policy, as it requires\nthe agent to learn two distinct subtasks: (1) to add energy to the system when it needs to swing up\nthe pole, and (2) to remove energy to balance the pole once the pole is close to the unstable, upright\nequilibrium [63]. Our setup is closely based on the environment described in [16, 69], where the\nground truth dynamics of the environment is described as [\u00a8x, \u00a8\u03b8] = F (x, \u03b8, \u02d9x, \u02d9\u03b8). F is a system of\nnon-linear equations, and the agent is rewarded for getting x close to zero and cos(\u03b8) close to one.\nFor more details, see the Appendix.2\nThe setup of the cart-pole experiment augmented with observational dropout is visualized in Figure 1.\nWe report the performance of our agent trained in environments with various peek probabilities, p, in\nFigure 2 (left). A result higher than \u223c 500 means that the agent is able to swing up and balance the\ncart-pole most of the time. Interestingly, the agent is still able to solve the task even when on looking\nat a tenth of the frames (p = 10%), and even at a lower p = 5%, it solves the task half of the time.\nTo understand the extent to which the policy, \u03c0 relies on the learned world model, M, and to probe the\ndynamics learned world model, we trained a new policy entirely within learned world model and then\n\n2Released code to facilitate reproduction of experiments at https://learningtopredict.github.io/\n\n4\n\n\fFigure 2: Left: Performance of cart-pole swing up under various observational dropout probabilities,\np. Here, both the policy and world model are learned. Right: Performance of deploying policies\ntrained from scratch inside of the environment generated by the world model, in the actual environ-\nment. For each p, the experiment is run 10 times independently (orange). Performance is measured\nby averaging cumulative scores over 100 rollouts. Model-based baseline performances learned via\na forward-predictive loss are indicated in red, blue. Note how world models learned when trained\nunder approximately 3-5% observational dropout can be used to train performant policies.\n\ndeployed these policies back to the original environment. Results in Figure 2 (right). Qualitatively,\nthe agent learns to swing up the pole, and balance it for a short period of time when it achieves a mean\nreward above \u223c 300. Below this threshold the agent typically swings the pole around continuously,\nor navigates off the screen. We observe that at low peek probabilities, a higher percentage of learned\nworld models can be used to train policies that behave correctly under the actual dynamics, despite\nfailing to completely solve the task. At higher peek probabilities, the learned dynamics model is not\nneeded to solve the task thus is never learned.\nWe have compared our approach to baseline model-based approach where we explicitly train our\nmodel to predict the next observation on a dataset collected from training a model-free agent from\nscratch to solving the task. To our surprise, we \ufb01nd it interesting that our approach can produce\nmodels that outperform an explicitly learned model with the same architecture size (120 units) for\ncart-pole transfer task. This advantage goes away, however, if we scale up the forward predictive\nmodel width by 10x.\n\nFigure 3: a. In the generated environment, the cart-pole stabilizes at an angle that is not perfectly\nperpendicular, due to its imperfect nature. b. This policy is still able to swing up the cart-pole in the\nactual environment, although it remains balanced only for some time before falling down. The world\nmodel is jointly trained with an observational dropout probability of p = 5%.\n\nFigure 3 depicts a trajectory of a policy trained entirely within a learned world model deployed\non the actual environment. It is interesting to note that the dynamics in the world model, M, are\nnot perfect\u2013for instance, the optimal policy inside the world model can only swing up and balance\nthe pole at an angle that is not perpendicular to the ground. We notice in other world models, the\noptimal policy learns to swing up the pole and only balance it for a short period of time, even in the\nself-contained world model. It should not surprise us then, that the most successful policies when\ndeployed back to the actual environment can swing up and only balance the pole for a short while,\nbefore the pole falls down.\nAs noted earlier, the task of stabilizing the pole once it is near its target state (when x, \u03b8, \u02d9x, \u02d9\u03b8 is near\nzero) is trivial, hence a policy, \u03c0, jointly trained with world model, M, will not require accurate\npredictions to keep the pole balanced. For this subtask, \u03c0 needs only to occasionally observe the\nactual world and realign its internal observation with reality. Conversely, the subtask of swinging\nthe pole upwards and then lowering the velocities is much more challenging, hence \u03c0 will rely on\nthe world model to captures the essence of the dynamics for it to accomplish the subtask. The world\nmodel M only learns the dif\ufb01cult part of the real world, as that is all that is required of it to facilitate\nthe policy performing well on the task.\n\n5\n\n1%3%5%10%20%30%40%50%60%70%80%90%100%300350400450500550600650Cartpole Swingup Mean Cumulative Score vs Peek ProbabilityPeek ProbabilityMean Cumulative Reward1%3%5%10%20%30%40%50%60%70%80%90%0100200300400500600Cartpole Swingup: Deploying Policy Learned in World Model to Actual EnvironmentWorld Model Learned with Peek ProbabilityMean Cumulative Rewardlearned model (1200 hidden units): 430 \u00b1 15learned model (120 hidden units): 274 \u00b1 122chamption solution in population: 593 \u00b1 24(a) Policy learned in environment generated using world model.(b) Deploying policy learned in (a) into real environment.\f4.2 Examining world models\u2019 inductive biases in a grid world\n\nTo illustrate the generality of our method to more varied domains, and to further emphasize the\nrole played by inductive bias in our models, we consider an additional problem: a classic search /\navoidance task in a grid world. In this problem, an agent navigates a grid environment with randomly\nplaced apples and \ufb01res. Apples provide reward, and \ufb01res provide negative reward. The agent is\nallowed to move in the four cardinal directions, or to perform a no-op. For more details, please refer\nto the Appendix.\n\nFigure 4: A cartoon demonstrating the shift of the receptive \ufb01eld of the world model as it moves to\nthe right. The greyed out column indicates the column of forgotten data, and the light blue column\nindicates the \u201cnew\u201d information gleaned from moving to the right. An optimal predictor would learn\nthe distribution function p and sample from it to populate this rightmost column, and would match the\nground truth everywhere else. The rightmost heatmap illustrates how predictions of a convolutional\nmodel correlate with the ground truth (more orange = more predictive) when moving to the right,\naveraged over 1000 randomized right-moving steps. See the Appendix for more details. Crucially,\nthis heat map is most predictive for the cells the agent can actually see, and is less predictive for the\ncells right outside its \ufb01eld of view (the rightmost column) as expected.\n\nFor simplicity, we considered only stateless policies and world models. While this necessarily limits\nthe expressive capacity of our world models, the optimal forward predictive model within this class of\nnetworks is straightforward to consider: movement of the agent essentially corresponds to a bit-shift\nmap on the world model\u2019s observation vectors. For example, for an optimal forward predictor, if an\nagent moves rightwards, every apple and \ufb01re within its receptive \ufb01eld should shift to the left. The\nleftmost column of observations shifts out of sight, and is forgotten\u2014as the model is stateless\u2014and\nthe rightmost column of observations should be populated according to some distribution which\ndepends on the locations of apples and \ufb01res visible to the agent, as well as the particular scheme used\nto populate the world with apples and \ufb01res. Figure 4 illustrates the receptive \ufb01eld of the world model.\n\nFigure 5: Performance, R of the two architectures, empirically averaged over hundred policies and a\nthousand rollouts as a function of peek probability, p. The convolutional architecture reliably out\nperforms the fully connected architecture. Error bars indicate standard error. Intuitively, a score near 0\namounts to random motion on the lattice\u2014encountering apples as often as \ufb01res, and 2 approximately\ncorresponds to encountering apples two to three times more often than \ufb01res. A baseline that is\ntrained on a version of the environment without any \ufb01res\u2014i.e., a proxy baseline for an agent that can\nperfectly avoid \ufb01res\u2014reliably achieves a score of 3. Agents were trained for 4000 generations.\nThis partial observability of the world immediately handicaps the ability of the world model to\nperform long imagined trajectories in comparison with the previous continuous, fully observed\ncart-pole tasks. Nonetheless, there remains suf\ufb01cient information in the world to train world models\nvia observational dropout that are predictive.\n\n6\n\n0%20%40%60%80%100%PeekProbability 0.00.51.01.52.02.5Rconvfc\fFor our numerical experiments we compared two different world model architectures: a fully\nconnected model and a convolutional model. See the Appendix for more details. Naively, these\nmodels are listed in increasing order of inductive bias, but decreasing order of overall capacity\n(10650 parameters for the fully connected model, 1201 learnable parameters for the convolutional\nmodel)\u2014i.e., the fully connected architecture has the highest capacity and the least bias, whereas the\nconvolutional model has the most bias but the least capacity. The performance of these models on the\ntask as a function of peek probability is provided in Figure 5. As in the cart-pole tasks, we trained the\nagent\u2019s policy and world model jointly, where with some probability p the agent sees the ground truth\nobservation instead of predictions from its world model.\nCuriously, even though the fully connected architecture has the highest overall capacity, and is capable\nof learning a transition map closer to the \u201coptimal\u201d forward predictive function for this task if taught\nto do so via supervised learning of a forward-predictive loss, it reliably performs worse than the\nconvolutional architectures on the search and avoidance task. This is not entirely surprising: the\nconvolutional architectures induce a considerably better prior over the space of world models than\nthe fully connected architecture via their translational invariance. It is comparatively much easier for\nthe convolutional architectures to randomly discover the right sort of transition maps.\n\nFigure 6: Empirically averaged correlation matrices between a world model\u2019s output and the ground\ntruth. Averages were calculated using 1000 random transitions for each direction of a typical\nconvolutional p = 75% world model. Higher correlation (yellow-white) translates to a world model\nthat is closer to a next frame predictor. Note that a predictive map is not learned for every direction.\nThe row and column, respectively of dark pixels for \u2193 and \u2192 correspond exactly to the newly-seen\npixels for those directions which are indicated in light-blue in Figure 4.\nBecause the world model is not being explicitly optimized to achieve forward prediction, it doesn\u2019t\noften learn a predictive function for every direction. We selected a typical convolutional world model\nand plot its empirically averaged correlation with the ground truth next-frames in Figure 6. Here, the\nworld model clearly only learns reliable transition maps for moving down and to the right, which is\nsuf\ufb01cient. Qualitatively, we found that the convolutional world models learned with peek-probability\nclose to p = 50% were \u201cbest\u201d in that they were more likely to result in accurate transition maps\u2014\nsimilar to the cart-pole results indicated in Figure 2 (right). Fully connected world models reliably\nlearned completely uninterpretable transition maps (e.g., see the additional correlation plots in the\nAppendix). That policies could almost achieve the same performance with fully connected world\nmodels as with convolutional world model is reminiscent of a recurrent architecture that uses the\n(generally not-easily-interpretable) hidden state as a feature.\n\n4.3 Car Racing: Keep your eyes off the road\nIn more challenging environments, observations are often expressed as high dimensional pixel images\nrather than state vectors. In this experiment, we apply observation dropout to learn a world model of\na car racing game from pixel observations. We would like to know to what extent the world model\ncan facilitate the policy at driving if the agent is only allowed to see the road only only a fraction of\nthe time. We are also interested in the representations the model learns to facilitate driving, and in\nmeasuring the usefulness of its internal representation for this task.\nIn Car Racing [31], the agent\u2019s goal is to drive around the tracks, which are randomly generated\nfor each trial, and drive over as many tiles as possibles in the shortest time. At each timestep, the\nenvironment provides the agent with a high dimensional pixel image observation, and the agent\noutputs 3 continuous action parameters that control the car\u2019s steering, acceleration, and brakes.\nTo reduce the dimensionality of the pixel observations, we follow the procedure in [21] and train\na Variational Autoencoder (VAE) [30, 48] using on rollouts collected from a random policy, to\ncompress a pixel observation into a small dimensional latent vector z. Our agent will use z instead\nas its observation. Examples of pixel observations, and reconstructions from their compressed\n\n7\n\n\u2193\u2191\u2192\u2190no-op0.00.20.40.60.81.0\fFigure 7: Two examples of action-conditioned predictions from a world model trained at p = 10%\n(bottom rows). Red boxes indicate actual observations from the environment the agent is allowed\nto see. While the agent is devoid of sight, the world model predicts (1) small movements of the car\nrelative to the track and (2) upcoming turns. Without access to actual observations for many timesteps,\nit incorrectly predicts a turn in (3) until an actual observation realigns the world model with reality.\nrepresentations are shown in the \ufb01rst 2 rows of Figure 7. Our policy, a feed forward network, will act\non actual observations with probability p, otherwise on observations produced by the world model.\nOur world model, M, a small feed forward network with a hidden layer, outputs the change of the\nmean latent vector z, conditioned on the previous observation (actual or predicted) and action taken\n(i.e \u2206z = M (z, a)). We can use the VAE\u2019s decoder to visualize the latent vectors produced by M,\nand compare them with the actual observations that the agent is not able to see (Figure 7). We observe\nthat our world model, while not explicitly trained to predict future frames, are still able to make\nmeaningful action-conditioned predictions. The model also learns to predict local changes in the car\u2019s\nposition relative to the road given the action taken, and also attempts to predict upcoming curves.\n\nFigure 8: Left: Mean performance of Car Racing under various p over 100 trials. Right: Mean\nperformance achieved by training a linear policy using only the outputs of the hidden layer of a world\nmodel learned at peek probability p. We run 5 independent seeds for each p (orange). Model-based\nbaseline performances learned via a forward-predictive loss are indicated in red, blue. We note\nthat in this constrained linear policy setup, our best solution out of a population of trials achieves a\nperformance slightly below reported state-of-the-art results (i.e. [21, 49]). As in the swingup cartpole\nexperiments, the best world models for training policies occur at a characteristic peek probability\nthat roughly coincides with the peek probability at which performance begins to degrade for jointly\ntrained models (i.e., the bend in the left pane occurs near the peak of the right pane).\nOur policy \u03c0 is jointly trained with world model M in the car racing environment augmented with\na peek probability p. The agent\u2019s performance is reported in Figure 8 (left). Qualitatively, a score\nabove \u223c 800 means that the agent can navigate around the track, making the occasional driving error.\nWe see that the agent is still able to perform the task when 70% of the actual observation frames are\ndropped out, and the world model is relied upon to \ufb01ll in the observation gaps for the policy.\n\n8\n\nActual frames from rollout (a)(1)(2) (3)Actual frames from rollout (b)time \u27f6time \u27f6VAE reconstructions of actual framesVAE reconstructions of actual framesVAE decoded images of predicted latent vectorsVAE decoded images of predicted latent vectors10%20%30%40%50%60%70%80%90%100%450500550600650700750800850900Car Racing Mean Cumulative Score vs Peek ProbabilityPeek ProbabilityMean Cumulative Reward10%20%30%40%50%60%70%80%90%500550600650700750800850900950CarRacing: Performance Using World Model's Hidden Units as Inputs vs Peek ProbabilityWorld Model Learned with Peek ProbabilityMean Cumulative RewardHa and Schmidhuber (2018): 906 \u00b1 21Risi and Stanley (2019): 903 \u00b1 72chamption solution: 873 \u00b1 71\fIf the world model produces useful predictions for the policy, then its hidden representation used\nto produce the predictions should also be useful features to facilitate the task at hand. We can test\nwhether the hidden units of the world model are directly useful for the task, by \ufb01rst freezing the\nweights of the world model, and then training from scratch a linear policy using only the outputs of\nthe intermediate hidden layer of the world model as the only inputs. This feature vector extracted the\nhidden layer will be mapped directly to the 3 outputs controlling the car, and we can measure the\nperformance of a linear policy using features of world models trained at various peek probabilities.\nThe results reported in Figure 8 (right) show that world models trained at lower peek probabilities\nhave a higher chance of learning features that are useful enough for a linear controller to achieve an\naverage score of 800. The average performance of the linear controller peaks when using models\ntrained with p around 40%. This suggests that a world model will learn more useful representation\nwhen the policy needs to rely more on its predictions as the agent\u2019s ability to observe the environment\ndecreases. However, a peek probability too close to zero will hinder the agent\u2019s ability to perform its\ntask, especially in non-deterministic environments such as this one, and thus also affect the usefulness\nof its world model for the real world, as the agent is almost completely disconnected from reality.\n\n5 Discussion\n\nIn this work, we explore world models that emerge when training with observational dropout for\nseveral reinforcement learning tasks. In particular, we\u2019ve demonstrated how effective world models\ncan emerge from the optimization of total reward. Even on these simple environments, the emerged\nworld models do not perfectly model the world, but they facilitate policy learning well enough to\nsolve the studied tasks.\nThe de\ufb01ciencies of the world models learned in this way have a consistency: the cart-pole world\nmodels learned to swing up the pole, but did not have a perfect notion of equilibrium\u2014the grid world\nworld models could perform reliable bit-shift maps, but only in certain directions\u2014the car racing\nworld model tended to ignore the forward motion of the car, unless a turn was visible to the agent\n(or imagined). Crucially, none of these de\ufb01ciencies were catastrophic enough to cripple the agent\u2019s\nperformance. In fact, these de\ufb01ciencies were, in some cases, irrelevant to the performance of the\npolicy. We speculate that the complexity of world models could be greatly reduced if they could fully\nleverage this idea: that a complete model of the world is actually unnecessary for most tasks\u2014that by\nidentifying the important part of the world, policies could be trained signi\ufb01cantly more quickly, or\nmore sample ef\ufb01ciently.\nWe hope this work stimulates further exploration of both model based and model free reinforcement\nlearning, particularly in areas where learning a perfect world model is intractable.\n\nAcknowledgments\n\nWe would like to thank our three reviewers for their helpful comments. Additionally, we would like\nto thank Alex Alemi, Tom Brown, Douglas Eck, Jaehoon Lee, B\u0142a\u02d9zej Osi\u00b4nski, Ben Poole, Jascha\nSohl-Dickstein, Mark Woodward, Andrea Benucci, Julian Togelius, Sebastian Risi, Hugo Ponte,\nand Brian Cheung for helpful comments, discussions, and advice on early versions of this work.\nExperiments in this work were conducted with the support of Google Cloud Platform.\n\n9\n\n\fReferences\n[1] James F Allen and Johannes A Koomen. Planning using a temporal world model. In Proceedings of\nthe Eighth international joint conference on Arti\ufb01cial intelligence-Volume 2, pages 741\u2013747. Morgan\nKaufmann Publishers Inc., 1983.\n\n[2] Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J Zico Kolter. Differentiable mpc for end-\nto-end planning and control. In Advances in Neural Information Processing Systems, pages 8289\u20138300,\n2018.\n\n[3] Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based reinforcement\n\nlearning. arXiv preprint arXiv:1804.07193, 2018.\n\n[4] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski,\nAlexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector-based navigation\nusing grid-like representations in arti\ufb01cial agents. Nature, 557(7705):429, 2018.\n\n[5] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve\ndif\ufb01cult learning control problems. IEEE transactions on systems, man, and cybernetics, pages 834\u2013846,\n1983.\n\n[6] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new\n\nperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u20131828, 2013.\n\n[7] Josh Bongard, Victor Zykov, and Hod Lipson. Resilient machines through continuous self-modeling.\n\nScience, 314(5802):1118\u20131121, 2006.\n\n[8] Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert,\nFabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative\nmodels for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.\n\n[9] Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau,\nand Nicolas Heess. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprint\narXiv:1811.06272, 2018.\n\n[10] Christopher J Cueva and Xue-Xin Wei. Emergence of grid-like representations by training recurrent neural\n\nnetworks to perform spatial localization. arXiv preprint arXiv:1803.07770, 2018.\n\n[11] Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search.\nIn Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465\u2013472,\n2011.\n\n[12] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprint\n\narXiv:1802.07687, 2018.\n\n[13] Bradley B Doll, Dylan A Simon, and Nathaniel D Daw. The ubiquity of model-based reinforcement\n\nlearning. Current opinion in neurobiology, 22(6):1075\u20131081, 2012.\n\n[14] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual fore-\narXiv preprint\n\nsight: Model-based deep reinforcement learning for vision-based robotic control.\narXiv:1812.00568, 2018.\n\n[15] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through\n\nvideo prediction. In Advances in neural information processing systems, pages 64\u201372, 2016.\n\n[16] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving PILCO with Bayesian neural\n\nnetwork dynamics models. In Data-Ef\ufb01cient Machine Learning workshop, ICML, volume 4, 2016.\n\n[17] Onofrio Gigliotta, Giovanni Pezzulo, and Sefano Nol\ufb01. Evolution of a predictive internal model in an\n\nembodied and situated agent. Theory in biosciences, 130(4):259\u2013276, 2011.\n\n[18] David E Goldberg and John H Holland. Genetic algorithms and machine learning. Machine learning, 3(2):\n\n95\u201399, 1988.\n\n[19] D. Ha. Evolving stable strategies. http://blog.otoro.net/, 2017. URL http://blog.otoro.net/2017/\n\n11/12/evolving-stable-strategies/.\n\n[20] David Ha. Reinforcement learning for improving agent design. arXiv:1810.03779, 2018. URL https:\n\n//designrl.github.io.\n\n[21] David Ha and J\u00fcrgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In Advances in\n\nNeural Information Processing Systems 31, pages 2451\u20132463. Curran Associates, Inc., 2018.\n\n[22] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James\n\nDavidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.\n\n[23] Nikolaus Hansen, Sibylle D M\u00fcller, and Petros Koumoutsakos. Reducing the time complexity of the\nderandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation,\n11(1):1\u201318, 2003.\n\n10\n\n\f[24] Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed,\nand Alexander Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprint\narXiv:1606.05579, 2016.\n\n[25] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexan-\nder Lerchner. Towards a de\ufb01nition of disentangled representations. arXiv preprint arXiv:1812.02230,\n2018.\n\n[26] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nscience, 313(5786):504\u2013507, 2006.\n\n[27] John Henry Holland et al. Adaptation in natural and arti\ufb01cial systems: an introductory analysis with\n\napplications to biology, control, and arti\ufb01cial intelligence. MIT press, 1975.\n\n[28] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad\nCzechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based rein-\nforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.\n\n[29] Nal Kalchbrenner, A\u00e4ron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves,\nand Koray Kavukcuoglu. Video pixel networks. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 1771\u20131779. JMLR. org, 2017.\n\n[30] D. Kingma and M. Welling. Auto-encoding variational bayes. Preprint arXiv:1312.6114, 2013. URL\n\nhttps://arxiv.org/abs/1312.6114.\n\n[31] Oleg Klimov. CarRacing-v0. https://gym.openai.com/envs/CarRacing-v0/, 2016.\n[32] M Kumar, M Babaeizadeh, D Erhan, C Finn, S Levine, L Dinh, and D Kingma. Video\ufb02ow: A \ufb02ow-based\n\ngenerative model for video. arXiv preprint arXiv:1903.01434, 2019.\n\n[33] Quoc V Le, Marc\u2019Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean,\nand Andrew Y Ng. Building high-level features using large scale unsupervised learning. arXiv preprint\narXiv:1112.6209, 2011.\n\n[34] Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley,\nSamuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution:\nA collection of anecdotes from the evolutionary computation and arti\ufb01cial life research communities. arXiv\npreprint arXiv:1803.03453, 2018.\n\n[35] Ian Lenz, Ross A Knepper, and Ashutosh Saxena. Deepmpc: Learning deep latent features for model\n\npredictive control. In Robotics: Science and Systems. Rome, Italy, 2015.\n\n[36] Hugo Marques, Julian Togelius, Magdalena Kogutowska, Owen Holland, and Simon M Lucas. Sensorless\nbut not senseless: Prediction in evolutionary car racing. In 2007 IEEE Symposium on Arti\ufb01cial Life, pages\n370\u2013377. IEEE, 2007.\n\n[37] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean\n\nsquare error. arXiv preprint arXiv:1511.05440, 2015.\n\n[38] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[39] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInternational conference on machine learning, pages 1928\u20131937, 2016.\n\n[40] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics\nfor model-based deep reinforcement learning with model-free \ufb01ne-tuning. In 2018 IEEE International\nConference on Robotics and Automation (ICRA), pages 7559\u20137566. IEEE, 2018.\n\n[41] Anusha Nagabandi, Guangzhao Yang, Thomas Asmar, Ravi Pandya, Gregory Kahn, Sergey Levine, and\nRonald S Fearing. Learning image-conditioned dynamics models for control of underactuated legged\nmillirobots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages\n4606\u20134613. IEEE, 2018.\n\n[42] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw\n\npuzzles. In European Conference on Computer Vision, pages 69\u201384. Springer, 2016.\n\n[43] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video\nprediction using deep networks in atari games. In Advances in neural information processing systems,\npages 2863\u20132871, 2015.\n\n[44] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive\n\ncoding. arXiv preprint arXiv:1807.03748, 2018.\n\n[45] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differen-\n\ntiable memory. arXiv preprint arXiv:1511.06309, 2015.\n\n11\n\n\f[46] Gianluigi Pillonetto, Francesco Dinuzzo, Tianshi Chen, Giuseppe De Nicolao, and Lennart Ljung. Kernel\nmethods in system identi\ufb01cation, machine learning and function estimation: A survey. Automatica, 50(3):\n657\u2013682, 2014.\n\n[47] Ingo Rechenberg. Evolutionsstrategie\u2013optimierung technisher systeme nach prinzipien der biologischen\n\nevolution. Frommann-Holzboog, 1973.\n\n[48] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep\n\ngenerative models. Preprint arXiv:1401.4082, 2014. URL https://arxiv.org/abs/1401.4082.\n\n[49] Sebastian Risi and Kenneth O. Stanley. Deep neuroevolution of recurrent and discrete world models. In\nProceedings of the Genetic and Evolutionary Computation Conference, GECCO \u201919, pages 456\u2013462,\nNew York, NY, USA, 2019. ACM. ISBN 978-1-4503-6111-8. doi: 10.1145/3321707.3321817. URL\nhttp://doi.acm.org/10.1145/3321707.3321817.\n\n[50] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to\n\nreinforcement learning. Preprint arXiv:1703.03864, 2017.\n\n[51] J. Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of re-\ninforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249,\n2015.\n\n[52] J\u00fcrgen Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent neural\nnetworks for dynamic reinforcement learning and planning in non-stationary environments. Technical\nReport, 1990.\n\n[53] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\n[54] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[55] H-P Schwefel. Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie.(Teil 1,\n\nKap. 1-5). Birkh\u00e4user, 1977.\n\n[56] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, and Sergey\nLevine. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International\nConference on Robotics and Automation (ICRA), pages 1134\u20131141. IEEE, 2018.\n\n[57] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-\nArnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and\nplanning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n3191\u20133199. JMLR. org, 2017.\n\n[58] N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov. Dropout: a simple way to prevent\n\nneural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[59] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video represen-\n\ntations using lstms. arXiv preprint arXiv:1502.04681, 2015.\n\n[60] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune.\nDeep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks\nfor reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.\n\n[61] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press\n\nCambridge, 1998.\n\n[62] Erik Talvitie. Model regularization for stable sample rollouts. In UAI, pages 780\u2013789, 2014.\n[63] Russ Tedrake. Underactuated robotics: Learning, planning, and control for ef\ufb01cient and agile machines:\n\nCourse notes for mit 6.832. Working draft edition, 3, 2009.\n\n[64] Sebastian Thrun, Knut M\u00f6ller, and Alexander Linden. Planning with an adaptive world model. In Advances\n\nin neural information processing systems, pages 450\u2013456, 1991.\n\n[65] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally\nlinear latent dynamics model for control from raw images. In Advances in neural information processing\nsystems, pages 2746\u20132754, 2015.\n\n[66] Paul J Werbos. Learning how the world works: Speci\ufb01cations for predictive networks in robots and brains.\n\nIn Proceedings of IEEE International Conference on Systems, Man and Cybernetics, NY, 1987.\n\n[67] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In 2008\nIEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence),\npages 3381\u20133387. IEEE, 2008.\n\n[68] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[69] Xingdong Zuo. PyTorch implementation of Improving PILCO with Bayesian neural network dynamics\n\nmodels, 2018. https://github.com/zuoxingdong/DeepPILCO.\n\n12\n\n\f", "award": [], "sourceid": 2886, "authors": [{"given_name": "Daniel", "family_name": "Freeman", "institution": "Google Brain"}, {"given_name": "David", "family_name": "Ha", "institution": "Google Brain"}, {"given_name": "Luke", "family_name": "Metz", "institution": "Google Brain"}]}