{"title": "Unsupervised State Representation Learning in Atari", "book": "Advances in Neural Information Processing Systems", "page_first": 8769, "page_last": 8782, "abstract": "State representation learning, or the ability to capture latent generative factors of an environment is crucial for building intelligent agents that can perform a wide variety of tasks. Learning such representations in an unsupervised manner without supervision from rewards is an open problem. We introduce a method that tries to learn better state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. We also introduce a new benchmark based on Atari 2600 games where we evaluate representations based on how well they capture the ground truth state. We believe this new framework for evaluating representation learning models will be crucial for future representation learning research. Finally, we compare our technique with other state-of-the-art generative and contrastive representation learning methods.", "full_text": "Unsupervised State Representation Learning in Atari\n\nAnkesh Anand\u21e4\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nMicrosoft Research\n\nEvan Racah\u21e4\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nSherjil Ozair\u21e4\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nYoshua Bengio\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nMarc-Alexandre C\u00f4t\u00e9\n\nMicrosoft Research\n\nR Devon Hjelm\nMicrosoft Research\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nAbstract\n\nState representation learning, or the ability to capture latent generative factors\nof an environment, is crucial for building intelligent agents that can perform a\nwide variety of tasks. Learning such representations without supervision from\nrewards is a challenging open problem. We introduce a method that learns\nstate representations by maximizing mutual information across spatially and tem-\nporally distinct features of a neural encoder of the observations. We also in-\ntroduce a new benchmark based on Atari 2600 games where we evaluate rep-\nresentations based on how well they capture the ground truth state variables.\nWe believe this new framework for evaluating representation learning models\nwill be crucial for future representation learning research. Finally, we com-\npare our technique with other state-of-the-art generative and contrastive repre-\nsentation learning methods. The code associated with this work is available at\nhttps://github.com/mila-iqia/atari-representation-learning\n\n1\n\nIntroduction\n\nThe ability to perceive and represent visual sensory data into useful and concise descriptions is con-\nsidered a fundamental cognitive capability in humans [1, 2], and thus crucial for building intelligent\nagents [3]. Representations that concisely capture the true state of the environment should empower\nagents to effectively transfer knowledge across different tasks in the environment, and enable learning\nwith fewer interactions [4].\nRecently, deep representation learning has led to tremendous progress in a variety of machine learning\nproblems across numerous domains [5, 6, 7, 8, 9]. Typically, such representations are often learned\nvia end-to-end learning using the signal from labels or rewards, which makes such techniques often\nvery sample-inef\ufb01cient. Human perception in the natural world, however, appears to require almost\nno explicit supervision [10].\nUnsupervised [11, 12, 13] and self-supervised representation learning [14, 15, 16] have emerged as\nan alternative to supervised versions which can yield useful representations with reduced sample\ncomplexity. In the context of learning state representations [17], current unsupervised methods rely\non generative decoding of the data using either VAEs [18, 19, 20, 21] or prediction in pixel-space\n[22, 23]. Since these objectives are based on reconstruction error in the pixel space, they are not\nincentivized to capture abstract latent factors and often default to capturing pixel level details.\nIn this work, we leverage recent advances in self-supervision that rely on scalable estimation of\nmutual information [24, 25, 26, 27], and propose a new contrastive state representation learning\n\n\u21e4Equal contribution. {anandank, racaheva, ozairs}@mila.quebec\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmethod named Spatiotemporal DeepInfomax (ST-DIM), which maximizes the mutual information\nacross both the spatial and temporal axes.\n\nScore\nLives\nItems in \nInventory\n\nRoom number\n\nAgent\u2019s x, y\n location \n\nSkull\u2019s x, y\n location \n\nNum dots \n\neaten\n\nNum Ghosts\nSue\u2019s x, y\n location \nInky\u2019s x, y\n location \nBlinky\u2019s x, y\n location \n\nAgent\u2019s x, y\n location \n\nLives\nScore\n\nFigure 1: We use a collection of 22 Atari 2600 games to evaluate state representations. We leveraged\nthe source code of the games to annotate the RAM states with important state variables such as the\nlocation of various objects in the game. We compare various unsupervised representation learning\ntechniques based on how well the representations linearly-separate the state variables. Shown above\nare examples of state variables annotated for Montezuma\u2019s Revenge and MsPacman.\n\nTo systematically evaluate the ability of different representation learning methods at capturing the true\nunderlying factors of variation, we propose a benchmark based on Atari 2600 games using the Arcade\nLearning Environment [ALE, 28]. A simulated environment provides access to the underlying\ngenerative factors of the data, which we extract using the source code of the games. These factors\ninclude variables such as the location of the player character, location of various items of interest\n(keys, doors, etc.), and various non-player characters, such as enemies (see \ufb01gure 2). Performance of\na representation learning technique in the Atari representation learning benchmark is then evaluated\nusing linear probing [29], i.e. the accuracy of linear classi\ufb01ers trained to predict the latent generative\nfactors from the learned representations.\nOur contributions are the following\n\n1. We propose a new self-supervised state representation learning technique which exploits the\n\nspatial-temporal nature of visual observations in a reinforcement learning setting.\n\n2. We propose a new state representation learning benchmark using 22 Atari 2600 games based\n\non the Arcade Learning Environment (ALE).\n\n3. We conduct extensive evaluations of existing representation learning techniques on the\n\nproposed benchmark and compare with our proposed method.\n\n2 Spatiotemporal Deep Infomax\n\nWe assume a setting where an agent interacts with an environment and observes a set of high-\ndimensional observations X = {x1, x2, . . . , xN} across several episodes. Our goal is to learn an\nabstract representation of the observation that captures the underlying latent generative factors of the\nenvironment.\nThis representations should focus on high-level semantics (e.g., the concept of agents, enemies,\nobjects, score, etc.) and ignore the low-level details such as the precise texture of the background,\nwhich warrants a departure from the class of methods that rely on a generative decoding of the full\nobservation. Prior work in neuroscience [30, 31] has suggested that the brain maximizes predictive\ninformation [32] at an abstract level to avoid sensory overload. Predictive information, or the mutual\ninformation between consecutive states, has also been shown to be the organizing principle of\nretinal ganglion cells in salamander brains [33]. Thus our representation learning approach relies\non maximizing an estimate based on a lower bound on the mutual information over consecutive\nobservations xt and xt+1.\n\n2\n\n\f2.1 Maximizing mutual information across space and time\n\nFigure 2: A schematic overview of SpatioTemporal DeepInfoMax (ST-DIM). Left: The two different\nmutual information objectives: local-local infomax and global-local infomax. Right: A simpli\ufb01ed\nversion of the global-local contrastive task. In practice, we use multiple negative samples.\n\nGiven a mutual information estimator, we follow DIM [26] and maximize a sum of patch-level mutual\ninformation objectives. The global-local objective in equation 2 maximize the mutual information\nbetween the full observation at time t with small patches of the observation at time t + 1. The\nrepresentations of the small image patches are taken to be the hidden activations of the convolutional\nencoder applied to the full observation. The layer is picked appropriately to ensure that the hidden\nactivations only have a limited receptive \ufb01eld corresponding to 1/16th the size of the full observations.\nThe local-local objective in equation 3 maximizes the mutual information between the local feature\nat time t with the corresponding local feature at time t + 1. Figure ?? is a visual depiction of our\nmodel which we call Spatiotemporal Deep Infomax (ST-DIM).\nIt has been shown that mutual information bounds can be loose for large values of the mutual\ninformation [34] and in practice fail to capture all the relevant features in the data [35] when used\nto learn representations. To alleviate this issue, our approach constructs multiple small mutual\ninformation objectives (rather than a single large one) which are easier to estimate via lower bounds,\nwhich has been concurrently found to work well in the context of semi-supervised learning [36].\nFor the mutual information estimator, we use infoNCE [25], a multi-sample variant of noise-\ncontrastive estimation [37] that was also shown to work well with DIM. Let {(xi, yi)}N\ni=1 be a\npaired dataset of N samples from some joint distribution p(x, y). For any index i, (xi, yi) is a sample\nfrom the joint p(x, y) which we refer to as positive examples, and for any i 6= j, (xi, yj) is a sample\nfrom the product of marginals p(x)p(y), which we refer to as negative examples. The InfoNCE\nobjective learns a score function f (x, y) which assigns large values to positive examples and small\nvalues to negative examples by maximizing the following bound [see 25, 38, for more details on this\nbound],\n\nIN CE({(xi, yi)}N\n\ni=1) =\n\n(1)\n\nThe above objective has also been referred to as multi-class n-pair loss [39, 40] and ranking-based\nNCE [41], and is similar to MINE [24] and the JSD-variant of DIM [26].\n\nexp f (xi, yi)\nj=1 exp f (xi, yj)\n\nPN\n\nlog\n\nNXi=1\n\n3\n\nconv layersconv layersconv layersFeature mapsAnchorNegativePositiveMLPglobal featurepositive local feature (+)negative local feature (-)Discriminatorlocal-local infomaxglobal-local infomaxconv layersconv layersdense layersBilinearBilinear\fFollowing van den Oord et al. [25] we use a bilinear model for the score function f (x, y) =\n(x)T W (y), where  is the representation encoder. The bilinear model combined with the InfoNCE\nobjective forces the encoder to learn linearly predictable representations, which we believe helps in\nlearning representations at the semantic level.\nLet X = {(xt, xt+1)i}B\ni=1 be a minibatch of consecutive observations that are randomly sampled\nfrom several collected episodes. Let Xnext = X[:, 1] correspond to the set of next observations.\nIn our context, the positive samples correspond to pairs of consecutive observations (xt, xt+1) and\nnegative samples correspond to pairs of non-consecutive observations (xt, xt\u21e4), where xt\u21e4 is a\nrandomly sampled observation from the same minibatch.\nAs mentioned above, in ST-DIM, we construct two losses: the global-local objective (GL) and the\nlocal-local objective (LL). The global-local objective is as follows:\n\nLGL =\n\nMXm=1\n\nNXn=1\n\n log\n\nexp(gm,n(xt, xt+1))\n\nexp(gm,n(xt, xt\u21e4))\n\nPx\u21e4t 2Xnext\n\nwhere the score function for the global-local objective, gm,n(xt, xt+1) = (xt)T Wgm,n(xt+1) and\nm,n is the local feature vector produced by an intermediate layer in  at the (m, n) spatial location.\nThe local-local objective is as follows:\n\n(2)\n\n(3)\n\nLLL =\n\nMXm=1\n\nNXn=1\n\n log\n\nexp(fm,n(xt, xt+1))\n\nexp(fm,n(xt, xt\u21e4))\n\nPx\u21e4t 2Xnext\n\nwhere the score function of the local-local objective is fm,n(xt, xt+1) = m,n(xt)T Wlm,n(xt+1)\n\n3 The Atari Annotated RAM Interface (AtariARI)\n\nMeasuring the usefulness of a representation is still an open problem, as a core utility of represen-\ntations is their use as feature extractors in tasks that are different from those used for training (e.g.,\ntransfer learning). Measuring classi\ufb01cation performance, for example, may only reveal the amount\nof class-relevant information in a representation, but may not reveal other information useful for\nsegmentation. It would be useful, then, to have a more general set of measures on the usefulness of a\nrepresentation, such as ones that may indicate more general utility across numerous real-world tasks.\nIn this vein, we assert that in the context of dynamic, visual, interactive environments, the capability\nof a representation to capture the underlying high-level factors of the state of an environment will be\ngenerally useful for a variety of downstream tasks such as prediction, control, and tracking.\nWe \ufb01nd video games to be a useful candidate for evaluating visual representation learning algorithms\nprimarily because they are spatiotemporal in nature, which is (1) more realistic compared to static\ni.i.d. datasets and (2) prior work [42, 43] have argued that without temporal structure, recovering\nthe true underlying latent factors is undecidable. Apart from this, video games also provide ready\naccess to the underlying ground truth states, unlike real-world datasets, which we need to evaluate\nperformance of different techniques.\n\nAnnotating Atari RAM: ALE does not explicitly expose any ground truth state information.\nHowever, ALE does expose the RAM state (128 bytes per timestep) which are used by the game\nprogrammer to store important state information such as the location of sprites, the state of the\nclock, or the current room the agent is in. To extract these variables, we consulted commented\ndisassemblies [44] (or source code) of Atari 2600 games which were made available by Engelhardt\n[45] and Jentzsch and CPUWIZ [46]. We were able to \ufb01nd and verify important state variables for a\ntotal of 22 games. Once this information is acquired, combining it with the ALE interface produces\na wrapper that can automatically output a state label for every example frame generated from the\ngame. We make this available with an easy-to-use gym wrapper, which returns this information with\nno change to existing code using gym interfaces. Table 1 lists the 22 games along with the categories\nof variables for each game. We describe the meaning of each category in the next section.\n\nState variable categories: We categorize the state variables of all the games among six major\ncategories: agent localization, small object localization, other localization, score/clock/lives/display,\n\n4\n\n\fTable 1: Number of ground truth labels available in the benchmark for each game across each category.\nLocalization is shortened for local. See section 3 for descriptions and examples for each category.\n\nGAME\n\nASTEROIDS\nBERZERK\nBOWLING\nBOXING\nBREAKOUT\nDEMONATTACK\nFREEWAY\nFROSTBITE\nHERO\nMONTEZUMAREVENGE\nMSPACMAN\nPITFALL\nPONG\nPRIVATEEYE\nQBERT\nRIVERRAID\nSEAQUEST\nSPACEINVADERS\nTENNIS\nVENTURE\nVIDEOPINBALL\nYARSREVENGE\n\nTOTAL\n\nAGENT\nLOCAL.\n\nSMALL\nOBJECT\nLOCAL.\n\nOTHER\nLOCAL.\n\nSCORE/CLOCK\n\nLIVES\n\nDISPLAY\n\nMISC\n\nOVERALL\n\n2\n2\n2\n2\n1\n1\n1\n2\n2\n2\n2\n2\n1\n2\n3\n1\n2\n1\n2\n2\n2\n2\n39\n\n4\n4\n2\n0\n2\n1\n0\n0\n0\n0\n0\n0\n2\n0\n0\n2\n1\n1\n2\n0\n2\n4\n27\n\n30\n19\n0\n2\n0\n6\n10\n9\n0\n4\n10\n3\n1\n2\n2\n0\n8\n2\n2\n12\n0\n2\n124\n\n3\n4\n2\n3\n1\n1\n1\n4\n3\n4\n2\n0\n2\n4\n0\n2\n4\n2\n2\n3\n2\n0\n49\n\n3\n5\n10\n0\n31\n1\n0\n2\n3\n5\n3\n0\n0\n2\n0\n0\n3\n1\n0\n1\n0\n0\n70\n\n41\n34\n16\n7\n35\n10\n12\n17\n8\n15\n17\n5\n6\n10\n5\n5\n18\n7\n8\n18\n6\n8\n308\n\nand miscellaneous. Agent Loc. (agent localization) refers to state variables that represent the x\nor y coordinates on the screen of any sprite controllable by actions. Small Loc. (small object\nlocalization) variables refer to the x or y screen position of small objects, like balls or missiles.\nProminent examples include the ball in Breakout and Pong, and the torpedo in Seaquest. Other\nLoc. (other localization) denotes the x or y location of any other sprites, including enemies or large\nobjects to pick up. For example, the location of ghosts in Ms. Pacman or the ice \ufb02oes in Frostbite.\nScore/Clock/Lives/Display refers to variables that track the score of the game, the clock, or the\nnumber of remaining lives the agent has, or some other display variable, like the oxygen meter in\nSeaquest. Misc. (Miscellaneous) consists of state variables that are largely speci\ufb01c to a game, and\ndon\u2019t fall within one of the above mentioned categories. Examples include the existence of each\nblock or pin in Breakout and Bowling, the room number in Montezuma\u2019s Revenge, or Ms. Pacman\u2019s\nfacing direction.\n\nProbing: Evaluating representation learning methods is a challenging open problem. The notion of\ndisentanglement [47, 48] has emerged as a way to measure the usefulness of a representation [49, 50].\nIn this work, we focus only on explicitness, i.e the degree to which underlying generative factors\ncan be recovered using a linear transformation from the learned representation. This is standard\nmethodology in the self-supervised representation learning literature [15, 25, 51, 16, 26]. Speci\ufb01cally,\nto evaluate a representation we train linear classi\ufb01ers predicting each state variable, and we report the\nmean F1 score.\n\n4 Related Work\n\nUnsupervised representation learning via mutual information objectives: Recent work in unsu-\npervised representation learning have focused on extracting latent representations by maximizing a\nlower bound on the mutual information between the representation and the input. Belghazi et al. [24]\nestimate the mutual information with neural networks using the Donsker-Varadhan representation of\nthe KL divergence [52], while Chen et al. [53] use the variational bound from Barber and Agakov\n[54] to learn discrete latent representations. Hjelm et al. [26] learn representations by maximiz-\ning the Jensen-Shannon divergence between joint and product of marginals of an image and its\n\n5\n\n\fpatches. van den Oord et al. [25] maximize mutual information using a multi-sample version of\nnoise contrastive estimation [37, 41]. See [38] for a review of different variational bounds for mutual\ninformation.\n\nState representation learning: Learning better state representations is an active area of research\nwithin robotics and reinforcement learning. Recently, Cuccu et al. [55] and Eslami et al. [4] show\nthat visual processing and policy learning can be effectively decoupled in pixel-based environments.\nJonschkowski and Brock [56] and Jonschkowski et al. [57] propose to learn representations using\na set of handcrafted robotic priors. Several prior works use a VAE and its variations to learn a\nmapping from observations to state representations [50, 18, 58]. Single-view TCN [40] and TDC\n[59] learn state representations using self-supervised objectives that leverage temporal information in\ndemonstrations. ST-DIM can be considered as an extension of TDC and TCN that also leverages the\nlocal spatial structure (see Figure 3b for an ablation of spatial losses in ST-DIM).\nA few works have focused on learning state representations that capture factors of an environment that\nare under the agent\u2019s control in order to guide exploration [60, 61] or unsupervised control [62]. [EMI,\n61] harnesses mutual information between state embeddings and actions to learn representations\nthat capture just the controllable factors of the environment, like the agent\u2019s position. ST-DIM, on\nthe other hand, aims to capture every temporally evolving factor (not just the controllable ones)\nin an environment, like the position of enemies, score, balls, missiles, moving obstacles, and the\nagent position. The ST-DIM objective is also different from EMI in that it maximizes the mutual\ninformation between global and local representations in consecutive time steps, whereas EMI just\nconsiders mutual information between global representations. Lastly, ST-DIM uses an InfoNCE\nobjective instead of the JSD one used in EMI. Our work is also closely related to recent work in\nlearning object-oriented representations [63, 64, 65].\n\nEvaluation frameworks of representations: Evaluating representations is an open problem, and\ndoing so is usually domain speci\ufb01c. In vision tasks, it is common to evaluate based on the presence\nof linearly separable label-relevant information, either in the domain the representation was learned\non [66] or in transfer learning tasks [67, 68]. In NLP, the SentEval [69] and GLUE [70] benchmarks\nprovide a means of providing a more linguistic-speci\ufb01c understanding of what the model has learned,\nand these have become a standard tool in NLP research. Such et al. [71] has shown initial quantitative\nand qualitative comparisons between the performance and representations of several DRL algorithms.\nOur evaluation framework can be thought of as a GLUE-like benchmarking tool for RL, providing a\n\ufb01ne-grained understanding of how well the RL agent perceives the objects in the scene. Analogous to\nGLUE in NLP, we anticipate that our benchmarking tool will be useful in RL research in order to\nbetter design components of agent learning.\n\n5 Experimental Setup\n\nWe evaluate the performance of different representation learning methods on our benchmark. Our\nexperimental pipeline consists of \ufb01rst training an encoder, then freezing its weights and evaluating its\nperformance on linear probing tasks. For each identi\ufb01ed generative factor in each game, we construct\na linear probing task where the representation is trained to predict the ground truth value of that factor.\nNote that the gradients are not backpropagated through the encoder network, and only used to train\nthe linear classi\ufb01er on top of the representation.\n\n5.1 Data preprocessing and acquisition\n\nWe consider two different modes for collecting the data: (1) using a random agent (steps through\nthe environment by selecting actions randomly), and (2) using a PPO [72] agent trained for 50M\ntimesteps. For both these modes, we ensure there is enough data diversity by collecting data using 8\ndifferently initialized workers. We also add additional stochasticity to the pretrained PPO agent by\nusing an \u270f-greedy like mechanism wherein at each timestep we take a random action with probability\n\u270f 2.\n\n2For all our experiments, we used \u270f = 0.2.\n\n6\n\n\f5.2 Methods\n\nIn our evaluations, we compare the following methods:\n\n1. Randomly-initialized CNN encoder (RANDOM-CNN).\n2. Variational autoencoder (VAE) [12] on raw observations.\n3. Next-step pixel prediction model (PIXEL-PRED) inspired by the \"No-action Feedforward\"\n\nmodel from [22].\n\n4. Contrastive Predictive Coding (CPC) [25], which maximizes the mutual information between\n\ncurrent latents and latents at a future timestep.\n\n5. SUPERVISED model which learns the encoder and the linear probe using the labels. The\ngradients are backpropagated through the encoder in this case, so this provides a best-case\nperformance bound.\n\nAll methods use the same base encoder architecture, which is the CNN from [73], but adapted for the\nfull 160x210 Atari frame size. To ensure a fair comparison, we use a representation size of 256 for\neach method. As a sanity check, we include a blind majority classi\ufb01er (MAJ-CLF), which predicts\nlabel values based on the mode of the train set. More details in Appendix, section A.\n\n5.3 Probing\nWe train a different 256-way3 linear classi\ufb01er with the representation under consideration as input.\nWe ensure the distribution of realizations of each state variable has high entropy by pruning any\nvariable with entropy less than 0.6. We also ensure there are no duplicates between the train and\ntest set. We train each linear probe with 35,000 frames and use 5,000 and 10,000 frames each for\nvalidation and test respectively. We use early stopping and a learning rate scheduler based on plateaus\nin the validation loss.\n\n6 Results\n\nTable 2: Probe F1 scores averaged across categories for each game (data collected by random agents)\n\nGAME\nASTEROIDS\nBERZERK\nBOWLING\nBOXING\nBREAKOUT\nDEMONATTACK\nFREEWAY\nFROSTBITE\nHERO\nMONTEZUMAREVENGE\nMSPACMAN\nPITFALL\nPONG\nPRIVATEEYE\nQBERT\nRIVERRAID\nSEAQUEST\nSPACEINVADERS\nTENNIS\nVENTURE\nVIDEOPINBALL\nYARSREVENGE\n\nMEAN\n\nMAJ-CLF\n0.28\n0.18\n0.33\n0.01\n0.17\n0.16\n0.01\n0.08\n0.22\n0.08\n0.10\n0.07\n0.10\n0.23\n0.29\n0.04\n0.29\n0.14\n0.09\n0.09\n0.09\n0.01\n0.14\n\nRANDOM-CNN\n0.34\n0.43\n0.48\n0.19\n0.51\n0.26\n0.50\n0.57\n0.75\n0.68\n0.49\n0.34\n0.17\n0.70\n0.49\n0.34\n0.57\n0.41\n0.41\n0.36\n0.37\n0.22\n0.44\n\nVAE\n0.36\n0.45\n0.50\n0.20\n0.57\n0.26\n0.01\n0.51\n0.69\n0.38\n0.56\n0.35\n0.09\n0.71\n0.49\n0.26\n0.56\n0.52\n0.29\n0.38\n0.45\n0.08\n0.40\n\nPIXEL-PRED\n0.34\n0.55\n0.81\n0.44\n0.70\n0.32\n0.81\n0.72\n0.74\n0.74\n0.74\n0.44\n0.70\n0.83\n0.52\n0.41\n0.62\n0.57\n0.57\n0.46\n0.57\n0.19\n0.58\n\nCPC\n0.42\n0.56\n0.90\n0.29\n0.74\n0.57\n0.47\n0.76\n0.90\n0.75\n0.65\n0.46\n0.71\n0.81\n0.65\n0.40\n0.66\n0.54\n0.60\n0.51\n0.58\n0.39\n0.61\n\nST-DIM SUPERVISED\n0.52\n0.68\n0.95\n0.83\n0.94\n0.83\n0.98\n0.85\n0.98\n0.87\n0.87\n0.83\n0.87\n0.97\n0.76\n0.57\n0.85\n0.75\n0.81\n0.68\n0.82\n0.74\n0.82\n\n0.49\n0.53\n0.96\n0.58\n0.88\n0.69\n0.81\n0.75\n0.93\n0.78\n0.72\n0.60\n0.81\n0.91\n0.73\n0.36\n0.67\n0.57\n0.60\n0.58\n0.61\n0.42\n0.68\n\n3Each RAM variable is a single byte thus has 256 possible values ranging from 0 to 255.\n\n7\n\n\fTable 3: Probe F1 scores for different methods averaged across all games for each category (data\ncollected by random agents)\n\nCATEGORY\nSMALL LOC.\nAGENT LOC.\nOTHER LOC.\nSCORE/CLOCK/LIVES/DISPLAY\nMISC.\n\nMAJ-CLF\n0.14\n0.12\n0.14\n0.13\n0.26\n\nRANDOM\nCNN\n0.19\n0.31\n0.50\n0.58\n0.59\n\nVAE\n0.18\n0.32\n0.39\n0.54\n0.63\n\nPIXEL-PRED\n0.31\n0.48\n0.61\n0.76\n0.70\n\nCPC\n0.42\n0.43\n0.66\n0.83\n0.71\n\nST-DIM SUPERVISED\n0.66\n0.81\n0.80\n0.91\n0.83\n\n0.51\n0.58\n0.69\n0.87\n0.75\n\n(a) InfoNCE vs JSD\n\n(b) Effect of Spatial Loss\n\nFigure 3: Different ablations for the ST-DIM model\n\nWe report the F1 averaged across all categories for each method and for each game in Table 2 for data\ncollected by random agent. In addition, we provide a breakdown of probe results in each category,\nsuch as small object localization or score/lives classi\ufb01cation in Table 3 for the random agent. We\ninclude the corresponding tables for these results with data collected by a pretrained PPO agent in\ntables 6 and 7. The results in table 2 show that ST-DIM largely outperforms other methods in terms of\nmean F1 score. In general, contrastive methods (ST-DIM and CPC) methods seem to perform better\nthan generative methods (VAE and PIXEL-PRED) at these probing tasks. We \ufb01nd that RandomCNN\nis a strong prior in Atari games as has been observed before [74], possibly due to the inductive bias\ncaptured by the CNN architecture empirically observed in [75]. We \ufb01nd similar trends to hold on\nresults with data collected by a PPO agent. Despite contrastive methods performing well, there is still\na sizable gap between ST-DIM and the fully supervised approach, leaving room for improvement\nfrom new unsupervised representation learning techniques for the benchmark.\n\n7 Discussion\n\nAblations: We investigate two ablations of our ST-DIM model: Global-T-DIM, which only maxi-\nmizes the mutual information between the global representations (similar in construction to PCL [76])\nand JSD-ST-DIM, which uses the NCE loss [77] instead of the InfoNCE loss, which is equivalent to\nmaximizing the Jensen Shannon Divergence between representations. We report results from these\nablations in Figure 3. We see from the results in that 1) the InfoNCE loss performs better than the\nJSD loss and 2) contrasting spatiotemporally (and not just temporally) is important across the board\nfor capturing all categories of latent factors.\nWe found ST-DIM has two main advantages which explain its superior performance over other\nmethods and over its own ablations. It captures small objects much better than other methods, and is\nmore robust to the presence of easy-to-exploit features which hurts other contrastive methods. Both\nthese advantages are due to ST-DIM maximizing mutual information of patch representations.\n\n8\n\n\fCapturing small objects: As we can see in Table 3, ST-DIM performs better at capturing small\nobjects than other methods, especially generative models like VAE and pixel prediction methods.\nThis is likely because generative models try to model every pixel, so they are not penalized much if\nthey fail to model the few pixels that make up a small object. Similarly, ST-DIM holds this same\nadvantage over Global-T-DIM (see Table 9), which is likely due to the fact that Global-T-DIM is not\npenalized if its global representation fails to capture features from some patches of the frame.\n\nRobust to presence of easy-to-exploit features: Representation learning with mutual information\nor contrastive losses often fail to capture all salient features if a few easy-to-learn features are\nsuf\ufb01cient to saturate the objective. This phenomenon has been linked to the looseness of mutual\ninformation lower bounds [34, 35] and gradient starvation [78]. We see the most prominent example\nof this phenomenon in Boxing. The observations in Boxing have a clock showing the time remaining\nin the round. A representation which encodes the shown time can perform near-perfect predictions\nwithout learning any other salient features in the observation. Table 4 shows that CPC, Global T-DIM,\nand ST-DIM perform well at predicting the clock variable. However only ST-DIM does well on\nencoding the other variables such as the score and the position of the boxers.\nWe also observe that the best generative model (PIXEL-PRED) does not suffer from this problem.\nIt performs its worst on high-entropy features such as the clock and player score (where ST-DIM\nexcels), and does slightly better than ST-DIM on low-entropy features which have a large contribution\nin the pixel space such as player and enemy locations. This sheds light on the qualitative difference\nbetween contrastive and generative methods: contrastive methods prefer capturing high-entropy\nfeatures (irrespective of contribution to pixel space) while generative methods do not, and generative\nmethods prefer capturing large objects which have low entropy. This complementary nature suggests\nhybrid models as an exciting direction of future work.\n\nTable 4: Breakdown of F1 Scores for every state variable in Boxing for ST-DIM, CPC, and Global-T-\nDIM, an ablation of ST-DIM that removes the spatial contrastive constraint, for the game Boxing\n\nMETHOD\n\nCLOCK\nENEMY_SCORE\nENEMY_X\nENEMY_Y\nPLAYER_SCORE\nPLAYER_X\nPLAYER_Y\n\nVAE\n0.03\n0.19\n0.32\n0.22\n0.08\n0.33\n0.16\n\nPIXEL-PRED\n0.27\n0.58\n0.49\n0.42\n0.32\n0.54\n0.43\n\nCPC\n0.79\n0.59\n0.15\n0.04\n0.56\n0.19\n0.04\n\nGLOBAL-T-DIM ST-DIM\n0.92\n0.70\n0.51\n0.38\n0.88\n0.56\n0.37\n\n0.81\n0.74\n0.17\n0.16\n0.45\n0.13\n0.14\n\n8 Conclusion\n\nWe present a new representation learning technique which maximizes the mutual information of\nrepresentations across spatial and temporal axes. We also propose a new benchmark for state\nrepresentation learning based on the Atari 2600 suite of games to emphasize learning multiple\ngenerative factors. We demonstrate that the proposed method excels at capturing the underlying\nlatent factors of a state even for small objects or when a large number of objects are present, which\nprove dif\ufb01cult for generative and other contrastive techniques, respectively. We have shown that\nour proposed benchmark can be used to study qualitative and quantitative differences between\nrepresentation learning techniques, and hope that it will encourage more research in the problem of\nstate representation learning.\n\nAcknowledgements\n\nWe are grateful for the collaborative research environment provided by Mila and Microsoft Research.\nWe thank Aaron Courville, Chris Pal, Remi Tachet, Eric Yuan, Chinwei-Huang, Khimya Khetrapal,\nTristan Deleu, and Aravind Srinivas for helpful discussions and feedback during the course of this\nwork. We would also like to thank the developers of PyTorch [79] and Weights&Biases.\n\n9\n\n\fReferences\n[1] David Marr. Vision: A Computational Investigation into the Human Representation and\n\nProcessing of Visual Information. Henry Holt and Co., Inc., 1982. ISBN 0716715678.\n\n[2] Robert D Gordon and David E Irwin. What\u2019s in an object \ufb01le? evidence from priming studies.\n\nPerception & Psychophysics, 58(8):1260\u20131277, 1996.\n\n[3] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building\n\nmachines that learn and think like people. Behavioral and brain sciences, 40, 2017.\n\n[4] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene\nrepresentation and rendering. Science, 360(6394):1204\u20131210, 2018.\n\n[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[6] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg,\nCarl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech\n2: End-to-end speech recognition in english and mandarin. In International conference on\nmachine learning, pages 173\u2013182, 2016.\n\n[7] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[9] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. nature, 529(7587):484,\n2016.\n\n[10] Jerzy Konorski. Integrative activity of the brain; an interdisciplinary approach. 1967.\n\n[11] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin\nArjovsky, and Aaron Courville. Adversarially learned inference. International Conference on\nLearning Representations (ICLR), 2017.\n\n[12] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Confer-\n\nence on Learning Representations (ICLR), 2014.\n\n[13] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[14] Deepak Pathak, Philipp Kr\u00e4henb\u00fchl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context\nencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, 2016.\n\n[15] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings\n\nof the IEEE International Conference on Computer Vision, pages 2051\u20132060, 2017.\n\n[16] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual\n\nrepresentation learning. arXiv preprint arXiv:1901.09005, 2019.\n\n[17] Timoth\u00e9e Lesort, Natalia D\u00edaz-Rodr\u00edguez, Jean-Franois Goudou, and David Filliat. State\n\nrepresentation learning for control: An overview. Neural Networks, 2018.\n\n10\n\n\f[18] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to\ncontrol: A locally linear latent dynamics model for control from raw images. In Advances in\nneural information processing systems, pages 2746\u20132754, 2015.\n\n[19] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel,\nMatthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot\ntransfer in reinforcement learning. In Proceedings of the 34th International Conference on\nMachine Learning-Volume 70, pages 1480\u20131490. JMLR. org, 2017.\n\n[20] David Ha and J\u00fcrgen Schmidhuber. Recurrent world models facilitate policy evolution. In\n\nAdvances in Neural Information Processing Systems, pages 2450\u20132462, 2018.\n\n[21] Wuyang Duan. Learning state representations for robotic control: Information disentangling\n\nand multi-modal learning. Master\u2019s thesis, Delft University of Technology, 2017.\n\n[22] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-\nIn Advances in neural\n\nconditional video prediction using deep networks in atari games.\ninformation processing systems, pages 2863\u20132871, 2015.\n\n[23] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interac-\ntion through video prediction. In Advances in neural information processing systems, pages\n64\u201372, 2016.\n\n[24] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,\nAaron Courville, and Devon Hjelm. Mutual information neural estimation. In Proceedings\nof the 35th International Conference on Machine Learning, pages 531\u2013540, 2018. URL\nhttp://proceedings.mlr.press/v80/belghazi18a.html.\n\n[25] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. arXiv preprint arXiv:1807.03748, 2018.\n\n[26] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,\nAdam Trischler, and Yoshua Bengio. Learning deep representations by mutual information\nestimation and maximization. International Conference on Learning Representations (ICLR),\n2019.\n\n[27] Petar Veli\u02c7ckovi\u00b4c, William Fedus, William L Hamilton, Pietro Li\u00f2, Yoshua Bengio, and R Devon\n\nHjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.\n\n[28] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n[29] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classi\ufb01er\n\nprobes. International Conference on Learning Representations (Workshop Track), 2017.\n\n[30] Karl Friston. A theory of cortical responses. Philosophical transactions of the Royal Society B:\n\nBiological sciences, 360(1456):815\u2013836, 2005.\n\n[31] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional\ninterpretation of some extra-classical receptive-\ufb01eld effects. Nature neuroscience, 2(1):79,\n1999.\n\n[32] William Bialek and Naftali Tishby. Predictive information. arXiv preprint cond-mat/9902341,\n\n1999.\n\n[33] Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information\nin a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908\u20136913,\n2015.\n\n[34] David McAllester and Karl Statos. Formal limitations on the measurement of mutual information.\n\narXiv preprint arXiv:1811.04251, 2018.\n\n11\n\n\f[35] Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre\nSermanet. Wasserstein dependency measure for representation learning. arXiv preprint\narXiv:1903.11780, 2019.\n\n[36] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by\n\nmaximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019.\n\n[37] Michael Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation\nprinciple for unnormalized statistical models. In Proceedings of the Thirteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\n[38] Ben Poole, Sherjil Ozair, A\u00e4ron Van den Oord, Alexander A Alemi, and George Tucker. On\nvariational bounds of mutual information. In International Conference on Machine Learning,\n2019.\n\n[39] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances\n\nin Neural Information Processing Systems, pages 1857\u20131865, 2016.\n\n[40] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey\nLevine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In\n2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1134\u20131141.\nIEEE, 2018.\n\n[41] Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for\nconditional models: Consistency and statistical ef\ufb01ciency. arXiv preprint arXiv:1809.01812,\n2018.\n\n[42] Aapo Hyv\u00e4rinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46.\n\nJohn Wiley & Sons, 2004.\n\n[43] Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch\u00f6lkopf, and\nOlivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled\nrepresentations. International Conference on Machine Learning, 2019.\n\n[44] Zach Whalen and Laurie N Taylor. Playing the past. History and Nostalgia in Video Games.\n\nNashville, TN: Vanderbilt University Press, 2008.\n\n[45] Steve Engelhardt. BJARS.com Atari Archives. http://bjars.com, 2019. [Online; accessed\n\n1-March-2019].\n\n[46] Thomas Jentzsch and CPUWIZ. Atariage atari 2600 forums, 2019. URL http://atariage.\n\ncom/forums/forum/16-atari-2600/.\n\n[47] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine\n\nLearning, 2(1):1\u2013127, 2009.\n\n[48] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review\nand new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8):\n1798\u20131828, 2013.\n\n[49] Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of\ndisentangled representations. International Conference on Learning Representations (ICLR),\n2018.\n\n[50] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende,\nand Alexander Lerchner. Towards a de\ufb01nition of disentangled representations. arXiv preprint\narXiv:1812.02230, 2018.\n\n[51] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering\nfor unsupervised learning of visual features. In Proceedings of the European Conference on\nComputer Vision (ECCV), pages 132\u2013149, 2018.\n\n[52] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov\nprocess expectations for large time. iv. Communications on Pure and Applied Mathematics, 36\n(2):183\u2013212, 1983.\n\n12\n\n\f[53] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[54] David Barber and Felix Agakov. The im algorithm: A variational approach to information\nmaximization. In Proceedings of the 16th International Conference on Neural Information\nProcessing Systems, NIPS\u201903, pages 201\u2013208, Cambridge, MA, USA, 2003. MIT Press. URL\nhttp://dl.acm.org/citation.cfm?id=2981345.2981371.\n\n[55] Giuseppe Cuccu, Julian Togelius, and Philippe Cudr\u00e9-Mauroux. Playing atari with six neurons.\n\nInternational Conference on Autonomous Agents and Multiagent Systems, 2019.\n\n[56] Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors.\n\nAutonomous Robots, 39(3):407\u2013428, 2015.\n\n[57] Rico Jonschkowski, Roland Hafner, Jonathan Scholz, and Martin Riedmiller. Pves: Position-\nvelocity encoders for unsupervised learning of structured state representations. arXiv preprint\narXiv:1705.09805, 2017.\n\n[58] Herke van Hoof, Nutan Chen, Maximilian Karl, Patrick van der Smagt, and Jan Peters. Stable\nreinforcement learning with autoencoders for tactile and visual data.\nIn 2016 IEEE/RSJ\nInternational Conference on Intelligent Robots and Systems (IROS), pages 3928\u20133934. IEEE,\n2016.\n\n[59] Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas.\nPlaying hard exploration games by watching youtube. In Advances in Neural Information\nProcessing Systems, pages 2930\u20132941, 2018.\n\n[60] Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi,\nand Honglak Lee. Contingency-aware exploration in reinforcement learning. arXiv preprint\narXiv:1811.01483, 2018.\n\n[61] Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, and Hyun Oh Song. Emi:\nExploration with mutual information. In International Conference on Machine Learning, pages\n3360\u20133369, 2019.\n\n[62] David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and\nVolodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. arXiv\npreprint arXiv:1811.11359, 2018.\n\n[63] Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt\nBotvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representa-\ntion. arXiv preprint arXiv:1901.11390, 2019.\n\n[64] Guangxiang Zhu, Zhiao Huang, and Chongjie Zhang. Object-oriented dynamics predictor. In\n\nAdvances in Neural Information Processing Systems, pages 9804\u20139815, 2018.\n\n[65] Klaus Greff, Rapha\u00ebl Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel\nZoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation\nlearning with iterative variational inference. arXiv preprint arXiv:1903.00450, 2019.\n\n[66] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsuper-\nvised feature learning. In Proceedings of the fourteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 215\u2013223, 2011.\n\n[67] Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning -\na comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, page 1\u20131, 2018. ISSN 1939-3539. doi: 10.1109/tpami.2018.\n2857768. URL http://dx.doi.org/10.1109/TPAMI.2018.2857768.\n\n[68] Eleni Trianta\ufb01llou, Richard Zemel, and Raquel Urtasun. Few-shot learning through an informa-\n\ntion retrieval lens, 2017.\n\n13\n\n\f[69] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence repre-\nsentations. In Proceedings of the Eleventh International Conference on Language Resources\nand Evaluation (LREC-2018), 2018.\n\n[70] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.\nGLUE: A multi-task benchmark and analysis platform for natural language understanding. In\nInternational Conference on Learning Representations, 2019. URL https://openreview.\nnet/forum?id=rJ4km2R5t7.\n\n[71] Felipe Petroski Such, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel Castro, Yulun\nLi, Ludwig Schubert, Marc Bellemare, Jeff Clune, and Joel Lehman. An atari model zoo for\nanalyzing, visualizing, and comparing deep reinforcement learning agents. arXiv preprint\narXiv:1812.07069, 2018.\n\n[72] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[73] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep\nLearning Workshop. MIT Press, 2013.\n\n[74] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A\nEfros. Large-scale study of curiosity-driven learning. International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[75] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 9446\u20139454, 2018.\n\n[76] AJ Hyvarinen and Hiroshi Morioka. Nonlinear ica of temporally dependent stationary sources.\n\nProceedings of Machine Learning Research, 2017.\n\n[77] Aapo Hyv\u00e4rinen and Petteri Pajunen. Nonlinear independent component analysis: Existence\n\nand uniqueness results. Neural Networks, 12(3):429\u2013439, 1999.\n\n[78] Remi Tachet des Combes, Mohammad Pezeshki, Samira Shabanian, Aaron Courville, and\narXiv preprint\n\nYoshua Bengio. On the learning dynamics of deep neural networks.\narXiv:1809.06848, 2018.\n\n[79] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[80] Ken Kansky, Tom Silver, David A M\u00e9ly, Mohamed Eldawy, Miguel L\u00e1zaro-Gredilla, Xinghua\nLou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks:\nZero-shot transfer with a generative causal model of intuitive physics. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1809\u20131818. JMLR. org,\n2017.\n\n[81] Amy Zhang, Yuxin Wu, and Joelle Pineau. Natural environment benchmarks for reinforcement\n\nlearning. arXiv preprint arXiv:1811.06032, 2018.\n\n14\n\n\f", "award": [], "sourceid": 4727, "authors": [{"given_name": "Ankesh", "family_name": "Anand", "institution": "Mila, University of Montreal"}, {"given_name": "Evan", "family_name": "Racah", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Sherjil", "family_name": "Ozair", "institution": "Mila, Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila"}, {"given_name": "Marc-Alexandre", "family_name": "C\u00f4t\u00e9", "institution": "Microsoft Research"}, {"given_name": "R Devon", "family_name": "Hjelm", "institution": "Microsoft Research"}]}