{"title": "Playing hard exploration games by watching YouTube", "book": "Advances in Neural Information Processing Systems", "page_first": 2930, "page_last": 2941, "abstract": "Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent\u2019s exact environment setup and the demonstrator\u2019s action and reward trajectories. Here we propose a method that overcomes these limitations in two stages. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e. vision and sound). Second, we embed a single YouTube video in this representation to learn a reward function that encourages an agent to imitate human gameplay. This method of one-shot imitation allows our agent to convincingly exceed human-level performance on the infamously hard exploration games Montezuma\u2019s Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.", "full_text": "Playing hard exploration games by watching YouTube\n\nYusuf Aytar\u2217, Tobias Pfaff\u2217, David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas\n\n{yusufaytar,tpfaff,budden,tpaine,ziyu,nandodefreitas}@google.com\n\nDeepMind, London, UK\n\nAbstract\n\nDeep reinforcement learning methods traditionally struggle with tasks where en-\nvironment rewards are particularly sparse. One successful method of guiding\nexploration in these domains is to imitate trajectories provided by a human demon-\nstrator. However, these demonstrations are typically collected under arti\ufb01cial\nconditions, i.e. with access to the agent\u2019s exact environment setup and the demon-\nstrator\u2019s action and reward trajectories. Here we propose a two-stage method that\novercomes these limitations by relying on noisy, unaligned footage without access\nto such data. First, we learn to map unaligned videos from multiple sources to a\ncommon representation using self-supervised objectives constructed over both time\nand modality (i.e. vision and sound). Second, we embed a single YouTube video\nin this representation to construct a reward function that encourages an agent to\nimitate human gameplay. This method of one-shot imitation allows our agent to\nconvincingly exceed human-level performance on the infamously hard exploration\ngames MONTEZUMA\u2019S REVENGE, PITFALL! and PRIVATE EYE for the \ufb01rst time,\neven if the agent is not presented with any environment rewards.\n\n1\n\nIntroduction\n\nPeople learn many tasks, from knitting to dancing to playing games, by watching videos online. They\ndemonstrate a remarkable ability to transfer knowledge from the online demonstrations to the task\nat hand, despite huge gaps in timing, visual appearance, sensing modalities, and body differences.\nThis rich setup with abundant unlabeled data motivates a research agenda in AI, which could result in\nsigni\ufb01cant progress in third-person imitation, self-supervised learning, reinforcement learning (RL)\nand related areas. In this paper, we show how this proposed research agenda enables us to make\nsome initial progress in self-supervised alignment of noisy demonstration sequences for RL agents,\nenabling human-level performance on the most complex and previously unsolved Atari 2600 games.\nDespite the recent advancements in deep reinforcement learning algorithms [7, 9, 17, 19] and\narchitectures [22, 35], there are many \u201chard exploration\u201d challenges, characterized by particularly\nsparse environment rewards, that continue to pose a dif\ufb01cult challenge for existing RL agents. One\nepitomizing example is Atari\u2019s MONTEZUMA\u2019S REVENGE [10], which requires a human-like avatar\nto navigate a series of platforms and obstacles (the nature of which change substantially room-to-\nroom) to collect point-scoring items. Such tasks are practically impossible using naive \u0001-greedy\nexploration methods, as the number of possible action trajectories grows exponentially in the number\nof frames separating rewards. For example, reaching the \ufb01rst environment reward in MONTEZUMA\u2019S\nREVENGE takes approximately 100 environment steps, equivalent to 10018 possible action sequences.\nEven if a reward is randomly encountered, \u03b3-discounted RL struggles to learn stably if this signal is\nbacked-up across particularly long time horizons.\n\n\u2217denotes equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) ALE frame\n\n(b) Frames from different YouTube videos\n\nFigure 1: Illustration of the domain gap that exists between the Arcade Learning Environment and\nYouTube videos from which our agent is able to learn to play MONTEZUMA\u2019S REVENGE. Note\ndifferent size, resolution, aspect ratio, color and addition of visual artifacts such as text and avatars.\n\nSuccessful attempts at overcoming the issue of sparse rewards have fallen broadly into two categories\nof guided exploration. First, intrinsic motivation methods provide an auxiliary reward that encourages\nthe agent to explore states or action trajectories that are \u201cnovel\u201d or \u201cinformative\u201d with respect to some\nmeasure [8, 30, 27]. These methods tend to help agents to re-explore discovered parts of state space\nthat appear novel or uncertain (known-unknowns), but often fail to provide guidance about where\nin the environment such states are to be found in the \ufb01rst place (unknown-unknowns). Accordingly,\nthese methods typically rely on an additional random component to drive the initial exploration\nprocess. The other category is imitation learning [20, 41], whereby a human demonstrator generates\nstate-action trajectories that are used to guide exploration toward areas considered salient with respect\nto their inductive biases. These biases prove to be a very useful constraint in the context of Atari, as\nhumans can immediately identify e.g. that a skull represents danger, or that a key unlocks a door.\nAmong existing imitation learning methods, DQfD by Hester et al. [20] has shown the best per-\nformance on Atari\u2019s hardest exploration games. Despite these impressive results, there are two\nlimitations of DQfD [20] and related methods. First, they assume that there is no \u201cdomain gap\u201d\nbetween the agent\u2019s and demonstrator\u2019s observation space, e.g. variations in color or resolution, or\nthe introduction of other visual artifacts. An example of domain gap in MONTEZUMA\u2019S REVENGE\nis shown in Figure 1, considering the \ufb01rst frame of (a) our environment compared to (b) YouTube\ngameplay footage. Second, they assume that the agent has access to the exact action and reward\nsequences that led to the demonstrator\u2019s observation trajectory. In both cases, these assumptions\nconstrain the set of useful demonstrations to those collected under arti\ufb01cial conditions, typically\nrequiring a specialized software stack for the sole purpose of RL agent training.\nTo address these limitations, this paper proposes a method for overcoming domain gaps between the\nobservation sequences of multiple demonstrations, by using self-supervised classi\ufb01cation tasks that\nare constructed over both time (temporal distance classi\ufb01cation) and modality (cross-modal temporal\ndistance classi\ufb01cation) to learn a common representation (see Figure 2). Unlike previous approaches,\nour method requires neither (a) frame-by-frame alignment between demonstrations, or (b) class labels\nor other annotations from which an alignment might be indirectly inferred. We additionally propose a\nnew unsupervised measure (cycle-consistency) for evaluating the quality of such a learnt embedding.\nUsing our embedding, we propose an auxiliary imitation loss that allows an agent to successfully\nplay hard exploration games without requiring the knowledge of the demonstrator\u2019s action trajectory.\nSpeci\ufb01cally, providing a standard RL agent with an imitation reward learnt from a single YouTube\nvideo, we are the \ufb01rst to convincingly exceed human-level performance on three of Atari\u2019s hardest\nexploration games: MONTEZUMA\u2019S REVENGE, PITFALL! and PRIVATE EYE. Despite the challenges\nof designing reward functions [18, 36] or learning them using inverse reinforcement learning [1, 49],\nwe also achieve human-level performance even in the absence of an environment reward signal.\n\n2 Related Work\n\nImitation learning methods such as DQfD have yielded promising results for guiding agent exploration\nin sparse-reward tasks, both in game-playing [20, 32] and robotics domains [41]. However, these\nmethods have traditionally leveraged observations collected in arti\ufb01cial conditions, i.e. in the absence\nof a domain gap (see Figure 1) and with full visibility over the demonstrator\u2019s action and reward\ntrajectories. Other approaches include interacting with the environment before introducing the expert\n\n2\n\n\f(a) An example path\n\n(b) Aligned frames\n\n(c) Our embedding\n\n(d) Pixel embedding\n\nFigure 2: For the path shown in (a), t-SNE projections [25] of observation sequences using (c)\nour embedding, versus (d) raw pixels. Four different domains are compared side-by-side in (b)\nfor an example frame of MONTEZUMA\u2019S REVENGE: (purple) the Arcade Learning Environment,\n(cyan/yellow) two YouTube training videos, and (red) an unobserved YouTube video. It is evident\nthat all four trajectories are well-aligned in our embedding space, despite (purple) and (red) being\nheld-aside during training. Using raw pixel values fails to achieve any meaningful alignment.\n\ndemonstrations [38, 31] and goal conditioned policies for high \ufb01delity imitation [29], although these\npapers typically do not assume domain gap or operate in sparse reward settings.\nThere are several methods of overcoming the domain gap in the previous literature. In the simple\nscenario of demonstrations that are aligned frame-by-frame [24, 34], methods such as CCA [2],\nDCTW [39] or time-contrastive networks (TCN) [34] can be used to learn a common representation\nspace. However, YouTube videos of Atari gameplay are more complex, as the actions taken by\ndifferent demonstrators can lead to very different observation sequences lacking such an alignment.\nIn this scenario, another common approach for domain alignment involves solving a shared auxiliary\nobjective across the domains [5, 6]. For example, Aytar et al. [5] demonstrated that by solving the\nsame scene classi\ufb01cation task bottlenecked by a common decision mechanism (i.e. using the same\nnetwork), several very different domains (i.e. natural images, line drawings and text descriptions)\ncould be successfully aligned. Similarly, domain adaptive meta-learning [44] uses a shared policy\nnetwork for addressing the domain gap for robotic tasks, though they require both \ufb01rst person robot\ndemonstrations and third person human demonstrations. Our work differs from the above approaches\nin that we do not make use of any category-guided supervision or \ufb01rst person demonstrations. Instead,\nwe de\ufb01ne our shared tasks using self-supervision over unlabeled data. This idea is motivated by\nseveral recent works in the self-supervised feature learning literature [3, 11, 12, 28, 42, 45, 26].\nOther related approaches include single-view TCN [34], which is another self-supervised task that\ndoes not require paired training data. We differ from this work by using temporal classi\ufb01cation instead\nof triplet-based ranking, which removes the need to empirically tune sensitive hyper parameters (local\nneighborhood size, ranking margin, etc). Another approach [33] performs temporal classi\ufb01cation but\nlimits its categories to frames close or far away in time. With respect to our use of cross-modal data,\nanother similar existing method in the feature learning literature is L3-net [3]. This approach learns\nto align vision and sound modalities, whereas we learn to align multiple audio-visual sequences (i.e.\ndemonstrations) using multi-modal alignment as a self-supervised objective. We adapt both TCN\nand L3-net for domain alignment and provide an evaluation compared to our proposed method in\nSection 6. We also experimented with third-person imitation methods [37] that combine the ideas of\ngenerative adversarial imitation learning (GAIL) [21] and adversarial domain confusion [16, 40], but\nwere unable to make progress using the very long YouTube demonstration trajectories.\nConsidering the imitation component of our work, one perspective is that we are learning a reward\nfunction that explains the demonstrator\u2019s behavior, which is closely related to inverse reinforcement\nlearning [1, 49]. There have also been many previous studies that consider supervised [4] and\nfew-shot methods [13, 15] for imitation learning. However, in both cases, our setting is more complex\ndue to the presence of domain gap and absence of demonstrator action and reward sequences.\n\n3\n\n\f(a) Temporal and cross-modal pair selection\n\n(b) Embedding networks\n\n(c) Classi\ufb01cation networks\n\nFigure 3: Illustration of the network architectures and interactions involved in our combined\nTDC+CMC self-supervised loss calculation. The \ufb01nal layer FC2 of \u03c6 is later used to embed the\ndemonstration trajectory to imitate. Although the Arcade Learning Environment does not expose\nan audio signal to our agent at training time, the audio signal present in YouTube footage made a\nsubstantial contribution to the learnt visual embedding function \u03c6.\n\n3 Closing the domain gap\n\nLearning from YouTube videos is made dif\ufb01cult by both the lack of frame-by-frame alignment, and the\npresence of domain-speci\ufb01c variations in color, resolution, screen alignment and other visual artifacts.\nWe propose that by learning a common representation across multiple demonstrations, our method\nwill generalize to agent observations without ever being explicitly exposed to the Atari environment.\nIn the absence of pre-aligned data, we adopt self-supervision in order to learn this embedding. The\nrationale of self-supervision is to propose an auxiliary task that we learn to solve simultaneously\nacross all domains, thus encouraging the network to learn a common representation. This is motivated\nby the work of Aytar et al. [5], but differs in that we do not have access to class labels to establish a\nsupervised objective. Instead, we propose two novel self-supervised objectives: temporal distance\nclassi\ufb01cation (TDC), described in Section 3.1 and cross-modal temporal distance classi\ufb01cation\n(CMC), described in Section 3.2. We also propose cycle-consistency in Section 3.3 as a quantitative\nmeasure for evaluating the one-to-one alignment capacity of an embedding.\n\n3.1 Temporal distance classi\ufb01cation (TDC)\n\nWe \ufb01rst consider the unsupervised task of predicting the temporal distance \u2206t between two frames\nof a single video sequence. This task requires an understanding of how visual features move\nand transform over time, thus encouraging an embedding that learns meaningful abstractions of\nenvironment dynamics conditioned on agent interactions.\nWe cast this problem as a classi\ufb01cation task, with K categories corresponding to temporal distance\nintervals, dk \u2208 {[0], [1], [2], [3 \u2212 4], [5 \u2212 20], [21 \u2212 200]}. Given two frames from the same video,\nv, w \u2208 I, we learn to predict the interval dk s.t. \u2206t \u2208 dk. Speci\ufb01cally, we implement two functions:\nan embedding function \u03c6 : I \u2192 RN , and a classi\ufb01er \u03c4tdc : RN \u00d7 RN \u2192 RK, both implemented as\nneural networks (see Section 5 for implementation details). We can then train \u03c4tdc(\u03c6(v), \u03c6(w)) to\npredict the distribution over class labels, dk, using the following cross-entropy classi\ufb01cation loss:\n\nLtdc(v, w, y) = \u2212 K(cid:88)\n\nyj log(\u02c6yj)\n\nwith \u02c6y = \u03c4tdc(\u03c6(v), \u03c6(w)) ,\n\n(1)\n\nwhere y and \u02c6y are the true and predicted label distributions respectively.\n\nj=1\n\n3.2 Cross-modal temporal distance classi\ufb01cation (CMC)\n\nIn addition to visual observations, our YouTube videos contain audio tracks that can be used to de\ufb01ne\nan additional self-supervised task. As the audio of Atari games tends to correspond with salient\nevents such as jumping, obtaining items or collecting points, a network that learns to correlate audio\nand visual observations should learn an abstraction that emphasizes important game events.\n\n4\n\n\f(a) Cycle-consistency visualization\n\n(b) One shot imitation\n\nFigure 4: (a) Visualization of two embedding spaces with low and high cycle-consistency. Note that\nthe selected point in sequence V (left) fails and (right) succeeds at cycling back to the original point.\n(b) Demonstration of one shot imitation through RL visualized in the embedding space.\n\nWe de\ufb01ne the cross-modal classi\ufb01cation task of predicting the temporal distance between a given\nvideo frame, v \u2208 I, and audio snippet, a \u2208 A. To achieve this, we introduce an additional\nembedding function, \u03c8 : A \u2192 RN , which maps from a frequency-decomposed audio snippet to\nan N-dimensional embedding vector, \u03c8(a). The associated classi\ufb01cation loss, Lcmc(v, a, y), is\nequivalent to Equation (1) using the classi\ufb01cation function \u02c6y = \u03c4cmc(\u03c6(v), \u03c8(a)). Note that by\nlimiting our categories to the two intervals d0 = [0] and d1 = [l, . . . ,\u221e] with l being the local\npositive neighborhood, this method reduces to L3-Net of Arandjelovic et al. [3]. In our following\nexperiments, we obtain a \ufb01nal embedding by a \u03bb-weighted combination of both cross-modal and\ntemporal distance classi\ufb01cation losses, i.e. minimizing L = Ltdc + \u03bbLcmc.\n\n3.3 Model selection through cycle-consistency\n\nA challenge of evaluating and meta-optimizing the models presented in Section 3 is de\ufb01ning a measure\nof the quality of an embedding \u03c6. Motivated by the success of cyclic relations in CycleGAN [48]\nand for matching visual features across images [47], we propose cycle-consistency for this purpose.\nAssume that we have two length-N sequences, V = {v1, v2, ...vn} and W = {w1, w2, . . . , wn}.\nWe also de\ufb01ne the distance, d\u03c6, as the Euclidean distance in the associated embedding space,\nd\u03c6(vi, wj) = ||\u03c6(vi)\u2212 \u03c6(wj)||2. To evaluate cycle-consistency, we \ufb01rst select vi \u2208 V and determine\nits nearest neighbor, wj = argminw\u2208W d\u03c6(vi, w). We then repeat the process to \ufb01nd the nearest\nneighbor of wj, i.e. vk = argminv\u2208V d\u03c6(v, wj). We say that vi is cycle-consistent if and only if\n|i \u2212 k| \u2264 1, and further de\ufb01ne the one-to-one alignment capacity, P\u03c6, of the embedding space \u03c6 as\nthe percentage of v \u2208 V that are cycle-consistent. Figure 4(a) illustrates cycle-consistency in two\nexample embedding spaces. The same process can be extended to evaluate the 3-cycle-consistency\n\u03c6, by requiring that vi remains cycle consistent along both paths V \u2192 W \u2192 U \u2192 V and\nof \u03c6, P 3\nV \u2192 U \u2192 W \u2192 V , where U is a third sequence.\n\n4 One-shot imitation from YouTube footage\n\nIn Section 3, we learned to extract features from unlabeled and unaligned gameplay footage, and\nintroduced a measure to evaluate the quality of the learnt embedding. In this section, we describe\nhow these features can be exploited to learn to play games with very sparse rewards, such as the\ninfamously dif\ufb01cult PITFALL! and MONTEZUMA\u2019S REVENGE. Speci\ufb01cally, we demonstrate how a\nsequence of checkpoints placed along the embedding of a single YouTube video can be presented\nas a reward signal to a standard reinforcement learning agent (IMPALA for our experiments [14]),\nallowing successful one-shot imitation even in the complete absence of the environment rewards.\nTaking a single YouTube gameplay video, we simply generate a sequence of \u201ccheckpoints\" every\nN = 16 frames along the embedded trajectory. We can then represent the following reward:\n\n(cid:26)0.5\n\n0.0\n\nrimitation =\n\nif \u00af\u03c6(vagent) \u00b7 \u00af\u03c6(vcheckpoint) > \u03b1\notherwise\n\n(2)\n\nwhere \u00af\u03c6(v) are the zero-centered and l2-normalized embeddings of the agent and checkpoint observa-\ntions. We also require that checkpoints be visited in soft-order, i.e. if the last collected checkpoint is at\n\n5\n\n\fEmbedding Method\nl2 pixel distance\nsingle-view TCN [34]\nTDC (ours)\nL3-Net [3]\nCMC (ours)\ncombined (TDC+CMC)\n\nP\u03c6\n30.5\n32.2\n42.0\n27.3\n41.7\n44.2\n\nP 3\n\u03c6\n08.4\n15.9\n23.0\n10.9\n23.6\n27.5\n\nFigure 5: Cycle-consistency evaluation considering different embedding spaces. We compare naive\nl2 pixel loss to temporal methods (TDC and single-view TCN) and cross-modal methods (CMC and\nL3-Net). Combining TDC and CMC yields the best performance for both 2 and 3-cycle-consistency,\nparticularly at deeper levels of abstraction (e.g. no performance loss using FC1 or FC2).\n\nv(n), then vcheckpoint \u2208 {v(n+1), . . . , v(n+1+\u2206t)}. We set \u2206t = 1 and \u03b1 = 0.5 for our experiments\n(except when considering pixel-only embeddings, where \u03b1 = 0.92 provided the best performance).\n\n5\n\nImplementation Details\n\nThe visual embedding function, \u03c6, is composed of three spatial, padded, 3x3 convolutional layers\nwith (32, 64, 64) channels and 2x2 max-pooling, followed by three residual-connected blocks with\n64 channels and no down-sampling. Each layer is ReLU-activated and batch-normalized, and the\noutput fed into a 2-layer 1024-wide MLP. The network input is a 128x128x3x4 tensor constructed\nby random spatial cropping of a stack of four consecutive 140x140 RGB images, sampled from our\ndataset. The \ufb01nal embedding vector is l2-normalized.\nThe audio embedding function, \u03c8, is as per \u03c6 except that it has four, width-8, 1D convolutional layers\nwith (32, 64, 128, 256) channels and 2x max-pooling, and a single width-1024 linear layer. The input\nis a width-137 (6ms) sample of 256 frequency channels, calculated using STFT. ReLU-activation and\nbatch-normalization are applied throughout and the embedding vector is l2-normalized.\nThe same shallow network architecture, \u03c4, is used for both temporal and cross-modal classi\ufb01cation.\nBoth input vectors are combined by element-wise multiplication, with the result fed into a 2-layer\nMLP with widths (1024, 6) and ReLU non-linearity in between. A visualization of these networks and\ntheir interaction is provided in Figure 3. Note that although \u03c4tdc and \u03c4cmc share the same architecture,\nthey are operating on two different problems and therefore maintain separate sets of weights.\nTo generate training data, we sample input pairs (vi, wi) (where vi and wi are sampled from the\nsame domain) as follows. First, we sample a demonstration sequence from our three training videos.\nNext, we sample both an interval, dk \u2208 {[0], [1], [2], [3 \u2212 4], [5 \u2212 20], [21 \u2212 200]}, and a distance,\n\u2206t \u2208 dk. Finally, we randomly select a pair of frames from the sequence with temporal distance \u2206t.\nThe model is trained with Adam using a learning rate of 10\u22124 and batch size of 32 for 200,000 steps.\nAs described in Section 4, our imitation loss is constructed by generating checkpoints every N = 16\nframes along the \u03c6-embedded observation sequence of a single YouTube video. We train an agent\nusing the sum of imitation and (optionally) environment rewards. We use the distributed A3C RL\nagent IMPALA [14] with 100 actors for our experiments. The only modi\ufb01cation we make to the\npublished network is to calculate the distance (as per Equation(2)) between the agent and the next\ntwo checkpoints and concatenate this 2-vector with the \ufb02attened output of the last convolutional layer.\nWe also tried re-starting our agent from checkpoints recorded along its trajectory, similar to Hosu et\nal. [23], but found that it provided minimal improvement given even our very long demonstrations.\n\n6 Analysis and Experiments\n\nIn this section we analyze (a) the learnt embedding spaces, and (b) the performance of our RL agent.\nWe consider three Atari 2600 games that are considered very dif\ufb01cult exploration challenges: MON-\nTEZUMA\u2019S REVENGE, PITFALL! and PRIVATE EYE. For each, we select four YouTube videos (three\ntraining and one test) of human gameplay, varying in duration from 3-to-10 minutes. Importantly,\n\n6\n\n\fFigure 6: For each embedding method, we visualize the t-SNE projection of four observation\nsequences traversing the \ufb01rst room of MONTEZUMA\u2019S REVENGE. Using pixel space alone fails to\nprovide any meaningful cross-domain alignment. Purely cross-modal methods perform better, but\nproduce a very scattered embedding due to missing long-range dependencies. The combination of\ntemporal and cross-modal objectives yields the best alignment and continuity of trajectories.\n\nnone of the YouTube videos were collected using our speci\ufb01c Arcade Learning Environment [10], and\nthe only pre-processing that we apply is keypoint-based (i.e. Harris corners) af\ufb01ne transformation to\nspatially align the game screens from the \ufb01rst frame only. The dataset used and additional experiments\ncan be found in the supplemental material to this paper.\n\n6.1 Embedding space evaluation\n\nTo usefully close the domain gap between YouTube gameplay footage and our environment observa-\ntions, our learnt embedding space should exhibit two desirable properties: (1) one-to-one alignment\ncapacity and (2) meaningful abstraction. We consider each of these properties in turn.\nFirst, one-to-one alignment is desirable for reliably mapping observations between different sequences.\nWe evaluate this property using the cycle-consistency measure introduced in Section 3.3. The features\nfrom earlier layers in \u03c6 (see Figure 5) are centered and l2-normalized before computing cycle-\nconsistency. Speci\ufb01cally, we consider both (a) the 2-way cycle-consistency, P\u03c6, between the test\nvideo and the \ufb01rst training video, and (b) the 3-way cycle-consistency, P 3\n\u03c6, between the test video and\nthe \ufb01rst two training videos. These results are presented in Figure 5, comparing the cycle-consistencies\nof our TDC, CMC and combined methods to a naive l2-distance in pixel space, single-view time-\ncontrastive networks (TCN) [34] and L3-Net [3]. Note that we implemented single-view TCN and\n\u03c6 cycle-consistency. As\nL3-Net in our framework and tuned the hyperparameters to achieve the best P 3\nexpected, pixel loss performs worst in the presence of sequence-to-sequence domain gaps. Our TDC\nand CMC methods alone yield improved performance compared to TCN and L3-Net (particularly at\ndeeper levels of abstraction), and combining both methods provides the best results overall.\nNext, Figure 6 shows the t-SNE projection of observation trajectories taken by different human\ndemonstrators to traverse the \ufb01rst room of MONTEZUMA\u2019S REVENGE. It is again evident that a\npixel-based loss entirely fails to align the sequences. The embeddings learnt using purely cross-modal\nalignment (i.e. L3-Net and CMC) perform better but still yield particularly scattered and disjoint\ntrajectories, which is an undesirable property likely due to the sparsity of salient audio signals. TDC\nand our combined TDC+CMC methods provide the more globally consistent trajectories, and are less\nlikely to produce false-positives with respect to the distance metric described in Section 4.\nFinally, a useful embedding should provide a useful abstraction that encodes meaningful, high-level\ninformation of the game while ignoring irrelevant features. To aid in visualizing this property, Figure 7\ndemonstrates the spatial activation of neurons in the \ufb01nal convolutional layer of the embedding\nnetwork \u03c6, using the visualization method proposed in [46]. It is compelling that the top activations\nare centered on features including the player and enemy positions in addition to the inventory state,\nwhich is informative of the next location that needs to be explored (e.g. if we have collected the key\nrequired to open a door). Important objects such as the key are emphasized more in the cross-modal\nand combined embeddings, likely due to the unique sounds that are played when collected (see \ufb01gure\n7(d) and (e)). Notably absent are activations associated with distractors such as the moving sand\nanimation, or video-speci\ufb01c artifacts indicative of the domain gap we wished to close.\n\n7\n\n\f(a) Neuron #46\n\n(b) Neuron #8\n\n(c) Neuron #39\n\n(d) TDC, Overall\n\n(e) CMC, Overall\n\nFigure 7: (Left) Visualization of select activations in the \ufb01nal convolutional layer. Individual neurons\nfocus on e.g. (a) the player, (b) enemies, and (c) the inventory. Notably absent are activations\nassociated with distractors or domain-speci\ufb01c artifacts. (Right) Visualization of activations summed\nacross all channels in the \ufb01nal layer. We observe that use of the audio signal in CMC results in more\nemphasis being placed on key items and their location in the inventory.\n\n6.2 Solving hard exploration games with one-shot imitation\n\nUsing the method described in Section 4, we train an IMPALA agent to play the hard exploration\nAtari games MONTEZUMA\u2019S REVENGE, PITFALL! and PRIVATE EYE using a learned auxiliary\nreward to guide exploration. For each game, the embedding network, \u03c6, was trained using just\nthree YouTube videos, and an additional video was embedded to generate a sequence of exploration\ncheckpoints. Videos of our agent playing these games can be found here2.\nFigure 8 presents our learning curves for each hard exploration Atari game. Without imitation reward,\nthe pure RL agent is unable to collect any of the sparse rewards in MONTEZUMA\u2019S REVENGE\nand PITFALL!, and only reaches the \ufb01rst two rewards in PRIVATE EYE (consistent with previous\nstudies using DQN variants [19, 22]). Using pixel-space features, the guided agent is able to obtain\n17k points in PRIVATE EYE but still fails to make progress in the other games. Replacing a pixel\nembedding with our combined TDC+CMC embedding convincingly yields the best results, even if\nthe agent is presented only with our TDC+CMC imitation reward (i.e. no environment reward).\nTo test the impact of the choice of expert trajectory, we generate checkpoints from two additional\nvideos of MONTEZUMA\u2019S REVENGE from our set, and train agents with those sequences (\ufb01gure 8,\nleft). While all three agents manage to clear the \ufb01rst level, expert 1 achieves the highest score. Out of\nthe three expert sequence considered, expert 1 also has the biggest domain shift. This is in line with\nour \ufb01ndings from section 6.1 that our embedding space can suf\ufb01ciently align our sequences. Domain\nshift in the expert trajectories is therefore not a signi\ufb01cant factor on performance.\nFinally, in Table 1 we compare our best policies for each game to the best previously published\nresults; Rainbow [19] and ApeX DQN [22] without demonstrations, and DQfD [20] using expert\ndemonstrations. Unlike DQfD our demonstrations are unaligned YouTube footage without access\nto action or reward trajectories. Our results are calculated using the standard approach of averaging\nover 200 episodes initialized using a random 1-to-30 no-op actions. Importantly, our approach is\nthe \ufb01rst to convincingly exceed human-level performance on all three games \u2013 even in the absence\nof an environment reward signal. We are the \ufb01rst to solve the entire \ufb01rst level of MONTEZUMA\u2019S\nREVENGE and PRIVATE EYE, and substantially outperform state-of-the-art on PITFALL!.\n\n7 Conclusion\n\nIn this paper, we propose a method of guiding agent exploration through hard exploration challenges\nby watching YouTube videos. Unlike traditional methods of imitation learning, where demonstrations\nare generated under controlled conditions with access to action and reward sequences, YouTube\nvideos contain only unaligned and often noisy audio-visual sequences. We have proposed novel\nself-supervised objectives that allow a domain-invariant representation to be learnt across videos, and\ndescribed a one-shot imitation mechanism for guiding agent exploration by embedding checkpoints\nthroughout this space. Combining these methods with a standard IMPALA agent, we demonstrate\n\n2https://www.youtube.com/playlist?list=PLZuOGGtntKlaOoq_8wk5aKgE_u_Qcpqhu\n\n8\n\n\fMONTEZUMA\u2019S REVENGE\n\nPITFALL!\n\nPRIVATE EYE\n\nFigure 8: Learning curves of our combined TDC+CMC algorithm with (purple) and without (yellow)\nenvironment reward, versus imitation from pixel-space features (blue) and IMPALA without demon-\nstrations (green). The red line represents the maximum reward achieved using previously published\nmethods, and the brown line denotes the score obtained by an average human player.\n\nMONTEZUMA\u2019S REVENGE\n\nPITFALL!\n\nPRIVATE EYE\n\nRainbow [19]\nApeX [22]\nDQfD [20]\nAverage Human [43]\nOurs (rimitation only)\nOurs (rimitation + renv)\n\n384.0\n2,500.0\n4,659.7\n4,743.0\n37,232.7\n58,175.1\n\n0.0\n-0.6\n57.3\n\n6,464.0\n54,912.4\n76,812.5\n\n4,234.0\n\n49.8\n\n42,457.2\n69,571.0\n98,212.5\n98,763.2\n\nTable 1: Comparison of our best policy (mean of 200 evaluation episodes) to previously published\nresults across MONTEZUMA\u2019S REVENGE, PITFALL! and PRIVATE EYE. Our agent is the \ufb01rst to\nexceed average human-level performance on all three games, even without environment rewards.\n\nthe \ufb01rst human-level performance in the infamously dif\ufb01cult exploration games MONTEZUMA\u2019S\nREVENGE, PITFALL! and PRIVATE EYE.\n\nAcknowledgments We would like to thank the team, especially Serkan Cabi, Bilal Piot and Tobias\nPohlen, for many fruitful discussions. We thank the reviewers for their comments, which helped\nin making this a better paper. And \ufb01nally, we say \u2019thank you\u2019 to all the amazing Atari players on\nYoutube and Twitch, which inspired this project.\n\nReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceed-\n\nings of the twenty-\ufb01rst international conference on Machine learning, page 1. ACM, 2004.\n\n[2] Theodore Wilbur Anderson. An introduction to multivariate statistical analysis, volume 2. Wiley New\n\nYork, 1958.\n\n[3] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In 2017 IEEE International Conference\n\non Computer Vision (ICCV), pages 609\u2013617. IEEE, 2017.\n\n[4] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from\n\ndemonstration. Robotics and autonomous systems, 57(5):469\u2013483, 2009.\n\n[5] Yusuf Aytar, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Cross-modal scene\n\nnetworks. IEEE transactions on pattern analysis and machine intelligence, 2017.\n\n[6] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. See, hear, and read: Deep aligned representations.\n\narXiv preprint arXiv:1706.00932, 2017.\n\n[7] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair Muldal,\nNicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. Interna-\ntional Conference on Learning Representations (ICLR), 2018.\n\n9\n\n\f[8] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\nUnifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing\nSystems, pages 1471\u20131479, 2016.\n\n[9] Marc G Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement learning.\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[10] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 2013.\n\n[11] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context\nprediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422\u20131430,\n2015.\n\n[12] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In The IEEE International\n\nConference on Computer Vision (ICCV), 2017.\n\n[13] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever,\nPieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information\nprocessing systems, pages 1087\u20131098, 2017.\n\n[14] Lasse Espeholt, Hubert Soyer, R\u00e9mi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable\ndistributed deep-rl with importance weighted actor-learner architectures. CoRR, abs/1802.01561, 2018.\n\n[15] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation\n\nlearning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[16] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of\nMachine Learning Research, 17(1):2096\u20132030, 2016.\n\n[17] Audrunas Gruslys, Mohammad Gheshlaghi Azar, Marc G Bellemare, and Remi Munos. The reactor: A\nsample-ef\ufb01cient actor-critic architecture. International Conference on Learning Representations (ICLR),\n2017.\n\n[18] Dylan Had\ufb01eld-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward\n\ndesign. In Advances in Neural Information Processing Systems, pages 6768\u20136777, 2017.\n\n[19] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan\nHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep\nreinforcement learning. Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[20] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John\nQuan, Andrew Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from demonstrations. Proceedings\nof the AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[21] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.\n\nInformation Processing Systems, pages 4565\u20134573, 2016.\n\nIn Advances in Neural\n\n[22] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David\nSilver. Distributed prioritized experience replay. International Conference on Learning Representations\n(ICLR), 2018.\n\n[23] Ionel-Alexandru Hosu and Traian Rebedea. Playing atari games with deep reinforcement learning and\n\nhuman checkpoint replay. CoRR, abs/1607.05077, 2016.\n\n[24] Yuxuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to\n\nimitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.\n\n[25] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[26] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive\n\ncoding. arXiv preprint arXiv:1807.03748, 2018.\n\n[27] Ian Osband, Daniel Russo, Zheng Wen, and Benjamin Van Roy. Deep exploration via randomized value\n\nfunctions. arXiv preprint arXiv:1703.07608, 2017.\n\n10\n\n\f[28] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound\nprovides supervision for visual learning. In European Conference on Computer Vision, pages 801\u2013816.\nSpringer, 2016.\n\n[29] Tom Le Paine, Sergio G\u00f3mez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W\nHoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden, et al. One-shot high-\ufb01delity imitation:\nTraining large-scale deep nets with rl. arXiv preprint arXiv:1810.05017, 2018.\n\n[30] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by\nself-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.\n\n[31] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan\nShelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. arXiv preprint\narXiv:1804.08606, 2018.\n\n[32] Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel\nBarth-Maron, Hado van Hasselt, John Quan, Mel Ve\u02c7cer\u00edk, et al. Observe and look further: Achieving\nconsistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.\n\n[33] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for\n\nnavigation. arXiv preprint arXiv:1803.00653, 2018.\n\n[34] Pierre Sermanet, Corey Lynch, Jasmine Hsu, and Sergey Levine. Time-contrastive networks: Self-\n\nsupervised learning from multi-view observation. arXiv preprint arXiv:1704.06888, 2017.\n\n[35] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[36] Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of\n\nthe annual conference of the cognitive science society, pages 2601\u20132606, 2009.\n\n[37] Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. arXiv preprint\n\narXiv:1703.01703, 2017.\n\n[38] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint\n\narXiv:1805.01954, 2018.\n\n[39] George Trigeorgis, Mihalis A Nicolaou, Bj\u00f6rn W Schuller, and Stefanos Zafeiriou. Deep canonical time\nwarping for simultaneous alignment and representation learning of sequences. IEEE transactions on\npattern analysis and machine intelligence, 40(5):1128\u20131138, 2018.\n\n[40] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion:\n\nMaximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.\n\n[41] Matej Ve\u02c7cer\u00edk, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess,\nThomas Roth\u00f6rl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforce-\nment learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.\n\n[42] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled\n\nvideo. arXiv preprint arXiv:1504.08023, 2015.\n\n[43] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling\nnetwork architectures for deep reinforcement learning. International Conference on Machine Learning\n(ICML), 2015.\n\n[44] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey\nLevine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint\narXiv:1802.01557, 2018.\n\n[45] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference\n\non Computer Vision, pages 649\u2013666. Springer, 2016.\n\n[46] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge\n\nin deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.\n\n[47] Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros. Learning dense\ncorrespondence via 3d-guided cycle consistency. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 117\u2013126, 2016.\n\n11\n\n\f[48] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n[49] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\n\nreinforcement learning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n12\n\n\f", "award": [], "sourceid": 1533, "authors": [{"given_name": "Yusuf", "family_name": "Aytar", "institution": "DeepMind"}, {"given_name": "Tobias", "family_name": "Pfaff", "institution": "DeepMind"}, {"given_name": "David", "family_name": "Budden", "institution": "DeepMind"}, {"given_name": "Thomas", "family_name": "Paine", "institution": "DeepMind"}, {"given_name": "Ziyu", "family_name": "Wang", "institution": "Deepmind"}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "DeepMind"}]}