{"title": "One-Shot Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1087, "page_last": 1098, "abstract": "Imitation learning has been commonly applied to solve different tasks in isolation. This usually requires either careful feature engineering, or a significant number of samples. This is far from what we desire: ideally, robots should be able to learn from very few demonstrations of any given task, and instantly generalize to new situations of the same task, without requiring task-specific engineering. In this paper, we propose a meta-learning framework for achieving such capability, which we call one-shot imitation learning.  Specifically, we consider the setting where there is a very large (maybe infinite) set of tasks, and each task has many instantiations.  For example, a task could be to stack all blocks on a table into a single tower, another task could be to place all blocks on a table into two-block towers, etc. In each case, different instances of the task would consist of different sets of blocks with different initial states.  At training time, our algorithm is presented with pairs of demonstrations for a subset of all tasks.  A neural net is trained that takes as input one demonstration and the current state (which initially is the initial state of the other demonstration of the pair), and outputs an action with the goal that the resulting sequence of states and actions matches as closely as possible with the second demonstration. At test time, a demonstration of a single instance of a new task is presented, and the neural net is expected to perform well on new instances of this new task. Our experiments show that the use of soft attention allows the model to generalize to conditions and tasks unseen in the training data. We anticipate that by training this model on a much greater variety of tasks and settings, we will obtain a general system that can turn any demonstrations into robust policies that can accomplish an overwhelming variety of tasks.", "full_text": "One-Shot Imitation Learning\n\nYan Duan\u2020\u00a7, Marcin Andrychowicz\u2021, Bradly Stadie\u2020\u2021, Jonathan Ho\u2020\u00a7,\nJonas Schneider\u2021, Ilya Sutskever\u2021, Pieter Abbeel\u2020\u00a7, Wojciech Zaremba\u2021\n\n{rockyduan, jonathanho, pabbeel}@eecs.berkeley.edu\n{marcin, bstadie, jonas, ilyasu, woj}@openai.com\n\n\u2020Berkeley AI Research Lab, \u2021OpenAI\n\n\u00a7Work done while at OpenAI\n\nAbstract\n\nImitation learning has been commonly applied to solve different tasks in isolation.\nThis usually requires either careful feature engineering, or a signi\ufb01cant number of\nsamples. This is far from what we desire: ideally, robots should be able to learn\nfrom very few demonstrations of any given task, and instantly generalize to new\nsituations of the same task, without requiring task-speci\ufb01c engineering. In this\npaper, we propose a meta-learning framework for achieving such capability, which\nwe call one-shot imitation learning.\nSpeci\ufb01cally, we consider the setting where there is a very large (maybe in\ufb01nite)\nset of tasks, and each task has many instantiations. For example, a task could be\nto stack all blocks on a table into a single tower, another task could be to place\nall blocks on a table into two-block towers, etc. In each case, different instances\nof the task would consist of different sets of blocks with different initial states.\nAt training time, our algorithm is presented with pairs of demonstrations for a\nsubset of all tasks. A neural net is trained such that when it takes as input the \ufb01rst\ndemonstration demonstration and a state sampled from the second demonstration,\nit should predict the action corresponding to the sampled state. At test time, a full\ndemonstration of a single instance of a new task is presented, and the neural net\nis expected to perform well on new instances of this new task. Our experiments\nshow that the use of soft attention allows the model to generalize to conditions and\ntasks unseen in the training data. We anticipate that by training this model on a\nmuch greater variety of tasks and settings, we will obtain a general system that can\nturn any demonstrations into robust policies that can accomplish an overwhelming\nvariety of tasks.\n\n1\n\nIntroduction\n\nWe are interested in robotic systems that are able to perform a variety of complex useful tasks, e.g.\ntidying up a home or preparing a meal. The robot should be able to learn new tasks without long\nsystem interaction time. To accomplish this, we must solve two broad problems. The \ufb01rst problem is\nthat of dexterity: robots should learn how to approach, grasp and pick up complex objects, and how\nto place or arrange them into a desired con\ufb01guration. The second problem is that of communication:\nhow to communicate the intent of the task at hand, so that the robot can replicate it in a broader set of\ninitial conditions.\nDemonstrations are an extremely convenient form of information we can use to teach robots to over-\ncome these two challenges. Using demonstrations, we can unambiguously communicate essentially\nany manipulation task, and simultaneously provide clues about the speci\ufb01c motor skills required to\nperform the task. We can compare this with an alternative form of communication, namely natural\nlanguage. Although language is highly versatile, effective, and ef\ufb01cient, natural language processing\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1:\n(a) Traditionally, policies are task-speci\ufb01c. For example, a policy might have been\ntrained through an imitation learning algorithm to stack blocks into towers of height 3, and then\nanother policy would be trained to stack blocks into towers of height 2, etc. (b) In this paper, we\nare interested in training networks that are not speci\ufb01c to one task, but rather can be told (through a\nsingle demonstration) what the current new task is, and be successful at this new task. For example,\nwhen it is conditioned on a single demonstration for task F, it should behave like a good policy for\ntask F. (c) We can phrase this as a supervised learning problem, where we train this network on a set\nof training tasks, and with enough examples it should generalize to unseen, but related tasks. To train\nthis network, in each iteration we sample a demonstration from one of the training tasks, and feed it\nto the network. Then, we sample another pair of observation and action from a second demonstration\nof the same task. When conditioned on both the \ufb01rst demonstration and this observation, the network\nis trained to output the corresponding action.\n\nsystems are not yet at a level where we could easily use language to precisely describe a complex\ntask to a robot. Compared to language, using demonstrations has two fundamental advantages: \ufb01rst,\nit does not require the knowledge of language, as it is possible to communicate complex tasks to\nhumans that don\u2019t speak one\u2019s language. And second, there are many tasks that are extremely dif\ufb01cult\nto explain in words, even if we assume perfect linguistic abilities: for example, explaining how to\nswim without demonstration and experience seems to be, at the very least, an extremely challenging\ntask.\nIndeed, learning from demonstrations have had many successful applications . However, so far\nthese applications have either required careful feature engineering, or a signi\ufb01cant amount of system\ninteraction time. This is far from what what we desire: ideally, we hope to demonstrate a certain task\nonly once or a few times to the robot, and have it instantly generalize to new situations of the same\ntask, without long system interaction time or domain knowledge about individual tasks.\nIn this paper we explore the one-shot imitation learning setting illustrated in Fig. 1, where the\nobjective is to maximize the expected performance of the learned policy when faced with a new,\npreviously unseen, task, and having received as input only one demonstration of that task. For the\ntasks we consider, the policy is expected to achieve good performance without any additional system\ninteraction, once it has received the demonstration.\nWe train a policy on a broad distribution over tasks, where the number of tasks is potentially in\ufb01nite.\nFor each training task we assume the availability of a set of successful demonstrations. Our learned\npolicy takes as input: (i) the current observation, and (ii) one demonstration that successfully solves\na different instance of the same task (this demonstration is \ufb01xed for the duration of the episode).\nThe policy outputs the current controls. We note that any pair of demonstrations for the same task\nprovides a supervised training example for the neural net policy, where one demonstration is treated\nas the input, while the other as the output.\n\n2\n\n(a) Traditional Imitation LearningTask Ae.g. stack blocks into towers of height 3ManydemonstrationsImitation Learning AlgorithmPolicy fortask AactionEnvironmentobsTask Be.g. stack blocks into towers of height 2ManydemonstrationsImitation Learning AlgorithmPolicy fortask BactionEnvironmentobsMany demonstrationsfor task AMeta Learning Algorithm\u2026more demonstrations for more tasksOne-Shot Imitator(Neural Network)EnvironmentactionobsSingle demonstration for task FPolicy for task FMany demonstrationsfor task BMany demonstrationsfor task AsampleMany demonstrationsfor task B(b) One-Shot Imitation Learning(c) Training the One-Shot ImitatorOne-Shot Imitator(Neural Network)Supervised lossDemo1observation fromDemo2correspondingaction in Demo2predictedaction\fTo make this model work, we made essential use of soft attention [6] for processing both the (poten-\ntially long) sequence of states and action that correspond to the demonstration, and for processing the\ncomponents of the vector specifying the locations of the various blocks in our environment. The use\nof soft attention over both types of inputs made strong generalization possible. In particular, on a\nfamily of block stacking tasks, our neural network policy was able to perform well on novel block\ncon\ufb01gurations which were not present in any training data. Videos of our experiments are available\nat http://bit.ly/nips2017-oneshot.\n\n2 Related Work\n\nImitation learning considers the problem of acquiring skills from observing demonstrations. Survey\narticles include [48, 11, 3].\nTwo main lines of work within imitation learning are behavioral cloning, which performs supervised\nlearning from observations to actions (e.g., [41, 44]); and inverse reinforcement learning [37], where\na reward function [1, 66, 29, 18, 22] is estimated that explains the demonstrations as (near) optimal\nbehavior. While this past work has led to a wide range of impressive robotics results, it considers\neach skill separately, and having learned to imitate one skill does not accelerate learning to imitate\nthe next skill.\nOne-shot and few-shot learning has been studied for image recognition [61, 26, 47, 42], generative\nmodeling [17, 43], and learning \u201cfast\u201d reinforcement learning agents with recurrent policies [16, 62].\nFast adaptation has also been achieved through fast-weights [5]. Like our algorithm, many of the\naforementioned approaches are a form of meta-learning [58, 49, 36], where the algorithm itself is\nbeing learned. Meta-learning has also been studied to discover neural network weight optimization\nalgorithms [8, 9, 23, 50, 2, 31]. This prior work on one-shot learning and meta-learning, however,\nis tailored to respective domains (image recognition, generative models, reinforcement learning,\noptimization) and not directly applicable in the imitation learning setting. Recently, [19] propose a\ngeneric framework for meta learning across several aforementioned domains. However they do not\nconsider the imitation learning setting.\nReinforcement learning [56, 10] provides an alternative route to skill acquisition, by learning through\ntrial and error. Reinforcement learning has had many successes, including Backgammon [57],\nhelicopter control [39], Atari [35], Go [52], continuous control in simulation [51, 21, 32] and on\nreal robots [40, 30]. However, reinforcement learning tends to require a large number of trials and\nrequires specifying a reward function to de\ufb01ne the task at hand. The former can be time-consuming\nand the latter can often be signi\ufb01cantly more dif\ufb01cult than providing a demonstration [37].\nMulti-task and transfer learning considers the problem of learning policies with applicability and\nre-use beyond a single task. Success stories include domain adaptation in computer vision [64, 34,\n28, 4, 15, 24, 33, 59, 14] and control [60, 45, 46, 20, 54]. However, while acquiring a multitude of\nskills faster than what it would take to acquire each of the skills independently, these approaches do\nnot provide the ability to readily pick up a new skill from a single demonstration.\nOur approach heavily relies on an attention model over the demonstration and an attention model\nover the current observation. We use the soft attention model proposed in [6] for machine translations,\nand which has also been successful in image captioning [63]. The interaction networks proposed\nin [7, 12] also leverage locality of physical interaction in learning. Our model is also related to\nthe sequence to sequence model [55, 13], as in both cases we consume a very long demonstration\nsequence and, effectively, emit a long sequence of actions.\n\n3 One Shot Imitation Learning\n\n3.1 Problem Formalization\nWe denote a distribution of tasks by T, an individual task by t \u223c T, and a distribution of demon-\nstrations for the task t by D(t). A policy is symbolized by \u03c0\u03b8(a|o, d), where a is an action, o is\nan observation, d is a demonstration, and \u03b8 are the parameters of the policy. A demonstration\nd \u223c D(t) is a sequence of observations and actions : d = [(o1, a1), (o2, a2), . . . , (oT , aT )]. We\nassume that the distribution of tasks T is given, and that we can obtain successful demonstrations for\neach task. We assume that there is some scalar-valued evaluation function Rt(d) (e.g. a binary value\n\n3\n\n\findicating success) for each task, although this is not required during training. The objective is to\nmaximize the expected performance of the policy, where the expectation is taken over tasks t \u2208 T,\nand demonstrations d \u2208 D(t).\n\n3.2 Block Stacking Tasks\n\nTo clarify the problem setting, we describe a concrete example of a distribution of block stacking\ntasks, which we will also later study in the experiments. The compositional structure shared among\nthese tasks allows us to investigate nontrivial generalization to unseen tasks. For each task, the goal is\nto control a 7-DOF Fetch robotic arm to stack various numbers of cube-shaped blocks into a speci\ufb01c\ncon\ufb01guration speci\ufb01ed by the user. Each con\ufb01guration consists of a list of blocks arranged into\ntowers of different heights, and can be identi\ufb01ed by a string. For example, ab cd ef gh means\nthat we want to stack 4 towers, each with two blocks, and we want block A to be on top of block B,\nblock C on top of block D, block E on top of block F, and block G on top of block H. Each of these\ncon\ufb01gurations correspond to a different task. Furthermore, in each episode the starting positions\nof the blocks may vary, which requires the learned policy to generalize even within the training\ntasks. In a typical task, an observation is a list of (x, y, z) object positions relative to the gripper,\nand information if gripper is opened or closed. The number of objects may vary across different\ntask instances. We de\ufb01ne a stage as a single operation of stacking one block on top of another. For\nexample, the task ab cd ef gh has 4 stages.\n\n3.3 Algorithm\n\nIn order to train the neural network policy, we make use of imitation learning algorithms such\nas behavioral cloning and DAGGER [44], which only require demonstrations rather than reward\nfunctions to be speci\ufb01ed. This has the potential to be more scalable, since it is often easier to\ndemonstrate a task than specifying a well-shaped reward function [38].\nWe start by collecting a set of demonstrations for each task, where we add noise to the actions in order\nto have wider coverage in the trajectory space. In each training iteration, we sample a list of tasks\n(with replacement). For each sampled task, we sample a demonstration as well as a small batch of\nobservation-action pairs. The policy is trained to regress against the desired actions when conditioned\non the current observation and the demonstration, by minimizing an (cid:96)2 or cross-entropy loss based on\nwhether actions are continuous or discrete. A high-level illustration of the training procedure is given\nin Fig. 1(c). Across all experiments, we use Adamax [25] to perform the optimization with a learning\nrate of 0.001.\n\n4 Architecture\n\nWhile, in principle, a generic neural network could learn the mapping from demonstration and current\nobservation to appropriate action, we found it important to use an appropriate architecture. Our\narchitecture for learning block stacking is one of the main contributions of this paper, and we believe\nit is representative of what architectures for one-shot imitation learning could look like in the future\nwhen considering more complex tasks.\nOur proposed architecture consists of three modules: the demonstration network, the context network,\nand the manipulation network. An illustration of the architecture is shown in Fig. 2. We will describe\nthe main operations performed in each module below, and a full speci\ufb01cation is available in the\nAppendix.\n\n4.1 Demonstration Network\n\nThe demonstration network receives a demonstration trajectory as input, and produces an embedding\nof the demonstration to be used by the policy. The size of this embedding grows linearly as a function\nof the length of the demonstration as well as the number of blocks in the environment.\nTemporal Dropout: For block stacking, the demonstrations can span hundreds to thousands of time\nsteps, and training with such long sequences can be demanding in both time and memory usage.\nHence, we randomly discard a subset of time steps during training, an operation we call temporal\ndropout, analogous to [53, 27]. We denote p as the proportion of time steps that are thrown away.\n\n4\n\n\fFigure 2: Illustration of the network architecture.\n\nIn our experiments, we use p = 0.95, which reduces the length of demonstrations by a factor of 20.\nDuring test time, we can sample multiple downsampled trajectories, use each of them to compute\ndownstream results, and average these results to produce an ensemble estimate. In our experience,\nthis consistently improves the performance of the policy.\nNeighborhood Attention: After downsampling the demonstration, we apply a sequence of opera-\ntions, composed of dilated temporal convolution [65] and neighborhood attention. We now describe\nthis second operation in more detail.\nSince our neural network needs to handle demonstrations with variable numbers of blocks, it must\nhave modules that can process variable-dimensional inputs. Soft attention is a natural operation which\nmaps variable-dimensional inputs to \ufb01xed-dimensional outputs. However, by doing so, it may lose\ninformation compared to its input. This is undesirable, since the amount of information contained\nin a demonstration grows as the number of blocks increases. Therefore, we need an operation that\ncan map variable-dimensional inputs to outputs with comparable dimensions. Intuitively, rather than\nhaving a single output as a result of attending to all inputs, we have as many outputs as inputs, and\nhave each output attending to all other inputs in relation to its own corresponding input.\nWe start by describing the soft attention module as speci\ufb01ed in [6]. The input to the attention includes\na query q, a list of context vectors {cj}, and a list of memory vectors {mj}. The ith attention weight\nis given by wi \u2190 vT tanh(q + ci), where v is a learned weight vector. The output of attention is a\n(cid:80)\nweighted combination of the memory content, where the weights are given by a softmax operation\nj exp(wj ). Note that the output has\nthe same dimension as a memory vector. The attention operation can be generalized to multiple query\nheads, in which case there will be as many output vectors as there are queries.\nNow we turn to neighborhood attention. We assume there are B blocks in the environment. We\ndenote the robot\u2019s state as srobot, and the coordinates of each block as (x1, y1, z1), . . . , (xB, yB, zB).\nThe input to neighborhood attention is a list of embeddings hin\nB of the same dimension,\nwhich can be the result of a projection operation over a list of block positions, or the output of a\nprevious neighborhood attention operation. Given this list of embeddings, we use two separate linear\nlayers to compute a query vector and a context embedding for each block: qi \u2190 Linear(hin\ni ), and\nci \u2190 Linear(hin\ni ). The memory content to be extracted consists of the coordinates of each block,\nconcatenated with the input embedding. The ith query result is given by the following soft attention\noperation: resulti \u2190 SoftAttn(query: qi, context: {cj}B\nIntuitively, this operation allows each block to query other blocks in relation to itself (e.g. \ufb01nd the\nclosest block), and extract the queried information. The gathered results are then combined with\neach block\u2019s own information, to produce the output embedding per block. Concretely, we have\n\nover the attention weights. Formally, we have output \u2190(cid:80)\n\n1 , . . . , hin\n\nexp(wi)\n\nj=1, memory: {((xj, yj, zj), hin\n\nj ))}B\n\nj=1).\n\ni mi\n\n5\n\nHidden layersHidden layersTemporal DropoutNeighborhood Attention+Temporal ConvolutionAttention overDemonstrationDemonstrationCurrent StateActionABlock#BCDEFGHIJAttention overCurrent StateContext NetworkDemonstration NetworkManipulation NetworkContext Embedding\foutputi \u2190 Linear(concat(hin\ni , resulti, (xi, yi, zi), srobot)). In practice, we use multiple query\nheads per block, so that the size of each resulti will be proportional to the number of query heads.\n\n4.2 Context network\n\nThe context network is the crux of our model. It processes both the current state and the embedding\nproduced by the demonstration network, and outputs a context embedding, whose dimension does\nnot depend on the length of the demonstration, or the number of blocks in the environment. Hence, it\nis forced to capture only the relevant information, which will be used by the manipulation network.\nAttention over demonstration: The context network starts by computing a query vector as a function\nof the current state, which is then used to attend over the different time steps in the demonstration\nembedding. The attention weights over different blocks within the same time step are summed\ntogether, to produce a single weight per time step. The result of this temporal attention is a vector\nwhose size is proportional to the number of blocks in the environment. We then apply neighborhood\nattention to propagate the information across the embeddings of each block. This process is repeated\nmultiple times, where the state is advanced using an LSTM cell with untied weights.\nAttention over current state: The previous operations produce an embedding whose size is inde-\npendent of the length of the demonstration, but still dependent on the number of blocks. We then\napply standard soft attention over the current state to produce \ufb01xed-dimensional vectors, where the\nmemory content only consists of positions of each block, which, together with the robot\u2019s state, forms\nthe context embedding, which is then passed to the manipulation network.\nIntuitively, although the number of objects in the environment may vary, at each stage of the\nmanipulation operation, the number of relevant objects is small and usually \ufb01xed. For the block\nstacking environment speci\ufb01cally, the robot should only need to pay attention to the position of the\nblock it is trying to pick up (the source block), as well as the position of the block it is trying to place\non top of (the target block). Therefore, a properly trained network can learn to match the current\nstate with the corresponding stage in the demonstration, and infer the identities of the source and\ntarget blocks expressed as soft attention weights over different blocks, which are then used to extract\nthe corresponding positions to be passed to the manipulation network. Although we do not enforce\nthis interpretation in training, our experiment analysis supports this interpretation of how the learned\npolicy works internally.\n\n4.3 Manipulation network\n\nThe manipulation network is the simplest component. After extracting the information of the source\nand target blocks, it computes the action needed to complete the current stage of stacking one block\non top of another one, using a simple MLP network.1 This division of labor opens up the possibility\nof modular training: the manipulation network may be trained to complete this simple procedure,\nwithout knowing about demonstrations or more than two blocks present in the environment. We leave\nthis possibility for future work.\n\n5 Experiments\n\nWe conduct experiments with the block stacking tasks described in Section 3.2.2 These experiments\nare designed to answer the following questions:\n\n\u2022 How does training with behavioral cloning compare with DAGGER?\n\u2022 How does conditioning on the entire demonstration compare to conditioning on the \ufb01nal\n\nstate, even when it already has enough information to fully specify the task?\n\n\u2022 How does conditioning on the entire demonstration compare to conditioning on a \u201csnapshot\u201d\n\nof the trajectory, which is a small subset of frames that are most informative?\n\n1In principle, one can replace this module with an RNN module. But we did not \ufb01nd this necessary for the\n\ntasks we consider.\n\n2Additional experiment results are available in the Appendix, including a simple illustrative example of\n\nparticle reaching tasks and further analysis of block stacking\n\n6\n\n\f\u2022 Can our framework generalize to tasks that it has never seen during training?\nTo answer these questions, we compare the performance of the following architectures:\n\nusing DAGGER.\n\n\u2022 BC: We use the same architecture as previous, but and the policy using behavioral cloning.\n\u2022 DAGGER: We use the architecture described in the previous section, and train the policy\n\u2022 Final state: This architecture conditions on the \ufb01nal state rather than on the entire demon-\nstration trajectory. For the block stacking task family, the \ufb01nal state uniquely identi\ufb01es the\ntask, and there is no need for additional information. However, a full trajectory, one which\ncontains information about intermediate stages of the task\u2019s solution, can make it easier to\ntrain the optimal policy, because it could learn to rely on the demonstration directly, without\nneeding to memorize the intermediate steps into its parameters. This is related to the way in\nwhich reward shaping can signi\ufb01cantly affect performance in reinforcement learning [38].\nA comparison between the two conditioning strategies will tell us whether this hypothesis is\nvalid. We train this policy using DAGGER.\n\u2022 Snapshot: This architecture conditions on a \u201csnapshot\u201d of the trajectory, which includes the\nlast frame of each stage along the demonstration trajectory. This assumes that a segmentation\nof the demonstration into multiple stages is available at test time, which gives it an unfair\nadvantage compared to the other conditioning strategies. Hence, it may perform better than\nconditioning on the full trajectory, and serves as a reference, to inform us whether the policy\nconditioned on the entire trajectory can perform as well as if the demonstration is clearly\nsegmented. Again, we train this policy using DAGGER.\n\nWe evaluate the policy on tasks seen during training, as well as tasks unseen during training. Note\nthat generalization is evaluated at multiple levels: the learned policy not only needs to generalize to\nnew con\ufb01gurations and new demonstrations of tasks seen already, but also needs to generalize to new\ntasks.\nConcretely, we collect 140 training tasks, and 43 test tasks, each with a different desired layout of the\nblocks. The number of blocks in each task can vary between 2 and 10. We collect 1000 trajectories\nper task for training, and maintain a separate set of trajectories and initial con\ufb01gurations to be used\nfor evaluation. The trajectories are collected using a hard-coded policy.\n\n5.1 Performance Evaluation\n\n(a) Performance on training tasks.\n\n(b) Performance on test tasks.\n\nFigure 3: Comparison of different conditioning strategies. The darkest bar shows the performance of the\nhard-coded policy, which unsurprisingly performs the best most of the time. For architectures that use temporal\ndropout, we use an ensemble of 10 different downsampled demonstrations and average the action distributions.\nThen for all architectures we use the greedy action for evaluation.\n\nFig. 3 shows the performance of various architectures. Results for training and test tasks are presented\nseparately, where we group tasks by the number of stages required to complete them. This is because\ntasks that require more stages to complete are typically more challenging. In fact, even our scripted\npolicy frequently fails on the hardest tasks. We measure success rate per task by executing the greedy\npolicy (taking the most con\ufb01dent action at every time step) in 100 different con\ufb01gurations, each\nconditioned on a different demonstration unseen during training. We report the average success rate\nover all tasks within the same group.\n\n7\n\n1234567NumberofStages0%20%40%60%80%100%AverageSuccessRatePolicyTypeDemoBCDAGGERSnapshotFinalstate245678NumberofStages0%20%40%60%80%100%AverageSuccessRatePolicyTypeDemoBCDAGGERSnapshotFinalstate\fFrom the \ufb01gure, we can observe that for the easier tasks with fewer stages, all of the different\nconditioning strategies perform equally well and almost perfectly. As the dif\ufb01culty (number of stages)\nincreases, however, conditioning on the entire demonstration starts to outperform conditioning on the\n\ufb01nal state. One possible explanation is that when conditioned only on the \ufb01nal state, the policy may\nstruggle about which block it should stack \ufb01rst, a piece of information that is readily accessible from\ndemonstration, which not only communicates the task, but also provides valuable information to help\naccomplish it.\nMore surprisingly, conditioning on the entire demonstration also seems to outperform conditioning\non the snapshot, which we originally expected to perform the best. We suspect that this is due\nto the regularization effect introduced by temporal dropout, which effectively augments the set of\ndemonstrations seen by the policy during training.\nAnother interesting \ufb01nding was that training with behavioral cloning has the same level of performance\nas training with DAGGER, which suggests that the entire training procedure could work without\nrequiring interactive supervision. In our preliminary experiments, we found that injecting noise into\nthe trajectory collection process was important for behavioral cloning to work well, hence in all\nexperiments reported here we use noise injection. In practice, such noise can come from natural\nhuman-induced noise through tele-operation, or by arti\ufb01cially injecting additional noise before\napplying it on the physical robot.\n\n5.2 Visualization\n\nWe visualize the attention mechanisms underlying the main policy architecture to have a better\nunderstanding about how it operates. There are two kinds of attention we are mainly interested in,\none where the policy attends to different time steps in the demonstration, and the other where the\npolicy attends to different blocks in the current state. Fig. 4 shows some of the attention heatmaps.\n\n(a) Attention over blocks in the current state.\n\n(b) Attention over downsampled demonstration.\n\nFigure 4: Visualizing attentions performed by the policy during an entire execution. The task\nbeing performed is ab cde fg hij. Note that the policy has multiple query heads for each type\nof attention, and only one query head per type is visualized. (a) We can observe that the policy\nalmost always focuses on a small subset of the block positions in the current state, which allows the\nmanipulation network to generalize to operations over different blocks. (b) We can observe a sparse\npattern of time steps that have high attention weights. This suggests that the policy has essentially\nlearned to segment the demonstrations, and only attend to important key frames. Note that there are\nroughly 6 regions of high attention weights, which nicely corresponds to the 6 stages required to\ncomplete the task.\n\n6 Conclusions\n\nIn this work, we presented a simple model that maps a single successful demonstration of a task to\nan effective policy that solves said task in a new situation. We demonstrated effectiveness of this\napproach on a family of block stacking tasks. There are a lot of exciting directions for future work.\nWe plan to extend the framework to demonstrations in the form of image data, which will allow\nmore end-to-end learning without requiring a separate perception module. We are also interested in\nenabling the policy to condition on multiple demonstrations, in case where one demonstration does\nnot fully resolve ambiguity in the objective. Furthermore and most importantly, we hope to scale up\n\n8\n\n\four method on a much larger and broader distribution of tasks, and explore its potential towards a\ngeneral robotics imitation learning system that would be able to achieve an overwhelming variety of\ntasks.\n\n7 Acknowledgement\n\nWe would like to thank our colleagues at UC Berkeley and OpenAI for insightful discussions. This\nresearch was funded in part by ONR through a PECASE award. Yan Duan was also supported by a\nHuawei Fellowship. Jonathan Ho was also supported by an NSF Fellowship.\n\nReferences\n[1] Pieter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In\n\nInternational Conference on Machine Learning (ICML), 2004.\n\n[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nNeural Information Processing Systems (NIPS), 2016.\n\n[3] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot\n\nlearning from demonstration. Robotics and autonomous systems, 57(5):469\u2013483, 2009.\n\n[4] Yusuf Aytar and Andrew Zisserman. Tabula rasa: Model transfer for object category detection.\n\nIn 2011 International Conference on Computer Vision, pages 2252\u20132259. IEEE, 2011.\n\n[5] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast\nweights to attend to the recent past. In Neural Information Processing Systems (NIPS), 2016.\n\n[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[7] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction\nnetworks for learning about objects, relations and physics. In Advances in Neural Information\nProcessing Systems, pages 4502\u20134510, 2016.\n\n[8] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a\nsynaptic learning rule. In Optimality in Arti\ufb01cial and Biological Neural Networks, pages 6\u20138,\n1992.\n\n[9] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule.\n\nUniversit\u00e9 de Montr\u00e9al, D\u00e9partement d\u2019informatique et de recherche op\u00e9rationnelle, 1990.\n\n[10] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In\nDecision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages\n560\u2013564. IEEE, 1995.\n\n[11] Sylvain Calinon. Robot programming by demonstration. EPFL Press, 2009.\n\n[12] Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional\nobject-based approach to learning physical dynamics. In Int. Conf. on Learning Representations\n(ICLR), 2017.\n\n[13] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[14] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor\nDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML,\npages 647\u2013655, 2014.\n\n[15] Lixin Duan, Dong Xu, and Ivor Tsang. Learning with augmented features for heterogeneous\n\ndomain adaptation. arXiv preprint arXiv:1206.4660, 2012.\n\n9\n\n\f[16] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[17] Harrison Edwards and Amos Storkey. Towards a neural statistician. International Conference\n\non Learning Representations (ICLR), 2017.\n\n[18] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal\ncontrol via policy optimization. In Proceedings of the 33rd International Conference on Machine\nLearning, volume 48, 2016.\n\n[19] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. arXiv preprint arXiv:1703.03400, 2017.\n\n[20] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Learning\ninvariant feature spaces to transfer skills with reinforcement learning. In Int. Conf. on Learning\nRepresentations (ICLR), 2017.\n\n[21] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa.\nLearning continuous control policies by stochastic value gradients. In Advances in Neural\nInformation Processing Systems, pages 2944\u20132952, 2015.\n\n[22] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[23] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient\n\ndescent. In International Conference on Arti\ufb01cial Neural Networks. Springer, 2001.\n\n[24] Judy Hoffman, Erik Rodner, Jeff Donahue, Trevor Darrell, and Kate Saenko. Ef\ufb01cient learning\n\nof domain-invariant image representations. arXiv preprint arXiv:1301.3224, 2013.\n\n[25] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings\n\nof the 3rd International Conference on Learning Representations (ICLR), 2014.\n\n[26] Gregory Koch. Siamese neural networks for one-shot image recognition. ICML Deep Learning\n\nWorkshop, 2015.\n\n[27] David Krueger, Tegan Maharaj, J\u00e1nos Kram\u00e1r, Mohammad Pezeshki, Nicolas Ballas, Nan Rose-\nmary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, et al. Zoneout:\nRegularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305,\n2016.\n\n[28] Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain\nadaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition\n(CVPR), 2011 IEEE Conference on, pages 1785\u20131792. IEEE, 2011.\n\n[29] S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning with gaussian\n\nprocesses. In Advances in Neural Information Processing Systems (NIPS), 2011.\n\n[30] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(39):1\u201340, 2016.\n\n[31] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n\n[32] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[33] Mingsheng Long and Jianmin Wang. Learning transferable features with deep adaptation\n\nnetworks. CoRR, abs/1502.02791, 1:2, 2015.\n\n[34] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning\n\nbounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.\n\n10\n\n\f[35] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[36] Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In International\n\nJoint Conference on Neural Netowrks (IJCNN), 1992.\n\n[37] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International\n\nConference on Machine Learning (ICML), 2000.\n\n[38] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In ICML, volume 99, pages 278\u2013287,\n1999.\n\n[39] Andrew Y Ng, H Jin Kim, Michael I Jordan, Shankar Sastry, and Shiv Ballianda. Autonomous\n\nhelicopter \ufb02ight via reinforcement learning. In NIPS, volume 16, 2003.\n\n[40] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients.\n\nNeural networks, 21(4):682\u2013697, 2008.\n\n[41] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in\n\nNeural Information Processing Systems, pages 305\u2013313, 1989.\n\n[42] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Under\n\nReview, ICLR, 2017.\n\n[43] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra.\nOne-shot generalization in deep generative models. International Conference on Machine\nLearning (ICML), 2016.\n\n[44] St\u00e9phane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and\n\nstructured prediction to no-regret online learning. In AISTATS, volume 1, page 6, 2011.\n\n[45] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[46] Fereshteh Sadeghi and Sergey Levine. (cad)2 rl: Real single-image \ufb02ight without a single real\n\nimage. 2016.\n\n[47] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nIn International Conference on\n\nMeta-learning with memory-augmented neural networks.\nMachine Learning (ICML), 2016.\n\n[48] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences,\n\n3(6):233\u2013242, 1999.\n\n[49] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to\nlearn: The meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.\n\n[50] J\u00fcrgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic\n\nrecurrent networks. Neural Computation, 1992.\n\n[51] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust\n\nregion policy optimization. In ICML, pages 1889\u20131897, 2015.\n\n[52] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[53] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n11\n\n\f[54] Bradlie Stadie, Pieter Abbeel, and Ilya Sutskever. Third person imitation learning. In Int. Conf.\n\non Learning Representations (ICLR), 2017.\n\n[55] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[56] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.\n\nMIT press Cambridge, 1998.\n\n[57] Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM,\n\n38(3):58\u201368, 1995.\n\n[58] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n1998.\n\n[59] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain\n\nconfusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.\n\n[60] Eric Tzeng, Coline Devin, Judy Hoffman, Chelsea Finn, Xingchao Peng, Pieter Abbeel, Sergey\nLevine, Kate Saenko, and Trevor Darrell. Towards adapting deep visuomotor representations\nfrom simulated to real environments. arXiv preprint arXiv:1511.07111, 2015.\n\n[61] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one\n\nshot learning. In Neural Information Processing Systems (NIPS), 2016.\n\n[62] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.\narXiv preprint arXiv:1611.05763, 2016.\n\n[63] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov,\nRichard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation\nwith visual attention. In ICML, volume 14, pages 77\u201381, 2015.\n\n[64] Jun Yang, Rong Yan, and Alexander G Hauptmann. Cross-domain video concept detection\nusing adaptive svms. In Proceedings of the 15th ACM international conference on Multimedia,\npages 188\u2013197. ACM, 2007.\n\n[65] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[66] B. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement\n\nlearning. In AAAI Conference on Arti\ufb01cial Intelligence, 2008.\n\n12\n\n\f", "award": [], "sourceid": 732, "authors": [{"given_name": "Yan", "family_name": "Duan", "institution": "UC Berkeley"}, {"given_name": "Marcin", "family_name": "Andrychowicz", "institution": "OpenAI"}, {"given_name": "Bradly", "family_name": "Stadie", "institution": "OpenAI"}, {"given_name": "OpenAI", "family_name": "Jonathan Ho", "institution": "OpenAI, UC Berkeley"}, {"given_name": "Jonas", "family_name": "Schneider", "institution": "OpenAI"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": null}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}, {"given_name": "Wojciech", "family_name": "Zaremba", "institution": "OpenAI"}]}