{"title": "Compositional Plan Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 14989, "page_last": 15000, "abstract": "Autonomous agents situated in real-world environments must be able to master large repertoires of skills.\nWhile a single short skill can be learned quickly, it would be impractical to learn every task independently. Instead, the agent should share knowledge across behaviors such that each task can be learned efficiently, and such that the resulting model can generalize to new tasks, especially ones that are compositions or subsets of tasks seen previously.\nA policy conditioned on a goal or demonstration has the potential to share knowledge between tasks if it sees enough diversity of inputs. However, these methods may not generalize to a more complex task at test time. We introduce compositional plan vectors (CPVs) to enable a policy to perform compositions of tasks without additional supervision. CPVs represent trajectories as the sum of the subtasks within them. We show that CPVs can be learned within a one-shot imitation learning framework without any additional supervision or information about task hierarchy, and enable a demonstration-conditioned policy to generalize to tasks that sequence twice as many skills as the tasks seen during training.\n Analogously to embeddings such as word2vec in NLP, CPVs can also support simple arithmetic operations -- for example, we can add the CPVs for two different tasks to command an agent to compose both tasks, without any additional training.", "full_text": "Plan Arithmetic: Compositional Plan Vectors for\n\nMulti-Task Control\n\nColine Devin\n\nDaniel Geng\n\nPieter Abbeel\n\nTrevor Darrell\n\nSergey Levine\n\nUniversity of California, Berkeley\n\nAbstract\n\nAutonomous agents situated in real-world environments must be able to master\nlarge repertoires of skills. While a single short skill can be learned quickly, it\nwould be impractical to learn every task independently. Instead, the agent should\nshare knowledge across behaviors such that each task can be learned ef\ufb01ciently,\nand such that the resulting model can generalize to new tasks, especially ones\nthat are compositions or subsets of tasks seen previously. A policy conditioned\non a goal or demonstration has the potential to share knowledge between tasks if\nit sees enough diversity of inputs. However, these methods may not generalize\nto a more complex task at test time. We introduce compositional plan vectors\n(CPVs) to enable a policy to perform compositions of tasks without additional\nsupervision. CPVs represent trajectories as the sum of the subtasks within them.\nWe show that CPVs can be learned within a one-shot imitation learning framework\nwithout any additional supervision or information about task hierarchy, and enable\na demonstration-conditioned policy to generalize to tasks that sequence twice as\nmany skills as the tasks seen during training. Analogously to embeddings such\nas word2vec in NLP, CPVs can also support simple arithmetic operations \u2013 for\nexample, we can add the CPVs for two different tasks to command an agent to\ncompose both tasks, without any additional training.\n\n1\n\nIntroduction\n\nA major challenge in current machine learning is to not only interpolate within the distribution of\ninputs seen during training, but also to generalize to a wider distribution. While we cannot expect\narbitrary generalization, models should be able to compose concepts seen during training into new\ncombinations. With deep learning, agents learn high level representations of the data they perceive. If\nthe data is drawn from a compositional environment, then agents that model the data accurately and\nef\ufb01ciently would represent that compositionality without needing speci\ufb01c priors or regularization.\nIn fact, prior work has shown that compositional representations can emerge automatically from\nsimple objectives, most notably a highly structured distribution such as language. These techniques\ndo not explicitly train for compositionality, but employ simple structural constraints that lead to\ncompositional representations. For example, Mikolov et al. found that a language model trained to\npredict nearby words represented words in a vector space that supported arithmetic analogies: \u201cking\u201d\n-\u201cman\" + \u201cwoman\" = \u201cqueen\" [29]. In this work, we aim to learn a compositional feature space to\nrepresent robotic skills, such that the addition of multiple skills results in a plan to accomplish all of\nthese skills.\nMany tasks can be expressed as compositions of skills, where the same set of skills is shared across\nmany tasks. For example, assembling a chair may require the subtask of picking up a hammer,\nwhich is also found in the table assembly task. We posit that a task representation that leverages this\ncompositional structure can generalize more easily to more complex tasks. We propose learning an\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Compositional plan vectors embed tasks into a space where adding two vectors represents\nthe composition of the tasks, and subtracting a sub-task leaves an embedding of the remaining\nsub-tasks needed for the task.\n\nembedding space such that tasks could be composed simply by adding their respective embeddings.\nThis idea is illustrated in Figure 2.\nIn order to learn these representations without additional supervision, we cannot depend on known\nsegmentation of the trajectories into subtasks, or labels about which subtasks are shared between\ndifferent tasks. Instead, we incorporate compositionality directly into the architecture of the policy.\nRather than conditioning the policy on the static embedding of the reference demonstration, we\ncondition the policy on the difference between the embedding of the whole reference trajectory and\nthe partially completed trajectory that the policy is outputting an action for.\nThe main contributions of our work are the compositional plan vector (CPV) representation and\na policy architecture that enables learning of CPVs without any sub-task level supervision. CPVs\nenable policies to generalize to signi\ufb01cantly longer tasks, and they can be added together to represent\na composition of tasks. We evaluate CPVs in the one-shot imitation learning paradigm [11, 12, 19]\non a discrete-action environment inspired by Minecraft, where tools must be picked up to remove or\nbuild objects, as well as on a 3D simulated pick-and-place environment.\n\n2 Related Work\n\nFor many types of high dimensional inputs, Euclidean distances are often meaningless in the raw\ninput space. Words represented as one-hot vectors are equally distant from all other words, and\nimages of the same scene may have entirely different pixel values if the viewpoint is shifted slightly.\nThis has motivated learning representations of language and images that respect desirable properties.\nChopra et al. [3] showed that a simple contrastive loss can be used to learn face embeddings. A\nsimilar method was also used on image patches to learn general image features [35]. Word2vec\nfound that word representations trained to be predictive of their neighbor words support some level\nof addition and subtraction [29, 24, 21]. More recently, Nagarajan used a contrastive approach in\nlearning decomposable embeddings of images by representing objects as vectors and attributes as\ntransformations of those vectors [30]. These methods motivate our goal of learning an embedding\nspace over tasks that supports transformations such as addition and subtraction. Notably, these\nmethods don\u2019t rely on explicit regularization for arithmetic operations, but rather use a simple\nobjective combined with the right model structure to allow a compositional representation to emerge.\nOur method also uses a simple end-to-end policy learning objective, combined with a structural\nconstraint that leads to compositionality.\nHierarchical RL algorithms learn representations of sub-tasks explicitly, by using primitives or\ngoal-conditioning [4, 32, 26, 13, 27, 7, 2, 39], or by combining multiple Q-functions [15, 36]. Our\napproach does not learn explicit primitives or skills, but instead aims to summarize the task via a\ncompositional task embedding. A number of prior works have also sought to learn policies that are\nconditioned on a goal or task [22, 8, 23, 38, 5, 14, 18, 28, 6, 33], but without explicitly considering\ncompositionality. Recent imitation learning methods have learned to predict latent intentions of\ndemonstrations [16, 25]. In the one-shot imitation learning paradigm, the policy is conditioned on\nreference demonstrations at both test and train time. This problem has been explored with meta-\n\n2\n\n\fFigure 2: By adding the CPVs for two different tasks, we obtain the CPV for the composition of the\ntasks. To determine what steps are left in the task, the policy subtracts the embedding of its current\ntrajectory from the reference CPV.\n\nlearning [12] and metric learning [19] for short reaching and pushing tasks. Duan et al. used attention\nover the reference trajectory to perform block stacking tasks [11]. Our work differs in that we aim to\ngeneralize to new compositions of tasks that are out of the distribution of tasks seen during training.\nHausman et al. obtain generalization to new compositions of skills by training a generative model over\nskills [17]. However, unlike our method, these approach does not easily allow for sequencing skills\ninto longer horizon tasks or composing tasks via arithmetic operations on the latent representation.\nPrior methods have learned composable task representations by using ground truth knowledge about\nthe task hierarchy. Neural task programming and the neural subtask graph solver generalize to new\ntasks by decomposing a demonstration into a hierarchical program for the task, but require ground-\ntruth hierarchical decomposition during training [40, 37]. Using supervision about the relations\nbetween tasks, prior approaches have uses analogy-based objectives to learn task representations that\ndecompose across objects and actions [31] or have set up a modular architectures over subtasks [1]\nor environments [9]. Unlike our approach, these methods require labels about relationships. We\nimplicitly learn to decompose tasks without supervising the task hierarchy.\n\n3 Compositional Plan Vectors\n\nIn this paper, we introduce compositional plan vectors (CPVs). The goal of CPVs is to obtain policies\nthat generalize to new compositions of skills without requiring skills to be labeled and without\nknowing the list of skills that may be required. Consider a task named \u201cred-out-yellow-in\u201d which\ninvolves taking a red cube out of a box and placing a yellow cube into the box. A plan vector encodes\nthe task as the sum of its parts: a plan vector for taking the red cube out of the box plus the vector for\nputting the yellow cube into the box should equal the plan vector for the full task. Equivalently, the\nplan vector for the full task minus the vector for taking the red cube out of the box should equal the\nvector that encodes \u201cput yellow cube in box.\u201d\nIf the list of all possible skills was known ahead of time, separate policies could be learned for each\nskill, and then the policies could be used in sequence. However, this knowledge is often unavailable\nin general and limits compositionality to a \ufb01xed set of skills. Instead, our goal is to formulate an\narchitecture and regularization that are compositional by design and do not need additional supervision.\nWith our method, CPVs acquire compositional structure because of the structural constraint they place\non the policy. To derive the simplest possible structural constraint, we observe that the minimum\ninformation that the policy needs about the task in order to complete it is knowledge of the steps\nthat have not yet been done. That is, in the cube example above, after taking out the red cube, only\nthe \u201cyellow-in\u201d portion of the task is needed by the policy. One property of this representation is\nthat task ordering cannot be represented by the CPV because addition is commutative. If ordering is\nnecessary to choose the right action, the policy will have to learn to decode which component of the\ncompositional plan vector must be done \ufb01rst.\n\n3\n\n\fAs an example, let (cid:126)v be a plan vector for the \u201cred-out-yellow-in\" task. To execute the task, a policy\n\u03c0(o0, (cid:126)v) outputs an action for the \ufb01rst observation o0. After some number t of timesteps, the policy\nhas successfully removed the red cube from the box. This partial trajectory O0:t can be embedded\ninto a plan vector (cid:126)u, which should encode the \u201cred-out\" task. We would like ((cid:126)v \u2212 (cid:126)u) to encode the\nremaining portion of the task, in this case placing the yellow block into the box. In other words,\n\u03c0(ot, (cid:126)v \u2212 (cid:126)u) should take the action that leads to accomplishing the plan described by (cid:126)v given that (cid:126)u\nhas already been accomplished. In order for both (cid:126)v and (cid:126)v \u2212 (cid:126)u to encode the yellow-in task, (cid:126)u must\nnot encode as strongly as (cid:126)v. If (cid:126)v is equal to the sum of the vectors for \u201cred-out\u201d and \u201cyellow-in,\u201d then\n(cid:126)v may not encode the ordering of the tasks. However, the policy \u03c0(o0, (cid:126)v) should have learned that\nthe box must be empty in order to perform the yellow-in task, and that therefore it should perform the\nred-out task \ufb01rst.\nWe posit that this structure can be learned without supervision at the subtask-level. Instead, we\nimpose a simple architectural and arithmetic constraints on the policy: the policy must choose its\naction based on the arithmetic difference between the plan vector embedding of the whole task and\nthe plan vector embedding of the trajectory completed so far. Additionally, the plan vectors of two\nhalves of the same trajectory should add up to the plan vector of the whole trajectory, which we\ncan write down as a regularization objective for the embedding function. By training the policy and\nthe embedding function together to optimize their objectives, we obtain an embedding of tasks that\nsupports compositionality and generalizes to more complex tasks. In principle, CPVs can be used\nwith any end-to-end policy learning objective, including behavioral cloning, reinforcement learning,\nor inverse reinforcement learning. In this work, we will validate CPVs in a one-shot imitation learning\nsetting.\nOne-shot imitation learning setup.\nIn one-shot imitation learning, the agent must perform a task\nconditioned on one reference example of the task. For example, given a demonstration of how to\nfold a paper crane, the agent would need to fold a paper crane. During training, the agent is provided\nwith pairs of demonstrations, and learns a policy by predicting the actions in one trajectory by using\nthe second as a reference. In the origami example, the agent may have trained on demonstrations of\nfolding paper into a variety of different creatures.\nWe consider the one-shot imitation learning scenario where an agent is given a reference trajectory\nT ). The agent starts with o0 \u223c p(o0),\nin the form of a list of T observations Oref\nwhere o0 may be different from oref\n0 . At each timestep t, the agent performs an action drawn from\n\u03c0(at|O0:t, Oref\n\n0:T = (oref\n\n0 , ..., oref\n\n0:T ).\n\nPlan vectors. We de\ufb01ne a function g\u03c6(Ok:l), parameterized by \u03c6, which takes in a trajectory and\noutputs a plan vector. The plan vector of a reference trajectory g\u03c6(Oref\n0:T ) should encode the sequence\nof steps required to accomplish the goal. Similarly, the plan vector of a partially accomplished\ntrajectory g\u03c6(O0:t) should encode the steps already taken. We can therefore consider the subtraction\nof these vectors to encode the steps necessary to complete the task de\ufb01ned by the reference trajectory.\nThus, the policy can be structured as\n\n\u03c0\u03b8(at|ot, g\u03c6(Oref\n\n0:T ) \u2212 g\u03c6(O0:t)),\n\n(1)\na function parameterized by \u03b8 that takes in just the trajectory\u2019s endpoints instead of considering the\nfull reference trajectory, and is learned end-to-end.\nIn this work we use a fully observable state space and only consider tasks that cause a change in the\nstate. For example, we do not consider tasks such as lifting a block and placing it exactly where it\nwas, because this does not result in a useful change to the state. Thus, instead of embedding a whole\ntrajectory O0:t, we limit g to only look at the \ufb01rst and last state of the trajectory we wish to embed.\nThen, \u03c0 becomes\n\n(2)\n\n\u03c0\u03b8(at|ot, g(oref\n\n0 , oref\n\nT ) \u2212 g(oo, ot)).\n\nTraining. With \u03c0 de\ufb01ned as above, we learn the parameters of the policy with imitation learning.\ndataset D containing N demonstrations paired with reference trajectories is collected. Each trajectory\nmay be a different arbitrary length, and the tasks performed by each pair of trajectories are unlabeled.\nThe demonstrations include actions, but the reference trajectories do not. In our settings, the reference\ntrajectories only need to include their \ufb01rst and last states. Formally,\n\nD = {(Orefi\n\n[0,T i], Oi\n\n[0:H i], Ai\n\n[0:H i\u22121])}N\n\ni=1,\n\n4\n\n\fFigure 3: Illustrations of the 5 skills in the GridCraft environment. To ChopTree, the agent must pick\nup the axe and bring it to the tree, which transforms the tree into logs. To BuildHouse, the agent\npicks up the hammer and brings it to logs to transform them into a house. To MakeBread, the agent\nbrings the axe to the wheat which transforms it into bread. The agent eats bread if it lands on a state\nthat contains bread. To BreakRock, the agent picks up a hammer and destroys the rock.\n\nN(cid:88)\n\nH i(cid:88)\n\nwhere T i is the length of the ith reference trajectory and H i is the length of the ith demonstration.\nGiven the policy architecture de\ufb01ned in Equation 1, the behavioral cloning loss for a discrete action\npolicy is\n\nLIL(D, \u03b8, \u03c6) =\n\n\u2212 log(\u03c0\u03b8(ai\n\nt|oi\n\nt, g\u03c6(orefi\n\n0, orefi\n\nT ) \u2212 g\u03c6(oi\n\n0, oi\n\nt))).\n\ni=0\n\nt=0\n\nWe also introduce a regularization loss function to improve compositionality by enforcing that the\nsum of the embeddings of two parts of a trajectory is close to the embedding of the full trajectory. We\ndenote this a homomorphism loss LHom because it constrains the embedding function g to preserve a\nmapping between concatenation of trajectories and addition of real-valued vectors. We implement\nthe loss using the triplet margin loss from [34] with a margin equal to 1:\n\nltri(a, p, n) = max{||a \u2212 p||2 \u2212 ||a \u2212 n||2 + 1.0, 0}\n\nLHom(D, \u03c6)\n\nN(cid:88)\n\nH i(cid:88)\n\ni=0\n\nt=0\n\nltri(g\u03c6(oi\n\n0, oi\n\nt) + g\u03c6(oi\n\nt, oi\n\nT ), g\u03c6(oi\n\n0, oi\n\nT ), g\u03c6(oj\n\n0, oj\n\nT ))\n\nN(cid:88)\n\nH i(cid:88)\n\nFinally, we follow James et al.\nin regularizing embeddings of paired trajectories to be close in\nembedding space, which has been shown to improve performance on new examples [19]. This \u201cpair\"\nloss LPair pushes the embedding of a demonstration to be similar to the embedding of its reference\ntrajectory and different from other embeddings, which enforces that embeddings are a function of the\nbehavior within a trajectory rather than the appearance of a state.\n\nLPair(D, \u03c6)\n\nltri(g\u03c6(oi\n\n0, oi\n\nT ), g\u03c6(orefi\n\n0, orefi\n\nT ), g\u03c6(orefj\n\n0, orefj\nT )\n\ni=0\n\nt=0\n\nfor any j (cid:54)= i. We empirically evaluate whether how these losses affect the composability of learned\nembeddings. While LPair leverages the supervision from the reference trajectories, LHom is entirely\nself-supervised.\n\nMeasuring compositionality. To evaluate whether the representation learned by g is compositional,\nwe condition the policy on the sum of plan vectors from multiple tasks and measure the policy\u2019s\nsuccess rate. Given two reference trajectories Orefi\n0:T j , we condition the policy on\ng\u03c6(orefi\nT j ). The policy is success if it accomplishes both tasks. We also\nevaluate whether the representation generalizes to more complex tasks.\n\nT i) + g\u03c6(orefj\n\n0:T i and Orefj\n\n0, orefj\n\n0, orefi\n\n5\n\n\fFigure 4: Two example skills from the pick and place environment. Time evolves from left to right.\nIf the relevant objects are in the box, the agent must \ufb01rst remove the lid to interact with the object and\nalso return the lid to the box in order to complete a task.\n\n4 Sequential Multitask Environments\n\nWe introduce two new learning environments, shown in Figures 3 and 4, that test an agent\u2019s ability to\nperform tasks that require different sequences and different numbers of sub-skills. We designed these\nenvironments such that the actions change the environment and make new sub-goals possible: in the\n3D environment, opening a box and removing its contents makes it possible to put something else\ninto the box. In the environment, chopping down a tree makes it is possible to build a house. Along\nwith the environments, we will release code to generate demonstrations of the compositional tasks.\n\n4.1 Crafting Environment\n\nThe \ufb01rst evaluation domain is a discrete-action world where objects can be picked up and modi\ufb01ed\nusing tools. The environment contains 7 types of objects: tree, rock, logs, wheat, bread, hammer, axe.\nLogs, hammers, and axes can be picked up, and trees and rocks block the agent. The environment\nallows for 6 actions: up, down, left, right, pickup, and drop. The transitions are deterministic, and\nonly one object can be held at a time. Pickup has no effect unless the agent is at the same position as\na pickupable object. Drop has no effect unless an object is currently held. When an object is held,\nit moves with the agent. Unlike the Malmo environment which runs a full game engine [20], this\nenvironment can be easily modi\ufb01ed to add new object types and interaction rules. We de\ufb01ne 5 skills\nwithin the environment, ChopTree, BreakRock, BuildHouse, MakeBread, and EatBread, as illustrated\nin Figure 3. A task is de\ufb01ned by a list of skills. For example, a task with 3 skills could be [ChopTree,\nChopTree, MakeBread]. Thus, considering tasks that use between 1 and 4 skills with replacement,\nthere are 125 distinct tasks and about 780 total orderings. Unlike in Andreas et al. [1], Oh et al. [31],\nskill list labels are only used for data generation and evaluation; they are not used for training and are\nnot provided to the model. The quantities and positions of each object are randomly selected at each\nreset.The observation space is a top-down image view of the environment, as shown in Figure 6a.\n\n4.2\n\n3D Pick and Place Environment\n\nThe second domain is a 3D simulated environment where a robot arm can pick up and drop objects.\nFour cubes of different colors, as well as a box with a lid, are randomly placed within the workspace.\nThe robot\u2019s action space is a continuous 4-dimensional vector: an (x, y) position at which to close\nthe gripper and an (x, y) position at which to open the gripper. The z coordinate of the grasp is\nchosen automatically. The observation space is a concatenation of the (x, y, z) positions of each\nof the 4 objects, the box, and the box lid. We de\ufb01ne 3 families of skills within the environment:\nPlaceInCorner, Stack, and PlaceInBox, each of which can be applied on different objects or pairs of\nobjects. Considering tasks that use 1 to 2 skills, there are 420 different tasks. An example of each\nskill is shown in Figure 4.\n\n5 Experiments\n\nOur experiments aim to understand how well CPVs can learn tasks of varying complexity, how\nwell they can generalize to tasks that are more complex than those seen during training (thus\n\n6\n\n\fdemonstrating compositionality), and how well they can handle additive composition of tasks,\nwhere the policy is expected to perform both of the tasks in sequence. We hypothesize that, by\nconditioning a policy on the subtraction of the current progress from the goal task embedding,\nwe will learn a task representation that encodes tasks as the sum of their component subtasks.\nWe additionally evaluate how regularizing objectives improve generalization and compositionality.\n\n0 , oref\n\n0 , oref\n\nT ) \u2212 g(o0, ot).\n\nFigure 5: The network architecture used for the craft-\ning environment. Orange denotes convolutional layers and\ndark blue denotes fully connected layers. The trajectories\n(oref\nT ) and (o0, ot) are each passed through g (the pale\ngreen box) independently, but with shared weights. The cur-\nrent observation ot is processed through a separate convo-\nlutional network before being concatenated with the vector\ng(oref\n\nImplementation. We implement g\u03c6 and\n\u03c0\u03b8 as neural networks. For the crafting\nenvironment, where the observations are\nRGB images, we use the convolutional\narchitecture in Figure 5. The encoder g\noutputs a 512 dimensional CPV. The pol-\nicy, shaded in red, takes the subtraction of\nCPVs concatenated with features from the\ncurrent observation and outputs a discrete\nclassi\ufb01cation over actions.\nFor the 3D environment, the observation\nis a state vector containing the positions\nof each object in the scene, including the\nbox and box lid. The function g again con-\ncatenates the inputs, but here the network\nis fully connected, and the current obser-\nvation is directly concatenated to the sub-\ntraction of the CPVs. To improve the per-\nformance of all models and comparisons,\nwe use an object-centric policy inspired\nby Devin et al. [10], where the policy out-\nputs a softmaxed weighting over the objects\nin the state. The position of the most attended object is output as the \ufb01rst coordinates of the action\n(where to grasp). The object attention as well as the features from the observation and CPVs are\npassed to another fully connected layer to output the position for placing.\nData generation. For the crafting environment, we train all models on a dataset containing 40k\npairs of demonstrations, each pair performs the same task. The demonstrations pairs are not labeled\nwith what task they are performing. The tasks are randomly generated by sampling 2-4 skills\nwith replacement from the \ufb01ve skills listed previously. A planning algorithm is used to generate\ndemonstration trajectories. For the 3D environment, we collect 180k trajectories of tasks with 1 and\n2 skills. All models are trained on this dataset to predict actions from the environment observations\nshown in Figure 6a. For both environments, we added 10% of noise to the planner\u2019s actions but\ndiscarded any trajectories that were unsuccessful. The data is divided into training and validation\nsets 90/10. To evaluate the models, reference trajectories were either regenerated or pulled from the\nvalidation set. Compositions of trajectories were never used in training or validation.\nComparisons.\nstead consistently use the same name as you do in the environment description\nWe compare our CPV model to several one-shot imitation learning models. All models are based\non Equation 2, where the policy is function of four images: o0, ot, oref\nT . The na\u00efve baseline\nsimply concatenates the four inputs as input to a neural network policy. The TECNets baseline is an\nimplementation of task embedding control networks from [19], where the embeddings are normalized\nto a unit ball and a margin loss is applied over the cosine distance to push together embeddings of the\nsame task. The policy in TECNets is conditioned on the static reference embedding rather than the\nsubtraction of two embeddings. For both TECNets and our model, g is applied to the concatenation\nof the two input observations.\nWe perform several ablations of our model, which includes the CPV architecture (including the\nembedding subtraction as input the policy), the homomorphism regularization, and the pair regular-\nization. We compare the plain version of our model, where the objective is purely imitation learning,\nto versions that use the regularizations. CPV-Plain uses no regularization, CPV-Pair uses only LPair,\nCPV-Hom uses only LHom, and CPV-Full uses both. To ablate the effect of the architecture vs the\nregularizations, we run the same set of comparisons for a model denoted TE (task embeddings) which\nhas the same architecture as TECNets without normalizing embeddings. These experiments \ufb01nd\n\n0 , oref\n\n7\n\n\fTable 1: Evaluation of generalization and compositionality in the craftin environment. Policies were\ntrained on tasks using between 1 and 4 skills. We evaluate the policies conditioned on reference trajectories\nthat use 4, 8, and 16 skills. We also evaluate the policies on the composition of skills: \u201c2, 2\u201d means that the\nembeddings of two demonstrations that each use 2 skills were added together, and the policy was conditioned\non this sum. For the na\u00efve model, we instead average the observations of the references, which performed\nsomewhat better. All models are variants on the architecture in Figure 5. The max horizon is three times the\naverage number of steps used by the expert for that length of task: 160, 280, and 550, respectively. Numbers are\nall success percentages.\n\nMODEL\nNAIVE\nTECNET\nTE\nTE-PAIR\nTE-HOM\nTE-FULL\nCPV-NAIVE\nCPV-PAIR\nCPV-HOM\nCPV-FULL\n\n4 SKILLS\n29 \u00b1 2\n49 \u00b1 11\n53 \u00b1 4\n64 \u00b1 1\n50 \u00b1 4\n61 \u00b1 8\n51 \u00b1 8\n68 \u00b1 11\n63 \u00b1 3\n73 \u00b1 2\n\n8 SKILLS\n9 \u00b1 2\n17 \u00b1 7\n28 \u00b1 2\n31 \u00b1 1\n27 \u00b1 2\n28 \u00b1 8\n19 \u00b1 5\n44 \u00b1 14\n35 \u00b1 5\n40 \u00b1 3\n\n16 SKILLS\n\n7 \u00b1 2\n7 \u00b1 6\n25 \u00b1 20\n18 \u00b1 2\n21 \u00b1 1\n13 \u00b1 2\n9 \u00b1 2\n31 \u00b1 13\n27 \u00b1 8\n28 \u00b1 6\n\n1+1\n29 \u00b1 10\n59 \u00b1 11\n32 \u00b1 1\n55 \u00b1 3\n51 \u00b1 1\n60 \u00b1 1\n31 \u00b1 16\n2 \u00b1 3\n71 \u00b1 8\n76 \u00b1 3\n\n2,2\n24 \u00b1 5\n43 \u00b1 11\n44 \u00b1 25\n53 \u00b1 8\n52 \u00b1 1\n47 \u00b1 7\n30 \u00b1 15\n1 \u00b1 2\n60 \u00b1 11\n64 \u00b1 6\n\n4,4\n5 \u00b1 2\n29 \u00b1 23\n18 \u00b1 12\n21 \u00b1 2\n20 \u00b1 1\n23 \u00b1 7\n5 \u00b1 2\n0 \u00b1 0\n26 \u00b1 14\n30 \u00b1 10\n\nwhether the regularization losses produce compositionality on their own, or whether they work in\nconjunction with the CPV architecture.\n\nTable 2: 3D Pick and Place Results. Each model was trained on tasks with 1 to 2 skills. We evaluate the\nmodels on tasks with 1 and 2 skills, as well as the compositions of two 1 skill tasks. For each model we list the\nsuccess rate of the best epoch of training. All numbers are averaged over 100 tasks. All models are variants of\nthe object-centric architecture, shown in the supplement. We \ufb01nd that the CPV architecture plus regularizations\nenable composing two reference trajectories better than other methods.\n\nMODEL\nNAIVE\nTECNET\nTE-PLAIN\nTE-PAIR\nTE-HOM\nTE-FULL\nCPV-PLAIN\nCPV-PAIR\nCPV-HOM\nCPV-FULL\n\n1 SKILL\n65 \u00b1 7\n82 \u00b1 6\n91 \u00b1 2\n81 \u00b1 11\n92 \u00b1 1\n88 \u00b1 2\n87 \u00b1 2\n82 \u00b1 4\n88 \u00b1 1\n87 \u00b1 4\n\n2 SKILLS\n34 \u00b1 8\n50 \u00b1 2\n55 \u00b1 5\n51 \u00b1 8\n59 \u00b1 1\n55 \u00b1 8\n55 \u00b1 2\n42 \u00b1 3\n54 \u00b1 5\n54 \u00b1 4\n\n1,1\n6 \u00b1 2\n33 \u00b1 4\n22 \u00b1 2\n15 \u00b1 3\n24 \u00b1 12\n9 \u00b1 6\n52 \u00b1 2\n7 \u00b1 1\n55 \u00b1 4\n56 \u00b1 6\n\nResults. We evaluate the methods on both domains. To be considered successful in the crafting\nenvironment, the agent must perform the same sub-skills with the same types of objects as those seen\nin the reference trajectory. The results on the crafting environment are shown in Table 1, where we\nreport the mean and standard deviation across 3 independent training seeds. We see that both the\nna\u00efve model and the TECNet model struggle to represent these complex tasks, even the 4 skill tasks\nthat are in the training distribution. We also \ufb01nd that both the CPV architecture and the regularization\nlosses are necessary for both generalizing the longer tasks and composing multiple tasks. The pair\nloss seems to help mostly with generalization, while the homomorphism losses helps more with\ncompositionality. CPVs are able to generalize to 8 and 16 skills, despite being trained on only 4\nskill combinations, and achieve 76% success at composing two tasks just by adding their embedding\nvectors. Recall that CPVs are not explicitly trained to compose multiple reference trajectories in this\nway \u2013 the compositionality is an extrapolation from the training. The TE ablation, which does not use\nthe subtraction of embeddings as input to the policy, shows worse compositionality than our method\neven with the homomorphism loss. This supports our hypothesis that structural constraints over the\nembedding representation contribute signi\ufb01cantly to the learning.\n\n8\n\n\fThese trends continue in the pick and place environment in Table 2, were we report the mean and\nstandard deviation across 3 independent training seeds. In this environment, a trajectory is successful\nif the objects that were moved in the reference trajectory are in the correct positions: placed in each\ncorner, placed inside the box, or stacked on top of a speci\ufb01c cube. As expected, TECNet performs\nwell on 1 skill tasks which only require moving a single object. TECNet and the na\u00efve model fail to\ncompose tasks, but the CPV model performs as well at composing two 1-skill tasks as it does when\nimitating 2-skill tasks directly. As before, the TE ablation fails to compose as well as CPV, indicating\nthat that the architecture and losses together are needed to learned composable embeddings.\n\n6 Discussion\n\nMany tasks can be understood as a composition of multiple subtasks. To take advantage of this latent\nstructure without subtask labels, we introduce the compositional plan vector (CPV) architecture\nalong with a homomorphism-preserving loss function, and show that this learns a compositional\nrepresentation of tasks. Our method learns a task representation and multi-task policy jointly. Our\nmain idea is to condition the policy on the arithmetic difference between the embedding of the goal\ntask and the embedding of the trajectory seen so far. This constraint ensures that the representation\nspace is structured such that subtracting the embedding of a partial trajectory from the embedding\nof the full trajectory encodes the portion of the task that remains to be completed. Put another way,\nCPVs encode tasks as a set of subtasks that the agent has left to perform to complete the full task.\nCPVs enable policies to generalize to tasks twice as long as those seen during training, and two plan\nvectors can be added together to form a new plan for performing both tasks.\nWe evaluated CPVs in a one-shot imitation learning setting. Extending our approach to a reinforce-\nment learning setting is a natural next step, as well as further improvements to the architecture to\nimprove ef\ufb01ciency. A particularly promising future direction would be to enable CPVs to learn from\nunstructured, self-supervised data, reducing the dependence on hand-speci\ufb01ed objectives and reward\nfunctions.\n\n7 Acknowledgements\n\nWe thank Kate Rakelly for insightful discussions and Hexiang Hu for writing the initial version of\nthe 3D simulated environment. This material is based upon work supported by the National Science\nFoundation Graduate Research Fellowship Program under Grant No. DGE 1752814.\n\nReferences\n[1] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy\nsketches. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n166\u2013175. JMLR. org, 2017.\n\n[2] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, 2016.\n[3] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. Learning a similarity metric discriminatively, with\n\napplication to face veri\ufb01cation. In CVPR, pages 539\u2013546, 2005.\n\n[4] Stelian Coros, Philippe Beaudoin, and Michiel van de Panne. Robust task-based control policies for\n\nphysics-based characters. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 28(5):Article 170, 2009.\n\n[5] Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. arXiv preprint\n\n[6] Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. arXiv preprint\n\narXiv:1206.6398, 2012.\n\narXiv:1206.6398, 2012.\n\n[7] Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Advances in Neural Information\nProcessing Systems 5, [NIPS Conference], pages 271\u2013278, San Francisco, CA, USA, 1993. Morgan\nKaufmann Publishers Inc.\nISBN 1-55860-274-7. URL http://dl.acm.org/citation.cfm?id=\n645753.668239.\n\n[8] Marc Peter Deisenroth, Peter Englert, Jan Peters, and Dieter Fox. Multi-task policy search for robotics. In\n2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3876\u20133881. IEEE, 2014.\n[9] Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning modular neural\nnetwork policies for multi-task and multi-robot transfer. In Robotics and Automation (ICRA), 2017 IEEE\nInternational Conference on, pages 2169\u20132176. IEEE, 2017.\n\n[10] Coline Devin, Pieter Abbeel, Trevor Darrell, and Sergey Levine. Deep object-centric representations for\ngeneralizable robot learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA),\npages 7111\u20137118. IEEE, 2018.\n\n9\n\n\f[11] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever,\nPieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information\nprocessing systems, pages 1087\u20131098, 2017.\n\n[12] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation\n\nlearning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[13] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies.\n\nCoRR, abs/1710.09767, 2017. URL http://arxiv.org/abs/1710.09767.\n\n[14] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for\nhierarchical reinforcement learning. In Proceedings of the 35th International Conference on Machine\nLearning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 1846\u20131855, 2018.\nURL http://proceedings.mlr.press/v80/haarnoja18a.html.\n\n[15] Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, and Sergey Levine. Compos-\nable deep reinforcement learning for robotic manipulation. In 2018 IEEE International Conference on\nRobotics and Automation (ICRA), pages 6244\u20136251. IEEE, 2018.\n\n[16] Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multi-modal\nimitation learning from unstructured demonstrations using generative adversarial nets. In Advances in\nNeural Information Processing Systems, pages 1235\u20131245, 2017.\n\n[17] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an\n\nembedding space for transferable robot skills. 2018.\n\n[18] Nicolas Heess, Gregory Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller, and David\nSilver. Learning and transfer of modulated locomotor controllers. CoRR, abs/1610.05182, 2016. URL\nhttp://arxiv.org/abs/1610.05182.\n\n[19] Stephen James, Michael Bloesch, and Andrew J Davison. Task-embedded control networks for few-shot\n\nimitation learning. arXiv preprint arXiv:1810.03237, 2018.\n\n[20] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for arti\ufb01cial\n\nintelligence experimentation. In IJCAI, pages 4246\u20134247, 2016.\n\n[21] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and\nSanja Fidler. Skip-thought vectors. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 28, pages 3294\u20133302. Curran Associates,\nInc., 2015. URL http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf.\n\n[22] J. Zico Kolter and Andrew Y. Ng. Learning omnidirectional path following using dimensionality reduction.\n\nIn in Proceedings of Robotics: Science and Systems, 2007.\n\n[23] Andras Gabor Kupcsik, Marc Peter Deisenroth, Jan Peters, and Gerhard Neumann. Data-ef\ufb01cient gener-\nalization of robot skills with contextual policy search. In Twenty-Seventh AAAI Conference on Arti\ufb01cial\nIntelligence, 2013.\n\n[24] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceedings of the 52nd Annual\nMeeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages\n302\u2013308, 2014.\n\n[25] Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual\ndemonstrations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3812\u20133822. 2017.\n\n[26] Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using\n\ndeep q-learning. ACM Transactions on Graphics, 36(3), 2017.\n\n[27] Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess,\nand Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning\nRepresentations, 2019. URL https://openreview.net/forum?id=BJfYvo09Y7.\n\n[28] Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh,\nand Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference\non Learning Representations, 2019. URL https://openreview.net/forum?id=BJl6TjRcY7.\n\n[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in Neural Information Processing Systems 26,\npages 3111\u20133119. 2013.\n\n[30] Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object\ncompositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169\u2013185,\n2018.\n\n[31] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with\nmulti-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 2661\u20132670. JMLR. org, 2017.\n\n[32] Xue Bin Peng, Glen Berseth, and Michiel van de Panne. Terrain-adaptive locomotion skills using deep\nISSN 0730-0301. doi:\n\nreinforcement learning. ACM Trans. Graph., 35(4):81:1\u201381:12, July 2016.\n10.1145/2897824.2925881. URL http://doi.acm.org/10.1145/2897824.2925881.\n\n[33] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In\n\nInternational conference on machine learning, pages 1312\u20131320, 2015.\n\n[34] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\nIn Proceedings of the IEEE conference on computer vision and pattern\n\nrecognition and clustering.\n\n10\n\n\frecognition, pages 815\u2013823, 2015.\n\n[35] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-\nNoguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the\nIEEE International Conference on Computer Vision, pages 118\u2013126, 2015.\n\n[36] Satinder P Singh. The ef\ufb01cient learning of multiple task sequences. In Advances in neural information\n\nprocessing systems, pages 251\u2013258, 1992.\n\n[37] Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for zero-shot\ngeneralization with subtask dependencies. In Advances in Neural Information Processing Systems, pages\n7156\u20137166, 2018.\n\n[38] Freek Stulp, Gennaro Raiola, Antoine Hoarau, Serena Ivaldi, and Olivier Sigaud. Learning compact\nIn 2013 13th IEEE-RAS International Conference on\n\nparameterized skills with a single regression.\nHumanoid Robots (Humanoids), pages 417\u2013422. IEEE, 2013.\n\n[39] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver,\nand Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the\n34th International Conference on Machine Learning - Volume 70, ICML\u201917, pages 3540\u20133549. JMLR.org,\n2017. URL http://dl.acm.org/citation.cfm?id=3305890.3306047.\n\n[40] Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Neural task\nprogramming: Learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on\nRobotics and Automation (ICRA), pages 1\u20138. IEEE, 2018.\n\nA Network Architectures\n\nA.1 Crafting environment\n\nThe observation is an RGB image of 33x30 pixels.\nThe architecture for g concatenates the \ufb01rst and last\nimage of the reference trajectory along the channel\ndimension, to obtain an input size of 33x30x6. This\nis followed by 4 convolutions with 16, 32,64, and 64\nchannels, respectively, with ReLU activations. The\n3x3x64 output is \ufb02attened and a fully connected layer\nreduces this to the desired embedding dimension. The\nsame architecture is used for the TECNet encoder. For\nthe policy, the observation is passed through a convo-\nlutional network with the same architecture as above\nand the output is concatenated with the subtraction of\nembeddings as de\ufb01ned in the paper\u2019s method. This\nconcatenation is passed through a 4 layer fully con-\nnected network with 64 hidden units per layer and\nReLU activations. The output is softmaxed to pro-\nduce a distribution over the 6 actions. The TECNet\nuses the same architecture, but the reference trajectory\nembeddings are normalized there is no subtraction;\ninstead, the initial image of the current trajectory is\nconcatenated with the observation. The naive model\nuses the same architecture but all four input images\nare concatenated for the initial convolutional network and there is no concatenation at the embedding level.\n\nFigure 6: The crafting environment. (a) Shows a\nstate observation as rendered for the agent. The\nwhite square in the bottom left indicates that an\nobject is held by the agent. (b) Shows the same\nstate, but rendered in a human-readable format.\nThe axe shown in the last row indicates that the\nagent is currently holding an axe.\n\n(b)\n\n(a)\n\nA.2\n\n3D environment\n\nThe environment has 6 objects: 4 cubes (red, blue, green, white), a box body and a box lid. The state space is\nthe concatenation of the (x, y, z) positions of these objects, resulting in an 18-dimensional state. As the object\npositions are known, we use an attention over the objects as part of the action, as shown in Figure 7. The actions\nare 2 positions: the (x0, y0) position at which to grasp and the the (x1, y1) position at which to place. When\ntraining the policy using the object centric model, (x0, y0) is a weighted sum of the object positions, with the\nz coordinate being ignored. Weights over the 6 object are output by a neural network given the difference of\nCPVs and the current observation. At evaluation time, (x0, y0) is the arg max object position. This means that\nall policies will always grasp at an object position. For (x1, y1), we do not have the same constraint. Instead,\nthe softmaxed weights over the objects are concatenated with the previous layer\u2019s activations, and another fully\nconnected layer maps this directly to continuous valued (x1, y1). This means that the policy can place at any\nposition in the workspace. The na\u00efve model, TECNet model, and CPV models all use this object-centric policy,\nthen only differ in how the input to the policy.\n\n11\n\n\fFigure 7: The object-centric network architecture we use for the 3D grasping environment. Because\nthe observations include the concatenated positions of the objects in the scene, the policy chooses a\ngrasp position by predicting a discrete classi\ufb01cation over the objects grasping at the weighted sum of\nthe object positions. The classi\ufb01cation logits are passed back to the network to output the position at\nwhich to place the object.\n\nB Hyperparameters\n\nWe compared all models across embedding dimension sizes of [64,128,256, and 512]. In the crafting environment,\nthe 512 size was best for all methods. In the grasping environment, the 64 size was best for all methods. For\nTECNets, we tested \u03bbctr = 1 and 0.1, and found that 0.1 was best. All models are trained on either k-80 GPUs\nor Titan X GPUs.\n\nC Additional Experiments\n\nWe ran a pared down experiment on a ViZDoom environment to show the method working from \ufb01rst person\nimages, as shown in C. In the experiment, the skills are reaching 4 different waypoints in the environment. The\nactions are \u201cturn left,\" \u201cturn right,\" and \u201cgo forward.\" The observation space consists of a \ufb01rst person image\nobservation as well as the (x, y) locations of the waypoints. We evaluate on trajectories that must visit 1 or 2\nwaypoints (skills), and also evaluate on the compositions of these trajectories. The policies were only trained on\ntrajectories that visit up to 3 waypoints. These evaluations are shown in 3.\n\nFigure 8: First person view\nin VizDoom env.\n\nTable 3: ViZDoom Navigation Results. All numbers are success\nrates of arriving within 1 meter of each waypoint.\n\nMODEL\nNAIVE\nTECNET\nCPV\n\n1 SKILL\n\n2 SKILLS\n\n97\n96\n93\n\n94\n95.3\n90.7\n\n1+1\n36.7\n48.3\n91\n\n2+2\n2\n0\n64\n\n12\n\n\f", "award": [], "sourceid": 8547, "authors": [{"given_name": "Coline", "family_name": "Devin", "institution": "UC Berkeley"}, {"given_name": "Daniel", "family_name": "Geng", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley & covariant.ai"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}