{"title": "Synthesized Policies for Transfer and Adaptation across Tasks and Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 1168, "page_last": 1177, "abstract": "The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence. In this paper, we consider the problem of learning to simultaneously transfer across both environments and tasks, probably more importantly, by learning from only sparse (environment, task) pairs out of all the possible combinations. We propose a novel compositional neural network architecture which depicts a meta rule for composing policies from  environment and task embeddings. Notably, one of the main challenges is to learn the embeddings jointly with the meta rule. We further propose new training methods to disentangle the embeddings, making them both distinctive signatures of the environments and tasks and effective building blocks for composing the policies. Experiments on GridWorld and THOR, of which the agent takes as input an egocentric view, show that our approach gives rise to high success rates on all the (environment, task) pairs after learning from only 40% of them.", "full_text": "Synthesized Policies for Transfer and Adaptation\n\nacross Tasks and Environments\n\nHexiang Hu \u2217\n\nLiyu Chen \u2217\n\nUniversity of Southern California\n\nUniversity of Southern California\n\nLos Angeles, CA 90089\nhexiangh@usc.edu\n\nLos Angeles, CA 90089\n\nliyuc@usc.edu\n\nBoqing Gong\nTencent AI Lab\n\nBellevue, WA 98004\n\nboqinggo@outlook.com\n\nFei Sha \u2020\nNet\ufb02ix\n\nLos Angeles, CA 90028\nfsha@netflix.com\n\nAbstract\n\nThe ability to transfer in reinforcement learning is key towards building an agent\nof general arti\ufb01cial intelligence. In this paper, we consider the problem of learning\nto simultaneously transfer across both environments (\u03b5) and tasks (\u03c4), probably\nmore importantly, by learning from only sparse (\u03b5, \u03c4) pairs out of all the possible\ncombinations. We propose a novel compositional neural network architecture\nwhich depicts a meta rule for composing policies from environment and task\nembeddings. Notably, one of the main challenges is to learn the embeddings jointly\nwith the meta rule. We further propose new training methods to disentangle the\nembeddings, making them both distinctive signatures of the environments and\ntasks and effective building blocks for composing the policies. Experiments on\nGRIDWORLD and THOR, of which the agent takes as input an egocentric view,\nshow that our approach gives rise to high success rates on all the (\u03b5, \u03c4) pairs after\nlearning from only 40% of them.\n\n1\n\nIntroduction\n\nRemarkable progress has been made in reinforcement learning in the last few years [16, 21, 26].\nAmong these, an agent learns to discover its best policy of actions to accomplish a task, by interacting\nwith the environment. However, the skills the agent learns are often tied for a speci\ufb01c pair of\nthe environment (\u03b5) and the task (\u03c4). Consequently, when the environment changes even slightly,\nthe agent\u2019s performance deteriorates drastically [11, 28]. Thus, being able to swiftly adapt to new\nenvironments and transfer skills to new tasks is crucial for the agents to act in real-world settings.\nHow can we achieve swift adaptation and transfer? In this paper, we consider several progressively\ndif\ufb01cult settings. In the \ufb01rst setting, the agent needs to adapt and transfer to a new pair of environ-\nment and task, when the agent has been exposed to the environment and the task before (but not\nsimultaneously). Our goal is to use as few as possible seen pairs (i.e., a subset out of all possible (\u03b5,\n\u03c4) combinations, as sparse as possible) to train the agent.\nIn the second setting, the agent needs to adapt and transfer across either environments or tasks, to\nthose previously unseen by the agent. For instance, a home service robot needs to adapt from one\nhome to another one but essentially accomplish the same sets of tasks, or the robot learns new tasks\nin the same home. In the third setting, the agent has encountered neither the environment nor the task\n\n\u2217Equal Contribution.\n\u2020On leave from University of Southern California (feisha@usc.edu).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: We consider a transfer learning scenario in reinforcement learning that considers transfer in both task\nand environment. Three different settings are presented here (see text for details). The red dots denote SEEN\ncombinations, gray dots denote UNSEEN combinations, and arrows \u2192 denote transfer directions.\n\nbefore. Intuitively, the second and the third settings are much more challenging than the \ufb01rst one\nand appear to be intractable. Thus, the agent is allowed to have a very limited amount of learning\ndata in the target environment and/or task, for instance, from one demonstration, in order to transfer\nknowledge from its prior learning.\nFigure 1 schematically illustrates the three settings. Several existing approaches have been proposed\nto address some of those settings [1\u20133, 14, 17, 24, 25]; for a detailed discussion, see related works\nin Section 2. A common strategy behind these works is to jointly learn through multi-task (rein-\nforcement) learning [9, 18, 25]. Despite many progresses, however, adaptation and transfer remain a\nchallenging problem in reinforcement learning where a powerful learning agent easily over\ufb01ts to the\nenvironment or the task it has encountered, leading to poor generalization to new ones [11, 28].\nIn this paper, we propose a new approach to tackle this challenge. Our main idea is to learn a meta\nrule to synthesize policies whenever the agent encounters new environments or tasks. Concretely,\nthe meta rule uses the embeddings of the environment and the task to compose a policy, which is\nparameterized as the linear combination of the policy basis. On the training data from seen pairs of\nenvironments and tasks, our algorithm learns the embeddings as well as the policy basis. For new\nenvironments or tasks, the agent learns the corresponding embeddings only while it holds the policy\nbasis \ufb01xed. Since the embeddings are low-dimensional, a limited amount of training data in the new\nenvironment or task is often adequate to learn well so as to compose the desired policy.\nWhile deep reinforcement learning algorithms are capable of memorizing and thus entangling\nrepresentations of tasks and environments [28], we propose a disentanglement objective such that\nthe embeddings for the tasks and the environments can be extracted to maximize the ef\ufb01cacy of the\nsynthesized policy. Empirical studies demonstrate the importance of disentangling the representations.\nWe evaluated our approach on GRIDWORLD which we have created and the photo-realistic robotic\nenvironment THOR [13]. We compare to several leading methods for transfer learning in a signi\ufb01cant\nnumber of settings. The proposed approach outperforms most of them noticeably in improving the\neffectiveness of transfer and adaptation.\n\n2 Related Work\n\nMulti-task [27] and transfer learning [24] for reinforcement learning (RL) have been long and\nextensively studied. Teh et al. [25] presented a distillation based method that transfers the knowledge\nfrom task speci\ufb01c agents to a multi-task learning agent. Andreas et al. [1] combined the option\nframework [23] and modular network [2], and presented an ef\ufb01cient multi-task learning approach\nwhich shares sub-policies across policy sketches of different tasks. Schaul et al. [19] encoded the goal\nstate into value functions and showed its generalization to new goals. More recently, Oh et al. [17]\nproposed to learn a meta controller along with a set of parameterized policies to compose a policy\nthat generalizes to unseen instructions. In contrast, we jointly consider the tasks and environments\nwhich can be both atomic, as we learn their embeddings without resorting to any external knowledge\n(e.g., text, attributes, etc.).\nSeveral recent works [3, 6, 14, 29] factorize Q value functions with an environment-agnostic state-\naction feature encoding function and task-speci\ufb01c embeddings. Our model is related to this line of\n\n2\n\nUnseenSeenMenvsNtasks(c) Transfer Setting 3NtasksMenvs(b) Transfer Setting 2Menvs(a) Transfer Setting 1Ntasks\fwork in spirit. However, as opposed to learning the value functions, we directly learn a factorized\npolicy network with strengthened disentanglement between environments and tasks. This allows us\nto easily generalize better to new environments or tasks, as shown in the empirical studies.\n\n3 Approach\n\nWe begin by introducing notations and stating the research problem formally. We then describe the\nmain idea behind our approach, followed by the details of each component of the approach.\n\n3.1 Problem Statement and Main Idea\n\nProblem statement. We follow the standard framework for reinforcement learning [22]. An agent\ninteracts with an environment by sequentially choosing actions over time and aims to maximize its\ncumulative rewards. This learning process is abstractly described by a Markov decision process with\nthe following components: a space of the agent\u2019s state s \u2208 S, a space of possible actions a \u2208 A,\nan initial distribution of states p0(s), a stationary distribution characterizing how the state at time t\ntransitions to the next state at (t + 1): p(st+1|st, at), and a reward function r := r(s, a).\nlative reward: R = E[(cid:80)\u221e\nThe agent\u2019s actions follow a policy \u03c0(a|s) : S \u00d7 A \u2192 [0, 1], de\ufb01ned as a conditional distribution\np(a|s). The goal of the learning is to identify the optimal policy that maximizes the discounted cumu-\nt=0 \u03b3tr(st, at)], where \u03b3 \u2208 (0, 1] is a discount factor and the expectation is\n\u03c0. With it, we de\ufb01ne the discounted state distribution as \u03c1\u03c0(s) =(cid:80)\ns(cid:48)(cid:80)\u221e\ntaken with respect to the randomness in state transitions and taking actions. We denote by p(s|s(cid:48), t, \u03c0)\nthe probability at state s after transitioning t time steps, starting from state s(cid:48) and following the policy\nt=1 \u03b3t\u22121p0(s(cid:48))p(s|s(cid:48), t, \u03c0).\nIn this paper, we study how an agent learns to accomplish a variety of tasks in different environments.\nLet E and T denote the sets of the environments and the tasks, respectively. We assume the cases of\n\ufb01nite sets but it is possible to extend our approach to in\ufb01nite ones. While the most basic approach\nis to learn an optimal policy under each pair (\u03b5, \u03c4 ) of environment and task, we are interested in\ngeneralizing to all combinations in (E,T ), with interactive learning from a limited subset of (\u03b5, \u03c4 )\npairs. Clearly, the smaller the subset is, the more desirable the agent\u2019s generalization capability is.\n\nMain idea.\nIn the rest of the paper, we refers to the limited subset of pairs as seen pairs or training\npairs and the rest ones as unseen pairs or testing pairs. We assume that the agent does not have\naccess to the unseen pairs to obtain any interaction data to learn the optimal policies directly. In\ncomputer vision, such problems have been intensively studied in the frameworks of unsupervised\ndomain adaptation and zero-shot learning, for example, [4,5,8,15]. There are totally |E| \u00d7 |T | pairs \u2013\nour goal is to learn from O(|E| + |T |) training pairs and generalize to all.\nOur main idea is to synthesize policies for the unseen pairs of environments and tasks. In particular,\nour agent learns two sets of embeddings: one for the environments and the other for the tasks.\nMoreover, the agent also learns how to compose policies using such embeddings. Note that learning\nboth the embeddings and how to compose happens on the training pairs. For the unseen pairs, the\npolicies are constructed and used right away \u2014 if there is interaction data, the policies can be further\n\ufb01ne-tuned. However, even without such interaction data, the synthesized policies still perform well.\nTo this end, we desire our approach to jointly supply two aspects: a compositional structure of\nSynthesized Policies (SYNPO) from environment and task embeddings and a disentanglement\nlearning objective to learn the embeddings. We refer this entire framework as SYNPO and describe\nits details in what follows.\n\n3.2 Policy Factorization and Composition\n\nGiven a pair z = (\u03b5, \u03c4 ) of an environment \u03b5 and a task \u03c4, we denote by e\u03b5 and e\u03c4 their embeddings,\nrespectively. The policy is synthesized with a bilinear mapping\n\n\u03c0z(a|s) \u221d exp(\u03c8T\n\ns U (e\u03b5, e\u03c4 )\u03c6a + b\u03c0)\n\n(1)\n\nwhere b\u03c0 is a scalar bias, and \u03c8s and \u03c6a are featurized states and actions (for instances, image\npixels or the feature representations of an image). The bilinear mapping given by the matrix U is\n\n3\n\n\fFigure 2: Overview of our proposed model. Given a task and an environment, the corresponding embeddings\ne\u03b5 and e\u03c4 are retrieved to compose the policy coef\ufb01cients and reward coef\ufb01cients. Such coef\ufb01cients then linearly\ncombine the shared basis and synthesize a policy (and a reward prediction) for the agent.\n\nparameterized as the linear combination of K basis matrices \u0398k,\n\nU (e\u03b5, e\u03c4 ) =\n\n\u03b1k(e\u03b5, e\u03c4 )\u0398k.\n\n(2)\n\nK(cid:88)\n\nk=1\n\nNote that the combination coef\ufb01cients depend on the speci\ufb01c pair of environment and task while the\nbasis is shared across all pairs. They enable knowledge transfer from the seen pairs to unseen ones.\nAnalogously, during learning (to be explained in detail in the later section), we predict the rewards by\nmodeling them with the same set of basis but different combination coef\ufb01cients:\n\n(cid:32)(cid:88)\n\n(cid:33)\n\n\u02dcrz(s, a) = \u03c8T\n\ns V (e\u03b5, e\u03c4 )\u03c6a + br = \u03c8T\n\ns\n\n\u03b2k(e\u03b5, e\u03c4 )\u0398k\n\n\u03c6a + br\n\n(3)\n\nk\n\nwhere br is a scalar bias. Note that similar strategies for learning to predict rewards along with\nlearning the policies have also been studied in recent works [3, 12, 29]. We \ufb01nd this strategy helpful\ntoo (cf. details in our empirical studies in Section 4).\nFigure 2 illustrates the model architecture described above. In this paper, we consider agents that take\negocentric views of the environment, so a convolutional neural network is used to extract the state\nfeatures \u03c8s (cf. the bottom left panel of Figure 2). The action features \u03c6a are learned as a look-up\ntable. Other model parameters include the basis \u0398, the embeddings e\u03b5 and e\u03c4 in the look-up tables\nrespectively for the environments and the tasks, and the coef\ufb01cient functions \u03b1k(\u00b7,\u00b7) and \u03b2k(\u00b7,\u00b7)\nfor respectively synthesizing the policy and reward predictor. The coef\ufb01cient functions \u03b1k(\u00b7,\u00b7) and\n\u03b2k(\u00b7,\u00b7) are parameterized with one-hidden-layer MLPs with the inputs being the concatenation of e\u03b5\nand e\u03c4 , respectfully.\n3.3 Disentanglement of the Embeddings for Environments and Tasks\n\nIn SYNPO, both the embeddings and the bilinear mapping are to be learnt. In an alternative but\nequivalent form, the policies are formulated as\n\n\u03c0z(a|s) \u221d exp\n\n\u03b1k(e\u03b5, e\u03c4 )\u03c8T\n\ns \u0398k\u03c6a + b\u03c0\n\n.\n\n(4)\n\nAs the de\ufb01ning coef\ufb01cients \u03b1k are parameterized by a neural network whose inputs and parameters are\nboth optimized, we need to impose additional structures such that the learned embeddings facilitate\nthe transfer across environments or tasks. Otherwise, the learning could over\ufb01t to the seen pairs and\nconsider each pair in unity, thus leading to poor generalization to unseen pairs.\nTo this end, we introduce discriminative losses to distinguish different environments or tasks through\ns \u0398k\u03c6a} \u2208 RK be the state-action representation. For the agent\nthe agent\u2019s trajectories. Let x = {\u03c8T\ninteracting with an environment-task pair z = (\u03b5, \u03c4 ), we denote its trajectory as {x1, x2,\u00b7\u00b7\u00b7 , xt, . . .}.\nWe argue that a good embedding (either e\u03b5 or e\u03c4 ) ought to be able to tell from which environment or\n\n4\n\n(cid:32)(cid:88)\n\nk\n\n(cid:33)\n\nTask DescriptorTask EmbeddingEnvironment DescriptorEnvironmentEmbeddingStateFeature ExtractionAction EmbeddingRewardPredictionPolicyPredictionL2 NormalizeL2 Normalizee\"<latexit sha1_base64=\"o0ScyitlsecfAWzDmpxR2anCJoc=\">AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=</latexit><latexit sha1_base64=\"o0ScyitlsecfAWzDmpxR2anCJoc=\">AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=</latexit><latexit sha1_base64=\"o0ScyitlsecfAWzDmpxR2anCJoc=\">AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=</latexit><latexit sha1_base64=\"o0ScyitlsecfAWzDmpxR2anCJoc=\">AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=</latexit>e\u2327<latexit sha1_base64=\"b17TtuIqqpVuQMgCnxZngNLs8a8=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==</latexit><latexit sha1_base64=\"b17TtuIqqpVuQMgCnxZngNLs8a8=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==</latexit><latexit sha1_base64=\"b17TtuIqqpVuQMgCnxZngNLs8a8=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==</latexit><latexit sha1_base64=\"b17TtuIqqpVuQMgCnxZngNLs8a8=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==</latexit>\u21b5(e\",e\u2327)<latexit sha1_base64=\"9UyHAbOkW8X/xPXTLrLzPRCcQKE=\">AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=</latexit><latexit sha1_base64=\"9UyHAbOkW8X/xPXTLrLzPRCcQKE=\">AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=</latexit><latexit sha1_base64=\"9UyHAbOkW8X/xPXTLrLzPRCcQKE=\">AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=</latexit><latexit sha1_base64=\"9UyHAbOkW8X/xPXTLrLzPRCcQKE=\">AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=</latexit>(e\",e\u2327)<latexit sha1_base64=\"R8vh5kuP0bbCsmePTJOyeZXopkg=\">AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==</latexit><latexit sha1_base64=\"R8vh5kuP0bbCsmePTJOyeZXopkg=\">AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==</latexit><latexit sha1_base64=\"R8vh5kuP0bbCsmePTJOyeZXopkg=\">AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==</latexit><latexit sha1_base64=\"R8vh5kuP0bbCsmePTJOyeZXopkg=\">AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==</latexit>a<latexit sha1_base64=\"el6Py6GIBQlOm5FdMp3Rgvsgg2g=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==</latexit><latexit sha1_base64=\"el6Py6GIBQlOm5FdMp3Rgvsgg2g=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==</latexit><latexit sha1_base64=\"el6Py6GIBQlOm5FdMp3Rgvsgg2g=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==</latexit><latexit sha1_base64=\"el6Py6GIBQlOm5FdMp3Rgvsgg2g=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==</latexit> s<latexit sha1_base64=\"vesXAMwppxL8jPVUL4wO79X4Crs=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==</latexit><latexit sha1_base64=\"vesXAMwppxL8jPVUL4wO79X4Crs=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==</latexit><latexit sha1_base64=\"vesXAMwppxL8jPVUL4wO79X4Crs=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==</latexit><latexit sha1_base64=\"vesXAMwppxL8jPVUL4wO79X4Crs=\">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==</latexit>\u02dcrz(s,a)<latexit sha1_base64=\"uMhnOPvpQd+Wwjt/9j7ViT7Q+GA=\">AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==</latexit><latexit sha1_base64=\"uMhnOPvpQd+Wwjt/9j7ViT7Q+GA=\">AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==</latexit><latexit sha1_base64=\"uMhnOPvpQd+Wwjt/9j7ViT7Q+GA=\">AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==</latexit><latexit sha1_base64=\"uMhnOPvpQd+Wwjt/9j7ViT7Q+GA=\">AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==</latexit>\u21e1z(a|s)<latexit sha1_base64=\"yCwp2OQiK/gaMRFA9L6BRPFQW2Q=\">AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH</latexit><latexit sha1_base64=\"yCwp2OQiK/gaMRFA9L6BRPFQW2Q=\">AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH</latexit><latexit sha1_base64=\"yCwp2OQiK/gaMRFA9L6BRPFQW2Q=\">AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH</latexit><latexit sha1_base64=\"yCwp2OQiK/gaMRFA9L6BRPFQW2Q=\">AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH</latexit>\ftask the trajectory is from. In particular, we formulate this as a multi-way classi\ufb01cation where we\ndesire xt (on average) is telltale of its environment \u03b5 or task \u03c4:\n\n(cid:88)\n(cid:88)\n\nt\n\nt\n\nlog P (\u03b5|xt) with P (\u03b5|xt) \u221d exp(cid:0)g(xt)Te\u03b5\n(cid:1)\nlog P (\u03c4|xt) with P (\u03c4|xt) \u221d exp(cid:0)h(xt)Te\u03c4\n(cid:1)\n\n(cid:96)\u03b5 := \u2212\n\n(cid:96)\u03c4 := \u2212\n\n(5)\n\n(6)\n\nwhere we use two nonlinear mapping functions (g(\u00b7) and h(\u00b7), parameterized by one-hidden-layer\nMLPs) to transform the state-action representation xt, such that it retrieves e\u03b5 and e\u03c4 . These two\nfunctions are also learnt using the interaction data from the seen pairs.\n\n3.4 Learning\n\nOur approach (SYNPO) relies on the modeling assumption that the policies (and the reward predicting\nfunctions) are factorized in the axes of the environment and the task. This is a generic assumption\nand can be integrated with many reinforcement learning algorithms. In this paper, we study its\neffectiveness on imitation learning (mostly) and also reinforcement learning.\nIn imitation learning, we denote by \u03c0e\nz the expert policy of combination z and apply the simple\nstrategy of \u201cbehavior cloning\u201d with random perturbations to learn our model from the expert demon-\nstration [10]. We employ a cross-entropy loss for the policy as follows:\n\n(cid:96)\u03c0z := \u2212E\n\ns\u223c\u03c1\u03c0e\n\nz ,a\u223c\u03c0e\n\nz\n\n[log \u03c0z(a|s)]\n\nA (cid:96)2 loss is used for learning the reward prediction function, (cid:96)rz := E\nz ,a\u223c\u03c0e\nrz(s, a)(cid:107)2. Together with the disentanglement losses, they form the overall loss function\n\ns\u223c\u03c1\u03c0e\n\nz(cid:107)\u02dcrz(s, a) \u2212\n\nL := Ez[(cid:96)\u03c0z + \u03bb1(cid:96)rz + \u03bb2(cid:96)\u03b5 + \u03bb3(cid:96)\u03c4 ]\n\nwhich is then optimized through experience replay, as shown in Algorithm 1 in the supplemen-\ntary materials (Suppl. Materials). We choose the value of those hyper-parameters \u03bbi so that the\ncontributions of the objectives are balanced. More details are presented in the Suppl. Materials.\n\n3.5 Transfer to Unseen Environments and Tasks\n\nEq. 1 is used to synthesize a policy for any (\u03b5, \u03c4) pair, as long as the environment and the task \u2014 not\nnecessarily the pair of them \u2014 have appeared at least once in the training pairs. If, however, a new\nenvironment and/or a new task appears (corresponding to the transfer setting 2 or 3 in Section 1),\n\ufb01ne-tuning is required to extract their embeddings. To do so, we keep all the components of our\nmodel \ufb01xed except the look-up tables (i.e., embeddings) for the environment and/or the task. This\neffectively re-uses the policy composition rule and enables fast learning of the environment and/or\nthe task embeddings, after seeing a few number of demonstrations. In the experiments, we \ufb01nd it\nworks well even with only one shot of the demonstration.\n\n4 Experiments\n\nWe validate our approach (SYNPO) with extensive experimental studies, comparing with several\nbaselines and state-of-the-art transfer learning methods.\n\n4.1 Setup\n\nWe experiment with two simulated environments3: GRIDWORLD and THOR [13], in both of which\nthe agent takes as input an egocentric view (cf. Figure 3). Please refer to the Suppl. Materials for\nmore details about the state feature function \u03c8s used in these simulators.\nGRIDWORLD and tasks. We design twenty 16 \u00d7 16 grid-aligned mazes, some of which are\nvisualized in Figure 3 (a). The mazes are similar in appearance but differ from each other in topology.\nThere are \ufb01ve colored blocks as \u201ctreasures\u201d and the agent\u2019s goal is to collect the treasures in pre-\nspeci\ufb01ed orders, e.g., \u201cPick up Red and then pick up Blue\u201d. At a time step, the \u201cegocentric\u201d view\n\n3The implementation of the two simulated environments are available on https://www.github.com/sha-lab/gridworld and\n\nhttps://www.github.com/sha-lab/thor, respectfully.\n\n5\n\n\fFigure 3: From left to right: (a) Some sample mazes of our GRIDWORLD dataset. They are similar in appearance\nbut different in topology. Demonstrations of an agent\u2019s egocentric views of (b) GRIDWORLD and (c) THOR.\n\nobserved by the agent consists of the agent\u2019s surrounding within a 3 \u00d7 3 window and the treasures\u2019\nlocations. At each run, the locations of the agent and treasures are randomized. We consider twenty\ntasks in each environment, resulting |E| \u00d7 |T | = 400 pairs of (\u03b5, \u03c4) in total. In the transfer setting\n1 (cf. Figure 1(a)), we randomly choose 144 pairs as the training set under the constraint that each\nof the environments appears at least once, so does any task. The remaining 256 pairs are used for\ntesting. For the transfer settings 2 and 3 (cf. Figure 1(b) and (c)), we postpone the detailed setups to\nSection 4.2.2.\n\nTHOR [13] and tasks. We also test our method on THOR, a challenging 3D simulator where the\nagent is placed in indoor photo-realistic scenes. The tasks are to search and act on objects, e.g., \u201cPut\nthe cabbage to the fridge\u201d. Different from GRIDWORLD, the objects\u2019 locations are unknown so the\nagent has to search for the objects of interest by its understanding of the visual scene (cf. Figure 3(c)).\nThere are 7 actions in total (look up, look down, turn left, turn right, move forward, open/close,\npick up/put down). We run experiments with 19 scenes \u00d7 21 tasks in this simulator.\nEvaluations. We evaluate the agent\u2019s performance by the averaged success rate (AvgSR.) for\naccomplishing the tasks, limiting the maximum trajectory length to 300 steps. For the results reported\nin numbers (e.g., Tables 1), we run 100 rounds of experiments for each (\u03b5, \u03c4) pair by randomizing the\nagent\u2019s starting point and the treasures\u2019 locations. To plot the convergence curves (e.g., Figure 4), we\nsample 100 (\u03b5, \u03c4) combinations and run one round of experiment for each to save computation time.\nWe train our algorithms under 3 random seeds and report the mean and standard deviation (std).\n\nCompeting methods. We compare our approach (SYNPO) with the following baselines and com-\npeting methods. Note that our problem setup is new, so we have to adapt the competing methods,\nwhich were proposed for other scenarios, to \ufb01t ours.\n\u2022 MLP. The policy network is a multilayer perceptron whose input concatenates state features and\nthe environment and task embeddings. We train this baseline using the proposed losses for our\napproach, including the disentanglement losses (cid:96)\u0001, (cid:96)\u03c4 ; it performs worse without (cid:96)\u0001, (cid:96)\u03c4 .\n\u2022 Successor Feature (SF). We learn the successor feature model [3] by Q-imitation learning for fair\ncomparison. We strictly follow [14] to set up the learning objectives. The key difference of SF\nfrom our approach is its lack of capability in capturing the environmental priors.\n\u2022 Module Network (ModuleNet). We also implement a module network following [7]. Here we\ntrain an environment speci\ufb01c module for each environment and a task speci\ufb01c module for each\ntask. The policy for a certain (\u03b5, \u03c4) pair is assembled by combining the corresponding environment\nmodule and task module.\n\u2022 Multi-Task Reinforcement Learning (MTL). This is a degenerated version of our method, where\nwe ignore the distinctions of environments. We simply replace the environment embeddings by\nzeros for the coef\ufb01cient functions. The disentanglement loss on task embeddings is still used since\nit leads to better performances than otherwise.\n\nPlease refer to the Suppl. Materials for more experimental details, including all the twenty GRID-\nWORLD mazes, how we con\ufb01gure the rewards, optimization techniques, feature extraction for the\nstates, and our implementation of the baseline methods.\n\n4.2 Experimental Results on GRIDWORLD\n\nWe \ufb01rst report results on the adaptation and transfer learning setting 1, as described in Section 1 and\nFigure 1(a). There, the agent acts upon a new pair of environment and task, both of which it has\n\n6\n\n\f(a) AvgSR. over Time on SEEN\n\n(b) AvgSR. over Time on UNSEEN\n\nFigure 4: On GRIDWORLD. Averaged success rate (AvgSR) on SEEN pairs and UNSEEN pairs, respectively.\nResults are reported with |E| = 20 and |T | = 20. We report mean and std based on 3 training random seeds.\n\n(a) Transfer learning performance curve\n\n(b) AvgSR. over Time on UNSEEN\n\nFigure 5: (a) Transfer learning performance (in AvgSR.) with respect to the ratio: # SEEN pairs / # TOTAL\npairs, with |E| = 10 and |T | = 10. (b) Reinforcement learning performance on unseen pairs of different\napproaches (with PPO [20]). MLP over\ufb01ts, MTL improves slightly, and SYNPO achieves 96.16% AvgSR.\n\nencountered during training but not in the same (\u03b5, \u03c4) pair. The goal is to use as sparse (\u03b5, \u03c4) pairs\namong all the combinations as possible to learn and yet still able to transfer successfully.\n\n4.2.1 Transfer to Previously Encountered Environments and Tasks\n\nMain results. Table 1 and Figure 4 show the success rates and convergence curves, respectively, of\nour approach and the competing methods averaged over the seen and unseen (\u03b5, \u03c4) pairs. SYNPO\nconsistently outperforms the others in terms of both the convergence and \ufb01nal performance, by a\nsigni\ufb01cant margin. On the seen split, MTL and MLP have similar performances, while MTL performs\nworse comparing to MLP on the unseen split (i.e.\nin terms of the generalization performance),\npossibly because it treats all the environments the same.\nWe design an extreme scenario to further challenge the environment-agnostic methods (e.g., MTL).\nWe reduce the window size of the agent\u2019s view to one, so the agent sees the cell it resides and the\ntreasures\u2019 locations and nothing else. As a result, MTL suffers severely, MLP performs moderately\nwell, and SYNPO outperforms both signi\ufb01cantly (unseen AvgSR: MTL=6.1%, MLP=66.1%, SYNPO\n= 76.8%). We conjecture that the environment information embodied in the states is crucial for the\nagent to beware of and generalize across distinct environments. More discussions are deferred to the\nSuppl. Materials.\n\nHow many seen (\u03b5, \u03c4) pairs do we need to transfer well? Figure 5(a) shows that, not surprisingly,\nthe transfer learning performance increases as the number of seen pairs increases. The acceleration\nslows down after the seen/total ratio reaches 0.4. In other words, when there is a limited budget, our\napproach enables the agent to learn from 40% of all possible (\u03b5, \u03c4) pairs and yet generalize well\nacross the tasks and environments.\n\nDoes reinforcement learning help transfer? Beyond imitation learning, we further study our\nSYNPO for reinforcement learning (RL) under the same transfer learning setting. Speci\ufb01cally, we\nuse PPO [20] to \ufb01ne-tune the three top performing algorithms on GRIDWORLD. The results averaged\nover 3 random seeds are shown in Figure 5(b). We \ufb01nd that RL \ufb01ne-tuning improves the transfer\n\n7\n\n0250005000075000100000125000150000175000200000iteration0.00.20.40.60.8average success rateMLPMTLModuleNetSFSynPo0250005000075000100000125000150000175000200000iteration0.00.20.40.60.8average success rateMLPMTLModuleNetSFSynPo0.10.20.30.40.50.60.70.8# of seen / # of total0.20.40.60.81.0average success rateSeenUnseen0.000.250.500.751.001.251.501.752.00steps1e70.600.650.700.750.800.850.900.95success rate on test setMLPMTLSynPo\fMethod\n\nTable 1: Performance (AvgSR.) of each method on GRIDWORLD (SEEN/UNSEEN = 144/256).\n0.0 \u00b1 0.0% 50.9 \u00b1 33.8% 69.0 \u00b1 2.0% 64.1 \u00b1 1.2% 83.3 \u00b1 0.5 %\n0.0 \u00b1 0.0% 30.4 \u00b1 20.1% 66.1 \u00b1 2.6% 41.5 \u00b1 1.4% 82.1 \u00b1 1.5%\n\nAvgSR. (SEEN)\n\nAvgSR. (UNSEEN)\n\nSYNPO\n\nSF\n\nModuleNet\n\nMLP\n\nMTL\n\nTable 2: Performance of transfer learning in the settings 2 and 3 on GRIDWORLD\n\nSetting Method Cross Pair (Q\u2019s \u03b5, P \u2019s \u03c4) Cross Pair (P \u2019s \u03b5, Q\u2019s \u03c4) Q Pairs\nSetting 2 MLP\n6.3%\n13.5%\nSYNPO\nSetting 3 MLP\n7.2%\n12.9%\nSYNPO\n\n20.7%\n21.5%\n18.3%\n19.4%\n\n13.8%\n50.5%\n14.6%\n42.7%\n\nperformance for all the three algorithms. In general, MLP suffers from over-\ufb01tting, MTL is improved\nmoderately yet with a signi\ufb01cant gap to the best result, and SYNPO achieves the best AvgSR, 96.16%.\n\nAblation studies. We refer readers to the Suppl. Materials for ablation studies of the learning\nobjectives.\n\n4.2.2 Transfer to Previously Unseen Environments or Tasks\n\nNow we investigate how effectively one can schedule transfer from seen environments and tasks to\nunseen ones, i.e., the settings 2 and 3 described in Section 1 and Figure 1(b) and (c). The seen pairs\n(denoted by P ) are constructed from ten environments and ten tasks; the remaining ten environments\nand ten tasks are unseen (denoted by Q). Then we have two settings of transfer learning.\nOne is to transfer to pairs which cross the seen set P and unseen set Q \u2013 this corresponds to the\nsetting 2 as the embeddings for either the unseen tasks or the unseen environments need to be learnt,\nbut not both. Once these embeddings are learnt, we use them to synthesize policies for the test (\u03b5, \u03c4)\npairs. This mimics the style \u201cincremental learning of small pieces and integrating knowledge later\u201d.\nThe other is the transfer setting 3. The agent learns policies via learning embeddings for the tasks\nand environments of the unseen set Q and then composing, as described in section 3.5. Using the\nembeddings from P and Q, we can synthesize policies for any (\u03b5, \u03c4) pair. This mimics the style of\n\u201clearning in giant jumps and connecting dots\u201d.\n\nMain results. Table 2 contrasts the results of the two transfer learning settings. Clearly, setting 2\nattains stronger performance as it \u201cincrementally learns\u201d the embeddings of either the tasks or the\nenvironments but not both, while setting 3 requires learning both simultaneously. It is interesting to\nsee this result aligns with how effective human learns.\nFigure 6 visualizes the results whose rows are indexed by tasks and columns by environments. The\nseen pairs in P are in the upper-left quadrant and the unseen set Q is on the bottom-right. We refer\nreaders to the Suppl. Materials for more details and discussions of the results.\n\n4.3 Experimental Results on THOR\n\nMain results. The results on the THOR simulator are shown in Table 3, where we report our\napproach as well as the top performing ones on GRIDWORLD. Our SYNPO signi\ufb01cantly outperforms\nthree competing ones for both seen pairs and unseen pairs. Moreover, our approach also has the best\nperformance of success rate on seen to unseen, indicating that it is less prone to over\ufb01ting than the\nother methods. More details are included in the Suppl. Materials.\n\n5 Conclusion\n\nIn this paper, we consider the problem of learning to simultaneously transfer across both environ-\nments (\u03b5) and tasks (\u03c4) under the reinforcement learning framework and, more importantly, by\nlearning from only sparse (\u03b5, \u03c4) pairs out of all the possible combinations. Speci\ufb01cally, we present a\nnovel approach that learns to synthesize policies from the disentangled embeddings of environments\nand tasks. We evaluate our approach for the challenging transfer scenarios in two simulators, GRID-\n\n8\n\n\f(a) Transfer Setting 2\n\n(b) Transfer Setting 3\n\nFigure 6: Transfer results of settings 2 and 3. AvgSRs are marked in the grid (see Suppl. Materials for more\nvisually discernible plots). The tasks and environments in the purple cells are from the unseen Q set and the red\ncells correspond to the rest. Darker color means better performance. It shows that cross-task transfer is easier\nthan cross-environment.\n\nTable 3: Performance of each method on THOR (SEEN/UNSEEN=144/199)\n\nMethod\n\nModuleNet\n\nAvgSR. (SEEN)\n\nAvgSR. (UNSEEN)\n\n51.5 %\n14.4 %\n\nMLP\n47.5%\n25.8%\n\nMTL\n52.2%\n33.3%\n\nSYNPO\n55.6%\n35.4%\n\nWORLD and THOR. Empirical results verify that our method generalizes better across environments\nand tasks than several competing baselines.\nAcknowledgments We appreciate the feedback from the reviewers. This work is partially supported by DARPA#\nFA8750-18-2-0117, NSF IIS-1065243, 1451412, 1513966/ 1632803/1833137, 1208500, CCF-1139148, a\nGoogle Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Net\ufb02ix, and ARO#\nW911NF-12-1-0241 and W911NF-15-1-0484.\n\nReferences\n\n[1] J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. In\n\nICML, 2017.\n\n[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 39\u201348, 2016.\n\n[3] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, D. Silver, and H. P. van Hasselt. Successor features\n\nfor transfer in reinforcement learning. In NIPS, 2017.\n\n[4] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classi\ufb01ers for zero-shot learning. 2016\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5327\u20135336, 2016.\n\n[5] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot\n\nlearning for object recognition in the wild. In ECCV, 2016.\n\n[6] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neural\n\nComputation, 5:613\u2013624, 1993.\n\n[7] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies\nfor multi-task and multi-robot transfer. In Robotics and Automation (ICRA), 2017 IEEE International\nConference on, pages 2169\u20132176. IEEE, 2017.\n\n[8] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. 2012\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 2066\u20132073, 2012.\n\n[9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[10] J. Ho and S. Ermon. Generative adversarial imitation learning.\n\nProcessing Systems, pages 4565\u20134573, 2016.\n\nIn Advances in Neural Information\n\n9\n\nEnv_0Env_1Env_2Env_3Env_4Env_5Env_6Env_7Env_8Env_9Env_10Env_11Env_12Env_13Env_14Env_15Env_16Env_17Env_18Env_19('R', 'B')('B', 'G')('G', 'O')('O', 'P')('P', 'R')('R', 'G')('B', 'O')('G', 'P')('O', 'R')('P', 'B')('R', 'O')('B', 'P')('G', 'R')('O', 'B')('P', 'G')('R', 'P')('B', 'R')('G', 'B')('O', 'G')('P', 'O')0.901.000.900.901.000.901.000.901.001.000.500.300.400.600.500.300.500.300.600.701.000.901.000.900.901.000.901.000.900.900.300.500.800.500.500.300.500.300.700.601.001.001.001.001.001.001.000.901.000.900.500.400.800.500.400.500.200.300.500.700.901.000.800.901.001.000.900.900.800.900.500.100.800.600.500.700.200.400.400.700.901.001.000.901.000.800.900.800.901.000.400.500.600.500.600.500.300.100.500.401.000.901.000.800.901.001.001.000.901.000.400.200.500.300.600.800.500.500.700.801.001.001.000.801.001.000.900.801.001.000.700.300.800.900.400.300.400.200.400.400.800.800.900.901.001.001.000.901.000.900.300.500.700.600.500.500.600.200.800.401.001.001.000.801.001.000.900.801.001.000.600.600.600.500.600.500.500.200.600.600.901.000.901.000.701.000.800.801.001.000.600.700.800.600.800.600.700.400.400.600.300.000.200.400.100.400.200.100.500.300.200.100.200.200.100.100.000.000.100.000.200.300.400.300.100.400.200.300.100.200.100.000.200.100.200.200.100.000.200.100.100.500.300.100.100.100.300.400.100.400.200.200.100.200.400.200.200.100.200.300.500.200.300.200.100.200.200.500.200.200.200.100.100.100.700.100.200.000.200.100.000.100.200.200.200.100.400.100.200.500.100.000.200.100.100.100.000.100.300.100.400.300.500.400.300.500.400.500.500.300.200.100.200.200.100.100.200.000.300.300.200.400.200.100.400.300.500.000.400.200.200.200.500.400.000.300.300.100.400.300.100.200.200.000.000.000.100.200.200.200.000.000.200.000.100.000.000.000.200.000.500.000.100.000.100.000.000.000.100.300.000.100.000.100.100.100.000.100.100.100.000.000.000.100.000.000.100.000.200.000.100.100.000.100.100.000.200.100.000.00Env_0Env_1Env_2Env_3Env_4Env_5Env_6Env_7Env_8Env_9Env_10Env_11Env_12Env_13Env_14Env_15Env_16Env_17Env_18Env_19('R', 'B')('B', 'G')('G', 'O')('O', 'P')('P', 'R')('R', 'G')('B', 'O')('G', 'P')('O', 'R')('P', 'B')('R', 'O')('B', 'P')('G', 'R')('O', 'B')('P', 'G')('R', 'P')('B', 'R')('G', 'B')('O', 'G')('P', 'O')1.001.001.001.000.900.900.900.900.901.000.400.500.800.200.500.400.500.300.800.001.001.000.900.900.901.000.901.000.901.000.400.300.200.400.600.200.300.100.900.101.000.800.901.000.700.900.901.001.001.000.300.300.700.500.500.400.500.300.700.301.000.800.901.000.800.900.800.901.000.900.500.300.500.400.400.400.400.300.800.300.900.900.801.001.000.800.901.000.901.000.400.800.600.200.200.200.200.400.600.501.001.000.900.700.900.900.800.800.901.000.500.300.600.600.900.200.400.500.600.401.001.001.000.700.900.800.900.900.900.900.400.500.500.300.800.400.300.100.400.400.900.800.900.700.800.800.800.901.000.900.300.400.700.400.200.400.300.300.700.100.800.800.801.001.001.000.800.801.001.000.000.400.700.400.800.400.500.300.700.400.801.001.000.900.900.800.901.000.901.000.300.300.700.400.700.400.300.200.900.200.400.200.300.200.200.500.500.300.200.100.200.100.000.200.200.200.100.100.000.100.200.300.200.300.100.200.100.100.400.200.300.000.100.100.000.100.000.100.400.000.300.200.100.400.100.200.300.100.300.300.100.200.200.000.100.200.100.000.100.100.200.300.200.300.200.000.200.100.100.300.100.200.000.200.100.100.000.000.400.200.100.100.200.200.000.100.100.000.200.200.000.200.000.100.100.100.300.000.000.100.400.200.300.400.500.100.300.100.300.500.400.400.200.100.100.200.100.100.700.400.100.500.200.200.100.200.200.300.300.200.300.300.400.000.300.200.100.000.300.000.400.000.200.000.000.300.100.100.200.000.200.000.200.000.100.200.000.000.000.000.100.000.100.000.100.300.100.000.100.300.200.000.000.100.100.100.000.100.000.100.000.100.000.200.000.200.100.100.300.200.100.000.000.000.400.200.100.000.100.40\f[11] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network\n\npolicies. arXiv preprint arXiv:1702.02284, 2017.\n\n[12] M. Jaderberg, V. Mnih, W. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement\n\nlearning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016.\n\n[13] E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. Ai2-thor: An interactive 3d\n\nenvironment for visual ai. CoRR, abs/1712.05474, 2017.\n\n[14] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. Gershman. Deep successor reinforcement learning. CoRR,\n\nabs/1606.02396, 2016.\n\n[15] I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. CVPR 2017,\n\npages 1160\u20131169, 2017.\n\n[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller,\nA. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,\nD. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature,\n518:529\u2013533, 2015.\n\n[17] J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement\n\nlearning. arXiv preprint arXiv:1706.05064, 2017.\n\n[18] E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement\n\nlearning. arXiv preprint arXiv:1511.06342, 2015.\n\n[19] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In ICML, 2015.\n[20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.\n\narXiv preprint arXiv:1707.06347, 2017.\n\n[21] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,\nT. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a\ngeneral reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.\n\n[22] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural\n\nNetworks, 16:285\u2013286, 1998.\n\n[23] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):181\u2013211, 1999.\n\n[24] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of\n\nMachine Learning Research, 10:1633\u20131685, 2009.\n\n[25] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral:\nRobust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages\n4499\u20134509, 2017.\n\n[26] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. K\u00fcttler,\nJ. Agapiou, J. Schrittwieser, et al. Starcraft ii: a new challenge for reinforcement learning. arXiv preprint\narXiv:1708.04782, 2017.\n\n[27] A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: a hierarchical bayesian\napproach. In Proceedings of the 24th international conference on Machine learning, pages 1015\u20131022.\nACM, 2007.\n\n[28] C. Zhang, O. Vinyals, R. Munos, and S. Bengio. A study on over\ufb01tting in deep reinforcement learning.\n\narXiv preprint arXiv:1804.06893, 2018.\n\n[29] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi. Visual semantic\nplanning using deep successor representations. In Proceedings of the IEEE International Conference on\nComputer Vision, volume 2, page 7, 2017.\n\n10\n\n\f", "award": [], "sourceid": 610, "authors": [{"given_name": "Hexiang", "family_name": "Hu", "institution": "University of Southern California"}, {"given_name": "Liyu", "family_name": "Chen", "institution": "University of Southern California"}, {"given_name": "Boqing", "family_name": "Gong", "institution": "Tencent AI Lab"}, {"given_name": "Fei", "family_name": "Sha", "institution": "University of Southern California (USC)"}]}