{"title": "Hardware Conditioned Policies for Multi-Robot Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9333, "page_last": 9344, "abstract": "Deep reinforcement learning could be used to learn dexterous robotic policies but it is challenging to transfer them to new robots with vastly different hardware properties. It is also prohibitively expensive to learn a new policy from scratch for each robot hardware due to the high sample complexity of modern state-of-the-art algorithms. We propose a novel approach called Hardware Conditioned Policies where we train a universal policy conditioned on a vector representation of robot hardware. We considered robots in simulation with varied dynamics, kinematic structure, kinematic lengths and degrees-of-freedom. First, we use the kinematic structure directly as the hardware encoding and show great zero-shot transfer to completely novel robots not seen during training. For robots with lower zero-shot success rate, we also demonstrate that fine-tuning the policy network is significantly more sample-efficient than training a model from scratch. In tasks where knowing the agent dynamics is important for success, we learn an embedding for robot hardware and show that policies conditioned on the encoding of hardware tend to generalize and transfer well. Videos of experiments are available at: https://sites.google.com/view/robot-transfer-hcp.", "full_text": "Hardware Conditioned Policies for Multi-Robot\n\nTransfer Learning\n\nTao Chen\n\nThe Robotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\ntaoc1@cs.cmu.edu\n\nAdithyavairavan Murali\n\nThe Robotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\namurali@cs.cmu.edu\n\nAbhinav Gupta\n\nThe Robotics Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nabhinavg@cs.cmu.edu\n\nAbstract\n\nDeep reinforcement learning could be used to learn dexterous robotic policies\nbut it is challenging to transfer them to new robots with vastly different hardware\nproperties. It is also prohibitively expensive to learn a new policy from scratch\nfor each robot hardware due to the high sample complexity of modern state-of-\nthe-art algorithms. We propose a novel approach called Hardware Conditioned\nPolicies where we train a universal policy conditioned on a vector representation\nof robot hardware. We considered robots in simulation with varied dynamics,\nkinematic structure, kinematic lengths and degrees-of-freedom. First, we use the\nkinematic structure directly as the hardware encoding and show great zero-shot\ntransfer to completely novel robots not seen during training. For robots with lower\nzero-shot success rate, we also demonstrate that \ufb01ne-tuning the policy network\nis signi\ufb01cantly more sample-ef\ufb01cient than training a model from scratch.\nIn\ntasks where knowing the agent dynamics is important for success, we learn an\nembedding for robot hardware and show that policies conditioned on the encoding\nof hardware tend to generalize and transfer well. Videos of experiments are\navailable at: https://sites.google.com/view/robot-transfer-hcp.\n\n1\n\nIntroduction\n\nIn recent years, we have seen remarkable success in the \ufb01eld of deep reinforcement learning (DRL).\nFrom learning policies for games [1, 2] to training robots in simulators [3], neural network based\npolicies have shown remarkable success. But will these successes translate to real world robots? Can\nwe use DRL for learning policies of how to open a bottle, grasping or even simpler tasks like \ufb01xturing\nand peg-insertion? One major shortcoming of current approaches is that they are not sample-ef\ufb01cient.\nWe need millions of training examples to learn policies for even simple actions. Another major\nbottleneck is that these policies are speci\ufb01c to the hardware on which training is performed. If we\napply a policy trained on one robot to a different robot it will fail to generalize. Therefore, in this\nparadigm, one would need to collect millions of examples for each task and each robot.\nBut what makes this problem even more frustrating is that since there is no standardization in terms\nof hardware, different labs collect large-scale data using different hardware. These hardware vary in\ndegrees of freedom (DOF), kinematic design and even dynamics. Because the learning process is so\nhardware-speci\ufb01c, there is no way to pool and use all the shared data collected across using different\ntypes of robots, especially when the robots are trained under torque control. There have been efforts\nto overcome dependence on hardware properties by learning invariance to robot dynamics using\ndynamic randomization [4]. However, learning a policy invariant to other hardware properties such\nas degrees of freedom and kinematic structure is a challenging problem.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we propose an alternative solution: instead of trying to learn the invariance to hardware;\nwe embrace these differences and propose to learn a policy conditioned on the hardware properties\nitself. Our core idea is to formulate the policy \u03c0 as a function of current state st and the hardware\nproperties vh. So, in our formulation, the policy decides the action based on current state and its own\ncapabilities (as de\ufb01ned by hardware vector). But how do you represent the robot hardware as vector?\nIn this paper, we propose two different possibilities. First, we propose an explicit representation\nwhere the kinematic structure itself is fed as input to the policy function. But such an approach\nwill not be able to encode robot dynamics which might be hard to measure. Therefore, our second\nsolution is to learn an embedding space for robot hardware itself. Our results indicate that encoding\nthe kinematic structures explicitly enables high success rate on zero-shot transfer to new kinematic\nstructure. And learning the embedding vector for hardware implicitly without using kinematics and\ndynamics information is able to give comparable performance to the model where we use all of the\nkinematics and dynamics information. Finally, we also demonstrate that the learned policy can also\nadapt to new robots with much less data samples via \ufb01netuning.\n\n2 Related Work\n\nTransfer in Robot Learning Transfer learning has a lot of practical value in robotics, given that it\nis computationally expensive to collect data on real robot hardware and that many reinforcement\nlearning algorithms have high sample complexity. Taylor et al. present an extensive survey of\ndifferent transfer learning work in reinforcement learning [5]. Prior work has broadly focused on the\ntransfer of policies between tasks [6\u20139], control parameters [10], dynamics [4, 11\u201313], visual inputs\n[14], non-stationary environments [15], goal targets [16]. Nilim et. al. presented theoretical results on\nthe performance of transfer under conditions with bounded disturbances in the dynamics [12]. There\nhave been efforts in applying domain adaption such as learning common invariant feature spaces\nbetween domains [17] and learning a mapping from target to source domain [18]. Such approaches\nrequire prior knowledge and data from the target domain. A lot of recent work has focused on\ntransferring policies trained in simulation to a real robot [4, 14, 19, 20]. However, there has been\nvery limited work on transferring knowledge and skills between different robots [6, 21]. The most\nrelevant paper is Devin et al. [6], who propose module networks for transfer learning and used it\nto transfer 2D planar policies across hand-designed robots. The key idea is to decompose policy\nnetwork into robot-speci\ufb01c module and task-speci\ufb01c module. In our work, a universal policy is\nconditioned on a vector representation of the robot hardware - the policy does not necessarily need\nto be completely retrained for a new robot. There has also been some concurrent work in applying\ngraph neural networks (GNN) as the policy class in continuous control [22, 23]. [22] uses a GNN\ninstead of a MLP to train a policy. [23] uses a GNN to learn a forward prediction model for future\nstates and performs model predictive control on agents. Our work is orthogonal to these methods as\nwe condition the policy on the augmented state of the robot hardware, which is independant of the\npolicy class.\nRobust Control and Adaptive Control Robust control can be considered from several vantage\npoints. In the context of trajectory optimization methods, model predictive control (MPC) is a\npopular framework which continuously resolves an open-loop optimization problem, resulting in a\nclosed loop algorithm that is robust to dynamics disturbances. In the context of deep reinforcement\nlearning, prior work has explored trajectory planning with an ensemble of dynamics models [24],\nadversarial disturbances [13], training with random dynamics parameters in simulation [4, 11], etc.\n[4] uses randomization over dynamics so that the policy network can generalize over a large range\nof dynamics variations for both the robot and the environment. However, it uses position control\nwhere robot dynamics has little direct impact on control. We use low-level torque control which is\nseverely affected by robot dynamics and show transfer even between kinematically different agents.\nThere have been similar works in the area of adaptive control [25] as well, where unknown dynamics\nparameters are estimated online and adaptive controllers adapt the control parameters by tracking\nmotion error. Our work is a model-free method which does not make assumptions like linear system\ndynamics. We also show transfer results on robots with different DOFs and joint displacements.\nSystem Identi\ufb01cation System identi\ufb01cation is a necessary process in robotics to \ufb01nd unknown\nphysical parameters or to address model inconsistency during training and execution. For control\nsystems based on analytic models, as is common in the legged locomotion community, physical\nparameters such as the moment of inertia or friction have to be estimated for each custom robotic\n\n2\n\n\fhardware [26, 27]. Another form of system identi\ufb01cation involves the learning of a dynamics model\nfor use in model-based reinforcement learning. Several prior research work have iterated between\nbuilding a dynamics model and policy optimization [28\u201330]. In the context of model-free RL, Yu\net al. proposed an Online System Identi\ufb01cation [31] module that is trained to predict physical\nenvironmental factors such as the agent mass, friction of the \ufb02oor, etc. which are then fed into the\npolicy along with the agent state [31]. However, results were shown for simple simulated domains\nand even then it required a lot of samples to learn an accurate regression function of the environmental\nfactors. There is also concurrent work which uses graph networks [23] to learn a forward prediction\nmodel for future states and to perform model predictive control. Our method is model-free and only\nrequires a simple hardware augmentation as input regardless of the policy class or DRL algorithms.\n\n3 Preliminaries\n\nWe consider the multi-robot transfer learning problem under the reinforcement learning framework\nand deal with fully observable environments that are modeled as continuous space Markov Decision\nProcesses (MDP). The MDPs are represented by the tuple (S,A, P, r, \u03c10, \u03b3), where S is a set of\ncontinuous states, A is a set of continuous actions, P : S \u00d7 A \u00d7 S \u2192 R is the transition probability\ndistribution, r : S \u00d7 A \u2192 R is the reward function, \u03c10 is the initial state distribution, and \u03b3 \u2208 (0, 1]\nis the discount factor. The aim is to \ufb01nd a policy \u03c0 : S \u2192 A that maximizes the expected return.\nThere are two classes of approaches used for optimization: on-policy and off-policy. On-policy\napproaches (e.g., Proximal Policy Optimization (PPO) [32]) optimize the same policy that is used\nto make decisions during exploration. On the other hand, off-policy approaches allow policy op-\ntimization on data obtained by a behavior policy different from the policy being optimized. Deep\ndeterministic policy gradient (DDPG) [3] is a model-free actor-critic off-policy algorithm which uses\ndeterministic action policy and is applicable to continuous action spaces.\nOne common issue with training these approaches is sparse rewards. Hindsight experience replay\n(HER) [33] was proposed to improve the learning under the sparse reward setting for off-policy\nalgorithms. The key insight of HER is that even though the agent has not succeeded at reaching the\nspeci\ufb01ed goal, the agent could have at least achieved a different one. So HER pretends that the agent\nwas asked to achieve the goal it ended up with in that episode at the \ufb01rst place, instead of the one\nthat we set out to achieve originally. By repeating the goal substitution process, the agent eventually\nlearns how to achieve the goals we speci\ufb01ed.\n\n4 Hardware Conditioned Policies\n\nOur proposed method, Hardware Conditioned Policies (HCP), takes robot hardware information\ninto account in order to generalize the policy network over robots with different kinematics and\ndynamics. The main idea is to construct a vector representation vh of each robot hardware that\ncan guide the policy network to make decisions based on the hardware characteristics. Therefore,\nthe learned policy network should learn to act in the environment, conditioned on both the state st\nand vh. There are several factors that encompass robot hardware that we have considered in our\nframework - robot kinematics (degree of freedom, kinematic structure such as relative joint positions\nand orientations, and link length), robot dynamics (joint damping, friction, armature, and link mass) -\nand other aspects such as shape geometry, actuation design, etc. that we will explore in future work.\nIt is also noteworthy that the robot kinematics is typically available for any newly designed robot, for\ninstance through the Universal Robot Description Format-URDF [34]. Nonetheless, the dynamics are\ntypically not available and may be inaccurate or change over time even if provided. We now explain\ntwo ways on how to encode the robot hardware via vector vh.\n\n4.1 Explicit Encoding\n\nFirst, we propose to represent robot hardware information via an explicit encoding method (HCP-E).\nIn explicit encoding, we directly use the kinematic structure as input to the policy function. Note\nthat while estimating the kinematic structure is feasible, it is not feasible to measure dynamics.\nHowever, some environments and tasks might not be heavily dependent on robot dynamics and\nin those scenarios explicit encoding (HCP-E) might be simple and more practical than implicit\n\n3\n\n\fembedding1. We followed the popular URDF convention in ROS to frame our explicit encoding\nand incorporate the least amount of information to fully de\ufb01ne a multi-DOF robot for the explicit\nencoding. It is dif\ufb01cult to completely de\ufb01ne a robot with just its end-effector information, as the\nkinematic structure (even for the same DOF) affects the robot behaviour. For instance, the whole\nkinematic chain is important when there are obstacles in the work space and the policy has to learn to\navoid collisions with its links.\nWe consider manipulators composed of n revolute joints (J0, J1, ..., Jn\u22121).\nFigure 1 shows two consecutive joints Ji, Ji+1 on the two ends of an L-shape\nlink and their corresponding local coordinate systems {xiyizi} with origin\nOi and {xi+1yi+1zi+1} with origin Oi+1 where z-axis is the direction of\nrevolute joint axis. To represent spatial relationship between Ji, Ji+1, one\nneeds to know the relative pose2 Pi between Ji and Ji+1.\nRelative pose Pi can be decomposed into relative position and orientation.\nRelative position is represented by the difference vector di between Oi and\nOi+1, i.e., di = Oi+1 \u2212 Oi and di \u2208 R3. The relative rotation matrix from\n{xi+1yi+1zi+1} to {xiyizi} is Ri+1\nw is the\nrotation matrix of {xiyizi} relative to the world coordinate system. One\ncan further convert rotation matrix which has 9 elements into Euler rotation\nvector with only 3 independent elements. Therefore, relative rotation can\nbe represented by an Euler rotation vector ei = (\u03b8ix, \u03b8iy, \u03b8iz), ei \u2208 R3. The\nrelative pose is then Pi = di \u2295 ei \u2208 R6, where \u2295 denotes concatenation.\nWith relative pose Pi of consecutive joints in hand, the encoding vector vh to represent the robot can\nbe explicitly constructed as follows3:\n\nFigure 1: Local coor-\ndinate systems for two\nconsecutive joints\n\ni = (Ri\n\nw)\u22121Ri+1\n\nw , where Ri\n\nvh = P\u22121 \u2295 P0 \u2295 \u00b7 \u00b7 \u00b7 \u2295 Pn\u22121\n\n4.2\n\nImplicit Encoding\n\nIn the above section, we discussed how kinematic structure of the hardware can be explicitly encoded\nas vh. However, in most cases, we need to not only encode kinematic structure but also the underlying\ndynamic factors. In such scenarios, explicit encoding is not possible since one cannot measure friction\nor damping in motors so easily. In this section, we discuss how we can learn an embedding space for\nrobot hardware while simultaneously learning the action policies. Our goal is to estimate vh for each\nrobot hardware such that when a policy function \u03c0(st, vh) is used to take actions it maximizes the\nexpected return. For each robot hardware, we initialize vh randomly. We also randomly initialize\nthe parameters of the policy network. We then use standard policy optimization algorithms to\nupdate network parameters via back-propagation. However, since vh is also a learned parameter,\nthe gradients \ufb02ow back all the way to the encoding vector vh and update vh via gradient descent:\nvh \u2190 vh \u2212 \u03b1\u2207vh L(vh, \u03b8), where L is the cost function, \u03b1 is the learning rate. Intuitively, HPC-I\ntrains the policy \u03c0(st, vh) such that it not only learns a mapping from states to actions that maximizes\nthe expected return, but also \ufb01nds a good representation for the robot hardware simultaneously.\n\n4.3 Algorithm\n\nThe hardware representation vector vh can be incorporated into many deep reinforcement learning\nalgorithms by augmenting states to be: \u02c6st \u2190 st\u2295 vh. We use PPO in environments with dense reward\nand DDPG + HER in environments with sparse reward in this paper. During training, a robot will be\nrandomly sampled in each episode from a pre-generated robot pool P \ufb01lled with a large number of\nrobots with different kinematics and dynamics. Alg. 1 provides an overview of our algorithm. The\ndetailed algorithms are summarized in Appendix A (Alg. 2 for on-policy and Alg. 3 for off-policy).\n\n1We will elaborate such environments on section 5.1. We also experimentally show the effect of dynamics\n\nfor transferring policies in such environments in Appendix C.1 and C.2.\n\n2If the manipulators only differ in length of all links, vh can be simply the vector of each link\u2019s length. When\n3Pi is relative pose from U to V . If i = \u22121, U = J0, V = robot base. If i = 0, 1, ..., n \u2212 2, U = Ji+1,V=\n\nthe kinematic structure and DOF also vary, vh composed of link length is not enough.\nJi. If i = n \u2212 1, U = end effector, V = Jn\u22121.\n\n4\n\nxiyizixi+1zi+1yi+1diJi+1Ji\fSample a robot instance I \u2208 P\nSample an initial state s0\nRetrieve the robot hardware representation vector vh\nRun policy \u03c0 in the environment for T timesteps\nAugment all states with vh: \u02c6s \u2190 s \u2295 vh\nfor n=1:W do\n\nOptimize actor and critic networks with \u03a8 via minibatch gradient descent\nif vh is to be learned (i.e. for implicit encoding, HCP-I) then\n\nupdate vh via gradient descent in the optimization step as well\n\n(cid:46) \u2295 denotes concatenation\n\nAlgorithm 1 Hardware Conditioned Policies (HCP)\n\nInitialize a RL algorithm \u03a8\nInitialize a robot pool P of size N with robots in different kinematics and dynamics\nfor episode = 1:M do\n\n(cid:46) e.g. PPO, DDPG, DDPG+HER\n\nend if\n\nend for\n\nend for\n\n5 Experimental Evaluation\n\nOur aim is to demonstrate the importance of conditioning the policy based on a hardware representa-\ntion vh for transferring complicated policies between dissimilar robotic agents. We show performance\ngains on two diverse settings of manipulation and hopper.\n\n5.1 Explicit Encoding\n\nRobot Hardwares: We created a set of robot manipulators based on the Sawyer robot in MuJoCo\n[35]. The basic robot types are shown in Figure 2. By permuting the chronology of revolute joints\nand ensuring the robot design is feasible, we designed 9 types of robots (named as A, B,..., I) in\nwhich the \ufb01rst four are 5 DOF, the next four are 6 DOF, and the last one is 7 DOF, following the main\nfeasible kinematic designs described in hardware design literature [36]. Each of these 9 robots were\nfurther varied with different link lengths and dynamics.\nTasks: We consider reacher and peg insertion tasks to demonstrate the effectiveness of explicit\nencoded representation. In reacher, robot starts from a random initial pose and it needs to move the\nend effector to the random target position. In peg-insertion, a peg is attached to the robot gripper and\nthe task is to insert the peg into the hole on the table. It\u2019s considered a success only if the peg bottom\ngoes inside the hole more than 0.03m. Goals are described by the 3D target positions (xg, yg, zg) of\nend effector (reacher) or peg bottom (peg insertion).\nStates and Actions: The states of\nboth environments consist of the\nangles and velocities of all robot\njoints. Action is n-dimensional\ntorque control over n (n \u2264 7)\njoints. Since we consider robots\nwith different DOF in this pa-\nper, we use zero-padding for\nrobots with < 7 joints to con-\nstruct a \ufb01xed-length state vector\nfor different robots. And the\npolicy network always outputs 7-\ndimensional actions, but only the\n\ufb01rst n elements are used as the\ncontrol command.\nRobot Representation: As men-\ntioned in section 4.1, vh is ex-\nplicitly constructed to represent\nthe robot kinematic chain: vh =\n\nFigure 2: Robots with different DOF and kinematics structures.\nThe white rings represent joints. There are 4 variants of 5 and 6\nDOF robots due to the different placements of joints.\n\n(g) G: 6 DOF\n\n(h) H: 6 DOF\n\n(d) D: 5 DOF\n\n(c) C: 5 DOF\n\n(a) A: 5 DOF\n\n(b) B: 5 DOF\n\n(e) E: 6 DOF\n\n(f) F: 6 DOF\n\n(i) I: 7 DOF\n\n5\n\n\fP\u22121 \u2295 P0 \u2295 \u00b7 \u00b7 \u00b7 \u2295 Pn\u22121. We use zero-padding for robots with < 7 joints to construct a \ufb01xed-length\nrepresentation vector vh for different robots.\nRewards: We use binary sparse reward setting because sparse reward is more realistic in robotics\napplications. And we use DPPG+HER as the backbone training algorithm. The agent only gets +1\nreward if POI is within \u0001 euclidean distance of the desired goal position. Otherwise, it gets \u22121 reward.\nWe use \u0001 = 0.02m in all experiments. However, this kind of sparse reward setting encourages the\nagent to complete the task using as less time steps as possible to maximize the return in an episode,\nwhich encourages the agent to apply maximum torques on all the joints so that the agent can move\nfast. This is referred to as bang\u2013bang control in control theory[37]. Hence, we added action penalty\non the reward.\nMore experiment details are shown in Appendix B.\n\n5.1.1 Does HCP-E improve performance?\n\nTo show the importance of hardware information as input to the policy network, we experiment on\nlearning robotic skills among robots with different dynamics (joint damping, friction, armature, link\nmass) and kinematics (link length, kinematic structure, DOF). The 9 basic robot types are listed in\nFigure 2. We performed several leave-one-out experiments (train on 8 robot types, leave 1 robot\ntype untouched) on these robot types. The sampling ranges for link length and dynamics parameters\nare shown in Table 2 in Appendix B.1.1. We compare our algorithm with vanilla DDPG+HER\n(trained with data pooled from all robots) to show the necessity of training a universal policy network\nconditioned on hardware characteristics. Figure 3 shows the learning curves4 of training on robot\ntypes A-G and I. It clearly shows that our algorithm HCP-E outperforms the baseline. In fact,\nDDPG+HER without any hardware information is unable to learn a common policy across multiple\nrobots as different robots will behave differently even if they execute the same action in the same\nstate. More leave-one-out experiments are shown in Appendix C.4.\n\nFigure 3: Learning curves for multi-DOF setup. Training robots contain Type A-G and Type I robots\n(four 5-DOF types, three 6-DOF types, one 7-DOF type). Each type has 140 variants with different\ndynamics and link lengths. The 100 testing robots used to generate the learning curves are from the\nsame training robot types but with different link lengths and dynamics. (a): reacher task with random\ninitial pose and target position. (b): peg insertion with \ufb01xed hole position. (c): peg insertion with\nhole position (x, y, z) randomly sampled in a 0.2m box region. Notice that the converged success\nrate in (c) is only about 70%. This is because when we randomly generate the hole position, some\nrobots cannot actually insert the peg into hole due to physical limit. Some hole positions are not\ninside the reachable space (workspace) of the robots. This is especially common in 5-DOF robots.\n\n5.1.2 Is HCP-E capable of zero-shot transfer to unseen robot kinematic structure?\n\nWe now perform testing in the leave-one-out experiments. Speci\ufb01cally, we can test the zero-shot\ntransfer ability of policy network on new type of robots. Table 1 shows the quantitative statistics\nabout testing performance on new robot types that are different from training robot types. Each data5\nin the table is obtained by running the model on 1000 unseen test robots (averaged over 10 trials, 100\nrobots per trial) of that robot type but with different link lengths and dynamics.\n\n4The learning curves are averaged over 5 random seeds on 100 testing robots and shaded areas represent 1\n\nstandard deviation.\n\n5The success rate is represented by the mean and standard deviation\n\n6\n\n0.00.51.01.52.02.5Episodes1e40.000.250.500.751.00Success rateReacherDDPG+HERHCP-E0.00.51.01.52.02.5Episodes1e40.000.250.500.751.00Peg Insertion (fixed goal)0.00.51.01.52.02.5Episodes1e40.000.250.500.751.00Peg Insertion (random goals)\fTable 1: Zero-shot testing performance on new robot type\n\nTasks\n\nTraining\n\nRobot Types\n\nTesting\n\nRobot Type\n\nReacher\n\n(random goals)\n\nPeg Insertion\n(\ufb01xed goal)\n\nPeg Insertion\n(random goals)\n\nA-G + I\n\nA-D + F-I\n\nA-G + I\n\nA-D + F-I\n\nA-H\n\nA-G + I\n\nA-D + F-I\n\nA-H\n\nH\n\nE\n\nH\n\nE\n\nI\n\nH\n\nE\n\nI\n\nExp.\n\nI\nII\nIII\nIV\nV\nVI\nVII\nVIII\nIX\nX\nXI\nXII\nXIII\nXIV\nXV\nXVI\n\nAlg.\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nHCP-E\n\nDDPG+HER\n\nSuccess rate (%)\n92.50 \u00b1 1.96\n0.20 \u00b1 0.40\n88.00 \u00b1 2.00\n2.70 \u00b1 2.19\n92.20 \u00b1 2.75\n0.00 \u00b1 0.00\n87.60 \u00b1 2.01\n0.80 \u00b1 0.60\n65.60 \u00b1 3.77\n0.10 \u00b1 0.30\n4.10 \u00b1 1.50\n0.10 \u00b1 0.30\n76.10 \u00b1 3.96\n0.00 \u00b1 0.00\n23.50 \u00b1 4.22\n0.20 \u00b1 0.40\n\nFrom Table 1, it is clear that HCP-E still maintains high success rates when the policy is applied to\nnew types of robots that have never been used in training, while DDPG+HER barely succeeds at\ncontrolling new types of robots at all. The difference between using robot types A-G+I and A-D+F-I\n(both have four 5-DOF types, three 6-DOF types, and one 7-DOF type) is that robot type H is harder\nfor peg insertion task than robot type E due to its joint con\ufb01guration (it removes joint J5). As we can\nsee from Exp. I and V, HCP-E got about 90% zero-shot transfer success rate even if it\u2019s applied on\nthe hard robot type H. Exp. X and XVI show the model trained with only 5 DOF and 6 DOF being\napplied to 7-DOF robot type. We can see that it is able to get about 65% success rate in peg insertion\ntask with \ufb01xed goal6.\nZero-shot transfer to a real Sawyer robot: We show results on\nthe multi-goal reacher task, as peg insertion required additional\nlab setup. Though the control frequency on the real robot is not\nas stable as that in simulation, we still found a high zero-shot\ntransfer rate. For quantitative evaluation, we ran three policies\non the real robot with results averaged over 20 random goal\npositions. A used the policy from Exp. I (HCP-E), B used the\npolicy from Exp. II (DDPG+HER) while C used the policy\ntrained with actual Sawyer CAD model in simulation with just\nrandomized dynamics. The distance from target for the 20-trials\nare summarized in Figure 4. Despite of the large reality gap7,\nHCP-E (BLUE) is able to reach the target positions with a high\nsuccess rate (75%) 8. DDPG+HER (RED) without hardware\ninformation was not even able to move the arm close to the\ndesired position.\nFine-tuning the zero-shot policy: Table 1 also shows that Exp.\nXI and Exp. XV have relatively low zero-shot success rates on\nnew type of robots. Exp. XI is trained on easier 6-DOF robots\n\nFigure 4: Testing distance distribu-\ntion on a real sawyer robot. A used\nI, B used\nthe policy from Exp.\nthe policy from Exp. II, C used\nthe policy trained with the actual\nSawyer CAD model in simulation\nwith randomized dynamics.\n\n6Since we are using direct torque control without gravity compensation, the trivial solution of transferring\nwhere the network can regard the 7-DOF robot as a 6-DOF robot by keeping one joint \ufb01xed doesn\u2019t exist here.\n7The reality gap is further exaggerated by the fact that we didn\u2019t do any form of gravity compensation in the\n\nsimulation but the real-robot tests used the gravity compensation to make tests safer.\n\n8The HCP-E policy resulted in a motion that was jerky on the real Sawyer robot to reach the target positions.\nThis was because we used sparse reward during training. This could be mitigated with better reward design to\nenforce smoothness.\n\n7\n\nABCExp0.00.10.20.30.40.5Distance (m)Distance distribution0.02m\f(a)\n\n(b)\n\n(c)\n\nFigure 5: (a): Distribution (violin plots) of distance between the peg bottom at the end of episode\nand the desired position. The three horizontal lines in each violin plot stand for the lower extrema,\nmedian value, and the higher extrema. It clearly shows that HCP-E moves the pegs much closer to\nthe hole than DDPG+HER. (b): The brown curve is the learning curve of training HCP-E on robot\ntype H with different link lengths and dynamics in multi-goal setup from scratch. The pink curve\nis the learning curve of training HCP-E on same robots with pretrained model from Exp. XI. (c):\nSimilar to (b), the training robots are robot type I (7 DOF) and the pretrained model is from Exp. XV.\n(b) and (c) show that applying the pretrained model that is trained on different robot types to a new\nrobot type can accelerate the learning by a large margin.\n\n(E, F, G) and applied to a harder 6-DOF robot type (H). Exp. XV is trained only on 5-DOF and\n6-DOF robots and applied to 7-DOF robots (I). The hole positions are randomly sampled in both\nexperiments. Even though the success rates are low, HCP-E is actually able to move the peg bottom\nclose to the hole in most testing robots, while DDPG+HER is much worse, as shown in Figure 5a.\nWe also \ufb01ne-tune the model speci\ufb01cally on the new robot type for these two experiments, as shown\nin Figure 5b and Figure 5c. It\u2019s clear that even though zero-shot success rates are low, the model can\nquickly adapt to the new robot type with the pretrained model weights.\n\n5.2\n\nImplicit Encoding\n\nEnvironment HCP-E shows remarkable success on transferring manipulator tasks to different types\nof robots. However, in explicit encoding we only condition on the kinematics of the robot hardware.\nFor unstable systems in robotics, such as in legged locomotion where there is a lot of frequent\nnonlinear contacts [13], it is crucial to consider robot dynamics as well. We propose to learn an\nimplicit, latent encoding (HCP-I) for each robot without actually using any kinematics and dynamics\ninformation. We evaluate the effectiveness of HCP-I on the 2D hopper [38]. Hopper is an ideal\nenvironment as it is an unstable system with sophisticated second-order dynamics involving contact\nwith the ground. We will demonstrate that adding implicitly learned robot representation can lead to\ncomparable performance to the case where we know the ground-truth kinematics and dynamics. To\ncreate robots with different kinematics and dynamics, we varied the length and mass of each hopper\nlink, damping, friction, armature of each joint, which are shown in Table 5 in Appendix B.\nPerformance We compare HCP-I with HCP-E, HCP-E+ground-truth dynamics, and vanilla PPO\nmodel without kinematics and dynamics information augmented to states. As shown in Figure 6a,\nHCP-I outperforms the baseline (PPO) by a large margin. In fact, with the robot representation vh\nbeing automatically learned, we see that HCP-I achieves comparable performance to HCP-E+Dyn\nthat uses both kinematics and dynamics information, which means the robot representation vh learns\nthe kinematics and dynamics implicitly. Since dynamics plays a key role in the hopper performance\nwhich can be seen from the performance difference between HCP-E and HCP-E+Dyn, the implicit\nencoding method obtains much higher return than the explicit encoding method. This is because the\nimplicit encoding method can automatically learn a good robot hardware representation and include\nthe kinematics and dynamics information, while the explicit encoding method can only encode the\nkinematics information as dynamics information is generally unavailable.\nTransfer Learning on New Agents We now apply the learned HCP-I model as a pretrained model\nonto new robots. However, since vh for the new robot is unknown, we \ufb01ne-tune the policy parameters\nand also estimate vh. As shown in Figure 6b, HCP-I with pretrained weights learns much faster than\nHCP-I trained from scratch. While in the current version, we do not show explicit few-shot results,\n\n8\n\nXIHCP-EXIIDDPG+HERXVHCP-EXVIDDPG+HERExp.No.0.000.250.500.751.001.25Distance(m)Distancedistribution012345Episodes\u00d71030.00.20.40.60.81.0SuccessratePegInsertion(Exp.XI)0.00.51.01.52.02.5Episodes\u00d71030.00.20.40.60.81.0SuccessratePegInsertion(Exp.XV)HCP-E(pretrain)HCP-E(scratch)\f(a)\n\n(b)\n\n(c)\n\nFigure 6: (a): Learning curves using 1000 hoppers with different kinematics and dynamics in training.\nHCP-I is able to automatically learn a good robot representation such that the learning performance\ncan be on par with HCP-E+Dyn where we use the ground-truth kinematics and dynamics values. And\nHCP-I has a much better performance than vanilla PPO. (b): If we use the pretrained HCP-I model\n(only reuse the hidden layers) from (a) on 100 new hoppers, HCP-I with pretrained weights learns\nmuch faster than training from scratch. (c): Embedding visualization. The colorbar shows the hopper\ntorso mass value. We can see that the embedding is smooth as the color transition is smooth.\n\none can train a regression network to predict the trained vh based on the agent\u2019s limited interaction\nwith environment.\nEmbedding Smoothness We found the implicit encoding vh to be a smooth embedding space over\nthe dynamics. For example, we only vary one parameter (the torso mass) and plot the resultant\nembedding vectors. Notice that we reduce the dimension of vh to 2 since we only vary torso mass.\nFigure 6c shows a smooth transition over torso mass (the color bar represents torso mass value, 1000\nhoppers with different torso mass), where robots with similar mass are clustered together.\n\n6 Conclusion\n\nWe introduced a novel framework of Hardware Conditioned Policies for multi-robot transfer learning.\nTo represent the hardware properties as a vector, we propose two methods depending on the task:\nexplicit encoding (HCP-E) and implicit encoding (HCP-I). HCP-E works well when task policy does\nnot heavily depend on agent dynamics. It has an obvious advantage that it is possible to transfer the\npolicy network to new robots in a zero-shot fashion. Even when zero-shot transfer gives low success\nrate, we showed that HCP-E actually brings the agents very close to goals and is able to adapt to the\nnew robots very quickly with \ufb01netuning. When the robot dynamics is so complicated that feeding\ndynamics information into policy network helps improve learning, the explicit encoding is not enough\nas it can only encode the kinematics information and dynamics information is usually challenging and\nsophisticated to acquire. To deal with such cases, we propose an implicit encoding scheme (HCP-I)\nto learn the hardware embedding representation automatically via back-propagation. We showed that\nHCP-I, without using any kinematics and dynamics information, can achieve good performance on\npar with the model that utilized both ground truth kinematics and dynamics information.\n\nAcknowledgement\n\nThis research is partly sponsored by ONR MURI N000141612007 and the Army Research Of\ufb01ce\nand was accomplished under Grant Number W911NF-18-1-0019. Abhinav was supported in part\nby Sloan Research Fellowship and Adithya was partly supported by a Uber Fellowship. The views\nand conclusions contained in this document are those of the authors and should not be interpreted as\nrepresenting the of\ufb01cial policies, either expressed or implied, of the Army Research Of\ufb01ce or the U.S.\nGovernment. The U.S. Government is authorized to reproduce and distribute reprints for Government\npurposes notwithstanding any copyright notation herein. The authors would also like to thank Senthil\nPurushwalkam and Deepak Pathak for feedback on the early draft and Lerrel Pinto and Roberto Shu\nfor insightful discussions.\n\n9\n\n0.00.51.01.52.0Timesteps1e80200040006000Total rewardHopperHCP-EHCP-E+DynHCP-IPPO0.00.51.01.52.0Timesteps1e80200040006000Total rewardHopperHCP-I (pretrain)HCP-I (scratch)202PC12024PC2PCA visualization1234567\fReferences\n[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[3] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nCoRR, abs/1509.02971, 2015. URL http://arxiv.org/abs/1509.02971.\n\n[4] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real\ntransfer of robotic control with dynamics randomization. arXiv preprint arXiv:1710.06537,\n2017.\n\n[5] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A\n\nsurvey. Journal of Machine Learning Research, 10(Jul):1633\u20131685, 2009.\n\n[6] Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning\nIn Robotics and\nmodular neural network policies for multi-task and multi-robot transfer.\nAutomation (ICRA), 2017 IEEE International Conference on, pages 2169\u20132176. IEEE, 2017.\n\n[7] Justin Fu, Sergey Levine, and Pieter Abbeel. One-shot learning of manipulation skills with\nonline dynamics adaptation and neural network priors. In Intelligent Robots and Systems (IROS),\n2016 IEEE/RSJ International Conference on, pages 4019\u20134026. IEEE, 2016.\n\n[8] Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for\n\neffective learning. ICRA, 2017.\n\n[9] Andrei Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. 2016.\nURL https://arxiv.org/abs/1606.04671.\n\n[10] Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. CASSL: Curriculum\n\naccelerated self-supervised learning. ICRA, 2018.\n\n[11] Rajeswaran Aravind, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt:\nLearning robust neural network policies using model ensembles. ICLR, 2017. URL http:\n//arxiv.org/abs/1610.01283.\n\n[12] A Nilim and L El Ghaoui. Robust control of markov decision processes with uncertain transition\n\nmatrices. In Operations Research, pages 780\u2013798, 2005.\n\n[13] Ajay Mandlekar, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Adversarially robust policy\n\nlearning: Active construction of physically-plausible perturbations. IROS, 2017.\n\n[14] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\n2017. URL https://arxiv.org/abs/1703.06907.\n\n[15] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter\nAbbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments.\narXiv preprint arXiv:1710.03641, 2017.\n\n[16] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap-\nproximators. In Francis Bach and David Blei, editors, Proceedings of the 32nd International\nConference on Machine Learning, volume 37 of Proceedings of Machine Learning Research,\npages 1312\u20131320, Lille, France, 07\u201309 Jul 2015. PMLR. URL http://proceedings.mlr.\npress/v37/schaul15.html.\n\n10\n\n\f[17] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Learn-\ning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint\narXiv:1703.02949, 2017.\n\n[18] Botond Bocsi, Lehel Csat\u00f3, and Jan Peters. Alignment-based transfer learning for robot models.\nIn Neural Networks (IJCNN), The 2013 International Joint Conference on, pages 1\u20137. IEEE,\n2013.\n\n[19] Samuel Barrett, Matthew E Taylor, and Peter Stone. Transfer learning for reinforcement learning\non a physical robot. In Ninth International Conference on Autonomous Agents and Multiagent\nSystems-Adaptive Learning Agents Workshop (AAMAS-ALA), 2010.\n\n[20] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu,\nJuan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan robust grasps\nwith synthetic point clouds and analytic grasp metrics. RSS, 2017.\n\n[21] Mohamed K. Helwa and Angela P. Schoellig. Multi-robot transfer learning: A dynamical\n\nsystem perspective. 2017. URL https://arxiv.org/abs/1707.08689.\n\n[22] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy\nwith graph neural networks. In International Conference on Learning Representations, 2018.\nURL https://openreview.net/forum?id=S1sqHMZCb.\n\n[23] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Ried-\nmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for\ninference and control. arXiv preprint arXiv:1806.01242, 2018.\n\n[24] Igor Mordatch, Kendall Lowrey, and Emanuel Todorov. Ensemble-cio: Full-body dynamic\nmotion planning that transfers to physical humanoids. In Intelligent Robots and Systems (IROS),\n2015 IEEE/RSJ International Conference on, pages 5307\u20135314. IEEE, 2015.\n\n[25] J-JE Slotine and Li Weiping. Adaptive manipulator control: A case study. IEEE transactions\n\non automatic control, 33(11):995\u20131003, 1988.\n\n[26] Hae-Won Park, Koushil Sreenath, Jonathan Hurst, and J. W. Grizzle. System identi\ufb01cation and\nmodeling for mabel, a bipedal robot with a cable-differential-based compliant drivetrain. In\nDynamic Walking Conference (DW), MIT, July 2010.\n\n[27] Umashankar Nagarajan, Anish Mampetta, George A Kantor, and Ralph L Hollis. State transition,\nbalancing, station keeping, and yaw control for a dynamically stable single spherical wheel\nmobile robot. In Robotics and Automation, 2009. ICRA\u201909. IEEE International Conference on,\npages 998\u20131003. IEEE, 2009.\n\n[28] Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement\nlearning. In Proceedings of the 23rd international conference on Machine learning, pages 1\u20138.\nACM, 2006.\n\n[29] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[30] Gregory Kahn, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Plato: Policy learning using\nadaptive trajectory optimization. In Robotics and Automation (ICRA), 2017 IEEE International\nConference on, pages 3342\u20133349. IEEE, 2017.\n\n[31] Wenhao Yu, C Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal\n\npolicy with online system identi\ufb01cation. arXiv preprint arXiv:1702.02453, 2017.\n\n[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\npolicy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/\nabs/1707.06347.\n\n[33] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.\nCoRR, abs/1707.01495, 2017. URL http://arxiv.org/abs/1707.01495.\n\n11\n\n\f[34] ROS URDF. http://wiki.ros.org/urdf.\n\n[35] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on,\npages 5026\u20135033. IEEE, 2012.\n\n[36] Xijian Huo, Yiwei Liu, Li Jiang, and Hong Liu. Design and development of a 7-dof humanoid\narm. In Robotics and Biomimetics (ROBIO), 2012 IEEE International Conference on, pages\n277\u2013282. IEEE, 2012.\n\n[37] Zvi Artstein. Discrete and continuous bang-bang and facial spaces or: look for the extreme\n\npoints. SIAM Review, 22(2):172\u2013185, 1980.\n\n[38] Tom Erez, Yuval Tassa, and Emanuel Todorov. In\ufb01nite-horizon model predictive control for\n\nperiodic tasks with contacts. Robotics: Science and systems VII, page 73, 2012.\n\n12\n\n\f", "award": [], "sourceid": 5696, "authors": [{"given_name": "Tao", "family_name": "Chen", "institution": "Carnegie Mellon University"}, {"given_name": "Adithyavairavan", "family_name": "Murali", "institution": "Carnegie Mellon University Robotics Institute"}, {"given_name": "Abhinav", "family_name": "Gupta", "institution": "Facebook AI Research/CMU"}]}