{"title": "Learning-In-The-Loop Optimization: End-To-End Control And Co-Design Of Soft Robots Through Learned Deep Latent Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 8284, "page_last": 8294, "abstract": "Soft robots have continuum solid bodies that can deform in an infinite number of ways. Controlling soft robots is very challenging as there are no closed form solutions. We present a learning-in-the-loop co-optimization algorithm in which a latent state representation is learned as the robot figures out how to solve the task. Our solution marries hybrid particle-grid-based simulation with deep, variational convolutional autoencoder architectures that can capture salient features of robot dynamics with high efficacy. We demonstrate our dynamics-aware feature learning algorithm on both 2D and 3D soft robots, and show that it is more robust and faster converging than the dynamics-oblivious baseline. We validate the behavior of our algorithm with visualizations of the learned representation.", "full_text": "Learning-In-The-Loop Optimization: End-To-End\nControl And Co-Design of Soft Robots Through\n\nLearned Deep Latent Representations\n\nCSAIL\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nAndrew Spielberg, Allan Zhao, Tao Du, Yuanming Hu, Daniela Rus, Wojciech Matusik\n\naespielberg@csail.mit.edu, azhao@mit.edu, taodu@csail.mit.edu\n\nyuanming@mit.edu, rus@csail.mit.edu, wojciech@csail.mit.edu\n\nAbstract\n\nSoft robots have continuum solid bodies that can deform in an in\ufb01nite number\nof ways. Controlling soft robots is very challenging as there are no closed form\nsolutions. We present a learning-in-the-loop co-optimization algorithm in which a\nlatent state representation is learned as the robot \ufb01gures out how to solve the task.\nOur solution marries hybrid particle-grid-based simulation with deep, variational\nconvolutional autoencoder architectures that can capture salient features of robot\ndynamics with high ef\ufb01cacy. We demonstrate our dynamics-aware feature learning\nalgorithm on both 2D and 3D soft robots, and show that it is more robust and faster\nconverging than the dynamics-oblivious baseline. We validate the behavior of our\nalgorithm with visualizations of the learned representation.\n\nFigure 1: Our algorithm learns a latent representation of robot state which it uses as input for control. Above\nare velocity \ufb01eld snapshots of a soft 2D biped walker moving to the right (top), the corresponding latent\nrepresentations (middle), and their reconstructions (bottom) from our algorithm. In each box, the x (left) and y\n(right) components of the velocity \ufb01elds are shown; red indicates negative values, blue positive.\n\n1\n\nIntroduction\n\nRecent breakthroughs have demonstrated capable computational methods for both controlling (Heess\net al. [2017], Schulman et al. [2017], Lillicrap et al. [2015]) and designing (Ha et al. [2017], Spielberg\net al., Wampler and Popovi\u00b4c [2009]) rigid robots. However, control and design of soft robots have\nbeen explored comparatively little due to the incredible computational complexity they present. Due\nto their continuum solid bodies, soft robots\u2019 state dimensionality is inherently in\ufb01nite. High, but\n\ufb01nite dimensional approximations such as \ufb01nite elements can provide robust and accurate forward\nsimulations; however, such representations have thousands or millions of degrees of freedom, making\nthem ill-suited for most control tasks. To date, few compact, closed-form models exist for describing\nsoft robot state, and none apply to the general case. In this paper, we address the problem of learning\nlow-dimensional robot state while simultaneously optimizing robot control and/or material parameters.\nIn particular, we require a representation applicable to physical control of real-world robots.\nWe propose a computer vision-inspired approach which makes use of the robot\u2019s observed dynamics\nin learning a compact observation model for soft robots. Our task-centric method interleaves\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcontroller (and material) optimization with learning low-dimensional state representations. Our\n\u201clearning-in-the-loop optimization\u201d method is inspired by recent advances in hybrid particle-grid-\nbased differentiable simulation techniques and deep, unsupervised learning techniques.\nIn the\nlearning phase, simulation grid data is fed into a deep, variational convolutional autoencoder to\nlearn a compact latent state representation of the soft robot\u2019s motion. In the optimization phase, the\nlearned encoder function creates a compact state description to feed into a parametric controller; the\nresulting, fully differentiable representation allows for backpropagating through an entire simulation\nand directly optimizing a simulation loss with respect to controller and material parameters.\nBecause learning is interleaved with optimization, learned representations are catered to the task, robot\ndesign (including, e.g., discrete actuator placement), and environment at hand, and not just the static\ngeometry of the soft structure. Because of our judicious choice of a physics engine which operates (in\npart) on a grid, we are able to easily employ modern, deep learning architectures (convolutional neural\nnetworks) to extract robust low-dimensional state representations while providing a representation\namenable to real-world control through optical \ufb02ow. Because of our fully-differentiable representation\nof the controller, observations, and physics, we can directly co-design robot performance.\nTo our knowledge, our pipeline is the \ufb01rst end-to-end method for optimizing soft robots without the use\nof a pre-chosen, \ufb01xed representation, minimizing human overhead. In this paper, we contribute: 1) An\nalgorithm for control and co-design of soft robots without the need for manual feature engineering; 2)\nexperiments on \ufb01ve model robots evaluating our system\u2019s performance compared to baseline methods;\n3) visualizations of the learned representations, validating the ef\ufb01cacy of our learning procedure.\n\n2 Related Work\n\nDimensionality Reduction for Control A compact, descriptive latent space is crucial for tractably\nmodeling and controlling soft robots. Methods for extracting and employing such spaces for control\ntypically fall into two categories: a) analytical methods, and b) learning-based methods.\nAnalytical methods examine the underlying physics and geometry of soft structures in order to extract\nan optimal subspace for capturing low-energy (likely) deformations. Most popular among these\nmethods are modal bases [Sifakis and Barbic, 2012], formed by solving a generalized eigenvalue\nproblem based on the harmonic dynamics of a system. These methods suffer from inadequately\nmodeling actuation, contact, and tasks and only represent a linear approximation of system\u2019s dynamics.\nStill, such representations have been successfully applied to real-time linear control (LQR) in Barbi\u02c7c\nand Popovi\u00b4c [2008] and Thieffry et al. [2018], and (with some human labeling) animation [Barbi\u02c7c\net al., 2009], but lack the physical accuracy needed for physical fabrication. In another line of work,\nChen et al. [2017] presented a method for using analytical modal bases in order to reduce the degrees\nof freedom of a \ufb01nite element system for faster simulation while maintaining physical accuracy.\nHowever, the resulting number of degrees of freedom are still impractical for most modern control\nalgorithms. For the speci\ufb01c case of soft robot arms, geometrically-inspired reduced coordinates may\nbe employed. Della Santina et al. [2018] developed a model for accurately and compactly describing\nthe state of soft robot arms by exploiting segment-wise constant curvature of arms.\nLearning-based methods, by contrast, use captured data in order to learn representative latent spaces\nfor control. Since these representations are derived from robot simulations or real-world data, they can\nnaturally handle contact, actuation, and be catered to the task. Goury and Duriez [2018] demonstrated\nsome theoretical guarantees on how \ufb01rst-order model reduction techniques could be applied to motion\nplanning and control for real soft robots. As drawbacks, their method is catered to FEM simulation,\nrequires a priori knowledge of how the robot will move, and representations are never re-computed,\nmaking it ill-suited to co-design where dynamics can change throughout optimization.\nTwo works from different domains have similarities to our work. Ma et al. [2018] applied deep-\nlearning of convolutional autoencoders in the context of controlling rigid bodies with directed \ufb02uids.\nOur algorithm shares high-level similarities, but operates in the domain of soft robot co-optimization\nand exploits simulation differentiability for fast convergence. Amini et al. [2018] employed latent\nrepresentations for autonomous vehicle control in the context of supervised learning on images.\nCo-Design of Soft Robots There exist two main threads of work in which robots are co-designed\nover morphology and control \u2014 gradient-free and gradient-based. Most of the work in model-free\nco-optimization of soft-robots is based on evolutionary algorithms. Cheney et al. [2013], Corucci\n\n2\n\n\fFigure 2: At each step of our simulation, the following procedure is performed. First, the unreduced state is\nfed into an observer function \u2014 centroids of a segmentation, as in Hu et al. [2019], or, as we demonstrate in\nthis paper, an automatically learned latent space. Regardless, the observer outputs features to be processed by a\nparametric controller, which converts these features to actuation signals. Finally, the actuation is fed into our\nMPM simulator, which performs a simulation step. The entire pipeline is differentiable and therefore we can\ncompute derivatives with respect to design variables even when executing the work\ufb02ow for many steps.\n\net al. [2016], and Cheney et al. [2018] have demonstrated task-speci\ufb01c co-optimization of soft robots\nover materials, actuators, and topology. These approaches are less susceptible to local minima than\ngradient-based approaches but are vastly more sample inef\ufb01cient. For instance, a single evolved robot\nin Cheney et al. [2013] requires 30000 forward simulations; by comparison, optimized robots in our\nwork are optimized in the equivalent of 400 simulations (treating gradient calculation as equal to\n3 forward simulations). Further, their approach was limited to simple open-loop controllers tied to\nform, while ours solves for more robust, closed-loop control.\nWhile some algorithms exist for gradient-based co-optimization of rigid robots (Wampler and Popovi\u00b4c\n[2009], Spielberg et al., Ha et al. [2017], Wang et al. [2019]), results in model-based co-optimization\nof soft robots have been sparse. Closest to our work, Hu et al. [2019] presented a method for gradient-\nbased co-optimization of soft robotic arms using a fully-differentiable simulator based on the material\npoint method (MPM). As a limitation, their method relied on ad-hoc features that needed to be labeled\nat the time the robot topology was speci\ufb01ed and could not be easily measured in the physical world,\nmaking them ill-suited for physical control tasks. Our work addresses this shortcoming.\n\n3 Overview and Preliminaries\n\nWe seek an algorithm for co-optimizing soft robots over control and design parameters without manu-\nally prescribing a state description for the controller to observe. Our solution will be to periodically\nlearn an updated observation model from the simulation data generated during optimization. For the\nremainder of this paper, we refer to the dimensionally reduced representation of the soft robot as the\nlatent representation and the unreduced representation as the full representation. To avoid confusion,\nwe use the term learning to refer to the procedure of learning a latent representation and the term\noptimization to exclusively refer to the procedure of improving a robot\u2019s controller or design.\nA full overview of our system is shown in Fig. 2. At each time step, the full representation is fed\ninto a (learned) observer function, which reduces it down to a latent representation. The latent\nrepresentation is fed into an (optimized) controller function, which produces control signals for\nthe robot\u2019s actuators. Those control signals are applied to the full robot state to simulate the robot\nforward one time step, producing the next full state. At the end of the simulation, a speci\ufb01ed \ufb01nal loss\nfunction L is computed. Direct optimization of this loss function over controller and physical design\nparameters is possible since each component of our system, including our simulator, is differentiable.\nFormally, let \u03c5t \u2208 Ru denote a robot\u2019s full state at time t, let qt \u2208 Rr denote the corresponding\nlatent space, and let ut \u2208 Rm denote the actuation control signal at time t. The observer function,\nO : Ru \u2192 Rr maps a full state to a latent state and is governed by observer parameters \u0398. The\ncontroller function, C : Rr \u2192 Rm maps a reduced state to deterministic actuation output and is\ngoverned by control parameters \u03b8. The simulation step function, f : Ru \u00d7 Rm \u2192 Ru, time steps the\nsystem for some speci\ufb01ed \u2206t given the full state and the actuation, and is governed by physical design\nparameters \u03c6. In other words, \u03c5t+\u2206t = f (\u03c5t,C(O(\u03c5t; \u0398); \u03b8); \u03c6). For brevity, we will omit writing\nthe parameterizations explicitly except when necessary. The \ufb01nal loss L : Ru \u2192 R at \ufb01nal time T\noperates on some \ufb01nal full state \u03c5T to produce a scalar loss; this could be the distance the robot has\ntraveled, its \ufb01nal velocity, etc. - anything that\u2019s dependent on the robot\u2019s \ufb01nal state. Computing L\namounts to iteratively applying f to generate states \u03c50, \u03c5\u2206t, \u03c52\u2206t . . . \u03c5T and applying L to the \ufb01nal\n\n3\n\n\fFigure 3: Left: The architecture of our convolutional variational autoencoder. The autoencoder takes in\n2-channel pixel grid data as input, with each channel representing the x or y velocity \ufb01eld at that pixel. We\nuse \ufb01ve layers of strided convolutions followed by ReLU operations. This is followed by a \ufb01nal \ufb02attening\noperation which coalesces the weights into latent variables. The latent variables parameterize Gaussians, used in\nour variational loss (Lv). The architecture is mirrored on the opposite side (without a \ufb01nal ReLU, to allow for\nnegative outputs). The 3D version is completely analogous, but takes in 3-channel voxel grid velocity \ufb01eld data\nand applies 3D convolutions. Above, the \ufb01lter sizes are speci\ufb01ed with K, and the strides are speci\ufb01ed with S.\nRight: At inference time, simulation data is fed into the encoder, E, which produces a latent [\u00b5, \u03c3]\u2020 vector. The\nmean \u00b5 variables are then fed as inputs to the controller, C.\none. We use S to denote the full process of simulating a robot and then computing the loss value l; in\nother words, l = L(\u03c5T ) = S(L, \u03c50) for some simulation time length T . The chain rule can then be\nused to backward-propagate through this chain of functions to compute gradients for optimization. In\nour algorithm, the learning phase learns \u0398 while the optimization step optimizes \u03c6, and \u03b8.\n\nminimize\n\nL(\u03c5T )\n\u2200t, \u03c5t+\u2206t = f (\u03c5t,C(O(\u03c5t; \u0398); \u03b8); \u03c6)\n\n\u03b8,\u03c6\nwhere\nsubject to \u03c6min \u2264 \u03c6 \u2264 \u03c6max\n\nThough \u0398 is not part of the optimization, it is an auxiliary variable that must be learned in tandem.\n\n4 Method\n\nSimulation and Data Generation We use a simulator based on ChainQueen [Hu et al., 2019], the\ndifferentiable Moving Least Squares Material Point Method (MLS-MPM) [Hu et al., 2018] simulator,\nfor the underlying physics of robots. In ChainQueen, robots are represented as collections of particles,\nand a background velocity grid is used for particle interaction and is exposed to users. ChainQueen\nalso provides analytical gradients of functions of simulation trajectories with respect to controller\nparameters and robot design. Our algorithm is not simulator speci\ufb01c, and can operate on any fully\ndifferentiable simulator where differentiable grid velocity data can be extracted.\nIn the remaining of this section we describe our LITL optimization algorithm. First, we assume we\nhave a large dataset of robot motion data from the simulator, representative of the way the robot\nwill move when completing the prescribed task, and describe the learning phase of the algorithm.\nNext, we describe how we use the simulation data to optimize the controller and design. Finally, we\ndescribe how to combine these two phases into a cohesive, complete algorithm.\n\n4.1 Learning\n\nDuring the learning phase, we seek to learn a compact, expressive representation of robot state to\nfeed to the controller. As input, learning takes in snapshots of robot simulation; namely, the robot\u2019s\nvelocity \ufb01eld on a background grid. Note that this \ufb01eld implicitly also provides robot positional\ninformation. As output, weights for an observer function with a descriptive latent space are learned.\nIn particular, we learn a variational autoencoder [Kingma and Welling, 2013, Rezende et al., 2014]\nthat takes, as input, a state description of a robot and minimizes the reconstruction cost of said state.\nOur assumption is that features which allow reconstruction are highly expressive. Fig. 3 presents the\narchitecture we used for all experiments. We experimented with various network depths; ours was\nchosen for stability and generality across experiments. We use a convolutional architecture, which\noperates naturally on input velocity grid data and generalizes to robot translation due to equivariance.\nFormally, for an unreduced u-dimensional object, let E : Ru \u2192 Rr be an encoder function with\nparameter weights \u0398E, and D : Rr \u2192 Ru be a decoder function with parameter weights \u0398D. For an\n\n4\n\n\finput training dataset grid velocity data \u03a5 , the reconstruction loss is de\ufb01ned as:\n\n(cid:88)\n\n\u03c5\u2208\u03a5\n\nLR(\u03a5) =\n\n1\n|\u03a5|\n\n(cid:107)D\u0398D (E\u0398E (\u03c5)) \u2212 \u03c5(cid:107)2\n2.\n\nWe omit the details of the VAE formulation, which adds a variational representation and regularization\nterm. For a more extensive treatment, please refer to [Doersch, 2016]. We minimize this loss by\nperforming mini-batch stochastic gradient descent on our input dataset. For updates, we employ the\nAdam [Kingma and Ba, 2014] \ufb01rst-order optimizer. We also experimented with a non-variational\nautoencoder. However, in the majority of experiments, that network over\ufb01t to a 1D manifold.\nThis caused control optimization to fail, quickly resulting in unpredictable, erratic behaviors. The\nregularization from the variational formulation is necessary for avoiding collapse in the latent space.\n\n4.2 Optimization\n\nOur optimization procedure is similar to that of Hu et al. [2019]. At each optimization iteration, we\ncompute \u2207\u03b8,\u03c6L, providing a gradient of our loss with respect to all of our decision variables. We\nthen use this gradient to apply a gradient descent update step to our parameters. Finally, we account\nfor potential bounds in our design variables (e.g., maximum and minimum Young\u2019s Modulus) by\nprojecting the variable bounds back to the feasible region; i.e. \u03c6i \u2190 max(min(\u03c6i, \u03c6max), \u03c6min). In\npractice, we use the Adam optimizer. At each iteration of the optimization, during forward simulation,\nwe record snapshots of the grid data to be used in the learning phase.\n\n4.3 Algorithm\n\nOur algorithm is an alternating minimization. First, we optimize the robot controller and design\nparameters for a \ufb01xed number of iterations, during which we record snapshots of grid velocities.\nThen, we use these grid velocities to learn an observer for a \ufb01xed number of iterations. With the\nobserver and latent representation improved, we return to our optimization procedure, and keep\nalternating until convergence. The initial grid velocity dataset is generated from simulating just\nonce with the initial, untrained controller. This is enough to bootstrap our learning. Two key design\ndecisions are discussed below:\nAlternative vs. Simultaneous Minimization Learning a descriptive latent encoding is harder than\noptimizing the controller, and therefore requires more minimization iterations. When trained simulta-\nneously, the controller tends to get trapped into a local minimum under a non-descriptive latent space.\nPerformance-wise, evaluating \u2207\u0398L is orders of magnitude more expensive than evaluating \u2207\u0398LR\nsince it requires backpropagating through an entire simulation. The alternating scheme allows us\nto economically draw a large amount of historical snapshots, and to evaluate \u2207\u0398LR only, for the\nautoencoder training.\nContinuous vs. One-Shot Autoencoder Training Since robot dynamics change with changing\ncontrol and design, continual retraining is critical. See, for example, Fig. 4 a. (robot arm control).\nWith one-shot autoencoder training using only initial motion, the autoencoder only disambiguates\nmotions of a mostly static arm, and optimization fails.\nOur algorithm has no obvious guarantee of convergence; here, we describe three speci\ufb01c ways this\nalgorithm can theoretically fail, and the steps we take to make our algorithm work reliably in practice.\nOver\ufb01tting to Historical Snapshots It is important to \ufb01t to an entire trajectory, and not just the\ntrajectory\u2019s individually captured historical snapshots. Over\ufb01tting the autoencoder to history will\ndegrade feature quality on future scenarios. Therefore, we employ early stopping to be conservative\nwith autoencoder training. Before training the autoencoder, we evenly split the snapshots into a\ntraining and validation set. We early stop the training when the validation loss has remained worse\nthan the best seen loss value, for a certain number of consecutive iterations.\nOver\ufb01tting to Recent Trajectories The autoencoder tends to prioritize learning the most recent\ntrajectories, harming generalization to future snapshots. To remedy this problem, we maintain an\nexperience replay buffer [Mnih et al., 2015] of snapshots from multiple simulations. We use all\nsnapshots currently in the replay buffer to train the autoencoder. This increases the diversity of\nautoencoder training inputs, and stabilizes our algorithm against a changing controller.\n\n5\n\n\fAlgorithm 1 Learning-In-The-Loop Co-Optimization\n\nHyperparameters: Max episode K, Max optimization iterations M, max learning iterations N, minibatch size b, max replay buffer size\nB, target update step size \u03b1, latent space dimensionality r.\nGiven: user-speci\ufb01ed robot morphology R, loss function L, design parameter bounds \u03c6min, \u03c6max, initial design parameters \u03c60, and initial\nfull state \u03c50.\nRandomly initialize network weights \u03b8, and \u0398 (with latent space of dimension r), and initialize autoencoder copy \u0398(cid:48) \u2190 \u0398.\nInitialize empty replay buffer I \u2190 [] with maximum size B.\nfor episode i = 1 . . . K do\n\nfor optimization iteration j = 1 . . . M do\n\nCompute loss lj and simulation snapshots \u03a5j: lj , \u03a5j = S(L, \u03c50).\nStore snapshots \u03a5j in I.\nUpdate \u03b8, \u03c6 using the analytical simulation loss gradients \u2207\u03b8,\u03c6L.\nClamp physical design variables \u03c6i \u2190 max(min(\u03c6i, \u03c6max), \u03c6min).\n\nend for\nSplit I randomly and evenly into training set I\u03c4 and validation set Iv.\nfor learning iteration j = 1 . . . N do\n\nfor minibatch 1 . . . len(I\u03c4 )/b do\n\nSample minibatch I\u03c4 from I\u03c4 (without replacement)\nUpdate \u0398(cid:48) using analytical autoencoder loss gradients \u2207\u0398L(\u03a5)|\u03a5=I\u03c4\n\nend for\nCompute validation loss (cid:96)j = L(Iv)\nif (cid:96)j has not decreased for q iterations then\n\nBreak (Early Stopping).\n\nend if\n\nend for\nUpdate Autoencoder weights using target: \u0398 \u2190 \u0398(cid:48)\u03b1 + (1 \u2212 \u03b1)\u0398\n\nend for\nReturn: R with optimized design \u03c6 and controller \u03b8.\n\nFeature Oscillation Despite our precautions thus far, the autoencoder can still change rapidly, which\ninjects instability to the controller optimization. Inspired by the smoothed target network update\nscheme in Lillicrap et al. [2015], we perform learning on a copy of the autoencoder network weights,\nand use the original autoencoder network throughout an episode. After each episode, we step the\noriginal autoencoder toward the updated copy \u0398(cid:48); i.e., \u0398 \u2190 \u0398(cid:48)\u03b1 + (1 \u2212 \u03b1)\u0398.\nWe combine these re\ufb01nements with our learning and optimization phases in our \ufb01nal algorithm (See\nAlg. 1 for details).\n\n5 Results and Discussion\n\nIn this section, we summarize the results of our experiments on four of our model robots: 2D Biped,\n2D Arm, 2D \u201cElephant,\u201d 2D \u201cBunny,\u201d and brie\ufb02y describe demonstrations on four further robots: 2D\nRhino, 3D Quadruped, 3D Curved Quadruped, and 3D Hexapod. Robot morphologies can be seen in\nFigs. 4 and 6. The Biped and its variants are co-design experiments, while the others are pure control,\nas we found co-design matters much more for locomotion tasks. We encourage the reader to watch\nthe accompanying video for simulations of our optimized robots. For each 2D example, we compare\nto another automated procedure, a k-means clustering baseline, inspired as an automated version\nof the manual labels from [Hu et al., 2019]. In this baseline, the particles were clustered prior to\noptimization based on their Euclidean distance in the robot\u2019s rest con\ufb01guration; the average position\nand velocity of each cluster in each Cartesian coordinate is fed as input to the controller network.\nFurther details about our hyperparameters are included in the Appendix for reproducibility. Each\nexperiment was run ten times. All results are presented with a 90% con\ufb01dence interval. We provide\nsome ablation tests in each experiment to justify the necessity of certain aspects of our algorithm. For\nall 2D experiments, iteration vs. loss is presented; autoencoder training time is trivial compared to\nsimulation, and both the VAE and k-means simulations/backpropagation times are similar. Each 2D\nsimulation and corresponding backpropagation is computed in less than 20 seconds. All experiments\nwere performed on a computer with an Intel i7 2.91-GHz processor, NVIDIA GeForce GTX 1080\nGPU, and 16GB of RAM.\n2D Arm The 2D arm presents the simplest of all of our tasks in which the centroid of a region of a\n\ufb01xed-base soft robot arm must reach a prescribed point in space with no gravity. While geometrically\nsimple, the problem is not dynamically trivial. The actuators are too weak to allow the robot to\ndirectly bend to to the goal; it must swing back and forth to build up momentum to reach its target.\nThis is the easiest problem we present; in this one example, k-means clustering is competitive, and at\n\ufb01ner resolutions, faster converging than our VAE (though slower at coarser resolutions). See Fig. 4 a.\n\n6\n\n\fFigure 4: Progress of robot performance vs. optimization iteration, along with drawings of our 2D demos. Each\nblack rectangle denotes an actuated region; in precision tasks, regions denoted with black X\u2019s are those which\nmust reach target locations, denoted with green circles.\nWe use the 2D arm as an opportunity to present ablation tests\nfor what happens if the replay buffer is eliminated.\nIn the\nformer case, the representation oscillates wildly, making con-\ntrol optimization impossible. In the latter case, since earlier\niterations have less dynamic motion, they provide less dynam-\nically descriptive, insuf\ufb01cient representations of the robot\u2019s full\nmotion.\n2D Biped The 2D Biped presents a locomotion task in which\nthe robot must run to the right as far as possible in the allotted\ntime. The biped\u2019s progress can be seen in Fig. 4 b. In the video,\nFigure 5: The average reconstruction\nwe present two design variations in the robot\u2019s shape. We also\nloss LR per pixel for the 2D Biped\nshow the results of a typical VAE training procedure in Fig.\nscaled as measured in average pixel\n5, showing typical convergence. In four out of the ten trials,\ndistance vs.\nstochastic gradient de-\nk\u2212means clustering completely failed to converge, landing in\nscent step, demonstrating that not only\npoor local minima near the robot\u2019s starting con\ufb01guration.\ndoes our algorithm converge in objec-\ntive value, but also in model learning.\nWe further use the Biped as an opportunity to show the minimal\nadverse effects of retraining. Table 1 presents the change in the distance traveled after retraining the\nautoencoder and then performing a single optimization step. The single optimization step cancels out\nvirtually all backward progress caused by retraining.\n\nRetrain #\n\nMean\n\n\u22121.08 \u00d7 10\u22122\n\u22121.68 \u00d7 10\u22122\n\u22121.25 \u00d7 10\u22122\n3.56 \u00d7 10\u22122\n\u22121.86 \u00d7 10\u22122\n\u22121.17 \u00d7 10\u22122\n\u22124.13 \u00d7 10\u22123\n\u22129.69 \u00d7 10\u22123\n\n1\n2\n3\n4\n5\n6\n7\n8\n\nStandard Dev. Retrain #\n2.64 \u00d7 10\u22122\n1.53 \u00d7 10\u22122\n3.06 \u00d7 10\u22122\n8.36 \u00d7 10\u22122\n2.88 \u00d7 10\u22122\n1.28 \u00d7 10\u22122\n1.55 \u00d7 10\u22122\n1.44 \u00d7 10\u22122\n\n9\n10\n11\n12\n13\n14\n15\n\nMean\n\n\u22128.53 \u00d7 10\u22123\n\u22121.26 \u00d7 10\u22122\n\u22122.70 \u00d7 10\u22125\n\u22121.00 \u00d7 10\u22122\n\u22126.54 \u00d7 10\u22123\n5.93 \u00d7 10\u22123\n\u22123.27 \u00d7 10\u22124\n\nStandard Dev.\n6.07 \u00d7 10\u22122\n1.80 \u00d7 10\u22122\n2.61 \u00d7 10\u22122\n1.36 \u00d7 10\u22122\n4.29 \u00d7 10\u22123\n1.16 \u00d7 10\u22122\n8.39 \u00d7 10\u22123\n\nTable 1: The mean backward progress remaining from retraining after a single optimization iteration on the 2D\nBiped locomotion task, with corresponding standard deviations. A negative value indicates a decrease in the\ndistance traversed. As can be seen, the backward progress is a very small negative number, or positive in all\ncases, indicating that a single optimization almost completely reverses the adverse effects of retraining.\n2D Elephant The 2D Elephant presents a task which is a mixture of locomotion and manipulation.\nThe elephant must walk to the right while a part of the trunk must reach a prescribed location. A subset\nof the results are seen in Fig. 4 c for readability, further results can be found in the Appendix.We use\nthe Elephant to perform experiments over a wide range of latent variable and cluster sizes. While we\ntry to compare the same number of clusters and latent variables in experiments (since inputs from a\ncluster give highly dependent data) we acknowledge that each cluster (in 2D) provides six inputs to\nthe controller. Thus, this experiment also allows comparisons over controllers of similar size.\nThe VAE dominates k-means over all cluster/latent variable counts. As can be seen, the latent variable\nprocedure has a weakness that the autoencoder can suffer the well-known \u201ccollapse\u201d phenomenon as\nthe latent variable size grows; increasing the VAE\u2019s regularizer weight combats this phenomenon.\n2D Bunny The 2D Bunny provides a task in which two arms must reach two target locations in space.\nThe robot must walk forward and bend the arms to reach the target points as closely as possible. This\nis our most dynamically challenging task and cannot be solved perfectly. Results are in Fig. 4 d.\n\n7\n\n\fFurther Demonstrations Finally, we present further control\ntasks, including extensions to 3D and curvier designs (Fig. 6).\nLike the 2D Biped, these robots must run as far to the right\nas possible in the alotted time. The 2D rhino is instantiated\ndirectly from a .png \ufb01le. 3D optimizations take much longer,\nsince the 3D autoencoder, and the corresponding simulation\ndata, are much larger. They take more time to process, and the\nVAE capacity is larger, meaning it requires larger minibatches,\na larger replay buffer, and more iterations during learning. Each\n3D walker takes over a day to optimize, and thus was only\nperformed once; please see the video for demonstrations of the\nfour additional walkers.\nFigure 6: Four further robot demon-\nstrations we present in more detail in\nDiscussion As can be seen, our VAE observer tends to converge\nthe supplementary video.\nfaster and get stuck in poor local minima much more rarely than\nk\u2212means clustering. A natural question is to ask why k\u2212means clustering performs worse. There\nare several reasons why clustering-based observors can lead to worse outcomes. First, k\u2212means\nclustering can lead to poorly selected regions to track on the robot. For illustrative purposes, Fig. 7\nshows two clusterings of the bunny. In the \ufb01rst, because the Euclidean distance is used, clustering\nleads to two segments that are geometrically close but dynamically dissimilar to be clustered together;\nespecially a problem when the task demands that they ideally should be tracked separately (we note\nthat a geodesic distance might not suffer as seriously from such a behavior). Second, clustering does\nnot gracefully handle changes in robot feature size. Even though the top arms are more dynamically\ninteresting than the body of the robot, the body is allocated the majority of the clusters. This can be\ncompensated for in a brute-force manner by adding more clusters, but experiments showed that as\nthe number of clusters grows large, simulation time slows tremendously (if, say, k = 1000 is used,\nsimulation on the same problems can take minutes). Finally, clustering is dynamics-oblivious; it\ncannot adapt as different motions are explored, nor does it consider other design or task speci\ufb01cs like\nactuator placement.\nFig. 8 provides a visualization of the extracted latent features\nfor the 2D Biped and describes their computation. The emer-\ngence of natural \u201cphysical modes\u201d arises as the procedure con-\ntinues, with some more signi\ufb01cant latent feature representing\nmore rigid motion modes (such as velocity to the right), and\nless signi\ufb01cant latent features representing higher-frequency,\ndynamic deformations. Such representations are not only valu-\nable for robust control, but can make it easier to understand\nlearned observers and controllers in relation to the underlying\nphysical processes.\nIn order to understand how the mapping between velocity \ufb01elds\nand actuations changes over time, we generated saliency maps\nfor each frame of a 2D Elephant simulation, which is included\nin the supplemental video. The saliency maps show the gradient of each actuator control signal\nwith respect to the x and y velocities, multiplied point-wise by the velocities. The gradients of the\nleg actuators (rows 1-4 from the top) with respect to latent variables are similar to one another, as\nis true for the trunk actuators (rows 5-10), implying similar parts of the task rely on similar latent\ncoordinates.\nFinally, we note that while our VAE observer dominates on more challenging problems, k\u2212means\nis still suf\ufb01cient for simpler problems, as can be seen with the arm. One other example where\nthe k\u2212means observer performs better is in the case of the 2D biped when the problem is made\nsuf\ufb01ciently easier, by giving it much stronger actuators and dropping it from a high initial height to\ngive it initial momentum. In this case, the walker can quickly learn to \u201cbounce\u201d forward; however, in\nthe much more dif\ufb01cult case shown here, the k\u2212means observer often completely fails.\n\nFigure 7: Two suboptimal clusterings\nfor the bunny with different k values.\nFor k = 5 (left) the upper arms are\nclustered together; for k = 10 (right)\nclusters overemphasize the importance\nof the body compared with the feet or\narms.\n\n8\n\n\f(a) After Episode 1\n\n(b) After Episode 10\n\nFigure 8: Visualization of the latent space of the autoencoder (2D Biped) with 10 latent variables. Each row\nrepresents a (normalized) decoded output for a one-hot latent feature, varied from \u22121 to 1. As the algorithm\nproceeds, the latent features become more descriptive. Formatting per box is the same as in Fig. 1.\n\n6 Conclusions and Future Work\n\nIn this work, we demonstrated a method for end-to-end co-optimization of soft robots that requires\nminimal human intervention. Our method interleaves optimization with learning a deep latent space\nrepresentation, allowing improved state estimates to improve the control and design, and vice versa.\nWe have demonstrated our algorithm\u2019s superior reliability to na\u00efve dynamics-oblivious methods.\nOur method has two notable drawbacks. First, although the 2D version of our algorithm can be applied\non a visual cross-section in 3D, the fully 3D version can be hard to realize in the physical world.\nFurther, autoencoder training times can be quite large in the 3D convolutional architecture. Second,\nretraining of the autoencoder, while necessary, can sometimes undo some forward progress and\ninterfere with momentum in optimization, both slowing the tail-end of convergence in optimization.\nFinally, we envision three future extensions to our work. First, since we learn a low-dimensional\nlatent space, it would be interesting to use the learned latent-space outside of the context of its\ncounterpart controller - namely for, e.g. optimal control such as LQR control. Second, we would like\nto apply our control algorithm to other soft robot simulation methods. For example, the nodes of an\nFEM simulation could be similarly rasterized to a background grid (though with additional overhead;\nMPM generates this grid \u201cfor free\u201d) to which our algorithm could be directly applied. Finally, we\nhope to demonstrate our optimized controllers on real, physical soft robots using vision-based sensors\nand optical \ufb02ow.\n\n7 Acknowledgments\n\nWe thank Alexander Amini for insightful discussions on convolutional variational autoencoders and\nstarter code. We thank Liane Makatura for help in drawing explanatory \ufb01gures. We thank Buttercup\nFoshey (and of course Michael Foshey) for moral support during this work.\nThis work was supported by NSF grant No. 1138967, the Unity Global Graduate Fellowship, IARPA\ngrant No. 2019-19020100001, and The MIT EECS David S. Y. Wong Fellowship.\n\n9\n\n\fReferences\nAlexander Amini, Wilko Schwarting, Guy Rosman, Brandon Araki, Sertac Karaman, and Daniela\nRus. Variational autoencoder for end-to-end control of autonomous driving with novelty detection\nand training de-biasing. IEEE/RSJ International Conference on Intelligent Robots and Systems\n(IROS), 2018.\n\nJernej Barbi\u02c7c and Jovan Popovi\u00b4c. Real-time control of physically based simulations using gentle\n\nforces. ACM transactions on graphics (TOG), 27(5):163, 2008.\n\nJernej Barbi\u02c7c, Marco da Silva, and Jovan Popovi\u00b4c. Deformable object animation using reduced\n\noptimal control. ACM Transactions on Graphics (TOG), 28(3):53, 2009.\n\nDesai Chen, David Levin, Wojciech Matusik, and Danny Kaufman. Dynamics-aware numerical\n\ncoarsening for fabrication design. ACM Transactions on Graphics (TOG), 36(4):84, 2017.\n\nNick Cheney, Robert MacCurdy, Jeff Clune, and Hod Lipson. Unshackling evolution: evolving soft\nrobots with multiple materials and a powerful generative encoding. In Proceedings of the 15th\nannual conference on Genetic and evolutionary computation, pages 167\u2013174. ACM, 2013.\n\nNick Cheney, Josh Bongard, Vytas SunSpiral, and Hod Lipson. Scalable co-optimization of morphol-\nogy and control in embodied machines. Journal of The Royal Society Interface, 15(143):20170937,\n2018.\n\nFrancesco Corucci, Nick Cheney, Hod Lipson, Cecilia Laschi, and Josh Bongard. Evolving swimming\nsoft-bodied creatures. International Conference on the Synthesis and Simulation of Living Systems,\n2016.\n\nCosimo Della Santina, Robert Katzschmann, Antonio Bicchi, and Daniela Rus. Dynamic control\nof soft robots interacting with the environment. Institute of Electrical and Electronics Engineers\n(IEEE), 2018.\n\nCarl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.\n\nOlivier Goury and Christian Duriez. Fast, generic, and reliable control and simulation of soft robots\n\nusing model order reduction. IEEE Transactions on Robotics, (99):1\u201312, 2018.\n\nSehoon Ha, Stelian Coros, Alexander Alspach, Joohyung Kim, and Katsu Yamane. Joint optimization\nof robot design and motion parameters using the implicit function theorem. Robotics: Science and\nSystems (RSS), 2017.\n\nNicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa,\nTom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver. Emergence of\nlocomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.\n\nYuanming Hu, Yu Fang, Ziheng Ge, Ziyin Qu, Yixin Zhu, Andre Pradhana, and Chenfanfu Jiang. A\nmoving least squares material point method with displacement discontinuity and two-way rigid\nbody coupling. ACM Transactions on Graphics (TOG), 37(4):150, 2018.\n\nYuanming Hu, Jiancheng Liu, Andrew Spielberg, Joshua Tenenbaum, William Freeman, Jiajun Wu,\nDaniela Rus, and Wojciech Matusik. ChainQueen: A real-time differentiable physical simulator\nfor soft robotics. IEEE International Conference on Robotics and Automation (ICRA), 2019.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik Kingma and Max Welling.\n\narXiv:1312.6114, 2013.\n\nAuto-encoding variational bayes.\n\narXiv preprint\n\nTimothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. International\nConference on Learning Representations (ICLR), 2015.\n\n10\n\n\fPingchuan Ma, Yunsheng Tian, Zherong Pan, Bo Ren, and Dinesh Manocha. Fluid directed rigid\nbody control using deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(4):96,\n2018.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\napproximate inference in deep generative models. International Conference on Machine Learning\n(ICML), 2014.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\nEftychios Sifakis and Jernej Barbic. Fem simulation of 3d deformable solids: A practitioner\u2019s guide\n\nto theory, discretization and model reduction. In ACM SIGGRAPH Courses, 2012.\n\nAndrew Spielberg, Brandon Araki, Cynthia Sung, Russ Tedrake, and Daniela Rus. Functional\nco-optimization of articulated robots. International Conference on Robotics and Automation\n(ICRA).\n\nMaxime Thieffry, Alexandre Kruszewski, Thierry-Marie Guerra, and Christian Duriez. Reduced\norder control of soft robots with guaranteed stability. European Control Conference (ECC), 2018.\n\nKevin Wampler and Zoran Popovi\u00b4c. Optimal gait and form for animal locomotion. ACM Transactions\n\non Graphics (TOG), 28(3):60, 2009.\n\nTingwu Wang, Yuhao Zhou, Sanja Fidler, and Jimmy Ba. Neural graph evolution: Towards ef\ufb01cient\n\nautomatic robot design. International Conference on Learning Representations (ICLR), 2019.\n\n11\n\n\f", "award": [], "sourceid": 4493, "authors": [{"given_name": "Andrew", "family_name": "Spielberg", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Allan", "family_name": "Zhao", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Yuanming", "family_name": "Hu", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Tao", "family_name": "Du", "institution": "MIT"}, {"given_name": "Wojciech", "family_name": "Matusik", "institution": "MIT"}, {"given_name": "Daniela", "family_name": "Rus", "institution": "Massachusetts Institute of Technology"}]}