{"title": "End-to-End Differentiable Physics for Learning and Control", "book": "Advances in Neural Information Processing Systems", "page_first": 7178, "page_last": 7189, "abstract": "We present a differentiable physics engine that can be integrated as a module in deep neural networks for end-to-end learning. As a result, structured physics knowledge can be embedded into larger systems, allowing them, for example, to match observations by performing precise simulations, while achieves high sample efficiency. Specifically, in this paper we demonstrate how to perform backpropagation analytically through a physical simulator defined via a linear complementarity problem. Unlike traditional finite difference methods, such gradients can be computed analytically, which allows for greater flexibility of the engine. Through experiments in diverse domains, we highlight the system's ability to learn physical parameters from data, efficiently match and simulate observed visual behavior, and readily enable control via gradient-based planning methods. Code for the engine and experiments is included with the paper.", "full_text": "End-to-End Differentiable Physics\n\nfor Learning and Control\n\nFilipe de A. Belbute-Peres\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nfiliped@cs.cmu.edu\n\nKevin A. Smith\n\nBrain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nk2smith@mit.edu\n\nKelsey R. Allen\n\nBrain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nkrallen@mit.edu\n\nJoshua B. Tenenbaum\n\nBrain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\njbt@mit.edu\n\nJ. Zico Kolter\n\nSchool of Computer Science\n\nCarnegie Mellon University and\n\nBosch Center for Arti\ufb01cial Intelligence\n\nPittsburgh, PA 15213\nzkolter@cs.cmu.edu\n\nAbstract\n\nWe present a differentiable physics engine that can be integrated as a module\nin deep neural networks for end-to-end learning. As a result, structured physics\nknowledge can be embedded into larger systems, allowing them, for example, to\nmatch observations by performing precise simulations, while achieves high sample\nef\ufb01ciency. Speci\ufb01cally, in this paper we demonstrate how to perform backpropaga-\ntion analytically through a physical simulator de\ufb01ned via a linear complementarity\nproblem. Unlike traditional \ufb01nite difference methods, such gradients can be com-\nputed analytically, which allows for greater \ufb02exibility of the engine. Through\nexperiments in diverse domains, we highlight the system\u2019s ability to learn physical\nparameters from data, ef\ufb01ciently match and simulate observed visual behavior, and\nreadily enable control via gradient-based planning methods. Code for the engine\nand experiments is included with the paper.1\n\n1\n\nIntroduction\n\nPhysical simulation environments, such as MuJoCo [Todorov et al., 2012], Bullet [Coumans et al.,\n2013], and others, have played a fundamental role in developing intelligent reinforcement learning\nagents. Such environments are widely used, both as benchmark tasks for RL agents [Brockman et al.,\n2016], and as \u201ccheap\u201d simulation environments that can (ideally) allow for transfer to real domains.\nHowever, despite their ubiquity, these simulation environments are in some sense poorly suited for\ndeep learning settings: the environments are not natively differentiable, and so gradients (e.g., policy\ngradients for control tasks, physical property gradients for modeling \ufb01tting, or dynamics Jacobians\n\n1Available at https://github.com/locuslab/lcp-physics.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffor model-based control) must all be evaluated via \ufb01nite differencing, with some attendant issues of\nspeed and numerical stability. Recent work has also proposed the development of a differentiable\nphysical simulator [Degrave et al., 2016], but this was accomplished by simply writing the simulation\nengine entirely in an automatic differentiation framework; the limitations of this framework meant\nthat the system only supported balls as objects, with limited extensibility.\nIn this paper, we propose and present a differentiable two-dimensional physics simulator that addresses\nthe main limitations of past work. Speci\ufb01cally, like many past simulation engines, our system\nsimulates rigid body dynamics via a linear complementarity problem (LCP) [Cottle, 2008, Cline,\n2002], which computes the equations of motion subject to contact and friction constraints.\nIn\naddition to this, however, in this paper we also show how to differentiate, analytically, through the\noptimal solution to the LCP; this allows us to use general simulation methods for determining the\nnon-differentiable parts of the dynamics (namely, the presence or absence of collisions between\nconvex shapes), while still providing a simulation environment that is end-to-end differentiable (given\nthe observed set of collisions). The end result is that we can embed an entire physical simulation\nenvironment as a \u201clayer\u201d in a deep network, enabling agents to both learn the parameters of the\nenvironments to match observed behavior and improve control performance via traditional gradient-\nbased learning. We highlight the utility of this system in a wide variety of different domains, each\nhighlighting a different bene\ufb01t that such differentiable physics can bring to deep learning systems:\nlearning physical parameters from data; simulating observed (visual) behavior with minimal data\nrequirements; and learning physical deep RL tasks, ranging from pure physical systems like Cartpole\nto \u201cphysics based\u201d like Atari Breakout, via gradient planning methods. The environment itself is\nimplemented as a function within popular the PyTorch library [Paszke et al., 2017]. Code for the\nengine and experiments is available at https://github.com/locuslab/lcp-physics.\n\n2 Background and related work\n\nThe work in this paper relates in some way to several different threads of recent work in deep learning\nand cognition.\n\nPhysical simulation As mentioned above, although they were not developed purely within the\nmachine learning community, physical simulation tools such as MuJoCo [Todorov et al., 2012],\nBullet [Coumans et al., 2013], and DART [Lee et al., 2018], have become ubiquitous tools for\nthe creation and development of deep RL agents. In general, these environments also use LCP\ntechniques to compute equations of motion, though often with additional enhancements such as a\nO(n)-time simulation for open chains via Featherstone\u2019s algorithm [Featherstone, 1984]. Despite\ntheir power, computing derivatives through these engines mostly involves using \ufb01nite differences,\nthat is, evaluating the forward simulation multiple times with small perturbations to the relevant\nparameters to approximate the relevant gradients. This strategy is often impractical due to (1) the\nhigh computational burden of \ufb01nite differencing when computing the gradient with respect to a large\nnumber of model/policy parameters; and (2) the instability of numerical gradients over long time\nhorizons, especially if contacts change over the course of a rollout. The analytic LCP differentiation\napproach has no such issues, and can give gradients with respect to a large number of parameters\nessentially \u201cfor free\u201d given a forward solution. The usage of analytical gradients in physics simulation\nhas been previously investigated in spring-damper systems [Hermans et al., 2014]. However, due to\nits limitations, such as instability and unrealistic contact handling, most engines used in practice do\nnot use spring-damper models.\nIn probably the most closely related work, Degrave et al. [2016] also develop a differentiable\nphysics engine, with motivations similar to our own. However, in this case they made their engine\ndifferentiable by simply implementing it in its entirety in the Theano framework [Al-Rfou et al.,\n2016]. This severely limited the complexity of the allowable operations: for instance the engine only\nallowed for collision between balls and the ground plane. In contrast, because our method analytically\ndifferentiates the LCP, it can be substituted within the traditional computations of most existing\nphysics engines, only requiring the added differentiability within the LCP portion itself; indeed, in\nour system we use existing ef\ufb01cient methods for portions of the simulator such as a collision detection\nor Euler stabilization. Finally, we also evaluate the method in a broader context than in this previous\nwork: while the approach there centered around policy optimization (within the physics engine itself),\n\n2\n\n\fwe additionally highlight applications in system identi\ufb01cation, prediction in visual settings, and using\nthe simulation engine internally within a policy to perform tasks in a different environment.\n\nIntuitive physics\nIn a related but orthogonal body of work, many studies have investigated the\nhuman ability to intuitively understand physics. Battaglia et al. [2013], Hamrick et al. [2015] and\nSmith and Vul [2013] suggested that people have an \u201cintuitive physics engine\u201d that they can use\nto simulate future or hypothetical states of the world for inference and planning. Recent work in\nmachine learning has leveraged this idea by attempting to design networks that can learn physical\ndynamics in a differentiable system [Lerer et al., 2016, Chang et al., 2016, Battaglia et al., 2016],\nbut because these dynamics must be learned, they require extensive training before they can be used\nas a layer in a larger network, and it is not clear how well they generalize across tasks. Conversely,\nby performing explicit simulation (similar to how people do), which is embedded as a \u201clayer\u201d in\nthe system, our approach requires no pre-training and can generalize across scenarios that can be\ncaptured by a rigid-body engine.\n\nModel-based reinforcement learning Although focusing on an orthogonal issue, our work is of\ncourse highly relevant to the entire \ufb01eld of model-based RL. Although model-free RL methods\nhave achieved some notable successes [Mnih et al., 2015, Heess et al., 2015], model-based RL also\nunderpins much of the recent success, and there is both old [Atkeson and Santamaria, 1997] and\nrecent [Kurutach et al., 2018] work that argues that model-based approaches are often superior for\nmany tasks.\nModel-based (deep) RL typically focuses on one of two settings: either a general neural network is\nused to simulate the dynamics (e.g. Werbos [1989], Nagabandi et al. [2017]), often with a speci\ufb01c loss\nfunction or network structured aimed at predicting on the relevant time scales [Abbeel and Ng, 2005,\nMishra et al., 2017], or the model used is a more \u201cpure\u201d physics model, where learning essentially\ncorresponds to traditional system identi\ufb01cation [Ljung, 1999]. Our approach lies somewhere in\nbetween these two extremes (though closer to the system identi\ufb01cation approach), where we can use\nthe differentiability of the simulation system to identify model parameters and use the system within\na model-based control method, but where the generic formulation is substantially more general than\ntraditional system identi\ufb01cation, and e.g. the number of objects or joints can even be dictated by\nanother portion of the network.\n\nAnalytic differentiation of optimization Finally, our work relates methodologically to a recent\ntrend of incorporating more structured layers within deep networks. Speci\ufb01cally, recent work has\nlooked incorporating quadratic programs [Amos and Kolter, 2017], combinatorial optimization [Djo-\nlonga and Krause, 2017], computing equilibria in games [Ling et al., 2018], or dynamic programming\n[Mensch and Blondel, 2018]. Our work relates most closely to that of [Amos and Kolter, 2017].\nLike this work, we use an interior point primal dual method to solve a nonlinear set of equations (in\nour case a general LCP, in their case a symmetric LCP resulting from the KKT conditions of QP).\nHowever, both the general nature of the LCP, and the application to physical simulation, specializes\nsubstantially from what has been considered in previous work.\n\n3 Differentiable Physics Engine\n\nA detailed description of the physics engine architecture is presented in Appendix A due to space\nconstraints. The LCP solution and the gradients are presented in more detail in Appendix B. Below\nwe present a brief summary of the LCP formulation.\n\n3.1 Formulating the LCP\n\nRigid body dynamics are commonly formulated as a linear complementarity problem, with the\ndifferent constraints on the movement of bodies (such as joints, interpenetrations, friction, etc.)\nrepresented as equality and inequality constraints [Anitescu and Potra, 1997, Cline, 2002]. In this\nwork, we follow closely the framework described in Cline [2002], in which at each time step an LCP\nis solved to \ufb01nd the constrained velocities of the objects.\nTo formulate such an LCP, we \ufb01rst \ufb01nd which contacts between bodies are present at the current\ntime-step. Let t be the current time-step and t + dt the following time-step, for a step of size dt. If\n\n3\n\n\fthe distance between possibly contacting objects is less than a prede\ufb01ned threshold, the interaction is\nconsidered a contact. From the equality constraints speci\ufb01ed in the system, such as joints, we can\nbuild the matrix Je such that Jevt+dt = 0. From the contacts at each step, we can build a contact\nconstraint matrix Jc, such that Jcvt+dt 0. Similarly, we have a friction constraint matrix Jf that\nintroduces frictional interactions. From the de\ufb01nition of the simulated bodies we construct the inertia\nmatrix M. The structure of these block matrices is described in detail in Appendix A. Finally, given\nthe forces acting on the bodies at time t, ft, and the collision coef\ufb01cient c, the constrained dynamics\ncan be formulated as the following mixed LCP\n\n0\n0\na\n\n\u21e3\n\n26664\n\n0\n0\n0\n0\n\nvt+dt\n\nMJ\nJe\nJc\nJf\n0\n\n26664\n37775 \nsubject to\" a\n\n0\n0\n0\n\u00b5\n\n0\n0\n0\nE\n0\n\ne J c J f\n0\n0\n0\nET\n\n26664\n37775\n\u21e3 #T\" c\n # 0,\" a\n\n=26664\n37775\n # = 0,\n\ne\nc\nf\n\n\n\u21e3 # 0,\" c\n\nf\n\nf\n\n\n\n\n\nMvt + dtft\n\n0\nc\n0\n0\n\n37775\n\n(1)\n\nwhere [a, , \u21e3]T are slack variables for the inequality constraints, and [vt+dt, e, c, f , ]T are the\nunknowns. By solving this LCP, we obtain the velocities for the next time-step vt+dt, which are used\nto update the positions of the bodies.\n\n3.2 Solving the LCP\n\nAnalogously to the differentiable optimizer in OptNet [Amos and Kolter, 2017], our LCP solver is\nadapted from the primal-dual interior point method described in Mattingley and Boyd [2012]. The\nadvantage of using such a method is that it allows for ef\ufb01cient computation of the gradients, as we\nshow in Section 3.3.\nFirst, to simplify the notation from the LCP formulation of the dynamics in Equation 1, let us de\ufb01ne\n\nx := vt+dt\ny := e\n\nz :=\" c\n #\n\nf\n\nq := Mvt dtft\nA := Je\n\nG :=\" Jc\n\nJf\n0\n\n0\n0\n\n0 #\n\n\n\ns :=\" a\n\u21e3 #\nm :=\" c\n0 #\n\n0\n\nF :=24\n\n0\n0\n0\n0\n\u00b5 ET\n\n0\nE\n\n0 35 .\n\nThen we can rewrite the LCP above as the system below, which can be solved with only slight\nadaptations to the primal-dual interior point method by [Mattingley and Boyd, 2012].\n\ns\n\n0 # +24 M GT AT\n\" 0\n\ny # =\" q\n0 #\nsubject to s 0, z 0, sT z = 0.\n\n0 35\" x\n\nG F\n0\nA\n\nm\n\n0\n\nz\n\n(2)\n\n3.3 Gradients\n\nAll the work leading to the construction of the dynamics LCP in Equation 1 consists of differentiable\noperations on the simulations parameters and initial setting. Therefore, if we could differentiate\nthrough the solution for the LCP as well, the system would be differentiable end to end. To derive\nthese gradients we apply the method described in [Amos and Kolter, 2017] to the LCP in 2, which\ngives us the gradients of the solution of the LCP with respect to the input parameters from the previous\ntime-step. By following this method we arrive at the partials that can then be used for the backward\n\n4\n\n\fstep\n\n@`\n@q\n@`\n@m\n@`\n@A\n\n= dx\n= D(z?)dz\n\n= dyxT ydT\n\nx\n\n@`\n@M\n@`\n@G\n@`\n@F\n\n1\n2\n\n(dxxT + xdT\nx )\n\n= \n= D(z?)(dzxT + zdT\nx )\n= D(z?)dzzT .\n\n(3)\n\n3.4\n\nImplementation\n\nThe physics engine is implemented in PyTorch [Paszke et al., 2017] in order to take advantage of the\nautograd automatic differentiation graph functionality. The LCP solver is implemented as an autograd\nFunction, with the analytical gradients provided according to the de\ufb01nitions above. This allows the\nderivatives to be propagated across time-steps in the simulation. Furthermore, the autograd graph\nthen allows the derivatives to be propagated backwards into the leaf parameters of the dynamics, such\nas the bodies\u2019 masses, positions, etc.\n\n4 Experiments\n\nTo demonstrate the \ufb02exibility of the differentiable physics engine, we test its performance across\nthree classes of experiments. First, we show that it can infer the mass of an object by observing the\ndynamics of a scene. Next, we demonstrate that embedding a differentiable physics engine within\na deep autoencoder network can lead to high accuracy predictions and improved sample ef\ufb01ciency.\nFinally, we use the differentiable physics engine together with gradient-based control methods to\nshow that we can learn to perform physics-based tasks with low sample complexity when compared\nto model-free methods.\n\n4.1 Parameter learning\n\nTask To evaluate the engine\u2019s capabilities for inference, we devised an experiment where one object\nhas unknown mass which has to be inferred from its interactions with the other bodies. As depicted\nin Figure 1, a scene in which a ball of known mass hits a chain is observed and the resulting positions\nof the objects are recorded for 10s. The goal is to infer the mass of the chain.\n\nLearning and results Simulations are iteratively unrolled starting with an arbitrarily chosen mass\nof 1 for the chain. After each iteration, the mean squared error (MSE) between the observed positions\nand the simulated positions is observed, and then used to obtain its gradient with respect to the mass.\nGradients are clipped to a maximum absolute value of 100 and then used to perform gradient descent\non the value of the mass, with a learning rate of 0.01. As shown in Figure 1 this process is able to\nquickly reduce the position MSE by converging to the true value of the mass.\n\nComparison to numerical derivatives We also compared using analytic and numerical gradients.\nIn this experiment, the same optimization process described above was repeated for a varying number\nof links in the chain. The number of gradient updates was \ufb01xed to 30 and the run times were averaged\nover 5 runs for each condition. As can be seen in Figure 1, the run time with analytical gradients\ngrows much more slowly with the increasing number of parameters.\n\n4.2 Prediction on visual data\n\nTask To test our approach on a benchmark for visual physical prediction, we generated a dataset\nof videos of billiard ball-like scenarios using the code from [Fragkiadaki et al., 2015]. Simulations\nlasting 10 seconds were generated, totalling 8,000 trials for training, 1,000 for validation and 1,000\nfor testing. Datasets with 1 and 3 balls were used, with all balls having the same mass. Frames from\nsample trials can be seen in Figure 3. In our task setup, balls bouncing in a box are observed for a\nperiod of time. The model is provided with 3 frames as input and has to learn to predict the state of\nthe world at a future state, 10 frames later.\n\n5\n\n\fFigure 1: Inferring the mass of a chain. Top: Sequence of frames from the inference experiment. The\ngoal is to infer the mass of the chain by unrolling simulations and using the gradient to minimize\nthe loss from the predicted positions to the observed ones. Bottom left: The estimated mass quickly\nconverges to the true value, m = 7, indicated by the dashed line. Bottom center: As a consequence of\nthe improving mass estimation, the MSE (represented in log scale) between the true and simulated\npositions for the bodies decreases quickly. Bottom right: Run time comparison between using\nanalytical gradients or \ufb01nite differences for 30 updates, as a function of the number of links in the\nchain.\n\n!\"#\"\n\nPredict\n\n!\"$%#\"$%\n\nEncode\n\nDecode\n\nFigure 2: Diagram of autoencoder architecture. The encoder learns to map from input frames to the\nphysical state of the objects (i.e., position, velocity, etc.). The physics engine steps the world forward\nusing the parameters from the encoder. The decoder takes the predicted physical parameters and\ngenerates a frame to match the true future frame. The system is trained end-to-end. Part of the labels\nhave strong supervision, with ground truth values available for the output of the encoder and physics\nengine. Different proportions of strong and weak supervision (only the future frame is provided) in\nthe data are evaluated. Using a large number of weakly labelled data improves sample ef\ufb01ciency for\nstrongly labelled data.\n\nArchitecture To make visual prediction given the visual input, we use an autoencoder architecture\nsummarized in Figure 2. It consists of three parts: (1) the encoder maps input frames to the physical\nstate of the objects (i.e., position, velocity, etc.). Speci\ufb01cally, we take in a sequence of 3 RGB frames\nfrom the simulation. We then use a pretrained spatial pyramid network [Ranjan and Black, 2016]\nto obtain two optical \ufb02ow frames (each consisting of two matrices, for x and y \ufb02ow). Color \ufb01lters\nare applied to the RGB images to segment the objects. The segmented region of each object is then\nused as a mask for the RGB and optical \ufb02ow frames, such that at the end of this pipeline we have,\nfor each object, a collection of 3 RGB frames and 2 optical \ufb02ow frames (13 channels) with only a\nsingle segmented object. Then, each of these per-object processed inputs is passed to a VGG-11\nnetwork with its last layer modi\ufb01ed to output size 4, in order to regress two position and two velocity\nparameters as outputs. (2) the physics engine steps the world forward using the physical parameters\nreceived from the encoder. The physics engine can be integrated into the autoencoder pipeline and\nallow for end-to-end training due to its differentiability, as described in Section 3. (3) the decoder\ntakes the predicted physical parameters and generates a frame to match the true future frame. The\narchitecture used is a mirror of the VGG encoder network, with transposed convolutions in the place\nof convolutions and bilinear upsampling layers in the place of the maxpooling ones.\n\n6\n\n\fInitial\nFrame\n\nTrue Future \n\nFrame\n\nPredicted \nFrame\n\nFigure 3: Qualitative results for prediction task comparing ground truth and predicted future frame.\nOnly the initial frame and the two preceding frames are used as input, with physical parameters\nextracted, used to simulated the state forward and generate the predicted frame. In most cases the\npredicted frame is accurate. However, small differences can still be perceived in some cases, due to\ndifferences between the two engines.\n\nLearning In order to evaluate the sample ef\ufb01ciency of the model, the network was trained with\nvarying amounts of labelled samples. The labels used consist of the ground truth physical parameters\n of the objects both at the present (t) and the future time-step (t+dt). When a label is available\nfor a given sample, the model uses these ground truth physical parameters (instead of the estimated\nones) to generate the predicted frame \u02c6y from input frames x, such that\n\n\u02c6t = encoder(x),\n\n\u02c6t+dt = physics(t),\n\n\u02c6y = decoder(t+dt).\n\n(4)\n\nUsing the labels and the true future frame y, the model is then trained to minimize a loss consisting\nof the sum of three terms, the encoder, physics and decoder losses\nL = Lenc + Lphys + Ldec,\n\n(5)\n\nLenc = `( \u02c6t, t), Lphys = `( \u02c6t+dt, t+dt), Ldec = `(\u02c6y, y),\n\nwhere `(\u00b7,\u00b7) is the mean squared error loss.\nWhen labels are not available for a given sample, the model uses its own estimated parameters to\ngenerate the predicted frame, that is\n\n\u02c6t = encoder(x),\n\n\u02c6t+dt = physics( \u02c6t),\n\n\u02c6y = decoder( \u02c6t+dt).\n\n(6)\n\nNotice that here, unlike in Equation 5, the arguments to the function are estimated ( \u02c6t, \u02c6t+dt). In\nthis case, since there are no labels to use for the other losses, the loss consists only of L = Ldec.\nNotice that here the right hand side of the equations use the estimated \u02c6. The gradients are thus being\npropagated end-to-end through the physics model back to the encoder. As shown in Figure 4, this\nsignal from unlabeled examples allows the autoencoder to learn with greater sample ef\ufb01ciency. For\nall losses, the MSE is used. In our experiments, the squared loss performed better than the `1 loss,\nwhich was not able to produce meaningful decoder outputs.\n\nResults As demonstrated in Figure 4, the model was able to learn to perform the task with high\naccuracy. Figure 3 contains sample predicted frames and their matching ground truth frame for a\nqualitative analysis of the results. As a comparison point, an MLP with two hidden-layers of size\n100 and trained with only labeled data was used as a baseline, replacing the physics(\u00b7) function in\nEquation 4 above. In our experiments, using the baseline model in such a way, as a replacement for\nthe physics function, provided better results than using it in an unstructured manner, relying solely on\nthe decoder loss. It is clear from Figure 4 that the autoencoder with the physics model is able to learn\nmore ef\ufb01ciently and with higher accuracy than the baseline model. To evaluate the sample ef\ufb01ciency\nof this model, we compare its performance on training regimens in which 100%, 25%, 10% and 2%\nof the available samples containing labels. Some supervision is still necessary, since when provided\nwith no supervision at all (a 0% condition), relying solely on the decoder loss, the model was not\nable to learn to extract meaningful physical parameters. Still, as can be seen in Figure 4, the model is\nable to leverage the unlabeled data to quickly learn even from few labelled data points.\n\n7\n\n\fFigure 4: Sample ef\ufb01ciency for the prediction task measured by the validation loss per number of\nlabelled samples used in training. The autoencoder is able to leverage unlabelled examples to improve\nits sample ef\ufb01ciency: training regimes that employ unlabeled data learn faster for a given amount of\nlabeled data. The loss is the mean squared error of the predicted image to the ground truth. Each line\nrepresents a training regiment with a different proportion of labeled to unlabeled data. Note that the\nx-axis is already adjusted to the number of labeled samples used, to facilitate comparisons.\n\n4.3 Control\n\nTasks Finally, in this section we demonstrate the physics engine ability to be readily used with\ngradient-based control methods. To this end, we evaluate its performance on physics based tasks\nfrom the OpenAI Gym\u2019s environment [Brockman et al., 2016], namely Cartpole and the Atari game\nBreakout.\n\nModel and Controller For the Cartpole environment, a model is built using two articulated\nrectangles, whose dimensions and mass are learned from simulated trajectories using random actions.\nThe physics engine-based model is compared to a baseline consisting of an MLP with two hidden\nlayers of size 100 trained on the same data. A variation of the environment is used in which the\nactions to be taken by the cart are continuous, instead of discrete. Rewards are also limited to 1000,\ninstead of the default 200 for which the task is considered done.\nFor Breakout, a model of the environments is built by applying color \ufb01lters, segmenting the diverse\nobjects (the paddle, the ball, etc.) and translating these positions into the physics engine. The ball\u2019s\nvelocity is estimated by the difference in its position from the last two frames. The paddle velocity\nwhen moving at each step is learned by unrolling game episodes with randomly chosen actions,\nperforming the same actions in the physics simulation and then \ufb01tting the simulation parameter\nvia gradient descent to minimize the mean squared error to the observed trajectory, analogously to\nthe process in Section 4.1. The physics engine model is compared to a Double Q Learning with\nprioritized replay [van Hasselt et al., 2015] baseline from OpenAI [Dhariwal et al., 2017].\nSince the resulting physics models described above are differentiable, they are used in conjunction\nwith iLQR [Li and Todorov, 2004] to control the agent in the tasks. The iLQR is set up with a\ntime-horizon of 5 frames for both tasks. For Cartpole the cost consists of the square of pole\u2019s angular\ndeviation from vertical. For the Ataro game the cost consists of the squared difference in the x\nposition of the paddle and the ball when the ball is descending, and the squared distance to the center\nof the screen otherwise.\n\nResults Results for the Cartpole task are shown in Figure 5. Even though the MLP baseline\nachieves a lower MSE faster in predicting the next state of the cartpole system, the physics engine\nis able to learn parameters for a model that allows for high reward on the task, even when error is\nhigher.\nIn the Atari benchmark, the system is able to achieve high reward on the task with extremely low\nsample complexity. Speci\ufb01cally, the model is able to learn the paddle parameters quickly from\nrandom trajectories, improving the control precision, and leading to high reward, as shown in Figure 6\n\n8\n\n\fFigure 5: Even though the baseline is able to achieve lower MSE over one-step predictions of the\ndynamics of the Cartpole environment (left), the physics engine-based controller is able to achieve a\nhigher reward very quickly (right).\n\nFigure 6: The physics based controller is able to quickly learn a good parameter values that lead\nto high reward. Even though the asymptotic performance is lower than the model-free method, it\nachieves a high level of reward with orders of magnitude data (the horizontal axis is log-scaled).\nHuman level of 31 for a professional game tester was used, as per [Mnih et al., 2015].\n\nfor Breakout. The model performs close to model-free reinforcement learning methods and is able to\nachieve a high level of reward with orders of magnitude fewer samples.\n\n5 Conclusion\n\nIn this work, we have presented a differentiable physics engine. Unlike most previous work, this\nengine provides analytical gradients by differentiating the solution to the physics LCP. The differen-\ntiable nature of the engine allows it to be used as a part of gradient-based learning and control system\nsystems. Our experiments demonstrate the diverse possibilities this system entails, such as inferring\nparameters from observed simulations, learning from visual observations and performing accurate\npredictions, and achieving high reward with gradient-based control methods on physics-based tasks,\nall the while demonstrating sample ef\ufb01ciency. This modular, differentiable system contributes to a\nrecent trend of incorporating components with structured modules into end-to-end learning systems,\nsuch as deep networks. In general, we believe structured constraints such as the ones provided by\nphysics simulation are essential for providing a scaffolding for more ef\ufb01cient learning.\n\nReferences\nPieter Abbeel and Andrew Y Ng. Learning \ufb01rst-order markov models for control. In Advances in\n\nneural information processing systems, pages 1\u20138, 2005.\n\n9\n\n\fRami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau,\nNicolas Ballas, Fr\u00e9d\u00e9ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua\nBengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas\nBouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Br\u00e9bisson, Olivier\nBreuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano, Tim Cooijmans,\nMarc-Alexandre C\u00f4t\u00e9, Myriam C\u00f4t\u00e9, Aaron Courville, Yann N. Dauphin, Olivier Delalleau, Julien\nDemouth, Guillaume Desjardins, Sander Dieleman, Laurent Dinh, M\u00e9lanie Ducoffe, Vincent\nDumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, Orhan Firat, Mathieu Germain,\nXavier Glorot, Ian Goodfellow, Matt Graham, Caglar Gulcehre, Philippe Hamel, Iban Harlouchet,\nJean-Philippe Heng, Bal\u00e1zs Hidasi, Sina Honari, Arjun Jain, S\u00e9bastien Jean, Kai Jia, Mikhail\nKorobov, Vivek Kulkarni, Alex Lamb, Pascal Lamblin, Eric Larsen, C\u00e9sar Laurent, Sean Lee,\nSimon Lefrancois, Simon Lemieux, Nicholas L\u00e9onard, Zhouhan Lin, Jesse A. Livezey, Cory\nLorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, Olivier Mastropietro, Robert T.\nMcGibbon, Roland Memisevic, Bart van Merri\u00ebnboer, Vincent Michalski, Mehdi Mirza, Alberto\nOrlandi, Christopher Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raffel, Daniel Renshaw,\nMatthew Rocklin, Adriana Romero, Markus Roth, Peter Sadowski, John Salvatier, Fran\u00e7ois Savard,\nJan Schl\u00fcter, John Schulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk, Samira\nShabanian, \u00c9tienne Simon, Sigurd Spieckermann, S. Ramana Subramanyam, Jakub Sygnowski,\nJ\u00e9r\u00e9mie Tanguay, Gijs van Tulder, Joseph Turian, Sebastian Urban, Pascal Vincent, Francesco\nVisin, Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Willson, Kelvin Xu, Lijun\nXue, Li Yao, Saizheng Zhang, and Ying Zhang. Theano: A Python framework for fast computation\nof mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. URL http://arxiv.\norg/abs/1605.02688.\n\nBrandon Amos and J. Zico Kolter. OptNet: Differentiable Optimization as a Layer in Neural Networks.\narXiv:1703.00443 [cs, math, stat], March 2017. URL http://arxiv.org/abs/1703.00443.\narXiv: 1703.00443.\n\nMihai Anitescu and Florian A. Potra. Formulating dynamic multi-rigid-body contact problems with\nfriction as solvable linear complementarity problems. Nonlinear Dynamics, 14(3):231\u2013247, 1997.\nURL http://www.springerlink.com/index/J71678405QK31722.pdf.\n\nChristopher G Atkeson and Juan Carlos Santamaria. A comparison of direct and model-based\nreinforcement learning. In Robotics and Automation, 1997. Proceedings., 1997 IEEE International\nConference on, volume 4, pages 3557\u20133564. IEEE, 1997.\n\nPeter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of\nphysical scene understanding. Proceedings of the National Academy of Sciences, 110(45):18327\u2013\n18332, November 2013. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1306572110. URL\nhttp://www.pnas.org/content/110/45/18327.\n\nPeter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. In-\nteraction Networks for Learning about Objects, Relations and Physics. arXiv:1612.00222 [cs],\nDecember 2016. URL http://arxiv.org/abs/1612.00222. arXiv: 1612.00222.\n\nGino Johannes Apolonia van den Bergen. Collison detection in interactive 3D environments. The\nMorgan Kaufmann series in interactive 3D technology. Elsevier/Morgan Kaufman, Amsterdam ;\nBoston, 2004. ISBN 978-1-55860-801-6.\n\nGreg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. Openai gym, 2016.\n\nErin Catto. Computing Distance Using GJK.\n\ndownloads/.\n\nIn GDC, 2010. URL http://box2d.org/\n\nMichael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A Compositional\nObject-Based Approach to Learning Physical Dynamics. arXiv:1612.00341 [cs], December 2016.\nURL http://arxiv.org/abs/1612.00341. arXiv: 1612.00341.\n\nMichael Bradley Cline. Rigid body simulation with contact and constraints.\n\nPhD thesis,\nUniversity of British Columbia, 2002. URL https://pdfs.semanticscholar.org/8567/\ne2467bb5ad67f3a3f11e7c3c4386d9ca8210.pdf.\n\n10\n\n\fRichard W Cottle. Linear complementarity problem.\n\n1873\u20131878. Springer, 2008.\n\nIn Encyclopedia of Optimization, pages\n\nErwin Coumans et al. Bullet physics library. Open source: bulletphysics. org, 15(49):5, 2013.\n\nJonas Degrave, Michiel Hermans, Joni Dambre, and Francis wyffels. A Differentiable Physics\nEngine for Deep Learning in Robotics. arXiv:1611.01652 [cs], November 2016. URL http:\n//arxiv.org/abs/1611.01652. arXiv: 1611.01652.\n\nPrafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,\nJohn Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/\nopenai/baselines, 2017.\n\nJosip Djolonga and Andreas Krause. Differentiable learning of submodular models. In Neural\n\nInformation Processing Systems (NIPS), 2017.\n\nRoy Featherstone. Robot dynamics algorithms. 1984.\n\nKaterina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning Visual Predictive\nModels of Physics for Playing Billiards. arXiv:1511.07404 [cs], November 2015. URL http:\n//arxiv.org/abs/1511.07404. arXiv: 1511.07404.\n\nHelmut Garstenauer and Gerhard Kurka. A uni\ufb01ed framework for rigid body dynamics. PhD thesis,\n\n2006.\n\nDirk Gregorius. The Separating Axis Test. In GDC, 2013. URL http://box2d.org/downloads/.\nDirk Gregorius. Robust Contact Creation for Physics Simulations. In GDC, 2015. URL http:\n\n//box2d.org/downloads/.\n\nJessica B. Hamrick, Kevin A. Smith, Thomas L. Grif\ufb01ths, and Edward Vul. Think again? the amount\nof mental simulation tracks uncertainty in the outcome. In Proceedings of the thirtyseventh annual\nconference of the cognitive science society. Citeseer, 2015. URL http://citeseerx.ist.psu.\nedu/viewdoc/download?doi=10.1.1.704.359&rep=rep1&type=pdf.\n\nNicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez. Learning\nContinuous Control Policies by Stochastic Value Gradients. arXiv:1510.09142 [cs], October 2015.\nURL http://arxiv.org/abs/1510.09142. arXiv: 1510.09142.\n\nMichiel Hermans, Benjamin Schrauwen, Peter Bienstman, and Joni Dambre. Automated Design of\nComplex Dynamic Systems. PLoS ONE, 9(1):e86696, January 2014. ISSN 1932-6203. doi: 10.\n1371/journal.pone.0086696. URL http://dx.plos.org/10.1371/journal.pone.0086696.\nThanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble\ntrust-region policy optimization. In International Conference on Learning Representations, 2018.\n\nJeongseok Lee, Michael X. Grey, Sehoon Ha, Tobias Kunz, Sumit Jain, Yuting Ye, Siddhartha\nS. Srinivasa, Mike Stilman, and C. Karen Liu. DART: Dynamic Animation and Robotics Toolkit.\nThe Journal of Open Source Software, 3(22):500, February 2018. ISSN 2475-9066. doi: 10.21105/\njoss.00500. URL http://joss.theoj.org/papers/10.21105/joss.00500.\n\nAdam Lerer, Sam Gross, and Rob Fergus. Learning Physical Intuition of Block Towers by Exam-\nple. arXiv:1603.01312 [cs], March 2016. URL http://arxiv.org/abs/1603.01312. arXiv:\n1603.01312.\n\nWeiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological\n\nmovement systems. In ICINCO (1), pages 222\u2013229, 2004.\n\nChun Kai Ling, Fei Fang, and J Zico Kolter. What game are we playing? end-to-end learning in\n\nnormal and extensive form games. arXiv preprint arXiv:1805.02777, 2018.\n\nLennart Ljung. System identi\ufb01cation: Theory for the user. Prentice Hall, 1999.\n\nX. Magnus and Heinz Neudecker. Matrix differential calculus. 1988.\n\n11\n\n\fJacob Mattingley and Stephen Boyd. CVXGEN: a code generator for embedded convex opti-\nISSN 1389-4420, 1573-\ndoi: 10.1007/s11081-011-9176-9. URL http://link.springer.com/10.1007/\n\nmization. Optimization and Engineering, 13(1):1\u201327, March 2012.\n2924.\ns11081-011-9176-9.\n\nArthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured prediction\n\nand attention. arXiv preprint arXiv:1802.03676, 2018.\n\nNikhil Mishra, Pieter Abbeel, and Igor Mordatch. Prediction and control with temporal segment\n\nmodels. In International Conference on Machine Learning, pages 2459\u20132468, 2017.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,\nCharles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,\nShane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Na-\nture, 518(7540):529\u2013533, February 2015. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature14236.\nURL http://www.nature.com/doifinder/10.1038/nature14236.\n\nAnusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynam-\nics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. arXiv preprint\narXiv:1708.02596, 2017.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\nAnurag Ranjan and Michael J. Black. Optical Flow Estimation using a Spatial Pyramid Network.\narXiv:1611.00850 [cs], November 2016. URL http://arxiv.org/abs/1611.00850. arXiv:\n1611.00850.\n\nKevin A. Smith and Edward Vul. Sources of Uncertainty in Intuitive Physics. Topics in Cognitive\nScience, 5(1):185\u2013199, January 2013. ISSN 1756-8765. doi: 10.1111/tops.12009. URL http:\n//onlinelibrary.wiley.com/doi/10.1111/tops.12009/abstract.\n\nEmanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based\ncontrol. pages 5026\u20135033. IEEE, 2012. ISBN 978-1-4673-1736-8 978-1-4673-1737-5 978-1-4673-\n1735-1. doi: 10.1109/IROS.2012.6386109. URL http://ieeexplore.ieee.org/document/\n6386109/.\n\nHado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-\n\nlearning. CoRR, abs/1509.06461, 2015. URL http://arxiv.org/abs/1509.06461.\n\nPaul J Werbos. Neural networks for control and system identi\ufb01cation. In Decision and Control, 1989.,\n\nProceedings of the 28th IEEE Conference on, pages 260\u2013265. IEEE, 1989.\n\n12\n\n\f", "award": [], "sourceid": 3572, "authors": [{"given_name": "Filipe", "family_name": "de Avila Belbute-Peres", "institution": "Carnegie Mellon University"}, {"given_name": "Kevin", "family_name": "Smith", "institution": "MIT"}, {"given_name": "Kelsey", "family_name": "Allen", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University / Bosch Center for AI"}]}