{"title": "Learning to See Physics via Visual De-animation", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 164, "abstract": "We introduce a paradigm for understanding physical scenes without human annotations. At the core of our system is a physical world representation that is first recovered by a perception module and then utilized by physics and graphics engines. During training, the perception module and the generative models learn by visual de-animation --- interpreting and reconstructing the visual information stream. During testing, the system first recovers the physical world state, and then uses the generative models for reasoning and future prediction.  Even more so than forward simulation, inverting a physics or graphics engine is a computationally hard problem; we overcome this challenge by using a convolutional inversion network. Our system quickly recognizes the physical world state from appearance and motion cues, and has the flexibility to incorporate both differentiable and non-differentiable physics and graphics engines. We evaluate our system on both synthetic and real datasets involving multiple physical scenes, and demonstrate that our system performs well on both physical state estimation and reasoning problems. We further show that the knowledge learned on the synthetic dataset generalizes to constrained real images.", "full_text": "Learning to See Physics via Visual De-animation\n\nJiajun Wu\nMIT CSAIL\n\nErika Lu\n\nUniversity of Oxford\n\nPushmeet Kohli\n\nDeepMind\n\nWilliam T. Freeman\n\nMIT CSAIL, Google Research\n\nJoshua B. Tenenbaum\n\nMIT CSAIL\n\nAbstract\n\nWe introduce a paradigm for understanding physical scenes without human an-\nnotations. At the core of our system is a physical world representation that is\n\ufb01rst recovered by a perception module and then utilized by physics and graphics\nengines. During training, the perception module and the generative models learn\nby visual de-animation \u2014 interpreting and reconstructing the visual information\nstream. During testing, the system \ufb01rst recovers the physical world state, and then\nuses the generative models for reasoning and future prediction.\nEven more so than forward simulation, inverting a physics or graphics engine is\na computationally hard problem; we overcome this challenge by using a convo-\nlutional inversion network. Our system quickly recognizes the physical world\nstate from appearance and motion cues, and has the \ufb02exibility to incorporate both\ndifferentiable and non-differentiable physics and graphics engines. We evaluate our\nsystem on both synthetic and real datasets involving multiple physical scenes, and\ndemonstrate that our system performs well on both physical state estimation and\nreasoning problems. We further show that the knowledge learned on the synthetic\ndataset generalizes to constrained real images.\n\n1\n\nIntroduction\n\nInspired by human abilities, we wish to develop machine systems that understand scenes. Scene\nunderstanding has multiple de\ufb01ning characteristics which break down broadly into two features. First,\nhuman scene understanding is rich. Scene understanding is physical, predictive, and causal: rather\nthan simply knowing what is where, one can also predict what may happen next, or what actions\none can take, based on the physics afforded by the objects, their properties, and relations. These\npredictions, hypotheticals, and counterfactuals are probabilistic, integrating uncertainty as to what is\nmore or less likely to occur. Second, human scene understanding is fast. Most of the computation has\nto happen in a single, feedforward, bottom-up pass.\nThere have been many systems proposed recently to tackle these challenges, but existing systems\nhave architectural features that allow them to address one of these features but not the other. Typical\napproaches based on inverting graphics engines and physics simulators [Kulkarni et al., 2015b]\nachieve richness at the expense of speed. Conversely, neural networks such as PhysNet [Lerer et al.,\n2016] are fast, but their ability to generalize to rich physical predictions is limited.\nWe propose a new approach to combine the best of both. Our overall framework for representation is\nbased on graphics and physics engines, where graphics is run in reverse to build the initial physical\nscene representation, and physics is then run forward to imagine what will happen next or what can\nbe done. Graphics can also be run in the forward direction to visualize the outputs of the physics\nsimulation as images of what we expect to see in the future, or under different viewing conditions.\nRather than use traditional, often slow inverse graphics methods [Kulkarni et al., 2015b], we learn to\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Visual de-animation \u2014 we would like to recover the physical world representation behind\nthe visual input, and combine it with generative physics simulation and rendering engines.\n\ninvert the graphics engine ef\ufb01ciently using convolutional nets. Speci\ufb01cally, we use deep learning to\ntrain recognition models on the objects in our world for object detection, structure and viewpoint\nestimation, and physical property estimation. Bootstrapping from these predictions, we then infer the\nremaining scene properties through inference via forward simulation of the physics engine.\nWithout human supervision, our system learns by visual de-animation: interpreting and reconstructing\nvisual input. We show the problem formulation in Figure 1. The simulation and rendering engines in\nthe framework force the perception module to extract physical world states that best explain the data.\nAs the physical world states are inputs to physics and graphics engines, we simultaneously obtain an\ninterpretable, disentangled, and compact physical scene representation.\nOur framework is \ufb02exible and adaptable to a number of graphics and physics engines. We present\nmodel variants that use neural, differentiable physics engines [Chang et al., 2017], and variants that\nuse traditional physics engines, which are more mature but non-differentiable [Coumans, 2010]. We\nalso explore various graphics engines operating at different levels, ranging from mid-level cues such\nas object velocity, to pixel-level rendering of images.\nWe demonstrate our system on real and synthetic datasets across multiple domains: synthetic billiard\nvideos [Fragkiadaki et al., 2016], in which balls have varied physical properties, real billiard videos\nfrom the web, and real images of block towers from Facebook AI Research [Lerer et al., 2016].\nOur contributions are three-fold. First, we propose a novel generative pipeline for physical scene\nunderstanding, and demonstrate its \ufb02exibility by incorporating various graphics and physics engines.\nSecond, we introduce the problem of visual de-animation \u2013 learning rich scene representations\nwithout supervision by interpreting and reconstructing visual input. Third, we show that our system\nperforms well across multiple scenarios and on both synthetic and constrained real videos.\n\n2 Related Work\n\nPhysical scene understanding has attracted increasing attention in recent years [Gupta et al., 2010,\nJia et al., 2015, Lerer et al., 2016, Zheng et al., 2015, Battaglia et al., 2013, Mottaghi et al., 2016b,\nFragkiadaki et al., 2016, Battaglia et al., 2016, Mottaghi et al., 2016a, Chang et al., 2017, Agrawal\net al., 2016, Pinto et al., 2016, Finn et al., 2016, Hamrick et al., 2017, Ehrhardt et al., 2017, Shao et al.,\n2014, Zhang et al., 2016]. Researchers have attempted to go beyond the traditional goals of high-level\ncomputer vision, inferring \u201cwhat is where\u201d, to capture the physics needed to predict the immediate\nfuture of dynamic scenes, and to infer the actions an agent should take to achieve a goal. Most of these\nefforts do not attempt to learn physical object representations from raw observations. Some systems\nemphasize learning from pixels but without an explicitly object-based representation [Lerer et al.,\n2016, Mottaghi et al., 2016b, Fragkiadaki et al., 2016, Agrawal et al., 2016, Pinto et al., 2016, Li et al.,\n2017], which makes generalization challenging. Others learn a \ufb02exible model of the dynamics of\nobject interactions, but assume a decomposition of the scene into physical objects and their properties\nrather than learning directly from images [Chang et al., 2017, Battaglia et al., 2016].\n\n2\n\nvisual datavisual dataphysical worldphysical world\fFigure 2: Our visual de-animation (VDA) model contains three major components: a convolutional\nperception module (I), a physics engine (II), and a graphics engine (III). The perception module\nef\ufb01ciently inverts the graphics engine by inferring the physical object state for each segment proposal\nin input (a), and combines them to obtain a physical world representation (b). The generative physics\nand graphics engines then run forward to reconstruct the visual data (e). See Section 3 for details.\n\nThere have been some works that aim to estimate physical object properties [Wu et al., 2016, 2015,\nDenil et al., 2017]. Wu et al. [2015] explored an analysis-by-synthesis approach that is easily\ngeneralizable, but less ef\ufb01cient. Their framework also lacked a perception module. Denil et al.\n[2017] instead proposed a reinforcement learning approach. These approaches, however, assumed\nstrong priors of the scene, and approximate object shapes with primitives. Wu et al. [2016] used\na feed-forward network for physical property estimation without assuming prior knowledge of the\nenvironment, but the constrained setup did not allow interactions between multiple objects. By\nincorporating physics and graphics engines, our approach can jointly learn the perception module\nand physical model, optionally in a Helmholtz machine style [Hinton et al., 1995], and recover an\nexplicit physical object representation in a range of scenarios.\nAnother line of related work is on future state prediction in either image pixels [Xue et al., 2016,\nMathieu et al., 2016] or object trajectories [Kitani et al., 2017, Walker et al., 2015]. There has also\nbeen abundant research making use of physical models for human or scene tracking [Salzmann\nand Urtasun, 2011, Kyriazis and Argyros, 2013, Vondrak et al., 2013, Brubaker et al., 2009]. Our\nmodel builds upon and extends these ideas by jointly modeling an approximate physics engine and a\nperceptual module, with wide applications including, but not limited to, future prediction.\nOur framework also relates to the \ufb01eld of \u201cvision as inverse graphics\u201d [Zhu and Mumford, 2007,\nYuille and Kersten, 2006, Bai et al., 2012]. Connected to but different from traditional analysis-\nby-synthesis approaches, recent works explored using deep neural networks to ef\ufb01ciently explain\nan object [Kulkarni et al., 2015a, Rezende et al., 2016], or a scene with multiple objects [Ba et al.,\n2015, Huang and Murphy, 2015, Eslami et al., 2016]. In particular, Wu et al. [2017] proposed \u201cscene\nde-rendering\u201d, building an object-based, structured representation from a static image. Our work\nincorporates inverse graphics with simulation engines for physical scene understanding and scene\ndynamics modeling.\n\n3 Visual De-animation\n\nOur visual de-animation (VDA) model consists of an ef\ufb01cient inverse graphics component to build\nthe initial physical world representation from visual input, a physics engine for physical reasoning of\nthe scene, and a graphics engine for rendering videos. We show the framework in Figure 2. In this\nsection, we \ufb01rst present an overview of the system, and then describe each component in detail.\n\n3.1 Overview\n\nThe \ufb01rst component of our system is an approximate inverse graphics module for physical object and\nscene understanding, as shown in Figure 2-I. Speci\ufb01cally, the system sequentially computes object\nproposals, recognizes objects and estimates their physical state, and recovers the scene layout.\n\n3\n\n(e) Output(a) Input(II) Physics engine (simulation)(III) Graphics engine (rendering)(b) Physical world representation(c) Appearance cuesObject proposalObject proposalPhysical object statePhysical object stateNMS(I) Perception module(d) Likelihood\fThe second component of our system is a physics engine, which uses the physical scene representation\nrecovered by the inverse graphics module to simulate future dynamics of the environment (Figure 2-\nII). Our system adapts to both neural, differentiable simulators, which can be jointly trained with the\nperception module, and rigid-body, non-differentiable simulators, which can be incorporated using\nmethods such as REINFORCE [Williams, 1992].\nThe third component of our framework is a graphics engine (Figure 2-III), which takes the scene\nrepresentations from the physics engine and re-renders the video at various levels (e.g. optical \ufb02ow,\nraw pixel). The graphics engine may need additional appearance cues such as object shape or color\n(Figure 2c). Here, we approximate them using simple heuristics, as they are not a focus of our paper.\nThere is a tradeoff between various rendering levels: while pixel-level reconstruction captures details\nof the scene, rendering at a more abstract level (e.g. silhouettes) may better generalize. We then use a\nlikelihood function (Figure 2d) to evaluate the difference between synthesized and observed signals,\nand compute gradients or rewards for differentiable and non-differentiable systems, respectively.\nOur model combines ef\ufb01cient and powerful deep networks for recognition with rich simulation\nengines for forward prediction. This provides us two major advantages over existing methods:\n\ufb01rst, simulation engines take an interpretable representation of the physical world, and can thus\neasily generalize and supply rich physical predictions; second, the model learns by explaining the\nobservations \u2014 it can be trained in a self-supervised manner without requiring human annotations.\n\n3.2 Physical Object and Scene Modeling\n\nWe now discuss each component in detail, starting with the perception module.\n\nObject proposal generation Given one or a few frames (Figure 2a), we \ufb01rst generate a number of\nobject proposals. The masked images are then used as input to the following stages of the pipeline.\n\nPhysical object state estimation For each segment proposal, we use a convolutional network to\nrecognize the physical state of the object, which consists of intrinsic properties such as shape, mass,\nand friction, as well as extrinsic properties such as 3D position and pose. The input to the network is\nthe masked image of the proposal, and the output is an interpretable vector for its physical state.\n\nPhysical world reconstruction Given objects\u2019 physical states, we \ufb01rst apply non-maximum sup-\npression to remove object duplicates, and then reconstruct the physical world according to object\nstates. The physical world representation (Figure 2b) will be employed by the physics and graphics\nengines for simulation and rendering.\n\n3.3 Physical Simulation and Prediction\n\nThe two types of physics engines we explore in this paper include a neural, differentiable physics\nengine and a standard rigid-body simulation engine.\n\nNeural physics engines The neural physics engine is an extension of the recent work from Chang\net al. [2017], which simulates scene dynamics by taking object mass, position, and velocity. We extend\ntheir framework to model object friction in our experiments on billiard table videos. Though basic,\nthe neural physics engine is differentiable, and thus can be end-to-end trained with our perception\nmodule to explain videos. Please refer to Chang et al. [2017] for details of the neural physics engine.\n\nRigid body simulation engines There exist rather mature, rigid-body physics simulation engines,\ne.g. Bullet [Coumans, 2010]. Such physics engines are much more powerful, but non-differentiable.\nIn our experiments on block towers, we used a non-differentiable simulator with multi-sample\nREINFORCE [Rezende et al., 2016, Mnih and Rezende, 2016] for joint training.\n\n3.4 Re-rendering with a Graphics Engine\n\nIn this work, we consider two graphics engines operating at different levels: for the billiard table\nscenario, we use a renderer that takes the output of a physics engine and generates pixel-level\nrendering; for block towers, we use one that computes only object silhouettes.\n\n4\n\n\fFigure 3: The three settings of our synthetic billiard videos: (a) balls have the same appearance and\nphysical properties, where the system learns to discover them and simulate the dynamics; (b) balls\nhave the same appearance but different physics, and the system learns their physics from motion; (c)\nballs have varied appearance and physics, and the system learns to associate appearance cues with\nunderlying object states, even from a single image.\n\nFigure 4: Results on the billiard videos, comparing ground truth videos with our predictions. We\nshow two of three input frames (in red) due to space constraints. Left: balls share appearance and\nphysics (I), where our framework learns to discover objects and simulate scene dynamics. Top right:\nballs have different appearance and physics (II), where our model learns to associate appearance with\nphysics and simulate collisions. It learns that the green ball should move further than the heavier\nblue ball after the collision. Bottom right: balls share appearance but have different frictions (III),\nwhere our model learns to associate motion with friction. It realizes from three input frames that the\nright-most ball in the \ufb01rst frame has a large friction coef\ufb01cient and will stop before the other balls.\n\n4 Evaluation\n\nWe evaluate variants of our frameworks in three scenarios: synthetic billiard videos, real billiard\nvideos, and block towers. We also test how models trained on synthetic data generalize to real cases.\n\n4.1 Billiard Tables: A Motivating Example\n\nWe begin with synthetic billiard videos to explore end-to-end learning of the perceptual module along\nwith differentiable simulation engines. We explore how our framework learns the physical object\nstate (position, velocity, mass, and friction) from its appearance and/or motion.\n\nData For the billiard table scenario, we generate data using the released code from Fragkiadaki\net al. [2016]. We updated the code to allow balls of different mass and friction. We used the billiard\ntable scenario as an initial exploration of whether our models can learn to associate visual object\nappearance and motion with physical properties. As shown in Figure 3, we generated three subsets,\nin which balls may have shared or differing appearance (color), and physical properties. For each\ncase, we generated 9,000 videos for training and 200 for testing.\n(I) Shared appearance and physics (Figure 3a): balls all have the same appearance and the same\nphysical properties. This basic setup evaluates whether we can jointly learn an object (ball) discoverer\nand a physics engine for scene dynamics.\n\n5\n\n(a) shared appearance,shared physics(b) varied appearance,varied physics(c) shared appearance,varied physicsInput (red) and ground truthReconstruction and predictionInput (red) and ground truthReconstruction and predictionFrame t-2 Frame tFrame t+2Frame t+5Frame t+10Frame t-2 Frame tFrame t+2Frame t+5Frame t+10\fDatasets\n\nAppear. Physics\n\nSame\n\nSame\n\nDiff\n\nDiff.\n\nSame\n\nDiff.\n\nMethods\n\nBaseline\nVDA (init)\nVDA (full)\nBaseline\nVDA (init)\nVDA (full)\nBaseline\nVDA (init)\nVDA (full)\n\n0.046\n0.046\n0.044\n0.058\n0.058\n0.055\n0.038\n0.038\n0.035\n\nRecon.\n\nPixel MSE 1st\n\n10th\n46.06\n12.76\n12.34\n70.47\n17.09\n16.33\n\nPosition Prediction (Abs)\n20th\n119.97\n26.10\n25.55\n180.04\n34.65\n32.97\n\nVelocity Prediction (Abs)\n20th\n1st\n10th\n5th\n8.45\n2.95\n7.32\n4.58 18.20\n2.65\n1.41\n2.34\n6.61\n3.46\n1.37\n2.55\n2.22\n3.25\n6.52\n10.51 12.02\n3.78\n6.57 26.38\n3.21\n3.02\n1.65\n8.92\n3.82\n3.55\n1.64\n2.89\n3.05\n8.58\n9.58 79.78 143.67 202.56 12.37 23.42 25.08 23.98\n3.77\n6.06 19.75\n5.77 19.34\n3.35\n\n5th\n5.63\n1.97\n1.87\n7.62\n2.27\n2.20\n\n5.16\n4.98\n\n5.01\n4.77\n\n34.24\n33.25\n\n46.40\n43.42\n\n3.37\n3.23\n\nTable 1: Quantitative results on synthetic billiard table videos. We evaluate our visual de-animation\n(VDA) model before and after joint training (init vs. full). For scene reconstruction, we compute\nMSE between rendered images and ground truth. For future prediction, we compute the Manhattan\ndistance in pixels between predicted object position and velocity and ground truth in pixels, at the 1st,\n5th, 10th, and 20th future frames. Our full model performs better. See qualitative results in Figure 4.\n\n(a) Future prediction results on synthetic billiard\nvideos of balls of the same physics\nFigure 5: Behavioral study results on future position prediction of billiard videos where balls have\nthe same physical properties (a), and balls have varied physical properties (b). For each model and\nhumans, we compare how their long-term relative prediction error grows, by measuring the ratio with\nrespect to errors in predicting the \ufb01rst next frame. Compared to the baseline model, the behavior of\nour prediction models aligns well with human predictions.\n\n(b) Future prediction results on synthetic billiard\nvideos of balls of varied frictions\n\n(II) Varied appearance and physics (Figure 3b): balls can be of three different masses (light, medium,\nheavy), and two different friction coef\ufb01cients. Each of the six possible combinations is associated\nwith a unique color (appearance). In this setup, the scene de-rendering component should be able to\nassociate object appearance with its physical properties, even from a single image.\n(III) Shared appearance, varied physics (Figure 3c): balls have the same appearance, but have one of\ntwo different friction coef\ufb01cients. Here, the perceptual component should be able to associate object\nmotion with its corresponding friction coef\ufb01cients, from just a few input images.\n\nSetup For this task, the physical state of an object is its intrinsic properties, including mass m and\nfriction f, and its extrinsic properties, including 2D position {x, y} and velocity v. Our system takes\nthree 256\u00d7256 RGB frames I1, I2, I3 as input. It \ufb01rst obtains \ufb02ow \ufb01elds from I1 to I2 and from I2\nto I3 by a pretrained spatial pyramid network (SPyNet) [Ranjan and Black, 2017]. It then generates\nobject proposals by applying color \ufb01lters on input images.\nOur perceptual model is a ResNet-18 [He et al., 2015], which takes as input three masked RGB\nframes and two masked \ufb02ow images of each object proposal, and recovers the object\u2019s physical\nstate. We use a differentiable, neural physics engine with object intrinsic properties as parameters; at\neach step, it predicts objects\u2019 extrinsic properties (position {x, y} and velocity v) in the next frame,\nbased on their current estimates. We employ a graphics engine that renders original images from the\npredicted positions, where the color of the balls is set as the mean color of the input object proposal.\nThe likelihood function compares, at a pixel level, these rendered images and observations. It is\nstraightforward to compute the gradient of object position from rendered RGB images and ground\ntruth. Thus, this simple graphics engine is also differentiable, making our system end-to-end trainable.\n\n6\n\nFrame 1Frame 3Frame 5Frame 100510Relative errorBaselineVDA (init)VDA (full)HumansFrame 1Frame 3Frame 5Frame 100510Relative errorBaselineVDA (init)VDA (full)Humans\fFigure 6: Sample results on web videos of real billiard games and computer games with realistic\nrendering. Left: our method correctly estimates the trajectories of multiple objects. Right: our\nframework correctly predicts the two collisions (white vs. red, white vs. blue), despite the motion\nblur in the input, though it underestimates the velocity of the red ball after the collision. Note that the\nbilliard table is a chaotic system, and highly accurate long-term prediction is intractable.\n\nOur training paradigm consists of two steps. First, we pretrain the perception module and the neural\nphysics engine separately on synthetic data, where ground truth is available. The second step is\nend-to-end \ufb01ne-tuning without annotations. We observe that the framework does not converge well\nwithout pre-training, possibly due to the multiple hypotheses that can explain a scene (e.g., we can\nonly observe relative, not absolute masses from collisions). We train our framework using SGD, with\na learning rate of 0.001 and a momentum of 0.9. We implement our framework in Torch7 [Collobert\net al., 2011]. During testing, the perception module is run in reverse to recover object physical states,\nand the learned physics engine is then run in forward for future prediction.\n\nResults Our formulation recovers a rich representation of the scene. With the generative models,\nwe show results in scene reconstruction and future prediction. We compare two variants of our\nalgorithm: the initial system has its perception module and neural physics engine separately trained,\nwhile the full system has an additional end-to-end \ufb01ne-tuning step, as discussed above. We also\ncompare with a baseline, which has the sample perception model, but in prediction, simply repeats\nobject dynamics in the past without considering interactions among them.\nScene reconstruction: given input frames, we are able to reconstruct the images based on inferred\nphysical states. For evaluation, we compute pixel-level MSE between reconstructions and observed\nimages. We show qualitative results in Figure 4 and quantitative results in Table 1.\nFuture prediction: with the learned neural simulation engine, our system is able to predict future\nevents based on physical world states. We show qualitative results in Figure 4 and quantitative results\nin Table 1, where we compute the Manhattan distance in pixels between predicted positions and\nvelocities and the ground truth. Our model achieves good performance in reconstructing the scene,\nunderstanding object physics, and predicting scene dynamics. See caption for details.\n\nBehavioral studies We further conduct behavioral studies, where we select 50 test cases, show the\n\ufb01rst three frames of each to human subjects, and ask them the positions of each ball in future frames.\nWe test 3 subjects per case on Amazon Mechanical Turk. For each model and humans, we compare\nhow their long-term relative prediction error grows, by measuring the ratio with respect to errors in\npredicting the \ufb01rst next frame. As shown in Figure 5, the behavior of our models resembles human\npredictions much better than the baseline model.\n\n4.2 Billiard Tables: Transferring to Real Videos\n\nData We also collected videos from YouTube, segmenting them into two-second clips. Some\nvideos are from real billiard competitions, and the others are from computer games with realistic\nrendering. We use it as an out-of-sample test set for evaluating the model\u2019s generalization ability.\n\nSetup and Results Our setup is the same as that in Section 4.1, except that we now re-train the\nperceptual model on the synthetic data of varied physics, but with \ufb02ow images as input instead of\nRGB images. Flow images abstract away appearance changes (color, lighting, etc.), allowing the\nmodel to generalize better to real data. We show qualitative results of reconstruction and future\nprediction in Figure 6 by rendering our inferred representation using the graphics software, Blender.\n\n7\n\nVideoVDA(ours)\fMethods\n\nChance\nHumans\nPhysNet\nGoogLeNet\nVDA (init)\nVDA (joint)\nVDA (full)\n\n# Blocks Mean\n2\n50\n67\n66\n70\n73\n75\n76\n\n3\n50\n62\n66\n70\n74\n76\n76\n\n4\n50\n62\n73\n70\n72\n73\n74\n\n50\n64\n68\n70\n73\n75\n75\n\n(b) Accuracy (%) of stability prediction on the\nblocks dataset\n\nMethods\nPhysNet\nGoogLeNet\nVDA (init)\nVDA (joint)\nVDA (full)\n\n2\n56\n70\n74\n75\n76\n\n3\n68\n67\n74\n77\n76\n\n4 Mean\n70\n71\n67\n70\n72\n\n65\n69\n72\n74\n75\n\n(a) Our reconstruction and prediction results given a single\nframe (marked in red). From top to bottom: ground truth,\nour results, results from Lerer et al. [2016].\n\n(c) Accuracy (%) of stability prediction when\ntrained on synthetic towers of 2 and 4 blocks, and\ntested on all block tower sizes.\n\n(d) Our reconstruction and prediction results given a single frame (marked in red)\n\nFigure 7: Results on the blocks dataset [Lerer et al., 2016]. For quantitative results (b), we compare\nthree variants of our visual de-animation (VDA) model: perceptual module trained without \ufb01ne-tuning\n(init), joint \ufb01ne-tuning with REINFORCE (joint), and full model considering stability constraint\n(full). We also compare with PhysNet [Lerer et al., 2016] and GoogLeNet [Szegedy et al., 2015].\n\n4.3 The Blocks World\n\nWe now look into a different scenario \u2014 block towers. In this experiment, we demonstrate the\napplicability of our model to explain and reason from a static image, instead of a video. We focus on\nthe reasoning of object states in the 3D world, instead of physical properties such as mass. We also\nexplore how our framework performs with non-differentiable simulation engines, and how physics\nsignals (e.g., stability) could help in physical reasoning, even when given only a static image.\n\nData Lerer et al. [2016] built a dataset of 492 images of real block towers, with ground truth\nstability values. Each image may contain 2, 3, or 4 blocks of red, blue, yellow, or green color.\nThough the blocks are the same size, their sizes in each 2D image differ due to 3D-to-2D perspective\ntransformation. Objects are made of the same material and thus have identical mass and friction.\n\n8\n\nVideoVDA(ours)PhysNetVideoVDA(ours)PhysNetVideoVDA(ours)\fFigure 8: Examples of predicting hypothetical scenarios and actively engaging with the scene. Left:\npredictions of the outcome of forces applied to two stable towers. Right: multiple ways to stabilize\ntwo unstable towers.\nSetup Here, the physical state of an object (block) consists of its 3D position {x, y, z} and 3D\nrotation (roll, pitch, yaw, each quantized into 20 bins). Our perceptual model is again a ResNet-18 [He\net al., 2015], which takes block silhouettes generated by simple color \ufb01lters as input, and recovers\nthe object\u2019s physical state. For this task, we implement an ef\ufb01cient, non-differentiable, rigid body\nsimulator, to predict whether the blocks are stable. We also implement a graphics engine to render\nobject silhouettes for reconstructing the input. Our likelihood function consists of two terms: MSE\nbetween rendered silhouettes and observations, and the binary cross-entropy between the predicted\nstability and the ground truth stability.\nOur training paradigm resembles the classic wake-sleep algorithm [Hinton et al., 1995]: \ufb01rst, generate\n10,000 training images using the simulation engines; second, train the perception module on synthetic\ndata with ground truth physical states; third, end-to-end \ufb01ne-tuning of the perceptual module by\nexplaining an additional 100,000 synthetic images without annotations of physical states, but with\nbinary annotations of stability. We use multi-sample REINFORCE [Rezende et al., 2016, Mnih and\nRezende, 2016] with 16 samples per input, assuming each position parameter is from a Gaussian\ndistribution and each rotation parameter is from a multinomial distribution (quantized into 20 bins).\nWe observe that the training paradigm helps the framework converge. The other setting is the same as\nthat in Section 4.1.\nResults We show results on two tasks: scene reconstruction and stability prediction. For each task,\nwe compare three variants of our algorithm: the initial system has its perception module trained\nwithout \ufb01ne-tuning; an intermediate system has joint end-to-end \ufb01ne-tuning, but without considering\nthe physics constraint; and the full system considers both reconstruction and physical stability during\n\ufb01ne-tuning.\nWe show qualitative results on scene reconstruction in Figures 7a and 7d, where we also demonstrate\nfuture prediction results by exporting our inferred physical states into Blender. We show quantitative\nresults on stability prediction in Table 7b, where we compare our models with PhysNet [Lerer et al.,\n2016] and GoogleNet [Szegedy et al., 2015]. All given a static image as test input, our algorithms\nachieve higher prediction accuracy (75% vs. 70%) ef\ufb01ciently (<10 milliseconds per image).\nOur framework also generalizes well. We test out-of-sample generalization ability, where we train our\nmodel on 2- and 4-block towers, but test it on all tower sizes. We show results in Table 7c. Further, in\nFigure 8, we show examples where our physical scene representation combined with a physics engine\ncan easily make conditional predictions, answering \u201cWhat happens if...\u201d-type questions. Speci\ufb01cally,\nwe show frame prediction of external forces on stable block towers, as well as ways that an agent can\nstabilize currently unstable towers, with the help of rich simulation engines.\n\n5 Discussion\n\nWe propose combining ef\ufb01cient, bottom-up, neural perception modules with rich, generalizable\nsimulation engines for physical scene understanding. Our framework is \ufb02exible and can incorporate\nvarious graphics and physics engines. It performs well across multiple synthetic and real scenarios,\nreconstructing the scene and making future predictions accurately and ef\ufb01ciently. We expect our\nframework to have wider applications in the future, due to the rapid development of scene description\nlanguages, 3D reconstruction methods, simulation engines and virtual environments.\n\n9\n\nInputVDAWhat if?InputFutureStabilizing force\fAcknowledgements\n\nWe thank Michael Chang, Donglai Wei, and Joseph Lim for helpful discussions. This work is\nsupported by NSF #1212849 and #1447476, ONR MURI N00014-16-1-2007, the Center for Brain,\nMinds and Machines (NSF #1231216), Toyota Research Institute, Samsung, Shell, and the MIT\nAdvanced Undergraduate Research Opportunities Program (SuperUROP).\n\nReferences\n\nPulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking:\n\nExperiential learning of intuitive physics. In NIPS, 2016. 2\n\nJimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In\n\nICLR, 2015. 3\n\nJiamin Bai, Aseem Agarwala, Maneesh Agrawala, and Ravi Ramamoorthi. Selectively de-animating video.\n\nACM TOG, 31(4):66, 2012. 3\n\nPeter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene\n\nunderstanding. PNAS, 110(45):18327\u201318332, 2013. 2\n\nPeter W Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu.\n\nnetworks for learning about objects, relations and physics. In NIPS, 2016. 2\n\nInteraction\n\nMarcus A. Brubaker, David J. Fleet, and Aaron Hertzmann. Physics-based person tracking using the anthropo-\n\nmorphic walker. IJCV, 87(1-2):140\u2013155, 2009. 3\n\nMichael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based\n\napproach to learning physical dynamics. In ICLR, 2017. 2, 4\n\nRonan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment for machine\n\nlearning. In BigLearn, NIPS Workshop, 2011. 7\n\nErwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 2010. 2, 4\n\nMisha Denil, Pulkit Agrawal, Tejas D Kulkarni, Tom Erez, Peter Battaglia, and Nando de Freitas. Learning to\n\nperform physics experiments via deep reinforcement learning. In ICLR, 2017. 3\n\nSebastien Ehrhardt, Aron Monszpart, Niloy J Mitra, and Andrea Vedaldi. Learning a physical long-term predictor.\n\narXiv:1703.00247, 2017. 2\n\nSM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and Geoffrey E Hinton. Attend,\n\ninfer, repeat: Fast scene understanding with generative models. In NIPS, 2016. 3\n\nChelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video\n\nprediction. In NIPS, 2016. 2\n\nKaterina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of\n\nphysics for playing billiards. In ICLR, 2016. 2, 5\n\nAbhinav Gupta, Alexei A Efros, and Martial Hebert. Blocks world revisited: Image understanding using\n\nqualitative geometry and mechanics. In ECCV, 2010. 2\n\nJessica B Hamrick, Andrew J Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W Battaglia.\n\nMetacontrol for adaptive imagination-based optimization. In ICLR, 2017. 2\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2015. 6, 9\n\nGeoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The \u201cwake-sleep\u201d algorithm for\n\nunsupervised neural networks. Science, 268(5214):1158, 1995. 3, 9\n\nJonathan Huang and Kevin Murphy. Ef\ufb01cient inference in occlusion-aware generative models of images. In\n\nICLR Workshop, 2015. 3\n\nZhaoyin Jia, Andy Gallagher, Ashutosh Saxena, and Tsuhan Chen. 3d reasoning from blocks to stability. IEEE\n\nTPAMI, 37(5):905\u2013918, 2015. 2\n\n10\n\n\fKris M. Kitani, De-An Huang, and Wei-Chiu Ma. Activity forecasting. In Group and Crowd Behavior for\n\nComputer Vision, pages 273\u2013294. Elsevier, 2017. 3\n\nTejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture: A probabilistic\n\nprogramming language for scene perception. In CVPR, 2015a. 3\n\nTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In NIPS, 2015b. 1\n\nNikolaos Kyriazis and Antonis Argyros. Physically plausible 3d scene tracking: The single actor hypothesis. In\n\nCVPR, 2013. 3\n\nAdam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. In ICML,\n\n2016. 1, 2, 8, 9\n\nWenbin Li, Ales Leonardis, and Mario Fritz. Visual stability prediction for robotic manipulation. In ICRA, 2017.\n\n2\n\nMichael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square\n\nerror. In ICLR, 2016. 3\n\nAndriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. In ICML, 2016. 4, 9\n\nRoozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonian scene under-\n\nstanding: Unfolding the dynamics of objects in static images. In CVPR, 2016a. 2\n\nRoozbeh Mottaghi, Mohammad Rastegari, Abhinav Gupta, and Ali Farhadi. \u201cwhat happens if...\u201d learning to\n\npredict the effect of forces in images. In ECCV, 2016b. 2\n\nLerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, and Abhinav Gupta. The curious robot: Learning\n\nvisual representations via physical interactions. In ECCV, 2016. 2\n\nAnurag Ranjan and Michael J Black. Optical \ufb02ow estimation using a spatial pyramid network. In CVPR, 2017. 6\n\nDanilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.\n\nUnsupervised learning of 3d structure from images. In NIPS, 2016. 3, 4, 9\n\nMathieu Salzmann and Raquel Urtasun. Physically-based motion models for 3d tracking: A convex formulation.\n\nIn ICCV, 2011. 3\n\nTianjia Shao, Aron Monszpart, Youyi Zheng, Bongjin Koo, Weiwei Xu, Kun Zhou, and Niloy J Mitra. Imagining\n\nthe unseen: Stability-based cuboid arrangements for scene understanding. ACM TOG, 33(6), 2014. 2\n\nChristian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,\n\nVincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. 8, 9\n\nMarek Vondrak, Leonid Sigal, and Odest Chadwicke Jenkins. Dynamical simulation priors for human motion\n\ntracking. IEEE TPAMI, 35(1):52\u201365, 2013. 3\n\nJacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical \ufb02ow prediction from a static image. In ICCV,\n\n2015. 3\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMLJ, 8(3-4):229\u2013256, 1992. 4\n\nJiajun Wu, Ilker Yildirim, Joseph J Lim, William T Freeman, and Joshua B Tenenbaum. Galileo: Perceiving\n\nphysical object properties by integrating a physics engine with deep learning. In NIPS, 2015. 3\n\nJiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning\n\nphysical object properties from unlabeled videos. In BMVC, 2016. 3\n\nJiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017. 3\n\nTianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame\n\nsynthesis via cross convolutional networks. In NIPS, 2016. 3\n\nAlan Yuille and Daniel Kersten. Vision as bayesian inference: analysis by synthesis? TiCS, 10(7):301\u2013308,\n\n2006. 3\n\n11\n\n\fRenqiao Zhang, Jiajun Wu, Chengkai Zhang, William T Freeman, and Joshua B Tenenbaum. A comparative\nevaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical\nscene understanding. In CogSci, 2016. 2\n\nBo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Scene understanding by reasoning\n\nstability and safety. IJCV, 112(2):221\u2013238, 2015. 2\n\nSong-Chun Zhu and David Mumford. A stochastic grammar of images. Foundations and Trends R(cid:13) in Computer\n\nGraphics and Vision, 2(4):259\u2013362, 2007. 3\n\n12\n\n\f", "award": [], "sourceid": 121, "authors": [{"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Erika", "family_name": "Lu", "institution": "University of Oxford"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "DeepMind"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}