{"title": "Unsupervised Learning for Physical Interaction through Video Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 72, "abstract": "A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled object information. However, to scale real-world interaction learning to a variety of scenes and objects, acquiring labeled data becomes increasingly impractical. To learn about physical object motion without labels, we develop an action-conditioned video prediction model that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames. Because our model explicitly predicts motion, it is partially invariant to object appearance, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, we also introduce a dataset of 59,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a \"visual imagination\" of different futures based on different courses of action. Our experiments show that our proposed method produces more accurate video predictions both quantitatively and qualitatively, when compared to prior methods.", "full_text": "Unsupervised Learning for Physical Interaction\n\nthrough Video Prediction\n\nChelsea Finn\u2217\nUC Berkeley\n\nIan Goodfellow\n\nOpenAI\n\ncbfinn@eecs.berkeley.edu\n\nian@openai.com\n\nSergey Levine\nGoogle Brain\nUC Berkeley\n\nslevine@google.com\n\nAbstract\n\nA core challenge for an agent learning to interact with the world is to predict\nhow its actions affect objects in its environment. Many existing methods for\nlearning the dynamics of physical interactions require labeled object information.\nHowever, to scale real-world interaction learning to a variety of scenes and objects,\nacquiring labeled data becomes increasingly impractical. To learn about physical\nobject motion without labels, we develop an action-conditioned video prediction\nmodel that explicitly models pixel motion, by predicting a distribution over pixel\nmotion from previous frames. Because our model explicitly predicts motion, it\nis partially invariant to object appearance, enabling it to generalize to previously\nunseen objects. To explore video prediction for real-world interactive agents, we\nalso introduce a dataset of 59,000 robot interactions involving pushing motions,\nincluding a test set with novel objects. In this dataset, accurate prediction of videos\nconditioned on the robot\u2019s future actions amounts to learning a \u201cvisual imagination\u201d\nof different futures based on different courses of action. Our experiments show that\nour proposed method produces more accurate video predictions both quantitatively\nand qualitatively, when compared to prior methods.\n\n1\n\nIntroduction\n\nObject detection, tracking, and motion prediction are fundamental problems in computer vision,\nand predicting the effect of physical interactions is a critical challenge for learning agents acting in\nthe world, such as robots, autonomous cars, and drones. Most existing techniques for learning to\npredict physics rely on large manually labeled datasets (e.g. [18]). However, if interactive agents\ncan use unlabeled raw video data to learn about physical interaction, they can autonomously collect\nvirtually unlimited experience through their own exploration. Learning a representation which can\npredict future video without labels has applications in action recognition and prediction and, when\nconditioned on the action of the agent, amounts to learning a predictive model that can then be used\nfor planning and decision making.\n\nHowever, learning to predict physical phenomena poses many challenges, since real-world physical\ninteractions tend to be complex and stochastic, and learning from raw video requires handling the\nhigh dimensionality of image pixels and the partial observability of object motion from videos. Prior\nvideo prediction methods have typically considered short-range prediction [17], small image patches\n[22], or synthetic images [20]. Such models follow a paradigm of reconstructing future frames\nfrom the internal state of the model. In our approach, we propose a method which does not require\nthe model to store the object and background appearance. Such appearance information is directly\navailable in the previous frame. We develop a predictive model which merges appearance information\n\n\u2217Work was done while the author was at Google Brain.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ffrom previous frames with motion predicted by the model. As a result, the model is better able to\npredict future video sequences for multiple steps, even involving objects not seen at training time.\n\nTo merge appearance and predicted motion, we output the motion of pixels relative to the previous\nimage. Applying this motion to the previous image forms the next frame. We present and evaluate\nthree motion prediction modules. The \ufb01rst, which we refer to as dynamic neural advection (DNA),\noutputs a distribution over locations in the previous frame for each pixel in the new frame. The\npredicted pixel value is then computed as an expectation under this distribution. A variant on this\napproach, which we call convolutional dynamic neural advection (CDNA), outputs the parameters of\nmultiple normalized convolution kernels to apply to the previous image to compute new pixel values.\nThe last approach, which we call spatial transformer predictors (STP), outputs the parameters of\nmultiple af\ufb01ne transformations to apply to the previous image, akin to the spatial transformer network\npreviously proposed for supervised learning [11]. In the case of the latter two methods, each predicted\ntransformation is meant to handle separate objects. To combine the predictions into a single image,\nthe model also predicts a compositing mask over each of the transformations. DNA and CDNA are\nsimpler and easier to implement than STP, and while all models achieve comparable performance,\nthe object-centric CDNA and STP models also provide interpretable internal representations.\n\nOur main contribution is a method for making long-range predictions in real-world videos by\npredicting pixel motion. When conditioned on the actions taken by an agent, the model can learn\nto imagine different futures from different actions. To learn about physical interaction from videos,\nwe need a large dataset with complex object interactions. We collected a dataset of 59,000 robot\npushing motions, consisting of 1.5 million frames and the corresponding actions at each time step.\nOur experiments using this new robotic pushing dataset, and using a human motion video dataset [10],\nshow that models that explicitly transform pixels from previous frames better capture object motion\nand produce more accurate video predictions compared to prior state-of-the-art methods. The dataset,\nvideo results, and code are all available online: sites.google.com/site/robotprediction.\n\n2 Related Work\n\nVideo prediction: Prior work on video prediction has tackled synthetic videos and short-term\nprediction in real videos. Yuan et al. [30] used a nearest neighbor approach to construct predictions\nfrom similar videos in a dataset. Ranzato et al. proposed a baseline for video prediction inspired by\nlanguage models [21]. LSTM models have been adapted for video prediction on patches [22], action-\nconditioned Atari frame predictions [20], and precipitation nowcasting [28]. Mathieu et al. proposed\nnew loss functions for sharper frame predictions [17]. Prior methods generally reconstruct frames\nfrom the internal state of the model, and some predict the internal state directly, without producing\nimages [23]. Our method instead transforms pixels from previous frames, explicitly modeling motion\nand, in the case of the CDNA and STP models, decomposing it over image segments. We found in\nour experiments that all three of our models produce substantially better predictions by advecting\npixels from the previous frame and compositing them onto the new image, rather than constructing\nimages from scratch. This approach differs from recent work on optic \ufb02ow prediction [25], which\npredicts where pixels will move to using direct optical \ufb02ow supervision. Boots et al. predict future\nimages of a robot arm using nonparametric kernel-based methods [4]. In contrast to this work, our\napproach uses \ufb02exible parametric models, and effectively predicts interactions with objects, including\nobjects not seen during training. To our knowledge, no previous video prediction method has been\napplied to predict real images with novel object interactions beyond two time steps into the future.\n\nThere have been a number of promising methods for frame prediction developed concurrently to\nthis work [16]. Vondrick et al. [24] combine an adversarial objective with a multiscale, feedforward\narchitecture, and use a foreground/background mask similar to the masking scheme proposed here. De\nBrabandere et al. [6] propose a method similar to our DNA model, but use a softmax for sharper \ufb02ow\ndistributions. The probabilistic model proposed by Xue et al. [29] predicts transformations applied to\nlatent feature maps, rather than the image itself, but only demonstrates single frame prediction.\n\nLearning physics: Several works have explicitly addressed prediction of physical interactions,\nincluding predicting ball motion [5], block falling [2], the effects of forces [19, 18], future human\ninteractions [9], and future car trajectories [26]. These methods require ground truth object pose\ninformation, segmentation masks, camera viewpoint, or image patch trackers. In the domain of\nreinforcement learning, model-based methods have been proposed that learn prediction on images [14,\n27], but they have either used synthetic images or instance-level models, and have not demonstrated\n\n2\n\n\fRGB\tinput\t\n\n5x5\t\n\nconv\t1\t\n\n5x5\tconv\t\nLSTM\t1\t\n\n5x5\tconv\t\nLSTM\t2\t\n\n5x5\tconv\t\nLSTM\t3\t\n\n5x5\tconv\t\nLSTM\t4\t\n\n5x5\tconv\t\nLSTM\t5\t\n\n5x5\tconv\t\nLSTM\t6\t\n\n5x5\tconv\t\nLSTM\t7\t\n\n32\tc\t\n\n32\tc\t\n\n32\tc\t\n\n64\tc\t\n\n64\tc\t\n\n\t\n\t\n\t\nstride\t\n2\t\n\n128\tc\t\n\n\t\n\t\n\t\nstride\t\n2\t\n\n32\tc\t\n\n64\tc\t\n\n\t\n\t\n\t\ndeconv\t\n2\t\n\n\t\n\t\n\t\ndeconv\t\n2\t\n\n\t\n\t\n\t\ndeconv\t\n2\t\n\n\t\nstride\t\n2\t\n\n1x1\t\n\nconv\t2\t\n\ncomposi!ng\t\n\nmasks\t\n\n11\tc\t\n\n\t\n\t\n\t\nchannel\t\nsoGmax\t\n\n\u039e\n\n8x8\t\n\n16x16\t\n\n32x32\t\n\n64x64\t\n\n64x64\t\n\n\t\nmasked\t\ncomposi!ng\t\n\n64x64x3\t\n\n32x32\t\n\n32x32\t\n\n32x32\t\n\n16x16\t\n\n16x16\t\n\n10\tc\t\n\n!le\t\n\t\n\n\t\n\nn\no\n!\nc\na\n\n\t\nt\no\nb\no\nr\n\n\t\n\ne\nt\na\nt\ns\n\n5\t\n\n5\t\n\n8x8\t\n\n\t\n\ne\nt\na\nn\ne\nt\na\nc\nn\no\nc\n\n\t\n\n\t\nconvolve\t\n\n10\t\t5x5\t\n\nCDNA\tkernels\t\n\nfully\tconnected,\t\nreshape\t&\t\nnormalize\t\n\t\n\n10\t\t64x64x3\t\ntransformed\t\n\nimages\t\n\n64x64\t\n\nRGB\t\n\npredic!on\t\n\nFigure 1: Architecture of the CDNA model, one of the three proposed pixel advection models. We\nuse convolutional LSTMs to process the image, outputting 10 normalized transformation kernels\nfrom the smallest middle layer of the network and an 11-channel compositing mask from the last\nlayer (including 1 channel for static background). The kernels are applied to transform the previous\nimage into 10 different transformed images, which are then composited according to the masks. The\nmasks sum to 1 at each pixel due to a channel-wise softmax. Yellow arrows denote skip connections.\n\ngeneralization to novel objects nor accurate prediction on real-world videos. As shown by our\ncomparison to LSTM-based prediction designed for Atari frames [20], models that work well on\nsynthetic domains do not necessarily succeed on real images.\n\nVideo datasets: Existing video datasets capture YouTube clips [12], human motion [10], synthetic\nvideo game frames [20], and driving [8]. However, to investigate learning visual physics prediction,\nwe need data that exhibits rich object motion, collisions, and interaction information. We propose\na large new dataset consisting of real-world videos of robot-object interactions, including complex\nphysical phenomena, realistic occlusions, and a clear use-case for interactive robot learning.\n\n3 Motion-Focused Predictive Models\n\nIn order to learn about object motion while remaining invariant to appearance, we introduce a class of\nvideo prediction models that directly use appearance information from previous frames to construct\npixel predictions. Our model computes the next frame by \ufb01rst predicting the motions of image\nsegments, then merges these predictions via masking. In this section, we discuss our novel pixel\ntransformation models, and propose how to effectively merge predicted motion of multiple segments\ninto a single next image prediction. The architecture of the CDNA model is shown in Figure 1.\nDiagrams of the DNA and STP models are in Appendix B.\n\n3.1 Pixel Transformations for Future Video Prediction\n\nThe core of our models is a motion prediction module that predicts objects\u2019 motion without attempting\nto reconstruct their appearance. This module is therefore partially invariant to appearance and can\ngeneralize effectively to previously unseen objects. We propose three motion prediction modules:\n\nDynamic Neural Advection (DNA):\nIn this approach, we predict a distribution over locations in\nthe previous frame for each pixel in the new frame. The predicted pixel value is computed as an\nexpectation under this distribution. We constrain the pixel movement to a local region, under the\nregularizing assumption that pixels will not move large distances. This keeps the dimensionality of\nthe prediction low. This approach is the most \ufb02exible of the proposed approaches.\nFormally, we apply the predicted motion transformation \u02c6m to the previous image prediction \u02c6It\u22121 for\nevery pixel (x, y) to form the next image prediction \u02c6It as follows:\n\n\u02c6It(x, y) = Xk\u2208(\u2212\u03ba,\u03ba) Xl\u2208(\u2212\u03ba,\u03ba)\n\n\u02c6mxy(k, l) \u02c6It\u22121(x \u2212 k, y \u2212 l)\n\nwhere \u03ba is the spatial extent of the predicted distribution. This can be implemented as a convolution\nwith untied weights. The architecture of this model matches the CDNA model in Figure 1, except that\n\n3\n\n\fthe higher-dimensional transformation parameters \u02c6m are outputted by the last (conv 2) layer instead\nof the LSTM 5 layer used for the CDNA model.\n\nConvolutional Dynamic Neural Advection (CDNA): Under the assumption that the same mech-\nanisms can be used to predict the motions of different objects in different regions of the image,\nwe consider a more object-centric approach to predicting motion. Instead of predicting a different\ndistribution for each pixel, this model predicts multiple discrete distributions that are each applied\nto the entire image via a convolution (with tied weights), which computes the expected value of the\nmotion distribution for every pixel. The idea is that pixels on the same rigid object will move together,\nand therefore can share the same transformation. More formally, one predicted object transformation\n\u02c6m applied to the previous image It\u22121 produces image \u02c6Jt for each pixel (x, y) as follows:\n\n\u02c6Jt(x, y) = Xk\u2208(\u2212\u03ba,\u03ba) Xl\u2208(\u2212\u03ba,\u03ba)\n\n\u02c6m(k, l) \u02c6It\u22121(x \u2212 k, y \u2212 l)\n\nwhere \u03ba is the spatial size of the normalized predicted convolution kernel \u02c6m. Multiple transformations\n(i)\n{ \u02c6m(i)} are applied to the previous image \u02c6It\u22121 to form multiple images { \u02c6J\nt }. These output images\nare combined into a single prediction \u02c6It as described in the next section and show in Figure 1.\n\nSpatial Transformer Predictors (STP):\nIn this approach, the model produces multiple sets of\nparameters for 2D af\ufb01ne image transformations, and applies the transformations using a bilinear\nsampling kernel [11]. More formally, a set of af\ufb01ne parameters \u02c6M produces a warping grid between\nprevious image pixels (xt\u22121, yt\u22121) and generated image pixels (xt, yt).\n\nyt\u22121(cid:19) = \u02c6M xt\n1!\n(cid:18)xt\u22121\n\nyt\n\nThis grid can be applied with a bilinear kernel to form an image \u02c6Jt:\n\nW\n\nH\n\n\u02c6Jt(xt, yt) =\n\n\u02c6It\u22121(k, l) max(0, 1 \u2212 |xt\u22121 \u2212 k|) max(0, 1 \u2212 |yt\u22121 \u2212 l|)\n\nXk\n\nXl\n\nwhere W and H are the image width and height. While this type of operator has been applied\npreviously only to supervised learning tasks, it is well-suited for video prediction. Multiple transfor-\n(i)\nmations { \u02c6M (i)} are applied to the previous image \u02c6It\u22121 to form multiple images { \u02c6J\nt }, which are\nthen composited based on the masks. The architecture matches the diagram in Figure 1, but instead\nof outputting CDNA kernels at the LSTM 5 layer, the model outputs the STP parameters { \u02c6M (i)}.\n\nAll of these models can focus on learning physics rather than object appearance. Our experiments\nshow that these models are better able to generalize to unseen objects compared to models that\nreconstruct the pixels directly or predict the difference from the previous frame.\n\n3.2 Composing Object Motion Predictions\n\nCDNA and STP produce multiple object motion predictions, which need to be combined into a single\n(i)\nimage. The composition of the predicted images \u02c6J\nis modulated by a mask \u039e, which de\ufb01nes a\nt\n\u25e6 \u039ec , where c denotes the channel\nof the mask and the element-wise multiplication is over pixels. To obtain the mask, we apply a\nchannel-wise softmax to the \ufb01nal convolutional layer in the model (conv 2 in Figure 1), which ensures\nthat the channels of the mask sum to 1 for each pixel position.\n\nweight on each prediction, for each pixel. Thus, \u02c6It = Pc\n\n(c)\n\u02c6J\nt\n\nIn practice, our experiments show that the CDNA and STP models learn to mask out objects that\nare moving in consistent directions. The bene\ufb01t of this approach is two-fold: \ufb01rst, predicted motion\ntransformations are reused for multiple pixels in the image, and second, the model naturally extracts\na more object centric representation in an unsupervised fashion, a desirable property for an agent\nlearning to interact with objects. The DNA model lacks these two bene\ufb01ts, but instead is more \ufb02exible\nas it can produce independent motions for every pixel in the image.\n\nFor each model, including DNA, we also include a \u201cbackground mask\u201d where we allow the models\nto copy pixels directly from the previous frame. Besides improving performance, this also produces\ninterpretable background masks that we visualize in Section 5. Additionally, to \ufb01ll in previously\noccluded regions, which may not be well represented by nearby pixels, we allowed the models to\ngenerate pixels from an image, and included it in the \ufb01nal masking step.\n\n4\n\n\f3.3 Action-conditioned Convolutional LSTMs\n\nMost existing physics and video prediction models use feedforward architectures [17, 15] or feedfor-\nward encodings of the image [20]. To generate the motion predictions discussed above, we employ\nstacked convolutional LSTMs [28]. Recurrence through convolutions is a natural \ufb01t for multi-step\nvideo prediction because it takes advantage of the spatial invariance of image representations, as the\nlaws of physics are mostly consistent across space. As a result, models with convolutional recurrence\nrequire signi\ufb01cantly fewer parameters and use those parameters more ef\ufb01ciently.\n\nThe model architecture is displayed in Figure 1 and detailed in Appendix B. In an interactive setting,\nthe agent\u2019s actions and internal state (such as the pose of the robot gripper) in\ufb02uence the next image.\nWe integrate both into our model by spatially tiling the concatenated state and action vector across a\nfeature map, and concatenating the result to the channels of the lowest-dimensional activation map.\nNote, though, that the agent\u2019s internal state (i.e. the robot gripper pose) is only input into the network\nat the beginning, and must be predicted from the actions in future timesteps. We trained the networks\nusing an l2 reconstruction loss. Alternative losses, such as those presented in [17] could complement\nthis method.\n\n4 Robotic Pushing Dataset\n\nOne key application of action-conditioned video prediction\nis to use the learned model for decision making in vision-\nbased robotic control tasks. Unsupervised learning from\nvideo can enable agents to learn about the world on their\nown, without human involvement, a critical requirement\nfor scaling up interactive learning. In order to investigate\naction-conditioned video prediction for robotic tasks, we\nneed a dataset with real-world physical object interactions.\nWe collected a new dataset using 10 robotic arms, shown in\nFigure 2, pushing hundreds of objects in bins, amounting\nto 57,000 interaction sequences with 1.5 million video\nframes. Two test sets, each with 1,250 recorded motions,\nwere also collected. The \ufb01rst test set used two different\nsubsets of the objects pushed during training. The second\ntest set involved two subsets of objects, none of which\nwere used during training. In addition to RGB images, we also record the corresponding gripper\nposes, which we refer to as the internal state, and actions, which corresponded to the commanded\ngripper pose. The dataset is publically available2. Further details on the data collection procedure are\nprovided in Appendix A.\n\nFigure 2: Robot data collection setup\n(top) and example images captured from\nthe robot\u2019s camera (bottom).\n\n5 Experiments\n\nWe evaluate our method using the dataset in Section 4, as well as on videos of human motion in\nthe Human3.6M dataset [10]. In both settings, we evaluate our three models described in Section 3,\nas well as prior models [17, 20]. For CDNA and STP, we used 10 transformers. While we show\nstills from the predicted videos in the \ufb01gures, the qualitative results are easiest to compare when the\npredicted videos can be viewed side-by-side. For this reason, we encourage the reader to examine\nthe video results on the supplemental website2. Code for training the model is also available on the\nwebsite.\n\nTraining details: We trained all models using the TensorFlow library [1], optimizing to conver-\ngence using ADAM [13] with the suggested hyperparameters. We trained all recurrent models\nwith and without scheduled sampling [3] and report the performance of the model with the best\nvalidation error. We found that scheduled sampling improved performance of our models, but did not\nsubstantially affect the performance of ablation and baseline models that did not model pixel motion.\n\n2See http://sites.google.com/site/robotprediction\n\n5\n\n\fT\nG\n\nA\nN\nD\nC\n\nt = 1\n\n5\n\n9\n\n13\n\n17\n\nT\nG\n\nA\nN\nD\nC\n\nM\nT\nS\nL\nC\nF\n\ns\n\nm\n\n,\n\nF\nF\n\nM\nT\nS\nL\nC\nF\n\ns\n\nm\n\n,\n\nF\nF\n\n1\n\n5\n\n9\n\n13\n\n17\n\nFigure 3: Qualitative and quantitative reconstruction performance of our models, compared with\n[20, 17]. All models were trained for 8-step prediction, except [17], trained for 1-step prediction.\n\n5.1 Action-conditioned prediction for robotic pushing\n\nOur primary evaluation is on video prediction using our robotic interaction dataset, conditioned on\nthe future actions taken by the robot. In this setting, we pass in two initial images, as well as the\ninitial robot arm state and actions, and then sequentially roll out the model, passing in the future\nactions and the model\u2019s image and state prediction from the previous time step. We trained for 8\nfuture time steps for all recurrent models, and test for up to 18 time steps. We held out 5% of the\ntraining set for validation. To quantitatively evaluate the predictions, we measure average PSNR and\nSSIM, as proposed in [17]. Unlike [17], we measure these metrics on the entire image. We evaluate\non two test sets described in Section 4, one with objects seen at training time, and one with previously\nunseen objects.\n\nFigure 3 illustrates the performance of our models compared to prior methods. We report the\nperformance of the feedforward multiscale model of [17] using an l1+GDL loss, which was the best\nperforming model in our experiments \u2013 full results of the multi-scale models are in Appendix C. Our\nmethods signi\ufb01cantly outperform prior video prediction methods on all metrics. The FC LSTM model\n[20] reconstructs the background and lacks the representational power to reconstruct the objects in the\nbin. The feedforward multiscale model performs well on 1-step prediction, but performance quickly\ndrops over time, as it is only trained for 1-step prediction. It is worth noting that our models are\nsigni\ufb01cantly more parameter ef\ufb01cient: despite being recurrent, they contain 12.5 million parameters,\nwhich is slightly less than the feedforward model with 12.6 million parameters and signi\ufb01cantly\nless than the FC LSTM model which has 78 million parameters. We found that none of the models\nsuffered from signi\ufb01cant over\ufb01tting on this dataset. We also report the baseline performance of\nsimply copying the last observed ground truth frame.\n\nIn Figure 4, we compare to models with the same stacked convolutional LSTM architecture, but\nthat predict raw pixel values or the difference between previous and current frames. By explicitly\nmodeling pixel motion, our method outperforms these ablations. Note that the model without skip\nconnections is most representative of the model by Xingjian et al. [28]. We show a second ablation in\nFigure 5, illustrating the bene\ufb01t of training for longer horizons and from conditioning on the action of\nthe robot. Lastly, we show qualitative results in Figure 6 of changing the action of the arm to examine\nthe model\u2019s predictions about possible futures.\n\nFor all of the models, the prediction quality degrades over time, as uncertainty increases further into\nthe future. We use a mean-squared error objective, which optimizes for the mean pixel values. The\n\n6\n\n\fFigure 4: Quantitative comparison to models which reconstruct rather than predict motion. Notice\nthat on the novel objects test set, there is a larger gap between models which predict motion and those\nwhich reconstruct appearance.\n\nFigure 5: Ablation of DNA involving not including the action, and different prediction horizons\nduring training.\n\nmodel thus encodes uncertainty as blur. Modeling this uncertainty directly through, for example,\nstochastic neural networks is an interesting direction for future work. Note that prior video prediction\nmethods have largely focused on single-frame prediction, and most have not demonstrated prediction\nof multiple real-world RGB video frames in sequence. Action-conditioned multi-frame prediction is\na crucial ingredient in model-based planning, where the robot could mentally test the outcomes of\nvarious actions before picking the best one for a given task.\n\n5.2 Human motion prediction\n\nIn addition to the action-conditioned prediction, we also evaluate our model on predicting future\nvideo without actions. We chose the Human3.6M dataset, which consists of human actors performing\nvarious actions in a room. We trained all models on 5 of the human subjects, held out one subject for\nvalidation, and held out a different subject for the evaluations presented here. Thus, the models have\nnever seen this particular human subject or any subject wearing the same clothes. We subsampled\nthe video down to 10 fps such that there was noticeable motion in the videos within reasonable time\nframes. Since the model is no longer conditioned on actions, we fed in 10 video frames and trained\nthe network to produce the next 10 frames, corresponding to 1 second each. Our evaluation measures\nperformance up to 20 timesteps into the future.\n\nThe results in Figure 7 show that our motion-predictive models quantitatively outperform prior\nmethods, and qualitatively produce plausible motions for at least 10 timesteps, and start to degrade\nthereafter. We also show the masks predicted internally by the model for masking out the previous\n\nn\no\ni\nt\nc\na\n0\n\nn\no\ni\nt\nc\na\nx\n1\nn\no\ni\nt\nc\na\n\nx\n5\n1\n\n.\n\nt = 1\n\n3\n\n5\n\n7\n\n9\n\n1\n\n3\n\n5\n\n7\n\n9\n\nFigure 6: CDNA predictions from the same starting image, but different future actions, with objects\nnot seen in the training set. By row, the images show predicted future with zero action (stationary),\nthe original action, and an action 150% larger than the original. Note how the prediction shows no\nmotion with zero action, and with a larger action, predicts more motion, including object motion.\n\n7\n\n\fT\nG\n\nP\nT\nS\n\nk\ns\na\nm\nP\nT\nS\n\nt = 1\n\n4\n\n7\n\n10\n\n13\n\n1\n\n4\n\n7\n\n10\n\n13\n\nFigure 7: Quantitative and qualitative results on human motion video predictions with a held-out\nhuman subject. All recurrent models were trained for 10 future timesteps.\n\nframe, which we refer to as the background mask. These masks illustrate that the model learns to\nsegment the human subject in the image without any explicit supervision.\n\n6 Conclusion & Future Directions\n\nIn this work, we develop an action-conditioned video prediction model for interaction that incorporates\nappearance information in previous frames with motion predicted by the model. To study unsupervised\nlearning for interaction, we also present a new video dataset with 59,000 real robot interactions and\n1.5 million video frames. Our experiments show that, by learning to transform pixels in the initial\nframe, our model can produce plausible video sequences more than 10 time steps into the future,\nwhich corresponds to about one second. In comparisons to prior methods, our method achieves the\nbest results on a number of previous proposed metrics.\n\nPredicting future object motion in the context of a physical interaction is a key building block of\nan intelligent interactive system. The kind of action-conditioned prediction of future video frames\nthat we demonstrate can allow an interactive agent, such as a robot, to imagine different futures\nbased on the available actions. Such a mechanism can be used to plan for actions to accomplish a\nparticular goal, anticipate possible future problems (e.g. in the context of an autonomous vehicle),\nand recognize interesting new phenomena in the context of exploration. While our model directly\npredicts the motion of image pixels and naturally groups together pixels that belong to the same object\nand move together, it does not explicitly extract an internal object-centric representation (e.g. as in\n[7]). Learning such a representation would be a promising future direction, particularly for applying\nef\ufb01cient reinforcement learning algorithms that might bene\ufb01t from concise state representations.\n\nAcknowledgments\n\nWe would like to thank Vincent Vanhoucke, Mrinal Kalakrishnan, Jon Barron, Deirdre Quillen, and\nour anonymous reviewers for helpful feedback and discussions. We would also like to thank Peter\nPastor for technical support with the robots.\n\nReferences\n\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, et al. Tensor\ufb02ow:\n\nLarge-scale machine learning on heterogeneous systems, 2015. Software from tensor\ufb02ow.org, 2015.\n\n[2] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene under-\n\nstanding. Proceedings of the National Academy of Sciences, 110(45), 2013.\n\n8\n\n\f[3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent\n\nneural networks. In Advances in Neural Information Processing Systems, pages 1171\u20131179, 2015.\n\n[4] B. Boots, A. Byravan, and D. Fox. Learning predictive models of a depth camera & manipulator from raw\n\nexecution traces. In International Conference on Robotics and Automation (ICRA), 2014.\n\n[5] M. A. Brubaker, L. Sigal, and D. J. Fleet. Estimating contact dynamics. In International Conference on\n\nComputer Vision (ICCV), 2009.\n\n[6] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic \ufb01lter networks. In Neural Information\n\nProcessing Systems (NIPS). 2016.\n\n[7] S. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, repeat: Fast\n\nscene understanding with generative models. Neural Information Processing Systems (NIPS), 2016.\n\n[8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. International\n\nJournal of Robotics Research (IJRR), 2013.\n\n[9] D.-A. Huang and K. M. Kitani. Action-reaction: Forecasting the dynamics of human interaction. In\n\nEuropean Conference on Computer Vision (ECCV). Springer, 2014.\n\n[10] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive\n\nmethods for 3d human sensing in natural environments. PAMI, 36(7), 2014.\n\n[11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Neural Information\n\nProcessing Systems (NIPS), 2015.\n\n[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classi\ufb01cation\n\nwith convolutional neural networks. In Compuer Vision and Pattern Recognition (CVPR), 2014.\n\n[13] D. Kinga and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning\n\nRepresentations (ICLR), 2015.\n\n[14] S. Lange, M. Riedmiller, and A. Voigtlander. Autonomous reinforcement learning on raw visual input data\n\nin a real world application. In International Joint Conference on Neural Networks (IJCNN), 2012.\n\n[15] A. Lerer, S. Gross, and R. Fergus. Learning physical intuition of block towers by example. International\n\nConference on Machine Learning (ICML), 2016.\n\n[16] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised\n\nlearning. arXiv preprint arXiv:1605.08104, 2016.\n\n[17] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error.\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[18] R. Mottaghi, H. Bagherinezhad, M. Rastegari, and A. Farhadi. Newtonian image understanding: Unfolding\n\nthe dynamics of objects in static images. Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[19] R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi. \"What happens if...\" learning to predict the effect of\n\nforces in images. European Conference on Computer Vision (ECCV), 2016.\n\n[20] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks\n\nin atari games. In Neural Information Processing Systems (NIPS), 2015.\n\n[21] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a\n\nbaseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.\n\n[22] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using\n\nlstms. International Conference on Machine Learning (ICML), 2015.\n\n[23] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the future by watching unlabeled video. CoRR,\n\nabs/1504.08023, 2015.\n\n[24] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Neural Information\n\nProcessing Systems (NIPS). 2016.\n\n[25] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using\n\nvariational autoencoders. In European Conference on Computer Vision (ECCV), 2016.\n\n[26] J. Walker, A. Gupta, and M. Hebert. Patch to the future: Unsupervised visual prediction. In Computer\n\nVision and Pattern Recognition (CVPR), 2014.\n\n[27] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent\n\ndynamics model for control from raw images. In Neural Information Processing Systems (NIPS), 2015.\n\n[28] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo. Convolutional LSTM network: A\nmachine learning approach for precipitation nowcasting. In Neural Information Processing Systems, 2015.\n\n[29] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Probabilistic future frame synthesis\n\nvia cross convolutional networks. In Neural Information Processing Systems (NIPS). 2016.\n\n[30] J. Yuen and A. Torralba. A data-driven approach for event prediction.\n\nIn European Conference on\n\nComputer Vision (ECCV), 2010.\n\n9\n\n\f", "award": [], "sourceid": 46, "authors": [{"given_name": "Chelsea", "family_name": "Finn", "institution": "Google"}, {"given_name": "Ian", "family_name": "Goodfellow", "institution": "Google"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "University of Washington"}]}