{"title": "High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 91, "abstract": "Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.", "full_text": "High Fidelity Video Prediction with\n\nLarge Stochastic Recurrent Neural Networks\n\nRuben Villegas1,4 Arkanath Pathak3 Harini Kannan2\n\nDumitru Erhan2 Quoc V. Le2 Honglak Lee2\n\n1 University of Michigan\n\n2 Google Research\n\n3 Google\n\n4 Adobe Research\n\nAbstract\n\nPredicting future video frames is extremely challenging, as there are many factors of\nvariation that make up the dynamics of how frames change through time. Previously\nproposed solutions require complex inductive biases inside network architectures\nwith highly specialized computation, including segmentation masks, optical \ufb02ow,\nand foreground and background separation. In this work, we question if such\nhandcrafted architectures are necessary and instead propose a different approach:\n\ufb01nding minimal inductive bias for video prediction while maximizing network\ncapacity. We investigate this question by performing the \ufb01rst large-scale empirical\nstudy and demonstrate state-of-the-art performance by learning large models on\nthree different datasets: one for modeling object interactions, one for modeling\nhuman motion, and one for modeling car driving1.\n\nIntroduction\n\n1\nFrom throwing a ball to driving a car, humans are very good at being able to interact with objects\nin the world and anticipate the results of their actions. Being able to teach agents to do the same\nhas enormous possibilities for training intelligent agents capable of generalizing to many tasks.\nModel-based reinforcement learning is one such technique that seeks to do this \u2013 by \ufb01rst learning a\nmodel of the world, and then by planning with the learned model. There has been some recent success\nwith training agents in this manner by \ufb01rst using video prediction to model the world. Particularly,\nvideo prediction models combined with simple planning algorithms [Hafner et al., 2019] or policy-\nbased learning [Kaiser et al., 2019] for model-based reinforcement learning have been shown to\nperform equally or better than model-free methods with far less interactions with the environment.\nAdditionally, Ebert et al. [2018] showed that video prediction methods are also useful for robotic\ncontrol, especially with regards to specifying unstructured goal positions.\nHowever, training an agent to accurately predict what will happen next is still an open problem. Video\nprediction, the task of generating future frames given context frames, is notoriously hard. There are\nmany spatio-temporal factors of variation present in videos that make this problem very dif\ufb01cult for\nneural networks to model. Many methods have been proposed to tackle this problem [Oh et al., 2015,\nFinn et al., 2016, Vondrick et al., 2016, Villegas et al., 2017a, Lotter et al., 2017, Tulyakov et al.,\n2018, Liang et al., 2017, Denton and Birodkar, 2017, Wichers et al., 2018, Babaeizadeh et al., 2018,\nDenton and Fergus, 2018, Lee et al., 2018, Byeon et al., 2018, Yan et al., 2018, Kumar et al., 2018].\nMost of these works propose some type of separation of information streams (e.g., motion/pose and\ncontent streams), specialized computations (e.g., warping, optical \ufb02ow, foreground/background masks,\npredictive coding, etc), additional high-level information (e.g., landmarks, semantic segmentation\nmasks, etc) or are simply shown to work in relatively simpler environments (e.g., Atari, synthetic\nshapes, centered human faces and bodies, etc).\n\n1This work was done while the \ufb01rst author was an intern at Google\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSimply making neural networks larger has been shown to improve performance in many areas such as\nimage classi\ufb01cation [Real et al., 2018, Zoph et al., 2018, Huang et al., 2018], image generation [Brock\net al., 2019], and language understanding [Devlin et al., 2018, Radford et al., 2019], amongst others.\nParticularly, Brock et al. [2019] recently showed that increasing the capacity of GANs [Goodfellow\net al., 2014] results in dramatic improvements for image generation.\nIn his blog post \"The Bitter Lesson\", Rich Sutton comments on these types of developments by\narguing that the most signi\ufb01cant breakthroughs in machine learning have come from increasing the\ncompute provided to simple models, rather than from specialized, handcrafted architectures [Sutton,\n2019]. For example, he explains that the early specialized algorithms of computer vision (edge\ndetection, SIFT features, etc.) gave way to larger but simpler convolutional neural networks. In this\nwork, we seek to answer a similar question: do we really need specialized architectures for video\nprediction? Or is it suf\ufb01cient to maximize network capacity on models with minimal inductive bias?\nIn this work, we perform the \ufb01rst large-scale empirical study of the effects of minimal inductive\nbias and maximal capacity on video prediction. We show that without the need of optical \ufb02ow,\nsegmentation masks, adversarial losses, landmarks, or any other forms of inductive bias, it is possible\nto generate high quality video by simply increasing the scale of computation. Overall, our experiments\ndemonstrate that: (1) large models with minimal inductive bias tend to improve the performance\nboth qualitatively and quantitatively, (2) recurrent models outperform non-recurrent models, and (3)\nstochastic models perform better than non-stochastic models, especially in the presence of uncertainty\n(e.g., videos with unknown action or control).\n2 Related Work\nThe task of predicting multiple frames into the future has been studied for a few years now. Initially,\nmany early methods tried to simply predict future frames in small videos or patches from large\nvideos [Michalski et al., 2014, Ranzato et al., 2014, Srivastava et al., 2015]. This type of video\nprediction caused rectangular-shaped artifacts when attempting to fuse the predicted patches, since\neach predicted patch was blind to its surroundings. Then, action-conditioned video prediction models\nwere built with the aim of being used for model-based reinforcement learning [Oh et al., 2015,\nFinn et al., 2016]. Later, video prediction models started becoming more complex and better at\npredicting future frames. Lotter et al. [2017] proposed a neural network based on predictive coding.\nVillegas et al. [2017a] proposed to separate motion and content streams in video input. Villegas et al.\n[2017b] proposed to predict future video as landmarks in the future and then use these landmarks\nto generate frames. Denton and Birodkar [2017] proposed to have a pose and content encoders as\nseparate information streams. However, all of these methods focused on predicting a single future.\nUnfortunately, real-world video is highly stochastic \u2013 that is, there are multiple possible futures given\na single past.\nMany methods focusing on the stochastic nature of real-world videos have been recently proposed.\nBabaeizadeh et al. [2018] build on the optical \ufb02ow method proposed by Finn et al. [2016] by\nintroducing a variational approach to video prediction where the entire future is encoded into a\nposterior distribution that is used to sample latent variables. Lee et al. [2018] also build on optical\n\ufb02ow and propose an adversarial version of stochastic video prediction where two discriminator\nnetworks are used to enable sharper frame prediction. Denton and Fergus [2018] also propose a\nsimilar variational approach. In their method, the latent variables are sampled from a prior distribution\nof the future during inference time, and only frames up to the current time step are used to model\nthe future posterior distribution. Kumar et al. [2018] propose a method based on normalizing \ufb02ows\nwhere the exact log-likelihood can be computed for training.\nIn this work, we investigate whether we can achieve high quality video predictions without the use of\nthe previously mentioned techniques (optical \ufb02ows, adversarial objectives, etc.) by just maximizing\nthe capacity of a standard neural network. To the best of our knowledge, this work is the \ufb01rst to\nperform a thorough investigation on the effect of capacity increases for video prediction.\n3 Scaling up video prediction\nIn this section, we present our method for scaling up video prediction networks. We \ufb01rst consider the\nStochastic Video Generation (SVG) architecture presented in Denton and Fergus [2018], a stochastic\nvideo prediction model that is entirely made up of standard neural network layers without any special\ncomputations (e. g. optical \ufb02ow). SVG is competitive with other state-of-the-art stochastic video\n\n2\n\n\fprediction models (SAVP, SV2P) [Lee et al., 2018]; however, unlike SAVP and SV2P, it does not use\noptical \ufb02ow, adversarial losses, etc. As such, SVG was a \ufb01tting starting point to our investigation.\nTo build our baseline model, we start with the stochastic component that models the inherent uncer-\ntainty in future predictions from Denton and Fergus [2018]. We also use shallower encoder-decoders\nthat only have convolutional layers to enable more detailed image reconstruction [Dosovitskiy and\nBrox, 2016]. A slightly shallower encoder-decoder architecture results in less information lost in the\nlatent state, as the resulting convolutional map from the bottlenecked layers is larger. Then, in contrast\nto Denton and Fergus [2018], we use a convolutional LSTM architecture, instead of a fully-connected\nLSTM, to \ufb01t the shallow encoders-decoders. Finally, the last difference is that we optimize the `1\nloss with respect to the ground-truth frame for all models like in the SAVP model, instead of using `2\nlike in SVG. Lee et al. [2018] showed that `1 encouraged sharper frame prediction over `2.\nWe optimize our baseline architecture by maximizing the following variational lowerbound:\n\nTXt=1\n\nEq(z\uf8ffT |x\uf8ffT ) log p\u2713(xt|z\uf8fft, x<t)  DKL (q(zt|x\uf8fft)||p (zt|x<t)) ,\n\nwhere xt is the frame at time step t, q(z\uf8ffT|x\uf8ffT ) the approximate posterior distribution, p (zt|x<t)\nis the prior distribution, p\u2713(xt|z\uf8fft, x<t) is the generative distribution, and  regulates the strength of\nthe KL term in the lowerbound. During training time, the frame prediction process at time step t is as\nfollows:\n\n\u00b5(t), (t) = LSTM(ht; M )\nzt \u21e0N (\u00b5(t), (t)) ,\ngt = LSTM\u2713 (ht1, zt; M )\nxt = f dec (gt; K) ,\n\nwhere ht = f enc (xt; K) ,\n\nwhere ht1 = f enc (xt1; K) ,\n\nwhere f enc is an image encoder and f dec is an image decoder neural network. LSTM and LSTM\u2713\nare LSTMs modeling the posterior and generative distributions, respectively. \u00b5(t) and (t) are\nthe parameters of the posterior distribution modeling the Gaussian latent code zt. Finally, xt is the\npredicted frame at time step t.\nTo increase the capacity of our baseline model, we use hyperparameters K and M, which denote\nthe factors by which the number of neurons in each layer of the encoder, decoder and LSTMs are\nincreased. For example, if the number of neurons in LSTM is d, then we scale up by d \u21e5 M. The\nsame applies to the encoder and decoder networks but using K as the factor. In our experiments we\nincrease both K and M together until we reach the device limits. Due to the LSTM having more\nparameters, we stop increasing the capacity of the LSTM at M = 3 but continue to increase K up to\n5. At test time, the same process is followed, however, the posterior distribution is replaced by the\nGaussian parameters computed by the prior distribution:\n\n\u00b5 (t), (t) = LSTM (ht1; M )\n\nwhere ht1 = f enc (xt1; K) ,\n\nNext, we perform ablative studies on our baseline architecture to better quantify exactly how much\neach individual component affects the quality of video prediction as capacity increases. First, we\nremove the stochastic component, leaving behind a fully deterministic architecture with just a CNN-\nbased encoder-decoder and a convolutional LSTM. For this version, we simply disable the prior and\nposterior networks as described above. Finally, we remove the LSTM component, leaving behind\nonly the encoder-decoder CNN architectures. For this version, we simply use f enc and f dec as the\nfull video prediction network. However, we let f enc observe the same number of initial history as the\nrecurrent counterparts.\nDetails of the devices we use to scale up computation can be found in the supplementary material.\n4 Experiments\nIn this section, we evaluate our method on three different datasets, each with different challenges.\nObject interactions. We use the action-conditioned towel pick dataset from Ebert et al. [2018] to\nevaluate how our models perform with standard object interactions. This dataset contains a robot arm\nthat is interacting with towel objects. Even though this dataset uses action-conditioning, stochastic\n\n3\n\n\fCNN models\n\nLSTM models\n\nSVG\u2019 models\n\nBiggest\n\n(M=3, K=5)\n\nDataset\nTowel Pick\n199.81\nHuman 3.6M 1321.23\n2414.64\nKITTI\n\nBaseline\n\n(M=1, K=1)\n\nBiggest\n\n(M=3, K=5)\n\nBaseline\n\n(M=1, K=1)\n\nBiggest\n\n(M=3, K=5)\n\nBaseline\n\n(M=1, K=1)\n\n281.07\n1077.55\n2906.71\n\n100.04\n458.77\n1159.25\n\n206.49\n614.21\n2502.69\n\n93.71\n429.88\n1217.25\n\n189.91\n682.08\n2264.91\n\nTable 1: Fr\u00e9chet Video Distance evaluation (lower is better). We compare the biggest model we\nwere able to train against the baseline models (M=1, K=1). Note that all models (SVG\u2019, CNN, and\nLSTM). The biggest recurrent models are signi\ufb01cantly better than their small counterpart. Please refer\nto our supplementary material for plots showing how gradually increasing model capacity results in\nbetter performance.\n\nvideo prediction is still required for this task. This is because the motion of the objects is not fully\ndetermined by the actions (the movements of the robot arm), but also includes factors such as friction\nand the objects\u2019 current state. For this dataset, we resize the original resolution of 48x64 to 64x64.\nFor evaluation, we use the \ufb01rst 256 videos in the test set as de\ufb01ned by Ebert et al. [2018].\nStructured motion. We use the Human 3.6M dataset [Ionescu et al., 2014] to measure the ability of\nour models to predict structured motion. This dataset is comprised of humans performing actions\ninside a room (walking around, sitting on a chair, etc.). Human motion is highly structured (i.e., many\ndegrees of freedom), and so, it is dif\ufb01cult to model. We use the train/test split from Villegas et al.\n[2017b]. For this dataset, we resize the original resolution of 1000x1000 to 64x64.\nPartial observability. We use the KITTI driving dataset [Geiger et al., 2013] to measure how our\nmodels perform in conditions of partial observability. This dataset contains driving scenes taken from\na front camera view of a car driving in the city, residential neighborhoods, and on the road. The front\nview camera of the vehicle causes partial observability of the vehicle environment, which requires a\nmodel to generate seen and unseen areas when predicting future frames. We use the train/test split\nfrom Lotter et al. [2017] in our experiments. We extract 30 frame clips and skip every 5 frames from\nthe test set so that the test videos do not signi\ufb01cantly overlap, which gives us 148 test clips in the end.\nFor this dataset, we resize the original resolution of 128x160 to 64x64.\n4.1 Evaluation metrics\nWe perform a rigorous evaluation using \ufb01ve different metrics: Peak Signal-to-Noise Ratio (PSNR),\nStructural Similarity (SSIM), VGG Cosine Similarity, Fr\u00e9chet Video Distance (FVD) [Unterthiner\net al., 2018], and human evaluation from Amazon Mechanical Turk (AMT) workers. We perform\nthese evaluations on all models described in Section 3: our baseline (denoted as SVG\u2019), the recurrent\ndeterministic model (denoted as LSTM), and the encoder-decoder CNN model (denoted as CNN).\nIn addition, we present a study comparing the video prediction performance as a result of using\nskip-connections from every layer of the encoder to every layer of the decoder versus not using\nskip connections at all (Supplementary A.3), and the effects of the number of context frames\n(Supplementary A.4).\n4.1.1 Frame-wise evaluation\nWe use three different metrics to perform frame-wise evaluation: PSNR, SSIM, and VGG cosine\nsimilarity. PSNR and SSIM perform a pixel-wise comparison between the predicted frames and\ngenerated frames, effectively measuring if the exact pixels have been generated. VGG Cosine\nSimilarity has been used in prior work [Lee et al., 2018] to compare frames in a perceptual level.\nVGGnet [Simonyan and Zisserman, 2015] is used to extract features from the predicted and ground-\ntruth frames, and cosine similarity is performed at feature-level. Similar to Kumar et al. [2018],\nBabaeizadeh et al. [2018], Lee et al. [2018], we sample 100 future trajectories per video and pick the\nhighest scoring trajectory as the main score.\n4.1.2 Dynamics-based evaluation\nWe use two different metrics to measure the overall realism of the generated videos: FVD and human\nevaluations. FVD, a recently proposed metric for video dynamics accuracy, uses a 3D CNN trained\nfor video classi\ufb01cation to extract a single feature vector from a video. Analogous to the well-known\nFID [Heusel et al., 2017], it compares the distribution of features extracted from ground-truth videos\nand generated videos. Intuitively, this metric compares the quality of the overall predicted video\n\n4\n\n\fDataset\nTowel Pick\nHuman 3.6M\nKITTI\n\nLSTM models\nBaseline\n\nBiggest\n\n(M=3, K=5)\n\n(M=1, K=1)\n\n90.2%\n98.7%\n99.3%\n\n9.0%\n1.3%\n0.7%\n\nAbout\nthe same\n\n0.8%\n0.0%\n0.0%\n\nSVG\u2019 models\nBaseline\n\nBiggest\n\n(M=3, K=5)\n\n(M=1, K=1)\n\n68.8%\n95.8%\n99.3%\n\n25.8%\n3.4%\n0.7%\n\nAbout\nthe same\n\n5.5%\n0.8%\n0.0%\n\nTable 2: Amazon Mechanical Turk human worker preference. We compared the biggest and\nbaseline models from LSTM and SVG\u2019. The bigger models are more frequently preferred by humans.\nWe present a full comparison for all large models in Supplementary A.5.\n\ndynamics with that of the ground-truth videos rather than a per-frame comparison. For FVD, we also\nsample 100 future trajectories per video, but in contrast, all 100 trajectories are used in this evaluation\nmetric (i.e., not just the max, as we did for VGG cosine similarity).\nWe also use Amazon Mechanical Turk (AMT) workers to perform human evaluations. The workers\nare presented with two videos (baseline and largest models) and asked to either select the more\nrealistic video or mark that they look about the same. We choose the videos for both models by\nselecting the highest scoring videos in terms of the VGG cosine similarity with respect to the ground\ntruth. We use 10 unique workers per video and choose the selection with the most votes as the \ufb01nal\nanswer. Finally, we also show qualitative evaluations on pairs of videos, also selected by using the\nhighest VGG cosine similarity scores for both the baseline and the largest model. We run the human\nperception based evaluation on the best two architectures we scale up.\n4.2 Robot arm\nFor this dataset, we perform action-conditioned video prediction. We modify the baseline and large\nmodels to take in actions as additional input to the video prediction model. Action conditioning\ndoes not take away the inherent stochastic nature of video prediction due to the dynamics of the\nenvironment. During training time, the models are conditioned on 2 input frames and predict 10\nframes into the future. During test time, the models predict 18 frames into the future.\nDynamics-based evaluation. We \ufb01rst evaluate the action-conditioned video prediction models\nusing FVD to measure the realism in the dynamics. In Table 1 (top row), we present the results of\nscaling up the three models described in Section 3. Firstly, we see that our baseline architecture\nimproves dramatically at the largest capacity we were able to train. Secondly, for our ablative\nexperiments, we notice that larger capacity improves the performance of the vanilla CNN architecture.\nInterestingly, by increasing the capacity of the CNN architecture, it approaches the performance\nof the baseline SVG\u2019 architecture. However, as capacity increases, the lack of recurrence heavily\naffects the performance of the vanilla CNN architecture in comparison with the models that do have\nan LSTM (Supplementary A.2.1). Both the LSTM model and SVG\u2019 perform similarly well, with\nSVG\u2019 model performing slightly better. This makes sense as the deterministic LSTM model is more\nlikely to produce videos closer to the ground truth; however, the stochastic component is still quite\nimportant as a good video prediction model must be both realistic and capable of handling multiple\npossible futures. Finally, we use human evaluations through Amazon Mechanical Turk to compare\nour biggest models with the corresponding baselines. We asked workers to focus on how realistic\nthe interaction between the robot arm and objects looks. As shown in Table 2, the largest SVG\u2019 is\npreferred 68.8% of the time vs 25.8% of the time for the baseline (right), and the largest LSTM model\nis preferred 90.2% of the time vs 9.0% of the time for the baseline (left).\nFrame-wise evaluation. Next, we use FVD to select the best models from CNN, LSTM, and SVG\u2019,\nand perform frame-wise evaluation on each of these three models. Since models that copy background\npixels perfectly can perform well on these frame-wise evaluation metrics, in the supplementary\nmaterial we also discuss a comparison against a simple baseline where the last observed frame is\ncopied through time. From Figure 1, we can see that the CNN model performs much worse than\nthe models that have recurrent connections. This is a clear indication that recurrence is necessary to\npredict future frames, and capacity cannot make up for it. Both LSTM and SVG perform similarly\nwell, however, towards the end, SVG slightly outperforms LSTM. The full evaluation on all capacities\nfor SVG\u2019, LSTM, and CNN is presented in the supplementary material.\nQualitative evaluation.\nIn Figure 2, we show example videos from the smallest SVG\u2019 model, the\nlargest SVG\u2019 model, and the ground truth. The predictions from the small baseline model are blurrier\n\n5\n\n\fFigure 1: Towel pick per-frame evaluation (higher is better). We compare the best performing models\nin terms of FVD. For model capacity comparisons, please refer to Supplementary A.2.1.\n\ncompared to the largest model, while the edges of objects from the larger model\u2019s predictions stay\ncontinuously sharp throughout the entire video. This is clear evidence that increasing the model\ncapacity enables more accurate modeling of the pick up dynamics. For more videos, please visit our\nwebsite https://cutt.ly/QGuCex\n\nt=5\n\nt=6\n\nt=9\n\nt=12\n\nt=15\n\nt=18\n\nt=20\n\n-\nd\nn\nu\no\nr\nG\n\nh\nt\nu\nr\nt\n\nt\ns\ne\ng\ng\ni\nB\n\nl\ne\nd\no\nm\n\n)\ns\nr\nu\nO\n\n(\n\nt\ns\ne\nl\nl\na\nm\nS\n\nl\ne\nd\no\nm\n\n)\ne\nn\ni\nl\ne\ns\na\nB\n\n(\n\nFigure 2: Robot towel pick qualitative evaluation. Our highest capacity model (middle row) produces\nbetter modeling of the robot arm dynamics, as well, as object interactions. The baseline model\n(bottom row) fails at modeling the objects (object blurriness), and also, the robot arm dynamics are\nnot well modeled (gripper is open when the it should be close at t=18). For best viewing and more\nresults, please visit our website https://cutt.ly/QGuCex.\n\n4.3 Human activities\nFor this dataset, we perform action-free video prediction. We use a single model to predict all action\nsequences in the Human 3.6M dataset. During training time, the models are conditioned on 5 input\nframes and predict 10 frames into the future. At test time, the models predict 25 frames.\n\nDynamics-based evaluation. We evaluate the predicted human motion with FVD (Table 1, middle\nrow). The performance of the CNN model is poor in this dataset, and increasing the capacity of the\nCNN does not lead to any increase in performance. We hypothesize that this is because the lack\nof action conditioning and the many degrees of freedom in human motion makes it very dif\ufb01cult\nto model with a simple encoder-decoder CNN. However, after adding recurrence, both LSTM and\nSVG\u2019 perform signi\ufb01cantly better, and both models\u2019 performance become better as their capacity\nis increased (Supplementary A.2.2). Similar to Section 4.2, we see that SVG\u2019 performs better than\nLSTM. This is again likely due to the ability to sample multiple futures, leading to a higher probability\nof matching the ground truth future. Secondly, in our human evaluations for SVG\u2019, 95.8% of the\nAMT workers agree that the bigger model has more realistic videos in comparison to the smaller\nmodel, and for LSTM, 98.7% of the workers agree that the LSTM largest model is more realistic.\nOur results, especially the strong agreement from our human evaluations, show that high capacity\nmodels are better equipped to handle the complex structured dynamics in human videos.\n\n6\n\n\fFigure 3: Human 3.6M per-frame evaluation (higher is better). We compare the best performing\nmodels in terms of FVD. For model capacity comparisons, please refer to Supplementary A.2.2.\n\nt=8\n\nt=11\n\nt=14\n\nt=17\n\nt=20\n\nt=23\n\nt=26\n\n-\nd\nn\nu\no\nr\nG\n\nh\nt\nu\nr\nt\n\nt\ns\ne\ng\ng\ni\nB\n\nl\ne\nd\no\nm\n\n)\ns\nr\nu\nO\n\n(\n\nt\ns\ne\nl\nl\na\nm\nS\n\nl\ne\nd\no\nm\n\n)\ne\nn\ni\nl\ne\ns\na\nB\n\n(\n\nFigure 4: Human 3.6M qualitative evaluation. Our highest capacity model (middle) produces better\nmodeling of the human dynamics. The baseline model (bottom) is able to keep the human dynamics\nto some degree but in often cases the human shape is unrecognizable or constantly vanishing and\nreappearing. For more videos, please visit our website https://cutt.ly/QGuCex.\n\nFrame-wise evaluation. Similar to the previous per-frame evaluation, we select the best performing\nmodels in terms of FVD and perform a frame-wise evaluation. In Figure 3, we can see that the CNN\nbased model performs poorly against the LSTM and SVG\u2019 baselines. The recurrent connections\nin LSTM and SVG\u2019 are necessary to be able to identify the human structure and the action being\nperformed in the input frames. In contrast to Section 4.2, there are no action inputs to guide the video\nprediction which signi\ufb01cantly affects the CNN baseline. The LSTM and SVG\u2019 networks perform\nsimilarly at the beginning of the video while SVG\u2019 outperforms LSTM in the last time steps. This is a\nresult of SVG\u2019 being able to model multiple futures from which we pick the best future for evaluation\nas described in Section 4.1. We present the full evaluation on all capacities for SVG\u2019, LSTM, and\nCNN in the supplementary material.\nQualitative evaluation. Figure 4 shows a comparison between the smallest and largest stochastic\nmodels. In the video generated by the smallest model, the shape of the human is not well-de\ufb01ned at\nall, while the largest model is able to clearly depict the arms and the legs of the human. Moreover,\nour large model is able to successfully predict the human\u2019s movement throughout all of the frames\ninto the future. The predicted motion is close to the ground-truth motion providing evidence that\nbeing able to model more factors of variation with larger capacity models can enable accurate motion\nidenti\ufb01cation and prediction. For more videos, please visit our website https://cutt.ly/QGuCex.\n\n4.4 Car driving\nFor this dataset, we also perform action-free video prediction. During training time, the models are\nconditioned on 5 input frames and predict 10 frames into the future. At test time, the models predict\n25 frames into the future. This video type is the most dif\ufb01cult to predict since it requires the model to\nbe able to hallucinate unseen parts in the video given the observed parts.\n\n7\n\n\fFigure 5: KITTI driving per-frame evaluation (higher is better). For model capacity comparisons,\nplease refer to Supplementary A.2.3.\n\n-\nd\nn\nu\no\nr\nG\n\nh\nt\nu\nr\nt\n\nt\ns\ne\ng\ng\ni\nB\n\nl\ne\nd\no\nm\n\n)\ns\nr\nu\nO\n\n(\n\nt\ns\ne\nl\nl\na\nm\nS\n\nl\ne\nd\no\nm\n\n)\ne\nn\ni\nl\ne\ns\na\nB\n\n(\n\nFigure 6: KITTI driving qualitative evaluation. Our highest capacity model (middle) is able to\nmaintain the observed dynamics of driving forward and is able to generate unseen street lines and the\nmoving background. The baseline (bottom) loses the street lines and the background becomes blurry.\nFor best viewing and more results, please visit our website https://cutt.ly/QGuCex.\n\nDynamics-based evaluation. We see very similar results to the previous dataset when measuring\nthe realism of the videos. For both LSTM and SVG\u2019, we see a large improvement in FVD when\ncomparing the baseline model to the largest model we were able to train (Table 1, bottom row).\nHowever, we see a similarly poor performance for the CNN architecture as in Section 4.3, where\ncapacity does not help. One interesting thing to note is that the largest LSTM model performs better\nthan the largest SVG\u2019 model. This is likely related to the architecture design and the data itself.\nThe movements of cars driving is mostly predictable, and so, the deterministic architecture becomes\nhighly competitive as we increase the model capacity (Supplementary A.2.3). However, our original\npremise that increasing model\u2019s capacity improves network performance still holds. Finally, for\nhuman evaluations, we see in Table 2 that the largest capacity SVG\u2019 model is preferred by human\nraters 99.3% of the time (right), and the largest capacity LSTM model (left) is also preferred by\nhuman raters 99.3% time (left).\nFrame-wise evaluation Now, when we evaluate based on frame-wise accuracy, we see similar but\nnot exactly the same behavior as the experiments in Section 4.3. The CNN architecture performs\npoorly as expected, however, LSTM and SVG\u2019 perform similarly well.\nQualitative evaluation.\nIn Figure 6, we show a comparison between the largest stochastic model\nand its baseline. The baseline model starts becoming blurry as the predictions move forward in the\nfuture, and important features like the lane markings disappear. However, our biggest capacity model\nmakes very sharp predictions that look realistic in comparison to the ground-truth.\n\n5 Higher resolution videos\n\nFinally, we experiment with higher resolution videos. We train SVG\u2019 on the Human 3.6M and\nKITTI driving datasets. These two datasets contain much larger resolution images compared to the\n\n8\n\n\fTowel pick dataset, enabling us to sub-sample frames to twice the resolution of previous experiments\n(128x128). We follow the same protocol for the number of input and predicted time steps during\ntraining (5 inputs and 10 predictions), and the same protocol for testing (5 inputs and 25 predictions).\nIn contrast to the networks used in the previous experiments, we add three more convolutional layers\nplus pooling to subsample the input to the same convolutional encoder output resolution as in previous\nexperiments.\nIn Figure 7 we show qualitative results comparing the smallest (baseline) and biggest (Ours) networks.\nThe biggest network we were able to train had a con\ufb01guration of M=3 and K=3. Higher resolution\nvideos contain more details about the pixel dynamics observed in the frames. This enables the\nmodels to have a denser signal, and so, the generated videos become more dif\ufb01cult to distinguish\nfrom real videos. Therefore, this result suggests that besides training better and bigger models, we\nshould also more towards larger resolutions. For more examples of videos, please visit our website:\nhttps://cutt.ly/QGuCex.\n\nt=8\n\nt=11\n\nt=14\n\nt=17\n\nt=20\n\nt=23\n\nt=26\n\nh\nt\nu\nr\nt\nd\nn\nu\no\nr\nG\n\nt\ns\ne\ng\ng\ni\nB\n\nl\ne\nd\no\nm\n\n)\ns\nr\nu\nO\n\n(\n\nt\ns\ne\nl\nl\na\nm\nS\n\nl\ne\nd\no\nm\n\n)\ne\nn\ni\nl\ne\ns\na\nB\n\n(\n\nh\nt\nu\nr\nt\nd\nn\nu\no\nr\nG\n\nt\ns\ne\ng\ng\ni\nB\n\nl\ne\nd\no\nm\n\n)\ns\nr\nu\nO\n\n(\n\nt\ns\ne\nl\nl\na\nm\nS\n\nl\ne\nd\no\nm\n\n)\ne\nn\ni\nl\ne\ns\na\nB\n\n(\n\nFigure 7: Human 3.6M and KITTI driving qualitative evaluation on high resolution videos (frame size\nof 128x128) with comparison between smallest model and largest model we were able to train (M=3,\nK=3). For best viewing and more results, please visit our website https://cutt.ly/QGuCex.\n\n6 Conclusion\nIn conclusion, we provide a full empirical study on the effect of \ufb01nding minimal inductive bias\nand increasing model capacity for video generation. We perform a rigorous evaluation with \ufb01ve\ndifferent metrics to analyze which types of inductive bias are important for generating accurate video\ndynamics, when combined with large model capacity. Our experiments con\ufb01rm the importance of\nrecurrent connections and modeling stochasticity in the presence of uncertainty (e.g., videos with\nunknown action or control). We also \ufb01nd that maximizing the capacity of such models improves the\nquality of video prediction. We hope our work encourages the \ufb01eld to push along similar directions\nin the future \u2013 i.e., to see how far we can get by \ufb01nding the right combination of minimal inductive\nbias and maximal model capacity for achieving high quality video prediction.\n\n9\n\n\fReferences\nMohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy Campbell, and Sergey Levine. Stochas-\n\ntic variational video prediction. In ICLR, 2018.\n\nAndrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for High Fidelity\n\nNatural Image Synthesis. In ICLR, 2019.\n\nWonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. ContextVP: Fully\n\nContext-Aware Video Prediction. In ECCV, 2018.\n\nEmily Denton and Vighnesh Birodkar. Unsupervised Learning of Disentangled Representations from\n\nVideo. In NeurIPS, 2017.\n\nEmily Denton and Rob Fergus. Stochastic Video Generation with a Learned Prior. In ICML, 2018.\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep\n\nBidirectional Transformers for Language Understanding. volume abs/1810.04805, 2018.\n\nAlexey Dosovitskiy and Thomas Brox. Inverting Visual Representations with Convolutional Networks.\n\nIn CVPR, 2016.\n\nFrederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, and Sergey Lee, Alex Levine. Visual\nForesight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control. volume\nabs/1812.00568, 2018.\n\nChelsea Finn, Ian J. Goodfellow, and Sergey Levine. Unsupervised Learning for Physical Interaction\n\nthrough Video Prediction. In NeurIPS, 2016.\n\nAndreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti\n\ndataset. In IJRR, 2013.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\nGoogle. Cloud TPUs, 2018. URL https://cloud.google.com/tpu.\nDanijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James\n\nDavidson. Learning Latent Dynamics for Planning from Pixels. In ICML, 2019.\n\nMartin Heusel, Hubert Ramsauer, Thomas Unterthiner, and Bernhard Nessler. GANs Trained by a\n\nTwo Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017.\n\nYanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and\nZhifeng Chen. GPipe: Ef\ufb01cient Training of Giant Neural Networks using Pipeline Parallelism.\nvolume abs/1811.06965, 2018.\n\nCatalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale\n\ndatasets and predictive methods for 3d human sensing in natural environments. In PAMI, 2014.\n\nLukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad\nCzechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi,\nGeorge Tucker, and Henryk Michalewski. Model-Based Reinforcement Learning for Atari. CoRR,\nabs/1903.00374, 2019.\n\nManoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent\nDinh, and Durk Kingma. VideoFlow: A Flow-Based Generative Model for Video. In ICML, 2018.\nAlex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine.\n\nStochastic Adversarial Video Prediction. volume abs/1804.01523, 2018.\n\nXiaodan Liang, Lisa Lee, Wei Dai, and Eric P. Xing. Dual Motion GAN for Future-Flow Embedded\n\nVideo Prediction. In ICCV, 2017.\n\nWilliam Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video\n\nprediction and unsupervised learning. In ICLR, 2017.\n\n10\n\n\fVincent Michalski, Roland Memisevic, and Kishore Konda. Modeling deep temporal dependencies\n\nwith recurrent \u201cgrammar cells\u201d. In NeurIPS, 2014.\n\nJunhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-Conditional\n\nVideo Prediction using Deep Networks in Atari Games. In NeurIPS, 2015.\n\nAlec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language\n\nModels are Unsupervised Multitask Learners. In Technical report, 2019.\n\nMarc\u2019Aurelio Ranzato, Arthur Szlam, Joan Bruna, Micha\u00ebl Mathieu, Ronan Collobert, and Sumit\nChopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv\npreprint arXiv:1412.6604, 2014.\n\nEsteban Real, Alok Aggarwal, Yaping Huang, and Quoc V. Le. Regularized evolution for image\n\nclassi\ufb01er architecture search. volume abs/1802.01548, 2018.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015.\n\nNitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised Learning of Video\n\nRepresentations using LSTMs. In ICML, 2015.\n\nRich Sutton. The Bitter Lesson, 2019. URL http://www.incompleteideas.net/IncIdeas/\n\nBitterLesson.html.\n\nSergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion\n\nand content for video generation. In CVPR, 2018.\n\nThomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski,\nand Sylvain Gelly. Towards Accurate Generative Models of Video: A New Metric & Challenges.\nCoRR, abs/1812.01717, 2018.\n\nRuben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing Motion\n\nand Content for Natural Video Sequence Prediction. In ICLR, 2017a.\n\nRuben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning\n\nto Generate Long-term Future via Hierarchical Prediction. In ICML, 2017b.\n\nCarl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating Videos with Scene Dynamics.\n\nIn NeurIPS, 2016.\n\nNevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical Long-term Video\n\nPrediction without Supervision. In ICML, 2018.\n\nXinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli Shechtman, Sunil Hadap, Ersin\nYumer, and Honglak Lee. Mt-vae: Learning motion transformations to generate multimodal human\ndynamics. In ECCV, 2018.\n\nBarret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures\n\nfor scalable image recognition. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 44, "authors": [{"given_name": "Ruben", "family_name": "Villegas", "institution": "Adobe Research / U. Michigan"}, {"given_name": "Arkanath", "family_name": "Pathak", "institution": "Google"}, {"given_name": "Harini", "family_name": "Kannan", "institution": "Google Brain"}, {"given_name": "Dumitru", "family_name": "Erhan", "institution": "Google Brain"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "Google / U. Michigan"}]}