{"title": "Learning to Decompose and Disentangle Representations for Video Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 517, "page_last": 526, "abstract": "Our goal is to predict future video frames given a sequence of input frames. Despite large amounts of video data, this remains a challenging task because of the high-dimensionality of video frames. We address this challenge by proposing the Decompositional Disentangled Predictive Auto-Encoder (DDPAE), a framework that combines structured probabilistic models and deep networks to automatically (i) decompose the high-dimensional video that we aim to predict into components, and (ii) disentangle each component to have low-dimensional temporal dynamics that are easier to predict. Crucially, with an appropriately specified generative model of video frames, our DDPAE is able to learn both the latent decomposition and disentanglement without explicit supervision. For the Moving MNIST dataset, we show that DDPAE is able to recover the underlying components (individual digits) and disentanglement (appearance and location) as we would intuitively do. We further demonstrate that DDPAE can be applied to the Bouncing Balls dataset involving complex interactions between multiple objects to predict the video frame directly from the pixels and recover physical states without explicit supervision.", "full_text": "Learning to Decompose and Disentangle\n\nRepresentations for Video Prediction\n\nJun-Ting Hsieh\nStanford University\n\njunting@stanford.edu\n\nBingbin Liu\n\nStanford University\n\nbingbin@stanford.edu\n\nDe-An Huang\n\nStanford University\n\ndahuang@cs.stanford.edu\n\nLi Fei-Fei\n\nStanford University\n\nfeifeili@cs.stanford.edu\n\nJuan Carlos Niebles\nStanford University\n\njniebles@cs.stanford.edu\n\nAbstract\n\nOur goal is to predict future video frames given a sequence of input frames. Despite\nlarge amounts of video data, this remains a challenging task because of the high-\ndimensionality of video frames. We address this challenge by proposing the\nDecompositional Disentangled Predictive Auto-Encoder (DDPAE), a framework\nthat combines structured probabilistic models and deep networks to automatically\n(i) decompose the high-dimensional video that we aim to predict into components,\nand (ii) disentangle each component to have low-dimensional temporal dynamics\nthat are easier to predict. Crucially, with an appropriately speci\ufb01ed generative\nmodel of video frames, our DDPAE is able to learn both the latent decomposition\nand disentanglement without explicit supervision. For the Moving MNIST dataset,\nwe show that DDPAE is able to recover the underlying components (individual\ndigits) and disentanglement (appearance and location) as we intuitively would do.\nWe further demonstrate that DDPAE can be applied to the Bouncing Balls dataset\ninvolving complex interactions between multiple objects to predict the video frame\ndirectly from the pixels and recover physical states without explicit supervision.\n\n1\n\nIntroduction\n\nOur goal is to build intelligent systems that are capable of visually predicting and forecasting what\nwill happen in video sequences. Visual prediction is a core problem in computer vision that has been\nstudied in several contexts, including activity prediction and early recognition [20, 30], human pose\nand trajectory forecasting [1, 18], and future frame prediction [22, 31, 39, 44]. In particular, the ability\nto visually hallucinate future frames has enabled applications in robotics [8] and healthcare [26].\nHowever, despite the availability of a large amount of video data, visual frame prediction remains a\nchallenging task because of the high-dimensionality of video frames.\nOur key insight into this high-dimensional, continuous sequence prediction problem is to decompose\nit into sub-problems that can be more easily predicted. Consider the example of predicting digit move-\nments of Moving MNIST in Figure 1: the transformation that converts an entire frame containing two\ndigits into the next frame is high-dimensional and non-linear. Directly learning such transformation is\nchallenging. On the other hand, if we decompose and understand this video correctly, the underlying\ndynamics that we must predict are simply the x, y coordinates of each individual digit, which are\nlow-dimensional and easy to model and predict in this case (constant velocity translation).\nThe main technical challenge is thus: How do we decompose the high-dimensional video sequence\ninto sub-problems with lower-dimensional temporal dynamics? While the decomposition is seemingly\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Our key insight is to decompose the video into several components. The prediction of each\nindividual component is easier than directly predicting the whole image sequence. It is important to\nnote that the decomposition is learned automatically without explicit supervision.\n\nobvious in the example from Figure 1, it is unclear how we can extend this to arbitrary videos. More\nimportantly, how do we discover the decomposition automatically? It is infeasible or even impossible\nto hand-craft the decomposition for predicting each type of video. While there have been previous\nworks that similarly aim to reduce the complexity of frame prediction by human pose [38, 42] and\npatch-based model [22, 31, 40], they either require domain-speci\ufb01c external supervision [38, 42] or\ndo not achieve a signi\ufb01cant level of dimension reduction using heuristics [31].\nWe address this challenge by proposing the Decompositional Disentangled Predictive Auto-Encoder\n(DDPAE), a framework that combines structured probabilistic models and deep networks to automati-\ncally (i) decompose the video we aim to predict into components, and (ii) disentangle each component\ninto low-dimensional temporal dynamics that are easy to predict. With appropriately speci\ufb01ed\ngenerative model on future frames, DDPAE is able to learn both the video decomposition and the\ncomponent disentanglement that are effective for video prediction without any explicit supervision on\nthese latent variables. By training a structural generative model of future frames like DDPAE, the aim\nis not only to obtain good future frame predictions, but also to learn to produce good decomposition\nand understanding of videos that signi\ufb01cantly reduce the complexity of visual frame prediction.\nWe evaluate DDPAE on two datasets: Moving MNIST [31] and Bouncing Balls [3]. Moving MNIST\nhas been widely used for evaluating video prediction models [15, 31, 43]. We show that DDPAE is\nable to learn to decompose videos in the Moving MNIST dataset into individual digits, and further\ndisentangles each component into the digit\u2019s appearance and its spatial location which is much easier\nto predict (Figure 1). This signi\ufb01cantly reduces the complexity of frame prediction and leads to\nstrong quantitative and qualitative improvements over the baselines that aim to predict the video as a\nwhole [6, 37]. We further demonstrate that DDPAE can be applied to the Bouncing Balls dataset,\nwhich has been used mainly for approaches that have access to full physical states (location, velocity,\nmass) [2, 3, 9]. We show that DDPAE is able to achieve reliable prediction of such complex systems\ndirectly from pixels, and recover physical properties without explicitly modeling the physical states.\n\n2 Related Work\n\nVideo Prediction. The task of video prediction has received increasing attention in the community.\nEarly works include prediction on small image patches [28, 31]. Recent common approaches for full\nframe prediction predict the feature representations that generate future frames [6, 22, 23, 37, 38, 39]\nin a sequence-to-sequence framework [4, 32], which has been extended to incorporate spatio-temporal\nrecurrence [15, 31, 43]. Instead of directly generating the pixels, transformation-based models focus\non predicting the difference/transformation between frames and lead to sharper results [5, 8, 21, 35,\n39, 40, 44, 45]. We also aim to predict the transformation, but only for the temporal dynamics of\nthe decomposed and disentangled representation, which is much easier to predict than whole-frame\ntransformation.\nVisual Representation Decomposition. Decomposing the video that we aim to predict into com-\nponents plays an important role to the success of our method. The idea of visual representation\ndecomposition has also been applied in different contexts, including representation learning [27],\nphysics modeling [3], and scene understanding [7]. In particular, some previous works use methods\nsuch as Expectation Maximization to perform perceptual grouping and discover individual objects in\nvideos [11, 12, 36].\n\n2\n\nGround truthDecompose\fA highly related work is Attend-Infer-Repeat (AIR) by Eslami et al. [7], which decomposes images\nin a variational auto-encoder framework. Our work goes beyond the image and extends to the\ntemporal dimension, where the model automatically learns the decomposition that is best suited for\npredicting the future frames. Concurrent to our work, Kosiorek et al. [19] proposed the Sequential\nAttend-Infer-Repeat (SQAIR), which extends the AIR model and is very similar to our work.\nDisentangled Representation. To learn meaningful decomposition, our DDPAE enforces the com-\nponents to be disentangled into a representation with low-dimensional temporal dynamics. The idea\nof disentangled representation has already been explored [6, 34, 37] for video. Denton et al. [6]\nproposed DRNet, where representations are disentangled into content and pose, and the poses are\npenalized for encoding semantic information with the use of a discrimination loss. Similarly, MCNet\n[37] disentangles motion from content using image differences and shared a single content vector\nin prediction. Note that some videos are hard to directly disentangle. Our work addresses this by\ndecomposing the video so that each component can actually be disentangled.\nVariational Auto-Encoder (VAE). Our DDPAE is based on the VAE [17], which provides one\nsolution to the multiple future problem [42, 44]. VAEs have been used for image and video generation\n[7, 13, 28, 29, 33, 41, 42, 44]. Our key contribution is to make the model structural, where the latent\nrepresentation is decomposed and more importantly disentangled. Our network models both motion\nand content probabilistically, and is regularized by learning transformations in a way similar to [16].\n\n3 Methods\n\nOur goal is to predict K future frames given T input frames. Our core insight is to combine structured\nprobabilistic models and deep networks to (i) decompose the high-dimensional video into components,\nand (ii) disentangle each component into low-dimensional temporal dynamics that are easy to predict.\nFirst, we take a Bayesian perspective and propose the Decompositional Disentangled Predictive Auto-\nEncoder (DDPAE) as our formulation in Section 3.1. Next, we discuss our deep parameterization of\neach of the components in DDPAE in Section 3.2. Finally, we show how we learn the DDPAE by\noptimizing the evidence lower bound in Section 3.3.\n\n3.1 Decompositional Disentangled Predictive Auto-Encoder\n\np(\u00afx1:K|x1:T ) =\n\nFormally, given an input video x1:T of length T , our goal is to predict future K frames \u00afx1:K =\nx(T +1):(T +K). For simplicity, in this paper we denote any variable \u00afz1:K to be the prediction sequence\nof z from time step T + 1 to T + K, i.e. \u00afz1:K = z(T +1):(T +K). We assume that each video frame xt\n(cid:90)(cid:90)\nis generated from a corresponding latent representation zt. In this case, we can formulate the video\nframe prediction p(\u00afx1:K|x1:T ) as:\n\np(\u00afx1:K|\u00afz1:K )p(\u00afz1:K|z1:T )p(z1:T|x1:T ) d\u00afz1:K dz1:T ,\n\n(1)\nwhere p(\u00afx1:K|\u00afz1:K) is the frame decoder for generating frames based on latent representations,\np(\u00afz1:K|z1:T ) is the prediction model that captures the dynamics of the latent representations, and\np(z1:T|x1:T ) is the temporal encoder that infers the latent representations given the input video x1:T .\nFrom a Bayesian perspective, we model these three as probability distributions.\nOur core insight is to decompose the video prediction problem in Eq. (1) into sub-problems that are\neasier to predict. In a simpli\ufb01ed case, where each of the components can be predicted independently\n(e.g., digits in Figure 1), we can use the following decomposition:\n\n\u00afx1:K =\n\n\u00afxi\n1:K ,\n\nx1:T =\n\nxi\n1:T ,\n\n(2)\n\np(\u00afxi\n\np(\u00afxi\n\n1:T ) =\n\n1:T )p(zi\n\n1:K|zi\n\n1:K|xi\n\n1:K|\u00afzi\n(3)\nwhere we decompose the input x1:T into {xi\n1:K},\nwhich will be combined as the \ufb01nal prediction \u00afx1:K. We will use this independence assumption for\nthe sake of explanation, but we will show later how this can easily be extended to the case where the\ncomponents are interdependent, which is crucial for capturing interactions between components.\nThe key technical challenge is thus: How do we learn the decomposition? How do we enforce\nthat each component is actually easier to predict? One can imagine a trivial decomposition, where\n\n1:K )p(\u00afzi\n1:T} and independently predict the future frames {\u00afxi\n\n1:T|xi\n\n1:T ) d\u00afzi\n\n1:K dzi\n\n1:T ,\n\nN(cid:88)\n(cid:90)(cid:90)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n3\n\n\f1:K and zi\n\n1:T = x1:T and xi\n1:T = 0 for i > 1. This does not simplify the prediction at all, but only keeps\nx1\nthe same complexity at a single component. We address this challenge by enforcing the latent\nrepresentations of each component (\u00afzi\n1:T ) to have low-dimensional temporal dynamics. In\nother words, the temporal signal to be predicted in each component should be low-dimensional. More\nspeci\ufb01cally, we achieve this by leveraging the disentangled representation [6]: a latent representation\nt is disentangled to the concatenation of (i) a time-invariant content vector zi\nt,C, and (ii) a time-\nzi\ndependent (low-dimensional) pose vector zi\nt,P . The content vector captures the information that is\nshared across all frames of the component. For example, in the \ufb01rst component of Figure 1, the\ncontent vector models the appearance of the digit \u201c9\u201d. Formally, we assume the content vector is the\nsame for all frames in both the input and the prediction: zi\nC. On the other hand, the\npose vector zi\nThis allows us to disentangle the prediction of decomposed latent representations as follows:\n\nt,P is low-dimensional, which captures the location of the digit in Figure 1.\n\n1:K,P|zi\n\n1:K,P|zi\n\n1:T ) = p(\u00afzi\n1:K|zi\n\n1:K|zi\n(4)\n1:T ) is reduced to just predicting the low-dimensional pose vectors\nwhere the prediction p(\u00afzi\n1:T,P ). This is possible since we share the content vector between the input and the\np(\u00afzi\nprediction. This disentangled representation allows the prediction of each component to focus on the\nlow-dimensional varying pose vectors, and signi\ufb01cantly simpli\ufb01es the prediction task.\nEq. (2)-(4) thus de\ufb01ne the proposed Decompositional Disentangled Predictive Auto-Encoder\n(DDPAE). Note that both the decomposition and the disentanglement are learned automatically\nwithout explicit supervision. Our formulation encourages the model to decompose the video into\ncomponents with low-dimensional temporal dynamics in the disentangled representation. By training\nthis structural generative model of future frames, the hope is to learn to produce good decomposition\nand disentangled representations of the video that reduce the complexity of frame prediction.\n\nt,C = zi\n\nt,C = \u00afzi\n\n\u00afzi\nt = [zi\n\nzi\nt = [zi\n\n1:T,P ),\n\nC, \u00afzi\n\nC, zi\n\nt,P ],\n\nt,P ],\n\np(\u00afzi\n\n3.2 Model Implementation\n\nC, \u00afzi\n\nj|\u00afzi\n\nC = zi\n\n1:T,P ).\n\nj=1 p(\u00afxi\n\n1:K|\u00afzi\n\n1:K,P|zi\n\nt,P ], where \u00afzi\n\n1:T ), and \ufb01nally prediction p(\u00afzi\n\n1:K) =(cid:81)K\n\nt as follows: First, the content vector is decoded to a recti\ufb01ed image \u00afyi\n\n1:T|xi\nIn Eq. (3), p(\u00afxi\nt|zi\n\nWe have formulated how we decompose the video prediction problem into sub-problems of disentan-\ngled representations that are easier to predict in our DDPAE framework. In this section, we discuss\nour implementation of each of the component of our model in Eq. (2)-(4), starting from the generation\n1:K), inference p(zi\np(\u00afxi\n1:K|\u00afzi\nFrame Generation Model.\n1:K) is frame generation model. We assume\n1:K|\u00afzi\nj). This model is used\nconditional independence between the frames: p(\u00afxi\nt|\u00afzi\nfor both input reconstruction p(xi\nt). Our frame generation model is \ufb02exible\nt) and prediction p(\u00afxi\nand can vary based on the domain. For 2D scenes, we follow work in scene understanding [7]\nand use an attention-based generative model. Note that our latent representation is disentangled:\nC is the \ufb01xed content vector (e.g., the latent representation of the digit),\n\u00afzi\nt = [\u00afzi\nt,P is the pose vector (e.g., the location and scale of the digit). As shown in Figure 2(c), we\nand \u00afzi\ngenerate the image \u00afxi\nt using\ndeconvolution layers. Next, the pose vector is used to parameterize an inverse spatial transformer\nT \u22121\nt. The pose vector in this example is a 3-dimensional\nz\ncontinuous variable, which signi\ufb01cantly simpli\ufb01es the prediction problem compared to predicting the\nfull frame.\nInference. In Eq. (3), our prediction requires the inference of the latent representations, p(zi\n1:T ).\nt), the true posterior distribution is intractable. Thus, the standard\nGiven our generation model p(xi\npractice is to employ a variational approximation q(zi\n1:T ) to the true posterior [17]. Since our\nlatent representations are decomposed and disentangled, we explain our model q in the following two\nsections: Video Decomposition and Disentangled Representation.\nVideo Decomposition. The next question is: How do we get the decomposed xi\n1:T from x1:T ? Eq. (3)\nassumes that the decomposition is given. Our key observation is that even if we decompose the input\nx1:T to {xi\n1:T} in a separate step, the decomposed video would only be used to infer its respective\nlatent representation through variational approximation. In this case, we can combine the video\n1:T|x1:T ), which directly infers the latent\ndecomposition with the variational approximation as q(zi\n1:T|x1:T ) using an RNN with 2-dimensional\nrepresentations of each component. We implement q(zi\nrecurrence, where one recurrence is for the temporal modeling (1 : T ) and the other is used to capture\n\nt to the generated frame \u00afxi\n\n[14] to warp \u00afyi\n\n1:T|xi\n\n1:T|xi\n\nt|zi\n\n4\n\n\fFigure 2: Overview of our model implementation.\n(a) We use 2D recurrence to implement\n1:T|x1:T ) to model both the temporal and dependency between components. (b) The predic-\nq(zi\ntion RNN is used only to predict the pose vector. (c) Our frame generation model generates different\nimage with the same content using inverse spatial transformer. (d) A single content vector zi\nC is\nobtained for each component from input x1:T and pose vectors zi\n\n1:T .\n\nt,P to extract the recti\ufb01ed image yi\n\nthe dependencies between components. For instance, in the video in Figure 1, the component of\ndigit \u201c6\u201d needs to know that \u201c9\u201d is already modeled by the \ufb01rst component. Figure 2(a) shows our\n2-dimensional recurrence (our input RNN) in both the time steps and the components.\nDisentangled Representation. While the 2D recurrence model can directly infer the latent repre-\nsentations, it is not guaranteed to output disentangled representation. We thus design a structural\ninference model to disentangle the representation. In contrast to frame generation, where the goal is\nto generate different frames conditioning on the same content vector, the goal here in inference is to\nrevert the process and obtain a single shared content vector zi\nC for different frames, and hence force\nthe variations between frames to be encoded in the pose vectors zi\n1:T,P . Thus, we apply the inverse of\nthe structural model in our generation process (see Figure 2(d)). For 2D scenes, this means applying\nthe spatial transformer parameterized by zi\nt from the frame xt. We\nthen use a CNN to encode each yi\nt into a latent representation. Instead of training with similarity\nregularization [6], we use another RNN on top of the raw output as pooling to obtain a single content\nvector zi\n1:T,P and x1:T .\nC is used for each time step in prediction, this forces the decomposition of our model\nSince the same zi\nto separate the components with different motions to get good prediction of the sequence.\nPose Prediction. The \ufb01nal component is the pose vector prediction p(\u00afzi\nC is\n\ufb01xed in prediction, we only need to predict the pose vectors. Inspired by [16], instead of directly\nt to reparametrize the pose vectors. Given\ninferring zi\nt).\nt\u22121,P and \u03b2i\nzi\nt\u22121,P , \u03b2i\nThis allows us to use a meaningful prior for \u03b2t. Therefore, as shown in Figure 2(a) and (b), given an\ninput sequence x1:T , for each component our model infers an initial pose vector zi\n0,P and the transition\nt,P at each time step. We use a seq2seq [4, 32]\nvariables \u03b2i\nbased model to predict \u00af\u03b2i\n1:K (Figure 2(b)). With this RNN-based model, the dependencies between\nposes of components can be captured by passing the hidden states across components. This allows\nthe model to learn and predict interactions between components, such as collisions between objects.\n\nC for each component. Figure 2(d) shows the process of inferring zi\n\nt,P , we introduce a set of transition variables \u03b2i\n\nt,P is deterministic with linear combination: zi\n\n1:T , from which we can iteratively obtain zi\n\nt, the transition to zi\n\nt,P = f (zi\n\nC from zi\n\n1:K,P|zi\n\n1:T,P ). Since zi\n\n3.3 Learning\n\nOur DDPAE framework is based on VAEs [17], and thus we can use the same variational techniques\nto optimize our model. For VAE, the assumption is that each data point x is generated from a latent\nrandom variable z with p\u03b8(x|z), where z is sampled from a prior p\u03b8(z). In our case, the output video\n\u00afx1:K is generated from the latent representations \u00afz1:N\nt is the disentangled\n1:K,P is parameterized by the initial\nrepresentation [zi\npose zi\n1:(T +K),\n\n1:(T +K). Therefore, in our model, we treat z1:N\n\nt,P ] (Eq. (4)) of the ith component, and \u00afzi\n\n0,P and the transition variables \u03b2i\n\n1:K of N components, where \u00afzi\n\n0,P , \u03b21:N\n\nC, \u00afzi\n\n5\n\nComponent iComponent 1(b) PredictionInitial PosePose VectorsyContentPose(a) Input(c) Frame Generationyy(d) Content Vector\fTable 1: Results on Moving MNIST\n(Bold for the best and underline for the\nsecond best). Our results signi\ufb01cantly\noutperforms the baselines.\n\nModel\n\nShi et al. [43]\nSrivastava et al. [31]\nBrabandere et al. [5]\nPatraucean et al. [25]\nGhosh et al. [10]\nKalchbrenner et al. [15]\n\nMCNet [37]\nDRNet [6]\nOurs w/o Decomposition\nOurs w/o Disentanglement\nOurs (DDPAE)\n\nBCE\n\n367.2\n341.2\n285.2\n262.6\n241.8\n87.6\n\n1308.2\n862.7\n325.5\n296.1\n223.0\n\nFigure 3: DDPAE separates the two digits and obtains good\nresults even when the digits overlap. The bounding boxes\nof the two components are drawn manually.\n\nMSE\n\n-\n-\n-\n-\n\n-\n\n167.9\n\n173.2\n163.9\n77.6\n65.6\n38.9\n\nC\n\nand z1:N\nas the underlying random latent variables that generate data \u00afx1:K. We denote \u00afz as the\ncombined set of random variables in our model. \u00afz is inferred from the input frames, \u00afz \u223c q\u03c6(\u00afz|x1:T ),\nwhere q\u03c6 is our inference model explained in Section 3.2, parameterized by \u03c6. The output frames\n\u00afx1:K are generated by \u00afx1:K \u223c p\u03b8(\u00afx1:K|\u00afz), where p\u03b8 is our frame generation model parameterized\nby \u03b8. Moreover, we assume that the prior distribution to be p(\u00afz) = N (\u00b5, diag(\u03c32)). We jointly\noptimize \u03b8 and \u03c6 by maximizing the evidence lower bound (ELBO):\n\nlog p\u03b8(\u00afx1:K ) \u2265 Eq[log p\u03b8(\u00afx1:K , \u00afz) \u2212 log q\u03c6(\u00afz|x1:T )] = Eq[log p\u03b8(\u00afx1:K|\u00afz) \u2212 KL(q\u03c6(\u00afz|x1:T )||p(\u00afz))\n(5)\nThe \ufb01rst term corresponds to the prediction error, and the second term serves as regularization of\nthe latent variables \u00afz. With the reparametrization trick, the entire model is differentiable, and the\nparameters \u03b8 and \u03c6 can be jointly optimized by standard backpropagation technique.\n\n4 Experiments\n\nOur goal is to predict a sequence of future frames given a sequence of input frames. The key\ncontribution of our DDPAE is to both decompose and disentangle the video representation to simplify\nthe challenging frame prediction task. First, we evaluate the importance of both the decomposition\nand disentanglement of the video representation for frame prediction on the widely used Moving\nMNIST dataset [31]. Next, we evaluate how DDPAE can be applied to videos involving more\ncomplex interactions between components on the Bouncing Balls dataset [3, 36]. Finally, we evaluate\nhow DDPAE can generalize and adapt to the cases where the optimal number of components is not\nknown a priori, which is important for applying DDPAE to new domains of videos.\nCode for DDPAE and the experiments are available at https://github.com/jthsieh/\nDDPAE-video-prediction.\n\n4.1 Evaluating Decompositional Disentangled Video Representation\n\nThe key element of DDPAE is learning the decompositional-disentangled representations. We evaluate\nthe importance of both decomposition and disentanglement using the Moving MNIST dataset. Since\nthe digits in the videos follow independent low-dimensional trajectories, our framework signi\ufb01cantly\nsimpli\ufb01es the prediction task from the original high-dimensional pixel prediction. We show that\nDDPAE is able to learn the decomposition and disentanglement automatically without explicit\nsupervision, which plays an important role in the accurate prediction of DDPAE.\nWe compare two state-of-the-art video prediction methods without decomposition as baselines:\nMCNet [37] and DRNet [6]. Both models perform video prediction using disentangled representations,\nsimilar to our model with only one component. We use the code provided by the authors of the two\npapers. For reference, we also list the results of existing work on Moving MNIST, where they use\nmore complicated models such as convolutional LSTM or PixelCNN decoders [15, 25, 43].\nDataset. Moving MNIST is a synthetic dataset consisting of two digits moving independently in\na 64 \u00d7 64 frame. It has been used in many previous works [6, 12, 15, 24, 31]. For training, each\n\n6\n\nGround truthDRNetOurs (DDPAE)Ours - 1st componentOurs - 2nd componentOurs w/o disentanglementInput\fFigure 4: Our model prediction on Bouncing Balls. Note that our\nmodel correctly predicts the collision between the two balls in the\nupper right corner, whereas the baseline model does not.\n\nFigure 5: Accuracy of velocity\nwith time. Top: Relative error\nin magnitude. Bottom: Cosine\nsimilarity.\n\nsequence is generated on-the-\ufb02y by sampling MNIST digits and generating trajectories with randomly\nsampled velocity and angle. The test set is a \ufb01xed dataset downloaded from [31] consisting of 10,000\nsequences of 20 frames, with 10 as input and 10 to predict.\nEvaluation Metric. We follow [31] and use the binary cross-entropy (BCE) as the evaluation metric.\nWe also report the mean squared error (MSE) as an additional metric from [10].\nResults. Table 1 shows the quantitative results. DDPAE signi\ufb01cantly outperforms the baselines\nwithout decomposition (MCNet, DRNet) or without disentanglement. For MCNet and DRNet, the\nlatent representations need to contain complicated information of the digits\u2019 combined content and\nmotion, and moreover, the decoder has a much harder task of generating two digits at the same time.\nIn fact, [6] speci\ufb01cally stated that DRNet is unable to get good results when the two digits have the\nsame color. In addition, our baseline without disentanglement produces blurry results due to the\ndif\ufb01culty of predicting representations.\nOur model, on the other hand, greatly simpli\ufb01es the inference of the latent variables and the decoder\nby both decomposition and disentanglement, resulting in better prediction. This is also shown in\nthe qualitative results in Figure 3, where DDPAE successfully separates the two digits into two\ncomponents and only needs to predict the low-dimensional pose vectors. Note that DDPAE can\nalso handle occlusion. Compared to existing works, DDPAE achieves the best result except BCE\ncompared to VPN [15], which can be the result of its more sophisticated image generation process\nusing PixelCNN. The main contribution of DDPAE is in the decomposition and disentanglement,\nwhich is in principle applicable to other existing models like VPN.\nIt is worth noting that the ordering of the components is learned automatically by the model. We\nobtain the \ufb01nal output by adding the components, which is a permutation-invariant operation. The\nmodel can learn to generate components in any order, as long as the \ufb01nal frames are correct. This\nphenomenon is also observed in many \ufb01elds, including tracking and object detection.\n\n4.2 Evaluating Interdependent Components\n\nPreviously in Eq. (3), we assume the components to be independent, i.e., the pose of each component\nis separately predicted without information of other components. The independence assumption\nis not true in most scenarios, as components in a video may interact with each other. Therefore,\nit is important for us to generalize to interdependent components. In Section 3.2, we explain how\nour model adds dependencies between components in the prediction RNN. We now evaluate the\nimportance of it in more complex videos. We evaluate the interdependency on the Bouncing Balls\ndataset [3]. Bouncing Balls is ideal for evaluating this because (i) it is widely used for methods with\naccess to physical states [2, 3, 9] and (ii) it involves physical interactions between components. One\ncontribution of our DDPAE framework is the ability to achieve complex physics system predictions\ndirectly from the pixels, without any physics information and assumptions.\n\n7\n\nGround truthOurs w/odependenciesOurs (DDPAE)Ours -1st componentOurs-2nd componentOurs -4th componentOurs -3rd componentInput246810Time step0.220.230.240.250.26Relative ErrorRelative Error in MagnitudeOurs w/o dependenciesOurs246810Time step0.760.780.800.820.840.860.880.90Cosine SimilarityCosine SimilarityOurs w/o dependenciesOurs\fFigure 6: Results of DDPAE trained on variable number of digits. Only the predicted frames are\nshown. Our model is able to correctly handle redundant components.\n\nt = pi\n\nt+1 \u2212 pi\n\nDataset. We simulate sequences of 4 balls bouncing in an image with the physics engine code used\nin [3]. The balls are allowed to bounce off walls and collide with each other. Following the prediction\ntask setting in [3], the balls have the same mass and the maximum velocity is 60 pixels/second\n(roughly 6 pixels/frame). The size of the original videos are 800 pixels, so we re-scale the videos to\n128 \u00d7 128. We generated a \ufb01xed training set of 50,000 sequences and a test set of 2,000 sequences.\nEvaluation Metric. The primary goal of this experiment is to evaluate the importance of modeling\nthe dependencies between components. Therefore, following [3], we evaluate the predicted velocities\nof the balls. Since our model outputs the spatial transformer of each component at every time step, we\nt of the attention region directly and thus the translation between frames.\ncan calculate the position pi\nWe normalize the positions to be [0, 1], and de\ufb01ne the velocity to be vi\nt\u22121. At every time\nstep, we calculate the relative error in magnitude and the cosine similarity between the predicted and\nground truth velocities, which corresponds to the speed and direction respectively. The \ufb01nal results\nare averaged over all instances in the test set. Note that the correspondence between components and\nballs is not known, so we \ufb01rst match each component to a ball by minimum distance.\nResults. Figure 4 shows results of our model on Bouncing Balls. Each component captures a single\nball correctly. Note that during prediction, a collision occurs between the two balls in the upper right\ncorner in the ground truth video. Our model successfully predicts the colliding balls to bounce off\nof each other instead of overlapping each other. On the other hand, our baseline model predicts the\nballs\u2019 motion independently and fails to identify the collision, and thus the two balls overlap each\nother in the predicted video. This shows that DDPAE is able to capture the important dependencies\nbetween components when predicting the pose vectors. It is worth noting that predicting the trajectory\nafter collision is a fundamentally challenging problem for our model since it highly depends on the\ncollision surface of the balls, which is very hard to predict accurately. Figure 5 shows the relative\nerror in magnitude and cosine similarity between the predicted and ground truth velocities, at each\ntime step during prediction. The accuracy of the predicted velocities decreases with time as expected.\nWe compare our model against the baseline model without interdependent components. Figure 5\nshows that our model outperforms the baseline for both metrics. The dependency allows our model\nto capture the interactions between balls, and hence generates more accurate predictions.\n\n4.3 Evaluating Generalization to Unknown Number of Components\n\nIn the previous experiments, the number of objects in the video is known and \ufb01xed, and thus we set\nthe number of components in DDPAE to be the same. However, videos may contain an unknown\nand variable number of objects. We evaluate the robustness of our model in these scenarios with\nthe Moving MNIST dataset. We set the number of components to be 3 for all experiments, and the\nnumber of digits to be a subset of {1, 2, 3}. Similar to previous experiments, we generate the training\nsequences on-the-\ufb02y and evaluate on a \ufb01xed test set.\n\n8\n\nGround truth1st component2nd component3rd componentOutput(a) Train on 1, 2, or 3 digitsGround truth1st component2nd component3rd componentOutput(b) Train on 2 digits(c) Train on 1 or 3 digits\fFigure 6 (a) shows results of our model trained on 1 to 3 digits. The two test sequences have 1 and 3\ndigits respectively. For sequences with 1 digit, our model learns to set two redundant components to\nempty, while for sequences with 3 digits, it correctly separates the 3 digits into 3 components. We\nobserve similar results when we train our model with 2 digits. Figure 6 (b) shows that our model\nlearns to set the extra component to be empty.\nNext, we train our model with sequences containing 1 or 3 digits, but test with sequences of 2 digits.\nIn this case, the number of digits is unseen during training. Figure 6 (c) shows that our model is\nable to produce correct results as well. Interestingly, two of the components generate the exact same\noutputs. This is reasonable since we do not set any constraints between components.\n\n5 Conclusion\n\nWe presented Decompositional Disentangled Predictive Auto-Encoder (DDPAE), a video prediction\nframework that explicitly decomposes and disentangles the video representation and reduces the\ncomplexity of future frame prediction. We show that, with an appropriately speci\ufb01ed structural model,\nDDPAE is able to learn both the video decomposition and disentanglement that are effective for video\nprediction without any explicit supervision on these latent variables. This leads to strong quantitative\nand qualitative improvements on the Moving MNIST dataset. We further show that DDPAE is able to\nachieve reliable prediction directly from the pixel on the Bouncing Balls dataset involving complex\nobject interaction, and recover physical properties without explicit modeling the physical states.\n\nAcknowledgements\n\nThis work was partially funded by Panasonic and Oppo. We thank our anonymous reviewers, John\nEmmons, Kuan Fang, Michelle Guo, and Jingwei Ji for their helpful feedback and suggestions.\n\nReferences\n[1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social LSTM: Human\n\ntrajectory prediction in crowded spaces. In CVPR, 2016.\n\n[2] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. Interaction networks for learning about objects,\n\n[3] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based approach to\n\nrelations and physics. In NIPS, 2016.\n\nlearning physical dynamics. ICLR, 2017.\n\n[4] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.\nLearning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv\npreprint arXiv:1406.1078, 2014.\n\n[5] B. De Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool. Dynamic \ufb01lter networks. In NIPS, 2016.\n[6] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In NIPS,\n\n2017.\n\n[7] S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. Attend, infer, repeat: Fast\n\nscene understanding with generative models. In NIPS, 2016.\n\n[8] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video\n\n[9] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik. Learning visual predictive models of physics for\n\nprediction. In NIPS, 2016.\n\nplaying billiards. ICLR, 2016.\n\nreasoning diagram generation. AAAI, 2017.\n\nperceptual grouping. In NIPS, 2016.\n\n[10] A. Ghosh, V. Kulharia, A. Mukerjee, V. Namboodiri, and M. Bansal. Contextual rnn-gans for abstract\n\n[11] K. Greff, A. Rasmus, M. Berglund, T. Hao, H. Valpola, and J. Schmidhuber. Tagger: Deep unsupervised\n\n[12] K. Greff, S. van Steenkiste, and J. Schmidhuber. Neural expectation maximization. In NIPS, 2017.\n[13] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for\n\nimage generation. arXiv preprint arXiv:1502.04623, 2015.\n\n[14] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n[15] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu.\n\nVideo pixel networks. In ICML, 2017.\n\n[16] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes \ufb01lters: Unsupervised learning\n\nof state space models from raw data. ICLR, 2016.\n\n[17] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.\n[18] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. In ECCV, 2012.\n[19] A. R. Kosiorek, H. Kim, I. Posner, and Y. W. Teh. Sequential attend, infer, repeat: Generative modelling of\n\nmoving objects. In NIPS, 2018.\n\n9\n\n\f[20] T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In ECCV,\n\n[21] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised\n\nlearning. arXiv preprint arXiv:1605.08104, 2016.\n\n[22] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In\n\nICLR, 2016.\n\nin atari games. In NIPS, 2015.\n\n[23] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks\n\n[24] M. Oliu, J. Selva, and S. Escalera. Folded recurrent neural networks for future video prediction. In ECCV,\n\n2014.\n\n2018.\n\n[25] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory.\n\narXiv preprint arXiv:1511.06309, 2015.\n\n[26] C. Paxton, Y. Barnoy, K. Katyal, R. Arora, and G. D. Hager. Visual robot task planning. arXiv preprint\n\n[27] D. J. R. Gao and K. Grauman. Object-centric representation learning from unlabeled videos. In ACCV,\n\narXiv:1804.00062, 2018.\n\nNovember 2016.\n\n[28] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a\n\nbaseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.\n\n[29] D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep\n\ngenerative models. arXiv preprint arXiv:1603.05106, 2016.\n\n[30] B. Soran, A. Farhadi, and L. Shapiro. Generating noti\ufb01cations for missing actions: Don\u2019t forget to turn the\n\n[31] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using\n\n[32] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS,\n\nlights off! In ICCV, 2015.\n\nlstms. In ICML, 2015.\n\n2014.\n\n[33] S. Tulyakov, A. Fitzgibbon, and S. Nowozin. Hybrid vae: Improving deep generative models using partial\n\nobservations. arXiv preprint arXiv:1711.11566, 2017.\n\n[34] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video\n\ngeneration. arXiv preprint arXiv:1707.04993, 2017.\n\n[35] J. Van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran, and S. Chintala. Transformation-based\n\nmodels of video sequences. arXiv preprint arXiv:1701.08435, 2017.\n\n[36] S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber. Relational neural expectation maximization:\n\nUnsupervised discovery of objects and their interactions. In ICLR, 2018.\n\n[37] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video\n\nsequence prediction. In ICLR, 2017.\n\n[38] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via\n\nhierarchical prediction. arXiv preprint arXiv:1704.05831, 2017.\n\n[39] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.\n[40] C. Vondrick and A. Torralba. Generating the future with adversarial transformers. CVPR, 2017.\n[41] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using\n\n[42] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose\n\nvariational autoencoders. In ECCV, 2016.\n\nfutures. In ICCV, 2017.\n\n[43] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A\n\nmachine learning approach for precipitation nowcasting. In NIPS, 2015.\n\n[44] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via\n\ncross convolutional networks. In NIPS, 2016.\n\n[45] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance \ufb02ow. In ECCV,\n\n2016.\n\n10\n\n\f", "award": [], "sourceid": 305, "authors": [{"given_name": "Jun-Ting", "family_name": "Hsieh", "institution": "Stanford University"}, {"given_name": "Bingbin", "family_name": "Liu", "institution": "Stanford University"}, {"given_name": "De-An", "family_name": "Huang", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford University & Google"}, {"given_name": "Juan Carlos", "family_name": "Niebles", "institution": "Stanford University"}]}