{"title": "Unsupervised Learning of Disentangled Representations from Video", "book": "Advances in Neural Information Processing Systems", "page_first": 4414, "page_last": 4423, "abstract": "We present a new model DRNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluating our approach on a range of synthetic and real videos. For the latter, we demonstrate the ability to coherently generate up to several hundred steps into the future.", "full_text": "Unsupervised Learning of Disentangled\n\nRepresentations from Video\n\nEmily Denton\n\nDepartment of Computer Science\n\nNew York University\ndenton@cs.nyu.edu\n\nVighnesh Birodkar\n\nDepartment of Computer Science\n\nNew York University\n\nvighneshbirodkar@nyu.edu\n\nAbstract\n\nWe present a new model DRNET that learns disentangled image representations\nfrom video. Our approach leverages the temporal coherence of video and a novel\nadversarial loss to learn a representation that factorizes each frame into a stationary\npart and a temporally varying component. The disentangled representation can be\nused for a range of tasks. For example, applying a standard LSTM to the time-vary\ncomponents enables prediction of future frames. We evaluate our approach on a\nrange of synthetic and real videos, demonstrating the ability to coherently generate\nhundreds of steps into the future.\n\n1\n\nIntroduction\n\nUnsupervised learning from video is a long-standing problem in computer vision and machine\nlearning. The goal is to learn, without explicit labels, a representation that generalizes effectively to a\npreviously unseen range of tasks, such as semantic classi\ufb01cation of the objects present, predicting\nfuture frames of the video or classifying the dynamic activity taking place. There are several prevailing\nparadigms: the \ufb01rst, known as self-supervision, uses domain knowledge to implicitly provide labels\n(e.g. predicting the relative position of patches on an object [4] or using feature tracks [36]). This\nallows the problem to be posed as a classi\ufb01cation task with self-generated labels. The second general\napproach relies on auxiliary action labels, available in real or simulated robotic environments. These\ncan either be used to train action-conditional predictive models of future frames [2, 20] or inverse-\nkinematics models [1] which attempt to predict actions from current and future frame pairs. The\nthird and most general approaches are predictive auto-encoders (e.g.[11, 12, 18, 31]) which attempt\nto predict future frames from current ones. To learn effective representations, some kind of constraint\non the latent representation is required.\nIn this paper, we introduce a form of predictive auto-encoder which uses a novel adversarial loss\nto factor the latent representation for each video frame into two components, one that is roughly\ntime-independent (i.e. approximately constant throughout the clip) and another that captures the\ndynamic aspects of the sequence, thus varying over time. We refer to these as content and pose\ncomponents, respectively. The adversarial loss relies on the intuition that while the content features\nshould be distinctive of a given clip, individual pose features should not. Thus the loss encourages\npose features to carry no information about clip identity. Empirically, we \ufb01nd that training with this\nloss to be crucial to inducing the desired factorization.\nWe explore the disentangled representation produced by our model, which we call Disentangled-\nRepresentation Net (DRNET ), on a variety of tasks. The \ufb01rst of these is predicting future video\nframes, something that is straightforward to do using our representation. We apply a standard LSTM\nmodel to the pose features, conditioning on the content features from the last observed frame. Despite\nthe simplicity of our model relative to other video generation techniques, we are able to generate\nconvincing long-range frame predictions, out to hundreds of time steps in some instances. This is\nsigni\ufb01cantly further than existing approaches that use real video data. We also show that DRNET can\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbe used for classi\ufb01cation. The content features capture the semantic content of the video thus can be\nused to predict object identity. Alternately, the pose features can be used for action prediction.\n\n2 Related work\n\nOn account of its natural invariances, image data naturally lends itself to an explicit \u201cwhat\u201d and\n\u201cwhere\u201d representation. The capsule model of Hinton et al. [10] performed this separation via\nan explicit auto-encoder structure. Zhao et al. [40] proposed a multi-layered version, which has\nsimilarities to ladder networks [23]. Several weakly supervised approaches have been proposed to\nfactor images into style and content (e.g. [19, 24]). These methods all operate on static images,\nwhereas our approach uses temporal structure to separate the components.\nFactoring video into time-varying and time-independent components has been explored in many\nsettings. Classic structure-from-motion methods use an explicit af\ufb01ne projection model to extract a\n3D point cloud and camera homography matrices [8]. In contrast, Slow Feature Analysis [38] has no\nmodel, instead simply penalizing the rate of change in time-independent components and encouraging\ntheir decorrelation. Most closely related to ours is Villegas et al. [33] which uses an unsupervised\napproach to factoring video into content and motion. Their architecture is also broadly similar to\nours, but the loss functions differ in important ways. They rely on pixel/gradient space (cid:96)p-norm\nreconstructions, plus a GAN term [6] that encourages the generated frames to be sharp. We also use\nan (cid:96)2 pixel-space reconstruction. However, this pixel-space loss is only applied, in combination with\na novel adversarial term applied to the pose features, to learn the disentangled representation. In\ncontrast to [33], our forward model acts on latent pose vectors rather than predicting pixels directly.\nOther approaches explore general methods for learning disentangled representations from video.\nKulkarni et al. [14] show how explicit graphics code can be learned from datasets with systematic\ndimensions of variation. Whitney et al. [37] use a gating principle to encourage each dimension of\nthe latent representation to capture a distinct mode of variation. Grathwohl et al. [7] propose a deep\nvariational model to disentangle space and time in video sequences.\nA range of generative video models, based on deep nets, have recently been proposed. Ranzato et\nal. [22] adopt a discrete vector quantization approach inspired by text models. Srivastava et al. [31]\nuse LSTMs to generate entire frames. Video Pixel Networks [12] use these models is a conditional\nmanner, generating one pixel at a time in raster-scan order (similar image models include [27, 32]).\nFinn et al. [5] use an LSTM framework to model motion via transformations of groups of pixels.\nCricri et al. [3] use a ladder of stacked-autoencoders. Other works predict optical \ufb02ows \ufb01elds that\ncan be used to extrapolate motion beyond the current frame, e.g. [17, 39, 35]. In contrast, a single\npose vector is predicted in our model, rather than a spatial \ufb01eld.\nChiappa et al. [2] and Oh et al. [20] focus on prediction in video game environments, where known\nactions at each frame can be permit action-conditional generative models that can give accurate\nlong-range predictions. In contrast to the above works, whose latent representations combine both\ncontent and motion, our approach relies on a factorization of the two, with a predictive model only\nbeing applied to the latter. Furthermore, we do not attempt to predict pixels directly, instead applying\nthe forward model in the latent space. Chiappa et al. [2], like our approach, produces convincing\nlong-range generations. However, the video game environment is somewhat more constrained than\nthe real-world video we consider since actions are provided during generation.\nSeveral video prediction approaches have been proposed that focus on handling the inherent uncer-\ntainty in predicting the future. Mathieu et al. [18] demonstrate that a loss based on GANs can produce\nsharper generations than traditional (cid:96)2-based losses. [34] train a series of models, which aim to span\npossible outcomes and select the most likely one at any given instant. While we considered GAN-\nbased losses, the more constrained nature of our model, and the fact that our forward model does not\ndirectly generate in pixel-space, meant that standard deterministic losses worked satisfactorily.\n\n3 Approach\n\nIn our model, two separate encoders produce distinct feature representations of content and pose for\neach frame. They are trained by requiring that the content representation of frame xt and the pose\nrepresentation of future frame xt+k can be combined (via concatenation) and decoded to predict the\npixels of future frame xt+k. However, this reconstruction constraint alone is insuf\ufb01cient to induce\n\n2\n\n\fi , ..., xT\n\nthe desired factorization between the two encoders. We thus introduce a novel adversarial loss on the\npose features that prevents them from being discriminable from one video to another, thus ensuring\nthat they cannot contain content information. A further constraint, motivated by the notion that\ncontent information should vary slowly over time, encourages temporally close content vectors to be\nsimilar to one another.\nMore formally, let xi = (x1\ni ) denote a sequence of T images from video i. We subsequently\ndrop index i for brevity. Let Ec denote a neural network that maps an image xt to the content\nc which captures structure shared across time. Let Ep denote a neural network that\nrepresentation ht\nmaps an image xt to the pose representation ht\np capturing content that varies over time. Let D denote\na decoder network that maps a content representation from a frame, ht\nc, and a pose representation\nfrom future time step t + k to a prediction of the future frame \u02dcxt+k. Finally, C is the scene\nht+k\np\ndiscriminator network that takes pairs of pose vectors (h1, h2) and outputs a scalar probability that\nthey came from the same video or not.\nThe loss function used during training has several terms:\nReconstruction loss: We use a standard per-pixel (cid:96)2 loss between the predicted future frame \u02dcxt+k\nand the actual future frame xt+k for some random frame offset k \u2208 [0, K]:\n) \u2212 xt+k||2\n\n(1)\nNote that many recent works on video prediction that rely on more complex losses that can capture\nuncertainty, such as GANs [19, 6].\nSimilarity loss: To ensure the content encoder extracts mostly time-invariant representations, we\nof neighboring frames k \u2208 [0, K]:\npenalize the squared error between the content features ht\n(2)\n\nLsimilarity(Ec) = ||Ec(xt) \u2212 Ec(xt+k)||2\n\nLreconstruction(D) = ||D(ht\n\nc, ht+k\n\np\n\nc, ht+k\n\nc\n\n2\n\n2\n\nAdversarial loss: We now introduce a novel adversarial loss that exploits the fact that the objects\npresent do not typically change within a video, but they do between different videos. Our desired\ndisenanglement would thus have the content features be (roughly) constant within a clip, but distinct\nbetween them. This implies that the pose features should not carry any information about the identity\nof objects within a clip.\nWe impose this via an adversarial framework between the scene discriminator network C and pose\nencoder Ep, shown in Fig. 1. The latter provides pairs of pose vectors, either computed from the same\nvideo (ht\np,j ), for some other video j. The discriminator then\nattempts to classify the pair as being from the same/different video using a cross-entropy loss:\n\np,i ) or from different ones (ht\n\np,i, ht+k\n\np,i, ht+k\n\n\u2212Ladversarial(C) = log(C(Ep(xt\n\ni), Ep(xt+k\n\ni\n\n))) + log(1 \u2212 C(Ep(xt\n\ni), Ep(xt+k\n\nj\n\n)))\n\n(3)\n\nThe other half of the adversarial framework imposes a loss function on the pose encoder Ep that tries\nto maximize the uncertainty (entropy) of the discriminator output on pairs of frames from the same\nclip:\n\n\u2212Ladversarial(Ep) =\n\n1\n2\n\nlog(C(Ep(xt\n\ni), Ep(xt+k\n\ni\n\n))) +\n\n1\n2\n\nlog(1 \u2212 C(Ep(xt\n\ni), Ep(xt+k\n\ni\n\n)))\n\n(4)\n\nThus the pose encoder is encouraged to produce features that the discriminator is unable to classify if\nthey come from the same clip or not. In so doing, the pose features cannot carry information about\nobject content, yielding the desired factorization. Note that this does assume that the object\u2019s pose is\nnot distinctive to a particular clip. While adversarial training is also used by GANs, our setup purely\nconsiders classi\ufb01cation; there is no generator network, for example.\nOverall training objective:\nDuring training we minimize the sum of the above losses, with respect to Ec, Ep, D and C:\nL = Lreconstruction(Ec, Ep, D)+\u03b1Lsimilarity(Ec)+\u03b2(Ladversarial(Ep)+Ladversarial(C)) (5)\nwhere \u03b1 and \u03b2 are hyper-parameters. The \ufb01rst three terms can be jointly optimized, but the discrim-\ninator C is updated while the other parts of the model (Ec, Ep, D) are held constant. The overall\nmodel is shown in Fig. 1. Details of the training procedure and model architectures for Ec, Ep, D\nand C are given in Section 4.1.\n\n3\n\n\fFigure 1: Left: The discriminator C is trained with binary cross entropy (BCE) loss to predict if a\npair of pose vectors comes from the same (top portion) or different (lower portion) scenes. xi and xj\ndenote frames from different sequences i and j. The frame offset k is sampled uniformly in the range\n[0, K]. Note that when C is trained, the pose encoder Ep is \ufb01xed. Right: The overall model, showing\nall terms in the loss function. Note that when the pose encoder Ep is updated, the scene discriminator\nis held \ufb01xed.\n\nFigure 2: Generating future frames by recurrently predicting hp, the latent pose vector.\n\n3.1 Forward Prediction\n\nAfter training, the pose and content encoders Ep and Ec provide a representation which enables\nvideo prediction in a straightforward manner. Given a frame xt, the encoders produce ht\np and ht\nc\nrespectively. To generate the next frame, we use these as input to an LSTM model to predict the next\npose features ht+1\n. These are then passed (along with the content features) to the decoder, which\ngenerates a pixel-space prediction \u02dcxt+1:\n\np\n\np\n\n\u02dcxt+1 = D(\u02dcht+1\n\u02dcxt+2 = D(\u02dcht+2\n\np\n\np\n\n\u02dcht+1\np\n\u02dcht+2\np\n\n= LST M (Ep(xt), ht\nc)\n= LST M (\u02dcht+1\n, ht\nc)\n\n(6)\n(7)\nNote that while pose estimates are generated in a recurrent fashion, the content features ht\nc remain\n\ufb01xed from the last observed real frame. This relies on the nature of Lreconstruction which ensured\nthat content features can be combined with future pose vectors to give valid reconstructions.\nThe LSTM is trained separately from the main model using a standard (cid:96)2 loss between \u02dcht+1\nand\n. Note that this generative model is far simpler than many other recent approaches, e.g. [12].\nht+1\nThis largely due to the forward model being applied within our disentangled representation, rather\nthan directly on raw pixels.\n\n, ht\nc)\n, ht\nc)\n\np\n\np\n\n3.2 Classi\ufb01cation\n\nAnother application of our disentangled representation is to use it for classi\ufb01cation tasks. Content\nfeatures, which are trained to be invariant to local temporal changes, can be used to classify the\nsemantic content of an image. Conversely, a sequence of pose features can be used to classify actions\nin a video sequence. In either case, we train a two layer classi\ufb01er network S on top of either hc or hp,\nwith its output predicting the class label y.\n\n4\n\nTarget 1(same scene)Target 0(different scenes)Pose encoder: Ep(x)Scene discriminator: C(Ep(x), Ep(x\u2019))Target 1(same scene)Target 0(different scenes)Pose encoder: Ep(x)Scene discriminator: D(Ep(x), Ep(x\u2019))............LBCELBCExitxit+kxitxjt+kPose encoder: Ep(x)LsimilarityLreconstructionContent encoder: Ec(x)Frame decoder: D( Ec(xt), Ep(xt+k) )xt+kxt+k\u2019xtxt+kx t+k~ Target=0.5 (maximal uncertainty)LadversarialPose encoder: Ep(x)LsimilarityContent encoder: Ec(x)Frame decoder: D( Ec(xt), Ep(xt+k) )Llinearityxt+k+2xt+k+1xtxt+kx t+k~ Target 1/2(maximal uncertainty)LadversaryScene discriminator not updated, only used for pose encoder lossEcxtLSTMhcthctEpxt-1hp t-1Dhp t~LSTMhctEphpthp t+1~LSTMhcthp t+1~hctDLSTMhcthp t+2~hcthp t+3~~hp t+2Dhctxtx t+3~x t+2~x t+1~\f4 Experiments\n\nWe evaluate our model on both synthetic (MNIST, NORB, SUNCG) and real (KTH Actions) video\ndatasets. We explore several tasks with our model: (i) the ability to cleanly factorize into content and\npose components; (ii) forward prediction of video frames using the approach from Section 3.1; (iii)\nusing the pose/content features for classi\ufb01cation tasks.\n\n4.1 Model details\n\nWe explored a variety of convolutional architectures for the content encoder Ec, pose encoder Ep\nand decoder D. For MNIST, Ec, Ep and D all use a DCGAN architecture [21] with |hp| = 5 and\n|hc| = 128. The encoders consist of 5 convolutional layers with subsampling. Batch normalization\nand Leaky ReLU\u2019s follow each convolutional layer except the \ufb01nal layer which normalizes the\npose/content vectors to have unit norm. The decoder is a mirrored version of the encoder with 5\ndeconvolutional layers and a sigmoid output layer.\nFor both NORB and SUNCG, D is a DCGAN architecture while Ec and Ep use a ResNet-18\narchitecture [9] up until the \ufb01nal pooling layer with |hp| = 10 and |hc| = 128.\nFor KTH, Ep uses a ResNet-18 architecture with |hp| = 24. Ec uses the same architecture as VGG16\n[29] up until the \ufb01nal pooling layer with |hc| = 128. The decoder is a mirrored version of the content\nencoder with pooling layers replaced with spatial up-sampling. In the style of U-Net [25], we add\nskip connections from the content encoder to the decoder, enabling the model to easily generate static\nbackground features.\nIn all experiments the scene discriminator C is a fully connected neural network with 2 hidden layers\nof 100 units. We trained all our models with the ADAM optimizer [13] and learning rate \u03b7 = 0.002.\nWe used \u03b2 = 0.1 for MNIST, NORB and SUNCG and \u03b2 = 0.0001 for KTH experiments. We used\n\u03b1 = 1 for all datasets.\nFor future prediction experiments we train a two layer LSTM with 256 cells using the ADAM\noptimizer. On MNIST, we train the model by observing 5 frames and predicting 10 frames. On KTH,\nwe train the model by observing 10 frames and predicting 10 frames.\n\n4.2 Synthetic datasets\n\nMNIST: We start with a toy dataset consisting of two MNIST digits bouncing around a 64x64\nimage. Each video sequence consists of a different pair of digits with independent trajectories.\nFig. 3(left) shows how the content vector from one frame and the pose vector from another generate\nnew examples that transfer the content and pose from the original frames. This demonstrates the\nclean disentanglement produced by our model. Interestingly, for this data we found it to be necessary\nto use a different color for the two digits. Our adversarial term is so aggressive that it prevents the\n\nFigure 3: Left: Demonstration of content/pose factorization on held out MNIST examples. Each\nimage in the grid is generated using the pose and content vectors hp and hc taken from the corre-\nsponding images in the top row and \ufb01rst column respectively. The model has clearly learned to\ndisentangle content and pose. Right: Each row shows forward modeling up to 500 time steps into the\nfuture, given 5 initial frames. For each generation, note that only the pose part of the representation is\nbeing predicted from the previous time step (using an LSTM), with the content vector being \ufb01xed\nfrom the 5th frame. The generations remain crisp despite the long-range nature of the predictions.\n\n5\n\nactionDim=5-latentDim=128-maxStep=8-advWeight=0-normalize=true-ngf=64-ndf=64-model=basic-output=sigmoid-linWeight=035191261821155010024200500..................Input framesGenerated frames...\fFigure 4: Left: Factorization examples using our DRNET model on held out NORB images. Each\nimage in the grid is generated using the pose and content vectors hp and hc taken from the corre-\nsponding images in the top row and \ufb01rst column respectively. Center: Examples where DRNET was\ntrained without the adversarial loss term. Note how content and pose are no longer factorized cleanly:\nthe pose vector now contains content information which ends up dominating the generation. Right:\nfactorization examples from Mathieu et al. [19].\n\nFigure 5: Left: Examples of linear interpolation in pose space between the examples x1 and x2.\nRight: Factorization examples on held out images from the SUNCG dataset. Each image in the grid\nis generated using the pose and content vectors hp and hc taken from the corresponding images in\nthe top row and \ufb01rst column respectively. Note how, even for complex objects, the model is able to\nrotate them accurately.\n\npose vector from capturing any content information, thus without a color cue the model is unable to\ndetermine which pose information to associate with which digit. In Fig. 3(right) we perform forward\nmodeling using our representation, demonstrating the ability to generate crisp digits 500 time steps\ninto the future.\n\nNORB: We apply our model to the NORB dataset [16], converted into videos by taking sequences of\ndifferent azimuths, while holding object identity, lighting and elevation constant. Fig. 4.2(left) shows\nthat our model is able to factor content and pose cleanly on held out data. In Fig. 4.2(center) we train\na version of our model without the adversarial loss term, which results in a signi\ufb01cant degradation in\nthe model and the pose vectors are no longer isolated from content. For comparison, we also show the\nfactorizations produced by Mathieu et al. [19], which are less clean, both in terms of disentanglement\nand generation quality than our approach. Table 1 shows classi\ufb01cation results on NORB, following\nthe training of a classi\ufb01er on pose features and also content features. When the adversarial term is\nused (\u03b2 = 0.1) the content features perform well. Without the term, content features become less\neffective for classi\ufb01cation.\n\nSUNCG: We use the rendering engine from the SUNCG dataset [30] to generate sequences where\nthe camera rotates around a range of 3D chair models. The dataset consists of 324 different chair\nmodels of varying size, shape and color. DRNET learns a clean factorization of content and pose and\nis able to generate high quality examples of this dataset, as shown in Fig. 4.2(right).\n\n6\n\nPoseContentContentPosePosePosePoseContentx1x2InterpolationsPoseContent\f4.3 KTH Action Dataset\nFinally, we apply DRNET to the KTH dataset [28]. This is a simple dataset of real-world videos of\npeople performing one of six actions (walking, jogging, running, boxing, handwaving, hand-clapping)\nagainst fairly uniform backgrounds. In Fig. 4.3 we show forward generations of different held out\nexamples, comparing against two baselines: (i) the MCNet of Villegas et al. [33]which, to the best\nof our knowledge, produces the current best quality generations of on real-world video and (ii) a\nbaseline auto-encoder LSTM model (AE-LSTM). This is essentially the same as ours, but with\na single encoder whose features thus combine content and pose (as opposed to factoring them in\nDRNET ). It is also similar to [31].\nFig. 7 shows more examples, with generations out to 100 time steps. For most actions this is suf\ufb01cient\ntime for the person to have left the frame, thus further generations would be of a \ufb01xed background.\nIn Fig. 9 we attempt to quantify the \ufb01delity of the generations by comparing our approach to MCNet\n[33] using a metric derived from the Inception score [26]. The Inception score is used for assessing\ngenerations from GANs and is more appropriate for our scenario that traditional metrics such as\nPSNR or SSIM (see appendix B for further discussion). The curves show the mean scores of our\ngenerations decaying more gracefully than MCNet [33]. Further examples and generated movies may\nbe viewed in appendix A and also at https://sites.google.com/view/drnet-paper//.\nA natural concern with high capacity models is that they might be memorizing the training examples.\nWe probe this in Fig. 4.3, where we show the nearest neighbors to our generated frames from the\ntraining set. Fig. 8 uses the pose representation produced by DRNET to train an action classi\ufb01er\nfrom very few examples. We extract pose vectors from video sequences of length 24 and train a fully\nconnected classi\ufb01er on these vectors to predict the action class. We compare against an autoencoder\nbaseline, which is the same as ours but with a single encoder whose features thus combine content\nand pose. We \ufb01nd the factorization signi\ufb01cantly boosts performance.\n\nFigure 6: Qualitative comparison between our DRNET model, MCNet [33] and the AE-LSTM\nbaseline. All models are conditioned on the \ufb01rst 10 video frames and generate 20 frames. We display\npredictions of every 3rd frame. Video sequences are taken from held out examples of the KTH dataset\nfor the classes of walking (top) and running (bottom).\n\n7\n\nt = 21Ground truth futureMCNetAE-LSTMDrNet(ours)Walkingt = 25t = 15t = 17t = 27t = 30t = 12t = 5t = 10t = 1t = 21Ground truth futureMCNetAE-LSTMDrNet(ours)Runningt = 25t = 15t = 17t = 27t = 30t = 12t = 5t = 10t = 1\fFigure 7: Four additional examples of generations on held out examples of the KTH dataset, rolled\nout to 100 timesteps.\n\nModel\n\nAccuracy (%)\n\nDRNET \u03b2=0.1\n\nDRNET \u03b2=0\n\nMathieu et al. [19]\n\nhc\nhp\nhc\nhp\n\n93.3\n60.9\n72.6\n80.8\n86.5\n\nTable 1: Classi\ufb01cation results on\nNORB dataset, with/without adver-\nsarial loss (\u03b2 = 0.1/0) using con-\ntent or pose representations (hc, hp\nrespectively). The adversarial term\nis crucial for forcing semantic in-\nformation into the content vectors \u2013\nwithout it performance drops signif-\nicantly.\n\nFigure 8: Classi\ufb01cation of\nKTH actions from pose vec-\ntors with few labeled exam-\nples, with autoencoder base-\nline. N.B. SOA (fully super-\nvised) is 93.9% [15].\n\nFigure 9: Comparison of\nKTH video generation quality\nusing Inception score. X-axis\nindicated how far from con-\nditioned input the start of the\ngenerated sequence is.\n\nFigure 10: For each frame generated by DRNET (top row in each set), we show nearest-neighbor\nimages from the training set, based on pose vectors (middle row) and both content and pose vectors\n(bottom row). It is evident that our model is not simply copying examples from the training data.\nFurthermore, the middle row shows that the pose vector generalizes well, and is independent of\nbackground and clothing.\n\n8\n\nt=11t=100t=90t=80t=70t=60t=50t=47t=44t=41t=38t=35t=32t=29t=26t=23t=20t=17t=14DrNetMCNet0204060801001.31.351.41.451.51.551.61.651.71.75Future time stepInception Score DrNetMCNett = 12t = 15t = 17t = 21t = 25t = 27t = 30DrNet generationsNearest neighbour in pose spacet = 12t = 15t = 17t = 21t = 25t = 27t = 30Nearest neighbour in pose+content spaceDrNet generationsNearest neighbour in pose spaceNearest neighbour in pose+content space\f5 Discussion\n\nIn this paper we introduced a model based on a pair of encoders that factor video into content and\npose. This seperation is achieved during training through novel adversarial loss term. The resulting\nrepresentation is versatile, in particular allowing for stable and coherent long-range prediction through\nnothing more than a standard LSTM. Our generations compare favorably with leading approaches,\ndespite being a simple model, e.g. lacking the GAN losses or probabilistic formulations of other\nvideo generation approaches. Source code is available at https://github.com/edenton/drnet.\n\nAcknowledgments\nWe thank Rob Fergus, Will Whitney and Jordan Ash for helpful comments and advice. Emily Denton\nis grateful for the support of a Google Fellowship\n\nReferences\n[1] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential\n\nlearning of intuitive physics. arXiv preprint arXiv:1606.07419, 2016.\n\n[2] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators. In\n\n[3] F. Cricri, M. Honkala, X. Ni, E. Aksu, and M. Gabbouj. Video ladder networks. arXiv preprint\n\nICLR, 2017.\n\narXiv:1612.01756, 2016.\n\n[4] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context\n\n[5] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through\n\nprediction. In CVPR, pages 1422\u20131430, 2015.\n\nvideo prediction. In arXiv 1605.07157, 2016.\n\n[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\n\nand Y. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[7] W. Grathwohl and A. Wilson. Disentangling space and time in video with hierarchical variational\n\nauto-encoders. arXiv preprint arXiv:1612.04440, 2016.\n\n[8] R. Hartley and A. Zisserman. Multiple view geometry in computer vision, 2000.\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.\n\n[10] G. E. Hinton, A. Krizhevsky, and S. Wang. Transforming auto-encoders. In ICANN, 2011.\n[11] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[12] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and\n\nK. Kavukcuoglu. Video pixel networks. In arXiv:1610.00527, 2016.\n\n[13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations, 2015.\n\n[14] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics\n\nnetwork. In Advances in Neural Information Processing Systems, pages 2539\u20132547, 2015.\n\n[15] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal\nfeatures for action recognition with independent subspace analysis. In Proceedings of the 2011\nIEEE Conference on Computer Vision and Pattern Recognition, 2011.\n\n[16] Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR, 2004.\n\n[17] C. Liu. Beyond pixels: exploring new representations and applications for motion analysis.\n\nPhD thesis, Massachusetts Institute of Technology, 2009.\n\n[18] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square\n\nerror. arXiv 1511.05440, 2015.\n\n[19] M. Mathieu, P. S. Junbo Zhao, A. Ramesh, and Y. LeCun. Disentangling factors of variation in\ndeep representations using adversarial training. In Advances in Neural Information Processing\nSystems 29, 2016.\n\n[20] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Action-conditional video prediction using deep\n\nnetworks in Atari games. In NIPS, 2015.\n\n[21] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep con-\nIn The International Conference on Learning\n\nvolutional generative adversarial networks.\nRepresentations, 2016.\n\n9\n\n\f[22] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language)\n\nmodeling: a baseline for generative models of natural videos. arXiv 1412.6604, 2014.\n\n[23] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with\n\nladder network. In Advances in Neural Information Processing Systems 28, 2015.\n\n[24] S. Reed, Z. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015.\n[25] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\nsegmentation. In International Conference on Medical Image Computing and Computer-Assisted\nIntervention, pages 234\u2013241. Springer International Publishing, 2015.\n\n[26] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training gans. arXiv 1606.03498, 2016.\n\n[27] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with\ndiscretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint arXiv:1701.05517,\n2017.\n\n[28] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In\nPattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on,\nvolume 3, pages 32\u201336. IEEE, 2004.\n\n[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In The International Conference on Learning Representations, 2015.\n\n[30] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion\nfrom a single depth image. IEEE Conference on Computer Vision and Pattern Recognition,\n2017.\n\n[31] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representa-\n\ntions using LSTMs. In ICML, 2015.\n\n[32] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In\n\nICML, 2016.\n\n[33] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural\n\nvideo sequence prediction. In ICLR, 2017.\n\n[34] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In arXiv\n\n[35] J. Walker, A. Gupta, and M. Hebert. Dense optical \ufb02ow prediction from a static image. In ICCV,\n\n[36] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In CVPR,\n\n1609.02612, 2016.\n\n2015.\n\npages 2794\u20132802, 2015.\n\n[37] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum. Understanding visual concepts\n\nwith continuation learning. arXiv preprint arXiv:1502.04623, 2016.\n\n[38] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invariance.\n\nNeural Computation, 14(4):715\u2013770, 2002.\n\n[39] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Probabilistic future frame\n\nsynthesis via cross convolutional networks. In NIPS, 2016.\n\n[40] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked what-where auto-encoders. In\n\nInternational Conference on Learning Representations, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2305, "authors": [{"given_name": "Emily", "family_name": "Denton", "institution": "New York University"}, {"given_name": "vighnesh", "family_name": "Birodkar", "institution": "New York University"}]}