{"title": "Video Prediction via Selective Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 1705, "page_last": 1715, "abstract": "Most adversarial learning based video prediction methods suffer from image blur, since the commonly used adversarial and regression loss pair work rather in a competitive way than collaboration, yielding compromised blur effect. \n In the meantime, as often relying on a single-pass architecture, the predictor is inadequate to explicitly capture the forthcoming uncertainty.\n Our work involves two key insights:\n (1) Video prediction can be approached as a stochastic process: we sample a collection of proposals conforming to possible frame distribution at following time stamp, and one can select the final prediction from it.\n (2) De-coupling combined loss functions into dedicatedly designed sub-networks encourages them to work in a collaborative way.\n Combining above two insights we propose a two-stage network called VPSS (\\textbf{V}ideo \\textbf{P}rediction via \\textbf{S}elective \\textbf{S}ampling). \n Specifically a \\emph{Sampling} module produces a collection of high quality proposals, facilitated by a multiple choice adversarial learning scheme, yielding diverse frame proposal set. \n Subsequently a \\emph{Selection} module selects high possibility candidates from proposals and combines them to produce final prediction. \n Extensive experiments on diverse challenging datasets \n demonstrate the effectiveness of proposed video prediction approach, i.e., yielding more diverse proposals and accurate prediction results.", "full_text": "Video Prediction via Selective Sampling\n\nJingwei Xu, Bingbing Ni\u2217, Xiaokang Yang\nMoE Key Lab of Arti\ufb01cial Intelligence, AI Institute\n\nSJTU-UCLA Joint Research Center on Machine Perception and Inference,\n\nShanghai Jiao Tong University, Shanghai 200240, China\n\nShanghai Institute for Advanced Communication and Data Science\n\nxjwxjw,nibingbing,xkyang@sjtu.edu.cn\n\nAbstract\n\nMost adversarial learning based video prediction methods suffer from image blur,\nsince the commonly used adversarial and regression loss pair work rather in a\ncompetitive way than collaboration, yielding compromised blur effect. In the\nmeantime, as often relying on a single-pass architecture, the predictor is inadequate\nto explicitly capture the forthcoming uncertainty. Our work involves two key\ninsights: (1) Video prediction can be approached as a stochastic process: we sample\na collection of proposals conforming to possible frame distribution at following time\nstamp, and one can select the \ufb01nal prediction from it. (2) De-coupling combined\nloss functions into dedicatedly designed sub-networks encourages them to work\nin a collaborative way. Combining above two insights we propose a two-stage\nframework called VPSS (Video Prediction via Selective Sampling). Speci\ufb01cally a\nSampling module produces a collection of high quality proposals, facilitated by a\nmultiple choice adversarial learning scheme, yielding diverse frame proposal set.\nSubsequently a Selection module selects high possibility candidates from proposals\nand combines them to produce \ufb01nal prediction. Extensive experiments on diverse\nchallenging datasets demonstrate the effectiveness of proposed video prediction\napproach, i.e., yielding more diverse proposals and accurate prediction results.\n\n1\n\nIntroduction\n\nVideo prediction has been receiving increasing research attention in computer vision [3, 29, 10, 8, 19],\nwhich has great potentials in applications such as future decision, robot manipulation and autonomous\ndriving [20]. Previous methods [10, 19] with pixel-wise regression loss tend to produce blurry\nresults as they seek the average from possible futures [29]. To enhance the generation quality, some\nworks [8, 34] utilize adversarial learning [13] to facilitate video prediction task, i.e., adding an\nadversarial loss [13] on the prediction module.\nHowever, paired regression and adversarial loss still CANNOT solve image blur and motion defor-\nmation problems in principle. A generator often struggles to balance between adversarial [13] and\nregression loss during training procedure [18, 2, 5], thus most possibly yielding an averaged result.\nAs in worse case either adversarial [13] or regression loss tend to take dominate place, which forces\nthe other term to fail to play its role. In other words, both loss functions work rather in a competitive\nway than collaboration. We start to think: (1) To address the blur issue, is it possible to sample\na collection of high quality proposals conforming to possible frame distribution at following time\nstamp, and select the \ufb01nal prediction from it? (2) To encourage collaboration between loss functions,\nis it possible to design dedicated sub-networks for adversarial and regression loss respectively?\n\n\u2217Corresponding Author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Competition between adversarial and regression Loss in term of \u03bbLAdv + LReg.\n\nModel\n\nLReg after 50K iterations\n\nMCNet [34]\n\u03bb=0.02/0.2/0.5\n0.04/0.05/0.07\n\nDrNet [8]\n\n\u03bb=1e-4/1e-3/4e-3\n\n0.07/0.09/0.12\n\nSAVP [25]\n\n\u03bb=0.01/0.1/0.3\n0.02/0.03/0.05\n\nMotivated by these issues, we propose a two-stage framework called VPSS which utilizes different\nmodules to handle both losses respectively, therefore to encourage collaboration between them,\ninstead of competition. As shown in Figure 1, the Sampling module produces multiple high quality\nvideo frame proposals, by making use of a multiple choice adversarial learning scheme, yielding\ndiverse video prediction set. This module is trained in an adversarial learning manner [5]. The\nSelection module selects high possibility candidates from proposals and combines to produce the\n\ufb01nal prediction, according to the criteria of better position matching. By contrast, the selection\nmodule is trained with regression loss. We conduct both qualitative and quantitative experiments on\ndiverse datasets, ranging from digits moving, human motion to robotic arm manipulation, including a\nvisualization experiment. Our results clearly indicate higher visual quality and more precise motion\nprediction, even for complex motion patterns as well as long-term prediction, which signi\ufb01cantly\noutperform prior arts. Experiments show that these two modules could corporate well with each other\nunder the VPSS framework.\n\n2 Related work\nVideo Prediction. Many previous work have been done on video prediction task. Srivastava\net al. [33] \ufb01rst proposes to use LSTM [16] to predict future frames. Shi et al. [32] proposes\nConvLSTM architecture dedicated to preserve the spatial information, which is a big step forward\non this task. Different from ConvLSTM, DFN [19] generates convolution kernels according to the\ninputs, which shows more \ufb02exibility on modelling the motion variation. CDNA [10] combines\nthe advantages of ConvLSTM [32] and DFN [19] for further utilizing the spatial information.\nHowever all above methods suffer from the image blur and structure deformation because of the\nregression loss. To enhance the image quality, DrNet [8] and MCNet [34] utilize adversarial training\nto generate more sharpen frames, yet hard to balance adversarial loss and regression loss. Some\nwork [35, 20, 31] further push prediction model to achieve long-term prediction. More recently,\nstochastic prediction [7, 3, 25] tries to use random learned prior or noise to achieve prediction in a\nstochastic way, which generate visually appealing prediction but with stochastic motion prediction\n(i.e., location variation and shape transformation is random). Note that different from these stochastic\nprediction methods, our model \ufb01rst generates multiple proposals and then precisely infer the \ufb01nal\noutcome from them, which shows high precision motion prediction.\nCompetition between Adversarial and Regression Loss. This issue has been widely discussed\nin image generation and translation task [18, 36]. It is more commonly treated as the competition\nbetween generation diversity (i.e., related to GAN [13] loss) and visual quality (i.e., related to\nregression loss) in image domain. VAE-GAN [24] \ufb01rst proposes to combine VAE and GAN [13] by\nreplacing element-wise errors with feature-wise errors to better capture the data distribution. Some\nwork try to solve this problem from information entropy view, such as InfoGAN [6], ALI [9] and\nALICE [26], which mainly try to \ufb01nd a deterministic mapping relation between the image domain and\nlatent code domain. Based on VAE-GAN [24], BEGAN [4] proposes a new equilibrium enforcing\nmethod to pursue fast and stable training and high visual quality. Different from the image generation\ntask, video prediction further requires precise motion prediction along the time axis. Accordingly, the\ncompetition in video domain lies between visual quality and motion prediction precision, yet few\nmethods have been proposed to address this issue.\n\n3 Method\nPrevious models [34, 34, 25] try to balance adversarial [13] loss (denoted as LAdv) and regression\nloss (denoted as LReg) with a hyper-parameter \u03bb, i.e. \u03bbLAdv +LReg. To demonstrate the competition\nintuitively, we run three open-source codes with their default training setups2, only altering \u03bb with\n\n2MCNet: https://github.com/rubenvillegas/iclr2017mcnet; DrNet: https://github.com/edenton/drnet; SAVP:\n\nhttps://github.com/alexlee-gk/video_prediction\n\n2\n\n\fFigure 1: The general framework of VPSS. At the Sampling stage, our framework contains a\nconditional sampling module which produces N high-quality proposals at time stamp t + 1 based on\ninputs at time stamp t. For the Selection stage, we propose a selection module for \ufb01nal prediction.\nThis module further chooses K candidates from the high-quality proposals and fuses them to produce\n\ufb01nal prediction.\n\ndifferent values. As shown in Table 1, if enlarging \u03bb progressively, one can notice that converge value\nof LReg keeps increasing. This indicates that it is impractical to force prediction module to satisfy\nboth sides, which directly motivates us to explore this task from a different view. Speci\ufb01cally, we\npropose a novel framework called VPSS with two modules of sampling and selection. In this section\nwe discuss this framework and detailed architecture of both modules. Meanwhile we demonstrate\nhow these two modules effectively corporate well with each other.\n\n3.1 Sampling Module\n\nPrevious methods [25, 3] aim to model future uncertainty with temporal correlated noise, which is\nsampled from prior distribution then passed through a recurrent neural network (e.g., GRU, LSTM).\nHowever it is insuf\ufb01cient to model the upcoming uncertainty explicitly with random noise based\nstochasticity, which lacks correlation with inputs. Inspired by previous work [30, 5], we show that\nvideo prediction can be approached as a stochastic process: we gather a random collection of high\nquality proposals in one shot, with a multiple choice adversarial learning scheme that encourages\ndiversity within the collection.\nMore formally, consider single step forward prediction, where we use current input (denoted as\nxt \u2208 RL\u00d7H\u00d7C) to predict next time-stamp frame (denoted as xt+1 \u2208 RL\u00d7H\u00d7C). L, H, C stand\nfor image width, height and channel respectively. Recall that xt is \ufb01rstly fed into a sampler \u03c6spl\nto produce a collection of proposals (denoted as \u02c6Xt+1 = {\u02c6x1\nt+1}), where N is the desired\nnumber of proposals. For single prediction the number of \ufb01nal output channels in \u03c6spl is C. So for\nN proposals we directly enlarge it to C \u00d7 N, where each consecutive C-tuple of channels forms a\nproposal, and denote the corresponding kernel as W = {W i}N\nTo guarantee the visual quality of proposals, we utilize adversarial learning to facilitate the sampling\nprocedure:\n\nt+1, ..., \u02c6xN\n\ni=1.\n\nT\u22121(cid:88)\n\nt=1\n\nLspl(S, D) =\n\n(EX[log D(xt+1)] +\n\nEX[log D(1 \u2212 \u03c6spl(xt, W ))]),\n\n(1)\n\n1\nN\n\nwhere D is a discriminator capable of distinguishing sampled images from real images. Naturally\nwe hope these proposals are diverse enough for further selection. However only using the same\nadversarial loss forces all proposals to be identical, which departs from the requirement of diversity.\nTo this end, we propose a kernel generator \u03c6KGen to achieve this goal.\nAt time stamp t, the kernel generator takes previous xt, and xt\u22121 as input, and the output of \u03c6KGen\nis \u2206W = {\u2206W i}N\ni=1. As shown in Figure 1 (black dash lines), there is one-to-one correspondence\nbetween \u2206W i and \u02c6xi\nt+1. We denote the m-th input channel and n-th output channel of \u2206W as\n\n3\n\n\ud835\udc65\ud835\udc61SamplerSelectorCombiner\ud835\udc65\ud835\udc61+1KernelProposalImageProposalKernelGeneratorCandidatesSamplingStageSelectionStageEncsDecsEnccDecc\fFigure 2: Detailed architecture of Sampling Module. The left part is proposed kernel generator to\nproduce multiple kernels, while the right part is sampler for sampling a collection of high quality\nproposals.\n\u2206W (m, n). Similarly x(n) represents the n-th channel of x and \u2206W (\u00b7, n) represents the n-th\noutput channel along with all input channels of \u2206W . As shown in Figure 2, the kernel generation\nprocedure \ufb01rst performs channel-wise subtraction to obtain temporal variation information, and then\nencodes it into the current convolution kernel with a CNN \u03c6W , Meanwhile we perform channel-wise\nsoftmax [28] along the input channel:\n\n\u2206 \u02c6W (m, n) = \u03c6W (xt(m), xt\u22121(n)), m, n = 1, ..., C,\n\n(2)\n\n\u2206W (\u00b7, n) = Sof tmax(\u2206 \u02c6W (\u00b7, n)), n = 1, ..., C,\n\n(3)\nwhere C is the number of channels. The sampling procedure at time stamp t is executed as follows,\n(4)\nThrough kernel generator we actually transform this problem into kernel diversity, i.e., diversity\nof {\u2206W i}N\ni=1. Inspired from Guzman-Rivera et al. [14], we develop a multiple choice adversarial\nlearning scheme as follows,\n\nt+1 = \u03c6spl(xt; W i + \u2206W i),\n\n\u2206W = \u03c6KGen(xt, xt\u22121), \u02c6xi\n\ni = 1, ..., N.\n\nT\u22121(cid:88)\n\nt=2\n\nLKGen =\n\nE \u02c6Xt+1\n\n[log D(\u02c6xk\n\nt+1)],\n\nk = min\n\ni\n\n||xt+1 \u2212 \u02c6xi\n\nt+1||1.\n\n(5)\n\nThis loss function is a critical component in sampling module, and the diversity essentially results\nfrom the min operation in Equation 5. To be speci\ufb01c, at each time stamp we only update kernel\ncorresponding to the best sampled image, where N kernels ({\u2206W i}N\ni=1) are optimized independently\ntoward different directions throughout training procedure. This asynchronized updating scheme\nencourages the kernel generator to spread its bets and cover the exploration space of predictions\nthat conform to all possible frame distribution. Intuitively one can consider that the kernel generator\n\u03c6KGen is boosted to capture motion information based on previous inputs and infer next time-stamp\nmotion direction without constrain of regression loss. Meanwhile image quality is well guaranteed by\nadversarial learning loss (Equation 1). More importantly, different from those stochastic prediction\nmethods [3, 7, 25] whose diversity is achieved by random noise, the diversity of our model is entirely\nachieved by previous inputs, which is more rational in principle and inherently useful for further\nselection.\n\n3.2 Selection Module\n\nGiven sampled proposals \u02c6Xt+1, the selection module \ufb01rst selects best K candidates from them\nunder the measure of motion precision. For example, the motion direction usually changes smoothly\nbetween frames, which can be treated as a selection criterion. But it is impractical to hand-crafted\ndesign several criteria for candidate sifting. Meanwhile under the problem setting of video prediction,\nground truth at time stamp t + 1 is unknown at time stamp t. Directly comparing with ground truth\nis not an option. To tackle this problem, we formulate it as ranking all proposals based on previous\ninputs. To be speci\ufb01c, for each proposal we utilize recurrent neural network to regress con\ufb01dence\nscore obtained in a self-supervised manner, and rank them based on the regressed values. The top-K\ncandidates are feed into a combiner for further processing.\nFor clarity we denote the ground-truth score for i-th candidate as \u03b3i, and \u0393 = {\u03b3i}N\nregressed score is denoted as \u02c6\u03b3i, and \u02c6\u0393 = {\u02c6\u03b3i}N\n\ni=1. Corresponding\ni=1. The most direct way to generate ground-truth score\n\n4\n\nSamplingStage\u2296(64,64,3)(64,64,3)(64,64,3,3)(3,64,64,3)ReshapeConv\ud835\udc65\ud835\udc61\ud835\udc65\ud835\udc61\u22121\u2296Channel-wise SubtractionN x (7,7,3,3)Kernel GeneratorSampler\u25b3\ud835\udc4a1\u25b3\ud835\udc4a\ud835\udc41\u25cf\u25cf\u25cf\ud835\udc4a\ud835\udc56\u25b3\ud835\udc4a\ud835\udc56Kernel iElement-wise Addition\u0ddc\ud835\udc65\ud835\udc61+1\ud835\udc56ProposalSoftmaxDiscriminator\ud835\udc65\ud835\udc61+1RealFake\fFigure 3: Detailed architecture of Selection Module. The left part is proposed selector to select K\ncandidates from N proposals, while the right part is combiner for combining K candidates into \ufb01nal\nprediction.\nt+1||1. It\nis to calculate the pixel-wise error between the proposals and ground truth: \u03b3i = ||xt+1 \u2212 \u02c6xi\nworks well when the motion pattern is simple (e.g., transformation toward some direction) and the\nobjects are rigid bodies. When encountering real-world video sequences (e.g., human or robot motion\nwith occlusion), it severely suffers from the motion uncertainty and complex structure deformation.\nFor better capturing the motion and content information, we utilize a pre-trained discriminative model\nto extract the multi-level feature from video sequences, and compute the con\ufb01dence score at feature\nlevel, which is widely used in image translation task [18]:\n\n\u03b3i =\n\n||\u03d5l(xt+1) \u2212 \u03d5l(\u02c6xi\n\nt+1)||1, i = 1, ..., N,\n\n(6)\n\nL(cid:88)\n\nl=1\n\nt+1|xt, xt\u22121),\n\nwhere \u03d5 is the discriminative model (e.g., detection or pose estimation network), and l stands for the\nl-th layer of \u03d5. As mentioned above, \u0393 is treated as target value of con\ufb01dence score. Based on it we\ncan predict the con\ufb01dence score based on previous inputs as follows:\n\ni = 1, ..., N,\n\n\u02c6\u03b3i = \u03c8slt(\u02c6xi\n\n(7)\nwhere \u03c8slt is essentially a 3-step ConvLSTM network [32] (Left-most part in Figure 3). It successively\nt+1 as input . Therefore we propose to train the \u03c8slt with a regression loss: Lslt =\ntakes xt\u22121, xt, \u02c6xi\n||\u02c6\u0393\u2212\u0393||1. At time stamp t, we select top-K candidates \u02dcX = {\u02c6xij , ij \u2208 T (\u02c6\u0393)}K\nj=1 for \ufb01nal prediction,\nwhere T (\u02c6\u0393) is the index set of top-K candidates in \u02c6\u0393.\nFor the \ufb01nal step, we utilize a Combiner \u03c8comb to compose candidates \u02dcX into \ufb01nal prediction.\nCorrespondingly it requires \u03c8comb to fully capture information contained in \u02dcX. Inspired from\nrecent work on video analogy making [31], we propose a K-stream Auto-Encoder to capture feature\ninformation. As shown in Figure 3, each candidate is fed into an encoder to extract multi-level\ninformation, then passed to decoder to generate K masks M = {mj}K\nj=1, which are used to compose\nK candidates into \ufb01nal prediction:\n\n\u02dcxt+1 = \u03c8comb( \u02dcX) = Conv(\n\nmj (cid:12) \u02c6xij\n\nt+1),\n\n(8)\n\nK(cid:88)\n\nj=1\n\nwhere (cid:12) represents hadamard product. As shown in Figure 3, we also use skip connection [18]\nbetween encoders and decoder to enhance feature sharing, and we keep the spatial resolution of feature\nsame to images throughout \u03c8comb, which proves more ef\ufb01cient to preserve the spatial information [21].\n\nWe use regular regression loss to train \u03c8comb as follows : Lcomb =(cid:80)T\n\nt=2 ||xt \u2212 \u02dcxt||1.\n\n3.3 Design Considerations and Implementation Details\n\nDesign Considerations. Notably, our framework contains two modules and each involves two\nsub-networks. If without carefully designed training procedure, it will easily get stuck in a bad local\nminima with meaningless outputs. We therefore provide several practical considerations dedicated for\ntraining stability. Firstly, sampler \u03c6spl and selector \u03c8slt act as models with strong prior knowledge,\npre-training them stabilizes following training procedure by a large margin. Secondly, we utilize\ncurriculum learning [22] along the temporal direction, which could effectively \ufb02atten the training\ncurve. Last but not least, we decouple combined loss function into corresponding sub-networks,\ni.e., we update \u03c6KGen, \u03c6spl, \u03c8slt, \u03c8comb according to LKGen,Lspl,Lslt,Lcomb respectively. On one\n\n5\n\n{\u0ddc\ud835\udc65\ud835\udc61+1\ud835\udc56}\ud835\udc56=1\ud835\udc41\ud835\udc65\ud835\udc61\ud835\udc65\ud835\udc61\u22121LSTMLSTMLSTM{\u0ddc\ud835\udefe\ud835\udc61+1\ud835\udc56}\ud835\udc56=1\ud835\udc41Top KSelection\u25cf\u25cf\u25cf\u0ddc\ud835\udc65\ud835\udc61+1\ud835\udc561\u0ddc\ud835\udc65\ud835\udc61+1\ud835\udc56\ud835\udc3eConfidenceScoreK Masks\u2a00\u0ddc\ud835\udc65\ud835\udc61+1\ud835\udc561\ud835\udc5a\ud835\udc61+11\u2a00\u0ddc\ud835\udc65\ud835\udc61+1\ud835\udc56\ud835\udc3e\ud835\udc5a\ud835\udc61+1\ud835\udc3e\u25cf\u25cf\u25cf1.Sum2.Convolution\u0de4\ud835\udc65\ud835\udc61+1Final PredictionSelectionStageSelectorCombinerK-Stream Auto-Encoder\fTable 2: Detailed Evaluation Setup with different datasets and models.\n\nDatasets\nMovingMnist [33]\nRobotPush [10]\nHuman3.6M [17]\n\nComparison models\nDFN [19]& SVG [7]\nCDNA [10]& SV2P [3]\nDrNet [8]& MCNet [34]\n\nInputs and outputs\n\nVideo resolution\n\n10 inputs for 10 outputs\n2 inputs for 8 outputs\n5 inputs for 10 outputs\n\n64x64\n64x64\n64x64\n\nTable 3: Qualitative experiments in terms of reality and similarity assessment.\n\nmodel setup\n\nReality\nImage Quality\nPrediction Accuracy\n\nRobotPush [10]\n\nCDNA/SV2P/Ours\n\nMovingMnist [33]\nDFN/SVG/Ours\n\nHuman3.6M [17]\nDrNet/MCNet/Ours\n25.4%/43.1%/45.2% 22.5%/29.2%/35.9% 10.8%/27.7%/23.1%\n28.2%/33.8%/38.0% 23.5%/30.3%/46.2% 20.0%/39.2%/40.8%\n38.1%/26.6%/35.3% 21.9%/36.8%/41.4% 33.4%/26.0%/40.7%\n\nhand, this design scheme prevents competition between adversarial and regression losses in principle.\nBecause the gradients back-propagated from them are to optimize different networks, which gets\nrid of balancing between these two loss functions. On the other hand, sub-networks are designed\ndedicated for different objectives, e.g., the sampler is required to produce high quality proposals\nwithout requirement of motion accuracy, while the selector should be able to fully capture the motion\ninformation of previous inputs and select out proposals with high motion accuracy. Note that these\nobjectives are complementary with each other, which essentially encourages corporation between\ndifferent sub-networks.\nImplementation Details. We implement the proposed framework with Tensor\ufb02ow library [1]. The\nsampler consists of 3 down-sampling Res-Blocks [15] and 3 up-sampling Res-Blocks, where sampling\noperation is bilinear interpolation with stride 2. Each block is followed by ReLU [12] activation. The\nkernel generator consists of 5 convolution layers with leaky-ReLU [27] (Leaky rate 0.1) and \ufb01nal\nlayer with sigmoid. The selector is a 3-layer ConvLSTM [32] with stride 4. Finally the combiner\nconsists of two parts, i.e., the encoder and decoder. Encoder part is a 4-layer convolution network with\ndown-sampling of stride 2. The decoder is a mirrored version of the encoder with 4 deconvolution\nlayers and a sigmoid output layer. The model \u03d5 is identical to that used in style transfer [11]. As\nmentioned above, we pre-train \u03c6spl and \u03c8slt for 2 epochs. The curriculum learning is essentially\nincreasing the prediction length by 1 every epoch with initial length of 1. In all experiments we train\nall our models with the ADAM optimizer [23] and learning rate \u03b7 = 0.0001, \u03b21 = 0.5, \u03b22 = 0.999.\nIn all experiments we set N = 5 and K = 2. This selection will be further discussed in Section 4.4.\n4 Experiments\n\n4.1 Datasets and Evaluation Setup\n\nDatasets. We evaluate our framework (VPSS) on three diverse datasets: MovingMnist [33], Robot-\nPush [10] and Human3.6M [17], which represent challenges in different aspects. MovingMnist [33]\ncontains one bouncing digit, which is treated as toy example and suitable for demonstrating the\nsuperiority of our framework. RobotPush [10] involves complex robotic motion which has been\nwidely used for video prediction. Human3.6M [17] captures single human motion whose challenge\nlies in motion stochasticity. Notably, in Human3.6M [17] the human subject only takes a small\nportion of current frame, whose motion could easily be ignored only with the regression loss.\nEvaluation Setup. A long-standing issue in video prediction or generation task is how to ensure\nfair comparison with other models [25]. We try to get rid of unfair comparison by taking following\ntwo measures: (1) we compare our model with models whose source codes can be accessed, (2)\nwe exactly follow their experiment setup and do not change their training procedure. For detailed\nevaluation setup, please refer to Table 2.\n4.2 Quantitative Evaluation\n\nTo quantitatively evaluate our proposed framework, we compute the PSNR and SSIM value for\nDFN [19], SVG [7] and our model on MovingMnist Datasets [33]. Due to the stochastic prediction of\nSVG [7], we plot these two curves on average 20 samples for it. Note that different from predicting\n10 frames during training, we predict 20 frames to validate the generalization ability. As illustrated\n\n6\n\n\fFigure 4: Quantitative comparison with different prediction models in term of PSNR (left), SSIM\n(middle) and Inception Score (right).\n\nTable 4: Evaluation under PSNR and Inception Score with different combinations of N and K\n\n(N, K)\n\n(3, 1)\n\n(4, 1)\n\n(3, 2)\n\n(4, 2)\n\n(5, 2)\n\n(PSNR, Inception Score)\n\n(24.43,1.85)\n\n(24.96,1.90)\n\n(25.07,1.95)\n\n(25.21,1.98)\n\n(25.13,2.01)\n\nin Figrue 4.4 (left and middle), our model outperforms the other two by a large margin, especially\nin latter time stamps. DFN [19] is trained only under regression loss, without modelling the future\nuncertainty. The prediction results degenerate to blurry frames rapidly. SVG [7] models the future\nwith random noise sampled from a learned prior, which effectively prevents the blur effects. However\nthe correlation between random noise and current inputs is quite weak, which is insuf\ufb01cient to predict\nthe future precisely. In other words, it is more like \"random guess\". So predictions of SVG [7] are\ncommonly of high image quality but low prediction precision. By contrast within proposed framework\nboth objectives degrade more gracefully than SVG [7] and DFN [19]. Our framework \ufb01rst samples\nhigh quality proposals then combines them into \ufb01nal prediction, which effectively tackles above\ntwo problems. It clearly proves that the two-stage framework with dedicated designed sub-networks\nuni\ufb01es both adversarial and regression loss functions into prediction system successfully.\nRecently several works [29, 8, 25] argue that PSNR and SSIM are not convincing enough to guarantee\nthe quality of video prediction. To this end, we compute the Inception Score for CDNA [10], SV2P [3]\nand our model on RobotPush Datasets [10]. As shown in Figure 4.4 (right), compared to CDNA [10]\nand SV2P [3] our model keeps relative higher scores throughout the prediction procedure, which\nmainly bene\ufb01ts from high quality proposals during the sampling stage. The results of CDNA [10] and\nSV2P [3] clearly demonstrates that balance between loss functions is hard to satisfy all requirements.\n4.3 Qualitative Evaluation\nIn video prediction task, the most convincing way to demonstrate effectiveness of model is directly\nvisualizing predicted results. We present several samples predicting up to 20 time stamps for all three\ndatasets shown in Figure 5. As mentioned above, image blur and prediction accuracy are still two\nkey issues we care about. We can clearly observe that DFN [19] and CDNA [10] produce blurry\nresults because of regression loss. While for SVG [7] and SV2P [3], image quality is much better,\nbut compared to ground truth, the prediction accuracy is not so satisfying (e.g., location prediction in\nMovingMnist [33]). For MCNet [34] and DrNet [8], although they try to enhance the image quality\nwith adversarial learning, the con\ufb02ict between adversarial loss and regression loss prevents them\nfrom achieving both requirements concurrently. By contrast, the proposed two-stage framework\nachieves both high image quality and precise motion prediction. We further consider that still\nimages are not suf\ufb01cient to fully demonstrate information contained in video sequences, so we\nstrongly recommend readers to refer to video results in supplemental material.\nMeanwhile we provide a subjective experiment for further validation. To be speci\ufb01c, we collect 1270\nprediction results of each dataset, and ask 40 people to provide subjective assessment on them. This\nexperiment is conducted in three aspects: (1) Regarding the real video samples as baseline, which one\nis more realistic (Not based on previous inputs)? (2) Considering image quality and previous inputs,\nwhich one is more similar to the Ground Truth? (3) Considering motion accuracy and previous inputs,\nwhich one is more similar to the Ground Truth? These results are shown in Table 3. In the reality\nexperiment, we observe that results produced by stochastic prediction (SVG [7]) and adversarial\nlearning (Ours and MCNet [34]) related methods seem more realistic to human, which demonstrates\n\n7\n\n05101520Time Stamp161820222426PSNRDFNOursSVG05101520Time Stamp0.750.80.850.90.951SSIMDFNOursSVG05101520Time Stamp1.71.751.81.851.91.9522.05Inception ScoreSV2POursCDNA\fFigure 5: Prediction results on MovingMnist (A) [33], RobotPush (C) [10] and Human3.6M (D) [17]\nDatasets at time stamp 4,8,12,16,20. Note that sub-\ufb01gure (B) demonstrates proposals (second row)\nand candidates (third row) during a complete procedure of predicting moving digit 0. We strongly\nrecommend readers to refer to more examples in supplemental material.\neffectiveness on enhancing the image quality. Similar effect can be observed even based on previous\ninputs (Image quality experiment). For the motion accuracy experiment, one can notice considerable\npreference drop of SVG [7], MCNet [34] and DrNet [8] compared to image quality experiment.\nNotably, our framework achieves most of the highest scores in three experiments, which mainly\nbene\ufb01ts from proposed selective sampling framework.\n\n4.4 Discussion\n\nPrevious experiments demonstrate the superiority of proposed framework. In this section we present\nfurther analysis about it. We \ufb01rstly discuss the selection of N and K, then delve deeper into the\nexecution procedure of our framework.\nSelection of N and K. As shown in Table 4, we evaluate the choice of N and K in term of PSNR\nand Inception Score [10] on Human3.6M Datasets [17]. One can notice that with the increasing of\nK, the prediction accuracy keeps growing (i.e., higher PSNR value). While with the increasing of N,\nthe image quality seems to be better (i.e., higher inception score). But keeping N and K too high\nwill drastically increase the model complexity. We choose N = 5 and K = 2 for the reason that this\ncombination could achieve promising performance and keep model at a relative low complexity level.\nAre these proposals rational enough? To examine the rationality of proposals at sampling stage,\nwe plot all these proposals in single image for visualization on MovingMnist Datasets [33]. As shown\nin the second row of Figure 5 (B), the operation of sampler is actually sampling N examples based on\nprevious inputs, and motion direction of all proposals is roughly towards the ground truth. Meanwhile\none can observe that the image quality of all proposals keeps at a relative high level and almost does\nnot degrade along the prediction procedure.\nAre these candidates accurate enough? To examine the accuracy of candidates at selection stage,\nsame as above we plot all these candidates for visualization. As shown in the third row of Figure 5 (B),\nthe operation of selector is actually selecting K candidates from N proposals. For MovingMnist [33]\nprediction, the selector is actually \ufb01ltering out these proposals which are possibly far away from\nthe Ground truth. With the help of combiner, high-quality candidates are composed into the \ufb01nal\nprediction which demonstrates high motion accuracy.\n\n8\n\nOursSV2PCDNAOursDrNetMCNetOursDFNSVGProposalsCandidatesPrediction(A)(B)(C)(D)\fFigure 6: Comparison experiments and ablation study. w/o: removing corresponding module.\n\nExplanation on \u2206W : (1) W is the normal learnable kernel and \u2206W is generated at each time stamp.\n\u201cElement-wise addition\u201d means matrix addition between \u2206Wi and Wi. (2) We update \u2206Wi instead\nof Wi for the reason that Wi is treated as a basic kernel which estimates motion at a coarse level,\nand \u2206Wi is for \ufb01ne-grained prediction based on Wi. By updating \u2206Wi we could narrow down the\npossible distribution to more precise and smaller scale. (3) Inspired from DFN [9], we use softmax to\nencourage sparsity of \u2206Wi, thus we can mimic the complex motion dynamics more precisely. (4)\nWe present visual evidence (Fig 6(F)) to show the difference among {\u2206Wi}2\ni=0 (2nd input and 2nd\noutput channel of {\u2206Wi}2\ni=0 on up-right corner of each proposal). They also gradually change along\nwith time increasing.\nComparison Experiments: As suggested we compare our model with all 6 baselines on Human3.6M\ndatasets [17] in terms of PSNR, SSIM and Inception Score (Fig 6(A,B,C)). Our model (green line)\nstill performs better than these baselines in both terms of motion accuracy and image quality when\nprediction length is extended to 30 (Long-term prediction, trained for 10 steps). Particularly it is\nslightly better than the baseline SVG/best (best of 10 random samples, light purple line).\nAblation Study: (1) As shown in Fig 6(A,B), when removing selector (K=N) or combiner (K=1),\nthe prediction accuracy drops by a large margin (yellow and orange line) compared to the full model\n(green line). (2) The predicted results (Fig 6(D)) without selector (2nd row) tend to involve random\nmotion. By contrast, the con\ufb01dence score (3rd row) assigned by selector could well estimate the\ndistribution of future motion, where selector acts as a strong supervisor for motion prediction.\n(3) The masks from combiner (Fig 6(E)) actually combine different parts of candidates for \ufb01nal\nprediction, which help to re\ufb01ne the predicted motion and improve accuracy. (4) Analysis on N and\nK: Further increasing K involves proposals with low con\ufb01dence, which may potentially decrease the\nprediction accuracy. Keeping N high will improve performance a little but drastically increase the\nmodel complexity.\n\n5 Conclusion\nIn this paper we propose a two-stage framework, called VPSS to study video prediction task from a\nnovel view. At the sampling stage, our framework contains a conditional sampling module which\nproduces multiple high-quality proposals at time each stamp. For the selection stage, we propose\na selection module for \ufb01nal prediction. Extensive experiments on diverse challenging datasets\ndemonstrate the effectiveness of the proposed video prediction framework.\n\nAcknowledgement\n\nThis work was supported by National Key Research and Development Program of China\n(2016YFB1001003), NSFC (61527804, U1611461, 61502301, 61521062). The work was partly\nsupported by State Key Research and Development Program 18DZ2270700. This work was supported\nby SJTU-UCLA Joint Center for Machine Perception and Inference. The work was also partially\nsupported by China\u2019s Thousand Youth Talents Plan, STCSM 17511105401, 18DZ2270700 and MoE\nKey Lab of Arti\ufb01cial Intelligence, AI Institute, Shanghai Jiao Tong University, China.\n\n9\n\n0510152025303234363840424446SVG/bestSV2PCDNAw/o combinerMCNetDrNetw/o selectorDFNOurs(A)PSNR-Human3.6M0510152025300.90.920.940.960.981SVG/bestDrNetDFNSV2PCDNAw/o combinerMCNetw/o selectorOurs(B)SSIM-Human3.6M0510152025301.41.51.61.71.81.922.12.2DFNw/o selectorDrNetMCNetw/o combinerCDNASV2PSVG/bestOursInceptionScore-Human3.6M(C)GroundTruth0W\uf0441W\uf044Proposal_0Proposal_1Proposal_22W\uf044Mask of redcandidateMask of bluecandidateOutputCandidates(Red&Blue)GroundTruthOuputsw/o selectorOuputs with selectorConfidence ofproposals(N=5)GroundTruthOursCDNAMCNet(D) Confidence Visualization(E) Mask Visualization(F) Kernel Visualization(G) Long-term PredictionT=1T=4T=7T=10T=1T=4T=7T=10T=1T=4T=7T=10T=1T=9T=18T=27\fReferences\n[1] M. Abadi, P. Barham, J. Chen, and Z. Chen. Tensor\ufb02ow: A system for large-scale machine learning. CoRR,\n\nabs/1605.08695, 2016.\n\n[2] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks.\n\nCoRR, abs/1701.04862, 2017.\n\n[3] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction.\n\nCoRR, abs/1710.11252, 2017.\n\n[4] D. Berthelot, T. Schumm, and L. Metz. BEGAN: boundary equilibrium generative adversarial networks.\n\nCoRR, abs/1703.10717, 2017.\n\n[5] Q. Chen and V. Koltun. Photographic image synthesis with cascaded re\ufb01nement networks. In IEEE\nInternational Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages\n1520\u20131529, 2017.\n\n[6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.\n\nInfogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in Neural\nInformation Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,\nDecember 5-10, 2016, Barcelona, Spain, pages 2172\u20132180, 2016.\n\n[7] E. Denton and R. Fergus. Stochastic video generation with a learned prior. CoRR, abs/1802.07687, 2018.\n[8] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In\nAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information\nProcessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4417\u20134426, 2017.\n\n[9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. C. Courville.\n\nAdversarially learned inference. CoRR, abs/1606.00704, 2016.\n\n[10] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video\nprediction. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural\nInformation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 64\u201372, 2016.\n\n[11] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In\n2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,\nJune 27-30, 2016, pages 2414\u20132423, 2016.\n\n[12] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In Proceedings of the Fourteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA,\nApril 11-13, 2011, pages 315\u2013323, 2011.\n\n[13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec,\nCanada, pages 2672\u20132680, 2014.\n\n[14] A. Guzm\u00e1n-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple\nstructured outputs. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on\nNeural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake\nTahoe, Nevada, United States., pages 1808\u20131816, 2012.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385,\n\n2015.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n[17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive\nmethods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325\u2013\n1339, 2014.\n\n[18] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks.\nIn 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA,\nJuly 21-26, 2017, pages 5967\u20135976, 2017.\n\n[19] X. Jia, B. D. Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic \ufb01lter networks. In Advances in Neural\nInformation Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,\nDecember 5-10, 2016, Barcelona, Spain, pages 667\u2013675, 2016.\n\n[20] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan. Predicting scene parsing\nand motion dynamics in the future. In Advances in Neural Information Processing Systems 30: Annual\nConference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA,\npages 6918\u20136927, 2017.\n\n[21] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu.\nVideo pixel networks. In Proceedings of the 34th International Conference on Machine Learning, ICML\n2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1771\u20131779, 2017.\n\n[22] F. Khan, X. J. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension.\nIn Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information\nProcessing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages\n1449\u20131457, 2011.\n\n10\n\n\f[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR,2014.\n[24] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\nlearned similarity metric. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1558\u20131566, 2016.\n\n[25] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction.\n\nCoRR, abs/1804.01523, 2018.\n\n[26] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. ALICE: towards understanding adversarial\nlearning for joint distribution matching. In Advances in Neural Information Processing Systems 30: Annual\nConference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA,\npages 5501\u20135509, 2017.\n\n[27] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Recti\ufb01er nonlinearities improve neural network acoustic models.\n\nIn Proc. icml, volume 30, page 3, 2013.\n\n[28] S. Marsland. Machine Learning - An Algorithmic Perspective. Chapman and Hall / CRC machine learning\n\nand pattern recognition series. CRC Press, 2009.\n\n[29] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error.\n\nICLR, 2016.\n\n[30] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks:\nConditional iterative generation of images in latent space. In 2017 IEEE Conference on Computer Vision\nand Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3510\u20133520, 2017.\n\n[31] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making.\n\nIn Advances in Neural\nInformation Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015,\nDecember 7-12, 2015, Montreal, Quebec, Canada, pages 1252\u20131260, 2015.\n\n[32] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo. Convolutional LSTM network: A machine\nlearning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems\n28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal,\nQuebec, Canada, pages 802\u2013810, 2015.\n\n[33] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using\nlstms. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille,\nFrance, 6-11 July 2015, pages 843\u2013852, 2015.\n\n[34] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video\n\nsequence prediction. CoRR, abs/1706.08033, 2017.\n\n[35] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via\nhierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning, ICML\n2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3560\u20133569, 2017.\n\n[36] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\nadversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy,\nOctober 22-29, 2017, pages 2242\u20132251, 2017.\n\n11\n\n\f", "award": [], "sourceid": 867, "authors": [{"given_name": "Jingwei", "family_name": "Xu", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Bingbing", "family_name": "Ni", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Xiaokang", "family_name": "Yang", "institution": "Shanghai Jiao Tong University"}]}