Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Quality: The paper is technically sound. Claims are supported by experimental results. The experimental study tests the proposed method on three standard datasets with correct methodologies and evaluations. Model capacity comparisons are covered in the supplementary material. I believe those comparisons are important, and it is better if authors can include them in the main paper. The paper tries to show that handcrafted features, separation of information sources, or any other specialized computation are not necessary for the video prediction task. However, the experimental results are only comparing the performance of simple networks with various capacities. This only shows that more capacity results in better results but It is not clear how other SOTA methods are compared to the simple architectures. Varying the capacity of SOTA networks, or simply reporting their performance can better evaluate the claim. Clarity: The paper is generally well-written and structured clearly. Originality: To my knowledge, there is no other paper providing a large scale study of networks capacity on the task of video prediction. The literature review seems to be complete. Significance: The results of this paper are important for research in processing videos. It is somewhat expected (from observing the same phenomena in other tasks) that higher capacity results in better performances, but this expectation has never been proved before. This paper is a useful contribution to the literature and has high practical impact. The paper may be better suited for publication in CVPR or ICLR conferences or perhaps can be published as an application paper at NIPS. Update: I think this paper is a good contribution to the field and might be helpful for other video processing tasks as well. That being said, I do believe comparing to the performance of SOTA networks is important. The author response mentions that a SOTA network has been only evaluated on simpler datasets. I do encourage the authors to compare to SOTA on the main datasets of the paper, but if the computation and hyperparameter tuning is really impractical, it is essential to at least have a comparison and report the performance of standard networks with different capacities on the datasets that SOTA is already evaluated on.
This paper attempts to answer the question of whether it is necessary to have the specialized architectures to solve the video prediction problem, and hypothesizes that the specialized architectures are not necessary and instead “standard” neural networks are capable of solving the problem as long as the networks are large enough. The paper then goes on in verifying this hypothesis using “standard” large networks in three datasets corresponding to three different types of video activities for video prediction. The paper claims that this was the first work to answer the question of whether it is necessary to specialized architectures for video prediction and provides the empirical study addressing this question with “standard” large networks. While I am not sure whether this was the first such work on this question (I personally did not see such work before indeed), the hypothesis is pretty much in line with the well known understanding that larger networks may provide better generalization capability in general, that is supported in recent literature including those mentioned in the related work in this paper. Consequently, I am not surprised to see that the hypothesis was verified in the empirical studies reported in the paper, and thus, I don’t see that the novelty of this work is terribly impressive though I appreciate the efforts in conducting and reporting the experiments in this paper. Regarding the empirical studies reported in the paper, I have the following two more comments/requests. First, for the scenes in the three datasets studied (object interaction, human motion, and car driving), they are relatively causality-explicit in nature, and thus there is not much uncertainty in prediction. What if you have a scene with more uncertainty or stochasticity such as natural scene motion (e.g., tornado) or events (e.g., crowd gathering, earthquake)? Second, presumably, the prediction accuracy also depends upon the length of the history data. Can you provide such empirical studies in the three datasets? The above was my original review. I read authors' response and am happy with their answers. I still have my reservation regarding the novelty issue. But I am happy to bump up the overall score to 7.
Edit: I read the author rebuttal and the other reviews. I will not change my score. I still encourage the authors to try adding experiments that vary the capacity of the SAVP approach, even if it is computationally difficult. We should explore and understand how the performance of both traditional approaches and more complicated approaches change with varying capacity. -------------- Originality -It is not a new concept that high capacity models can improve prediction tasks, but the authors apply the idea to video prediction and are the first to demonstrate that such models can perform well on this task. -The paper may also be the first to perform such thorough and rigorous evaluation (e.g. a study across three datasets and five metrics) of models at varying capacity levels for this task. -Related work is extremely well cited throughout the paper. Quality -This paper is an empirical study of architecture design, and a strength of the work is indeed empirical rigor. The authors propose five metrics for evaluation and compare on three datasets. The type of models that are tested are clearly thoughtfully chosen based on a careful literature review. -The authors include human evaluation using AMT, which, in my opinion, greatly increases the quality and usefulness of the evaluations -One critique is that there is no direct comparison to some of the techniques that the authors claim are not needed for this task. It would have been nice to see how the performance of architectures that employ optical flow, segmentation tasks, adversarial losses, and landmarks scale with increasing capacity. Did the authors try this? (it is possible I just missed something about the paper). Clarity -The paper was well written and easy to follow. Significance The idea that high capacity models can perform well compared to specialized, handcrafted architectures is not novel, as the authors point out in the introduction. However, this paper is the first to apply the idea to video prediction and the first to conduct a rigorous experimental study of this type. So, in my opinion, the results are somewhere in the range of a medium to high level of significance. If the idea were completely new, the paper would be more significant to me, but I do think that this work will help guide researchers to design better architectures for this task.