Review for NeurIPS paper: Video Frame Interpolation without Temporal Priors

NeurIPS 2020

Video Frame Interpolation without Temporal Priors

Review 1

Summary and Contributions: The paper looks into the mechanism in frame acquisition that leads to the frame blurriness and low frame rate, and propose a new perspective on the joint frame deblurring and frame interpolation for various video exposure scenarios. The state transition model is interesting and novel to resolve the joint problem. In detail, the curvilinear representation and optical flow refinement are proposed to achieve better frame qualities than the state-of-the-art method. The idea of the state transition model is somewhat novel to the community. Although the paper has provided experiment results on uncertain time interval interpolation, it is still unknown how the model will behave on real-world blurry and low-frame-rate videos. Overall, I highly recommend the paper to be accepted. Below we summarize the main advantages and flaws of this paper.

Strengths: + The paper shows the frame acquisition process when capturing moving objects, depicting the origins for motion blur and low frame rate. + The key instant frame state is decomposed from the frame capturing process and will help the lateral construction of neural networks. + The experiments are sufficient and the ablation studies have validated the effectiveness of the proposed modules and applicability to various exposure scenes.

Weaknesses: - The synthesis of the dataset may not be consistent with the actual blurry video acquisition process. Nevertheless, BIN shares the same inconsistency problem. Thus, do you ever apply your trained model to the actual videos? What are the limitations? - The model size, running speed of the proposed method is not compared with existing methods such as BIN. - The term "generalized" in the title does not suit very well to the context of this paper. Because on the opposite side of "generalized", "scenario-Specific" might come to peoples' minds? However, is there a scenario-specific video frame interpolation?

Correctness: Yes. Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I think the overall idea should be summarized in to a better terminology instead of the "Generalized Video Frame Interpolation". Accordingly, if you found a good one for it, the paper should be polished too.

Review 2

Summary and Contributions: This paper addresses the video frame interpolation problem. It explicitly models the exposure time and readout time to handle the scenario under uncertain exposure time. It presents a curvilinear motion trajectory formula and a novel optical flow refinement network for better results. Experiments show that the proposed method outperforms competitive methods quantitatively on synthetic datasets. Overall, there is sufficient novelty in the problem setting and the proposed solution. However, the experiments were only performed on synthetic datasets, and the synthesis procedure is not physically correct. Without seeing how the proposed method performs on real videos, it is not easy to judge its effectiveness.

Strengths: The paper has good novelty. It observes that direct interpolation between blurry frames leads to inferior results, and a naïve combination of deblurring and interpolation cannot handle blurry videos well. In addition to analyzing the exposure time and readout time explicitly, the proposed method also generalizes the quadratic motion to the curvilinear motion and designs the optical flow refinement network. From the quantitative evaluation, the proposed method outperforms methods with a simple combination of deblurring and interpolation and those performing the two tasks jointly.

Weaknesses: The procedure for synthesizing frames with different exposure times from a 240fps camera is not physically correct. For example, for obtaining a 24fps video with the 1/48-s exposure time, the procedure averages five consecutive frames to simulate the exposure time and discard the next five frames for the readout time. Since each frame of the five frames for averaging does not have the exposure time of 1/240 second (some is for reading out). The five frames do not span a continuous exposure of 1/48 second. Averaging them cannot accurately simulate the frame with 1/48-s exposure time. The experiments were only carried out on synthetic datasets. As pointed out above, the synthesis procedure is not physically correct. It is not easy to judge how the proposed method performs on real videos. The performance improvement is more significant in the 5-5 setting, but it is much less evident for the other two settings.

Correctness: Other than the problem with the synthesis procedure, other parts appear correct to me. However, I did not check every step carefully.

Clarity: It takes some effort to comprehend the proposed method. In particular, Section 2 needs improvement.

Relation to Prior Work: It is clear to me.

Reproducibility: Yes

Additional Feedback: POST REBUTTAL COMMENTS: My main concern is about the performance of real videos since the synthesis process seems flawed. The rebuttal shows a real example and points out two real video examples in the supplementary video. The proposed method does outperform other methods significantly in these examples. I raise my score. However, it would be better if there are real examples and a discussion about the synthesis process's limitations in the paper.

Review 3

Summary and Contributions: This paper presents a method for interpolating frames that may be blurry and/or have unknown exposure times. The ideas is to train networks to recover sharp beginning and end points in time for the video frames, to compute multi-frame flow assuming constant acceleration, and then use this to interpolate the frames. They show that the training and test data need not have the same exposure times and show good results in comparison to several other methods on e few datasets.

Strengths: The strength of this approach is that it doesn't make as many assumption as previous work on the exposure times of the input frames and can effectively jointly deblur and interpolate video frames.

Weaknesses: The main weakness of this paper is that in some cases the improvements over previous work are not that large, interns of both visual results (Figure 3) and PSNR and SSIM. There is a lot of work in the area of frame interpolation, I think the approach here is not groundbreaking. Another smaller concern is how reasonable the constant acceleration assumptions is in practice. Does it hold often? I imagine it may break down in real scenarios

Correctness: The technical content of the paper appears to be correct

Clarity: The paper is well written and easy to follow

Relation to Prior Work: The previous work is discussed and compared to well

Reproducibility: Yes

Additional Feedback: I didn't follow this:"However, there is still limitation in our work, e.g. the proposed trajectory priors can only be used to refine one optical flow" Can you please expand on the limitation of this work?

Review 4

Summary and Contributions: The authors proposed a video frame interpolation method which can be generalized to different exposure time ratio. The main contributions of the paper is deriving a generalized motion trajectory, and proposing an optical flow refinement network trained with the derived constrains. The experiment results show the effectiveness of the method, and the visual quality is promising.

Strengths: The authors refine the motion trajectory computed from uncertain exposure interval. The derivation looks fine and the models implementing the new interpolation formula obtain a good visual and quantitative performance. Training the restoration model and optical flow refinement model with synthetic data is shown to be easily generalized to different settings even real inputs.

Weaknesses: Some of the concerns are below: 1. The claims of proposing a 'generalized' interpolation may be too strong. What could be the real cases which cannot be resolved by the proposed methods should also be discussed. I believe exposure setting could be only one of the problem of the poor generalization ability. 2. It's unclear to me how equation(2) in line 103 is derived. And in line 109, how equ(4) can be degraded to equ(1) given using different frames? 3. For the restoration network, it seems the network is just trying to achieve multi-frame deblurring with a two-stage process. What exactly the function of the two-stage network? Could the author show some outputs from different stages? In line 154, the improvement in dB can only come from more parameters in the model, but not the intuitive idea illustrated in the paper. Please double check it or visualize it. 4. Also for the restoration network, in line 147, what is the temporal ambiguity, and why the authors utilize four frames but not just 2 frame? In line 143, does the author mean B1 and B2? 5. More results on real videos should be reported. Currently, the results are reported on synthetic data and most previous methods do not share the same assumptions as this paper. Results on real videos in the supplementary material look fine, but related contents are missing in the paper, especially user study.

Correctness: Yes, see above.

Clarity: 1. The related work section can be placed after the introduction. 2. There are a couple of typos in the paper, i.e. -line 55: almost -synthesis should be synthesize (ctrl-F for all the typos across the manuscript) -line 82, detailed. -line 112, between --> among -line 153, temporal etc.

Relation to Prior Work: Yes, see above.

Reproducibility: Yes

Additional Feedback: The main concerns of the paper is the details of the derivation and lacking results on real images. Also the writing needs improvement. Many typos exist. ------ After reading the rebuttal and other reviewers' comments, I think some of the concerns are addressed in the make-up experiments. I encourage the authors to add more derivations and explanations to the final versions. I will increase my rating.