Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This work proposes a method of estimating and using the higher-order information, i.e. acceleration, for optical flow estimation such that the interpolated frames can capture motions more naturally. The idea is interesting and straightforward and I am surprised that no one has done this before. The work is very well presented with sufficient experiments. The SM is well prepared. The flow reversal layer is somehow novel, but it is not very clear what exactly learned by the reversal layer. What is the performance of the learned layer compared to a reversal layer with fixed parameters? I am not sure about the contribution of the adaptive flow filtering. It is hard to get the logic why the proposed method is a better way of reducing artifacts. The result in Table 4 also shows very marginal improvements by using this adaptive flow filtering. It will be great to see the same ablation study on other datasets.
Positive: - Clear and important motivation and problem. - Good writing throughout; concise, to the point. Thank you. - Method is simple. - Improvement from integration of multi-frame linear flow vs. quadratic flow is tested - Clear results on multiple datasets, with ablated components. - Video: Results from 1m 50s onwards are great. This presentation format with the comparison to other techniques would be good to use for the boat sequence, also. Room for improvement: - Method does not directly estimate quadratic flow, but instead estimates linear flow across pairs of frames and combines them. - SepConv and DVF were not retrained on your 'large motion' clip database. It is known that better training data (more selective to the problem) is important. What effect does this 'large motion' clip selection have on these models? Put another way, what is the difference between the SuperSloMo network trained on regular vs. large motion data? - Video results: 42 seconds: while impressive, this result shows visible popping/jumping whenever we reach an original video frame. Why? - Video results: The interpolated frames are noticeably blurrier than the original frames. While this is a problem with these techniques in general (potentially caused by the averaging model for frame synthesis), it is still a limitation and should be stated as such. Minor: - Title: consider adding 'temporal' or 'motion' to the title. - L86: additional comma at the end of the line. - Figure 4 - the frames labeled as GT with the trajectories: the trajectories are estimated and should be labeled as such; there are no GT trajectories. - L225 - please add Eq. (2) not just (2)
+ Novel idea + Good results + complete evaluation - Technical section can be streamlined more - flow filtering missing some technical details I liked the idea and motivation of the paper. It makes intuitive sense and is well explained. The proposed algorithm produces good results, not just numerically, but also qualitatively. There is an extensive comparison to prior work and a ablation study. One minor request would be to ablate all possible combinations of components instead of just removing each individual component. It would make the paper stronger. However, the current ablation is sufficient. May main points of improvement lie in the technical section. First, equation 1 is probably not needed. Most readers will be familiar with velocity and acceleration. Maybe, a slightly version of equation 3 is fine. If the authors decide to keep equation 1, I'd recommend to fix the notation: the bounds of the integration currently use the same symbol as the differential, which seems wrong. The proposed flow reversal (eq 4) is commonly called splatting in computer graphics. It might be worth a citation, and carefully highlighting the differences (if there are any). If it is just splatting of the flow field (which seems to be the case at first glance), the section can be shortened a bit. Finally, I was missing some information on how the flow filtering works, how the network is set up, what are the inputs, what are the outputs. Are all inputs fed into a single network (concatenated or separate paths), is there any normalization between images and flow fields?