Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
this paper is well-written and intuitive. the approach sounds technically correct. the novelty is moderate. The paper doesn't aim for a significant technical contribution. Instead, it tackles the valid problem of video recognition using a limited computational budget. the related work is short and some works are not covered: -SlowFast Networks for Video Recognition -Skip-RNNs in terms of results, it shows the benefit of conditional gating for processing fine features. the experiments are relatively convincing but its improvement is not significant I believe. - in Table 1, the results for the UNIFORM method is shown for 25 uniformly sampled frames, which results in almost 2x GFLOPs compared to the proposed method. However, Fig.2 shows that to obtain the same accuracy, uniform sampling needs slightly more frames than the proposed method (~1.3x).
This paper focuses on the video classification task, its goal is to speed up the inference of typically heavy video neural networks. The proposed method is a "cascade" like approach where one light-weight 2D CNN (e.g. MobileNets) scans through all frames, and an LSTM is employed to decide whether to apply a heavier-weight 2D CNN (e.g. ResNets) on each frame. The main innovation of this paper, is that opposed to previous work which are reinforcement learning-based, it proposes to use the Gumbel softmax trick which allows the whole framework to be trained end-to-end. Empirical results on FCVID and ACTIVITYNET confirms that such optimization process outperforms several RL-based baselines. Clarity: the paper is very clearly written and easy to read. Originality: Although the proposed framework is not entirely novel (the high-level idea of skipping frames, and cascades have been explored by previous work on this topic), the optimization process is novel for this particular application (although explored under other scenarios) and shown to be critical by empirical evaluations. Significance: speeding up video classification networks is important for practical applications. However, the choice of using LSTM to aggregate temporal features may limit its practical application, as it has been shown by recent work that 3D ConvNets significantly outperform RNN alternatives on video classification benchmarks. This can also be seen from Table 1, where the uniform baseline is "unreasonably" strong compared with LSTM, on both datasets. (Post rebuttal) The reviewer appreciates the authors' rebuttal which addressed my concerns. I keep my original rating and recommend acceptance of the paper.
Originality: The task is not new, as the authors mentioned in the related work section, many previous literature have worked on this problem. The proposed method is very similar to this paper (Low-Latency Video Semantic Segmentation, CVPR 2018, and other literature as well). Hence the novelty of the paper is limited. Quality: I have one concern about the gating mechanism. How to justify it is really working? The gating mechanism is a one layer MLP, outputting the probability of whether to extract fine details of the video frames or not. Maybe it is just doing a random selection. The performance gap between the proposed method and the uniform baseline is really small, which is hard to justify its effectiveness. Another thing is, for video classification, usually we pick Kinetics, UCF101, something-something as the evaluation dataset. This submission choose FCVID and ActivityNet, which is hard to compare to previous approaches. And again, it is hard to justify the method's effectiveness. Clarity: This submission is well written and easy to understand.