Reviews: More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

This is a borderline case. The initial scores for this paper were: 6: Marginally above the acceptance threshold. 6: Marginally above the acceptance threshold. 5: Marginally below the acceptance threshold. Positive points: + extensive experimental analysis on three large-scale video datasets (Something-Something, Kinetics, Moments-in-time) showing memory/efficiency/accuracy gains enabled by the proposed approach. + several ablations providing insights likely to be useful for the community. + simplicity of the method + clear writing Negative points: - the paper combines together existing building blocks (Big-Little-Net, TSN, TSM) and hence has somewhat limited novelty. - “memory/efficiency gains are convincingly demonstrated, but are not substantial enough to be a game-changer in the practice”. - Missing important details. The authors provide a rebuttal. After seeing the rebuttal and in the follow-up discussion R1 and R2 maintain their slightly positive rating (6) and R3 upgrades their rating from 5 to 6. The rebuttal addresses some of the concerns, though the concern regarding limited novelty remains. All reviewers have also updated their review with post-rebuttal comments: R1: “Despite the limited originality, I believe that the paper can be a valuable contribution to the community with its simplicity, positive results, and comprehensive experiments.” R2: “The contribution of this work is mostly empirical. The stronger results compared to more complex models and the promise to release the code imply that this work deserves to be known, even if fairly incremental.” R3: “Technical novelty of the work is limited, though the good performance on standard benchmarks with lower computation might be valuable. Given the newer results in rebuttal and the promise to release code, I am upgrading my rating to 6.” AC is convinced by the positive arguments of the reviewers and recommends accept. The authors are strongly encouraged to incorporate the new results from the rebuttal and the clarifications suggested by the reviewers into the camera ready version of the paper.

Paper ID:	1336
Title:	More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation