Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
the technical novelty of the paper is moderate. I can see some small differences between the proposed method and conditional BN or dynamic filters. my concern is that there is too much engineering for this task in terms of finding temporal dimensions, feature types and hyperparameters. However, empirically, the method shows promising results on three datasets. some related work are missing e.g. Gavrilyuk et al. Actor and Action Video Segmentation from a Sentence.
This paper proposes a semantic conditioned dynamic modulation (SCDM) mechanism for temporal sentence grounding in videos. The strengths are as follows: (1) Technically, the proposed SCDM is novel, and the idea of using sentence information to compose related video contents for temporal grounding is natural and reasonable. Coupling the SCDM mechanism in a hierarchical temporal convolutional network also makes the video contents with different temporal scales and granularities be linked together, and naturally supports the requirements of multi-scale video spans for temporal sentence grounding. Overall, this paper has good originality and quality from the technical aspect. More detail comments can be found in the summarized contributions above. (2) Extensive experiments on three public datasets show that the proposed SCDM significantly outperforms several state-of-the-art methods. The ablation studies and qualitative results are also convincing. (3) The proposed SCDM is light-weighted and concise, and the presentation of the paper is clear and this work is easy to follow. There are also some minor points: (1) Overall, the SCDM outperforms other baseline methods. But why the SCDM gets lower results on R@5, IoU@0.3 in the TACoS dataset, and lower results on R@5, IoU@0.5 in the Charades-STA dataset? Please provide more explanations. (2) In SCDM, the authors compute the word attention weights in the sentence. However, the word attention weights are not considered in the multi-modal fusion procedure. (3) What happens if the SCDM is only performed on several temporal convolutional layers, not on all the temporal convolutional layers? In addition, the authors are suggested to release the code for SCDM and facilitate further researches.
ORIGINALITY The originality of the work is somehow limited as it is porting the SSD approach for object detection to action localization. While I understand that the encoding of the sentence with the referring expression and the addition of the temporal dimension is novel, it is not a breakthrough idea either. (after author's feedback) The authors have better explained the novelty on how the sentence is encoded and linked with the video. The proposed approach is more novel than I understood after my original review. QUALITY The reported results are state of the art in the presented datasets. However, as there are no comparative results in terms of frame/second or size of the model or memory footprint, the results are somehow incomplete. CLARITY The text is a bit tedious to follow, especially when describing an architecture. SIGNIFICANCE The significance of the work seems limited beyond setting new state of the arts in three benchmarks.