Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This work seems interesting, but I found the paper very confusing. Here are some comments and questions: - The Introduction (and Figure 1 in particular) makes one think the paper addresses the join problem of predicting the location (bounding box) of the object of interest and recognizing the object. However, the results (as far as I understood) only test the mAP, meaning the accuracy on predicting the bounding box of the object of interest. Is the class prediction also tested quantitatively? - The model seems interesting, but I still didn't understand completely the purpose of the Temporal Gear Branch. While the Spatial Gear Branch acts as an object detector, what is the expected contribution of the Temporal Gear Branch? The text seems to suggest that it acts as an attention module, but I don't see exactly how (particularly when the Self-Validation Module is removed -in the ablation studies-). - In the qualitative results (Figure 3), it would be useful to see also how the baselines perform, for comparison. I assume authors made some visualizations, maybe they can discuss about their observations.
== Originality == The main contribution of the paper is a novel idea (to my knowledge). Speficially, the self-validation module proposed to combine and update location and class labels is very interesting, and is more original than the typical combination of different task-specific architectures to get a new result. I am not an expert on the field, but the related work seems adequate. == Quality == The model proposed in the paper seems to be technically sound, and each part has a role which is well-justified and properly explained. The choices for the "backbones" of the different architecture parts (e.g., SSD300 for the spatial object detector part) all make sense, and the loss functions are all well justified. Something was not clear: does the validation module have any trainable parameters? The experiments mostly show the approach achieves a significant improvement over several baseline algorithms of varying complexity. It is interesting to note that the SSD model baseline performs comparably to Mr Net, even though it is a much simpler model, and the paper should have discussed this point in more depth. Table 2 also has some interesting results. Specifically, the model that only uses RGB streams shows performance which is quite comparable to the full model, and is even superior for AP0.5. Though this is discussed in lins 265-272, the question of whether the optical flow input is providing any value is not really addressed. The online detection and EK dataset results are both very interesting and impressive, and provide good evidence to support the claims of the paper. Finally, I would be interested in seeing more qualitative results. Specifically, a discussion (including perhaps some images) about what types of mistakes this model typically makes when predicting, especially when compared to the baseline approaches, would be very useful. == Clarity == For the most part, the paper is well-written and easy to follow. A few relatively minor points: * please fix grammatical errors (e.g., plural vs. singular nouns) * the paper does not mention how many parameters the model has to learn * as asked above, it is not obvious whether the self-validation module has any parameters
- The paper is well written and the model details are clear for the reader. - The model is evaluated in two well-known datasets and compare to a handful of relevant baselines. - Results suggest the model improves over previous state of the art. The models seems to advance the state-of-the-art. - I believe it would be interesting for the reader to better understand the effect of the time window on the final computation. - I believe the reader would benefit from more failures examples in the paper. The examples shown are all success examples.