Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is well-written and easy to follow. technically it seems to be correct and more importantly, the method is novel. this paper had the highest quality in my batch and I enjoyed reading it. the problem is well-defined and well-motivated, the distinction from other related work is clear, the solution is intuitive and novel and it works on a real dataset. weaknesses are listed in section 5.
• This work claims to be the first to do space-time factorization in neural graph processing. However, [A][B] use similar space-time factorization, in which separate spatial and temporal graph convolutions are performed. Considering this, the novelty of this work is weakened. • The actor type, as well as the spatial layout in both datasets used, are relatively rigid. More experiments on complex human-object interaction datasets, e.g., Charades, would be helpful in showing the scalability of the adopted rigid region-split scheme. It would also be helpful to compare with the existing space-time graphical modeling approaches, e.g., [B][C], on such datasets. • Compared with previous space-time video modeling works, this work is different mainly in two components: one is to use message passing instead of graph convolution; and the other is the rigid region-split scheme compared with the explicit object detection manner. There are no ablation studies concerning these two modules to shed a light on which part actually brings the performance boost. • How is the number of scales determined? Studies for analyzing the correlation between performance and number of scales are also missing. [A] Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Yan et al. AAAI 2018. [B] Stacked Spatio-Temporal Graph Convolutional Networks for Action Segmentation. Ghosh et al. Arxiv preprint. [C] Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph. Tsai et al. CVPR 2019. ------------------------- After rebuttal: Thanks to the authors for the comprehensive response. However it could have been better if the authors provided in the rebuttal the results that they promised to include in the final paper. I stand to my previous decision of "above acceptance threshold."
1. Although the space-time graph is not a very novel, however, the proposed recurrent space-time graph is more principled than previous GCN applied on videos. It fits with a network rather being an indepent module. 2. The paper is easy to read and understand. 3. The paper has the potential to be applied widely in video processing community, conditioned on its computational efficiency. 4. I found no obvious faults in this paper. It is evaluated properly.