Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Summary: The authors defend in this paper that motion is an effective way to bridge the gap between real and synthetic data. To that end, they use optical flow together with 2D keypoints as input for a network that estimates 3D pose from outdoor images. The system itself is relatively straightforward: they extend HMR with an LSTM and make it process batches of 31 frames containing 2D keypoints and flow. The data is taken from an existing dataset (SURREAL), but is extended to include realistic background motion and occlusions. The system is evaluated on the 3DPW dataset, where they outperform other systems trained without real RGB data, and perform similarly to the state of the art using real RGB. The authors also provide extensive ablation experiments and some qualitative results. Positive: The paper is well written, reproducible, and the system it describes is simple. It contains a very complete and up to date description of the related work. Although the idea of using motion to bridge the gap between synthetic and real data is not new, implementing it in a deep net is, and achieving results that are pretty comparable with state of the art from outdoor images is new as well. Apart from comparing with a number of relevant systems, the experiments provide an ablative evaluation that informs the reader about which parts of the algorithm are most important. Negative: As previously mentioned, the idea is not extremely novel. But its application to a working and well performing deep network system is good, so this is a minor negative factor. Apart from there are just minor things: - Which flow method is used? Flownet is mentioned in the paper (line 62) but TVL1 in supp. matt. (line 20) - In the paragraph 78-85 lines, seems like 59 and 69 references are swapped
Originality: the approach is a combination of existing approaches and ideas with a slight modification to enable temporal processing. Through citation of a large corpus of work the authors make that very clear. quality: the paper is of high quality and thoughfully written. I appreciated the detailed evaluation with a thorough ablation study for the model as well as the dataset. clarity: The paper is well and clearly written. The motivation is clearly stated, and the approach clearly described. significance: showing results of a model trained on simulated data that performs on par with models trained on real data is an important contribution. Various question: 1) Fig 5 is hard to understand and follow. Some images do show detections some dont? Why are some blue boxes empty? 2) The authors mention and compare against the Temporal HMR  model - it would be important to clarify the difference to the Motion HMR model. They sound very similar. 3) The paper title is very general about 3D pose estimation. While I agree that insights should be useful for training more general object tracking systems beyond humans, the paper does only show results for human 3D pose tracking. I would strongly encourage a modification to the title to reflect the fact that the paper only deals with human 3D pose estimation.
Positives The paper is well-written and includes a through literature review. The following paper is also very relevant to the submission: Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial training." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Novelty of the method over  is not major. Still, I believe no one has shown that computing flow on simulated data and using it for training improves over RGB only (although the improvement is quite marginal). Simulation pipeline proposed in the paper seems to be quite useful. It improves over the previous approaches that used simpler compositing operations. Negatives My main concern is about the generality of the method: 1. It only applies to video data which is a major limitation. 2. Using only simulated data (flow only), accuracy is still very far from the state of the art. The method requires additional information from RGB images in the form of 2D keypoints to improve the result (In fact 2D keypoints only has almost the same performance). This network is trained using supervised data. This is not coherent with the claim of the paper that the method completely relies on simulated human data. My second concern is whether the method is a viable solution for sim2real problem. Flow only is only marginally better than RGB only (105.6 vs. 100.1) which introduces a doubt about motion computed using simulated data being a viable solution for sim2real problem. Overall, the paper has good points but I believe the negatives mentioned above weighs more. I encourage the authors to address the concerns above. ------ Revision after rebuttal: Going over my review, together with other reviews and rebuttal, I think the paper deserves a marginally above average rating which I will revise. Nevertheless, I stand by my original review points, particularly about the generality of the method: 1. To clarify, one of the limitations I stated is requiring the motion of the humans and/or camera, not only application to video. Compared to many of the methods that the paper is compared against (e.g. in Table 1, Martinez et al., DANN or HMR), this is a limitation. There is no question about working on video data being an important problem. 2. I agree that the paper is not trying to hide it is using real data supervision for keypoint detection. However, this fact still weakens the main claim of the paper which is “… motion can be a simple way to bridge a sim2real gap”. I appreciate that the paper presents an optimized pipeline which combines real and synthetic data to obtain a good 3D human pose estimation result, however in my opinion “… motion can be a simple way to bridge a sim2real gap” is a strong claim that is not strongly supported by the experiments presented in the paper. In addition, as also noted in the rebuttal, I would recommend the paper rephrases some of the claims about simulated data such as in abstract: “… on par with state-of-the-art methods trained on real 3D sequences despite training only on synthetic humans from the standard SURREAL dataset”.