Review for NeurIPS paper: Variational Amodal Object Completion

NeurIPS 2020

Variational Amodal Object Completion

Meta Review

This submission tackles the problem of amodal category-specific instance mask completion. To do this, they propose an interesting 3-stage training process for a variational autoencoder that maps partial masks to full masks, followed by resizing to match the object sizes. Reviewers were divided on whether the curriculum training process represents an important contribution; I think this is well-designed, but it could be more clearly motivated in the text. This is demonstrated both for the mask completion problem, and through combination with instance inpainters, for the instance completion problem in the RGB pixel space. During rebuttal experiments, authors also showed results (Tab 3, Fig 4) indicating that the method is able to produce diverse predictions in the occluded regions. While not all reviewers were convinced on this point, I found this result very helpful in evaluating the usefulness of the probabilistic predictions. The fact that the deocclusion results are better than [37] with a simpler approach is also an important strength. This paper could still improve significantly in the camera-ready. An important drawback pointed out in original reviews whose addressal in the rebuttal is a bit contentious among the reviewers is: why should mask completion be constrained to not use RGB information. The author response, somewhat unintuitively, shows a small drop in performance from using RGB data, but this is one specific instantiation of how RGB information might be used, and it does not rule out that RGB information is in fact valuable using a different approach. Another drawback that the authors have promised to address (but is pending now) is the fact that the number of categories evaluated on are somewhat limited in the manuscript. Finally, I would also suggest a few improvements: (i) a nearest neighbor baseline that completes masks by matching to the closest mask in the synthetic dataset, (ii) an additional ablation of the pixel reconstruction loss in stage 2 to answer: does the latent space loss alone suffice, (iii) explicitly show results demonstrating that separating training stages 1 and 2 does actually improve performance. (iv) discuss hyperparameters such as training lambdas for various loss terms, and finetuning schedule etc.