Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Post-rebuttal: Thanks to the authors for a thoughtful rebuttal. If the authors are able to find space for the qualitative results, I believe most of the reviewers' concerns were addressed. I am however wary of adding additional experiments from Pascal 2012 test, even if they were technically done before the submission deadline. The motivation of this approach is that per-pixel losses don't explicitly capture the structural differences between the shapes of predictions in ground truth. By comparing the mutual information between local patches in the prediction and ground truth, the authors hope to encourage higher order similarities. Ultimately the results on Pascal 2012 and CamVid show that such an approach can lead to small but significant improvements in IoU over prior approaches based on CRFs and Pixel Affinities, without any added cost during inference. Along the way they show how to approximately minimize this region mutual information based, with the appropriate downsampling to efficiently train the model with this loss. One interesting potential advantage of this approach is that since the loss is based on aggregate statistics over small regions, it may be more robust to small labeling errors. It would be interesting to see if this is the case by artificially perturbing the training data in various ways (coarser polygons, random shifts between GT/image, etc). Along these lines, it would also be interesting to see the influence of different weight factors (\lambda) between the cross entropy and region mutual information. In particular, I'd be curious if this loss can work without cross entropy at all? Further, it would be interesting to consider a loss that is more sensitive to errors in the shape or higher order errors. For example, something like hausdorff distance between object boundaries might be a suitable metric.
1. The main concern with this work is that the benefits of using RMI are not presented well in this paper at all. I am not sure why practitioners would like to use the RMI loss in practice for semantic image segmentation, given only marginal improvements at a more complex computational cost. 2. It is not clear if the assumption (11) holds in experiments for semantic image segmentation. There are no experimental results to support this. 3. I appreciate the details of the equations to explain the proposed RMI loss. However, given the equation 16 and equation 13, how would one to implement it in an efficient way is not very clear after reading the paper. It is also not very clear if the proposed approach would work well with end-to-end training with deep neural networks at all. 4. It is not clear if the proposed loss would give improvements with other base models, and how much this would give compared with the other tricks, such as atrous convolution, various up-sampling modules.
- Originality: I think many people have thought of different ways trying to model the relationships between pixels, e.g. (as the paper mentioned) CRF. The current proposal of using region based loss is both novel and intuitive to me. - Quality: The paper is written in a decent quality. The overall style is biased toward being "technical", with clean mathematical definitions and down-to-earth practical details. The only thing missing to me is qualitative results for experiments, which is very essential to me as the paper is focusing on the qualitative difference between a model that takes into account the correlations between pixels vs. a model that does not. - Clarity: Although I haven't check the mathematical formulations, the paper is clear enough in delivering what is the intuition for the solution, and what is the results. - Significance: Judging from the quantitative results, I think the performance improvement is significant enough for a paper. I do find several things lacking: -- The results are reported on the val set for VOC 2012, instead of the test set; -- Qualitative results are lacking; -- It would be nice to include more SOTA comparisons (I am not up to date on this, but is DeepLab v3+ still the state-of-the-art?); -- Another important experiment to analyze is to verify the performance of 1) RMI + CRF; 2) CRF + Affinity; 3) Affinity + RMI. This is to see if RMI has already done the job that CRF/Affinity is doing, or vice versa -- how much of RMI's contribution can be done in CRF/Affinity. -- Another baseline I can think of is to have a "pyramid" of losses, e.g. using the same CE loss or BCE loss, but apply it a pyramid of downsampled predictions. Will that help performance? Will that replace the functionality of RMI?