Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The proposed method for performing weakly supervised instance segmentation is an original and novel contribution, extending the Mask R-CNN with a MIL loss term to enable learning of instance segmentation from only bounding box labels. As far as quality, the experimental validation is thorough, including comparisons to two state-of-the-art baselines in weakly supervised instance segmentation as well as comparisons/upper bounds of fully-supervised instance segmentation methods. Additionally, ablation studies are given to validate the use of box-level annotations compared with alternative baseline approaches, as well as to validate the different components of the overall proposed pipeline. The paper is also well-written and easy to read. Overall, given the above, this is a nice paper with significance to anyone working on object detection/segmentation. --- After reading the author response, I believe they have addressed both my questions and concerns raised by the other reviewers, and maintain my initial rating. One comment based on reviewer discussion: In my initial read, Figure 1 was slightly confusing due to the illustration of positive and negative bags as being patches with greater than 1 pixel height/width. Perhaps it would be worth adding a small note clarifying that the example bags are just for illustrative purposes, and that in practice single pixel horizontal/vertical lines are used as bags?
This paper is well-organized and the presentation is nice. The ideas are interesting, which I believe could be helpful for future research in weakly-supervised segmentation. The results are also good and convinving. Overall speaking, this paper is well-writen. However, I still have the following questions. 1) The authors use max pooling with respect to each bag to get the probability of each bag belonging to some category. I want to know how the results change when the avg pooling is used. We know that recent classification networks mostly exploit 2D avg pooling to map a 3D tensor to a 1D feature vector. According to attention models (like CAM, CVPR'16), avg pooling is more effective than max pooling to capture large-range object attentions. 2) For objects with middle or large size (defined in COCO), this work may work well. My question is how is the result of the proposed approach when processing small objects.
The major issues of this paper are related to the motivation. Though it claims in L50 that this is the first end-to-end trainable algorithm that learns the instance segmentation model using bounding box annotations, this does not explain well the value of such problem. If the motivation of using bounding box annotation for training instance segmentation is that such bounding box is cheaper than boundary annotation, then there should be a study of performance versus annotation effort, e.g., in terms of annotation expense or total annotation time. This will answer if weakly supervised instance segmentation achieves better performance than fully supervised on given the same amount of annotation time/money. It may also be possible that given the same amount of time/money, the fine pixel annotation is better than coarse bounding box annotation in terms of training. [R1] performs such a study as reference in terms of semantic segmentation. [R1] On the Importance of Label Quality for Semantic Segmentation, CVPR, 2018 Line85: This is a misleading statement. For some real-world applications, the pixel-level annotation is required. Moreover, it should be noted that the non-proposal based instance segmentation can be exclusively applied to some other more challenging tasks, like C. elegans segmentation which are more deformable and often cuddle with each other for instance segmentation. In such cases, proposal based methods cannot handle well, though they perform well for segmenting oval-shape objects. Eq.4 How to set epsilon? Why using Eq. 4 helps enlarge the segment size? Why not training to classify all positive patches as positive labels? Why must it use MIL loss given that all the positive patches are positive in some sense? Line100: There is no support for "efficiency" of the proposed method over . In general, the paper about using MIL for weakly supervised instance segmentation is not persuasive. It does not explain well why MIL works so well -- on some metrics it even outperforms the fully-supervised Mask RCNN. Given the results, an in-depth analysis is required to explain the advantage. ------------------------- The rebuttal provides answers to most of my questions. There are still a few concerns -- 1) I still have difficulty in understanding why the MIL works so well with multiple sampled patches as positive/negative bags. From Fig. 1 and Eq. 3, I don't know how the MIL forces the model to choose what patch for the positive bag. Perhaps a visual demonstration may demonstrate this. The authors partially answer this in "The four questions about Eq. 4". 2) The answer provided in the rebuttal is very important that studies performance vs. annotation effort (time, money). That's one of the main motivation why weakly supervised learning is important -- if one has infinite money, then annotation for fully supervised learning is no problem. If this paper is finally accepted, I strongly suggest authors include this in the main paper especially given that the paper is submitted to machine learning venue and includes little theories. Another minor concern is the claim about L85 "in general, the applicability of the fully supervised methods “may” be limited in the real world because of the high annotation cost". This really has ambiguity in defining the "general application" -- production oriented application in companies, or numerous applications in-need in biology/medical science, or others. That's why I would rather see the performance vs. annotation cost to motivate weakly supervised learning instead of saying this vague statement.