Reviews: SNIPER: Efficient Multi-Scale Training

The paper presents an image patch sampling protocol for effectively training the Faster-RCNN object detection model. The key idea is to choose the positive image regions adaptively, based on the positions of the ground truth object boxes, and employ a region proposal network to effectively mine hard negative examples. This leads to fixed-size training batches, that include object instances that originally have different resolutions. The main benefit of this adaptive approach is that the relatively small size of the training batches allows faster training (including the batch norm parameters) due to the smaller GPU memory requirements compared to using the standard Faster-RCNN large-image training protocol. The paper reports convincing experimental results, the main point being that the proposed approach allows successful training of Faster-RCNN models in shorter time and with more moderate GPU memory requirements.

Reviewer 2

#1. Summary This paper introduces practical techniques, called SNIPER, for speed-up training of object detection networks based on Region Proposal Networks (RPNs). The idea is simple: Instead of considering numerous object candidate boxes in the entire input image, SNIPER samples sub-images called “chips” and train detection models to localize objects in such chips. Specifically, positive chips are generated to contain ground-truth object boxes of a certain scale range, and negative ones are sub-images that has object proposals likely but not actual ground-truth. Since the scale of chips are commonly normalized to a small scale, object detectors can be scale invariant (as in SNIP [24]) and its training procedure becomes more efficient and faster than before. #2. Quality 1) Experimental evaluation is weak, and the advantages of SNIPER claimed in the manuscript are not empirically justified. First, SNIPER is compared only with SNIP, its previous version, and current state of the arts are not listed in the tables. Moreover, even though the speed-up of training is the most important and unique contribution of SNIPER, there is no quantitative analysis on that aspect at all. 2) Its originality is quite limited. Please see #3 below for more details. 3) The quality of presentation is not good enough. Please see #4 below for more details. >> Thus, overall the quality of this submission seems below the standard of NIPS. #3. Originality The originality of SNIPER seems quite limited since it is an extension of SNIP [24]. Specifically, it can be regarded as a set of practices that are designed to speed-up of SNIP, which is slower than conventional pipelines for training object detectors due to its extensive use of image pyramid for scale-invariant learning. That is, the component improving detection accuracy by scale-invariant learning is the contribution of SNIP, not of SNIPER. #4. Clarity The paper is readable, but sometimes difficult to follow due to the following issues: 1) The algorithm for drawing chips is not presented. The manuscript simply enumerates a few conditions to be a positive/negative chip, and no detail of the procedure is introduced. 2) Some techniques taken by SNIPER are not well justified. For example, it is unknown what is the advantages of taking the strategy introduced in line 164~165, why a ground-truth box that is partially overlapped with a chip is not ignored, and so on. 3) Often previous approaches are not cited, and their abbreviations are not given as well. For example, OHEM in line 169, SNIP in line 171, synchronized batch normalization in line 183, training on 128 GPUs in line 184, OpenImages v4 in line 195, and so on. >> Due to the above issues, this submission looks not ready for publication. Also, they damage reproducibility of SNIPER. #5. Significance 1) Strength: The proposed techniques can improve the efficiency of training deep object detection networks significantly. 2) Weakness: The main idea of learning detectors with carefully generated chips is good and reasonable, but is implemented by a set of simple practical techniques. 3) Weakness: This is an extension of SNIP [24], and focuses mostly on speed-up. Thus its novelty is significantly limited.

Reviewer 3

Summary: This paper focuses on the training efficiency of SNIP (and multi-scale training) and proposes to train on pre-cropped chips. Authors propose a greedy strategy to crop positive chips at different scales to cover object bounding boxes and use an RPN trained on positive chips only to mine negative chips which are likely to contain false positives. By training only on those chips, the proposed method can take advantage of multi-scale training without huge computation. It can also benefit from large batch size training and thus can perform efficient batch normalization training. Strength: + The proposed method makes the training images to be the same small size (512x512). This makes detector training can benefits from the large batch size and single GPU batch normalization, which reduces the training time of synchronizing batch normalization. + During training, it proposes an effective sampling strategy, which reduces the training time of multi-scale training. + Elaborate experiments are carried out on the COCO and OpenImage datasets. Results show that the proposed method obtains better performance than the SNIP when using the batch normalization. Weakness: - Long range contexts may be helpful for object detection as shown in [a, b]. For example, the sofa in Figure 1 may help detect the monitor. But in the SNIPER, images are cropped into chips, which makes the detector cannot benefit from long range contexts. Is there any idea to address this? - The writing should be improved. Some points in the paper is unclear to me. 1. In line 121, authors said partially overlapped ground-truth instances are cropped. But is there any threshold for the partial overlap? In the lower left figure of the Figure 1 right side, there is a sofa whose bounding-box is partially overlapped with the chip, but not shown in a red rectangle. 2. In line 165, authors claimed that a large object which may generate a valid small proposal after being cropped. This is a follow-up question of the previous one. In the upper left figure of the Figure 1 right side, I would imagine the corner of the sofa would make some very small proposals to be valid and labelled as sofa. Does that distract the training process since there may be too little information to classify the little proposal to sofa? 3. Are the negative chips fixed after being generated from the lightweight RPN? Or they will be updated while the RPN is trained in the later stage? Would this (alternating between generating negative chips and train the network) help the performance? 4. What are the r^i_{min}'s, r^i_{max}'s and n in line 112? 5. In the last line of table3, the AP50 is claimed to be 48.5. Is it a typo? [a] Wang et al. Non-local neural networks. In CVPR 2018. [b] Hu et al. Relation Networks for Object Detection. In CVPR 2018. ----- Authors' response addressed most of my questions. After reading the response, I'd like to remain my overall score. I think the proposed method is useful in object detection by enabling BN and improving the speed, and I vote for acceptance. The writing issues should be fixed in the later versions.

Paper ID:	5679
Title:	SNIPER: Efficient Multi-Scale Training

Reviewer 1

Reviewer 2

Reviewer 3