Reviews: Efficient Neural Architecture Transformation Search in Channel-Level for Object Detection

The paper reads very well and manages to present both the challenges of NAS and the proposed idea in a very understandable form (although English grammar and spelling could be improved). The paper's main idea is to constrain the search space of NAS to the dilation factor of convolutions, such that the effective receptive field of units in the network can be varied, while keeping the network weights fixed (or at least allowing the weights to be re-used and smoothly varied during the optimization). This idea is very attractive from a computational point of view, since it allows the notoriously expensive NAS process to achieve faster progress by avoiding the need for ImageNet pre-training after every architecture change. On the flip side, the proposed NATS method only explores part of the potential search space of neural architecture variations. So, the longer-term effect will depend on how restrictive this choice of search space is. I.e., do we lose on potential object detection performance by only exploring the dilation factor of convolutions and would other architecture variations achieve larger improvements? This question is very hard to answer, since (if the paper is correct) no alternative NAS methods are available that could deal with the overhead of ImageNet pre-training. Apart from this issue, the paper presents a very convincing experimental validation. It demonstrates consistent performance improvements across different backbones (ResNet50, ResNet101, ResNeXt101) and in combination with different detector heads (Faster-RCNN, Mask-RCNN, Cascade-RCNN, RetinaNet). An ablation study explores the influence of different numbers of channel groups (Tab.2), channels per group (Tab.3), and different dilation densities and aspect ratios (Tab.5). Questions: - What is the computational cost of performing the optimizations presented in Tab. 4 and 5? Does each NATS row of the table correspond to 20 GPU days worth of computation (L204), or were there larger variations for the different backbone architectures? - As stated in L193, the architecture transformation search is performed for 25 epochs in total, while the architecture parameters are designed not to be updated for the first 10 epochs to achieve better convergence. I was surprised to see such a relatively low number of epochs here. Although one epoch corresponds to a run over the full COCO training set, I would have expected more epochs to be necessary for optimization. Is this low number a restriction in practice? - What was actually the outcome of the optimization? I.e., in the optimibzed network architectures, what were the modifications that ended up being selected? Is it possible to derive some general insights into what are good architectural features for object detection? In particular, was there an observable trend in how dilation factors varied across the different layers of the network (and did this change for deeper architectures)? Update: The rebuttal cleared up my remaining questions. I think this is good work that should be published. In the long run, the point about the necessity of pretraining R5 brought up would indeed be interesting to explore. However, even as it is, the paper already provides a very useful point of reference that future papers on NAS for objection can refer to and compete against. I keep my vote to Accept.

Reviewer 2

- The proposed search space is new and the model transformation method for practical NAS for object detection task is interesting. However, the search pipeline is largely based on previous works on gradient-based NAS, which makes the novelty of the proposed method limited. - The paper is well written and easy to follow. The authors clearly outline the problem that they are trying to solve, and the experiments look reasonable. - Experiments on various detectors clearly show the effectiveness of the proposed search method and search space. - The efficiency of the discovered method is not discussed although it is mentioned in line 63: "and keep the inference times almost the same". The log file provided in supplementary material is also not clear enough. Since adding extra paths and dilated convolution may introduce considerable computational overhead, it is important to provide a detailed runtime report on inference speed to further validate the efficiency. Update: The rebuttal addressed my concern on the efficiency of the discovered models. Although I still think the technical contribution of this paper is limited, I agree with the other reviewers that the proposed practical NAS method for objection detection can be a new and promising direction in future research. I think this is a good paper overall. Therefore, I would like to revise my score to 6.

Reviewer 3

- Originality: This paper addresses an important problem - NAS for object detection. In particular, it identifies the need to use pre-training as. a key obstacle towards NAS for detection. This is a new problem that has not been considered previously in NAS and object detection research. UPDATE: - Quality: This paper has a good structure and a thorough literature survey. However, in some cases it seems to have omitted some related works, especially in object detection. More specifically. This work claims object detection must use “pre-training”. While this is mostly true, prior works on getting rid of pre-training should be acknowledged. As an example: [1]: Zhu et al. “ScratchDet: Training Single-Shot Object Detectors from Scratch”, CVPR 19 oral. In the discussion, the paper also omits that Auto-Deeplab [22] essentially argues that running NAS from scratch on segmentation can essentially match the performance of the use of a pre-trained model. While this may not be the case for object detection (although it is hard to believe so, since segmentation is similar to object detection in many aspects), the current presentation which seems to be asserting “pre-training” as “absolutely necessary” is trivializing the more subtle topic. - Clarity: The paper has good use of language in general. The motivation, literature survey and experiments section are clear and informative. However, there are major issues in the presentation of methodology: Line 148-149: the meaning of “i” is not clearly defined, although it is likely to be channel index. Equation 2: C_i^g and C_{out} are not defined. Equation 4: “I^g” is not well defined. Also, ind_i is defined using vague language rather than precise mathematical statement which is essential for such a critical part of the presentation. Equation 5: It is not clear how “I^g” affects “y^g”...This may again due to the lack of precise definition of “I^g”. In general, with the help of Figure 2, 3 and the simplicity of the method in general, the core idea, namely the paper proposes to assign different dilation rates at a per-channel level is clear. However, the presentation is not clear enough for a reader to re-implement the method. - Significance: - The paper addresses an important question. The improvement is convincing suggesting that it is a practical method for NAS in object detection. There are a few advantages in this method that makes it likely to inspire future methods: (1) It is quite efficient, as shown in comparison in Table 1, although efficiency is derived from the continuous relaxation in DARTS and thus not an original contribution. (2) It does not introduce additional FLOPs, since it focuses on dilation operators. (3) It can use pre-trained models, which again stems from the use of dilation. - There are important claims that are not justified, which is a limiting factor in its significance. Note that the paper propose “channel-wise” search as an important contribution. However, no direct comparison with “path-level” search is provided. This shouldn’t be impossible, as the proposed method is essentially adding degrees of freedom in the architecture parameters so that they are channel wise. The lack of ablations in this regard limits the significance of this work. UPDATE: The rebuttal fails to convince me that the discussions on pre-training is fair and complete. This is important but not critical to me. The clarification of methodology and comparisons with path-level search are on the other hand critical, and I see a much better case there. I revise my score up to 6.

Paper ID:	8079
Title:	Efficient Neural Architecture Transformation Search in Channel-Level for Object Detection

Reviewer 1

Reviewer 2

Reviewer 3