Review for NeurIPS paper: Auto Learning Attention

NeurIPS 2020

Auto Learning Attention

Review 1

Summary and Contributions: The paper proposes a NAS method of attention. Different from traditional NAS that focuses on searching connections and operations, this paper tries to search high order attention. By modifying DARTS, the proposed method could accomplish the search within 1hour. The results indeed show improvement upon various baselines on several tasks.

Strengths: 1. The idea is straightforward, and the results are solid and indeed surpass baselines 2. Searching attention is an unexplored area of NAS, the scope of the paper is novel. 3. The method indeed improves across different tasks, which shows the universality of the proposed method.

Weaknesses: 1. The search space is quite limited, important types of attention such as non-local cannot be incorporated into the search space. 2. The compared baselines are not fair. It is not clear whether the performance improvement comes from the group split and high order attention or the searched arch. The authors should provide 1) random search baseline 3) SE and CBAM baseline with group split and high order attention. 3. More ablation analyses on the choice of k are needed to demonstrate the balance of speed and performance. 4. The proposed method increases the FLOPS by about 20%. It is necessary to provide some comparisons with baselines under the same FLOPS by scaling the width of the network.

Correctness: Yes

Clarity: Mostly clear to me, with some abuse of notations. For example, the meaning of k in Eqn1 and 2 are different. It is better to separate group the node index explicitly.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: It is interesting to see the results on stronger backbone, such as ResNet101 and ResNet152, etc. Also for the detection experiments, the baselines of COCO are too low from current view. I suggest the authors to try it on better baseline such as RetinaNet/FasterRCNN with ResNet50/101-FPN. ================================ The rebuttal does not change my opinion on this paper, especially the ablation analyses show that the random baseline is indeed comparable with the proposed method, and the performance improvement largely comes from group splitting. Thus I recommend rejection on this paper.

Review 2

Summary and Contributions: This paper presents a neural architecture search approach to search for an attention module, which can be plugged into various backbone networks. It proposes a new attention module called high order group attention and efficiently searches for its architecture via a differential method. Experiments show that the searched attention module with various backbones and outperforms hand-designed attention on several vision tasks.

Strengths: - The idea of architecture search for attention module is novel. Researchers have actively explored Neural Architecture Search as well as attention modules, but this paper is the first attempt to combine them together. - The proposed high order attention module is interesting. Previous work has sufficiently explored first order and second order attention modules including spatial and channel-wise attention modules, but higher order attention modules have not been explored to the best of my knowledge. - The approach produces good results on ImageNet classification using ResNet 18/34/50 backbones.

Weaknesses: - In section 3.1, the logic of extending HOGA from second order is not consistent with the extension from first order to second order; i.e., second order attention creates one more intermediate state U compared to the first order attention module. However, from the second order to higher order attention module, although intermediate states U0, U1 … are created, they are only part of the intermediate feature (Concatenating them will form U with full channel resolution). In this way, it seems we could regard the higher order attention module as a special form of second order attention module. - The paper does not clearly explain the intuition as to why different channel groups should have different attention mechanisms; i.e., in what specific way the network can benefit from the proposed channel group specific attention module. - Experiments are not solid enough: 1. There are no ablation studies on the effect of parameter numbers, so it is not clear whether the performance gain is due to the proposed approach or additional parameters. 2. Although there is good performance on imageNet classification with ResNet50/34/18, there are no results with larger models like ResNet101/152. 3. There are no results using strong object detection frameworks; the current SSD framework is relatively weak (e.g. Faster RCNN would be a stronger, more standard approach); it is not clear whether the improvements would be retained with a stronger base framework. - The proposed approach requires larger FLOPS compared to baselines; i.e., any performance gain requires large computation overhead (this is particularly pronounced in Table 3). - In Table 3 shows ResNet32/56 but L222 refers to ResNet34/50, which is confusing.

Correctness: Yes, overall they are correct.

Clarity: Some parts are unclear, as listed under weaknesses.

Relation to Prior Work: Yes, the proposed work has two main related fields, attention mechanism and neural architecture search, and both are clearly discussed.

Reproducibility: Yes

Additional Feedback: The idea of architecture search on attention module is novel and interesting, and the paper shows good performance gains on Imagenet classification. However, there are issues with experiments and unclear explanations as stated under "weaknesses". Thus, currently I tend to borderline reject this paper. ===== Post-rebuttal comments: I read the other reviews and the rebuttal. The rebuttal addressed some of my issues regarding the experiments, but did not adequately address my concerns regarding unclear explanations. Therefore, I maintain my original "Marginally below the acceptance threshold" rating.

Review 3

Summary and Contributions: The paper considers searching for an attention module for visual recognition architectures. In particular, the authors define an attention module parametrization that parametrizes a large number of attention module variants (including SE and CBAM). They then search for the optimal structure using a DARTS variant while keeping rest of the network structure / backbone fixed. The resulting module achieves consistent improvements across datasets (CIFAR, ImageNet) and tasks (COCO detection and pose estimation).

Strengths: 1. The studied problem is interesting and relevant for the community. 2. Although the search is performed using a single backbone (ResNet-20) and a single dataset (CIFAR-10) the found module transfers to different architectures (e.g. ResNet-56), datasets (e.g. ImageNet), and tasks (e.g. COCO detection). 3. The found module achieves consistent and solid improvements (~1 point) across all tested datasets (CIFAR-10, CIFAR-100, ImageNet) and tasks (COCO detection and key points). 4. The baselines are reasonable (vanilla, SE, CBAM), the training settings are clearly explained (Appendix 1), and fair (same across methods). 5. Qualitative visualization results are nice (e.g. Figure 4)

Weaknesses: 1. It is not very clear how exactly is the attention module attached to the backbone ResNet-20 architecture when performing the search. How many attention modules are used? Where are they placed? After each block? After each stage? It would be good to clarify this. 2. Similar to above, it would be good to provide more details of how the attention modules are added to tested architectures. I assume they are added following the SE paper but would be good to clarify. 3. Related to above, how is the complexity of the added module controlled? Is there a tunable channel weight similar to SE? It would be good to clarify this. 4. In Table 3, the additional complexity of the found module is ~5-15% in terms of parameters and flops. It is not clear if this is actually negligible. Would be good to perform comparisons where the complexity matches more closely. 5. In Table 3, it seems that the gains are decreasing for larger models. It would be good to show results with larger and deeper models (ResNet-101 and ResNet-152) to see if the gains transfer. 6. Similar to above, it would be good to show results for different model types (e.g. ResNeXt or MobileNet) to see if the module transfer to different model types. All current experiments use ResNet models. 7. It would be good to discuss and report how the searched module affect the training time, inference time, and memory usage (compared to vanilla baselines and other attention modules). 8. It would be interesting to see the results of searching for the module using a different backbone (e.g. ResNet-56) or a different dataset (e.g. CIFAR-100) and compare both the performance and the resulting module. 9. The current search space for the attention module consists largely of existing attention operations as basic ops. It would be interesting to consider a richer / less specific set of operators.

Correctness: The method is generally well-evaluated. However, it would be good to address the relevant points from the weaknesses section (4-7).

Clarity: The paper is generally well-written. However, it would be good to address the relevant points from the weaknesses section (1-3).

Relation to Prior Work: The paper discusses prior work clearly.

Reproducibility: No

Additional Feedback: Minor: - L184: where does the 91 number come from? I believe COCO has 80 categories - Table 3: I assume ResNet-56 is a typo and should be ResNet-50 Updated review: I thank the authors for the response. The rebuttal does not address my concerns regarding clarity and addresses some of my concerns about empirical evaluation. Overall, I maintain my original recommendation.

Review 4

Summary and Contributions: The paper presents neural architecture search algorithm (DARTS) to determine attention modules for various backbone networks. In particular, the authors first define the high order group attention (HOGA) that can be represented as a directed acyclic graph (DAG) and present an attention search space which includes typical attentional operations such as SE and CBAM. The authors demonstrate that the searched attention module can be applied to various backbones as a plug-and-play component and outperforms previous attention modules for many vision tasks.

Strengths: It is the first attempt to apply NAS to automate the attention module design. The authors define the novel concept of High Order Group Attention (HOGA) that generalizes previous attention modules (for example, SE and CBAM can be seen as a first-order and a second-order attention, respectively. The experiments are solid and the proposed novel High Order Group Attention (HOGA) attention module showed better performances than the existing SOTA approaches (e.g., SE and CBAM) for various bench marking datasets.

Weaknesses: The experimental comparisons are mainly conducted on ResNet backbones (with a different depth only). However, it is well known that the feature representations can be different by the backbone types and the results can obviously change to the choice of backbone. To show that the searched attention module is robust to these backbone variances, it is necessary to apply the module to other backbones such as WideResNet, Inception, DenseNet, and ResNext.

Correctness: The idea of continuous relaxation (equation 13) to ease the NAS optimization process is correct and good.

Clarity: It is well written and easy to understand.

Relation to Prior Work: There are some important missing attention modules. It is necessary to compare the HOGA with the following recent attention modules. [1] GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (ICCV 2019) [2] Attention Augmented Convolutional Networks (ICCV 2019) [3] Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks (NeurIPS 2018)

Reproducibility: Yes

Additional Feedback: