Review for NeurIPS paper: Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search

NeurIPS 2020

Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search

Review 1

Summary and Contributions: This paper is an impressive upgrade of DNA (DNA is the current best-performing method of NAS), which provides an interesting and promising new solution to the timely model ranking problem of NAS. I like this paper. --------------post-rebuttal--------- The authors have answered all of my questions. I am satisfied with the response. As we know, the development of NAS has slowed down in the past months due to the inaccurate architecture rating, which leads to the randomness of NAS. This paper provides an exciting and promising new solution to the timely topic of architecture rating in NAS. I think it deserves acceptance. I thank R3 for the insightful discussions on why knowledge distillation can improve architecture rating. Actually, how to guarantees architecture ranking is an active research topic, which needs more researchers to discuss to re-accelerate the development of NAS. DNA [6], FairNAS, and PCNAS have a conclusion that a smaller search space can lead to a better architecture rating. I also have empirical experiments to support this conclusion. The experiment of R3 in sampling models from --min_flops and --max_flops could also support this conclusion. I also have a theoretical finding, which proves that narrowing the search space could enhance the supernet's generalization ability and results in a sound architecture rating. The block-wisely NAS using knowledge distillation in [6] can modularize the search space into smaller sub-space and improve the architecture rating. FairNAS shows that if the architectures in the search space can be fairly and fully trained, architecture rating could be upgraded. I guess knowledge distillation in this paper can ensure the architectures in the search space to be sufficiently trained, which leads to a good architecture ranking. I appreciate the potential explanation of R3. To sum up, I think this is a good paper. I keep my high rating unchanged.

Strengths: This paper provides an interesting and promising new solution to the timely model ranking problem of NAS.

Weaknesses: As is proved by several papers in ICLR, the problem of architecture rating (ranking) is the main obstacle in NAS, and many of the existing NAS methods is even not better than random architecture selection. The ineffectiveness of NAS is due to the inaccurate architecture ranking. Recognizing this, DNA has solved the architecture problem by using knowledge distillation. This paper improves DNA by removing the third-party teacher. This is a great step. I like this paper very much. I have several minor comments in this paper. First, can the authors provide a theoretical analysis to show that why distilling prioritized paths can lead to accurate architecture rating? Second, what is the validation set used by this paper? As we know, using the validation set to train the matching network will lead to an unfair comparison with the competing methods. Third, DNA does not use AutoAugment. The comparison with DNA is a bit unfair. Please make this detail known in the text.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Please refer to "Weaknesses".

Review 2

Summary and Contributions: The paper proposes a one-shot NAS method aiming to address the problem of insufficient training of the supernet in the traditional one-shot NAS. It proposes to maintain a 'high-performaing network board' to keep the best subnetworks during the search on the fly. For each sampled subnetwork, the method finds a best matched teacher subnetwork from the board through a meta-network, and do the knowledge distillation for the sampled subnetwork using the teacher network. Experiments show that the method achieves superior performance than previous NAS methods.

Strengths: ++ The motivation is sound and the method is intuitive. ++ The teacher network selection from the prioritized paths using a meta network is novel and show great benefits for the search quality. ++ The experiments are extensive. ++ The paper is well-written in general. ++ Codes are provided.

Weaknesses: -- Applying knowledge distillation (KD) to NAS is not novel, even for KD between subnetworks. For example, see [1]. -- Due to the two paths and the meta network, the search time and GPU memory requirement are not advantageous for the proposed method -- Apart from the'random match', it would be good to show the performance of a 'random' baseline in Table 1, i.e. randomly sampled architectures. -- The input to the meta network is the difference between the sampled subnetwork and the subnetworks in the prioritized board. Do authors try different types of the input? -- What is the structure of the meta network? [1] FasterSeg: Searching for Faster Real-time Semantic Segmentation

Correctness: The claims are correct. The method seems correct, supported by extensive experiments.

Clarity: The paper clearly explains its motivation and methodology, although some details are missing, e.g. the structure of the meta network.

Relation to Prior Work: The difference between this work and most of its related works are clearly discussed. But I think reference [1] should be added and discussed as well.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: In this paper, the author proposes a distillation method. The subnetworks within the supernet could teach others while training. The prioritized path is used for boosting the training of subnetworks. SOTA results are achieved.

Strengths: This article is neat and clear writing. The results are strong. The author introduces a meta-network to select good architectures and uses the logits as the soft supervision, this point is novel. This paper belongs to the NAS area.

Weaknesses: The search space is not the same as the google publications but similar to once-for-all. The se-ratio is 0.25 in this paper's code, the expansion rates are {4,6} in this paper and the maximum depth is 5 in every stage, slightly different. Thus, please report #params in Tab. 1. L120. In this paper, the author uses 2K images as the validation set (L212) and use the validation loss to train the meta-network M. I'm curious that the author claim that this step is time-consuming (L159), then how many iterations in total are used for updating M in this paper? The Kendall rank is important, and I prefer more results. BigNAS and once-for-all behave like single-stage NAS, the distillation aims for improving performance. In this paper, the meta-network should solve the ranking problem, and this point should be emphasized. I could not directly find the answer to the question: why KD improves ranking. Firstly, the author uses 30 architectures on ImageNet100, EcoNAS, RegNet might be cited because they use a few models to observe the phenomenon. According to my experience, the randomly sampled 30 models are always with middle-sized, middle FLOPs, which could not cover the global space. But this is enough to demonstrate something. For all the 30 models, we denote the Kendall rank as k@30, I prefer k@10 and k@20. Though the overall Kendall rank is not that high, if the rank on superior architectures is higher, that would be better. The k@10 and k@20 are the rank for the first 10, 20 architectures with the highest accuracies. Maintaining a set of architectures on the Pareto front is similar to CARS. In CARS, the author observed the small model trap, according to Eq. 3, after initialization, only smaller models with higher accuracies will replace the larger models with lower accuracies. Then, where does the large model come from? For example, if the 600M model is not in the initialized set B, where would it come from? According to Eq. 3, the architectures in B will become smaller in every update? L216. If the prioritized path is set to 10, how to select the architectures in Tab. 5 in a single run? Or by setting the desired FLOPs in --flops_minimum and --flops_maximum and search multiple times? Please report the EfficientNet-b0 baseline on the training strategy introduced in 182-186. The EfficientNet-b0 should be 77.3 in Tab. 4 and Tab. 5. This paper uses auto augmentation and ema, the comparisons should be fair. Tab. 7. Please compare with recent sota detectors EfficientDet and Hit-detector. After reading this paper, I want an answer to one question: why distillation helps to rank the architectures? I know that the distillation improves the accuracies for single-stage NAS, for example, [8,9]. [6] indicates different student networks prefer different teacher networks. [7] uses a pre-trained EfficientNet-B7 as supervision because the soft-label produced by B7 may lead to higher accuracy than a one-hot label. They all make sense because the largest subnetwork or pre-trained network are strong enough. In this paper, 10 architecture are used to boost the training of others. The subnetworks may improve their performance under soft-labels, but all the searched architectures are retrained from scratch! How to ensure that the subnetworks with high accuracies under soft-labels (search) could still achieve high performance with the one-hot label (retrain)? Why the supervision from the randomly initialized subnetworks helps ranking? From my point of view, I guess the answer is: 1. Soft-supervision helps training and make subnetworks converge faster than the hard-supervision if we train them for the same iterations, so we need teacher-student training (need figure or related works). 2. Different students prefer different teachers [6], even the largest subnetwork [8] is not the best for different students, so we need lots of teachers (need experiments). 3. Thus, we need to maintain a set of superior architectures (teachers). 4. The superior architectures are used to accelerate the training of different students, so the supernet converges faster (need experiments). 5. The meta-network is used to match them. 6. The ranking would be higher for a better-converged model, as we all know. 7. Thus, the method accelerates the convergence of the whole supernet which indirectly improves ranking. This is my guess, but I can not find the whole logic chain in this paper. The search space is different so I can not say the accuracy is gained from a better search algorithm rather than a better search space. [1] EfficientDet: Scalable and Efficient Object Detection [2] EcoNAS: Finding Proxies for Economical Neural Architecture Search [3] CARS: Continuous Evolution for Efficient Neural Architecture Search [4] RegNet: Designing Network Design Spaces [5] Hit-Detector: Hierarchical Trinity Architecture Search for Object Detection [6] Search to Distill: Pearls are Everywhere but not the Eyes [7] Blockwisely Supervised NAS with Knowledge Distillation [8] BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models [9] Once for All- Train One Network and Specialize it for Efficient Deployment After rebuttal: I have read comments from all the reviewers and the feedback. There are still some concerns. 1. [minor] Search space. The search space is different. EfficientNet use 0.04-0.05 se ratio, not 0.25 in rebuttal. The comparison considers FLOPs and acc w/o #params. Nearly all the NAS papers have their own space. Not a big deal. 2. [major] Selection. Actually, there is a small model trap. If not, why use the min_flops parameter? Because the author knows the smaller model has lower accuracy and the smaller model may replace the larger model according to the update equation in the paper. 3. [major] Efficiency. The method is somewhat inefficient. It needs 12 GPU days to search for one model. Other methods, like SPOS or OFA mentioned in this paper search one time for numerous models. 4. [minor] Efficient-b0 in Tab. 4,5. According to https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet, the baseline using AA is 77.1. I'm not sure why the baseline is lower, as all the tricks are used. I have reimplemted and I could achieve 76.8 w/o AA. 5. [major] Novelty. This is not the first paper that introduces KD to NAS. I sill don't know why KD helps ranking. I can only guess that KD helps convergence. This paper uses KD in searching and does not use KD in retraining. Another thing: the models in the board are between --min_flops and --max_flops. I have examined, if the SPOS is trained by sampling models from --min_flops and --max_flops, the ranking is better than sampling models from the global space. So, sampling from a subspace also helps to rank the models in the subspace. This may be another factor that improves ktau except for the KD. Overall, the author addressed some of my concerns. I will raise my score to make a consensus.

Correctness: The concerns are detailed in weaknesses.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: This paper clearly discussed with previous works.

Reproducibility: Yes

Additional Feedback: Please refer to correctness and weakness.

Review 4

Summary and Contributions: This paper considers the problem of architecture search for computer vision problems under certain constraints, e.g. FLOPs, specifically, on image classification. Unlike the gradient-based architecture search, e.g. DARTS, this paper considers the one-shot neural architecture search, where the idea is to sample and train one subnetwork (a single path within the hypernetwork) at a time. Technically, the authors consider two ideas to improve the training efficiency and performance, e,g. distillation and training a meta network for picking the teacher network (this is different from existing work that uses pre-trained models as teacher). Distillation: during training, a caching mechanism (path board in the paper) is maintained to store the best subnetwork so far, and this will be further used to distill information to the new subnetwork, potentially accelerating the training process. The path board will be updated on-the-fly by the better subnetworks. Meta network is designed to replace the evolution approach in previous work, the goal is to train a matching network that can assign a teacher (from the path board) for distilling information to the new subnetwork. Update (post-rebuttal & discussion with other reviewers). There was broad consensus that this paper makes a good contribution, so I maintain my positive score.

Strengths: The paper is clearly written, easy to understand. The ablation study has clearly demonstrated the effectiveness of the proposed methods, and superior performance has been shown for ImageNet and COCO detection, when comparing with strong baselines under different Flop constraints, e.g. EfficientNet, MobileNet. Utilizing other subnetworks is more convincing than the concurrent works that use pretrained models as teachers for distillation.

Weaknesses: The novelty is somehow limited. It is more towards the engineering direction rather than a principled idea, it includes several small components, and each performs some role in improving the performance.

Correctness: Yes, the authors have experimentally shown the effectiveness of the proposed methods, but it somehow remains unclear to me why the weights trained to optimize one path can be re-used for other paths, I guess this is a question for all these one-shot architecture search approaches.

Clarity: The paper is well-written, easy to understand.This paper is built on the previous approach (SPOS [1]), where the authors have clearly illustrated that, and shown the proposed approach outperform SPOS both in terms of performance and searching efforts.

Relation to Prior Work: This paper is built on the previous approach (SPOS [1]), where the authors have clearly illustrated that, and shown the proposed approach outperform SPOS both in terms of performance and searching efforts. [1]: Guo et al. “Single path one-shot neural architecture search with uniform sampling.”, arXiv preprint arXiv: 1904.00420, 2019

Reproducibility: Yes

Additional Feedback: the authors have provided the codes for reproducing their experiments, though I didn’t try myself, I assume the codes should work.