Review for NeurIPS paper: Theory-Inspired Path-Regularized Differential Network Architecture Search

NeurIPS 2020

Theory-Inspired Path-Regularized Differential Network Architecture Search

Review 1

Summary and Contributions: This paper addresses an important problem that existed in the previous gradient-based NAS (e.g., DARTS) that discovered architectures are often dominated by skip connections which usually leads to poor performance. The authors provide the theoretical analysis of the intrinsic reasons that cause such an issue, which reveals that the architectures with more skip connections converge fast in DARTS. Inspired by that, the authors further propose a novel path-regularized method, including a differential group-structured sparse binary gate on each operation and a path-depth-wise regularization, to better train the gradient-based NAS problem. The experiments on CIFAR-10/100 and ImageNet demonstrate the promising performance of the proposed method.

Strengths: This paper provides a theoretical analysis of why DARTS prefers skip connections in the discovered architecture, and then proposes a path-regularized method to solve this issue. This paper is well written and is valuable in terms of both theoretical analysis and algorithm implementation. Specifically, this paper shows theoretically that it is fundamentally because that the convergence rate of the gradient-based NAS algorithm (e.g., DARTS) depends much heavier on the weights of skip connections than other types of operations, this makes the architectures with more skip connections converging faster than other candidates, and this selected by DARTS. The authors address such an issue by avoiding unfair operation competition (between the skip connection and other operations, such as convolutions) and encouraging exploration in the search space, which are then implemented by a differential group-structured sparse binary gate and a path-depth-wise regularization, respectively. The codes are provided.

Weaknesses: 1. Can the authors clarify the chosen of the learning rate /eta? For example, \eta should be at the order of 1/\sqrt{m}/h^3/\lambda, where m denotes the channel number, h denotes the depth, and lambda denotes the smallest eigenvalue of the data matrix. So in most cases, this setting will give a very small learning rate, which is not very consistent with the practice setting. 2. I suggest the authors move more derivations to the supplement to avoid using too much v-spacing. This also helps to better express the key ideas in the main text.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I have read the authors rebuttal, which well addressed my concerns with further clarified technical details, therefore, I remain my initial rating as a strong accept.

Review 2

Summary and Contributions: This paper aims to resolve the dominated skip connections issue in DARTS-based NAS methods. Specifically, the authors first theoretically analyze the intrinsic reasons for the dominated skip connections by analyzing its convergence behaviors in DARTS cell. Based on the theoretical results, the authors propose a Path-regularized DARTS method by regularizing the competition between skip connections and other operations. It is worth noting that the proposed method is theoretically guaranteed. However, the experimental part could be further improved to make it more convincing.

Strengths: 1 The author theoretically prove that DARTS prefers skip connection to other types of operations (i.e., convolution and zero), since more skip connections the faster convergence. 2 Based on theoretical analysis, the authors propose a Path-regularized DRATS method (namely PR-DARTS), which aims to avoid direct competition between skip connection and other types of operations. 3 All components in the proposed PR-DARTS are theory-inspired and have theoretical guarantees. Experimental results also demonstrate the effectiveness of PR-DARTS.

Weaknesses: 1. The authors theoretically prove that “more skip connections the faster convergence” and “shallow cells benefit faster convergence rate than deep cells”. Is there any experimental evidence to verify these claims? 2. In the theoretical analysis of Section 3.2, the authors do not consider pooling operation since it is also dominated by skip connections. However, does pooling operations also have a slower convergence rate than skip connections? 3. In lines 175-178, the authors mentioned that skip connection in shared path and convolution in private path can benefit the Gram matrix singularity of networks. Thus, the convergence rate can be greatly improved. It would be stronger to provide empirical results to verify these claims. 4. Does depth-wise separable convolution and standard convolution share the same convergence results that are described in Theorem 1? 5. In Section 3.2, the authors assume that the activation function should be $\mu$-Lipschitz and $\rho$-smooth. However, in operations.py, the authors take ReLU as the activation function, which is not $\rho$-smooth. As a result, the assumption may not hold. More discussions regarding this issue should be provided. 6. Several state-of-the-art NAS methods should be compared in the paper, including [1] Cars: Continuous evolution for efficient neural architecture search. CVPR 2020. [2] Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search. ICML 2020. [3] Neural Architecture Search in A Proxy Validation Loss Landscape. ICML 2020.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: -------After Rebuttal--------------- The authors have addressed all my concerns. This paper is of good quality and I still recommend to accept it.

Review 3

Summary and Contributions: This paper provides a theoretical perspective towards the degenerated search architectures in DARTS space where skip-connection dominates the connections. Based on that, it proposes to use a differential group-structured sparse binary gate to reduce the direct competition between skip- and non-skip connections. Furthermore, it proposed a regularization on cell depth to favor deeper models. Experiments in DARTS space on CIFAR10/100 and ImageNet confirm the effectiveness of the proposed regularizations. =============Post rebuttal================ After reviewing authors' response, I confirm this is a strong submission and keep my original rating.

Strengths: - It contains a quite thorough derivations of the relationship between Supernet convergence and the architecture parameters of different types of operators (e.g. skip, zero, and conv layers), which shows skip-connection tends to get larger weights. - The use of individual binary gate at each candidate operator is interesting, and Theorem 3 is not only sound in theory, but also provides solid foundations for designing regularization loss on group sparsity and model depth. - Experiments on CIFAR10/100 and ImageNet are properly designed, and results are quite competitive.

Weaknesses: I am mostly happy with both theoretic- and experimental aspects of this work, and did not find any obvious weakness.

Correctness: Yes.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper deals with the problem of network architecture search. Based on previous work DARTS, this paper proposes some theoretical analyses and corresponding regularization methods to solve the dominated skip-connection problem in DARTS and improve its performance. The contributions of this paper are as follows: (1) provide rigorous theoretical analyses and explanations as to the reasons why skip-connection dominates in DARTS. (2) Based on the theory, the paper proposes regularized group-structured sparse binary gate method and corresponding theoretical analyses to alleviate the dominated skip-connection problem. (3) the paper then proposes a path-depth-wise regularizer and corresponding theoretical analyses to encourage deep architecture search.

Strengths: The strengeths of this paper are as follows: (1) Rigorous theoretical analysis which hasn't rarely been investigated before, offering theoretical reasons for dominated skip-connection behaviors. (2) Convincing modification of DARTS based on great combination of intuitive explanation, theoretical analysis and detailed empirical results, evaluations. (3) The contribution of theoretical analysis is novel and fundamental. (4) This paper is relevant to the NeurIPS community since NAS is a key issue in artificial intelligence.

Weaknesses: The weaknesses of this paper are as follows: (1)The explanation for theorem 1 is kind of intuitive and confusing. some points are confusing: 1. The illustration of lambda_s by connection path X(0)→X(1)→· · ·→X(s)→X(h−1). From my understanding, alpha_(t,2)^(s) should represent the arrow from X(t) to X(s) while alpha_(s,3)^(h-1) should represnet the arrow from X(s) to X(h-1). So I believe for a single lambda_s, the connection path should be like X(0)→X(s), X(1)→X(s)... and X(s)→X(h−1). So the the illustration of lambda_s by connection path is unclear. 2. The explanation about residual information learned by convolution layers is unclear too. (2)The main theoretical analysis is based on the Mean square error loss assumption, which might not be practical in reality.

Correctness: The claims and methods are mostly correct. However, there might exist some over-claims in some parts. In most parts, this paper conduct theoretical analysis to prove that some settings(like dominated skip-connection and shallow network) have bigger convergence rates, but it doesn't rigourously show that the phenomenon mentioned by the papers “as some cells that fast decay the loss Fval(W, β) at the current iteration have competitive advantage over other cells that reduce Fval(W, β) slowly currently but can achieve superior final performan”. It's an intuitive and reasonable explanatoin for that since the previous setting might lead the network to be trapped in some local minima but the descriptions in the paper appears to be over-claimed.

Clarity: The paper is mostly well written. However, there exists room to improve: (1) Mentioned in the weakness, it might be helpful to detail the explanation from theorem 1 to the reasons for dominated skip-connection.Some explanations in the paper might be confusing for audiences. (2) It might be useful for audience to understand the theorem if some brief summaries (or key steps) of the proof can be discussed in the main paper rather than the appendix since the proof in appendix is relatively long and audiences might lose patience checking all of it.

Relation to Prior Work: The paper clearly discusses the differences between itself and the previous contributions. The theoretical analysis about DART's problem is novel.

Reproducibility: Yes

Additional Feedback: Some questions/suggestions: (1)Is the mean square error loss assumption necessary for the whole theoretical analysis? Can any other loss assumption satisfy the theorem in the paper?(like cross entropy) (2)How do you choose the hyperparameters for lambda_1, lambda_2, lambda_3 in the experiments, which are important for the experiments. (3)Some other baseline experiments can also be conducted, which might detail the ablation test, such as explicit regularization function without independent stochastic gate, which might show the importance of competition or cooperation among operatiions. (4)It might be clear if you can bolden the best experimental results in all tables to show the performance results.