Review for NeurIPS paper: Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?

NeurIPS 2020

Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?

Review 1

Summary and Contributions: In this paper, the authors have proposed a simple yet effective unsupervised architecture representation learning method for neural architecture search. They have demonstrated that pre-training architecture representations help to build a smoother latent space w.r.t architecture performance. In addition, the pre-trained architecture representations considerably benefit the downstream architecture search.

Strengths: (1)This paper is well written and easy to follow. The authors have clearly presented their approach. (2) The motivation of this paper is easy to understand. (3) The authors have provided clear implementation of their approach. Also, they have provided the code to re-implement their approach. The reviewer has checked their code, and re-run their experiments. (4) The authors have conducted extensive experiments to confirm their contributions.

Weaknesses: (1) how the variational graph isomorphism auto-encoder is designed? (2) In the reinforcement learning module, how did the authors define the observation? (3) How did the authors train the searched network architecture? Train from scratch or ?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes, the authors have clarified the differences from previous papers.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: Induces embeddings by training architectures without access to the training accuracies. To this end, the authors rely on a reconstruction objective that noises the adjacency matrix to make it undirected, then encodes it along with the operations into a code vector, and finally tries to recover the original directed adjacency matrix and operations from the code vector. These code vectors are then used to augment architecture search methods (Bayesian optimization and reinforcement learning) in three search spaces (NAS-Bench 101, NAS-Bench-201, and NasNet). The methods that use the pretrained embeddings achieve sligthly better performance than other competitive baselines.

Strengths: The authors have an interesting hypothesis that using pretrained embeddings learned just based on the structure of the networks can help with architecture search. The experimental results are in Section 4.2 are compelling, showing that the proposed algorithms consistently yield improvements to the architecture found across the three search spaces.

Weaknesses: Even despite the experimental exploration in Section 4.1, it is not clear to me what the pretrained embeddings are capturing about the architectures and why they ought to be good for architecture search. There is no supervision coming from performance, so why would the embeddings be useful The performance improvements are modest.

Correctness: The claims and methods appear to be correct.

Clarity: Yes, although a few aspects could be improved (see feedback).

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: Why is does the colorscale on Figure 4 contains a jumps (red and black)? It wasn't clear for me what the authors were trying to convey in this figure (that somehow the colors are better separated in when using reconstruction embeddings than when using supervised embeddings). Observations 1-5 in Section 4.1 and 4.2 could be named more aptly to reflect the nature of the observation in that paragraph. Is the supervised learning approach one that simply directly tries to predict the \hat A and \hat X without going through the variational autoencoder approach? This should be written down more explicitly (e.g., in a sentence). I couldn't find an explict discussion of this part of the model I didn't find it clear how the pretrained embeddings are used with BO and RL for architecture search (e.g., where are these embeddings passed to the model; how are they used for intermediate states of the controller in the RL (is the code vector computed just with the partial adjacency matrix filled so far?). Perhaps this could be explained more explicitly. Would fine tuning the embeddings during search based on the performances of the architectures, improve the results further? --- I've read the author response and updated my score accordingly.

Review 3

Summary and Contributions: This paper and some recent work [1] [2] share similar motivation that the way each architecture is encoded may have a significant effect on the performance of NAS algorithms. This paper studies the effect of architecture representations for NAS with extensive empirical study. It shows that by incorporating some architecture properties into pre-training (in this work they use structure similarity), networks in the given search space with similar accuracy are able to be clustered together rather than randomly. This will provide a better predictive performance of the latent architecture embeddings, that may further benefit the neural architecture search. [1] Graph Structure of Neural Networks. ICML 2020. [2] A Study on Encodings for Neural Architecture Search, arXiv 2007.04965.

Strengths: This paper shows that optimization on a smoothly-changing performance surface in the latent architecture representation space could lead to more efficient downstream neural architecture sampling process, while joint optimization of neural architecture representation learning and search fails to do so. To construct such space, it pre-trains the architecture representation optimized with structure-level reconstruction loss using standard variational Bayesian methods. In this way, the graph structure of the network is embedded in the latent space given the assumption that similar network structures may have similar accuracies. Figure 4 in the paper validates this assumption, where in most cases by clustering similar network structures, networks with similar performance are grouped together rather than randomly, and modelling on such latent space would be easier for different downstream architecture search methods compared to joint optimization of architecture representations and search. The NAS result using two representative search methods on different search spaces is aligned with its main claim.

Weaknesses: Similar to this work, many recent NAS approaches perform architecture search in the continuous representation space, and gradient descent (GD) is one of the most commonly used approaches for architecture search in the continuous space. However, this work only considers RL and BO as the search algorithms. I am curious to know how competitive GD is compared to RL and BO. In Figure 4, I am not sure why the right plot has more blank space compared to the left plot. An explanation is needed to help the readers understand the insights behind those plots. There are many different ways to describe the similarity between two neural networks. While this work focuses on structural similarity, I would suggest the authors to take a look at the concurrent work [1, 2] which focuses on graph relation similarity and computation similarity respectively, and include a discussion on them. Lastly, this work adopts a similar approach as many current NAS methods where the search space is based on a cell or a ResNet block with relatively small number of nodes. Therefore, you choose to reconstruct the cell/block as the learning objective. This is technically reasonable but somehow simplifies the learning task. I would suggest a recent progress on unsupervised learning [3] in NLP domain. [3] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, ICLR 2020

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: This work shares similar motivations with [1, 2] and a concurrent unsupervised NAS approach [4]. The paper provided a short discussion with [4] but lacks a discussion on [1] [2]. [4] Are Labels Necessary for Neural Architecture Search? ECCV 2020.

Reproducibility: Yes

Additional Feedback: Add extra experiments using GD and compare it with RL and BO. Provide a better explanation of Figure 4 (right). Add a discussion on how this work differs from [1] [2].

Review 4

Summary and Contributions: The author propose a graph VAE framework to do NAS tasks. The author argues that the proposed approach is scalable and benefits the architecture sampling approach. The author conduct a series of experiments to support the statement.

Strengths: The proposed idea is straight forward and reasonable. The graph VAE models have been validated to be effective in many graph-related tasks. The experiments are performed in a good quality . Both the qualitative and quantitive results are looking significant and convincing.

Weaknesses: My major concern is the novelty of this work. Although the graph VAE may not been used in NAS before, however, the similar ideas have been proposed in other graph tasks, e.g. https://arxiv.org/pdf/1802.03480.pdf and https://grlearning.github.io/papers/118.pdf . Also, the effectiveness of KL is not fully justified.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: