Review for NeurIPS paper: On the Modularity of Hypernetworks

NeurIPS 2020

On the Modularity of Hypernetworks

Review 1

Summary and Contributions: The authors compare computational complexity of two conditioning models for neural networks: embedding-based and hypernetwork-based. The paper develops a an extension of the optimal nonlinear approximation theory to neural nets and aforementioned conditioning models in particular. In a series of elegant theorems the authors prove advantages of the hypernet approach over embeddings: modularity and reduced complexity as the network size increases.

Strengths: The paper presents estimations of network complexity for the two conditioning methods in a highly technical and compelling elaboration of the optimal nonlinear estimation theory. The conclusions have potentially substantial impact. The paper is well organized and is relatively lucid (given the density of the presented details). It offers a guide for understanding hypernetwork advantages and sheds light on modular meta-learning methods in general, thus. The authors make strides to present their argumentation in a clear manner.

Weaknesses: The choice of the embedding network seems a bit ad hoc and not necessarily optimal: First, embeddings usually refer to mappings using shallow nets (suh as word2vec) as in many applications adding more layers don't improve embedding performance. On the other hand, deeper networks can help learn highly performant *representations*, but often that is achieved using more loss functions , such as triplet loss. Would these observations change the conclusions of the paper? The paper is extremely dense in technical detail, and it would be helpful if authors found a way to popularize their approach and present it in the introduction. To be fair, a number of theorems are preceded by explanatory intuitive explanations, but that is not quite done systematically. Some symbols are used without defining them first. The abstract actually is surprisingly terse and not illuminating. Broader context as in the conclusion section would help the readers to appreciate the significance of this work.

Correctness: The technical approach and methodology appear to be sounds and solid.

Clarity: The paper is well written, but substantial amount of technical details could go to teh supplemental material, and more effort made to provide intuitive clarifications.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper theoretically compares and analyses the complexity of embedding-based models and the hypernetworks, when the aim is to model a function that has two different inputs: one that is the input from the dataset and one that is a conditioning signal. In particular, the authors compare the minimal parameter complexity needed to obtain a certain error in each of the two alternatives.

Strengths: The discussed problem is very interesting and gives an interesting insight about the reason for the good results obtained by the hypernetworks in many tasks. Moreover, the proposed analysis deeply explores the theory related to the two considered model types.

Weaknesses: The main weakness of the paper is the presentation. In particular, the first part of the manuscript is not pleasant to read and following the flow of the discussion turns out to be complex. In fact, the problem tacked in the proposed work is not clearly presented until the end of the introduction itself. In the first part of the paper, there are also many references to paper and theorems discussed in the subsequent sections, and that makes it even more complex for the reader to follow the discussion. In this regard, in the manuscript, the [5] has been referred as a fundamental starting point for the proposed analysis, but the discussion about what [5] proposes is very limited (few rows section 1.1), and should be significantly extended. Moreover, in the introduction, some important concepts are reported without giving any reference as proof (e.g. “Modularity suggests that the primary network g whose weights are given by f(I) has the same minimal complexity as required for g_I′ in the worst case.”, “However, in hypernetworks, often the layer before the output one is a bottleneck” ) In Section 1.1 the hypernetwork is defined as a model that “using one RNN to generate the weights of another RNN, called the primary network, which performed the actual task”. In my opinion that is not completely correct, since in [David Ha, Andrew Dai, and Quoc V Le. Hypernetworks], the model is defined for CNN and RNN/LSTM to “relax” the weights sharing constraint, therefore it is not limited to the sequential model. Note also that in the base case the meta-network (the one that generates the weights for the primal network) has in input x_t and h_{t-1} as input, and not a conditioning signal. For what concerns sections 2,3 and 4, the lack of readability is also due to the mathematical notation that is not always clear. In the first part of section 3 the concept of “accuracy” is used without any further explanation. Same problem for the "N-width framework" (discussed in [5]) and BV() and C^1 in theorem 1. Furthermore, the authors should indicate that all the proofs are reported in the supplementary material. In general, taking into account also the quantity of information reported in the supplementary material (and the relevance of this information), in my opinion, this work is more suitable to be published in a journal than in a conference where the number of allowed pages is limited.

Correctness: The theoretical part seems correct, even if some points have to be clarified. For what concerns the Experiments, the reported results (in particular the “synthetic experiments” and “experiments on Real-world Dataset”) is questionable since the 2 models (hyper network and embedding-base network) were tested using similar settings, but it is not clear how the authors have chosen the hyper-parameters of the models. For instance, using a particular learning rate / regularization / dropout rate, etc. can significantly impact the capability of a model. Therefore it is important to report how these hyper-parameters were validated, in order to ensure a fair comparison between two models (using the same hyper-parameters for both models is not a fair way to perform the experiments). Moreover, it is important in order to ascribe the observed phenomena to the hyperparameters considered in the various test (number of layers/embedding dimension). Also in this case it is important to say that some figures (fig. 4) are reported in the supplementary material and not in the paper.

Clarity: No, the paper is complex to read, and some parts turn out to be very confusing.

Relation to Prior Work: The literature review is exhaustive, but some important works that are fundamental to understanding the proposed analysis (in particular [5]) should be discussed more deeply.

Reproducibility: Yes

Additional Feedback: --- Update after author response --- since my review significantly differs from the ones provided by the other reviewers I would like to discuss the various points that the authors correctly highlighted in their response. The first one is the clarity of the paper. I've re-read the manuscript, and honestly, in particular, the first part (section 1), to me is very confusing. My feeling is that the many references to concepts and theorems that are discussed in the subsequent sections make it very complex to follow the flow of the discussion. In this regard (related also to the "Prior work" point) only the lines from 83 to 87 are dedicated to explain the idea proposed in [5] and the extension of this proposed in the paper. To me, it is required to provide a more deep explanation about a base work that is crucial to understand one of the central contributions of the paper (as stated in the introduction). For what concerns the methodology of the experimental results, the authors' response still does not explain how the hyperparameters were chosen. They report an interesting study about the varying single hyper-parameter (the learning rate) that is very useful in terms of analysis. But, in order to do a fair comparison between the models, full validation of each model's hyper-parameters is required (the relationship between the various hyperparameters could significantly influence the results). The authors show the results of tests varying a specific hyper-parameter, but also using a very specific set of values for the others. That introduces a bias (how much the reported the results influenced by the other hyper-parameters? are they are valid in general or only using a specific subset of values for the hyperparameters?). In general, the authors in the response said that they will address many points related to the presentation in the next version of the paper. For this reason, and taking into account the other reviews, I raise my overall score from 5 to 6. But, in my opinion, the paper does not deserve a higher score because: - A Paper that discusses complex topics, like this one, should be presented very clearly, and to me, this is not the case. - A good part of the contribution of this paper is related to the proofs that are in the supplementary material. This paper has 18 pages of supplementary material, and so, "shrinking" it in an 8-pages conference paper is not very reasonable in my humble opinion. - The experimental methodology is not completely fair/correct.

Review 3

Summary and Contributions: The main contribution of the paper is the theoretical analysis that shows that for the overall number of trainable parameters in a hypernetwork is much smaller than the number of trainable parameters of a standard neural network. The theoretical analysis is supported by set of experiments that confirm the theoretical part.

Strengths: The paper refers to interesting topic related with the advantages of hyper nets vs. conditioning. The theoretical justification that that under certain conditions that the target model can be smaller than the model with conditioning factor by orders of magnitude. According to my knowledge, it is the first paper that refers to that aspect of hypernets.

Weaknesses: In theoretical considerations authors assumes approximation of the function y: X x I -> R. Currently, hyper networks are successively applied to generative models, like Flows, where we can observe the mapping I x R^K -> R^K. The question occurs, if the theory scales to multidimensional y representation. Also, some empirical evaluation for such cases would be beneficial. Authors provide validation of the assumption 1 in experimental part, but assumption 2 is is not investigated in experiments. From practical point of view this assumption may not be satisfied for some corner cases. The conclusions of the theoretical considerations are supported by experimental part and are consistent with the intuition about conditioning vs. hyper nets. However, from practical perspective, that is the most important is generalization capability for both of the approaches on unseen cases. It would be beneficial to evaluate if the lower number of parameters required for training corresponds to better generalization of the model for unseen cases.

Correctness: I did not study the proofs in supplement carefully, but I didn't notice any incorrect aspects in theoretical and practical methodology.

Clarity: Paper is clear and easy to follow.

Relation to Prior Work: Authors extend the theory of: Ronald A. DeVore, Ralph Howard, and Charles Micchelli. Optimal nonlinear approximation. 383 Manuscripta Math, 1989. and adapt it to theoretical considerations about optimal capabilities of hyper nets vs. conditional models.

Reproducibility: Yes

Additional Feedback: ############# UPDATE ############# The authors answered all of my concerns in rebuttal. In general, I am satisfied with the provided answers. The example used for the generalization case is a bit unlucky, does not show any difference between hypernets and conditioning, but the authors promised to elaborate more in the revised version. I keep voting for acceptance.

Review 4

Summary and Contributions: The paper aims at explaining the success of hypernetworks. It compares hypernetworks with embedding methods, focusing on the complexity, expressed as the number of trainable parameters. In the adopted theoretical model, hypernetworks have significantly lower complexity, as they manifest a certain degree of modularity.

Strengths: The paper presents theoretical analysis of the problem, and follows with experimental evidence that supports the claim.

Weaknesses: The experimental part is hard to follow and poorly ordered: technical details are mixed with the description of the experiments, and high-level overviews are lacking to begin with. Figure 4 is missing, and Figure 1 is not referenced in the text (are these the same figure?)

Correctness: The theoretical framework is adequate. My theoretical understanding of neural networks is limited and I'm unable to verify the correctness.

Clarity: The paper is written clearly with a coherent narrative. I would suggest to newline and center definitions of crucial symbols, esp. used in the later parts of the paper. In the experimental part, frequently inlined technical details make the reading difficult.

Relation to Prior Work: The prior contributions seem adequate.

Reproducibility: Yes

Additional Feedback: