NeurIPS 2020

Network size and size of the weights in memorization with two-layers neural networks

Review 1

Summary and Contributions: Given an arbitrary dataset of N points, the paper shows existence (via construction) of two layer NN architectures that can memorize the dataset. The main contribution is threefold: 1. The authors were able to memorize using an almost optimal (upto log factors and multiplicative constants) number of Relu neurons arranged in a two layer network. 2. They were able to extend to real valued outputs instead of binary outputs dealt in the prior work. 3. They also provide a construction (using complex numbers) that can both memorize the dataset and have small weights. ====== Post rebuttal and discussion ===== I am happy with authors' answers to my questions. I am increasing my score appropriately!

Strengths: The main strength of the paper is technical ideas and proof techniques. I think that the idea of using finite differencing to express gradient and correspondingly hyperplanes to embed real valued labels was quite interesting. Similarly, using complex numbers to cancel off higher order moments is also fascinating. I enjoyed reading the proofs, and I think they can be of general interest in design of NN architectures and non-linear activations.

Weaknesses: In the following, I am clubbing together both weaknesses and my questions here. I would appreciate some feedback on them : 1. It is not clear whether SGD or GD will be able to find the optimal set of weights for the corresponding architectures developed in the paper. I think it is important that the corresponding construction should be learnable using gradient based (to avoid dependent on d) methods. Do you think it is the case here? 2. The authors use the work Boosting at multiple places (abstract, in the discussion before Lemma 1, etc). The boosting part is not clear to me? What is the underlying game? 3, Notion of error: This work considers multiplicative error model to define memorization (for eg in Lemma 1). Why? It seems that the gap between the multiplicative and the additive model can be bad by a factor of n, i.e., \eps_{additive} = \eps_{\multiplicative} \times n. Would the provided constructions be optimal in the additive model of memorization error as well ? 4. The proof of Lemma 1 does not seem to be complete (and there is no full proof in the appendix). I was not able to see it through. Can you please elaborate? 5. Proof of proposition 1 was very cute!! 6. Please elaborate more on the similarities / differences between the current work and Ji and Telgarsky 2020. 7. In Proposition 2, it is not clear why the activation should be non-polynomial. Please elaborate. 8. The authors do not provide a clear definition of the NTK regime anywhere in the paper or the appendix. I had to skip this section while reviewing because I could not understand what the term means. 9. What is the role of the parameter m in Theorem 4? 10. Is the bound on W(f) in eqn 21 in Theorem 6 optimal?

Correctness: The proof and technical ideas seem to be correct.

Clarity: The paper was clear (except for Section 4 where the background was not well established). Please take a look at my comments in the weakness section for more details.

Relation to Prior Work: Discussed well. Would appreciate more comments comparing this work to Ji and Telgarsky 2020.

Reproducibility: Yes

Additional Feedback: ====== Post rebuttal and

Review 2

Summary and Contributions: This paper considers memorization of data using neural networks, and aims to achieve the smallest total weight rather than just smallest number of neurons. Indeed, the paper (approximately) memorizes n well-separated data points with order \sqrt{n} weight, and shows that this is optimal.

Strengths: The question considered in the paper is interesting/relevant and the results are both elegant and sharp in most parameters of interest. The main technical ideas are to incrementally build a network one neuron at a time in order to iteratively reduce the approximation error, and the complex weights which result in high order Taylor expansion terms disappearing in expectation (these terms arise in analysis of correlation between a neuron and the data). The fact that the expectation for complex perturbation of the weights precisely is zeroed out is quite nice, yet as pointed out in the paper is not possible for ReLU due to it not having a holomorphic extension. This difficulty is overcome through a sequence of representation results, where polynomial activations are used but then expressed in terms of ReLUs. I am confident that this paper will be of interest to the NeurIPS community.

Weaknesses: No major weaknesses, aside from (I suppose) trying to determine optimal dependence on epsilon in addition to n.

Correctness: All of the proofs appear to be correct.

Clarity: The paper is exceptionally well-written.

Relation to Prior Work: There is a thorough comparison to prior work, except as follows: Conceptually, there are two main ideas in the paper: 1. complex weights and 2. iterative construction. I saw a somewhat related COLT 2020 paper by Bresler and Nagaraj, which also used complex weights (via the discrete Fourier transform) and whose main idea as far as I could tell was a similar iterative construction. If we try to unpack their Theorem 1 pg 7 to see what is the total weight of their construction, it seems they have |F(\xi)| = O(sqrt{n}) and \|\xi\| = O(sqrt{d}) so O(sqrt{nd}) weight. Their result is worse in dependence on dimension but allows for more general data points, and assuming what I wrote is correct it would be appropriate to mention this and give a comparison. It seems important to explain precisely the relationship between the ideas in that paper as compared to yours, especially surrounding the iterative construction of networks. It would be good to explain also how the NTK ideas relate to those of the paper by Daniely

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This work studies how many neurons and how large the total weights of a neural net is required for memorizing training data. One of the results is about constructing a ReLU neural network with $4n/d$ number of neurons for exactly memorizing the data, where $n$ is number of samples and $d$ is the dimension. Another result is showing that a ReLU neural network in the NTK regime can approximately memorize the data with O(n/d \log(1/\epsilon)) neurons, where \epsilon is the approximation error. On the other hand, the authors point out that the total weight of neural networks for these memorization results could be very large. Yet, existing theory results suggest that networks with small total weights might be able to generalize better. So the authors consider constructing a neural net with (small) optimal total weights for memorizing data.

Strengths: The paper consider several settings of memorizing training data. The number of neurons and the size of total weights are basically optimal. The claims seem to be sound and the analysis seems to be non-trivial.

Weaknesses: I think the authors should compare this paper with [Bresler and Nagaraj 2020] in details (e.g. contributions, approach of network constructions), because I feel that the techniques used in both papers are quite similar. For example, the technique that iteratively constructs sub-networks to reduce the residual errors (Lemma 1 in this paper) also appears in [Bresler and Nagaraj 2020]. Furthermore, the complex trick in Section 5 of this paper also appears in [Bresler and Nagaraj 2020]. Specifically, it seems that Lemma 9 of this paper (i.e. replacing a complex neuron with ReLU) has a similar counterpart in [Bresler and Nagaraj 2020] (Lemma 4: replacing sinusoids by ReLU). I also have a concern about the writing (see below). But I am happy to increase the score if the authors respond well. Ref: Guy Bresler and Dheeraj Nagaraj. "A Corrective View of Neural Networks: Representation, Memorization and Learning" COLT 2020.

Correctness: seems to be correct

Clarity: The writing seems to be a little bit sloppy. Here are some examples which I hope that the authors can address. Line 415 in supplementary: "...where we used the fact that (\alpha u - c ) \perp u ..." But \alpha seems not defined. Line 540-541 in supplementary: "...the expression C_m will denote a constant depending only on m, whose value may change between different appearances..." I see that C_m is used in the inequalities on line 550 and line 551. But it is unclear if it really only depends on $m$. Does C_m depend on $n$ or $\gamma$? I suggest that the authors to write down what C_m really is. Minor: line 175: Is it 2[n / 2d] or 2[n/d]?

Relation to Prior Work: See *weakness*.

Reproducibility: Yes

Additional Feedback: === After rebuttal === I have read the response. I hope that the authors will polish their writing in the analysis as their promise.

Review 4

Summary and Contributions: This work studies the expressiveness of two-layer (mostly RELU) networks, focusing on the question of how many hidden neurons are sufficient to interpolate the training data and the optimality of the construction in terms of weight size. The first contribution is to extend the result of Baum, 1988 from binary to arbitrary labels: n/d neurons is sufficient to fit arbitrary labels. Then, the paper extends this result to NTK and harmonic networks.

Strengths: I have some concerns about the correctness below. But if they are addressed, the results on expressiveness of NTK and harmonic networks are new and interesting.

Weaknesses: The “wide-spread” assumption (i.e., data points cannot be too close) is restrictive. To my knowledge, most practical datasets do not satisfy this, for example MNIST.

Correctness: Some important proof steps are incomplete (proof of Lemma 1, proof of the extension of Proposition 3...), so I have concerns about the correctness. I give more details in the additional feedback section. That said, it could be that I missed something, so I am happy to be proven wrong and willing to change my stance if my concerns are properly addressed. I do not think the term “memorization”, which is used all over the paper, is a right terminology. I think it should be reserved for a learning algorithm rather than a construction. The term “interpolation” would be appropriate.

Clarity: My feeling is the paper is hard to follow at times. I also find it hard to distinguish the main contributions from existing results. The introduction is filled with theoretical statements and proofs that I believe should be put to the main technical section. Introduction should probably focus on high-level ideas and a summary of the results (by informal statement and discussion) in a clear distinction with related work. In my opinion, Section 2 should be moved to an earlier place. I am not sure Theorem 1 is an informal version of which formal one.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Clarification questions: 1. In Theorem 1, what does the first inequality say? If there is any i such that the square loss is > 1, then it must not hold for an arbitrarily small epsilon. 2. I do not follow how to construct g in the proof of Lemma 1 (Line 69-70). Line 86, the sign of b_l is not consistent with Eq. 1. Expanding the inline equality (at the end of the same line) would be helpful for the reader. 3. What is f_{u, v, b} in Line 180? Is it defined in Equation (3). Moreover, (3) is not mathematically precise, shouldn’t LHS of (3) \lim_{\delta -> 0}? 4. Can the authors explain why it is possible to assume f_{u, v, b}(x_i) \in F_2(RELU) in Line 184? 5. Is there any intuition why the dependence on 1/\gamma is only logarithmic? Minor comments: Line 157, for a fixed training set X, why do we have randomness in H^\infty? Bibliography style does not follow NeurIPS format. ========== AFTER REBUTTAL =========== The authors have addressed my concerns and clarified some confusion. I increase my score accordingly.