NeurIPS 2020

Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient


Review 1

Summary and Contributions: This paper considers the strong lottery ticket hypothesis -- a conjecture that a randomly initialized neural network can be pruned (i.e. only have weights removed) in order to achieve good performance against some target function. The authors show that when the target function is a fully-connected neural network, such a pruning will exist with high probability whenever the randomly initialized network has twice as many layers and has a width that is a log(d*l/epsilon) factor larger than the target network. Here, d is the width of the target network, l is its depth, and epsilon is the desired accuracy. This is an improvement over the best/only known result (Malach et al. 2020) on this problem that showed that this can be achieved with a width that is a poly(d, l, 1/epsilon) factor larger than the target. The improvement is achieved by essentially reusing the proof of Malach et al., but fixing a key step where polynomial factors were lost by appealing to known results on random subset sum problems. The authors also provide a lower bound showing that approximating a two layer network with a constant layer pruned network will require a log(1/epsilon) factor blowup in width.

Strengths: 1. Tightest known bounds on the strong lottery ticket hypothesis. 2. Accompanying lower bounds showing that a log(1/epsilon) blowup in the number of parameters is required when the target network is a two layer network. 3. The improvement over Malach et al. is simple and understandable.

Weaknesses: It is hard to find a glaring weakness here. This paper identifies the issue in the only known result on the lottery ticket hypothesis and fixes it by appealing to a previous result on the subset sum problem.

Correctness: The paper appears to be correct.

Clarity: The paper is clear and well-written.

Relation to Prior Work: This paper adequately explains the previous state of the art on the problem and the improvements provided in the paper.

Reproducibility: Yes

Additional Feedback: After reading the author response and the other reviews, I still feel that my score is appropriate.


Review 2

Summary and Contributions: This paper studies the problem of lottery ticket hypothesis. They theoretically show that any fully connected target network can be approximated by pruning an random one within only only log factor of over-parameterization. The key idea is to connect it to the SUBSETSUM problem. Experiments are also provided to justify their results.

Strengths: The paper improves previous result from a polynomial factor of over-parameterization to a log factor of over-parameterization. They also provide lower bound for the required over-parameterization, which indicates their upper bound is optimal up to log factor. These results improve the understanding of lottery ticket hypothesis, which is currently a very interesting topic.

Weaknesses: The construction of subnetwork given by this paper relies on the solving SUBSETSUM, which is a NP-hard problem in general. So, it might be time-consuming in practice. It would be interesting to find more efficient way to find such subnetwork, as mentioned by authors.

Correctness: The claims and experiments seem to be correct, from my point of view.

Clarity: The paper overall is well-written and easy to understand.

Relation to Prior Work: The authors discussed related works, and explained several differences between prior works and current paper.

Reproducibility: Yes

Additional Feedback: In the experiments, when using the network structure in Figure 1 and pruning it by solving SUBSETSUM directly or using other method (e.g. edge-popup in Experiment 2), I was wondering whether the subnetwork we get would be similar, or the number of parameters left would be different. ============================================================= After rebuttal: Thanks for the response, I would like to keep my score.


Review 3

Summary and Contributions: This paper studies the strong lottery ticket hypothesis which states that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. This paper establishes a connection between pruning random networks and random instances of the SUBSETSUM problem and uses this connection to derive a new approach that exponentially improves the previous bounds in terms of the amount of over-parameterization. The paper complements with this positive result with a lower bound for two-layer neural networks and experiments.

Strengths: 1. Pruning is a very hot and important topic in DL. 2. This paper is very well-written. 3. The theoretical finding is very interesting. The use of random instances of the SUBSETSUM problem is novel in the literature. I believe this connection maybe useful in other compression problems in ML. 4. The exponential improvement over existing result is significant.

Weaknesses: The lower bound only applies to two-layer neural networks, though I think the lower bound is not the focus of this paper.

Correctness: Yes.

Clarity: Yes. I espeically appraciate the intuitive explaination of the proof.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Thanks for the response, and I'll keep my score. -------------------------------------------- Does the proof apply to other activation function? Like softplus? Is [1] a concurrent work? If so, I think author(s) should point it out more explicitly. Line 485: "xof" -> "of"


Review 4

Summary and Contributions: The paper tightens the bound on how over-parameterized a network has to be for a subset of its weights to approximate a target network. The work is a solid theoretical advance with respect to the lottery ticket hypothesis. The central idea is simple and clever, especially so because it leverages an obscure, 20+ year old theorem which provides an excellent improvement on the bound.

Strengths: Simple, elegant idea and extension of the pruning approach for bounding approximating network sizes for the lottery ticket hypothesis. My background is not that strong in this area, and my understanding of proof was only superficial, so others should evaluate the technical aspects to be sure.

Weaknesses: I wonder whether it's important to experiment with training the target network on other problems besides MNIST.

Correctness: As far as I understood them, yes.

Clarity: The paper is very clearly written. I understood everything up to Section 3.2, and I followed the key proof starting around line 189, but I didn't understand it deeply. My lack of comprehension is due to my lack of background in this area, and in general I think an expert will find it clear.

Relation to Prior Work: Yea, it builds directly on a recent result on bounding the network size with polynomial depth/layer size complexity, offers an explanation for why the previous version's complexity was as it is, and motivated their work by pointing to the Lueker paper and explaining how the subset sum strategy provides an advantage.

Reproducibility: Yes

Additional Feedback: