NeurIPS 2020

On Convergence and Generalization of Dropout Training

Review 1

Summary and Contributions: This paper gives test error guarantees for training a two-layer network with ReLU activations and dropout as an explicit regularizer. The proof extends the work of Ji and Telgarsky (2019) that uses the NTK approximation and is based on a margin assumption in the NTK feature space.

Strengths: This work studies an important problem of understanding the theoretical properties of dropout. The paper is mostly clearly written and most of the proofs in the paper seem correct.

Weaknesses: I think there is a major technical mistake in the paper. To prove their main result, the authors rely on an inequality presented after line 259 which says that the expected zero-one *test* error is at most the expected *training* error. This should not be correct because the test error is usually higher than the training error. There is no generalization analysis in the paper, e.g., using Rademacher complexity as in Ji and Telgarsky (2019). The proof of this inequality is given after line 561 in the supplementary, but there is no explanation and it seems incorrect. Am I missing something? Another issue with the paper is that it seems that the guarantees hold for the initialization and only "worsen" with the gradient updates. Therefore, if we consider the algorithm which initializes W_1 and then satisfies W_t = W_1 for all t, it has the same guarantees as SGD. Is this correct? Where in the proofs does SGD have an advantage over the latter algorithm? If it does not, then this is a clear limitation of the result and it does not explain the benefits of training with SGD. In the discussion, the authors claim that following Nagarajan et al., implicit bias may not explain the generalization of SGD. I think that this statement is incorrect. Nagarajan et al. show that uniform convergence bounds may not explain generalization, this is not the same. The implicit bias of SGD implies good generalization (because this is what we see in practice), and it may be possible to explain it and derive generalization guarantees by other methods (not uniform convergence bounds). Other minor comments: 1. I think it would be good to emphasize the differences between the results on SGD with dropout and SGD without dropout. We should expect better guarantees for dropout because it generalizes better in practice. Is this the case in this paper? 2. In Page 8 it is not clear which Lemma is used to prove which Lemma. This should be written more clearly. 3. In Algorithm 1, what is p? 4. Typos: throughout the paper miss-classification should be misclassification. Line 122: "min" is missing. Line 271: ">0" is missing.

Correctness: There seems to be a major mistake in the proof of the main result. See weaknesses above.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: -------------- Post author response --------------------- I have read the response, other reviews and relevant parts of the paper again. There is no technical mistake. Thank you for clarifying my misunderstanding. I updated the score.

Review 2

Summary and Contributions: I have read the authors' response and appreciate their explanation. I improve my score but it would be nice if the authors include the clarifications as response to my comments in the final version. ============================================================= This work studies convergence rates of Dropout guaranteeing epsilon-suboptimality in the test error. In addition, it shows that dropout implicitly compresses the network.

Strengths: As far as I know, there has not been any work that studies properties of dropout in such theoretical settings.

Weaknesses: I have to admit that I am not qualified to follow all theoretical results in this paper. However, it could have been easier for someone like me to follow if the insights from each theoretical result were discussed more and maybe even add a toy example for motivation.

Correctness: -I would appreciate if the authors had discussed iteration complexity in practice. We know that applying dropout in neural networks come with a negligible cost because it is just a matter of masking weights. - The logistic loss equation in line 123 is wrong. Also, the loss function \ell takes two parameters in line 122 and one parameter in line 123. - As I mentioned, I was not able to follow the derivation of the theoretical results but I wonder if the main reason for those results is the max-norm instead of dropout.

Clarity: - I find paragraph in lines 148-151 rather abrupt. It would have been nice if the authors gave a bit background on how NTK is related to this problem. - I cannot follow the notation in lines 167-169. What is k? Is it iteration number? What do you mean by linearization? Why \nabla g_t is called features? - In basic facts, the claim in line 174 is not obvious to me but it is also because I was not able to follow the notation. - The misclassification error used in theorem 4.3 is confusing. \mathcal{R} takes all weights of the neural network as inputs. Why does it take qW_t in this theorem?

Relation to Prior Work: The authors discussed prior work in detail.

Reproducibility: No

Additional Feedback: - The paper focuses on a binary classification problem. I would like to know what happens for more complicated problems and networks.

Review 3

Summary and Contributions: The paper studies the convergence and compression properties of neural networks trained with dropouts. They make two important and novel contributions. 1. Convergence rates (non-asymptotic) for two layer neural networks with ReLU activation units. 2. Compression - They show that dropout supports compression. In particular, there exists a sub-network that can generalize as well as the whole network. This has interesting relations to the lottery ticket hypothesis.

Strengths: The paper tackles a very interesting problem which is not well understood in the deep learning community. This should be interesting to the nips community.

Weaknesses: In Assumption 1, (q, gamma)-margin: \phi (line 185) is not defined. Also, since assumption 1 is crucial to the proof and statement of all the results in the paper. It might have been helpful, to have more discussion around some of the questions below. why is assumption 1 necessary or why do we expect this to hold on real datasets or some example datasets where such a property holds.

Correctness: The claims seem to be correct.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I have read the rebuttal and other reviews. I will keep my score unchanged.

Review 4

Summary and Contributions: This paper gives a proof on the dropout sub-optimalities on the two layer RELU neural networks. Assumptions of the analysis include only the first layer is trainable, overparameterization and data separation.

Strengths: This draft gives a proof on the convergence of the dropout training for the two layer neural overparameterized network. In particular given some margin assumptions, the mis-classification error decreases with a speed of roughly O(1/T).

Weaknesses: The work looks fine to me. I understand the assumption that only the first layer is trainable is used a lot in previous work. It will be better if we can make both layers trainable.

Correctness: Overall the arguments look fine to me, but I did not go through the details.

Clarity: the draft is written fine.

Relation to Prior Work: yes the references seem adequate to me.

Reproducibility: Yes

Additional Feedback: I suggest we ignore the reproducibility question for the theory paper. ================ I have read the authors' response. I will keep my ratings unchanged.