NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5904
Title:Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates

Reviewer 1

This work leverages concepts from PAC-Bayes style generalization bounds to the estimation of local mutual information in neural networks and its application to bounding generalization (similar to Xu and Raginsky). Unfortunately there are numerous spelling and grammar mistakes, and more problematically, critical notation in the main theorems are not defined anywhere. Additionally, important quantities are glossed over and not discussed thoroughly enough. As a result, the theoretical sections are quite difficult to follow. It is not clear to me how the information bounds are used, it seems that instead of these bounds the authors end up focusing on KL-based bounds which are more reminiscent of PAC-Bayes. For the experimental results, the improvement over non-data-dependent bounds is to be expected. The figure should be relabeled, "Numerical Results" is not informative, and the y axis should be labeled. Edit: Thanks for answering my questions. The authors seem willing to put in the effort for properly revising the paper, so I tend towards acceptance.

Reviewer 2

This looks like a good set of theoretical contributions. I found the paper a fair bit harder to read than the previous papers on information-theoretic generalization, and I have very limited knowledge of the PAC-Bayes literature, so overall my confidence is at most “medium”. Although I would typically focus on theoretical contributions, in this case the experimental validation seems to play quite an important role, since without this it is unclear how much improvement there is (if any) over existing bounds. This leads me to the main issue that comes to mind – the notion of comparing the “key quantities” in the generalization bounds is considerably less convincing compared to comparing the bounds themselves. Why can’t the latter be done? (If it is somehow impossible, this ought to be explained in the paper) It should also be established that your generalization bounds are non-trivial, since a comparison of the form “trivial” vs. “a few orders of magnitude above trivial” would be somewhat less valuable. Aside from that, my main suggestions would be to try to make the paper (including the appendix) more accessible to less experienced readers. Some specific suggestions: - Define all notation unless it is very standard (e.g., \calx^{[T]}, \calM_1(\calW), #S_{J_t}), and be careful of undefined symbols even if their meaning is clear (e.g., W, and similarly \bar{W} and \bar{S} on p13) - Include citations/references even for fairly standard formulas (e.g., sample variances in finite populations on p5, the chain rule for KL divergence in (53), the Donsker-Varadhan formulation many times in the appendix – which should be added as a display equation, etc.) - For better visibility, and to make all the relevant assumptions clear, perhaps state (16) and (26) as “Corollary 1” and “Corollary 2”? - Do not skip non-trivial details in the appendix. For example, the “reduces to” claim leading to (34) may not be immediate for typical readers. - Check for inconsistencies in the main body and appendix. For example, Proposition 2.4 is written in terms of P_T(omega) but then P_T has no argument in the appendix, which is confusing. Other comments: - The reference list seems a bit too short given the amount of related work (on information-theoretic generalization, PAC-Bayes, Langevin algorithms, etc.). - It is not acceptable to define key quantities/notations (e.g., loss function, mu_X notation) in the appendix and then go ahead and use them in the main body. - Before (32), I don’t see how the “RHS is constant in J”. What if W depends more strongly on certain indices than others? - It is nice to have the flaw outlined in Appendix C formally documented. I think it’s worth adding (if I’m not mistaken) that, as an alternative to your fix on p13, Russo and Zou’s approach also leads to the same result without any errors. - Has [11] been published? If so, please modify. If not, consider switching to the AISTATS reference. Minor comments: - The second-last paragraph before Section 1.1 is hard to follow - Section 1.1: Add “We” at the start of each dot point to form a proper sentence - Section 3.1.1: Too many “and let” in a row - Typo: “These bound elucidates” - p12: Mention Radon-Nikodym derivative for less experienced readers === POST-REBUTTAL: Thank you for the responses. While I am a bit reluctant as to whether the required changes are minor enough for acceptance with no re-review, I have upgraded to 7. If accepted, the authors should very carefully address the reviewer comments, including much more strongly highlighting (i.e., not buried in an appendix) the numerical and analytical examples where the *generalization bound* is improved and non-vacuous.

Reviewer 3

The paper studies the problem of data-dependent generalization bounds. Novel abstract results are presented, which extend the existing mutual information and PAC-Bayes bounds, which scale with the mutual information or KL divergence related to a random uniform subsample of the original dataset. Based on this framework, an improved generalization bound for the Langevin algorithm is developed, which achieves $O(1/n)$ rate and depends only on the trace of covariance of stochastic gradients. The idea of proof is simple but very interesting. Mutual information method and PAC-Bayes bounds are both based on Donsker-Varadhan variational formula. Plugging in one of the measures to be based on a random subsample brings the ideas of leave-one-out analysis to this literature. Such type of extension is nontrivial and can potentially bring about better understanding of data-dependent generalization bounds. The examples of Langevin dynamics and SGLD also made solid progress in this direction. That being said, the current presentation impedes readers' understanding of the techniques, and the writing needs a lot of improvement. For example, in Theorem 2.1, it is not clear what the function $\psi_{R_D(\tilde{W}) - \hat{R}_{S_J^C} (\tilde{W})} (\lambda)$ means. The results about Langevin dynamics and SGLD should be formally stated as corollaries or theorems. Besides, the authors should also include more intuition about the meaning of subsample J and auxiliary random variable X.