Paper ID: | 676 |
---|---|

Title: | Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence |

Theory-wise, the authors overlooked to discuss several prior works, some of which suggested opposite theories to theirs. For example: - "Don't Decay the Learning Rate, Increase the Batch Size", ICLR'18, seems to support a constant batch size/lr ratio empirically --- after rebuttal --- After reading the comments and the authors rebuttal, I am satisfied with the responses. The paper theoretically verifies that the ratio of batch size to learning rate is positively related to the generalization error. Specifically, it verifies some very recent empirical findings, e.g., Donâ€™t decay the learning rate, increase the batch size, ICLR 2018, which empirically states that increasing the batch size and decaying the learning rate are quantitatively equivalent. I think the theoretical result is novel and timely and would interest many readers in the deep learning community. I value the theoretical contribution and thus would like increase my score and vote for accepting the submission. Experiment-wise, I have lots of reservations in the thorough/convincing level of the current experiments presented. - Only {ResNet-110, VGG-19} and CIFAR-10/100 are examined, albeit each with a large variety of lr/batch size. One would rather see more variety in model/data (like ImageNet models) - Where are the actual accuracy numbers achieved by those models? After changing the training protocol, are those results same competitive with SOTA numbers reported on the same model/dataset? Practically, a reduced generalization gap does not automatically grant a better testing set result. - All training techniques of SGD, such as momentum, are disabled. Also, both batch size and learning rate are constant without annealing. I wonder how those affect final achievable accuracy. - I believe the authors confused S_BS and S_LR definitions on lines 198-200.

The PAC bound in this paper is very similar to that in London 2017, which also derives an O(1/N^2) generalization bound. (London, B., "A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent", NeurIPS 2017.) In the experiments, the authors seem to suggest that a larger learning rate is always better, using the pearson correlation coefficient to argue that there is a positive correlation between test accuracy and learning rate. Obviously, this reasoning breaks down at some point --- at some point using a large learning rate keeps you from learning --- and it is strange that the authors did not experiment with large enough learning rates to see this effect. Specific comments: - Equations in 5 are only true if we haven't seen training data batch n. If we are cycling through the training data, then these do not hold. - The author's assume that the stochastic gradients are Gaussian distributed. I don't know if this is a good assumption --- I can certainly think of simple contrived examples where the gradient is either +d or -d for some large value d. - The authors should cite two other important papers in this area, who do a much more detailed analysis of this subject: Smith and Le, "A Bayesian Perspective on Generalization and Stochastic Gradient Descent" 2018, and Smith and Le, "Don't Decay the Learning Rate, Increase the Batch Size", 2018. - The experiments show a relationship between the batchsize, learning rate, and test accuracy, but that in itself is not very surprising. Computing Pearson correlations does not seem appropriate because the relationship is not linear, and the p-values are overkill.

This paper presents a novel strategy for training deep neural networks with SGD, in order to achieve a good generalization ability. In particular, the authors suggest controlling the ratio of batch size to learning rate not too large. Extensive theoretical and empirical evidences are provided in the paper. Pros. 1. In general this paper is very well written and clearly organized. This paper brings new insights on training deep models to the community. The strategy has been well justified using both theoretical analysis and empirical evaluations. 2. Theoretical analysis is provided. In particular, a generalization bound for SGD is derived, which has a positive correlation with the ratio of batch size to learning rate. Tightness of the bound is also discussed. 3. A large number of popular deep models are trained and analyzed in the experiments. The Pearson correlation coefficient and the corresponding p values clearly support the proposed strategy. Cons. 1. Existing generalization bounds for SGD are discussed in Section 5. It will be helpful if the authors can discuss if the existing bounds involve learning rate and batch size or not. 2. A few typos should be corrected in final version. --Page 2: in terms of --Theorem 2: hold --Section 4.1: the elements in S_BS and S_LR ------------------------------------------------------------------------------------------- The authors have provided detailed and reasonable responses in their rebuttal. I still believe that this paper presents important theoretical results and brings new insights to the community. Thus, I vote for acceptance.