{"title": "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence", "book": "Advances in Neural Information Processing Systems", "page_first": 1143, "page_last": 1152, "abstract": "Deep neural networks have received dramatic success based on the optimization method of stochastic gradient descent (SGD). However, it is still not clear how to tune hyper-parameters, especially batch size and learning rate, to ensure good generalization. This paper reports both theoretical and empirical evidence of a training strategy that we should control the ratio of batch size to learning rate not too large to achieve a good generalization ability. Specifically, we prove a PAC-Bayes generalization bound for neural networks trained by SGD, which has a positive correlation with the ratio of batch size to learning rate. This correlation builds the theoretical foundation of the training strategy. Furthermore, we conduct a large-scale experiment to verify the correlation and training strategy. We trained 1,600 models based on architectures ResNet-110, and VGG-19 with datasets CIFAR-10 and CIFAR-100 while strictly control unrelated variables. Accuracies on the test sets are collected for the evaluation. Spearman's rank-order correlation coefficients and the corresponding $p$ values on 164 groups of the collected data demonstrate that the correlation is statistically significant, which fully supports the training strategy.", "full_text": "Control Batch Size and Learning Rate to Generalize\n\nWell: Theoretical and Empirical Evidence\n\nUBTECH Sydney AI Centre, School of Computer Science, Faculty of Engineering\n\nFengxiang He Tongliang Liu Dacheng Tao\n\nThe University of Sydney, Darlington, NSW 2008, Australia\n\n{fengxiang.he, tongliang.liu, dacheng.tao}@sydney.edu.au\n\nAbstract\n\nDeep neural networks have received dramatic success based on the optimization\nmethod of stochastic gradient descent (SGD). However, it is still not clear how\nto tune hyper-parameters, especially batch size and learning rate, to ensure good\ngeneralization. This paper reports both theoretical and empirical evidence of a\ntraining strategy that we should control the ratio of batch size to learning rate not too\nlarge to achieve a good generalization ability. Speci\ufb01cally, we prove a PAC-Bayes\ngeneralization bound for neural networks trained by SGD, which has a positive\ncorrelation with the ratio of batch size to learning rate. This correlation builds the\ntheoretical foundation of the training strategy. Furthermore, we conduct a large-\nscale experiment to verify the correlation and training strategy. We trained 1,600\nmodels based on architectures ResNet-110, and VGG-19 with datasets CIFAR-10\nand CIFAR-100 while strictly control unrelated variables. Accuracies on the test\nsets are collected for the evaluation. Spearman\u2019s rank-order correlation coef\ufb01cients\nand the corresponding p values on 164 groups of the collected data demonstrate\nthat the correlation is statistically signi\ufb01cant, which fully supports the training\nstrategy.\n\n1\n\nIntroduction\n\nThe recent decade saw dramatic success of deep neural networks [9] based on the optimization\nmethod of stochastic gradient descent (SGD) [2, 32]. It is an interesting and important problem that\nhow to tune the hyper-parameters of SGD to make neural networks generalize well. Some works\nhave been addressing the strategies of tuning hyper-parameters [5, 10, 14, 15] and the generalization\nability of SGD [4, 11, 19, 26, 27]. However, there still lacks solid evidence for the training strategies\nregarding the hyper-parameters for neural networks.\nIn this paper, we present both theoretical and empirical evidence for a training strategy for deep\nneural networks:\n\nWhen employing SGD to train deep neural networks, we should control the batch\nsize not too large and learning rate not too small, in order to make the networks\ngeneralize well.\n\nThis strategy gives a guide to tune the hyper-parameters that helps neural networks achieve good test\nperformance when the training error has been small. It is derived from the following property:\n\nThe generalization ability of deep neural networks has a negative correlation with\nthe ratio of batch size to learning rate.\n\nAs regards the theoretical evidence, we prove a novel PAC-Bayes [24, 25] upper bound for the\ngeneralization error of deep neural networks trained by SGD. The proposed generalization bound\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Scatter plots of accuracy on test set to ratio of batch size to learning rate. Each point\nrepresents a model. Totally 1,600 points are plotted.\n\nhas a positive correlation with the ratio of batch size to learning rate, which suggests a negative\ncorrelation between the generalization ability of neural networks and the ratio. This result builds the\ntheoretical foundation of the training strategy.\nFrom the empirical aspect, we conduct extensive systematic experiments while strictly control\nunrelated variables to investigate the in\ufb02uence of batch size and learning rate on the generalization\nability. Speci\ufb01cally, we trained 1,600 neural networks based on two popular architectures, ResNet-\n110 [12, 13] and VGG-19 [28], on two standard datasets, CIFAR-10 and CIFAR-100 [16]. The\naccuracies on the test set of all the networks are collected for analysis. Since the training error is\nalmost the same across all the networks (it is almost 0), the test accuracy is an informative index to\nexpress the generalization ability. Evaluation is then performed to 164 groups of the collected data.\nThe Spearman\u2019s rank-order correlation coef\ufb01cients and the corresponding p values [31] demonstrate\nthat the correlation is statistically signi\ufb01cant (the probability that the correlation is wrong is smaller\nthan 0.005), which fully supports the training strategy.\nThe rest of this paper is organized as follows. Section 2 recalls the preliminaries of generalization\nand SGD. Sections 3 and 4 respectively present the theoretical and empirical evidence of the training\nstrategy. Section 5 reviews the related works. Section 6 concludes the paper. Appendix A presents\nadditional background and preliminaries. Appendix B provides the proofs omitted from the main\ntext.\n\n2 Preliminaries\n\nGeneralization bound for stochastic algorithms. Formally, machine learning algorithms are de-\nsigned to select the hypothesis function F\u2713 with the lowest expected risk R under the loss function\nl from a hypothesis class {F\u2713|\u2713 2 \u21e5 \u21e2 Rd}, where \u2713 is the parameter of the hypothesis and d is\nthe dimension of the parameter \u2713. For many stochastic algorithms, such as SGD, we usually use a\ndistribution to express the output parameter. Suppose the parameter follows a distribution Q, the\nexpected risks respectively in terms of \u2713 and Q are de\ufb01ned as:\n\nR(\u2713) = E(X,Y )\u21e0Dl(F\u2713(X), Y ),\n\n(1)\n(2)\nHowever, the expected risk R is not available from the data, since we do not know the formulation of\nlatent distribution D of data. Practically, we use the empirical risk \u02c6R to estimate the expected risk R,\nwhich is de\ufb01ned as:\n\nR(Q) = E\u2713\u21e0QE(X,Y )\u21e0Dl(F\u2713(X), Y ).\n\n1\n|T|\n\n\u02c6R(\u2713) =\n\n|T|Xi=1\n\u02c6R(Q) = E\u2713\u21e0Q24 1\n|T|Xi=1\n\n|T|\n\n2\n\nl(F\u2713(Xi), Yi),\n\nl(F\u2713(Xi), Yi)35 ,\n\n(3)\n\n(4)\n\n\fwhere all (Xi, Yi) constitute the training sample T .\nEquivalently, the empirical risk \u02c6R is the error of the algorithm on training data, while the expected\nrisk R is the expectation of the error on test data or unseen data. Therefore, the difference between\nthem is an informative index to express the generalization ability of the algorithm, which is called\ngeneralization error. The upper bound of the generalization error (usually called generalization bound)\nexpresses how large the generalization error is possible to be. Therefore, generalization bound is also\nan important index to show the generalization ability of an algorithm.\nStochastic gradient descent. To optimize the expected risk (eq. 1), a natural tool is the gradient\ndescent (GD). Speci\ufb01cally, the gradient of eq. (1) in terms of the parameter \u2713 and the corresponding\nupdate equation are de\ufb01ned as follows,\n\ng(\u2713(t)) , r\u2713(t)R(\u2713(t)) = r\u2713(t)E(X,Y )l(F\u2713(t)(X), Y ),\n\n\u2713(t + 1) = \u2713(t) \u2318g(\u2713(t)),\n\nwhere \u2713(t) is the parameter at the interation t and \u2318> 0 is the learning rate.\nStochastic gradient descent (SGD) estimates the gradient from mini batches of the training sample\nset to estimate the gradient g(\u2713). Let S be the indices of a mini batch, in which all indices are\nindependently and identically (i.i.d.) drawn from {1, 2, . . . , N}, where N is the training sample size.\nThen similar to the gradient, the iteration of SGD on the mini-batch S is de\ufb01ned as follows,\n\n\u02c6gS(\u2713(t)) = r\u2713(t) \u02c6R(\u2713(t)) =\n\nr\u2713(t)l(F\u2713(t)(Xn), Yn),\n\n1\n\n|S|Xn2S\n\n\u2713(t + 1) = \u2713(t) \u2318\u02c6g(\u2713(t)),\n\n|S|Pn2S l(F\u2713(Xn), Yn) is the empirical risk on the mini batch and |S| is the\n\nwhere \u02c6R(\u2713) = 1\ncardinality of the set S. For brevity, we rewrite l(F\u2713(Xn), Yn) = ln(\u2713) in the rest of this paper.\nAlso, suppose that in step i, the distribution of parameter is Qi, the initial distribution is Q0, and the\nconvergent distribution is Q. Then SGD is used to \ufb01nd Q from Q0 through a series of Qi.\n\n3 Theoretical Evidence\n\nIn this section, we explore and develop the theoretical foundations for the training strategy. The main\ningredient is a PAC-Bayes generalization bound of deep neural networks based on the optimization\nmethod SGD. The generalization bound has a positive correlation with the ratio of batch size to\nlearning rate. This correlation suggests the presented training strategy.\n\n3.1 A Generalization Bound for SGD\nApparently, both ln(\u2713) and \u02c6R(\u2713) are un-biased estimations of the expected risk R(\u2713), while r\u2713ln(\u2713)\nand \u02c6gS(\u2713) are both un-biased estimations of the gradient g(\u2713) = r\u2713R(\u2713):\n\n(5)\n(6)\n\n(7)\n\n(8)\n\n(9)\n(10)\n\nE [ln(\u2713)] = Eh \u02c6R(\u2713)i = R(\u2713),\n\nE [r\u2713ln(\u2713)] = E [\u02c6gS(\u2713)] = g(\u2713) = r\u2713R(\u2713),\n\nwhere the expectations are in terms of the corresponding examples (X, Y ).\nAn assumption (see, e.g., [7, 23]) for the estimations is that all the gradients {r\u2713ln(\u2713)} calculated\nfrom individual data points are i.i.d. drawn from a Gaussian distribution centred at g(\u2713) = r\u2713R(\u2713):\n(11)\nwhere C is the covariance matrix and is a constant matrix for all \u2713. As covariance matrices are (semi)\npositive-de\ufb01nite, for brevity, we suppose that C can be factorized as C = BB>. This assumption can\nbe justi\ufb01ed by the central limit theorem when the sample size N is large enough compared with the\nbatch size S. Considering deep learning is usually utilized to process large-scale data, this assumption\napproximately holds in the real-life cases.\n\nr\u2713ln(\u2713) \u21e0N (g(\u2713), C).\n\n3\n\n\fTherefore, the stochastic gradient is also drawn from a Gaussian distribution centred at g(\u2713):\n\n\u02c6gS(\u2713) =\n\n1\n\n|S|Xn2S\n\nr\u2713ln(\u2713) \u21e0N \u2713g(\u2713),\n\nC\u25c6 .\n\n1\n|S|\n\n(12)\n\nSGD uses the stochastic gradient \u02c6gS(\u2713) to iteratively update the parameter \u2713 in order to minimize the\nfunction R(\u2713):\n\n\u2713(t) = \u2713(t + 1) \u2713(t) = \u2318\u02c6gS(\u2713(t)) = \u2318g(\u2713) +\n\nBW, W \u21e0N (0, I).\n\n(13)\n\nIn this paper, we only consider the case that the batch size |S| and learning rate \u2318 are constant. Eq.\n(13) expresses a stochastic process which is well-known as Ornstein-Uhlenbeck process [33].\nFurthermore, we assume that the loss function in the local region around the minimum is convex and\n2-order differentiable:\n\nR(\u2713) =\n\n1\n2\n\n\u2713>A\u2713,\n\n(14)\n\nwhere A is the Hessian matrix around the minimum and is a (semi) positive-de\ufb01nite matrix. This\nassumption has been primarily demonstrated by empirical works (see [18, p. 1, Figures 1(a) and 1(b)\nand p. 6, Figures 4(a) and 4(b)]). Without loss of generality, we assume that the global minimum of\nthe objective function R(\u2713) is 0 and achieves at \u2713 = 0. General cases can be obtained by translation\noperations, which would not change the geometry of objective function and the corresponding\ngeneralization ability. From the results of Ornstein-Uhlenbeck process, eq. (13) has an analytic\nstationary distribution:\n\n\u2318\n\np|S|\n\nq(\u2713) = M exp\u21e2\n\n1\n2\n\n\u2713>\u23031\u2713 ,\n\n(15)\n\nwhere M is the normalizer [8].\nEstimating SGD as a continuous-time stochastic process dates back to works by [17, 21]. For a\ndetailed justi\ufb01cation, please refer to a recent work [see 23, pp. 6-8, Section 3.2].\nWe then obtain a generalization bound for SGD as follows.\nTheorem 1. For any positive real 2 (0, 1), with probability at least 1 over a training sample\nset of size N, we have the following inequality for the distribution Q of the output hypothesis function\nof SGD:\n\nR(Q) \uf8ff \u02c6R(Q) +s \u2318\n\n|S|\n\nand\n\ntr(CA1) 2 log(det(\u2303)) 2d + 4 log 1\n\n + 4 log N + 8\n\n,\n\n(16)\n\n(17)\n\n8N 4\n\u2318\n|S|\n\nC,\n\n\u2303A + A\u2303=\n\nwhere A is the Hessian matrix of the loss function around the local minimum, B is from the covariance\nmatrix of the gradients calculated by single sample points , and d is the dimension of the parameter \u2713\n(network size).\n\nThe proof for this generalization bound has two parts: (1) utilize results from stochastic differential\nequation (SDE) to \ufb01nd the stationary solution of the latent Ornstein-Uhlenbeck process (eq. 13)\nwhich expresses the iterative update of SGD; and (2) adapt the PAC-Bayes framework to obtain the\ngeneralization bound based on the stationary distribution. A detailed proof is omitted here and is\ngiven in Appendix B.1.\n\n3.2 A Special Case of the Generalization Bound\n\nIn this subsection, we study a special case with two more assumptions for further understating the\nin\ufb02uence of the gradient \ufb02uctuation on our proposed generalization bound.\nAssumption 1. The matrices A and \u2303 are symmetric.\n\n4\n\n\fAssumption 1 can be translated as that both the local geometry around the global minima and\nthe stationary distribution are homogeneous to every dimension of the parameter space. Similar\nassumptions are also used by a recent work [14]. This assumption indicates that the product \u2303A of\nmatrices A and \u2303 is also symmetric.\nBased on Assumptions 1, we can further get the following theorem.\nTheorem 2. When Assumptions 1 holds, under all the conditions of Theorem 1, the stationary\ndistribution of SGD has the following generalization bound,\nR(Q) \uf8ff \u02c6R(Q)\n\n\u2318\n2|S|\n\ntr(CA1) + d log\u21e3 2|S|\u2318 \u2318 log(det(CA1)) d + 2 log 1\n\n + 2 log N + 4\n\n4N 2\n\n+vuut\n\n,\n(18)\n\nA detailed proof is omitted here and is given in Appendix B.2 in the supplementary materials.\nIntuitively, our generalization bound links the generalization ability of the deep neural networks\ntrained by SGD with three factors:\nLocal geometry around minimum. The determinant of the Hessian matrix A expresses the local\ngeometry of the objective function around the local minimum. Speci\ufb01cally, the magnitude of det(A)\nexpresses the sharpness of the local minima. Many works suggest that sharp local minima relate to\npoor generalization ability [15, 10].\nGradient \ufb02uctuation. The covariance matrix C (or equivalently the matrix B) expresses the \ufb02uctua-\ntion of the estimation to the gradient from individual data points which is the source of gradient noise.\nA recent intuition for the advantage of SGD is that it introduces noise into the gradient, so that it can\njump out of bad local minima.\nHyper-parameters. Batch size |S| and learning rate \u2318, adjust the \ufb02uctuation of gradient. Speci\ufb01cally,\nunder the following assumption, our generalization bound has a positive correlation with the ratio of\nbatch size to learning rate.\nAssumption 2. The network size is large enough:\n\ntr(CA1)\u2318\n\nd >\n\n2|S|\n\n,\n\n(19)\n\nwhere d is the number of the parameters, C expresses the magnitude of individual gradient noise, A\nis the Hessian matrix around the global minima, \u2318 is the leaning rate, and |S| is the batch size.\nThis assumption can be justi\ufb01ed that the network sizes of neural networks are usually extremely large.\nThis property is also called overparametrization [6, 3, 1]. We can obtain the following corollary by\ncombining Theorem 2 and Assumption 2.\nCorollary 1. When all conditions of Theorem 2 and Assumption 2 hold, the generalization bound of\nthe network has a positive correlation with the ratio of batch size to learning rate.\n\nThe proof is omitted from the main text and given in Appendix B.3\nIt reveals the negative correlation between the generalization ability and the ratio. This property\nfurther derives the training strategy that we should control the ratio not too large to achieve a good\ngeneralization when training deep neural networks using SGD.\n\n4 Empirical Evidence\n\nTo evaluate the training strategy from the empirical aspect, we conduct extensive systematic experi-\nments to investigate the in\ufb02uence of the batch size and learning rate on the generalization ability of\ndeep neural networks trained by SGD. To deliver rigorous results, our experiments strictly control\nall unrelated variables. The empirical results show that there is a statistically signi\ufb01cant negative\ncorrelation between the generalization ability of the networks and the ratio of the batch size to the\nlearning rate, which builds a solid empirical foundation for the training strategy.\n\n5\n\n\f(a) Test accuracy-batch size\n\n(b) Test accuracy-learning rate\n\nFigure 2: Curves of test accuracy to batch size and learning rate. The four rows are respectively for\n(1) ResNet-110 trained on CIFAR-10, (2) ResNet-110 trained on CIFAR-100, (3) VGG-19 trained on\nCIFAR-10, and (4) VGG-19 trained on CIFAR-10. Each curve is based on 20 networks.\n\n4.1\n\nImplementation Details\n\nTo guarantee that\nthe empirical results generally apply to any case, our experiments are\nconducted based on two popular architectures, ResNet-110 [12, 13] and VGG-19 [28], on\ntwo standard datasets, CIFAR-10 and CIFAR-100 [16], which can be downloaded from\nhttps://www.cs.toronto.edu/ kriz/cifar.html. The separations of the training sets and the test sets are\nthe same as the of\ufb01cial version.\nWe trained 1,600 models with 20 batch sizes, SBS = {16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176,\n192, 208, 224, 240, 256, 272, 288, 304, 320}, and 20 learning rates, SLR = {0.01, 0.02, 0.03, 0.04,\n0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20}. All train-\ning techniques of SGD, such as momentum, are disabled. Also, both batch size and learning rate\nare constant in our experiments. Every model with a speci\ufb01c pair of batch size and learning rate is\ntrained for 200 epochs. The test accuracies of all 200 epochs are collected for analysis. We select the\nhighest accuracy on the test set to express the generalization ability of each model, since the training\nerror is almost the same across all models (they are all nearly 0).\nThe collected data is then utilized to investigate three correlations: (1) the correlation between the\ngeneralization ability of networks and the batch size, (2) the correlation between the generalization\nability of networks and the learning rate, and (3) the correlation between the generalization ability\nof networks and the ratio of batch size to learning rate, where the \ufb01rst two are preparations for\nthe \ufb01nal one. Speci\ufb01cally, we calculate the Spearman\u2019s rank-order correlation coef\ufb01cients (SCCs)\nand the corresponding p value of 164 groups of the collected data to investigate the statistically\nsigni\ufb01cance of the correlations. Almost all results demonstrate the correlations are statistically\nsigni\ufb01cant (p < 0.005)1. The p values of the correlation between the test accuracy and the ratio are\nall lower than 10180 (see Table 3).\nThe architectures of our models are similar to a popular implementation of ResNet-110 and VGG-192.\nAdditionally, our experiments are conducted on a computing cluster with GPUs of NVIDIA\u00ae Tesla\u2122\nV100 16GB and CPUs of Intel\u00ae Xeon\u00ae Gold 6140 CPU @ 2.30GHz.\n\n4.2 Empirical Results on the Correlation\n\nCorrelation between generalization ability and batch size. When the learning rate is \ufb01xed as an\nelement of SLR, we train ResNet-110 and VGG-19 on CIFAR10 and CIFAR100 with 20 batch sizes\nof SBS. The plots of test accuracy to batch size are illustrated in Figure 2a. We list 1/4 of all plots\ndue to space limitation. The rest of the plots are in the supplementary materials. We then calculate the\nSCCs and the p values as Table 1, where bold p values refer to the statistically signi\ufb01cant observations,\nwhile underlined ones refer to those not signi\ufb01cant (as well as Table 2). The results clearly show that\nthere is a statistically signi\ufb01cant negative correlation between generalization ability and batch size.\n\n1The de\ufb01nition of \u201cstatistically signi\ufb01cant\u201d has various versions, such as p < 0.05 and p < 0.01. This paper\n\nuses a more rigorous one (p < 0.005).\n\n2See Wei Yang, https://github.com/bearpaw/pytorch-classi\ufb01cation, 2017.\n\n6\n\n\fp\n\np\n\np\n\np\n\nLR\n\nTable 1: SCC and p values of batch size to test accuracy for different learning rate (LR).\nVGG-19 on CIFAR-100\nSCC\n0.99\n0.99\n1.00\n0.99\n0.99\n0.99\n0.97\n0.97\n0.96\n0.93\n0.93\n0.90\n0.89\n0.86\n0.79\n0.77\n0.68\n0.66\n0.75\n0.57\n\nResNet-110 on CIFAR-10\nSCC\n0.01 0.96\n0.02 0.96\n0.03 0.96\n0.04 0.98\n0.05 0.98\n0.06 0.96\n0.07 0.98\n0.08 0.97\n0.09 0.97\n0.10 0.97\n0.11 0.97\n0.12 0.97\n0.13 0.94\n0.14 0.97\n0.15 0.96\n0.16 0.95\n0.17 0.95\n0.18 0.97\n0.19 0.94\n0.20 0.91\n\nVGG-19 on CIFAR-10\nSCC\n0.98\n0.99\n0.99\n0.99\n0.99\n1.00\n0.98\n0.98\n0.98\n0.98\n0.99\n0.98\n0.97\n0.97\n0.95\n0.95\n0.91\n0.93\n0.90\n0.95\n\nResNet-110 on CIFAR-100\nSCC\n0.92\n0.94\n0.99\n0.98\n0.98\n0.97\n0.94\n0.97\n0.98\n0.96\n0.98\n0.96\n0.98\n0.91\n0.98\n0.96\n0.95\n0.97\n0.95\n0.98\n\n5.6 \u21e5 108\n1.5 \u21e5 109\n1.5 \u21e5 1016\n7.1 \u21e5 1014\n1.3 \u21e5 1013\n6.7 \u21e5 1013\n5.0 \u21e5 1010\n1.7 \u21e5 1012\n3.7 \u21e5 1014\n8.7 \u21e5 1012\n1.3 \u21e5 1013\n2.5 \u21e5 1011\n1.3 \u21e5 1013\n3.1 \u21e5 108\n1.3 \u21e5 1013\n8.7 \u21e5 1012\n2.6 \u21e5 1010\n1.1 \u21e5 1012\n8.3 \u21e5 1011\n1.3 \u21e5 1013\n\n2.6 \u21e5 1011\n1.2 \u21e5 1011\n3.4 \u21e5 1011\n1.8 \u21e5 1014\n3.7 \u21e5 1014\n1.8 \u21e5 1011\n5.9 \u21e5 1015\n1.7 \u21e5 1012\n4.0 \u21e5 1013\n1.9 \u21e5 1012\n1.1 \u21e5 1012\n4.4 \u21e5 1012\n1.5 \u21e5 109\n2.6 \u21e5 1012\n4.6 \u21e5 1011\n3.1 \u21e5 1010\n2.4 \u21e5 1010\n4.0 \u21e5 1012\n6.3 \u21e5 1010\n3.6 \u21e5 108\n\n3.7 \u21e5 1014\n3.6 \u21e5 1017\n7.1 \u21e5 1018\n9.6 \u21e5 1019\n7.1 \u21e5 1018\n1.9 \u21e5 1021\n8.3 \u21e5 1015\n2.4 \u21e5 1013\n1.8 \u21e5 1014\n8.3 \u21e5 1015\n2.2 \u21e5 1016\n7.1 \u21e5 1013\n1.7 \u21e5 1012\n6.7 \u21e5 1013\n8.3 \u21e5 1011\n1.4 \u21e5 1010\n2.3 \u21e5 108\n2.6 \u21e5 109\n8.0 \u21e5 108\n6.2 \u21e5 1011\n\n7.1 \u21e5 1018\n7.1 \u21e5 1018\n1.9 \u21e5 1021\n3.6 \u21e5 1017\n1.4 \u21e5 1015\n1.4 \u21e5 1015\n1.7 \u21e5 1012\n1.7 \u21e5 1012\n1.2 \u21e5 1011\n2.2 \u21e5 109\n2.7 \u21e5 109\n7.0 \u21e5 108\n1.2 \u21e5 107\n1.1 \u21e5 106\n3.1 \u21e5 105\n6.1 \u21e5 105\n1.3 \u21e5 103\n2.8 \u21e5 103\n3.4 \u21e5 104\n1.4 \u21e5 102\n\nCorrelation between generalization ability and learning rate. When the batch size is \ufb01xed as an\nelement of SBS, we train ResNet-110 and VGG-19 on CIFAR10 and CIFAR100 respectively with\n20 learning rates SLR. The plot of the test accuracy to the learning rate is illustrated in Figure 2b,\nwhich include 1/4 of all plots due to space limitation. The rest of the plots are in the supplementary\nmaterials. We then calculate the SCC and the p values as Table 2 shows. The results clearly show that\nthere is a statistically signi\ufb01cant positive correlation between the learning rate and the generalization\nability of SGD.\n\np\n\np\n\np\n\nResNet-110 on CIFAR-10\nSCC\n\np\n\nVGG-19 on CIFAR-10\nSCC\n\nResNet-110 on CIFAR-100\nSCC\n\nTable 2: SCC and p values of learning rate to test accuracy for different batch size (BS).\nVGG-19 on CIFAR-100\nBS\nSCC\n0.80\n0.14\n0.37\n0.91\n0.87\n0.94\n0.95\n0.99\n0.98\n0.99\n0.98\n0.97\n0.99\n0.98\n0.99\n0.97\n0.99\n0.99\n1.00\n1.00\n\n3.2 \u21e5 106\n9.9 \u21e5 108\n1.8 \u21e5 107\n1.0 \u21e5 107\n4.8 \u21e5 1016\n2.4 \u21e5 107\n2.7 \u21e5 108\n1.1 \u21e5 108\n7.7 \u21e5 1014\n5.0 \u21e5 1010\n3.6 \u21e5 1017\n5.0 \u21e5 1010\n6.7 \u21e5 1012\n3.7 \u21e5 1014\n2.4 \u21e5 1010\n5.0 \u21e5 1010\n2.5 \u21e5 1011\n1.5 \u21e5 1018\n4.0 \u21e5 1013\n8.3 \u21e5 1011\n\n3.4 \u21e5 103\n4.9 \u21e5 105\n4.9 \u21e5 107\n2.0 \u21e5 108\n2.4 \u21e5 1010\n5.2 \u21e5 109\n2.6 \u21e5 1012\n2.2 \u21e5 1014\n6.2 \u21e5 108\n3.3 \u21e5 1010\n2.3 \u21e5 108\n6.2 \u21e5 1011\n6.1 \u21e5 1014\n2.2 \u21e5 109\n8.3 \u21e5 1015\n4.8 \u21e5 1016\n4.0 \u21e5 1013\n8.3 \u21e5 1011\n6.2 \u21e5 1011\n2.6 \u21e5 1012\n\n5.3 \u21e5 103\n5.0 \u21e5 103\n3.2 \u21e5 106\n1.2 \u21e5 103\n2.0 \u21e5 105\n3.3 \u21e5 105\n8.8 \u21e5 108\n8.3 \u21e5 1011\n2.1 \u21e5 106\n4.3 \u21e5 108\n5.0 \u21e5 1010\n6.7 \u21e5 1010\n3.6 \u21e5 108\n9.0 \u21e5 108\n4.6 \u21e5 105\n4.8 \u21e5 106\n2.4 \u21e5 1010\n9.8 \u21e5 1010\n1.5 \u21e5 105\n1.4 \u21e5 109\n\n2.6 \u21e5 105\n5.5 \u21e5 101\n1.1 \u21e5 101\n1.1 \u21e5 106\n4.5 \u21e5 106\n1.5 \u21e5 109\n1.2 \u21e5 1010\n4.8 \u21e5 1016\n3.5 \u21e5 1015\n7.1 \u21e5 1018\n1.8 \u21e5 1014\n2.6 \u21e5 1012\n2.5 \u21e5 1017\n1.3 \u21e5 1013\n9.6 \u21e5 1019\n5.4 \u21e5 1012\n1.5 \u21e5 1016\n1.5 \u21e5 1016\n3.7 \u21e5 1024\n7.2 \u21e5 1020\n\n0.60\n0.60\n0.84\n0.67\n0.80\n0.79\n0.90\n0.95\n0.85\n0.90\n0.94\n0.94\n0.91\n0.90\n0.78\n0.83\n0.95\n0.94\n0.81\n0.94\n\n0.62\n0.78\n0.87\n0.91\n0.95\n0.94\n0.97\n0.98\n0.90\n0.95\n0.91\n0.95\n0.98\n0.93\n0.98\n0.99\n0.97\n0.95\n0.95\n0.97\n\n0.84\n0.90\n0.89\n0.89\n0.99\n0.89\n0.91\n0.92\n0.98\n0.94\n0.99\n0.94\n0.97\n0.98\n0.95\n0.94\n0.96\n0.92\n0.97\n0.95\n\n16\n32\n48\n64\n80\n96\n112\n128\n144\n160\n176\n192\n208\n224\n240\n256\n272\n288\n304\n320\n\nCorrelation between generalization ability and ratio of batch size to learning rate. We plot the\ntest accuracies of ResNet-110 and VGG-19 on CIFAR-10 and CIFAR-100 to the rate of batch size to\nlearning rate in Figure 1. Totally over 1,600 points are plotted. Additionally, we perform Spearman\u2019s\nrank-order correlation test on all the accuracies of ResNet-110 and VGG-19 on CIFAR-10 and\nCIFAR-100. The SCC and p values show that the correlation between the ratio and the generalization\n\n7\n\n\fability is statistically signi\ufb01cant as Table 3 demonstrate. Each test is performed on 400 models. The\nresults strongly support the training strategy.\n\nTable 3: SCC and p values of the ratio of batch size to learning rate and test accuracy.\nResNet-110 on CIFAR-10\nVGG-19 on CIFAR-100\nSCC\nSCC\n0.97\n0.94\n\nResNet-110 on CIFAR-100\nSCC\n0.98\n\n5.3 \u21e5 10293\n\n3.3 \u21e5 10235\n\n6.1 \u21e5 10180\n\nVGG-19 on CIFAR-10\n\nSCC\n0.98\n\np\n\n6.2 \u21e5 10291\n\np\n\np\n\np\n\n5 Related Work\n\nFormer experiments have been addressing the in\ufb02uence of batch size and learning rate on the\ngeneralization ability; however, the in\ufb02uence of hyper-parameters on the generalization is still under\ndebate. [15] uses experiments to show that the large-batch training leads to sharp local minima which\nhave the poor generalization of SGD, while small-batches lead to \ufb02at minima which make SGD\ngeneralize well. However, only two values of batch size are investigated. [10] proposes the Linear\nScaling Rule to maintain the generalization ability of SGD: \u201cWhen the minibatch size is multiplied by\nk, multiply the learning rate by k\u201d. [14] conducts experiments to show that the dynamics, geometry,\nand the generalization are dependent on the ratio |S|/\u2318, but both the batch size and learning rate only\nhave two values which considerably limits the generality of the results. Meanwhile, [5] \ufb01nds that\nmost current notions for sharpness/\ufb02atness are ill-de\ufb01ned. [30] proves that for linear neural networks,\nbatch size has an optimal value when the learning rate is \ufb01xed. [29] suggests that increasing the batch\nsize during the training of neural networks can receive the same result of decaying the learning rate\nwhen the rate of batch size to learning rate remains the same. This result is consistent with ours.\nSome generalization bounds for algorithms trained by SGD are proposed, but they have not derived\nhyper-parameters tuning strategies. [26] studies the generalization of stochastic gradient Langevin\n\ndynamics (SGLD), and proposes an O (1/N ) generalization bound and an O\u21e31/pN\u2318 generalization\n\nbound respectively via the stability and the PAC-Bayesian theory. [27] studies the generalization of all\niterative and noisy algorithms and gives a generalization error bound based on the mutual information\nbetween the input data and the output hypothesis. As exemplary cases, it gives generalization\nbounds for some special cases like SGLD. [4] proves a trade-off between convergence and stability\nfor all iterative algorithms respectively under convex smooth setting and strong convex smooth\nsetting. Considering the equivalence between stability and generalization, it also gives a trade-off\nbetween the convergence and the generalization. Under the same assumptions, [4] gives an O (1/N )\ngeneralization bound. [20] proves O(1/N ) bounds for SGD when the loss function is Lipschitz\ncontinuous and smooth. [22] also gives a PAC-Bayes generalization bound for SGD based on the KL\ndivergence of posterior Q and the prior P which is somewhat premature.\n\n6 Conclusion\n\nThis work presents a training strategy for stochastic gradient descent (SGD) in order to achieve a\ngood generalization ability: we should control the ratio of batch size to learning rate not too large\nwhile tuning the hyper-parameters. This strategy is based on a negative correlation between the\ngeneralization ability of networks with the ratio of batch size to learning rate, which is proved from\nboth theoretical and empirical aspects in this paper. As the theoretical evidence, we prove a novel\nPAC-Bayes upper bound for the generalization error of algorithms trained by SGD. The bound has a\npositive correlation with the ratio of batch size to learning rate, which suggests a negative correlation\nbetween the generalization ability the ratio. For the empirical evidence, we trained 1,600 models\nbased on ResNet-110 and VGG-19 on CIFAR-10 and CIFAR-110 and collected the accuracy on\nthe test sets, while strictly control other variables. Spearman\u2019s order-rank correlation coef\ufb01cients\nand the corresponding p values are then calculated on 164 groups of the collected data. The results\ndemonstrate that the correlation is statistically signi\ufb01cant, which is in full agreement with the training\nstrategy.\n\n8\n\n\fAcknowledgments\nThis work was supported in part by Australian Research Council Projects FL-170100117,\nDP180103424, and DE190101473.\n\nReferences\n[1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\n\nparameterization. In International Conference on Machine Learning, 2019.\n\n[2] L. Bottou. Online learning and stochastic approximations. On-line learning in neural networks,\n\n17(9):142, 1998.\n\n[3] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD learns over-parameterized\nnetworks that provably generalize on linearly separable data. In International Conference on\nLearning Representations, 2018.\n\n[4] Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-off of iterative optimization\n\nalgorithms. arXiv preprint arXiv:1804.01619, 2018.\n\n[5] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. In\n\nInternational Conference on Machine Learning, 2017.\n\n[6] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\nparameterized neural networks. In International Conference on Learning Representations,\n2019.\n\n[7] W. E. A proposal on machine learning via dynamical systems. Communications in Mathematics\n\nand Statistics, 5(1):1\u201311, 2017.\n\n[8] C. W. Gardiner. Handbook of stochastic methods, volume 3. springer Berlin, 1985.\n[9] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.\n[10] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,\ntraining imagenet in 1 hour. arXiv preprint\n\nand K. He. Accurate, large minibatch sgd:\narXiv:1706.02677, 2017.\n\n[11] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient\n\ndescent. In International Conference on Machine Learning, 2015.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nConference on Computer Vision and Pattern Recognition, 2016.\n\nIn\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European\n\nconference on computer vision, 2016.\n\n[14] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three\n\nfactors in\ufb02uencing minima in sgd. arXiv e-prints, 1711.04623, 2017.\n\n[15] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch\ntraining for deep learning: Generalization gap and sharp minima. In International Conference\non Leanring Representations, 2017.\n\n[16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[17] H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications,\n\nvolume 35. Springer Science & Business Media, 2003.\n\n[18] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural\n\nnets. In Advances in Neural Information Processing Systems, 2018.\n\n[19] J. Lin, R. Camoriano, and L. Rosasco. Generalization properties and implicit regularization for\n\nmultiple passes sgm. In International Conference on Machine Learning, 2016.\n\n9\n\n\f[20] T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In\n\nInternational Conference on Machine Learning, 2017.\n\n[21] L. Ljung, G. P\ufb02ug, and H. Walk. Stochastic approximation and optimization of random systems,\n\nvolume 17. Birkh\u00e4user, 2012.\n\n[22] B. London. A pac-bayesian analysis of randomized learning with application to stochastic\n\ngradient descent. In Advances in Neural Information Processing Systems, 2017.\n\n[23] S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian\n\ninference. The Journal of Machine Learning Research, 18(1):4873\u20134907, 2017.\n\n[24] D. A. McAllester. Pac-bayesian model averaging. In Annual Conference on Computational\n\nLearning Theory, 1999.\n\n[25] D. A. McAllester. Some pac-bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n[26] W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of sgld for non-convex\n\nlearning: Two theoretical viewpoints. In Annual Conference On Learning Theory, 2018.\n\n[27] A. Pensia, V. Jog, and P.-L. Loh. Generalization error bounds for noisy, iterative algorithms. In\n\nIEEE International Symposium on Information Theory, 2018.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In International Conference on Learning Representations, 2015.\n\n[29] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. Don\u2019t decay the learning rate, increase\n\nthe batch size. In International Conference on Learning Representations, 2018.\n\n[30] S. L. Smith and Q. V. Le. A bayesian perspective on generalization and stochastic gradient\n\ndescent. In International Conference on Learning Representations, 2018.\n\n[31] C. Spearman. The proof and measurement of association between two things. The American\n\njournal of psychology, 100(3/4):441\u2013471, 1987.\n\n[32] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and\n\nmomentum in deep learning. In International conference on machine learning, 2013.\n\n[33] G. E. Uhlenbeck and L. S. Ornstein. On the theory of the brownian motion. Physical review,\n\n36(5):823, 1930.\n\n10\n\n\f", "award": [], "sourceid": 676, "authors": [{"given_name": "Fengxiang", "family_name": "He", "institution": "The University of Sydney"}, {"given_name": "Tongliang", "family_name": "Liu", "institution": "The University of Sydney"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}]}