{"title": "The Effect of Network Width on the Performance of Large-batch Training", "book": "Advances in Neural Information Processing Systems", "page_first": 9302, "page_last": 9309, "abstract": "Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance.\n\nIn this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.", "full_text": "The Effect of Network Width on the Performance of\n\nLarge-batch Training\n\n1Department of Computer Sciences,\n\n2Department of Electrical and Computer Engineering\n\nLingjiao Chen1 , Hongyi Wang1 , Jinman Zhao1,\nParaschos Koutris, 1 Dimitris Papailiopoulos2\n\nUniversity of Wisconsin-Madison\n\nAbstract\n\nDistributed implementations of mini-batch stochastic gradient descent (SGD) suf-\nfer from communication overheads, attributed to the high frequency of gradient\nupdates inherent in small-batch training. Training with large batches can reduce\nthese overheads; however it besets the convergence of the algorithm and the gen-\neralization performance. In this work, we take a \ufb01rst step towards analyzing how\nthe structure (width and depth) of a neural network affects the performance of\nlarge-batch training. We present new theoretical results which suggest that\u2013for a\n\ufb01xed number of parameters\u2013wider networks are more amenable to fast large-batch\ntraining compared to deeper ones. We provide extensive experiments on residual\nand fully-connected neural networks which suggest that wider networks can be\ntrained using larger batches without incurring a convergence slow-down, unlike\ntheir deeper variants.\n\n1\n\nIntroduction\n\nDistributed implementations of stochastic optimization algorithms have become the standard in large-\nscale model training [1, 2, 3, 4, 5, 6, 7]. Most machine learning frameworks, including Tensor\ufb02ow\n[1], MxNet [4], and Caffe2 [7], implement variants of mini-batch SGD as their default distributed\ntraining algorithm. During a distributed iteration of mini-batch SGD a parameter server (PS) stores\nthe global model, and P compute nodes evaluate a total of B gradients; B is commonly referred\nto as the batch size. Once the PS receives the sum of these B gradients from every compute node,\nit applies them to the global model and sends the model back to the compute nodes, where a new\ndistributed iteration begins.\nThe main premise of a distributed implementation is speedup gains, i.e., how much faster training\ntakes on P vs 1 compute node. In practice, these gains usually saturate beyond a few 10s of compute\nnodes [6, 8, 9]. This is because communication becomes the bottleneck, i.e., for a \ufb01xed batch of B\nexamples, as the number of compute nodes increases, these nodes will eventually spend more time\ncommunicating gradients to the PS rather than computing them. To mitigate this bottleneck, a plethora\nof recent work has studied low-precision training and gradient sparsi\ufb01cation, e.g., [10, 11, 12].\nAn alternative approach to alleviate these overheads is to increase the batch size B, since B directly\ncontrols the communication-computation ratio. Recent work develops sophisticated methods that\nenable large-batch training on state-of-the-art models and data sets [13, 14, 15]. At the same time,\nseveral studies suggest that large-batch training can affect the generalizability of the models [16], can\nslow down convergence [17, 18, 19], and is more sensitive to hyperparameter mis-tuning [20].\nSeveral theoretical results [21, 18, 22, 19, 17] suggest that, when the batch size B becomes larger\nthan a problem-dependent threshold B\u2217, the total number of iterations to converge signi\ufb01cantly\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fincreases, rendering the use of larger B a less viable option. Some of these studies, implicitly or\nexplicitly, indicate that the threshold B\u2217 is controlled by the similarity of the gradients in the batch.\nIn particular, [19] shows that the measure of gradient\ndiversity directly controls the relationship of B and the\nconvergence speed of mini-batch SGD. Gradient diversity\nmeasures the similarity of concurrently processed gradi-\nents, and [19] shows theoretically and experimentally that\nthe higher the diversity, the more amenable a problem\nis to fast large-batch training, and by extent to speedup\ngains in a distributed setting.\nA large volume of work has focused on how the structure\nof neural networks can affect the complexity or capacity\n[23, 24, 25] of the model, its representation ef\ufb01ciency\n[26], and its prediction accuracy [27, 28]. However, there\nis little work towards understanding how the structure\nof a neural network affects its amenability to distributed\nspeedup gains.\nIn this work, through analyzing the gradient diversity of\ndifferent network architectures, we take a step towards\naddressing the following question: How does the struc-\nture of a neural network affect its amenability to fast\nlarge-batch training?\nOur contribution We establish a theoretical connection\nbetween the structure (depth and width) of neural net-\nworks and their gradient diversity, which is an indicator of how large batch size can become, without\nslowing down the speed of convergence [19]. In particular, we prove how gradient diversity varies as\na function of width and depth for two types of networks: 2-layer fully-connected linear and non-linear\nneural networks, and multi-layer fully-connected linear neural networks. Our theoretical analysis\nindicates that, perhaps surprisingly, gradient diversity increases monotonically as width increases and\ndepth decreases. On a high-level, wider networks provide more space for the gradients to become\ndiverse. This result suggests that wider and shallower networks are more amenable to fast large-batch\ntraining compared to deeper ones. Figure 1 provides an illustrative example of this phenomenon.\nWe provide extensive experimental results that support our theoretical \ufb01ndings. We present ex-\nperiments on fully-connected and residual networks on CIFAR10, MNIST, EMNIST, Gisette, and\nsynthetic datasets. In our experimental setting, we \ufb01x the number of network parameters, vary the\ndepth and width, and measure (after tuning the step size) how many passes over the data it takes to\nreach an accuracy of \u0001 with batch size B. We observe that for all networks there exists a threshold\nB\u2217, and setting the batch size larger than the threshold leads to slower convergence. The observed\nthreshold B\u2217 becomes smaller when the network becomes deeper, validating our theoretical result\nthat deeper networks are less amenable to fast large-batch training.\nTo summarize the main message of our work, communication bottlenecks in distributed mini-batch\nSGD can be partially overcome not only by designing communication-ef\ufb01cient algorithms, but also\nby optimizing the architecture of the neural network at hand in order to enable large-batch training.\n\nFigure 1: Impact of neural network structure\non amenability to large-batch training. This is\nfor fully-connected models with ReLUs on M-\nNIST. For each fully-connected network, we\nvary the batch size and measure the number\nof epochs to converge to 96% accuracy on M-\nNIST. Wider and shallower networks require\nless epochs to converge than narrower and\ndeeper ones, which suggests that the former\nare more suitable to scale out to more compute\nnodes.\n\n2 Related Work\n\nMini-batch The choice of an optimal batch size has been studied for non-strongly convex models [21],\nleast square regression [22], and SVMs [29]. Other works propose methods that automatically choose\nthe batch size on the \ufb02y [30, 31]. Mini-batch algorithms can be combined with accelerated gradient\ndescent algorithms [32], or using dual coordinate descent [33, 34]. Mini-batch proximal algorithms\nare presented in [35]. While previous work mainly focuses on (strongly) convex models, or speci\ufb01c\nmodels (e.g., least square regression, SVMs), our work studies how neural network structure can\naffect the optimal batch size.\nGradient Diversity Previous work indicates that mini-batch can achieve better convergence rates\nby increasing the diversity of gradient batches, e.g., using strati\ufb01ed sampling [36], Determinantal\n\n2\n\n102103Batch Size02040Number of Epochsto ConvergeWidth=21, Depth=1Width=19, Depth=5Width=17, Depth=10\fPoint Processes [37], or active sampling [38]. The notion of similarity between gradients and how it\naffects convergence performance has been studied in several papers [17, 18, 19]. A formal de\ufb01nition\nand analysis of gradient diversity is given in [19], which establishes the connection between gradient\ndiversity and maximum batch size for convex and nonconvex models. To the best of our knowledge,\nnone of the existing works relates gradient diversity (and thus the optimal batch size) with the\nstructure of a neural network.\nWidth vs Depth in Arti\ufb01cial Neural Networks There has been an increasing interest and debate\non the qualities of deep versus wide neural networks. [23] suggests that deep networks have larger\ncomplexity than wide networks and thus may be able to obtain better models. [26] proves that deep\nnetworks can approximate sum products more ef\ufb01ciently than wide networks. Meanwhile, [39] shows\nthat a class of wide ResNets can achieve at least as high accuracy as deep ResNets. [40] presents\ntwo classes of networks, one shallow and one deep, that achieve similar prediction error for saliency\nprediction. In fact, [41] shows that well-designed shallow neural networks can outperform many\ndeep neural networks. More recently, [27] shows that using a dense structure, wider yet shallower\nnetworks can signi\ufb01cantly improve the accuracy compared to deeper networks. In addition, [42]\nshows that larger widths leads to better optimization landscape. While previous work has mainly\nstudied the effect of network structure on prediction accuracy, we focus on its effect on the optimal\nchoice of batch size for distributed computation.\n\n3 Setup and Preliminaries\n\nIn this section, we present the necessary background and problem setup.\nMini-batch SGD The process of training a model from data can be cast as an optimization problem\nknown as empirical risk minimization (ERM):\n\nn(cid:88)\n\ni=1\n\nmin\n\nw\n\n1\nn\n\n(cid:96)(w; (xi, yi))\n\n(k+1)B\u22121(cid:88)\n\n(cid:96)=kB\n\nwhere xi \u2208 Rm represents the ith data point, n is the total number of data points, w \u2208 Rd is a\nparameter vector or model, and (cid:96)(\u00b7;\u00b7) is a loss function that measures the prediction accuracy of the\nmodel on each data point. One way to approximately solve the above ERM is through mini-batch\nstochastic gradient descent (SGD), which operates as follows:\n\nw(k+1)B = wkB \u2212 \u03b3\n\n\u2207fs(cid:96)(wkB),\n\n(3.1)\n\nwhere each index s(cid:96) is drawn uniformly at random from [n] with replacement. We use w with\nsubscript kB to denote the model we obtain after k distributed iterations, i.e., a total number of\nkB gradient updates. In related studies there is often a normalization factor included in the batch\ncomputation, but here we subsume that in the step size \u03b3.\nGradient diversity and speed of convergence Gradient diversity measures the degree to which\nindividual gradients of the loss function are different from each other.\nDe\ufb01nition 1 (Gradient Diversity [19]). We refer to the following ratio as gradient diversity\n\n(cid:80)n\n(cid:107)(cid:80)n\ni=1 (cid:107)\u2207fi(w)(cid:107)2\ni=1 \u2207fi(w)(cid:107)2\n\n2\n\n2\n\n(cid:80)n\n2 +(cid:80)\n(cid:80)n\ni=1 (cid:107)\u2207fi(w)(cid:107)2\ni=1 (cid:107)\u2207fi(w)(cid:107)2\n\n2\n\n=\n\ni(cid:54)=j(cid:104)\u2207fi(w),\u2207fj(w)(cid:105) .\n\n\u2206S (w) : =\n\nThe gradient diversity \u2206S (w) is large when the inner products between the gradients taken with\nrespect to different data points are small. Equipped with the notion of gradient diversity, we de\ufb01ne a\nbatch size bound BS (w) for each data set S and each w as follows:\n\nBS (w) := n \u00b7 \u2206S (w).\n\nThe following result [19] uses the notion of gradient diversity to capture the convergence rate of\nmini-batch SGD.\nLemma 1. [Theorem 3 in [19],Informal] Suppose B \u2264 \u03b4 \u00b7 n\u2206S (w) + 1,\u2200w in each iteration. If\nserial SGD achieves an \u0001-suboptimal solution after T gradient updates, then using the same step-size\nas serial SGD, mini-batch SGD with batch-size B can achieve a (1 + \u03b4\n2 )\u0001-suboptimal solution after\nthe same number of gradient updates/data pass ( i.e., T /B iterations).\n\n3\n\n\fThe above result is true for both convex and non-convex problems, and its main message is that\nmini-batch SGD does not suffer from speedup saturation as long as the batch size is smaller than\nn \u00b7 \u2206S (w) (up to a constant factor). Moreover, [19] also shows that this is a worst-case optimal\nbound, i.e., (roughly) if the batch size is larger than n times the gradient diversity, there exists some\nmodel such that the convergence rate of mini-batch SGD is slower than that of serial SGD.\nThe main theoretical question that we study in this work is the following: how does gradient diversity\nchange as neural networks\u2019 structure (depth and width) varies?\nFully-connected Neural Networks We consider both linear and non-linear fully connected net-\nworks, with L \u2265 2 layers. We denote by K(cid:96) the width (number of nodes) of the (cid:96)-th layer, where\n(cid:96) \u2208 {0, . . . , L}. The \ufb01rst layer corresponds to the input of dimension d, hence K0 = d. The last\nlayer corresponds to the single output of the neural network, hence KL = 1. The weights of the\nedges that connect the (cid:96) and (cid:96) \u2212 1 layers, where l \u2208 {1, . . . , L}, are represented by the matrix\nW(cid:96) \u2208 RK(cid:96)\u00d7K(cid:96)\u22121. For the sake of simplicity, we will express the collection of weights (i.e., the\nmodel) as w = (W1, W2, . . . , WL).\nA general neural network (NN) with L \u2265 2 layers can be described as a collection of matrices\nW1, . . . , WL, where W(cid:96) \u2208 RK(cid:96)\u00d7K(cid:96)\u22121, together with a (generally nonlinear) activation function \u03c3(\u00b7).\nThe output of a NN (or LNN) on input data point xi is then de\ufb01ned as \u02c6yi = WL \u00b7 \u03c3(\u00b7\u00b7\u00b7 \u03c3(W2 \u00b7\n\u03c3(W1 \u00b7 xi))). There are different types of activation that we study,i.e., tanh(x), the softsign function\n1+|x|, arctan(x), and the ReLU function max{0, x}. For linear neural networks (LNNs), we denote\n(cid:96)=1 W(cid:96) = WL \u00b7 WL\u22121 \u00b7\u00b7\u00b7 W1. We will also write W(cid:96),p,q to denote the element in the p-th\n\nW =(cid:81)L\n\nx\n\nrow and q-th column of matrix W(cid:96).\nThe output of the neural network with input xi is de\ufb01ned as \u02c6yi. Throughout the theory part of this\npaper, we will use the square loss function to measure the error, which we denote for the i-th data\npoint as fi = 1\nsuch that the loss function is 0 on each data point when W(cid:96),p,q = W \u2217\n\n2 (\u02c6yi \u2212 yi)2. Further, we assume that the data is achievable, i.e., there exists W \u2217\n\n(cid:96),p,q\n\n(cid:96),p,q.\n\n4 Main Results\n\nIn this section, we present a theoretical analysis on how structural properties of a neural network, and\nin particular the depth and width, in\ufb02uence the gradient diversity, and hence the convergence rate of\nmini-batch SGD for varying batch size B. All proofs are left to the Appendix.\nIn the following derivations, we will assume that the labels {y1, . . . , yn} of the n data points are\nrealizable, i.e., there exist a network of L layers that on input xi outputs yi. Our results are presented\nas probabilistic statements, and for almost all weight matrices.\nWarmup: 2-Layer Linear Neural Networks Our \ufb01rst result concerns the case of a simple 2-layer\nlinear neural network with one hidden layer. To simplify notation, we will denote the width of the\nhidden layer with K = K1. Further, \u0398(\u00b7) and \u2126(\u00b7) are used in their standard meaning. The main\nresult can be stated as follows:\nl,p,q for l \u2208 {1, 2} and xi be independently\nTheorem 1. Consider a 2 LNN. Let the weights Wl,p,q, W \u2217\ndrawn random variables, such that their k-th order moments for k \u2264 4 are bounded in a postive\ninterval. Then, with arbitrary constant probability, the following holds:\n\nBS (w) \u2265\n\n\u0398(nKd)\n\n\u0398(Kn + dn + Kd)\n\nFor suf\ufb01ciently large n, the above ratio on the batch size scales like \u0398(Kd)\n\u0398(K+d). This ratio is always\nincreasing as a function of the width of the hidden layer, which implies that larger width allows for a\nlarger batch size.\n2-Layer Nonlinear Neural Networks As a next step in our theoretical analysis, we analyze general\n2-layer NNs with a nonlinear activation function \u03c3.\nTheorem 2. Consider a 2-layer NN with a monotone activation function \u03c3 such that for every\nx we have: \u2212\u03c3(x) = \u03c3(\u2212x), and both |\u03c3(x)| and supx{x\u03c3(cid:48)(x)} are bounded. Let the weights\n\n4\n\n\fE[n(cid:80)n\nE[||(cid:80)n\n\nWl,p,q, W \u2217\nprobability, the following holds:\n\nl,p,q for l \u2208 {1, 2} and xi be i.i.d. random variables from N (0, 1). Then, with high\n\n\u2265 \u2126(\n\nKd2\n\n).\n\nKd + K + d\n\ni=1 ||\u2207fi||2\n2]\ni=1 \u2207fi||2\n2]\nwhere the expectation is over W2, W \u2217\n2 .\nWe should remark here that the above bound is weaker than the one obtained for the case of 2-layer\nLNNs, since it bounds the ratio of the expectations, and not the expectation of the ratio (the batch\nsize bound). Nevertheless, we conjecture that the batch size bound concentrates, and thus the above\ntheorem can approximate the batch size bound well.\nAnother remark is that several commonly used activation functions in NNs, such as tanh, arctan,\nand the softsign function satisfy the assumptions of the above theorem. The same trends can be\nobserved here as in the case of 2-layer LNNs: (i) larger width leads to a larger gradient diversity, and\nthus faster convergence of distributed mini-batch SGD, and (ii) the ratio can never exceed \u2126(d).\nMultilayer Linear Neural Networks We generalize here our result for 2-layer LNNs to general\nmultilayer LNNs of arbitrary depth L \u2265 2. Below is our main result.\nTheorem 3. Let the weight values Wl,p,q for l \u2208 {1, . . . , L} and xi be independently drawn\n2 (W xi \u2212 W \u2217xi)2 =\nrandom variables from N (0, 1). Consider a multilayer LNN where fi = 1\n(cid:96) xi)2. Assuming that K(cid:96) \u2265 2 for every (cid:96) \u2208 {0, . . . , L \u2212 1}, and that n is\n\n2 ((cid:81)L\n(cid:96)=1 W(cid:96)xi \u2212(cid:81)L\n\n1\nsuf\ufb01ciently large, then we have:\n\n(cid:96)=1 W \u2217\n\nE[n(cid:80)n\nE[||(cid:80)n\n\n\u03c1 =\n\ni=1 ||\u2207fi||2\n2]\ni=1 \u2207fi||2\n2]\n\n\u2265\n\n(cid:80)L\u22121\n\n\u03c6=1\n\nL\nL\u2212\u03c6\nK\u03c6\u22121 + 2L\nd\u22121\n\n.\n\n(4.1)\n\nAgain, note that the above bound is weaker than the one obtained for the case of 2-layer LNNs, since\nit bounds the ratio of the expectations, and not the expectation of the ratio. It is believed that the\ndenominator and numerator should concentrate around their expectations (as was the case in Theorem\n1) and thus ratio of the expectation re\ufb02ects the expectation of the ratio. Whether this can be proved\nremains an interesting open question.\nWe next discuss the implications of Theorem 3 on the convergence rate of mini-batch SGD. To analyze\nthe behavior of the bound, consider the simple case where all the hidden layers (l = 1, . . . , L \u2212 1)\nhave exactly the same width K. In this case, the ratio in Eq. (4.1) becomes:\n\n\u03c1 \u2265\n\n1\n\nL\u22121\n2(K\u22121) + 2\nd\u22121\n\n= \u0398\n\n(cid:18) dK\n\n(cid:19)\n\ndL + K\n\nThere are three takeaways from the above bound. First, by increasing the width K of the LNN, the\nratio increases as well, which implies that the convergence rate increases. Second, the effect of the\ndepth L is the opposite: by increasing the depth, the ratio decreases. Third, the ratio can never exceed\n\u0398(d), but it can be arbitrarily small. Suppose now that we \ufb01x the total number of weights in the LNN,\nand then start increasing the width of each layer (which means that the depth will decrease). In this\ncase, the ratio will also increase.\nWe conclude by noting that the same behavior of the bound w.r.t. width and depth can be observed if\nwe drop the simplifying assumption that all layers have the same width.\n\n5 Experiments\n\nIn this section, we provide empirical results on how the structure of a neural network (width and\ndepth) impacts its amenability to large-batch training using various datasets and network architectures.\nOur main \ufb01ndings are three-fold:\n\n1. For all neural networks we used, there exists a threshold B\u2217, such that using batch size\n\nlarger than this threshold induces slower convergence;\n\n2. The threshold of wider neural networks is often larger than that of deeper ones;\n\n5\n\n\fDataset\n\n# datapoints\n\nModel\n# Classes\n\n# Parameters\n\nSynthetic\n10,000\nlinear FC\n\n+\u221e\n16k\n\nMNIST\n70,000\n\nCifar10\n60,000\n\nFC/LeNet\n\nResNet-18/34\n\n10\n\n16k / 431k\n96% / 98%\n\n10\n\n11m / 21m\n\nEMNIST Gisette\n6,000\n131,600\nFC\n2\n\nFC\n47\n16k\n65%\n\n262k\n95%\n\nConverged Accuracy\nTable 1: The datasets used and their associated learning models and hyper-parameters.\n\n95%\n\n10\u221212 (loss)\n\n3. When using the same large batch size, almost all wider neural networks need much fewer\n\nepochs to converge compared to their deeper counterparts.\n\nThose \ufb01ndings validate our theoretical analysis and suggest that wider neural networks are indeed\nmore amenable to large-batch training and thus more suitable to scale out.\nImplementation and Setup We implemented our experimental pipeline in Keras [43], and conducted\nall experiments on p2.xlarge instances on Amazon EC2. All results reported are averaged from 5\nindependent runs.\n\nDatasets and Networks Table 1 summarizes the datasets and networks used in the experiments.\nIn the synthetic dataset, all data points were independently drawn from N (0, 1) as described by our\ntheory results. A deep linear fully connected neural network (FC) whose weights were generated\nfrom N (0, 1) independently was used to produce the true labels. The task on the synthetic data is a\nregression task. We train linear FCs on the synthetic dataset. The real-world datasets we used include\nMNIST [44], EMNIST[45], Gisette [46], and CIFAR-10 [47], with appropriate networks ranging\nfrom linear, to non-linear fully connected ones, and to LeNet [48] and ResNet [28].\nFor each network, we \ufb01x the total number of parameters and vary its depth/number of layers L and\nwidth K. For fully connected networks and LeNet, we vary depth L from 1 to 10 and change K\naccordingly to ensure the total number of parameters are approximately \ufb01xed. More precisely, we \ufb01x\nthe total number of parameters p, and solve the following equations\n\ndin \u00d7 K + (L \u2212 1) \u00d7 K 2 + K \u00d7 dout = p.\n\nwhere din is the dimension of the data and dout is the size of output. For ResNet, we vary two\nparameters separately. We \ufb01rst vary the width and depth of the fully connected layers without\nchanging the residual blocks. Next we \ufb01x the fully connected layers and change the number of blocks\nand convolution \ufb01lters in each chunk. We refer to the building block in a residual function described\nin [28] as chunk. For ResNet-18/34 architecture, we use [s1, s2, s3, s4] to denote a particular structure,\nwhere s1 represents the number of blocks stacked in the \ufb01rst chunk, s2 is the number of blocks\nstacked in the second chunk, etc. For varying depths, we incrementally increase or decrease one\nblock in each chunk and adjust the number of convolutional \ufb01lters in each block to meet the \ufb01xed\nnumber of parameters requirement.\nFor each combination of depth and width of a NN architecture, we train the model by setting a\nconstant threshold on training accuracy for classi\ufb01cation tasks, or loss for regression tasks. We then\ntrain the NN for a variety of batch sizes, in range of 2i, for i \u2208 {5,\u00b7\u00b7\u00b7 , 12}. We tune the step size in\nthe following way: (i) for all learning rates \u03b7 from a grid of candidate values, we run the training\nprocess with \u03b7 for 2 passes over the data; and then (ii) we choose \u02c6\u03b7 which leads to the lowest training\nloss after two epochs. An epoch represents a full pass over the data.\nExperimental Results We \ufb01rst verify whether gradient diversity re\ufb02ects the amenability to large\nbatch training. For each linear FC network with \ufb01xed width and depth, we measure its gradient\ndiversity every ten epochs and compute the average. Figure 2(a) shows how the averaged gradient\ndiversity varies as depth/width changes, while Figure 2(b) presents the largest batch to converge\nfor each network within a pre-set number of epochs. Both of them increase as the width K of the\nnetworks increases. In fact, as shown in Figure 2(c), the largest batch size that does not impact the\nconvergence rate grows monotonically w.r.t the gradient diversity. This validates our theoretical\nanalysis that gradient diversity can be used to capture the amenability to large batch training.\n\n6\n\n\f(a) Gradient Diversity\n\n(b) Largest Batch Size\n\n(c) Diversity vs Batch Size\n\nFigure 2: The effect of gradient diversity for linear FCs trained on the synthetic dataset for a regression task. (a)\nGradient diversity for different width/depth (b) Largest batch size to converge to loss 10\u221212, within a pre-set\nnumber (i.e., 14) of epochs. (c) Largest batch size v.s. gradient diversity.\n\n(a) Synthetic, Linear FC\n\n(b) MNIST, FC\n\n(c) EMNIST, FC\n\n(d) Gisette, FC\n\n(e) MNIST, LeNet\n\n(f) Cifar10, ResNet18, FC (g) Cifar10, ResNet18, Res (h) Cifar10, ResNet34, Res\nFigure 3: Number of epochs needed to converge to the same loss / accuracy given in Table 1. K represents\nwidth, and L depth. In (f) We \ufb01x the residual blocks of ResNet 18 and only vary the fully-connected parts. In (g)\nand (h), we \ufb01x the fully connected layers and vary the residual blocks of ResNet 18 and ResNet 34.\n\nNext, we study the number of epochs needed to converge when different batch sizes are used for\nreal-world datasets. First, for almost all network architectures, there exists a batch size threshold,\nsuch that using a batch size larger than this, requires more epochs for convergence, consistent with\nthe observations in [19]. For example, in Figure 3(b), when the batch size is smaller than 256, the FC\nnetwork with width K = 17 and depth L = 10 needs a small number (2 to 3) of epochs to converge.\nBut when the batch size becomes larger than 256, the number of epochs necessary for convergence\nincreases signi\ufb01cantly, e.g., it takes 50 epochs to converge when batch size is 4096. Moreover, we\nobserve that the threshold increases as width increases. Again as shown in Figure 3(b), the batch-size\nthreshold for the FC network with L = 10 is 256, but goes up to 1024 with L = 1. Furthermore,\nwhen using the same large batch size, wider networks tend to require fewer epochs to converge than\nthe deeper ones. In Figure 3(c), for instance, using the same batch size of 4096, the required epochs\nto converge decreases from 211 to 9 as width K increases from 17 to 21. Those trends are similar for\nall FC networks we used in the experiments.\nWhen it comes to ResNets and LeNet, the trends are not always as sharp. This is expected since our\ntheoretical analysis does not cover such cases, but the main trend can still be observed. For example,\nas shown in Figures 3(e) and 3(f), for a \ufb01xed batch size, increasing the width almost always leads\nto a decrease in number of epochs for convergence. Figure 4, depicts the exact number of epochs\nto converge for each network architecture, and plots them as a heatmap. It is interesting to see that\nfor ResNet, there is a small fraction of cases where increase of depth can also reduce the number of\nepochs for convergence.\nIn many practical applications, only a reasonable and limited number of data passes is performed\ndue to time and resources constraints. Thus, we also study how the structure of a network affects the\nlargest possible batch size to converge within a \ufb01xed number of epochs/data passes to a pre-speci\ufb01ed\naccuracy. As shown in Figure 5, neural networks with larger width K usually allow much larger\nbatch sizes to converge within a small, pre-set number of total epochs. This is especially bene\ufb01cial in\n\n7\n\nK:16L:11621631641551561571581491410Structure of Network0.000.020.040.060.080.10Gradient DiversityK:16L:11621631641551561571581491410Structure of Network05001000150020002500300035004000Largest Batch Sizefor Fixed Epochs0.0000.0250.0500.0750.100Gradient Diversity01000200030004000Largest Batch Size102103Batch Size0.00.51.0Number ofEpochs1e4K:16, L:1K:16, L:4K:15, L:7K:14, L:10102103Batch Size02040K:21, L:1K:19, L:4K:18, L:7K:17, L:10102103Batch Size0100200K:21, L:1K:19, L:4K:18, L:7K:17, L:10102103Batch Size01020K:52, L:1K:51, L:4K:49, L:7K:48, L:10102103Batch Size050100K:143, L:1K:88, L:4K:71, L:7K:61, L:10102103Batch Size204060K:143, L:1K:88, L:4K:71, L:7K:61, L:10102103Batch Size255075Num Blocks:[3, 3, 3, 3]Num Blocks:[5, 5, 5, 5]Num Blocks:[6, 6, 6, 6]102103Batch Size2040Num Blocks:[3, 4, 6, 3]Num Blocks:[5, 6, 8, 5]Num Blocks:[6, 7, 9, 6]\f(a) Synthetic, Linear FC\n\n(b) MNIST, Linear FC\n\n(c) EMNIST, FC\n\n(d) Gisette, FC\n\n(e) MNIST, LeNet\n\n(f) Cifar10, ResNet18, FC (g) Cifar10, ResNet18, Res (h) Cifar10, ResNet34, Res\n\nFigure 4: Heatmap on number of epochs needed to converge to loss / accuracy de\ufb01ned in Table 1. We report\nthe log10 of the epochs for (a) and the real epochs for the others.\n\n(a) Synthetic, Linear FC\n\n(b) MNIST, FC\n\n(c) EMNIST, FC\n\n(d) Gisette, FC\n\n(e) MNIST on LeNet\n\n(f) Cifar10, ResNet18, FC (g) Cifar10, ResNet18, Res (h) Cifar10, ResNet34, Res\n\nFigure 5: Largest possible batch size to converge within a \ufb01xed number of epochs.\n\nthe scenarios of large-scale distributed learning, since increasing the batch size can result in more\nspeedup gains due to a reduction in the total amount of communication. Finally, we should note that\nthe largest batch size differs among different networks, as well as different datasets. This is because\ngradient diversity is both data-dependent and model-dependent.\n6 Conclusion\n\nIn this paper, we study how the structure of a neural network affects the performance of large-batch\ntraining. Through the lens of gradient diversity, we quantitatively connect a network\u2019s amenability to\nlarger batches during training with its depth and width. Extensive experimental results, along with\ntheoretical analysis, demonstrate that for a large class of neural networks, increasing width leads\nto larger gradient diversity and thus allows for a larger batch training that is always bene\ufb01cial for\ndistributed computation.\nIn the future, we plan to explore how a particular structure, e.g., convolutional \ufb01lters, residual blocks,\netc, affects gradient diversity. From a practical perspective, we argue that it is important to consider\nthe architecture of a network with regards to its amenability for speedups in a distributed setting.\nHence, we plan to explore how one can \ufb01ne-tune a network so that large-batch training is enabled,\nand communication bottlenecks are minimized. Another direction is to quantitatively study how the\ngeneralization error is affected.\n\n8\n\n12345678910Depth of Hidden Layers3264128256512102420484096Batch Size0.50.60.70.70.81.01.11.31.51.90.60.70.80.81.01.21.41.61.81.90.60.80.91.01.21.51.71.92.12.20.61.01.11.21.61.72.02.22.42.50.71.01.61.51.82.12.32.52.72.80.81.31.81.62.12.42.62.83.03.11.01.52.01.92.42.73.03.13.43.41.11.72.22.22.82.83.33.43.64.10.81.62.43.24.012345678910Depth of Hidden Layers326412825651210242048409611111111121111111122111112222311122223441223234559223346789153457891114192666910141518242550102030405012345678910Depth of Hidden Layers3264128256512102420484096112234699112323568111415245799141520242235781115182334571213242927475813172226435162909132328445670921151409152436486376113130211408012016020012345678910Depth of Hidden Layers3264128256512102420484096111111111111111111111111111122121222222322333344454445676778789101113131416131112121618181921212851015202512345678910Depth of Hidden Layers326412825651210242048409611111111111111111111111111222212222333454446786981291416181521201828283341364541484346445282818390828388961041142040608010012345678910Depth of Hidden Layers32641282565121024204840961718202021232324272813161617171819202122141516171818192021221214161617181919212312151617181921212326151719212324262729312724262830343538404445394245515254596468203040506012345Number of Blocksin Every Chunk326412825651210242048409681013151571013152079121418781115148911151711131318162421224253414868779415304560759012345Number of Blocksin Every Chunk3264128256512102420484096810121314810121415899121478910138101112121214161718201829293258363845571020304050K:16L:11621641561410Structure of Network102103Largest Batch Sizefor Fixed EpochsK:21L:11941961881710Structure of Network103K:21L:11941961871710Structure of Network102103K:52L:15145064974710Structure of Network103K:143L:111326786496110Structure of Network102103K:143L:19737566496110Structure of Network102103K:64L:2523444405366Structure of Network102103K:100L:2773644575516Structure of Network102103\f", "award": [], "sourceid": 5672, "authors": [{"given_name": "Lingjiao", "family_name": "Chen", "institution": "University of Wisconsin-Madison"}, {"given_name": "Hongyi", "family_name": "Wang", "institution": "University of Wisconsin-Madison"}, {"given_name": "Jinman", "family_name": "Zhao", "institution": "University of Wisconsin-Madison"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "UW-Madison"}, {"given_name": "Paraschos", "family_name": "Koutris", "institution": "University of Wisconsin-Madison"}]}