{"title": "Collaborative Learning for Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1832, "page_last": 1841, "abstract": "We introduce collaborative learning in which multiple classifier heads of the same network are simultaneously trained on the same training data to improve generalization and robustness to label noise with no extra inference cost. It acquires the strengths from auxiliary training, multi-task learning and knowledge distillation. There are two important mechanisms involved in collaborative learning. First, the consensus of multiple views from different classifier heads on the same example provides supplementary information as well as regularization to each classifier, thereby improving generalization. Second, intermediate-level representation (ILR) sharing with backpropagation rescaling aggregates the gradient flows from all heads, which not only reduces training computational complexity, but also facilitates supervision to the shared layers. The empirical results on CIFAR and ImageNet datasets demonstrate that deep neural networks learned as a group in a collaborative way significantly reduce the generalization error and increase the robustness to label noise.", "full_text": "Collaborative Learning for Deep Neural Networks\n\nGuocong Song\nPlayground Global\nPalo Alto, CA 94306\nsonggc@gmail.com\n\nWei Chai\nGoogle\n\nMountain View, CA 94043\n\nchaiwei@google.com\n\nAbstract\n\nWe introduce collaborative learning in which multiple classi\ufb01er heads of the\nsame network are simultaneously trained on the same training data to improve\ngeneralization and robustness to label noise with no extra inference cost. It acquires\nthe strengths from auxiliary training, multi-task learning and knowledge distillation.\nThere are two important mechanisms involved in collaborative learning. First, the\nconsensus of multiple views from different classi\ufb01er heads on the same example\nprovides supplementary information as well as regularization to each classi\ufb01er,\nthereby improving generalization. Second, intermediate-level representation (ILR)\nsharing with backpropagation rescaling aggregates the gradient \ufb02ows from all heads,\nwhich not only reduces training computational complexity, but also facilitates\nsupervision to the shared layers. The empirical results on CIFAR and ImageNet\ndatasets demonstrate that deep neural networks learned as a group in a collaborative\nway signi\ufb01cantly reduce the generalization error and increase the robustness to\nlabel noise.\n\n1\n\nIntroduction\n\nWhen training deep neural networks, we must confront the challenges of general nonconvex opti-\nmization problems. Local gradient descent methods that most deep learning systems rely on, such\nas variants of stochastic gradient descent (SGD), have no guarantee that the optimization algorithm\nwill converge to a global minimum. It is well known that an ensemble of multiple instances of a\ntarget neural network trained with different random seeds generally yields better predictions than\na single trained instance. However, an ensemble of models is too computationally expensive at\ninference time. To keep the exact same computational complexity for inference, several training\ntechniques have been developed by adding additional networks in the training graph to boost accuracy\nwithout affecting the inference graph, including auxiliary training [19], multi-task learning [4, 3],\nand knowledge distillation [10]. Auxiliary training is introduced to improve the convergence of deep\nnetworks by adding auxiliary classi\ufb01ers connected to certain intermediate layers [19]. However,\nauxiliary classi\ufb01ers require speci\ufb01c new designs for their network structures in addition to the target\nnetwork. Furthermore, it is found later [20] that auxiliary classi\ufb01ers do not result in obvious improved\nconvergence or accuracy. Multi-task learning is an approach to learn multiple related tasks simultane-\nously so that knowledge obtained from each task can be reused by the others [4, 3, 21]. However, it\nis not useful for a single task use case. Knowledge distillation is introduced to facilitate training a\nsmaller network by transferring knowledge from another high-capacity model, so that the smaller one\nobtains better performance than that trained by using labels only [10]. However, distillation is not an\nend-to-end solution due to having two separate training phases, which consume more training time.\nIn this paper, we propose a framework of collaborative learning that trains several classi\ufb01er heads\nof the same network simultaneously on the same training data to cope with the above challenges.\nThe method acquires the advantages from auxiliary training, multi-task learning, and knowledge\ndistillation, such as, appending the exact same network as the target one in the training graph for a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsingle task, sharing intermediate-level representation (ILR), learning from the outputs of other heads\n(peers) besides the ground-truth labels, and keeping the inference graph unchanged. Experiments\nhave been performed with several popular deep neural networks on different datasets to benchmark\nperformance, and their results demonstrate that collaborative learning provides signi\ufb01cant accuracy\nimprovement for image classi\ufb01cation problems in a generic way. There are two major mechanisms\ncollaborative learning bene\ufb01ts from: 1) The consensus of multiple views from different classi\ufb01er\nheads on the same data provides supplementary information and regularization to each classi\ufb01er. 2)\nBesides computational complexity reduction bene\ufb01ted from ILR sharing, backpropagation rescaling\naggregates the gradient \ufb02ows from all heads in a balanced way, which leads to additional performance\nenhancement. The per-layer network weight distribution shows that ILR sharing reduces the number\nof \u201cdead\u201d \ufb01lter weights in the bottom layers due to the vanishing gradient issue, thereby enlarging\nthe network capacity.\nThe major contributions are summarized as follows. 1) Collaborative learning provides a new training\nframework that for any given model architecture, we can use the proposed collaborative training\nmethod to potentially improve accuracy, with no extra inference cost, with no need to design another\nmodel architecture, with minimal hyperparameter re-tuning. 2) We introduce ILR sharing into co-\ndistillation that not only enhances training time/memory ef\ufb01ciency but also improves generalization\nerror. 3) Backpropagation rescaling we propose to avoid gradient explosion when the number of heads\nis big is also proven able to improve accuracy when the number of heads is small. 4) Collaborative\nlearning is demonstrated to be robust to label noise.\n\n2 Related work\n\nIn addition to auxiliary training, multi-task learning, and distillation mentioned before, we list other\nrelated work as follows.\nGeneral label smoothing. Label smoothing replaces the hard values (1 or 0) in one-hot labels for a\nclassi\ufb01er with smoothed values, and is shown to reduce the vulnerability of noisy or incorrect labels in\ndatasets [20]. It regularizes the model and relaxes the con\ufb01dence on the labels. Temporal ensembling\nforms a consensus prediction of the unknown labels using the outputs of the network-in-training on\ndifferent epochs to improve the performance of semi-supervised learning [14]. However, it is hard to\nscale for a large dataset since temporal ensembling requires to memorize the smoothed label of each\ndata example.\nTwo-way distillation. Co-distillation of two instances of the same neural network is studied in [2]\nwith a focus on training speed-up in a distributed learning environment. Two-way distillation between\ntwo networks, which can use the same architecture or different, is also studied in [23]. Each of them\nalternatively optimizes its own network parameters. However, the developed algorithms are far from\noptimized. First, when different classi\ufb01ers have different architectures, each of them should have\na different weight associated with its loss function to balance injected backpropagation error \ufb02ows.\nSecond, multiple copies of the target network increase proportionally the memory consumption in\ngraphics processing unit (GPU) and the training time.\nSelf-distillation/born-again neural networks. Self-distillation is a kind of distillation when the\nstudent network is identical to the teacher in terms of the network graph. Furthermore, the distillation\nprocess can be performed consecutively several times. At each consecutive step, a new identical\nmodel is initialized from a different random seed and trained from the supervision of the earlier\ngeneration. At the end of the procedure, additional gains can be achieved with an ensemble of\nmultiple students generations [7]. However, multiple self-distillation processes multiply the total\ntraining time proportionally; an ensemble of multiple student generations increases the inference time\naccordingly as well.\nIn comparison, the major goal of this paper is to improve the accuracy of a target network without\nchanging its inference graph and emphasize both the accuracy and the training ef\ufb01ciency.\n\n3 Collaborative learning\n\nThe framework of collaborative learning consists of three major parts: the generation of a population\nof classi\ufb01er heads in the training graph, the formulation of the learning objective, and optimization\n\n2\n\n\f(a) Target network\n\n(b) Multiple instances\n\n(c) Simple ILR sharing\n\n(d) Hierarchical ILR sharing\n\nFigure 1: Multiple head patterns for training. Three colors represent subnets g1, g2, and g3 in (1).\n\nfor learning a group of classi\ufb01ers collaboratively. We will describe the details of each of them in the\nfollowing subsections.\n\n3.1 Generation of training graph\n\nSimilar to auxiliary training [19], we add several new classi\ufb01er heads into the original network graph\nduring training time. At inference time, only the original network is kept and all added parts are\ndiscarded. Unlike auxiliary training, each classi\ufb01er head here has an identical network to the original\none in terms of graph structure. This approach leads to advantages over auxiliary training in terms\nof engineering effort minimization. First, it does not require to design additional networks for the\nauxiliary classi\ufb01ers. Second, the structure symmetry for all heads does not require additional different\nweights associated with loss functions to well balance injected backpropagation error \ufb02ows, because\nan equal weight for each head\u2019s objective is optimal for training.\nFigure 1 illustrates several patterns to create a group of classi\ufb01ers in the training graph. Figure 1 (a)\nis a target network to train. The network can be expressed as z = g(x; \u03b8), where g is determined\nby the graph architecture, and \u03b8 represents the network parameters. To better explain the following\npatterns, we assume the network g can be represented as a cascade of three functions or subnets,\n\ng(x; \u03b8) = g3(g2(g1(x; \u03b81); \u03b82); \u03b83)\n\n(1)\n\nwhere \u03b8 = [\u03b81, \u03b82, \u03b83] and \u03b8i includes all parameters of subnet gi accordingly. In Figure 1 (b),\neach head is just a new instance of the original network. The output of head h is z(h) = g(x; \u03b8(h)),\nwhere \u03b8(h) is an instance of network parameters for head h. Another pattern allows all heads to\nshare ILRs in the same low layers, which is shown in Figure 1 (c). This structure is very similar to\nmulti-task learning [4, 3], in which different supervised tasks share the same input, as well as some\nILR. However, collaborative learning has the same supervised tasks for all heads. It can be expressed\nas follows z(h) = g3(g2(g1(x; \u03b81); \u03b8(h)\n3 ), where there is only one instance of \u03b81 shared by\nall heads. Furthermore, multi-heads can take advantage of multiple hierarchical ILRs, as shown in\nFigure 1 (d). The hierarchy is similar to a binary tree in which the branches at the same levels are\ncopies of each other. For inference, we just need to keep one head with its dependent nodes and\ndiscard the rest. Therefore, the inference graph is identical to the original graph g.\nIt is shown in [17, 5] that the training memory size is roughly proportional to the number of\nlayers/operations. With the multi-instance pattern, the number of parameters in the whole training\ngraph is proportional to the number of heads. Obviously, ILR sharing can proportionally reduce the\nmemory consumption and speed up training, compared to multiple instances without sharing. It is\nmore interesting that the empirical results and analysis in Section 4 will demonstrate that ILR sharing\nis able to boost the classi\ufb01cation accuracy as well.\n\n2 ); \u03b8(h)\n\n3.2 Learning objectives\n\nThe main idea of collaborative learning is that each head learns from ground-truth labels but also from\nthe whole population through the training process. We focus on multi-class classi\ufb01cation problems\nin this paper. For head h, the classi\ufb01er\u2019s logit vector is represented as z = [z1, z2, . . . , zm]tr for m\n\n3\n\n\fclasses. The associated softmax with temperature T is de\ufb01ned as follows,\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n\u03c3i(z(h); T ) =\n\nm(cid:80)\n\nj=1\n\nexp\n\nz(h)\ni /T\n\nexp\n\nz(h)\nj /T\n\n(2)\n\nWhen T = 1, (2) is just a normal softmax function. Using a higher value for T produces a softer\nprobability distribution over classes. The loss function for head h is proposed as\n\nL(h) = \u03b2Jhard(y, z(h)) + (1 \u2212 \u03b2)Jsof t(q(h), z(h))\n\n(3)\nwhere \u03b2 \u2208 (0, 1]. The objective function with regard to a ground-truth label Jhard is just the\nclassi\ufb01cation loss \u2013 cross entropy between a one-hot encoding of the label y and the softmax output\ni=1 yi log(\u03c3i(z(h); 1)). The soft label of head h is\n\nproposed to be a consensus of all other heads\u2019 predictions as follows:\n\nwith temperature of 1: Jhard(y, z(h)) = \u2212(cid:80)m\n\uf8eb\uf8ed 1\n\nq(h) = \u03c3\n\n\uf8f6\uf8f8\n\n(cid:88)\n\nj(cid:54)=h\n\nz(j); T\n\nH \u2212 1\n\nwhich combines the multiple views on the same data and contains additional information rather\nthan the ground-truth label. The objective function with regard to the soft label is the cross entropy\nbetween the soft label and the softmax output with a certain temperature, i.e.\n\nJsof t(q(h), z(h)) = \u2212 m(cid:88)\n\nq(h)\ni\n\nlog(\u03c3i(z(h); T ))\n\nwhich can be regarded as a distance measure between an average prediction from population and the\nprediction of each head [10]. Minimizing this objective aims at transferring the information from the\nsoft label to the logits and regularizing the training network.\n\ni=1\n\n3.3 Optimization for a group of classi\ufb01er heads\n\nIn addition to performance optimization, another design criterion for collaborative learning is to keep\nthe hyperparameters in training algorithms, e.g. the type of SGD, regularization, and learning rate\nschedule, the same as those used in individual learning. Thus, collaborative learning can be simply\nput on top of individual learning. The optimization here is mainly designed to take new concepts\ninvolved in collaborative learning into account, including a group of classi\ufb01ers, and ILR sharing.\nSimultaneous SGD. Since multiple heads are involved in optimization, it seems straightforward to\nalternatively update the parameters associated with each head one-by-one. This algorithm is used\nin both [23, 2]. In fact, alternative optimization is popular in generative adversarial networks [8], in\nwhich a generator and discriminator get alternatively updated. However, alternative optimization has\nthe following shortcomings. In terms of speed, it is slow because one head needs to recalculate a new\nprediction after updating its parameters. In terms of convergence, recent work [15, 16] reveals that\nsimultaneous SGD has faster convergence and achieves better performance than the alternative one.\nTherefore, we propose to apply SGD and update all parameters simultaneously in the training graph\naccording to the total loss, which is the sum of each head\u2019s loss as well as regularization \u2126(\u03b8).\n\nH(cid:88)\n\nL =\n\nL(h) + \u03bb\u2126(\u03b8)\n\n(4)\n\nh=1\n\nWe suggest keeping the same regularization and its hyperparameters as individual training when\napplying collaborative learning. It is important to avoid unnecessary hyperparameter search in\npractice when introducing a new training approach. The effectiveness of simultaneous SGD will be\nvalidated in Section 4.1.\nBackpropagation rescaling. First, we describe an important stability issue with ILR sharing. As-\nsume that there are H heads sharing subnet g1(\u00b7; \u03b81) as shown in Figure 2 (a), in which \u03b81 and \u03b8(h)\nrepresent the parameters of g1 and those of g2 associated with head h, respectively. The output of\n\n2\n\n4\n\n\f(a) No rescaling\n\n(b) Backprop rescaling. Operation I is described in (5).\n\nFigure 2: No rescaling vs backpropagation rescaling\n\n\u2207x1L =(cid:80)H\n\nthe shared layers, x1, is fed to all corresponding heads. However, the backward graph becomes\na many-to-one connection. According to (4), the backpropagation input for the shared layers is\nh=1 \u2207x1L(h). It is not hard to discover an issue that the variance of \u2207x1 L grows as the\nnumber of heads grows. Assume that the gradient of each head\u2019s loss has a limited variance, i.e.,\nVar((\u2207x1L(h))i) < \u221e, where i represents each element in a vector. We should make the system\nstable, i.e., Var((\u2207x1L)i) < \u221e, even when H \u2212\u2192 \u221e. Unfortunately, the backpropagation \ufb02ow of\nFigure 2 (a) is unstable in the asymptotic sense due to the sum of all gradient \ufb02ows.\nNote that simple loss scaling, i.e., L = 1\nH\nlearning w.r.t \u03b8(h)\nH\u2207\u03b8(h)\n\u03b7 1\nTherefore, backpropagation rescaling is proposed to achieve two goals at the same time \u2013 to normalize\nthe backpropagation \ufb02ow in subnet g1 and keep that in subnet g2 the same as the single classi\ufb01er case.\nThe solution to add a new operation I(\u00b7) between g1 and g2, shown in Figure 2 (b), which is\n\nh L(h), bring another problem: resulting in very slow\nL(h). For a \ufb01xed learning rate \u03b7,\n\n(cid:80)\n2 \u2190 \u03b8(h)\n\nL(h) \u2192 0 when H \u2192 \u221e.\n\n2 . The SGD update is \u03b8(h)\n\n2 \u2212 \u03b7 1\n\nH \u2207\u03b8(h)\n\n2\n\n2\n\nI(x) = x, \u2207xI =\n\n1\nH\n\nAnd then the backpropagation input for the shared layers becomes\n\nH(cid:88)\n\nh=1\n\n\u2207x1 L =\n\n1\nH\n\n\u2207x1L(h)\n\n(5)\n\n(6)\n\nThe variance of (6) is then always limited, which is proven in Session 1 of Supplementary material.\nBackpropagation rescaling is essential for ILR sharing to have better performance by just reusing a\ntraining con\ufb01guration well tuned in individual learning. Its effectiveness on classi\ufb01cation accuracy\nwill be validated in Section 4.1.\nBalance between hard and soft loss objectives. We follow the suggestion in [10] that the back-\npropagation \ufb02ow from each soft objective should be multiplied by T 2 since the magnitudes of the\ngradients produced by the soft targets scale as 1/T 2. This ensures that the relative contributions of\nthe hard and soft targets remain roughly unchanged when tuning T .\n\n3.4 Robustness to label noise\n\nIn supervised learning, it is hard to completely avoid confusion during network training either due\nto incorrect labels or data augmentation. For example, random cropping is a very important data\naugmentation technique when training an image classi\ufb01er. However, the entire labeled objects or\nlarge portion of them occasionally get cut off, which really challenges the classi\ufb01er. Since multiple\nviews on the same example have diversity of predictions, collaborative learning is by nature more\nrobust to label noise than individual learning, which will be validated in Section 4.1.\n\n5\n\n\f4 Experiments\n\nWe will evaluate the performance of collaborative learning on various network architectures for\nseveral datasets, with analysis of important and interesting observations. We use T = 2 and \u03b2 = 0.5\nfor all experiments. In addition, the performance of any model trained with collaborative learning is\nevaluated using the \ufb01rst classi\ufb01er head without head selection. All experiments are conducted with\nTensor\ufb02ow [1].\n\n4.1 CIFAR Datasets\n\nThe two CIFAR datasets, CIFAR-10 and CIFAR-100, consist of colored natural images with 32x32\npixels [13] and have 10 and 100 classes, respectively. We conduct empirical studies on the CIFAR-10\ndataset with ResNet-32, ResNet-110 [9], and DenseNet-40-12 [11]. ResNets and DenseNets for\nCIFAR are all designed to have three building blocks, residual or dense blocks. For the simple ILR\nsharing, the split point is just after the \ufb01rst block. For the hierarchical sharing, the two split points are\nlocated after the \ufb01rst and second blocks, respectively. Refer to Section 2 in Supplementary material\nfor the detailed training setup.\n\nTable 1: Test errors (%) on CIFAR-10. All experiments are performed 5 runs except for those of\nDenseNet-40-12 are done for 3 runs.\n\nIndividual\nlearning\n\nCollaborative\nlearning\n\nSingle instance\nLabel smoothing (0.05)\n2 instances\n4 instances\n2 heads w/ simple ILR sharing\n4 heads w/ hierarchical ILR sharing\n\nResNet-32\n6.66 \u00b1 0.21\n6.83 \u00b1 0.14\n6.19 \u00b1 0.17\n6.16 \u00b1 0.17\n5.97 \u00b1 0.07\n5.86 \u00b1 0.13\n\nResNet-110 DenseNet-40-12\n5.56 \u00b1 0.16\n5.66 \u00b1 0.08\n5.21 \u00b1 0.14\n5.16 \u00b1 0.13\n5.15 \u00b1 0.14\n4.98 \u00b1 0.12\n\n5.26 \u00b1 0.08\n5.40 \u00b1 0.04\n5.11 \u00b1 0.15\n5.00 \u00b1 0.05\n5.04 \u00b1 0.10\n4.86 \u00b1 0.12\n\nClassi\ufb01cation results. All results are summarized in Table 1. It can be concluded from Table 1 that\nwith a given training graph pattern, the more classi\ufb01er heads, the lower generalization error. More\nimportant, ILR sharing reduces not only GPU memory consumption and training time but also the\ngeneralization error considerately.\nSimultaneous vs alternative optimization. We repeat an experiment that was performed in [23].\nIt is just a special case of collaborative learning in which we train two instances of ResNet-32 on\nCIFAR-100 with T = 1, \u03b2 = 0.5. The only difference is that we replace the alternative optimization\n[23] with the simultaneous one. It is shown in Table 2 that based on the corresponding baseline,\nsimultaneous optimization provides additional 1%+ accuracy gain compared to alternative one. With\nT = 2, simultaneous one has another 1% boost. Thus, simultaneous optimization substantially\noutperforms alternative one in terms of accuracy and speed.\n\nTable 2: Alternative optimization [23] vs simultaneous optimization (ours) in terms of test errors of\nResNet-32 on CIFAR-100.\n\n[23]\nCollaborative\nlearning\n\nT=1\nT=2\n\nSingle instance (baseline) Head 1 in two instances Head 2 in two instances\n31.01\n30.52 \u00b1 0.35\n\n29.25\n27.64 \u00b1 0.36\n26.32 \u00b1 0.26\n\n28.81\n27.48 \u00b1 0.37\n26.36 \u00b1 0.27\n\nBackpropagation rescaling. Backpropagation rescaling is proposed to be necessary for ILR sharing\ntheoretically in Section 3.3. We intend to con\ufb01rm it by experiments on the CIFAR-10 dataset. To\ntrain a ResNet-32, we use a simple ILR sharing topology with four heads, and the split point located\nafter the \ufb01rst residual block. The results in Table 3 provide evidence that backpropagation rescaling\nclearly outperforms others \u2013 no scaling and loss scaling. While no scaling suffers from too large\ngradients in the shared layers, loss scaling results in a too small factor for updating the parameters\n\n6\n\n\fFigure 3: Test error on CIFAR-10 with label noise. Noise level is the percentage of corrupted\nlabels over the all training set. The noisy labels are randomly generated every epoch.\n\nof independent layers. We suggest backpropagation rescaling for all multi-head learning problems\nbeyond collaborative learning.\n\nTable 3: Impact of backprop rescaling. Four heads based on ResNet-32 share the low layers up to\nthe \ufb01rst residual block. With no scaling, the factor for each head\u2019s loss is one. With loss scaling, the\nfactor for each head\u2019s loss is 1/4.\n\nError (%) of ResNet-32\n\nNo scaling\n6.04 \u00b1 0.17\n\nLoss scaling Backprop rescaling\n6.09 \u00b1 0.24\n\n5.82 \u00b1 0.08\n\nNoisy label robustness.\nIn this experiment, we aim at validating the noisy label resistance of\ncollaborative learning on the CIFAR-10 dataset with ResNet-32. Assume that a portion of labels,\nwhose percentage is called noise level, are corrupted with a uniform distribution over the label set.\nThe partition for images with corruption or not is \ufb01xed for all runs; their noisy labels are randomly\ngenerated every epoch. The results in Figure 3 validate that the test error rates of all collaborative\nlearning setups are substantially lower than the baseline, and the accuracy gain becomes larger at a\nconsiderately larger noise level. It is well expected since the consensus formed from a group is able\nto mitigate the effect of noisy labels without knowledge of noise distribution. Another observation\nis that 4 heads with hierarchical ILR sharing, which constantly provides the lowest error rate at a\nrelatively low noise level, seems worse at a high noise level. We conjecture that the diversity of\npredictions is more important than better ILR sharing in this scenario. Collaborative learning provides\n\ufb02exibility to trade off the diversity of predictions from the group with additional supervision and\nregularization for the common layers.\n\n4.2\n\nImageNet Dataset\n\nThe ILSVRC 2012 classi\ufb01cation dataset consists of 1.2 million for training, and 50,000 for validation\n[6]. We evaluate how collaborative learning helps improve the performance of ResNet-50 network.\nAs following the notations in [9], we consider two heads sharing ILRs up to \u201cconv3_x\" block for\nsimple ILR sharing. For the hierarchical sharing with four heads, two split points are located after\n\u201cconv3_x\" and \u201cconv4_x\" blocks, respectively. Refer to Section 3 in Supplementary material for the\ndetailed training setup.\nClassi\ufb01cation error vs training computing resources (GPU memory consumption as well as\ntraining time). Classi\ufb01cation error on Imagenet is particularly important because many state-of-the-\nart computer vision problems derive image features or architectures from ImageNet classi\ufb01cation\nmodels. For instance, a more accurate classi\ufb01er typically leads to a better object detection model\nbased on the classi\ufb01er [12]. Table 4 summarizes the performance of various training graph patterns\n\n7\n\n\fTable 4: Validation errors of ResNet-50 on ImageNet. Label smoothing, distillation and collabora-\ntive learning all do not affect inference\u2019s memory size and running time.\n\nTop-1 error\n\nTop-5 error\n\nBaseline\nLabel smoothing (0.1)\n\nIndividual\nlearning\nDistillation From ensemble of two ResNet-50s\nCollabor-\native\nlearning\n\n2 instances\n2 heads w/ simple ILR sharing\n4 heads w/ hierarchical ILR sharing\n\n23.47\n23.34\n22.65\n22.81\n22.70\n22.29\n\n6.83\n6.80\n6.34\n6.45\n6.37\n6.21\n\nTraining time Memory\n1x\n1x\n3.42x\n2x\n1.4x\n1.75x\n\n1x\n1x\n1.05x\n2x\n1.32x\n1.5x\n\nFigure 4: Per-layer weight distribution in trained ResNet-50. As following the notations in [9],\nthe two split points in the hierarchical sharing with four heads are located after \u201cconv3_x\" and\n\u201cconv4_x\" blocks, respectively.\n\nwith ResNet-50 on ImageNet. As mentioned in Section 3.1, collaborative learning brings some extra\ntraining cost since it generates more classi\ufb01er heads in training, and ILR sharing is designed for\ntraining speedup and memory consumption reduction. We have measured GPU memory consumption\nand training time and also listed them in Table 4. It is similar to the CIFAR results that two heads with\nsimple ILR sharing and four heads with hierarchical ILR sharing reduce the validation top-1 error\nrate signi\ufb01cantly in this case, from 23.47% with the baseline to 22.70% and 22.29%, respectively.\nNote that increasing training time for individual learning does not improve accuracy [22]. Since\nthe convolution \ufb01lters are shared in the space domain in deep convolutional networks, the memory\nconsumption by storing the intermediate feature maps is much higher than that by model parameters\nin training [17]. Therefore, ILR sharing is especially computationally ef\ufb01cient for deep convolutional\nnetworks because it contains only one copy of shared layers. Compared to distillation1, collaborative\nlearning can achieve a lower error rate with a much less training time in an end-to-end way.\nModel weight distribution and mechanisms of ILR sharing. We have plotted the statistical\ndistribution of each layer\u2019s weights of trained ResNet-50 in Figure 4, including the baseline, distilled\nand trained versions with hierarchical ILR sharing. Refer to Section 5 in Supplementary material for\nmore results with other training con\ufb01gurations. The \ufb01rst \ufb01nding is that the weight distribution of the\nbaseline has a very large spike at near zero in the bottom layers. We conjecture that the gradients to\n\n1Training time of distillation is analyzed in Section 4 in Supplementary material.\n\n8\n\n\fmany weights may be vanished so small that the weight decay part takes the major impact, which\ncauses near-zero \"dead\" values eventually2. Compared to distillation, ILR sharing more effectively\nhelps reduce the number of \"dead\" weights, thereby improve the accuracy. The second \ufb01nding is that\ncollaborative learning makes the weight distribution be more centralized to zero overall. Note that we\nalso calculate per-layer model weight standard deviation values in Table 1 in Supplementary material\nto additionally support this claim. The results indicate that the consensus of multiple views on the\nsame data provides additional regularization.\nILR sharing is somewhat related to the concept of hint training [18], in which a teacher transfers\nits knowledge to a student network by using not only the teacher\u2019s predictions but also an ILR. In\ncollaborative learning, ILR sharing can be regarded as an extreme case in which the ILRs of two\nseparated classi\ufb01er heads converge to the exact same one by forcing them to match. It is reported in\n[18] that using hints can outperform distillation. To a certain extent, this provides an indirect evidence\nfor the possibility of accuracy improvement from ILR sharing.\nAgain, two hyperparameters \u03b2 and T are \ufb01xed in all of our experiments. It is possible that more\nextensive hyper-parameter searches may further improve the performance on speci\ufb01c datasets. We\nevaluate the impact of hyperparameters, \u03b2, T , and split point locations for ResNet-32 on CIFAR-10\nin Section 6 in Supplementary material.\n\n5 Conclusion\n\nWe have proposed a framework of collaborative learning to train a deep neural network in a group of\ngenerated classi\ufb01ers based on the target network. The consensus of multiple views from different\nclassi\ufb01er heads on the same example provides supplementary information as well as regularization to\neach classi\ufb01er, thereby improving the generalization. By well aggregating the gradient \ufb02ows from all\nheads, ILR sharing with backpropagation rescaling not only lowers training computational cost, but\nalso facilitates supervision to the shared layers. Empirical results have also validated the advantages\nof simultaneous optimization and backpropagation rescaling in group learning. Overall, collaborative\nlearning provides a \ufb02exible and powerful end-to-end training approach for deep neural networks to\nachieve better performance. Collaborative learning also opens up several possibilities for future work.\nThe mechanism of group collaboration and noisy label resistance imply that it may potentially be\nbene\ufb01cial to semi-supervised learning. Furthermore, other machine learning tasks, such as regression,\nmay take advantage of collaborative learning as well.\n\nAcknowledgement\n\nWe would like to thank Qiqi Yan for many helpful discussions.\n\nReferences\n\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-\nfowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah,\nM. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Va-\nsudevan, F. Vi\u00e9gas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.\nTensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available\nfrom tensor\ufb02ow.org.\n\n[2] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton. Large scale distributed\nneural network training through online distillation. In International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[3] J. Baxter. Learning internal representations. In Proceedings of the Eighth Annual Conference\n\non Computational Learning Theory, COLT \u201995, pages 311\u2013320. ACM, 1995.\n\n2We ran another experiment in which the weight decay was reduced by half in \ufb01rst three layers to verify our\n\nhypothesis. Refer to Section 5 in Supplementary material for more details.\n\n9\n\n\f[4] R. Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings\nof the Tenth International Conference on Machine Learning (ICML), pages 41\u201348. Morgan\nKaufmann, 1993.\n\n[5] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost.\n\narXiv, abs/1604.06174, 2016.\n\n[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nHierarchical Image Database. In CVPR09, 2009.\n\nImageNet: A Large-Scale\n\n[7] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neural\n\nnetworks. In International Conference on Machine Learning (ICML), July 2018.\n\n[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems\n27, pages 2672\u20132680, 2014.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.\n\n[10] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS\n\nDeep Learning and Representation Learning Workshop, 2015.\n\n[11] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutional\n\nnetworks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[12] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song,\nS. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object\ndetectors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n[13] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.\n[14] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[15] L. Mescheder, S. Nowozin, and A. Geiger. The numerics of gans. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2017.\n\n[16] V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In Advances\n\nin Neural Information Processing Systems (NIPS), 2017.\n\n[17] M. Rhu, N. Gimelshein, J. Clemons, A. Zul\ufb01qar, and S. W. Keckler. vDNN: Virtualized\ndeep neural networks for scalable, memory-ef\ufb01cient neural network design. In 49th Annual\nIEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.\n\n[18] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for\n\nthin deep nets. In International Conference on Learning Representations (ICLR), 2015.\n\n[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\nA. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 1\u20139, 2015.\n\n[20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-\nture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2016.\n\n[21] Y. Yang and T. Hospedales. Deep multi-task representation learning: A tensor factorisation\n\napproach. In International Conference on Learning Representations (ICLR), 2017.\n\n[22] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk mini-\n\nmization. In International Conference on Learning Representations (ICLR), 2018.\n\n[23] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. arXiv, abs/1706.00384,\n\n2017.\n\n10\n\n\f", "award": [], "sourceid": 921, "authors": [{"given_name": "Guocong", "family_name": "Song", "institution": "Playground Global"}, {"given_name": "Wei", "family_name": "Chai", "institution": "Google Inc"}]}