{"title": "Knowledge Distillation by On-the-Fly Native Ensemble", "book": "Advances in Neural Information Processing Systems", "page_first": 7517, "page_last": 7527, "abstract": "Knowledge distillation is effective to train the small and generalisable network models for meeting the low-memory and fast running requirements. Existing offline distillation methods rely on a strong pre-trained teacher, which enables favourable knowledge discovery and transfer but requires a complex two-phase training procedure. Online counterparts address this limitation at the price of lacking a high-capacity teacher. In this work, we present an On-the-fly Native Ensemble (ONE) learning strategy for one-stage online distillation. Specifically, ONE only trains a single multi-branch network while simultaneously establishing a strong teacher on-the-fly to enhance the learning of target network. Extensive evaluations show that ONE improves the generalisation performance of a variety of deep neural networks more significantly than alternative methods on four image classification dataset: CIFAR10, CIFAR100, SVHN, and ImageNet, whilst having the computational efficiency advantages.", "full_text": "Knowledge Distillation by On-the-Fly Native\n\nEnsemble\n\nXu Lan1, Xiatian Zhu2, and Shaogang Gong1\n\n1Queen Mary University of London\n\n2Vision Semantics Limited\n\nAbstract\n\nKnowledge distillation is effective to train the small and generalisable network\nmodels for meeting the low-memory and fast running requirements. Existing\nof\ufb02ine distillation methods rely on a strong pre-trained teacher, which enables\nfavourable knowledge discovery and transfer but requires a complex two-phase\ntraining procedure. Online counterparts address this limitation at the price of\nlacking a high-capacity teacher. In this work, we present an On-the-\ufb02y Native\nEnsemble (ONE) learning strategy for one-stage online distillation. Speci\ufb01cally,\nONE only trains a single multi-branch network while simultaneously establishing\na strong teacher on-the-\ufb02y to enhance the learning of target network. Extensive\nevaluations show that ONE improves the generalisation performance of a variety\nof deep neural networks more signi\ufb01cantly than alternative methods on four image\nclassi\ufb01cation dataset: CIFAR10, CIFAR100, SVHN, and ImageNet, whilst having\nthe computational ef\ufb01ciency advantages.\n\n1\n\nIntroduction\n\nDeep neural networks have gained impressive success in many computer vision tasks [1, 2, 3, 4, 5, 6,\n7, 8]. However, the performance advantages often come at the cost of training and deploying resource-\nintensive networks with large depth and/or width [9, 4, 2]. This leads to the necessity of developing\ncompact yet discriminative models. Knowledge distillation [10] is one general meta-solution among\nthe others such as parameter binarisation [11, 12] and \ufb01lter pruning [13]. The distillation process\nbegins with training a high-capacity teacher model (or an ensemble of models), followed by learning\na smaller student model which is encouraged to match the teacher\u2019s predictions [10] or feature\nrepresentations [14, 15]. While promising the student model quality improvement from aligning\nwith a pre-trained teacher model, this strategy requires a longer training process, signi\ufb01cant extra\ncomputational cost and large memory (for a heavy teacher) in a more complex multi-phase training\nprocedure. These are commercially unattractive [16].\nTo simplify the distillation training process as above, simultaneous distillation algorithms [17, 16]\nhave been developed to perform knowledge online teaching in a one-phase training procedure. Instead\nof pre-training a static teacher model, these methods train simultaneously a set of (typically two)\nstudent models which learn from each other in a peer-teaching manner. This approach merges the\ntraining processes of the teacher and student models, and uses the peer network to provide the teaching\nknowledge. Beyond the original understanding of distillation that requires the teacher model larger\nthan the student, this online distilling strategy can improve the performance of any-capacity models,\nleading to a more generically applicable technique. Such a peer-teaching strategy sometimes even\noutperforms the teacher based of\ufb02ine distillation. The plausible reason is that the large teacher model\ntends to over\ufb01t the training data therefore leading to less extra knowledge on top of the manually\nlabelled annotations [16].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fHowever, the existing online distillation methods have a number of drawbacks: (1) Each peer-student\nmodel may only provide limited extra information, resulting in suboptimal distillation; (2) Training\nmultiple students causes a signi\ufb01cant increase of computational cost and resource burdens; (3)\nThey require asynchronous model updating which has a notorious need of carefully ordering the\noperations of label prediction and gradient back-propagation across networks. We consider that all\nthe weaknesses are due to the lack of an appropriate teacher role in the online distillation processing.\nIn this work, we propose a novel online knowledge distillation method that is not only more ef\ufb01cient\n(lower training cost) but also more effective (higher model generalisation improvement) as compared\nIn training, the proposed approach constructs a multi-branch\nto previous alternative methods.\nvariant of a given target network by adding auxiliary branches, creates a native ensemble teacher\nmodel from all branches on-the-\ufb02y, and learns simultaneously each branch plus the teacher model\nsubject to the same target label constraints. Each branch is trained with two objective loss terms:\na conventional softmax cross-entropy loss which matches with the ground-truth label distributions,\nand a distillation loss which aligns to the teacher\u2019s prediction distributions. Comparing with creating\na set of student networks, a multi-branch single model is more ef\ufb01cient to train whilst achieving\nsuperior generalisation performance and avoiding asynchronous model update. In test, we simply\nconvert the trained multi-branch model back to the original (single-branch) network architecture\nby removing the auxiliary branches, therefore not increasing test-time cost. In doing so, we derive\nan On-the-Fly Native Ensemble (ONE) teacher based simultaneous distillation training approach\nthat not only eliminates the computationally expensive need for pre-training the teacher model in an\nisolated stage as the of\ufb02ine counterparts, but also further improves the quality of online distillation.\nExtensive experiments on four benchmarks (CIFAR10/100, SVHN, and ImageNet) show that the\nproposed ONE distillation method enables to train more generalisable target models in an one-phase\nprocess than the alternative strategies of of\ufb02ine learning a larger teacher network or simultaneously\ndistilling peer students, the previous state-of-the-art techniques for training small target models.\n\n2 Related Work\n\nKnowledge Distillation. There are existing attempts at knowledge transfer between varying-capacity\nnetwork models [18, 10, 14, 15]. Hinton et al. [10] distilled knowledge from a large teacher model to\nimprove a small target net. The rationale behind is taking advantage of extra supervision provided\nby the teacher model during training the target model, beyond a conventional supervised learning\nobjective such as the cross-entropy loss subject to the training data labels. The extra supervision is\ntypically extracted from a pre-trained powerful teacher model in form of class posterior probabilities\n[10], feature representations [14, 15], or inter-layer \ufb02ow (the inner product of feature maps) [19].\nRecently, knowledge distillation has been exploited to distil easy-to-train large networks into harder-\nto-train small networks [15], to transfer knowledge within the same network [20, 21], and to transfer\nhigh-level semantics across layers [8]. Earlier distillation methods often take an of\ufb02ine learning\nstrategy, requiring at least two phases of training. The more recently proposed deep mutual learning\n[17] overcomes this limitation by conducting an online distillation in one-phase training between two\npeer student models. Anil et al. [16] further extended this idea to accelerate the training of large scale\ndistributed neural networks.\nHowever, the existing online distillation methods lack a strong \u201cteacher\u201d model which limits the\nef\ufb01cacy of knowledge discovery. As the of\ufb02ine counterpart, multiple nets are needed to be trained\ntherefore computationally expensive. We overcome both limitations by designing a new online\ndistillation training algorithm characterised by simultaneously learning a teacher on-the-\ufb02y and the\ntarget net as well as performing batch-wise knowledge transfer in an one-phase training procedure.\nMulti-branch Architectures. Multi-branch based neural networks have been widely exploited\nin computer vision tasks [3, 22, 4]. For example, ResNet [4] can be thought of as a category of\ntwo-branch networks where one branch is the identity mapping. Recently, \u201cgrouped convolution\u201d\n[23, 24] has been used as a replacement of standard convolution in constructing multi-branch net\narchitectures. These building blocks are usually utilised as templates to build deeper networks to\ngain stronger model capacities. Despite sharing the multi-branch principle, our ONE method is\nfundamentally different from such existing methods since our objective is to improve the training\nquality of any target network, but not to propose a new multi-branch building block. In other words,\nour method is a meta network learning algorithm, independent of the network architecture design.\n\n2\n\n\fFigure 1: Overview of online distillation training of ResNet-110 by the proposed On-the-\ufb02y Native\nEnsemble (ONE). With ONE, we start by recon\ufb01guring the target network by adding m auxiliary\nbranches on shared low-level layers. All branches together with shared layers make individual models,\nall of which are then used to construct a stronger teacher model. During the mini-batch training\nprocess, we employ the teacher to assemble knowledge of branch models on-the-\ufb02y, which is in turn\ndistilled back to all branches to enhance the model learning in a closed-loop form. In test, auxiliary\nbranches are discarded or kept according to the deployment ef\ufb01ciency requirement.\n\n3 Knowledge Distillation by On-the-Fly Native Ensemble\n\nWe formulate an online distillation training method based on a concept of On-the-\ufb02y Native Ensemble\n(ONE). For understanding convenience, we take ResNet-110 [4] as an example. It is straightforward\nto apply ONE to other network architectures. For model training, we often have access to n labelled\ntraining samples D = {(xi, yi)}n\ni with each belonging to one of C classes yi \u2208 Y = {1, 2,\u00b7\u00b7\u00b7 , C}.\nThe network \u03b8 outputs a probabilistic class posterior p(c|x, \u03b8) for a sample x over a class c as:\n\np(c|x, \u03b8) = fsm(z) =\n\n, c \u2208 Y\n\n(1)\n\nwhere z is the logits or unnormalised log probability outputted by the network \u03b8. To train a multi-class\nclassi\ufb01cation model, we typically adopt the Cross-Entropy (CE) measurement between the predicted\nand ground-truth label distributions as the objective loss function:\n\n(cid:80)C\n\nexp(zc)\nj=1 exp(zj)\n\n(cid:16)\n\n(cid:17)\n\nLce = \u2212 C(cid:88)\n\n\u03b4c,y log\n\np(c|x, \u03b8)\n\n(2)\n\nc=1\n\nwhere \u03b4c,y is Dirac delta which returns 1 if c is the ground-truth label, and 0 otherwise. With the CE\nloss, the network is trained to predict the correct class label in a principle of maximum likelihood. To\nfurther enhance the model generalisation, we concurrently distil extra knowledge from an on-the-\ufb02y\nnative ensemble (ONE) teacher to each branch in training.\nOn-the-Fly Native Ensemble. An overview of the ONE architecture is depicted in Fig 1. The ONE\nconsists of two components: (1) m auxiliary branches with the same con\ufb01guration (Res4X block and\nan individual classi\ufb01er), each of which serves as an independent classi\ufb01cation model with shared\nlow-level stages/layers. This is because low-level features are largely shared across different network\ninstances and sharing them allows to reduce the training cost. (2) A gating component which learns\nto ensemble all (m+1) branches to build a stronger teacher model. It is constructed by one fully\nconnected (FC) layer followed by batch normalisation, ReLU activation, and softmax, using the same\ninput features as the branches.\n\n3\n\nClassifier\u2026Branch0GateLogits\ud835\udc3f\"##\ud835\udc3f\"#$\ud835\udc3f\"#%\ud835\udc3f\"#&\ud835\udc3f\u2019($\ud835\udc3f\u2019(%\ud835\udc3f\u2019(&\ud835\udc5d$*\ud835\udc5d%*\ud835\udc5d&*\ud835\udc5d#\ud835\udc5d%\ud835\udc5d&\ud835\udc5d$\ud835\udc5d#*EnsemblePredictionsOn-the-FlyKnowledgeDistillationEnsembleLogitsHigh-Level Layers(Res4X block)Low-Level Layers (Conv1, Res2X, Res3X)Branch 1BranchmPredictions\fOur ONE method is established based on a multi-branch design specially for model training with\nseveral merits: (1) Enable the possibility of creating a strong teacher model without training a set of\nnetworks at a high computational cost; (2) Introduce a multi-branch simultaneous learning regulari-\nsation which bene\ufb01ts model generalisation (Fig 3(a)); (3) Avoid the tedious need for asynchronous\nupdate between multiple networks.\nUnder the recon\ufb01guration of network, we add a separate CE loss Li\nce to each branch which simul-\ntaneously learns to predict the same ground-truth class label of a training sample. While sharing\nthe most layers, each branch can be considered as an independent multi-class classi\ufb01er in that all of\nthem independently learn high-level semantic representations. Consequently, taking the ensemble\nof all branches (classi\ufb01ers) can make a stronger teacher model. One common way of ensembling\nmodels is to average individual predictions. This may ignore the diversity and importance variety of\nthe member models of an ensemble. We therefore learn to ensemble by a gating component as:\n\nze =\n\ngi \u00b7 zi\n\n(3)\n\nm(cid:88)\n\ni=0\n\nwhere gi is the importance score of the i-th branch\u2019s logits zi, and ze is the logits of the ONE teacher.\nIn particular, we denote the original branch as i = 0 for indexing convenience. We train the ONE\nteacher model with the CE loss Le\nKnowledge Distillation. Given the teacher\u2019s logits of each training sample, we distil this knowledge\nback into all branches in a closed-loop form. For facilitating knowledge transfer, we compute soft\nprobability distributions at a temperature of T for individual branches and the ONE teacher as:\n\nce (Eq (2)) the same as the branches.\n\n\u02dcpi(c|x, \u03b8i) =\n\n\u02dcpe(c|x, \u03b8e) =\n\n(cid:80)C\n(cid:80)C\n\nexp(zc\ni /T )\nj=1 exp(zj\n\ni /T )\n\nexp(zc\ne/T )\nj=1 exp(zj\n\ne/T )\n\n, c \u2208 Y\n\n, c \u2208 Y\n\n(4)\n\n(5)\n\nwhere i denotes the branch index, i = 0,\u00b7\u00b7\u00b7 , m, \u03b8i and \u03b8e the parameters of the branch and teacher\nmodels respectively. Higher values of T lead to more softened distributions.\nTo quantify the alignment between individual branches and the teacher in their predictions, we use\nthe Kullback Leibler divergence from branches to the teacher written as:\n\u02dcpe(j|x, \u03b8e)\n\u02dcpi(j|x, \u03b8i)\n\n\u02dcpe(j|x, \u03b8e) log\n\nC(cid:88)\n\nm(cid:88)\n\nLkl =\n\n(6)\n\n.\n\ni=0\n\nj=1\n\nm(cid:88)\n\nOverall Loss Function. We obtain the overall loss function for online distillation training by the\nproposed ONE as:\n\nL =\n\nLi\nce + Le\n\nce + T 2 \u2217 Lkl\n\n(7)\n\ni=0\n\nce and Le\n\nwhere Li\nce are the conventional CE loss terms associated with the i-th branch and the ONE\nteacher, respectively. The gradient magnitudes produced by the soft targets \u02dcp are scaled by 1\nT 2 ,\nso we multiply the distillation loss term by a factor T 2 to ensure that the relative contributions of\nground-truth and teacher probability distributions remain roughly unchanged. Note, the entire ONE\nobjective function of ONE is not an ensemble learning since (1) these loss functions corresponding to\nthe models with different roles, and (2) the conventional ensemble learning often takes independent\ntraining of member models.\nModel Training and Deployment. The model optimisation and deployment details are summarised\nin Alg 1. Unlike the two-phase of\ufb02ine distillation training, the target network and the ONE teacher\nare trained simultaneously and collaboratively, with the knowledge distillation from the teacher to\nthe target being conducted in each mini-batch and throughout the whole training procedure. Since\nthere is one multi-branch network rather than multiple networks, we only need to carry out the\nsame stochastic gradient descent through (m + 1) branches, and training the whole network until\n\n4\n\n\fAlgorithm 1 Knowledge Distillation by On-the-Fly Native Ensemble\n1: Input: Labelled training data D; Training epoch number \u03c4; Auxiliary branch number m;\n2: Output: Trained target CNN model \u03b80, and auxiliary models {\u03b8i}m\n3: /* Training */\n4: Initialisation: t=1; Randomly initialise {\u03b8i}m\n5: while t \u2264 \u03c4 do\n6:\n7:\n8:\n9:\n10:\n11:\n12: end\n13: /* Testing */\n14: Single model deployment: Use \u03b80;\n15: Ensemble deployment (ONE-E): Use {\u03b8i}m\n\nCompute predictions of all individual branches {pi}m\nCompute the teacher logits (Eq (3));\nCompute the soft targets of all the branch and teacher models (Eq (4));\nDistil knowledge from the teacher back to all the branch models (Eq (6));\nCompute the \ufb01nal ONE loss function (Eq (7));\nUpdate the model parameters {\u03b8i}m\n\ni=0 by a SGD algorithm.\n\ni=0;\n\ni=0.\n\ni=1;\n\ni=0 (Eq (1));\n\nconverging, as the standard single-model incremental batch-wise training. There is no complexity of\nasynchronously updating among different networks which is required in deep mutual learning [17].\nOnce the model is trained, we simply remove all the auxiliary branches and obtain the original\nnetwork architecture for deployment. Hence, our ONE method does not increase the test-time cost.\nHowever, if there is less constraint on the computation budget and the model performance is more\nimportant, we can deploy it as an ensemble model with all trained branches, denoted as \u201cONE-E\u201d.\n\n4 Experiments\n\nDatasets. We used four multi-class categorisation benchmark datasets in our evaluations (Fig 2). (1)\nCIFAR10 [25]: A natural images dataset that contains 50,000/10,000 training/test samples drawn\nfrom 10 object classes (in total 60,000 images). Each class has 6,000 images sized at 32\u00d732 pixels.\n(2) CIFAR100 [25]: A similar dataset as CIFAR10 that also contains 50,000/10,000 training/test\nimages but covering 100 \ufb01ne-grained classes. Each class has 600 images. (3) SVHN: The Street View\nHouse Numbers (SVHN) dataset consists of 73,257/26,032 standard training/text images and an extra\nset of 531,131 training images. We used all the training data without using data augmentation as\n[26, 27]. (4) ImageNet: The 1,000-class dataset from ILSVRC 2012 [28] provides 1.2 million images\nfor training, and 50,000 for validation.\n\nFigure 2: Example images from (a) CIFAR, (b) SVHN, and (c) ImageNet.\n\nPerformance Metrics. We adopted the common top-n (n=1, 5) classi\ufb01cation error rate. To measure\nthe computational cost of model training and test, we used the criterion of \ufb02oating point operations\n(FLOPs). For any network trained by ONE, we reported the average performance of all branch\noutputs with standard deviation.\nExperiment Setup. We implemented all networks and model training procedures in Pytorch. For all\ndatasets, we adopted the same experimental settings as [29, 23] for making fair comparisons. We\nused the SGD with Nesterov momentum and set the momentum to 0.9. We deployed a standard\nlearning rate schedule that drops from 0.1 to 0.01 at 50% training and to 0.001 at 75%. For the\ntraining budget, we set 300/40/90 epochs for CIFAR/SVHN/ImageNet, respectively. We adopted a\n\n5\n\n(c)(b)(a)\fMethod\nResNet-32 [4]\nResNet-32 + ONE\nResNet-110 [4]\nResNet-110 + ONE\nResNeXt-29(8\u00d764d) [23]\nResNeXt-29(8\u00d764d) + ONE\nDenseNet-BC(L=190, k=40) [30]\nDenseNet-BC(L=190, k=40) + ONE\n\nCIFAR10\n6.93\n5.99\u00b10.05\n5.56\n5.17\u00b10.07\n3.69\n3.45\u00b10.04\n3.32\n3.13\u00b10.07\n\nCIFAR100\n31.18\n26.61\u00b10.06\n25.33\n21.62\u00b10.26\n17.77\n16.07\u00b10.08\n17.53\n16.35\u00b10.05\n\nSVHN\n2.11\n1.83\u00b10.05\n2.00\n1.76\u00b10.07\n1.83\n1.70\u00b10.03\n1.73\n1.63\u00b10.05\n\nParams\n0.5M\n0.5M\n1.7M\n1.7M\n34.4M\n34.4M\n25.6M\n25.6M\n\nTable 1: Evaluation of our ONE method on CIFAR and SVHN. Metric: Error rate (%).\n\n3-branch ONE (m = 2) design unless stated otherwise. We separated the last block of each backbone\nnet from the parameter sharing (except on ImageNet we separated the last 2 blocks to give more\nlearning capacity to branches) without extra structural optimisation (see ResNet-110 for example in\nFig 1). Following [10], we set T = 3 in all the experiments. Cross-validation of this parameter T\nmay give better performance but at the cost of extra model tuning.\n\n4.1 Evaluation of On-the-Fly Native Ensemble\n\nResults on CIFAR and SVHN. Table 1 compares top-1 error rate performances of four varying-\ncapacity state-of-the-art network models trained by the conventional and our ONE learning algorithms.\nWe have these observations: (1) All different networks bene\ufb01t from the ONE training algorithm,\nparticularly with small models achieving larger performance gains. This suggests a generic superiority\nof our method for online knowledge distillation from the on-the-\ufb02y teacher to the target student model.\n(2) All individual branches have similar performances, indicating that they have made suf\ufb01cient\nagreement and exchanged respective knowledge to each other well through the proposed ONE teacher\nmodel during training.\n\nMethod\nResNet-18 [4]\nResNet-18 + ONE\nResNeXt-50 [23]\nResNeXt-50 + ONE\nSeNet-ResNet-18 [31]\nSeNet-ResNet-18 + ONE\n\nTop-1\n30.48\n29.45\u00b10.23\n22.62\n21.85\u00b10.07\n29.85\n29.02\u00b10.17\n\nTop-5\n10.98\n10.41\u00b10.12\n6.29\n5.90\u00b10.05\n10.72\n10.13\u00b10.12\n\nTable 2: Evaluation of our ONE method on ImageNet. Metric: Error rate (%).\n\nResults on ImageNet. Table 2 shows the comparative performances on the 1000-classes ImageNet.\nIt is shown that the proposed ONE learning algorithm again yields more effective training and more\ngeneralisable models in comparison to the vanilla SGD. This indicates that our method is generically\napplicable in large scale image classi\ufb01cation settings.\n\nTarget Network\nMetric\nKD [10]\nDML [17]\nONE\n\nError (%)\n28.83\n29.03\u00b10.22\u2217\n26.61\u00b10.06\n\nResNet-32\nTrCost\n6.43\n2.76\n2.28\n\nTeCost\n1.38\n1.38\n1.38\n\nError (%)\nN/A\n24.10\u00b10.72\n21.62\u00b10.26\n\nResNet-110\nTrCost\nN/A\n10.10\n8.29\n\nTeCost\nN/A\n5.05\n5.05\n\nTable 3: Comparison with knowledge distillation methods on CIFAR100. \u201c*\u201d: Reported results.\nTrCost/TeCost: Training/test cost, in unit of 108 FLOPs. Red/Blue: Best and second best results.\n\n6\n\n\fNetwork\nMetric\nSnopshot Ensemble [32]\n2-Net Ensemble\n3-Net Ensemble\nONE-E\nONE\n\nError (%)\n\n27.12\n26.75\n25.14\n24.63\n26.61\n\nResNet-32\nTrCost\n1.38\n2.76\n4.14\n2.28\n2.28\n\nTeCost\n6.90\n2.76\n4.14\n2.28\n1.38\n\nError (%)\n23.09\u2217\n22.47\n21.25\n21.03\n21.62\n\nResNet-110\nTrCost\n5.05\n10.10\n15.15\n8.29\n8.29\n\nTeCost\n25.25\n10.10\n15.15\n8.29\n5.05\n\nTable 4: Comparison with ensembling methods on CIFAR100. \u201c*\u201d: Reported results. TrCost/TeCost:\nTraining/test cost, in unit of 108 FLOPs. Red/Blue: Best and second best results.\n\n4.2 Comparison with Distillation Methods\n\nWe compared our ONE method with two representative distillation methods: Knowledge Distillation\n(KD) [10] and Deep Mutual Learning (DML) [17]. For the of\ufb02ine competitor KD, we used a large\nnetwork ResNet-110 as the teacher and a small network ResNet-32 as the student. For the online\nmethods DML and ONE, we evaluated their performances using either ResNet-32 or ResNet-110 as\nthe target student model. We observed from Table 3 that: (1) ONE outperforms both KD (of\ufb02ine) and\nDML (online) distillation methods in error rate, validating the performance advantages of our method\nover alternative algorithms when applied to different CNN models. (2) ONE takes the least model\ntraining cost and the same test cost as others, therefore giving the most cost-effective solution.\n\n4.3 Comparison with Ensembling Methods\n\nTable 4 compares the performances of our multi-branch (3 branches) based model ONE-E and\nstandard ensembling methods. It is shown that ONE-E yields not only the best test error but also\nenables most ef\ufb01cient deployment with the lowest test cost. These advantages are achieved at the\nsecond lowest training cost. Whilst Snapshot Ensemble takes the least training cost, its generalisation\ncapability is unsatis\ufb01ed with a notorious drawback of much higher deployment cost.\nIt is worth noting that ONE (without branch ensemble) already outperforms comprehensively a 2-Net\nEnsemble in terms of error rate, training and test cost. Comparing a 3-Net Ensemble, ONE approaches\nthe generalisation capability whilst having larger model training and test ef\ufb01ciency advantages.\n\nCon\ufb01guration\nONE\nONE-E\n\nFull\n\n21.62\u00b10.26\n\n21.03\n\n21.84\n\nW/O Online Distillation W/O Sharing Layers W/O Gating\n22.26\u00b10.23\n\n24.73\u00b10.20\n\n22.45\u00b10.52\n\n20.57\n\n21.79\n\nTable 5: Model component analysis on CIFAR100. Network: ResNet-110.\n\n4.4 Model Component Analysis\n\nTable 5 shows the bene\ufb01ts of individual ONE components on CIFAR100 using ResNet-110. We have\nthese observations: (1) Without online distillation (Eq (6)), the target network suffers a performance\ndrop of 3.11% (24.73-21.62) in test error rate. This performance drop validates the ef\ufb01cacy and\nquality of the ONE teacher in terms of performance superiority over individual branch models. This\ncan be more clearly seen in Fig 3 that the ONE teacher \ufb01ts better to training data and generalises\nbetter to test data. Due to the closed-loop design, the ONE teacher also mutually bene\ufb01ts from\ndistillation, reducing its error rate from 21.84% to 21.03%. With distillation, the target model\neffectively approaches the ONE teacher (Fig 3(a) vs 3(b)) on both training and test error performance,\nindicating the success of teacher knowledge transfer. Interestingly, even without distillation, ONE still\nachieves better generalisation than the vanilla algorithm. This suggests that our multi-branch design\nbrings some positive regularisation effect by concurrently and jointly learning the shared low-level\nlayers subject to more diverse high-level representation knowledge. (2) Without sharing the low-\nlevel layers not only increases the training cost (83% increase), but also leads to weaker performance\n(0.83% error rate increase). The plausible reason is a lack of multi-branch regularisation effect as\nindicated in Fig 3(a). (3) Using average ensemble of branches without gating (Eq (3)) causes a\n\n7\n\n\fBranch #\nError (%)\n\n1\n\n31.18\n\n2\n\n27.38\n\n3\n\n26.68\n\n4\n\n26.58\n\n5\n\n26.52\n\nTable 6: Bene\ufb01t of adding branches to ONE on CIFAR100. Network: ResNet-32.\n\nperformance decrease of 0.64%(22.26-21.62). This suggests the bene\ufb01t of adaptively exploiting the\nbranch diversity in forming the ONE teacher.\nThe main experiments use 3 branches in ONE. Table 6 shows that ONE scales well with more branches\nand the ResNet-32 model generalisation improves on CIFAR100 with the number of branches added\nduring training hence its performance advantage over the independently trained network (31.18%\nerror rate).\n\n(a) ONE without online distillation\n\n(b) Full ONE model\n\nFigure 3: Effect of online distillation. Network: ResNet-110.\n\n4.5 Model Generalisation Analysis\n\nWe aim to give insights on why an ONE trained network yields a better generalisation capability. A\nfew previous studies [33, 34] demonstrate that the width of a local optimum is related to the model\ngeneralisation. A general understanding is that, the surfaces of training and test error largely mirror\nto each other and it is favourable to converge the models to broader optima in training. As such,\na trained model remains approximately optimal even under small perturbations at test time. Next,\nwe exploited this criterion to examine the quality of model solutions \u03b8v, \u03b8m, \u03b8o discovered by the\nvanilla, DML and ONE training algorithms respectively. This analysis was conducted on CIFAR100\nusing ResNet-110.\nSpeci\ufb01cally, to test the width of local optimum, we added small perturbations to the solutions as\n\u03b8\u2217(d, v) = \u03b8\u2217 +d \u00b7 v, \u2217 \u2208 {v, m, o} where v is a uniform distributed direction vector with a unit\nlength, and d\u2208 [0, 5] controls the change magnitude. At each magnitude scale, we further sampled\nrandomly 5 different direction vectors to disturb the solutions. We then tested the robustness of all\nperturbed models in training and test error rates. The training error was quanti\ufb01ed as the cross-entropy\nmeasurement between the predicted and ground-truth label distributions.\nWe observed in Fig 4 that: (1) The robustness of each solution against parameter perturbation appears\nto indicate the width of local optima as: \u03b8v < \u03b8m < \u03b8o. That is, ONE seems to \ufb01nd the widest local\nminimum among the three therefore more likely to generalise better than others.\n(2) Comparing with DML, vanilla and ONE found deeper local optima with lower training errors. This\nindicates that DML may probably get stuck in training, therefore scarifying the vanilla\u2019s exploring\ncapability for more generalisable solutions to exchange the ability of identifying wider optima. In\ncontrast, our method further improves the capability of identifying wider minima over DML whilst\nmaintaining the original exploring quality.\n\n8\n\n050100150200250300Epoch01020304050Error (%)Vanilla (Train)Vanilla (Test)ONE (Train)ONE (Test)ONE-E (Train)ONE-E (Test)050100150200250300Epoch01020304050Error (%)\f(a) Robustness on training data\n\n(b) Robustness on test data\n\nFigure 4: Robustness test of ResNet-110 solutions found by ONE, DML, and vanilla training\nalgorithms on CIFAR100. Each curve corresponds to a speci\ufb01c perturbation direction v.\n\n4.6 Variance Analysis on ONE\u2019s Branches\n\nWe analysed the variance of ONE\u2019s branches over the\ntraining epochs in comparison to the conventional en-\nsemble method. We used ResNet-32 as the base net and\ntested CIFAR100. We quanti\ufb01ed the model variance\nby the average prediction differences on training sam-\nples between every two models/branches in Euclidean\nspace. Figure 5 shows that a 3-Net Ensemble involves\nlarger inter-model variances than ONE with 3 branches\nthroughout the training process. This means that the\nbranches of ONE have higher correlations, due to the\nproposed learning constraint from the distillation loss\nthat enforces them align to the same teacher predic-\ntion, which probably hurts the ensemble performance.\nHowever, in the mean generalisation capability (another fundamental aspect in ensemble learning),\nONE\u2019s branches (the average error rate 26.61\u00b10.06%) are much superior to individual models of a\nconventional ensemble (31.07\u00b10.41%), leading to a stronger ensembling performance.\n\nFigure 5: Model variance during training.\n\n5 Conclusion\n\nIn this work, we presented a novel On-the-\ufb02y Native Ensemble (ONE) strategy for improving deep\nnetwork learning through online knowledge distillation in a one-stage training procedure. With\nONE, we can more discriminatively learn both small and large networks with less computational\ncost, beyond the conventional of\ufb02ine alternatives that are typically formulated to learn better small\nmodels alone. Our method is also superior over existing online counterparts due to the unique\ncapability of constructing a high-capacity online teacher to more effectively mine knowledge from\nthe training data and supervise the target network concurrently. Extensive experiments on four image\nclassi\ufb01cation benchmarks show that a variety of deep networks can all bene\ufb01t from the ONE approach.\nSigni\ufb01cantly, smaller networks obtain more performance gains, making our method specially good\nfor low-memory and fast execution scenarios.\n\nAcknowledgements\n\nThis work was partly supported by the China Scholarship Council, Vision Semantics Limited, the\nRoyal Society Newton Advanced Fellowship Programme (NA150459), and Innovate UK Industrial\nChallenge Project on Developing and Commercialising Intelligent Video Analytics Solutions for\nPublic Safety (98111-571149).\n\n9\n\n0.00.51.01.52.02.53.0Perturbation Magnitude2030405060708090100Test Error(%)ONEDMLVanilla00.511.50100200300VarianceEpochs3-Net EnsembleONE\fReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, 2012.\n\n[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv, 2015.\n\n[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions.\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[5] Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.\n\n[6] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\n\nsegmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[7] Wei Li, Xiatian Zhu, and Shaogang Gong. Person re-identi\ufb01cation by deep joint learning of\n\nmulti-loss classi\ufb01cation. In International Joint Conference of Arti\ufb01cial Intelligence, 2017.\n\n[8] Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by multi-scale matching. In European\n\nConference on Computer Vision, 2018.\n\n[9] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv e-print, 2016.\n\n[10] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv e-print, 2015.\n\n[11] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, 2016.\n\n[12] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and huffman coding. In International Conference\non Learning Representations, 2016.\n\n[13] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. In International Conference on Learning Representations, 2017.\n\n[14] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in Neural\n\nInformation Processing Systems, 2014.\n\n[15] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. Fitnets: Hints for thin deep nets. arXiv e-print, 2014.\n\n[16] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geof-\nfrey E Hinton. Large scale distributed neural network training through online distillation. In\nInternational Conference on Learning Representations, 2018.\n\n[17] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. arXiv\n\ne-print, 2017.\n\n[18] Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In\n\nProceedings of the 12th ACM SIGKDD, 2006.\n\n[19] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast\noptimization, network minimization and transfer learning. In IEEE Conference on Computer\nVision and Pattern Recognition, 2017.\n\n[20] Xu Lan, Xiatian Zhu, and Shaogang Gong. Self-referenced deep learning. In Asian Conference\n\non Computer Vision, 2018.\n\n10\n\n\f[21] Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandku-\n\nmar. Born again neural networks. arXiv e-print, 2018.\n\n[22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink-\ning the inception architecture for computer vision. In IEEE Conference on Computer Vision\nand Pattern Recognition, 2016.\n\n[23] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2017.\n\n[24] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger. Condensenet: An\n\nef\ufb01cient densenet using learned group convolutions. arXiv e-print, 2017.\n\n[25] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[26] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\n\nconvolutional networks. arXiv e-print, 2016.\n\n[27] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-\n\nsupervised nets. In Arti\ufb01cial Intelligence and Statistics, pages 562\u2013570, 2015.\n\n[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[29] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with\n\nstochastic depth. In European Conference on Computer Vision, 2016.\n\n[30] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\nconvolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition,\n2017.\n\n[31] Shen Li Sun Gang Hu, Jie. Squeeze-and-excitation networks. arXiv e-print, 2017.\n\n[32] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger.\nSnapshot ensembles: Train 1, get m for free. International Conference on Learning Representa-\ntions, 2017.\n\n[33] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv e-print, 2016.\n\n[34] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian\nBorgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient\ndescent into wide valleys. arXiv e-print, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3725, "authors": [{"given_name": "xu", "family_name": "lan", "institution": "Queen Mary, University of London"}, {"given_name": "Xiatian", "family_name": "Zhu", "institution": "Queen Mary University, London, UK"}, {"given_name": "Shaogang", "family_name": "Gong", "institution": "Queen Mary University of London"}]}