{"title": "Learning Efficient Object Detection Models with Knowledge Distillation", "book": "Advances in Neural Information Processing Systems", "page_first": 742, "page_last": 751, "abstract": "Despite significant accuracy improvement in convolutional neural networks (CNN) based object detectors, they often require prohibitive runtimes to process an image for real-time applications. State-of-the-art models often use very deep networks with a large number of floating point operations. Efforts such as model compression learn compact models with fewer number of parameters, but with much reduced accuracy. In this work, we propose a new framework to learn compact and fast ob- ject detection networks with improved accuracy using knowledge distillation [20] and hint learning [34]. Although knowledge distillation has demonstrated excellent improvements for simpler classification setups, the complexity of detection poses new challenges in the form of regression, region proposals and less voluminous la- bels. We address this through several innovations such as a weighted cross-entropy loss to address class imbalance, a teacher bounded loss to handle the regression component and adaptation layers to better learn from intermediate teacher distribu- tions. We conduct comprehensive empirical evaluation with different distillation configurations over multiple datasets including PASCAL, KITTI, ILSVRC and MS-COCO. Our results show consistent improvement in accuracy-speed trade-offs for modern multi-class detection models.", "full_text": "Learning Ef\ufb01cient Object Detection Models with\n\nKnowledge Distillation\n\nGuobin Chen1,2 Wongun Choi1 Xiang Yu1 Tony Han2 Manmohan Chandraker1,3\n3University of California, San Diego\n\n2University of Missouri\n\n1NEC Labs America\n\nAbstract\n\nDespite signi\ufb01cant accuracy improvement in convolutional neural networks (CNN)\nbased object detectors, they often require prohibitive runtimes to process an image\nfor real-time applications. State-of-the-art models often use very deep networks\nwith a large number of \ufb02oating point operations. Efforts such as model compression\nlearn compact models with fewer number of parameters, but with much reduced\naccuracy. In this work, we propose a new framework to learn compact and fast ob-\nject detection networks with improved accuracy using knowledge distillation [20]\nand hint learning [34]. Although knowledge distillation has demonstrated excellent\nimprovements for simpler classi\ufb01cation setups, the complexity of detection poses\nnew challenges in the form of regression, region proposals and less voluminous la-\nbels. We address this through several innovations such as a weighted cross-entropy\nloss to address class imbalance, a teacher bounded loss to handle the regression\ncomponent and adaptation layers to better learn from intermediate teacher distribu-\ntions. We conduct comprehensive empirical evaluation with different distillation\ncon\ufb01gurations over multiple datasets including PASCAL, KITTI, ILSVRC and\nMS-COCO. Our results show consistent improvement in accuracy-speed trade-offs\nfor modern multi-class detection models.\n\n1\n\nIntroduction\n\nRecent years have seen tremendous increase in the accuracy of object detection, relying on deep\nconvolutional neural networks (CNNs). This has made visual object detection an attractive possibility\nfor domains ranging from surveillance to autonomous driving. However, speed is a key requirement in\nmany applications, which fundamentally contends with demands on accuracy. Thus, while advances\nin object detection have relied on increasingly deeper architectures, they are associated with an\nincrease in computational expense at runtime. But it is also known that deep neural networks are\nover-parameterized to aid generalization. Thus, to achieve faster speeds, some prior works explore\nnew structures such as fully convolutional networks, or lightweight models with fewer channels and\nsmall \ufb01lters [22, 25]. While impressive speedups are obtained, they are still far from real-time, with\ncareful redesign and tuning necessary for further improvements.\nDeeper networks tend to have better performance under proper training, since they have ample\nnetwork capacity. Tasks such as object detection for a few categories might not necessarily need\nthat model capacity. In that direction, several works in image classi\ufb01cation use model compression,\nwhereby weights in each layer are decomposed, followed by layer-wise reconstruction or \ufb01ne-tuning\nto recover some of the accuracy [9, 26, 41, 42]. This results in signi\ufb01cant speed-ups, but there is\noften a gap between the accuracies of original and compressed models, which is especially large\nwhen using compressed models for more complex problems such as object detection. On the other\nhand, seminal works on knowledge distillation show that a shallow or compressed model trained\nto mimic the behavior of a deeper or more complex model can recover some or all of the accuracy\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdrop [3, 20, 34]. However, those results are shown only for problems such as classi\ufb01cation, using\nsimpler networks without strong regularization such as dropout.\nApplying distillation techniques to multi-class object detection, in contrast to image classi\ufb01cation, is\nchallenging for several reasons. First, the performance of detection models suffers more degradation\nwith compression, since detection labels are more expensive and thereby, usually less voluminous.\nSecond, knowledge distillation is proposed for classi\ufb01cation assuming each class is equally important,\nwhereas that is not the case for detection where the background class is far more prevalent. Third,\ndetection is a more complex task that combines elements of both classi\ufb01cation and bounding box\nregression. Finally, an added challenge is that we focus on transferring knowledge within the same\ndomain (images of the same dataset) with no additional data or labels, as opposed other works that\nmight rely on data from other domains (such as high-quality and low-quality image domains, or\nimage and depth domains).\nTo address the above challenges, we propose a method to train fast models for object detection with\nknowledge distillation. Our contributions are four-fold:\n\u2022 We propose an end-to-end trainable framework for learning compact multi-class object detection\nmodels through knowledge distillation (Section 3.1). To the best of our knowledge, this is the \ufb01rst\nsuccessful demonstration of knowledge distillation for the multi-class object detection problem.\n\u2022 We propose new losses that effectively address the aforementioned challenges. In particular, we\npropose a weighted cross entropy loss for classi\ufb01cation that accounts for the imbalance in the\nimpact of misclassi\ufb01cation for background class as opposed to object classes (Section 3.2), a\nteacher bounded regression loss for knowledge distillation (Section 3.3) and adaptation layers for\nhint learning that allows the student to better learn from the distribution of neurons in intermediate\nlayers of the teacher (Section 3.4).\n\u2022 We perform comprehensive empirical evaluation using multiple large-scale public benchmarks.\nOur study demonstrates the positive impact of each of the above novel design choices, resulting in\nsigni\ufb01cant improvement in object detection accuracy using compressed fast networks, consistently\nacross all benchmarks (Sections 4.1 \u2013 4.3).\n\u2022 We present insights into the behavior of our framework by relating it to the generalization and\n\nunder-\ufb01tting problems (Section 4.4).\n\n2 Related Works\n\nCNNs for Detection. Deformable Part Model (DPM) [14] was the dominant detection framework\nbefore the widespread use of Convolutional Neural Networks (CNNs). Following the success of\nCNNs in image classi\ufb01cation [27], Girshick et al. proposed RCNN [24] that uses CNN features\nto replace handcrafted ones. Subsequently, many CNN based object detection methods have been\nproposed, such as Spatial Pyramid Pooling (SPP) [19], Fast R-CNN [13], Faster-RCNN [32] and\nR-FCN [29], that unify various steps in object detection into an end-to-end multi-category framework.\nModel Compression. CNNs are expensive in terms of computation and memory. Very deep networks\nwith many convolutional layers are preferred for accuracy, while shallower networks are also widely\nused where ef\ufb01ciency is important. Model compression in deep networks is a viable approach to\nspeed up runtime while preserving accuracy. Denil et al. [9] demonstrate that neural networks are\noften over-parametrized and removing redundancy is possible. Subsequently, various methods [5, 7,\n10, 15, 17, 30] have been proposed to accelerate the fully connected layer. Several methods based on\nlow-rank decomposition of the convolutional kernel tensor [10, 23, 28] are also proposed to speed\nup convolutional layers. To compress the whole network, Zhang et al. [41, 42] present an algorithm\nusing asymmetric decomposition and additional \ufb01ne-tuning. In similar spirit, Kim et al. [26] propose\none-shot whole network compression that achieves around 1.8 times improvement in runtime without\nsigni\ufb01cant drop in accuracy. We will use methods presented in [26] in our experiments. Besides, a\npruning based approach has been proposed [18] but it is challenging to achieve runtime speed-up\nwith a conventional GPU implementation. Additionally, both weights and input activations can be the\nquantized( [18]) and binarized ( [21, 31]) to lower the computationally expensive.\nKnowledge Distillation. Knowledge distillation is another approach to retain accuracy with model\ncompression. Bucila et al. [3] propose an algorithm to train a single neural network by mimicking\nthe output of an ensemble of models. Ba and Caruana [2] adopt the idea of [3] to compress deep\n\n2\n\n\fFigure 1: The proposed learning skeme on visual object detection task using Faster-RCNN, which mainly\nconsists of region proposal network (RPN) and region classi\ufb01cation network(RCN). The two networks both\nuse multi-task loss to jointly learn the classi\ufb01er and bounding-box regressor. We employ the \ufb01nal output of the\nteacher\u2019s RPN and RCN as the distillation targets, and apply the intermediate layer outputs as hint. Red arrows\nindicate the backpropagation pathways.\n\nnetworks into shallower but wider ones, where the compressed model mimics the \u2018logits\u2019. Hinton\net al. [20] propose knowledge distillation as a more general case of [3], which applies the prediction\nof the teacher model as a \u2018soft label\u2019, further proposing temperature cross entropy loss instead of L2\nloss. Romero et al. [34] introduce a two-stage strategy to train deep networks. In their method, the\nteacher\u2019s middle layer provides \u2018hint\u2019 to guide the training of the student model.\nOther researchers [16, 38] explore distillation for transferring knowledge between different domains,\nsuch as high-quality and low-quality images, or RGB and depth images. In a draft manuscript\nconcurrent with our work, Shen et al. [36] consider the effect of distillation and hint frameworks\nin learning a compact object detection model. However, they formulate the detection problem as\na binary classi\ufb01cation task applied to pedestrians, which might not scale well to the more general\nmulti-category object detection setup. Unlike theirs, our method is designed for multi-category object\ndetection. Further, while they use external region proposals, we demonstrate distillation and hint\nlearning for both the region proposal and classi\ufb01cation components of a modern end-to-end object\ndetection framework [32].\n\n3 Method\n\nIn this work, we adopt the Faster-RCNN [32] as the object detection framework. Faster-RCNN\nis composed of three modules: 1) A shared feature extraction through convolutional layers, 2) a\nregion proposal network (RPN) that generates object proposals, and 3) a classi\ufb01cation and regression\nnetwork (RCN) that returns the detection score as well as a spatial adjustment vector for each object\nproposal. Both the RCN and RPN use the output of 1) as features, RCN also takes the result of RPN\nas input. In order to achieve highly accurate object detection results, it is critical to learn strong\nmodels for all the three components.\n\n3.1 Overall Structure\n\nWe learn strong but ef\ufb01cient student object detectors by using the knowledge of a high capacity teacher\ndetection network for all the three components. Our overall learning framework is illustrated in Figure\n1. First, we adopt the hint based learning [34] (Sec.3.4) that encourages the feature representation of\na student network is similar to that of the teacher network. Second, we learn stronger classi\ufb01cation\nmodules in both RPN and RCN using the knowledge distillation framework [3,20]. In order to handle\nsevere category imbalance issue in object detection, we apply weighted cross entropy loss for the\ndistillation framework. Finally, we transfer the teacher\u2019s regression output as a form of upper bound,\nthat is, if the student\u2019s regression output is better than that of teacher, no additional loss is applied.\n\n3\n\nSoft LabelDetectionDetectionL2 LossTeacherHintStudentGuidedSoftMax & SmoothL1 LossHintGround TruthClassificationRegressionClassificationRegressionWeighted Cross EntropyLossDistillation Ground Truth LabelBoundedRegressionLossAdaptationBack Propagation\fOur overall learning objective can be written as follows:\n\n(cid:88)\n(cid:88)\n\ni\n\ni\n\nLRCN =\n\nLRPN =\n\n1\nN\n\n1\nM\n\n(cid:88)\n(cid:88)\n\nj\n\nj\n\nLRCN\n\ncls + \u03bb\n\nLRCN\n\nreg\n\n1\nN\n\nLRPN\n\ncls + \u03bb\n\n1\nM\n\nLRPN\n\nreg\n\nL = LRPN + LRCN + \u03b3LHint\n\n(1)\nwhere N is the batch-size for RCN and M for RPN. Here, Lcls denotes the classi\ufb01er loss function\nthat combines the hard softmax loss using the ground truth labels and the soft knowledge distillation\nloss [20] of (2). Further, Lreg is the bounding box regression loss that combines smoothed L1 loss [13]\nand our newly proposed teacher bounded L2 regression loss of (4). Finally, Lhint denotes the hint\nbased loss function that encourages the student to mimic the teacher\u2019s feature response, expressed as\n(6). In the above, \u03bb and \u03b3 are hyper-parameters to control the balance between different losses. We\n\ufb01x them to be 1 and 0.5, respectively, throughout the experiments.\n\n3.2 Knowledge Distillation for Classi\ufb01cation with Imbalanced Classes\n\nConventional use of knowledge distillation has been proposed for training classi\ufb01cation networks,\nwhere predictions of a teacher network are used to guide the training of a student model. Suppose we\nhave dataset {xi, yi}, i = 1, 2, ..., n where xi \u2208 I is the input image and yi \u2208 Y is its class label.\nLet t be the teacher model, with Pt = softmax( Zt\nT ) its prediction and Zt the \ufb01nal score output. Here,\nT is a temperature parameter (normally set to 1). Similarly, one can de\ufb01ne Ps = softmax( Zs\nT ) for the\nstudent network s. The student s is trained to optimize the following loss function:\n\nLcls = \u00b5Lhard(Ps, y) + (1 \u2212 \u00b5)Lsof t(Ps, Pt)\n\n(2)\nwhere Lhard is the hard loss using ground truth labels used by Faster-RCNN, Lsof t is the soft loss\nusing teacher\u2019s prediction and \u00b5 is the parameter to balance the hard and soft losses. It is known that\na deep teacher can better \ufb01t to the training data and perform better in test scenarios. The soft labels\ncontain information about the relationship between different classes as discovered by teacher. By\nlearning from soft labels, the student network inherits such hidden information.\nIn [20], both hard and soft losses are the cross entropy losses. But unlike simpler classi\ufb01cation\nproblems, the detection problem needs to deal with a severe imbalance across different categories, that\nis, the background dominates. In image classi\ufb01cation, the only possible errors are misclassi\ufb01cations\nbetween \u2018foreground\u2019 categories. In object detection, however, failing to discriminate between\nbackground and foreground can dominate the error, while the frequency of having misclassi\ufb01cation\nbetween foreground categories is relatively rare. To address this, we adopt class-weighted cross\nentropy as the distillation loss:\n\nLsof t(Ps, Pt) = \u2212(cid:88)\n\nwcPt log Ps\n\n(3)\n\nwhere we use a larger weight for the background class and a relatively small weight for other classes.\nFor example, we use w0 = 1.5 for the background class and wi = 1 for all the others in experiments\non the PASCAL dataset.\nWhen Pt is very similar to the hard label, with probability for one class very close to 1 and most\nothers very close to 0, the temperature parameter T is introduced to soften the output. Using higher\ntemperature will force t to produce softer labels so that the classes with near-zero probabilities will\nnot be ignored by the cost function. This is especially pertinent to simpler tasks, such as classi\ufb01cation\non small datasets like MNIST. But for harder problems where the prediction error is already high, a\nlarger value of T introduces more noise which is detrimental to learning. Thus, lower values of T are\nused in [20] for classi\ufb01cation on larger datasets. For even harder problems such as object detection,\nwe \ufb01nd using no temperature parameter at all (equivalent to T = 1) in the distillation loss works the\nbest in practice (see supplementary material for an empirical study).\n\n3.3 Knowledge Distillation for Regression with Teacher Bounds\n\nIn addition to the classi\ufb01cation layer, most modern CNN based object detectors [26, 29, 32, 33] also\nuse bounding-box regression to adjust the location and size of the input proposals. Often, learning a\n\n4\n\n\fgood regression model is critical to ensure good object detection accuracy [13]. Unlike distillation\nfor discrete categories, the teacher\u2019s regression outputs can provide very wrong guidance toward the\nstudent model, since the real valued regression outputs are unbounded. In addition, the teacher may\nprovide regression direction that is contradictory to the ground truth direction. Thus, instead of using\nthe teacher\u2019s regression output directly as a target, we exploit it as an upper bound for the student to\nachieve. The student\u2019s regression vector should be as close to the ground truth label as possible in\ngeneral, but once the quality of the student surpasses that of the teacher with a certain margin, we\ndo not provide additional loss for the student. We call this the teacher bounded regression loss, Lb,\nwhich is used to formulate the regression loss, Lreg, as follows:\nif (cid:107)Rs \u2212 y(cid:107)2\notherwise\n\n(cid:26)(cid:107)Rs \u2212 y(cid:107)2\n\nLb(Rs, Rt, y) =\n\n2 + m > (cid:107)Rt \u2212 y(cid:107)2\n\n2\n\n2 ,\n\n0,\n\nLreg = LsL1(Rs, yreg) + \u03bdLb(Rs, Rt, yreg),\n\n(4)\nwhere m is a margin, yreg denotes the regression ground truth label, Rs is the regression output of\nthe student network, Rt is the prediction of teacher network and \u03bd is a weight parameter (set as 0.5 in\nour experiments). Here, LsL1 is the smooth L1 loss as in [13]. The teacher bounded regression loss\nLb only penalizes the network when the error of the student is larger than that of the teacher. Note\nthat although we use L2 loss inside Lb, any other regression loss such as L1 and smoothed L1 can be\ncombined with Lb. Our combined loss encourages the student to be close to or better than teacher in\nterms of regression, but does not push the student too much once it reaches the teacher\u2019s performance.\n\n3.4 Hint Learning with Feature Adaptation\n\nDistillation transfers knowledge using only the \ufb01nal output. In [34], Romero et al. demonstrate that\nusing the intermediate representation of the teacher as hint can help the training process and improve\nthe \ufb01nal performance of the student. They use the L2 distance between feature vectors V and Z:\n\n(5)\nwhere Z represent the intermediate layer we selected as hint in the teacher network and V represent\nthe output of the guided layer in the student network. We also evaluate the L1 loss:\n\n2\n\nLHint(V, Z) = (cid:107)V \u2212 Z(cid:107)2\n\nLHint(V, Z) = (cid:107)V \u2212 Z(cid:107)1\n\n(6)\n\nWhile applying hint learning, it is required that the number of neurons (channels, width and height)\nshould be the same between corresponding layers in the teacher and student. In order to match the\nnumber of channels in the hint and guided layers, we add an adaptation after the guided layer whose\noutput size is the same as the hint layer. The adaptation layer matches the scale of neuron to make\nthe norm of feature in student close to teacher\u2019s. A fully connected layer is used as adaptation layer\nwhen both hint and guided layers are also fully connected layers. When the hint and guided layers are\nconvolutional layers, we use 1 \u00d7 1 convolutions to save memory. Interestingly, we \ufb01nd that having\nan adaptation layer is important to achieve effective knowledge transferring even when the number\nof channels in the hint and guided layers are the same (see Sec. 4.3). The adaptation layer can also\nmatch the difference when the norms of features in hint and guided layers are different. When the\nhint or guided layer is convolutional and the resolution of hint and guided layers differs (for examples,\nVGG16 and AlexNet), we follow the padding trick introduced in [16] to match the number of outputs.\n\n4 Experiments\n\nIn this section, we \ufb01rst introduce teacher and student CNN models and datasets that are used in the\nexperiments. The overall results on various datasets are shown in Sec.4.1. We apply our methods to\nsmaller networks and lower quality inputs in Sec.4.2. Sec.4.3 describes ablation studies for three dif-\nferent components, namely classi\ufb01cation/regression, distillation and hint learning. Insights obtained\nfor distillation and hint learning are discussed in Sec.4.4. We refer the readers to supplementary\nmaterial for further details.\nDatasets We evaluate our method on several commonly used public detection datasets, namely,\nKITTI [12], PASCAL VOC 2007 [11], MS COCO [6] and ImageNet DET benchmark (ILSVRC\n2014) [35]. Among them, KITTI and PASCAL are relatively small datasets that contain less object\n\n5\n\n\fTeacher\n\nPASCAL\n\nCOCO@.5 COCO@[.5,.95]\n\nStudent\n\nModel Info\n\nTucker\n\n11M / 47ms\n\nAlexNet\n\n62M / 74ms\n\nVGG16\n\nVGGM 80M / 86ms\nVGG16\n138M / 283ms\nTable 1: Comparison of student models associated with different teacher models across four datasets, in terms\nof mean Average Precision (mAP). Rows with blank (-) teacher indicate the model is without distillation, serving\nas baselines. The second column reports the number of parameters and speed (per image, on GPU).\n\n58.6 (+2.3) 34.0 (+2.9)\n\n63.7 (+3.9) 37.2 (+3.6)\n\n17.3 (+1.2)\n\n70.4\n\n45.1\n\n59.2\n\n54.7\n\n25.4\n\nAlexNet 57.6 (+2.9) 26.5 (+1.2)\nVGGM 58.2 (+3.5) 26.4 (+1.1)\nVGG16\n59.4 (+4.7) 28.3 (+2.9)\n\nVGGM 59.2 (+2.0) 33.4 (+0.9)\n60.1 (+2.9) 35.8 (+3.3)\nVGG16\n\n57.2\n\n59.8\n\n32.5\n\n33.6\n\n-\n\n-\n\n-\n\n-\n\n11.8\n\n12.3 (+0.5)\n12.2 (+0.4)\n12.6 (+0.8)\n\n15.8\n\n16.0 (+0.2)\n16.9 (+1.1)\n\n16.1\n\n24.2\n\nKITTI\n49.3\n\nILSVRC\n\n20.6\n\n51.4 (+2.1) 23.6 (+1.3)\n51.4 (+2.1) 23.9 (+1.6)\n53.7 (+4.4) 24.4 (+2.1)\n\n56.3 (+1.2) 28.7 (+1.4)\n58.3 (+3.2) 30.1 (+2.8)\n\n55.1\n\n56.7\n\n27.3\n\n31.1\n\n35.6\n\ncategories and labeled images, whereas MS COCO and ILSVRC 2014 are large scale datasets.\nSince KITTI and ILSVRC 2014 do not provide ground-truth annotation for test sets, we use the\ntraining/validation split introduced by [39] and [24] for analysis. For all the datasets, we follow the\nPASCAL VOC convention to evaluate various models by reporting mean average precision (mAP) at\nIoU = 0.5 . For MS COCO dataset, besides the PASCAL VOC metric, we also report its own metric,\nwhich evaluates mAP averaged for IoU \u2208 [0.5 : 0.05 : 0.95] (denoted as mAP[.5, .95]).\nModels The teacher and student models de\ufb01ned in our experiments are standard CNN architectures,\nwhich consist of regular convolutional layers, fully connected layers, ReLU, dropout layers and\nsoftmax layers. We choose several popular CNN architectures as our teacher/student models, namely,\nAlexNet [27], AlexNet with Tucker Decomposition [26], VGG16 [37] and VGGM [4]. We use two\ndifferent settings for the student and teacher pairs. In the \ufb01rst set of experiments, we use a smaller\nnetwork (that is, less parameters) as the student and use a larger one for the teacher (for example,\nAlexNet as student and VGG16 as teacher). In the second set of experiments, we use smaller input\nimage size for the student model and larger input image size for the teacher, while keeping the\nnetwork architecture the same.\n\n4.1 Overall Performance\n\nTable1 shows mAP for four student models on four object detection databases, with different\narchitectures for teacher guidance. For student models without teacher\u2019s supervision, we train them to\nthe best numbers we could achieve. Not surprisingly, larger or deeper models with more parameters\nperform better than smaller or shallower models, while smaller models run faster than larger ones.\nThe performance of student models improves signi\ufb01cantly with distillation and hint learning over all\ndifferent pairs and datasets, despite architectural differences between teacher and student. With a\n\ufb01xed scale (number of parameters) of a student model, training from scratch or \ufb01ne-tuning on its own\nis not an optimal choice. Getting aid from a better teacher yields larger improvements approaching\nthe teacher\u2019s performance. A deeper model as teacher leads to better student performance, which\nsuggests that the knowledge transferred from better teachers is more informative. Notice that the\nTucker model trained with VGG16 achieves signi\ufb01cantly higher accuracy than the Alexnet in the\nPASCAL dataset, even though the model size is about 5 times smaller. The observation may support\nthe hypothesis that CNN based object detectors are highly over-parameterized. On the contrary,\nwhen the size of dataset is much larger, it becomes much harder to outperform more complicated\nmodels. This suggests that it is worth having even higher capacity models for such large scale datasets.\nTypically, when evaluating ef\ufb01ciency, we get 3 times faster from VGG16 as teacher to AlexNet as\nstudent on KITTI dataset. For more detailed runtimes, please refer to supplementary material.\nFurther, similar to [38], we investigate another student-teacher mode: the student and teacher share\nexactly the same network structure, while the input for student is down-scaled and the input for\nteacher remains high resolution. Recent works [1] report that image resolution critically affects\nobject detection performance. On the other hand, downsampling the input size quadratically reduces\nconvolutional resources and speeds up computation. In Table 2, by scaling input sizes to half in\n\n6\n\n\fAlexNet\nTucker\n\nmAP\n57.2\n54.7\n\nHigh-res teacher\nSpeed\n\nLow-res baseline\nSpeed\n\n1,205 / 74 ms\n663 / 41 ms\n\nmAP\n53.2\n48.6\n\n726 / 47 ms\n430 / 29 ms\n\nLow-res distilled student\n\nmAP\n\n56.7(+3.5)\n53.5(+4.9)\n\nSpeed\n\n726 / 47 ms\n430 / 29 ms\n\nTable 2: Comparison of high-resolution teacher model (trained on images with 688 pixels) and low-resolution\nstudent model (trained on 344 pixels input), on PASCAL. We report mAP and speed (both CPU and GPU)\nof different models. The speed of low-resolution models are about 2 times faster than the corresponding\nhigh-resolution models, while achieving almost the same accuracy when our distillation method is used.\n\nFLOPS(%)\nFinetune\nDistillation\n\n20\n30.3\n\n25\n49.3\n\n30\n51.4\n\n37.5\n54.7\n\n45\n55.2\n\n35.5(+5.2)\n\n55.4(+6.1)\n\n56.8(+5.4)\n\n59.4(+4.7)\n\n59.5(+4.3)\n\nTable 3: Compressed AlexNet performance evaluated on PASCAL. We compare the model \ufb01ne-tuned with the\nground truth and the model trained with our full method. We vary the compression ratio by FLOPS.\n\nPASCAL VOC dataset for the student and using the original resolution for the teacher, we get almost\nthe same accuracy as the high-resolution teacher while being about two times faster1.\n\n4.2 Speed-Accuracy Trade off in Compressed Models\n\nIt is feasible to select CNN models from a wide range of candidates to strike a balance between\nspeed and accuracy. However, off-the-shelf CNN models still may not meet one\u2019s computational\nrequirements. Designing new models is one option. But it often requires signi\ufb01cant labor towards\ndesign and training. More importantly, trained models are often designed for speci\ufb01c tasks, but speed\nand accuracy trade-offs may change when facing a different task, whereby one may as well train a\nnew model for the new task. In all such situations, distillation becomes an attractive option.\nTo understand the speed-accuracy trade off in object detection with knowledge distillation, we vary the\ncompression ratio (the ranks of weight matrices) of Alexnet with Tucker decomposition. We measure\nthe compression ratio using FLOPS of the CNN. Experiments in Table 3 show that the accuracy drops\ndramatically when the network is compressed too much, for example, when compressed size is 20%\nof original, accuracy drops from 57.2% to only 30.3%. However, for the squeezed networks, our\ndistillation framework is able to recover large amounts of the accuracy drop. For instance, for 37.5%\ncompression, the original squeezed net only achieves 54.7%. In contrast, our proposed method lifts\nit up to 59.4% with a deep teacher (VGG16), which is even better than the uncompressed AlexNet\nmodel 57.2%.\n\n4.3 Ablation Study\n\nAs shown in Table 4, we compare different strategies for distillation and hint learning to highlight the\neffectiveness of our proposed novel losses. We choose VGG16 as the teacher model and Tucker as\nour student model for all the experiments in this section. Other choices re\ufb02ect similar trends. Recall\nthat proposal classi\ufb01cation and bounding box regression are the two main tasks in the Faster-RCNN\nframework. Traditionally, classi\ufb01cation is associated with cross entropy loss, denoted as CLS in\nTable 4, while bounding box regression is regularized with L2 loss, denoted as L2.\nTo prevent the classes with small probability being ignored by the objective function, soft label with\nhigh temperature, also named weighted cross entropy loss, is proposed for the proposal classi\ufb01cation\ntask in Sec.3.2. We compare the weighted cross entropy loss de\ufb01ned in (3), denoted as CLS-W in\nTable 4, with the standard cross entropy loss (CLS), to achieve slightly better performance on both\nPASCAL and KITTI datasets.\nFor bounding box regression, directly parroting to teacher\u2019s output will suffer from labeling noise. An\nimprovement is proposed through (4) in Sec.3.3, where the teacher\u2019s prediction is used as a boundary\nto guide the student. Such strategy, denoted as L2-B in Table 4, improves over L2 by 1.3%. Note that\na 1% improvement in object detection task is considered very signi\ufb01cant, especially on large-scale\ndatasets with voluminous number of images.\n\n1Ideally, the convolutional layers should be about 4 times faster. However, due to the loading overhead and\n\nthe non-proportional consumption from other layers, this speed up drops to around 2 times faster.\n\n7\n\n\fBaseline L2 L2-B CLS CLS-W Hints Hints-A L2-B+CLS-W L2-B+CLS-W+Hints-A\n\nPASCAL\nKITTI\n\n54.7\n49.3\n\n54.6 55.9 57.4\n48.5 50.1 50.8\n\n57.7\n51.3\n\n56.9\n50.3\n\n58\n52.1\n\n58.4\n51.7\n\n59.4\n53.7\n\nTable 4: The proposed method component comparison, i.e., bounded L2 for regression (L2-B, Sec.3.3) and\nweighted cross entropy for classi\ufb01cation (CLS-W, Sec.3.2) with respect to traditional methods, namely, L2\nand cross entropy (CLS). Hints learning w/o adaptation layer (Hints-A and Hints) are also compared. All\ncomparisons take VGG16 as the teacher and Tucker as the student, with evaluations on PASCAL and KITTI.\n\nBaseline Distillation Hint Distillation + Hint\n\nPASCAL\n\nCOCO\n\nTrainval\n\nTest\nTrain\nVal\n\n79.6\n54.7\n45.3\n25.4\n\n78.3\n58.4\n45.4\n26.1\n\n80.9\n58\n47.1\n27.8\n\n83.5\n59.4\n49.6\n28.3\n\nTable 5: Performance of distillation and hint learning on different datasets with Tucker and VGG16 pair.\n\nMoreover, we \ufb01nd that the adaptation layer proposed in Sec.3.4 is critical for hint learning. Even\nif layers from teacher and student models have the same number of neurons, they are almost never\nin the same feature space. Otherwise, setting the student\u2019s subsequent structure to be the same as\nteacher\u2019s, the student would achieve identical results as the teacher. Thus, directly matching a student\nlayer to a teacher layer [2, 3] is unlikely to perform well. Instead, we propose to add an adaptation\nlayer to transfer the student layer feature space to the corresponding teacher layer feature space.\nThereby, penalizing the student feature from the teacher feature is better-de\ufb01ned since they lie in\nthe same space, which is supported by the results in Table 4. With adaptation layer, hint learning\n(Hint-A) shows a 1.1% advantage over the traditional method (Hint). Our proposed overall method\n(L2-B+CLS-W+Hint-A) outperforms the one without adaptive hint learning (L2-B+CLS-W) by 1.0%,\nwhich again suggests the signi\ufb01cant advantage of hint learning with adaptation.\n\n4.4 Discussion\n\nIn this section, we provide further insights into distillation and hint learning. Table 5 compares the\naccuracy of Tucker model learned with VGG16 on the trainval and testing split of the PASCAL and\nCOCO datasets. In general, distillation mostly improves the generalization capability of student,\nwhile hint learning helps improving both the training and testing accuracy.\nDistillation improves generalization: Similarly to the image classi\ufb01cation case discussed in [20],\nthere also exists structural relationship among the labels in object detection task. For example,\n\u2018Car\u2019 shares more common visual characteristics with \u2018Truck\u2019 than with \u2019Person\u2019. Such structural\ninformation is not available in the ground truth annotations. Thus, injecting such relational information\nlearned with a high capacity teacher model to a student will help generalization capability of the\ndetection model. The result of applying the distillation only shows consistent testing accuracy\nimprovement in Table 5.\nHint helps both learning and generalization: We notice that the \u201cunder-\ufb01tting\u201d is a common\nproblem in object detection even with CNN based models (see low training accuracy of the baselines).\nUnlike simple classi\ufb01cation cases, where it is easy to achieve (near) perfect training accuracy [40],\nthe training accuracy of the detectors is still far from being perfect. It seems the learning algorithm\nis suffering from the saddle point problem [8]. On the contrary, the hint may provide an effective\nguidance to avoid the problem by directly having a guidance at an intermediate layer. Thereby,\nthe model learned with hint learning achieves noticeable improvement in both training and testing\naccuracy.\nFinally, by combining both distillation and hint learning, both training and test accuracies are\nimproved signi\ufb01cantly compared to the baseline. Table 5 empirically veri\ufb01es consistent trends on\nboth the PASCAL and MS COCO datasets for object detection. We believe that our methods can also\nbe extended to other tasks that also face similar generalization or under-\ufb01tting problems.\n\n5 Conclusion\n\nWe propose a novel framework for learning compact and fast CNN based object detectors with\nthe knowledge distillation. Highly complicated detector models are used as a teacher to guide\nthe learning process of ef\ufb01cient student models. Combining the knowledge distillation and hint\n\n8\n\n\fframework together with our newly proposed loss functions, we demonstrate consistent improvements\nover various experimental setups. Notably, the compact models trained with our learning framework\nexecute signi\ufb01cantly faster than the teachers with almost no accuracy compromises at PASCAL\ndataset. Our empirical analysis reveals the presence of under-\ufb01tting issue in object detector learning,\nwhich could provide good insights to further advancement in the \ufb01eld.\n\nAcknowledgments This work was conducted as part of Guobin Chen\u2019s internship at NEC Labs\nAmerica in Cupertino.\n\nReferences\n[1] K. Ashraf, B. Wu, F. N. Iandola, M. W. Moskewicz, and K. Keutzer. Shallow networks for high-accuracy\n\nroad object-detection. CoRR, abs/1606.01561, 2016. 6\n\n[2] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing\n\nsystems, pages 2654\u20132662, 2014. 2, 8\n\n[3] C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 535\u2013541. ACM, 2006.\n2, 3, 8\n\n[4] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep\n\ninto convolutional nets. arXiv preprint arXiv:1405.3531, 2014. 6\n\n[5] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the\n\nhashing trick. CoRR, abs/1504.04788, 2015. 2\n\n[6] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Doll\u00e1r, and C. L. Zitnick. Microsoft coco captions:\n\nData collection and evaluation server. arXiv:1504.00325, 2015. 5\n\n[7] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.-F. Chang. An exploration of parameter\nIn Proceedings of the IEEE International\n\nredundancy in deep networks with circulant projections.\nConference on Computer Vision, pages 2857\u20132865, 2015. 2\n\n[8] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the\nsaddle point problem in high-dimensional non-convex optimization. In Advances in neural information\nprocessing systems, pages 2933\u20132941, 2014. 8\n\n[9] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in\n\nNeural Information Processing Systems, pages 2148\u20132156, 2013. 1, 2\n\n[10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within\nconvolutional networks for ef\ufb01cient evaluation. In Advances in Neural Information Processing Systems,\npages 1269\u20131277, 2014. 2\n\n[11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes\n\n(voc) challenge. International journal of computer vision, 88(2):303\u2013338, 2010. 5\n\n[12] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark\nsuite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354\u20133361.\nIEEE, 2012. 5\n\n[13] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages\n\n1440\u20131448, 2015. 2, 4, 5\n\n[14] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models,\n\nrelease 5, 2012. 2\n\n[15] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector\n\nquantization. arXiv preprint arXiv:1412.6115, 2014. 2\n\n[16] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. arXiv preprint\n\narXiv:1507.00448, 2015. 3, 5\n\n[17] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: ef\ufb01cient inference engine\n\non compressed deep neural network. arXiv preprint arXiv:1602.01528, 2016. 2\n\n[18] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning,\n\ntrained quantization and huffman coding. CoRR, abs/1510.00149, 2015. 2\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In European Conference on Computer Vision, pages 346\u2013361. Springer, 2014. 2\n\n[20] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015. 1, 2, 3, 4, 8\n\n[21] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In D. D.\nLee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 29, pages 4107\u20134115. Curran Associates, Inc., 2016. 2\n\n[22] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level\n\naccuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016. 1\n\n9\n\n\f[23] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank\n\nexpansions. arXiv preprint arXiv:1405.3866, 2014. 2\n\n[24] R. J. T. JitendraMalik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2,\n\n6\n\n[25] K. Kim, Y. Cheon, S. Hong, B. Roh, and M. Park. PVANET: deep but lightweight neural networks for\n\nreal-time object detection. CoRR, abs/1608.08021, 2016. 1\n\n[26] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural\nnetworks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. 1, 2, 4, 6\n[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\nnetworks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 25, pages 1097\u20131105. Curran Associates, Inc., 2012. 2, 6\n\n[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional neural\n\nnetworks using \ufb01ne-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2\n\n[29] Y. Li, K. He, J. Sun, et al. R-fcn: Object detection via region-based fully convolutional networks. In\n\nAdvances in Neural Information Processing Systems, pages 379\u2013387, 2016. 2, 4\n\n[30] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In Advances in\n\nNeural Information Processing Systems, pages 442\u2013450, 2015. 2\n\n[31] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using binary\nconvolutional neural networks. In European Conference on Computer Vision, pages 525\u2013542. Springer,\n2016. 2\n\n[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015. 2, 3, 4\n\n[33] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature\n\nmaps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016. 4\n\n[34] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep\n\nnets. arXiv preprint arXiv:1412.6550, 2014. 1, 2, 3, 5\n\n[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-\nstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International\nJournal of Computer Vision (IJCV), 115(3):211\u2013252, 2015. 5\n\n[36] J. Shen, N. Vesdapunt, V. N. Boddeti, and K. M. Kitani. In teacher we trust: Learning compressed models\n\nfor pedestrian detection. arXiv preprint arXiv:1612.00478, 2016. 3\n\n[37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014. 6\n\n[38] J.-C. Su and S. Maji. Cross quality distillation. arXiv preprint arXiv:1604.00433, 2016. 3, 6\n[39] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1903\u20131911,\n2015. 6\n\n[40] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking\n\ngeneralization. arXiv preprint arXiv:1611.03530, 2016. 8\n\n[41] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classi\ufb01cation and\n\ndetection. 2015. 1, 2\n\n[42] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Ef\ufb01cient and accurate approximations of nonlinear convolu-\ntional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1984\u20131992, 2015. 1, 2\n\n10\n\n\f", "award": [], "sourceid": 483, "authors": [{"given_name": "Guobin", "family_name": "Chen", "institution": "University of Missouri"}, {"given_name": "Wongun", "family_name": "Choi", "institution": "NEC Laboratories"}, {"given_name": "Xiang", "family_name": "Yu", "institution": "NEC Laboratories America"}, {"given_name": "Tony", "family_name": "Han", "institution": "University of Missouri"}, {"given_name": "Manmohan", "family_name": "Chandraker", "institution": "University of California, San Diego"}]}