{"title": "Paraphrasing Complex Network: Network Compression via Factor Transfer", "book": "Advances in Neural Information Processing Systems", "page_first": 2760, "page_last": 2769, "abstract": "Many researchers have sought ways of model compression to reduce the size of a deep neural network (DNN) with minimal performance degradation in order to use DNNs in embedded systems. Among the model compression methods, a method called knowledge transfer is to train a student network with a stronger teacher network. In this paper, we propose a novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student. This is done by two convolutional modules, which are called a paraphraser and a translator. The paraphraser is trained in an unsupervised manner to extract the teacher factors which are defined as paraphrased information of the teacher network. The translator located at the student network extracts the student factors and helps to translate the teacher factors by mimicking them. We observed that our student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.", "full_text": "Paraphrasing Complex Network: Network\n\nCompression via Factor Transfer\n\nJangho Kim\n\nSeongUk Park\n\nNojun Kwak\n\nSeoul National University\n\nSeoul National University\n\nSeoul National University\n\nSeoul, Korea\n\nkjh91@snu.ac.kr\n\nSeoul, Korea\n\nswpark0703@snu.ac.kr\n\nSeoul, Korea\n\nnojunk@snu.ac.kr\n\nAbstract\n\nMany researchers have sought ways of model compression to reduce the size of\na deep neural network (DNN) with minimal performance degradation in order\nto use DNNs in embedded systems. Among the model compression methods, a\nmethod called knowledge transfer is to train a student network with a stronger\nteacher network. In this paper, we propose a novel knowledge transfer method\nwhich uses convolutional operations to paraphrase teacher\u2019s knowledge and to\ntranslate it for the student. This is done by two convolutional modules, which are\ncalled a paraphraser and a translator. The paraphraser is trained in an unsupervised\nmanner to extract the teacher factors which are de\ufb01ned as paraphrased information\nof the teacher network. The translator located at the student network extracts the\nstudent factors and helps to translate the teacher factors by mimicking them. We\nobserved that our student network trained with the proposed factor transfer method\noutperforms the ones trained with conventional knowledge transfer methods.\n\n1\n\nIntroduction\n\nIn recent years, deep neural nets (DNNs) have shown their remarkable capabilities in various parts\nof computer vision and pattern recognition tasks such as image classi\ufb01cation, object detection,\nlocalization and segmentation. Although many researchers have studied DNNs for their application\nin various \ufb01elds, high-performance DNNs generally require a vast amount of computational power\nand storage, which makes them dif\ufb01cult to be used in embedded systems that have limited resources.\nGiven the size of the equipment we use, tremendous GPU computations are not generally available in\nreal world applications.\nTo deal with this problem, many researchers studied DNN structures to make DNNs smaller and\nmore ef\ufb01cient to be applicable for embedded systems. These studies can be roughly classi\ufb01ed into\nfour categories: 1) network pruning, 2) network quantization, 3) building ef\ufb01cient small networks,\nand 4) knowledge transfer. First, network pruning is a way to reduce network complexity by pruning\nthe redundant and non-informative weights in a pretrained model [26, 17, 7]. Second, network\nquantization compresses a pretrained model by reducing the number of bits used to represent the\nweight parameters of the pretrained model [20, 27]. Third, Iandola et al. [13] and Howard et al. [11]\nproposed ef\ufb01cient small network models which \ufb01t into the restricted resources. Finally, knowledge\ntransfer (KT) method is to transfer large model\u2019s information to a smaller network [22, 30, 10].\nAmong the four approaches, in this paper, we focus on the last method, knowledge transfer. Previous\nstudies such as attention transfer (AT) [30] and knowledge distillation (KD) [10] have achieved\nmeaningful results in the \ufb01eld of knowledge transfer, where their loss function can be collectively\nsummarized as the difference between the attention maps or softened distributions of the teacher and\nthe student networks. These methods directly transfer the teacher network\u2019s softened distribution [10]\nor its attention map [30] to the student network, inducing the student to mimic the teacher.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Overview of the factor transfer. In the teacher network, feature maps are transformed to\nthe \u2018teacher factors\u2019 by a paraphraser. The number of feature maps of a teacher network (m) are\nresized to the number of feature maps of teacher factors (m \u00d7 k) by a paraphrase rate k. The feature\nmaps of the student network are also transformed to the \u2018student factors\u2019 with the same dimension\nas that of the teacher factor using a translator. The factor transfer (FT) loss is used to minimize the\ndifference between the teacher and the student factors in the training of the translator that generates\nstudent factors. Factors are drawn in blue. Note that before the FT, the paraphraser is already trained\nunsupervisedly by a reconstruction loss.\n\nWhile these methods provide fairly good performance improvements, directly transferring the\nteacher\u2019s outputs overlooks the inherent differences between the teacher network and the student\nnetwork, such as the network structure, the number of channels, and initial conditions. Therefore,\nwe need to re-interpret the output of the teacher network to resolve these differences. For example,\nfrom the perspective of a teacher and a student, we came up with a question that simply providing the\nteacher\u2019s knowledge directly without any explanation can be somewhat insuf\ufb01cient for teaching the\nstudent. In other words, when teaching a child, the teacher should not use his/her own term because\nthe child cannot understand it. On the other hand, if the teacher translates his/her terms into simpler\nones, the child will much more easily understand.\nIn this respect, we sought ways for the teacher network to deliver more understandable information\nto the student network, so that the student comprehends that information more easily. To address\nthis problem, we propose a novel knowledge transferring method that leads both the student and\nteacher networks to make transportable features, which we call \u2018factors\u2019 in this paper. Contrary to the\nconventional methods, our method is not simply to compare the output values of the network directly,\nbut to train neural networks that can extract good factors and to match these factors. The neural\nnetwork that extracts factors from a teacher network is called a paraphraser, while the one that extracts\nfactors from a student network is called a translator. We trained the paraphraser in an unsupervised\nway, expecting it to extract knowledges different from what can be obtained with supervised loss\nterm. At the student side, we trained the student network with the translator to assimilate the factors\nextracted from the paraphraser. The overview of our proposed method is provided in Figure 1. With\nvarious experiments, we succeeded in training the student network to perform better than the ones\nwith the same architecture trained by the conventional knowledge transfer methods.\nOur contributions can be summarized as follows:\n\u2022 We propose a usage of a paraphraser as a means of extracting meaningful features (factors) in an\nunsupervised manner.\n\u2022 We propose a convolutional translator in the student side that learns the factors of the teacher\nnetwork.\n\u2022 We experimentally show that our approach effectively enhances the performance of the student\nnetwork.\n\n2 Related Works\n\nA wide variety of methods have been studied to use conventional networks more ef\ufb01ciently. In\nnetwork pruning and quantization approaches, Srinivas et al. [26] proposed a data-free pruning\n\n2\n\n\f(a) Knowledge Distillation\n\n(b) Attention Transfer\n\n(c) Factor Transfer (Proposed)\n\nFigure 2: The structure of (a) KD [10], (b) AT [30] and (c) the proposed method FT. Unlike KD and\nAT, our method does not directly compare the softened distribution (KD) or the attention map (AT)\nwhich is de\ufb01ned as the sum of feature maps of the teacher and the student networks. Instead, we\nextract factors from both the teacher and the student, whose difference is tried to be minimized.\n\nmethod to remove redundant neurons. Han et al. [7] removed the redundant connection and then used\nHuffman coding to quantize the weights. Gupta et al. [6] reduced \ufb02oat point operation and memory\nusage by using the 16 bit \ufb01xed-point representation. There are also many studies that directly train\nconvolutional neural networks (CNN) using binary weights [3, 4, 20]. However, the network pruning\nmethods require many iterations to converge and the pruning threshold is manually set according to\nthe targeted amount of degradation in accuracy. Furthermore, the accuracies of binary weights are\nvery poor, especially in large CNNs. There are many ways to directly design ef\ufb01cient small networks\nsuch as SqueezeNet [13], Mobile-Net [11] and Condense-Net [12], which showed a vast amount of\nreduction in the number of parameters compared to the original network sacri\ufb01cing some accuracies.\nAlso, there are methods of designing a network using a reinforcement learning algorithm such as\nMetaQNN [2] and Neural Architecture Search [33]. Using the reinforcement learning algorithm,\nthe network itself searches for an ef\ufb01cient structure without human assistance. However, they only\nfocused on performance without considering the number of parameters. In addition, it takes a lot of\nGPU memories and time to learn.\nAnother method is the \u2018knowledge transfer\u2019. This is a method of training a student network with a\nstronger teacher network. Knowledge distillation (KD) [10] is the early work of knowledge transfer\nfor deep neural networks. The main idea of KD is to shift knowledge from a teacher network to\na student network by leaning the class distribution via softened softmax. The student network can\ncapture not only the information provided by the true labels, but also the information from the teacher.\nYim et al. [28] de\ufb01ned the \ufb02ow of solution procedure (FSP) matrix calculated by Gram matrix of\nfeature maps from two layers in order to transfer knowledge. In FitNet [22], they designed the student\nnetwork to be thinner and deeper than the teacher network, and provided hints from the teacher\nnetwork for improving performance of the student network by learning intermediate representations\nof the teacher network. FitNet attempts to mimic the intermediate activation map directly from the\nteacher network. However, it can be problematic since there are signi\ufb01cant capacity differences\nbetween the teacher and the student. Attention transfer (AT) [30], in contrast to FitNet, trains a\nless deep student network such that it mimics the attention maps of the teacher network which are\nsummations of the activation maps along the channel dimension. Therefore, an attention map for\na layer is of its the spatial dimensions. Figure 2 visually shows the difference of KD [10], AT [30]\nand the proposed method, factor transfer (FT). Unlike other methods, our method does not directly\ncompare the teacher and student networks\u2019 softend distribution, or attention maps.\nAs shown in Figure 1, our paraphraser is similar to the convolutional autoencoder [18] in that it\nis trained in an unsupervised manner using the reconstruction loss and convolution layers. Hinton\net al.[9] proved that autoencoders produce compact representations of images that contain enough\ninformation for reconstructing the original images. In [16], a stacked autoencoder on the MNIST\ndataset achieved great results with a greedy layer-wise approach. Many studies show that autoencoder\nmodels can learn meaningful, abstract features and thus achieve better classi\ufb01cation results in high-\ndimensional data, such as images [19, 24]. The architecture of our paraphraser is different from\nconvolutional autoencoders in that convolution layers do not downsample the spatial dimension of an\ninput since the paraphraser uses suf\ufb01ciently downsampled feature maps of a teacher network as the\ninput.\n\n3\n\n\f3 Proposed Method\n\nIt is said that if one fully understands a thing, he/she should be able to explain it by himself/herself.\nCorrespondingly, if the student network can be trained to replicate the extracted information, this\nimplies that the student network is well informed of that knowledge. In this section, we de\ufb01ne\nthe output of paraphraser\u2019s middle layer, as \u2018teacher factors\u2019 of the teacher network, and for the\nstudent network, we use the translator made up of several convolution layers to generate \u2018student\nfactors\u2019 which are trained to replicate the \u2018teacher factors\u2019 as shown in Figure 1. With these modules,\nour knowledge transfer process consists of the following two main steps: 1) In the \ufb01rst step, the\nparaphraser is trained by a reconstruction loss. Then, teacher factors are extracted from the teacher\nnetwork by a paraphraser. 2) In the second step, these teacher factors are transferred to the student\nfactors such that the student network learns from them.\n\n3.1 Teacher Factor Extraction with Paraphraser\n\nResNet architectures [8] have stacked residual blocks and in [30] they call each stack of residual\nblocks as a \u2018group\u2019. In this paper, we will also denote each stacked convolutional layers as a \u2018group\u2019.\nYosinski et al.[29] veri\ufb01ed lower layer features are more general and higher layer features have a\ngreater speci\ufb01city. Since the teacher network and the student network are focusing on the same task,\nwe extracted factors from the feature maps of the last group as clearly can be seen in Figure 1 because\nthe last layer of a trained network must contain enough information for the task.\nIn order to extract the factor from the teacher network, we train the paraphraser in an unsupervised\nway by assigning the reconstruction loss between the input feature maps x and the output feature\nmaps P (x) of the paraphraser. The unsupervised training act on the factor to be more meaningful,\nextracting different kind of knowledge from what can be obtained with supervised cross-entropy\nloss function. This approach can also be found in EBGAN [32], which uses an autoencoder as\ndiscriminator to give the generator different kind of knowledge from binary output.\nThe paraphraser uses several convolution layers to produce the teacher factor FT which is further pro-\ncessed by a number of transposed convolution layers in the training phase. Most of the convolutional\nautoencoders are designed to downsample the spatial dimension in order to increase the receptive\n\ufb01eld. On the contrary, the paraphraser maintains the spatial dimension while adjusting the number of\nfactor channels because it uses the feature maps of the last group which has a suf\ufb01ciently reduced\nspatial dimension. If the teacher network produces m feature maps, we resize the number of factor\nchannels as m \u00d7 k. We refer to hyperparameter k as a paraphrase rate.\nTo extract the teacher factors, an adequately trained paraphraser is needed. The reconstruction loss\nfunction used for training the paraphraser is quite simple as\n\n(1)\nwhere the paraphraser network P (\u00b7) takes x as an input. After training the paraphraser, it can extract\nthe task speci\ufb01c features (teacher factors) as can be seen in the supplementary material.\n\nLrec = (cid:107)x \u2212 P (x)(cid:107)2,\n\n3.2 Factor Transfer with Translator\n\nOnce the teacher network has extracted the factors which are the paraphrased teacher\u2019s knowledge,\nthe student network should be able to absorb and digest them on its own way. In this paper, we name\nthis procedure as \u2018Factor Transfer\u2019. As depicted in Figure 1, while training the student network, we\ninserted the translator right after the last group of student convolutional layers.\nThe translator is trained jointly with the student network so that the student network can learn the\nparaphrased information from the teacher network. Here, the translator plays a role of a buffer that\nrelieves the student network from the burden of directly learning the output of the teacher network by\nrephrasing the feature map of the student network.\nThe student network is trained with the translator using the sum of two loss terms, i.e. the classi\ufb01cation\nloss and the factor transfer loss:\n\nLstudent = Lcls + \u03b2LF T ,\n\n(2)\n\n4\n\n\fLcls = C(S(Ix), y),\n\u2212 FS(cid:107)FS(cid:107)2\n\nLF T = (cid:107) FT(cid:107)FT(cid:107)2\n\n(cid:107)p.\n\n(3)\n\n(4)\n\nWith (4), the student\u2019s translator is trained to output the student factors that mimic the teacher factors.\nHere, FT and FS denote the teacher and the student factors, respectively. We set the dimension of FS\nto be the same as that of FT . We also apply an l2 normalization on the factors as [30]. In this paper,\nthe performances using l1 loss (p = 1) is reported, but the performance difference between l1 (p = 1)\nand l2 (p = 2) losses is minor (See the supplementary material), so we consistently used l1 loss for\nall experiments.\nIn addition to the factor transfer loss (4), the conventional classi\ufb01cation loss (3) is also used to train\nstudent network as in (2). Here, \u03b2 is a weight parameter and C(S(Ix), y) denotes the cross entropy\nbetween ground-truth label y and the softmax output S(Ix) of the student network for an input image\nIx, a commonly used term for classi\ufb01cation tasks.\nThe translator takes the output features of the student network, and with (2), it sends the gradient back\nto the student networks, which lets the student network absorb and digest the teacher\u2019s knowledge\nin its own way. Note that unlike the training of the teacher paraphraser, the student network and its\ntranslator are trained simultaneously in an end-to-end manner.\n\n4 Experiments\n\nIn this section, we evaluate the proposed FT method on several datasets. First, we verify the effective-\nness of FT through the experiments with CIFAR-10 [14] and CIFAR-100 [15] datasets, both of which\nare the basic image classi\ufb01cation datasets, because many works that tried to solve the knowledge\ntransfer problem used CIFAR in their base experiments [22, 30]. Then, we evaluate our method\non ImageNet LSVRC 2015 [23] dataset. Finally, we applied our method to object detection with\nPASCAL VOC 2007 [5] dataset.\nTo verify our method, we compare the proposed FT with several knowledge transfer methods such\nas KD [10] and AT [30]. There are several important hyperparameters that need to be consistent.\nFor KD, we \ufb01x the temperature for softened softmax to 4 as in [10], and for \u03b2 of AT, we set it to\n103 following [30]. In the whole experiments, AT used multiple group losses. Alike AT, \u03b2 of FT is\nset to 103 in ImageNet and PASCAL VOC 2007. However, we set it to 5 \u00d7 102 in CIFAR-10 and\nCIFAR-100 because a large \u03b2 hinders the convergence.\nWe conduct experiments for different k values from 0.5 to 4. To show the effectiveness of the\nproposed paraphraser architecture, we also used two convolutional autoencoders as paraphrasers\nbecause the autoencoder is well known for extracting good features which contain compressed\ninformation for reconstruction. One is an undercomplete convolutional autoencoder (CAE), the other\nis an overcomplete regularized autoencoder (RAE) which imposes l1 penalty on factors to learn the\nsize of factors needed by itself [1]. Details of these autoencoders and overall implementations of\nexperiments are explained in the supplementary material.\nIn some experiments, we also tested KD in combination with AT or FT because KD transfers output\nknowledge while AT and FT delivers knowledge from intermediate blocks and these two different\nmethods can be combined into one (KD+AT or KD+FT).\n\n4.1 CIFAR-10\n\nThe CIFAR-10 dataset consists of 50K training images and 10K testing images with 10 classes.\nWe conducted several experiments on CIFAR-10 with various network architectures, including\nResNet [8], Wide ResNet (WRN) [31] and VGG [25]. Then, we made four conditions to test various\nsituations. First, we used ResNet-20 and ResNet-56 which are used in CIFAR-10 experiments of [8].\nThis condition is for the case where the teacher and the student networks have same width (number\nof channels) and different depths (number of blocks). Secondly, we experimented with different types\nof residual networks using ResNet-20 and WRN-40-1. Thirdly, we intended to see the effect of the\nabsence of shortcut connections that exist in Resblock on knowledge transfer by using VGG13 and\nWRN-46-4. Lastly, we used WRN-16-1 and WRN-16-2 to test the applicability of knowledge transfer\nmethods for the architectures with the same depth but different widths.\n\n5\n\n\fStudent\n\nTeacher\n\nResNet-20 (0.27M)\nResNet-56 (0.85M)\nResNet-20 (0.27M) WRN-40-1 (0.56M)\nWRN-46-4 (10M)\nWRN-16-1 (0.17M) WRN-16-2 (0.69M)\n\nVGG-13 (9.4M)\n\nStudent\n\nTeacher\n\nStudent\n\n7.78\n7.78\n5.99\n8.62\n\nk = 0.5\n\nKD\n7.19\n7.09\n5.71\n7.64\n\nAT\n7.13\n7.34\n5.54\n8.10\nk = 0.75\n\nVGG-13 (9.4M)\n\nResNet-20 (0.27M)\nResNet-56 (0.85M)\nResNet-20 (0.27M) WRN-40-1 (0.56M)\nWRN-46-4 (10M)\nWRN-16-1 (0.17M) WRN-16-2 (0.69M)\nTable 1: Mean classi\ufb01cation error (%) on CIFAR-10 dataset (5 runs). All the numbers are the results\nof our implementation. AT and KD are implemented according to [30].\n\n6.92\n7.05\n5.09\n7.83\n\n6.85\n7.16\n4.84\n7.64\n\nk = 4\n7.08\n7.05\n4.98\n7.95\n\nCAE RAE\n7.07\n7.24\n7.33\n7.26\n5.53\n5.85\n8.48\n8.00\n\nFT\n6.85\n6.85\n4.84\n7.64\nk = 1\n6.89\n7.04\n5.04\n7.74\n\n6.89\n7.00\n5.30\n7.52\nk = 2\n6.87\n6.85\n5.01\n7.87\n\nAT+KD FT+KD Teacher\n\n7.04\n6.95\n4.65\n7.59\n\n6.39\n6.84\n4.44\n6.27\n\nStudent\n\nTeacher\n\nStudent\n\nWRN-16-1 (0.17M) WRN-40-1 (0.56M)\nWRN-16-2 (0.69M) WRN-40-2 (2.2M)\nTable 2: Median classi\ufb01cation error (%) on CIFAR-10 dataset (5 runs). The \ufb01rst 6 columns are from\nTable 1 of [30], while the last two columns are from our implementation.\n\n8.01\n5.71\n\n8.77\n6.31\n\n8.12\n5.51\n\n6.55\n5.09\n\nAT\n8.25\n5.85\n\nF-ActT KD AT+KD Teacher\n8.62\n6.24\n\n8.39\n6.08\n\n6.58\n5.23\n\nFT (k = 0.5) Teacher\n\nIn the \ufb01rst experiment, we wanted to show that our algorithm is applicable to various networks.\nResult of FT and other knowledge transfer algorithms can be found in Table 1. In the table, \u2018Student\u2019\ncolumn provides the performance of student network trained from scratch. The \u2018Teacher\u2019 column\nprovides the performance of the pretrained teacher network. The numbers in the parentheses are\nthe sizes of network parameters in Millions. The performances of AT and KD are better than those\nof \u2018Student\u2019 trained from scratch and the two show better or worse performances than the other\ndepending on the type of network used. For FT, we chose the best performance among the different k\nvalues shown in the bottom rows in the table. The proposed FT shows better performances than AT\nand KD consistently, regardless of the type of network used.\nIn the cases of hybrid knowledge transfer methods such as AT+KD and FT+KD, we could get\ninteresting result that AT and KD make some sort of synergy, because for all the cases, AT+KD\nperformed better than standalone AT or KD. It sometimes performed even better than FT, but FT\nmodel trained together with KD loses its power in some cases.\nAs stated before in section 3.1, to check if having a paraphraser per group in FT is bene\ufb01cial, we\ntrained a ResNet-20 as student network with paraphrasers and translators combined in group1, group2\nand group3, using the ResNet-56 as teacher network with k = 0.75. The classi\ufb01cation error was\n7.01%, which is 0.06% higher than that from the single FT loss for the last group. This indicates\nthat the combined FT loss does not improve the performance thus we have used the single FT loss\nthroughout the paper. In terms of paraphrasing the information of the teacher network, the paraphraser\nwhich maintains the spatial dimension outperformed autoencoders based methods which use CAE or\nRAE.\nAs a second experiment, we compared FT with transferring FitNets-style hints which use full\nactivation maps as in [30]. Table 2 shows the results which veri\ufb01y that using the paraphrased\ninformation is more bene\ufb01cial than directly using the full activation maps (full feature maps). In\nthe table, FT gives better accuracy improvement than full-activation transfer (F-ActT). Note that we\ntrained a teacher network from scratch for factor transfer (the last column) with the same experimental\nenvironment of [30] because there is no pretrained model of the teacher networks.\n\n4.2 CIFAR-100\n\nFor further analysis, we wanted to apply our algorithm to more dif\ufb01cult tasks to prove generality of\nthe proposed FT by adopting CIFAR-100 dataset. CIFAR-100 dataset contains the same number of\nimages as CIFAR-10 dataset, 50K (train) and 10K (test), but has 100 classes, containing only 500\nimages per classes. Since the training dataset is more complicated, we thought the number of blocks\n(depth) in the network has much more impact on the classi\ufb01cation performance because deeper\nand stronger networks will better learn the boundaries between classes. Thus, the experiments on\nCIFAR-100 were designed to observe the changes depending on the depths of networks. The teacher\nnetwork was \ufb01xed as ResNet-110, and the two networks ResNet-20 and ResNet-56, that have the\n\n6\n\n\fTeacher\n\nTeacher\n\nStudent\n\nStudent\n\nResNet-56 (0.85M)\nResNet-20 (0.27M)\n\nResNet-110 (1.73M)\nResNet-110 (1.73M)\n\nTeacher\n26.91\n26.91\nRAE\n26.29\n30.11\nTable 3: Mean classi\ufb01cation error (%) on CIFAR-100 dataset (5 runs). All the numbers are from our\nimplementation.\n\nAT+KD FT+KD\n28.01\n26.93\n32.19\n34.78\n\nStudent\n28.04\n31.24\nk = 0.5\n25.62\n29.20\n\nKD\n27.96\n33.14\n\nAT\n27.28\n31.04\nk = 0.75\n\nResNet-56 (0.85M)\nResNet-20 (0.27M)\n\nResNet-110 (1.73M)\nResNet-110 (1.73M)\n\nCAE\n26.41\n29.84\n\nk = 1\n25.85\n29.28\n\nk = 2\n25.63\n29.19\n\nk = 4\n25.87\n29.08\n\nFT\n25.62\n29.08\n\n25.78\n29.25\n\nParaphraser\n\nYes\nNo\nYes\n\nTranslator\n\nNo\nYes\nYes\n\nCIFAR-10 CIFAR-100 Number of layers in Paraphraser CIFAR-10 CIFAR-100\n\nStudent (WRN-40-1[0.6M])\nTable 4: Left: Ablation study with and without the paraphraser (k = 0.5) and the Translator. (Mean\nclassi\ufb01cation error (%) of 5 runs). Right: Effect of number of layers in the paraphraser.\n\n6.18\n6.12\n5.71\n7.02\n\n27.61\n27.39\n26.91\n28.81\n\n1 Layer [0.07M]\n2 Layers [0.22M]\n3 Layers [0.26M]\nTeacher (WRN-40-2[2.2M])\n\n6.09\n5.99\n5.71\n4.96\n\n27.07\n27.03\n26.91\n24.10\n\nsame width (number of channels) but different depth (number of blocks) with the teacher, were used\nas student networks. As can be seen in Table 3, we got an impressive result that the student network\nResNet-56 trained with FT even outperforms the teacher network. The student ResNet-20 did not\nwork that well but it also outperformed other knowledge transfer methods.\nAdditionally, in line with the experimental result in [30], we also got consistent result that KD suffers\nfrom the gap of depths between the teacher and the student, and the accuracy is even worse compared\nto the student network in the case of training ResNet-20. For this dataset, the hybrid methods (AT+KD\nand FT+KD) was worse than the standalone AT or FT. This also indicates that KD is not suitable for\na situation where the depth difference between the teacher and the student networks is large.\n\n4.3 Ablation Study\n\nIn the introduction, we have described that the teacher network provides more paraphrased information\nto the student network via factors, and described a need for a translator to act as a buffer to better\nunderstand factors in the student network. To further analyze the role of factor, we performed an\nablation experiment on the presence or absence of a paraphraser and a translator. The result is shown\nin Table 4. The student network and the teacher network are selected with different number of output\nchannels. One can adjust the number of student and teacher factors by adjusting the paraphrase\nrate k of the paraphraser. As described above, since the role of the paraphraser (making FT with\nunsupervised training loss) and the translator (trained jointly with student network to ease the learning\nof Factor Transfer) are not the same, we can con\ufb01rm that the synergy of two modules maximizes the\nperformance of the student network. Also, we report the performance of different number of layers in\nthe paraphraser. As the number of layers increases, the performance also increases.\n\n4.4\n\nImageNet\n\nThe ImageNet dataset is a image classi\ufb01cation dataset which consists of 1.2M training images and\n50K validation images with 1,000 classes. We conducted large scale experiments on the ImageNet\nLSVRC 2015 in order to show our potential availability to transfer even more complex and detailed\ninformations. We chose ResNet-18 as a student network and ResNet-34 as a teacher network same as\nin [30] and validated the performance based on top-1 and top-5 error rates as shown in Table 5.\nAs can be seen in Table 5, FT consistently outperforms the other methods. The KD, again, suffers\nfrom the depth difference problem, as already con\ufb01rmed in the result of other experiments. It shows\njust adding the FT loss helps to lower about 1.34% of student network\u2019s (ResNet-18) Top-1 error on\nImageNet.\n\n4.5 Object Detection\n\nIn this experiment, we wanted to verify the generality of FT, and decided to apply it on detection task,\nother than classi\ufb01cations. We used Faster-RCNN pipeline [21] with PASCAL VOC 2007 dataset [5]\n\n7\n\n\fMethod\nStudent\n\nKD\nAT\n\nFT (k = 0.5)\n\nTeacher\n\nNetwork\nResnet-18\nResnet-18\nResnet-18\nResnet-18\nResnet-34\n\nTable 5: Top-1 and Top-5 classi\ufb01cation error (%) on Ima-\ngeNet dataset. All the numbers are from our implementation.\n\nTable 6: Mean average precision on\nPASCAL VOC 2007 test dataset.\n\nTop-1\n29.91\n33.83\n29.36\n28.57\n26.73\n\nTop-5\n10.68\n12.55\n10.23\n9.71\n8.57\n\nMethod\n\nStudent(VGG-16)\n\nFT(VGG-16, k = 0.5)\nTeacher(ResNet-101)\n\nmAP\n69.5\n70.3\n75.0\n\nFigure 3: Factor transfer applied to Faster-RCNN framework\n\nfor object detection. We used PASCAL VOC 2007 trainval as training data and PASCAL VOC 2007\ntest as testing data. Instead of using our own ImageNet FT pretrained model as a backbone network\nfor detection, we tried to apply our method for transferring knowledges about object detection. Here,\nwe set a hypothesis that since the factors are extracted in an unsupervised manner, the factors not only\ncan connote the core knowledge of classi\ufb01cation, but also can convey other types of representations.\nIn the Faster-RCNN, the shared convolution layers contain knowledges of both classi\ufb01cation and\nlocalization, so we applied factor transfer to the last layer of shared convolution layer. Figure 3 shows\nwhere we applied FT in the Faster-RCNN framework. We set VGG-16 as a student network and\nResNet-101 as a teacher network. Both networks are \ufb01ne-tuned at PASCAL VOC 2007 dataset with\nImageNet pretrained model. For FT, we used ImageNet pretrained VGG-16 model and \ufb01xed the\nlayers before conv3 layer during training phase. Then, by the factor transfer, the gradient caused by\nthe LF T loss back-propagates to the student network passing by the student translator.\nAs can be seen in Table 6, we could get performance enhancement of 0.8 in mAP (mean average\nprecision) score by training Faster-RCNN with VGG-16. As mentioned earlier, we have strong belief\nthat the latter layer we apply the factor transfer, the higher the performance enhances. However, by\nthe limit of VGG-type backbone network we have used, we tried but could not apply FT else that the\nbackbone network. Experiment on the capable case where the FT can be applied to the latter layers\nlike region proposal network (RPN) or other types of detection network will be our future work.\n\n4.6 Discussion\n\nIn this section, we compare FitNet [22] and FT. FitNet transfers information of an intermediate\nlayer while FT uses the last layer, and the purpose of the regressor in FitNet is somewhat different\nfrom our translator. More speci\ufb01cally, Romero et al. [22] argued that giving hints from deeper layer\nover-regularizes the student network. On the contrary, we chose the deeper layer to provide more\nspeci\ufb01c information as mentioned in the paper. Also, FitNet does not use the paraphraser as well.\nNote that FitNet is actually a 2-stage algorithm in that they initialize the student weights with hints\nand then train the student network using Knowledge Distillation.\n\n5 Conclusion\n\nIn this work, we propose the factor transfer which is a novel method for knowledge transfer. Unlike\nprevious methods, we introduce factors which contain paraphrased information of the teacher network,\nextracted from the paraphraser. There are mainly two reasons that the student can understand\ninformation from the teacher network more easily by the factor transfer than other methods. One\n\n8\n\n\freason is that the factors can relieve the inherent differences between the teacher and student network.\nThe other reason is that the translator of the student can help the student network to understand\nteacher factors by mimicking the teacher factors. A downside of the proposed method is that the\nfactor transfer requires the training of a paraphraser to extract factors and needs more parameters of\nthe paraphraser and the translator. However, the convergence of the training for the paraphraser is very\nfast and additional parameters are not needed after training the student network. In our experiments,\nwe showed the effectiveness of the factor transfer on various image classi\ufb01cation datasets. Also, we\nveri\ufb01ed that factor transfer can be applied to other domains than classi\ufb01cation. We think that our\nmethod will help further researches in knowledge transfer.\n\nAcknowledgments\n\nThis work was supported by Next-Generation Information Computing Development Program through\nthe NRF of Korea (2017M3C4A7077582) and ICT R&D program of MSIP/IITP, Korean Government\n(2017-0-00306).\n\nReferences\n[1] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-\ngenerating distribution. The Journal of Machine Learning Research, 15(1):3563\u20133593, 2014.\n[2] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network\n\narchitectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.\n\n[3] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Advances in neural information\nprocessing systems, pages 3123\u20133131, 2015.\n\n[4] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks: Training deep neural networks with weights and activations constrained to+ 1\nor-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.\n\nThe\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\n\n[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In International Conference on Machine Learning, pages\n1737\u20131746, 2015.\n\n[7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[10] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[11] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[12] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An\n\nef\ufb01cient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017.\n\n[13] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt\nKeutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model\nsize. arXiv preprint arXiv:1602.07360, 2016.\n\n[14] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced\n\nresearch).\n\n9\n\n\f[15] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced\n\nresearch).\n\n[16] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An\nempirical evaluation of deep architectures on problems with many factors of variation. In\nProceedings of the 24th international conference on Machine learning, pages 473\u2013480. ACM,\n2007.\n\n[17] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In\nComputer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2554\u2013\n2564. IEEE, 2016.\n\n[18] Jonathan Masci, Ueli Meier, Dan Cire\u00b8san, and J\u00fcrgen Schmidhuber. Stacked convolutional\nauto-encoders for hierarchical feature extraction. In International Conference on Arti\ufb01cial\nNeural Networks, pages 52\u201359. Springer, 2011.\n\n[19] Wing WY Ng, Guangjun Zeng, Jiangjun Zhang, Daniel S Yeung, and Witold Pedrycz. Dual\nautoencoders features for imbalance classi\ufb01cation problem. Pattern Recognition, 60:875\u2013889,\n2016.\n\n[20] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object\ndetection with region proposal networks. In Advances in neural information processing systems,\npages 91\u201399, 2015.\n\n[22] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[24] Hoo-Chang Shin, Matthew R Orton, David J Collins, Simon J Doran, and Martin O Leach.\nStacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot\nstudy using 4d patient data. IEEE transactions on pattern analysis and machine intelligence,\n35(8):1930\u20131943, 2013.\n\n[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[26] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.\n\narXiv preprint arXiv:1507.06149, 2015.\n\n[27] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional\nneural networks for mobile devices. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 4820\u20134828, 2016.\n\n[28] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation:\nFast optimization, network minimization and transfer learning. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2017.\n\n[29] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in\ndeep neural networks? In Advances in neural information processing systems, pages 3320\u20133328,\n2014.\n\n[30] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving\nthe performance of convolutional neural networks via attention transfer. arXiv preprint\narXiv:1612.03928, 2016.\n\n[31] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[32] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\narXiv preprint arXiv:1609.03126, 2016.\n\n[33] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1466, "authors": [{"given_name": "Jangho", "family_name": "Kim", "institution": "Seoul National University"}, {"given_name": "Seonguk", "family_name": "Park", "institution": "Seoul National University"}, {"given_name": "Nojun", "family_name": "Kwak", "institution": "Seoul National University"}]}