{"title": "Discrimination-aware Channel Pruning for Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 875, "page_last": 886, "abstract": "Channel pruning is one of the predominant approaches for deep model compression. Existing pruning methods either train from scratch with sparsity constraints on channels, or minimize the reconstruction error between the pre-trained feature maps and the compressed ones. Both strategies suffer from some limitations: the former kind is computationally expensive and difficult to converge, whilst the latter kind optimizes the reconstruction error but ignores the discriminative power of channels. To overcome these drawbacks, we investigate a simple-yet-effective method, called discrimination-aware channel pruning, to choose those channels that really contribute to discriminative power. To this end, we introduce additional losses into the network to increase the discriminative power of intermediate layers and then select the most discriminative channels for each layer by considering the additional loss and the reconstruction error. Last, we propose a greedy algorithm to conduct channel selection and parameter optimization in an iterative way. Extensive experiments demonstrate the effectiveness of our method. For example, on ILSVRC-12, our pruned ResNet-50 with 30% reduction of channels even outperforms the original model by 0.39% in top-1 accuracy.", "full_text": "Discrimination-aware Channel Pruning\n\nfor Deep Neural Networks\n\nZhuangwei Zhuang1\u2217, Mingkui Tan1\u2217\u2020, Bohan Zhuang2\u2217, Jing Liu1\u2217,\n\nYong Guo1, Qingyao Wu1, Junzhou Huang3,4, Jinhui Zhu1\u2020\n\n{z.zhuangwei, seliujing, guo.yong}@mail.scut.edu.cn, jzhuang@uta.edu\n{mingkuitan, qyw, csjhzhu}@scut.edu.cn, bohan.zhuang@adelaide.edu.au\n\n1South China University of Technology, 2The University of Adelaide,\n\n3University of Texas at Arlington, 4Tencent AI Lab\n\nAbstract\n\nChannel pruning is one of the predominant approaches for deep model compression.\nExisting pruning methods either train from scratch with sparsity constraints on\nchannels, or minimize the reconstruction error between the pre-trained feature\nmaps and the compressed ones. Both strategies suffer from some limitations:\nthe former kind is computationally expensive and dif\ufb01cult to converge, whilst\nthe latter kind optimizes the reconstruction error but ignores the discriminative\npower of channels. In this paper, we investigate a simple-yet-effective method\ncalled discrimination-aware channel pruning (DCP) to choose those channels that\nreally contribute to discriminative power. To this end, we introduce additional\ndiscrimination-aware losses into the network to increase the discriminative power of\nintermediate layers and then select the most discriminative channels for each layer\nby considering the additional loss and the reconstruction error. Last, we propose\na greedy algorithm to conduct channel selection and parameter optimization in\nan iterative way. Extensive experiments demonstrate the effectiveness of our\nmethod. For example, on ILSVRC-12, our pruned ResNet-50 with 30% reduction\nof channels outperforms the baseline model by 0.39% in top-1 accuracy.\n\n1\n\nIntroduction\n\nSince 2012, convolutional neural networks (CNNs) have achieved great success in many computer\nvision tasks, e.g., image classi\ufb01cation [21, 41], face recognition [37, 42], object detection [35,\n36], image generation [7, 3] and video analysis [38, 47]. However, deep models are often with a\nhuge number of parameters and the model size is very large, which incurs not only huge memory\nrequirement but also unbearable computation burden. As a result, deep learning methods are hard to\nbe applied on hardware devices with limited storage and computation resources, such as cell phones.\nTo address this problem, model compression is an effective approach, which aims to reduce the model\nredundancy without signi\ufb01cant degeneration in performance.\nRecent studies on model compression mainly contain three categories, namely, quantization [34, 54],\nsparse or low-rank compressions [10, 11], and channel pruning [27, 28, 51, 49]. Network quantization\nseeks to reduce the model size by quantizing \ufb02oat weights into low-bit weights (e.g., 8 bits or\neven 1 bit). However, the training is very dif\ufb01cult due to the introduction of quantization errors.\nMaking sparse connections can reach high compression rate in theory, but it may generate irregular\nconvolutional kernels which need sparse matrix operations for accelerating the computation. In\n\n\u2217Authors contributed equally.\n\u2020Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcontrast, channel pruning reduces the model size and speeds up the inference by removing redundant\nchannels directly, thus little additional effort is required for fast inference. On top of channel pruning,\nother compression methods such as quantization can be applied. In fact, pruning redundant channels\noften helps to improve the ef\ufb01ciency of quantization and achieve more compact models.\nIdentifying the informative (or important) channels, also known as channel selection, is a key issue\nin channel pruning. Existing works have exploited two strategies, namely, training-from-scratch\nmethods which directly learn the importance of channels with sparsity regularization [1, 27, 48],\nand reconstruction-based methods [14, 16, 24, 28]. Training-from-scratch is very dif\ufb01cult to train\nespecially for very deep networks on large-scale datasets. Reconstruction-based methods seek to\ndo channel pruning by minimizing the reconstruction error of feature maps between the pruned\nmodel and a pre-trained model [14, 28]. These methods suffer from a critical limitation: an actually\nredundant channel would be mistakenly kept to minimize the reconstruction error of feature maps.\nConsequently, these methods may result in apparent drop in accuracy on more compact and deeper\nmodels such as ResNet [13] for large-scale datasets.\nIn this paper, we aim to overcome the drawbacks of both strategies. First, in contrast to existing\nmethods [14, 16, 24, 28], we assume and highlight that an informative channel, no matter where\nit is, should own discriminative power; otherwise it should be deleted. Based on this intuition, we\npropose to \ufb01nd the channels with true discriminative power for the network. Speci\ufb01cally, relying on a\npre-trained model, we add multiple additional losses (i.e., discrimination-aware losses) evenly to the\nnetwork. For each stage, we \ufb01rst do \ufb01ne-tuning using one additional loss and the \ufb01nal loss to improve\nthe discriminative power of intermediate layers. And then, we conduct channel pruning for each layer\ninvolved in the considered stage by considering both the additional loss and the reconstruction error\nof feature maps. In this way, we are able to make a balance between the discriminative power of\nchannels and the feature map reconstruction.\nOur main contributions are summarized as follows. First, we propose a discrimination-aware channel\npruning (DCP) scheme for compressing deep models with the introduction of additional losses. DCP\nis able to \ufb01nd the channels with true discriminative power. DCP prunes and updates the model\nstage-wisely using a proper discrimination-aware loss and the \ufb01nal loss. As a result, it is not sensitive\nto the initial pre-trained model. Second, we formulate the channel selection problem as an (cid:96)2,0-norm\nconstrained optimization problem and propose a greedy method to solve the resultant optimization\nproblem. Extensive experiments demonstrate the superior performance of our method, especially\non deep ResNet. On ILSVRC-12 [4], when pruning 30% channels of ResNet-50, DCP improves\nthe original ResNet model by 0.39% in top-1 accuracy. Moreover, when pruning 50% channels of\nResNet-50, DCP outperforms ThiNet [28], a state-of-the-art method, by 0.81% and 0.51% in top-1\nand top-5 accuracy, respectively.\n\n2 Related studies\n\nNetwork quantization.\nIn [34], Rastegari et al. propose to quantize parameters in the network into\n+1/\u2212 1. The proposed BWN and XNOR-Net can achieve comparable accuracy to their full-precision\ncounterparts on large-scale datasets. In [55], high precision weights, activations and gradients in\nCNNs are quantized to low bit-width version, which brings great bene\ufb01ts for reducing resource\nrequirement and power consumption in hardware devices. By introducing zero as the third quantized\nvalue, ternary weight networks (TWNs) [23, 56] can achieve higher accuracy than binary neural\nnetworks. Explorations on quantization [54, 57] show that quantized networks can even outperform\nthe full precision networks when quantized to the values with more bits, e.g., 4 or 5 bits.\n\nSparse or low-rank connections. To reduce the storage requirements of neural networks, Han et\nal. suggest that neurons with zero input or output connections can be safely removed from the\nnetwork [12]. With the help of the (cid:96)1/(cid:96)2 regularization, weights are pushed to zeros during training.\nSubsequently, the compression rate of AlexNet can reach 35\u00d7 with the combination of pruning,\nquantization, and Huffman coding [11]. Considering the importance of parameters is changed during\nweight pruning, Guo et al. propose dynamic network surgery (DNS) in [10]. Training with sparsity\nconstraints [40, 48] has also been studied to reach higher compression rate.\nDeep models often contain a lot of correlations among channels. To remove such redundancy,\nlow-rank approximation approaches have been widely studied [5, 6, 19, 39]. For example, Zhang et\n\n2\n\n\fFigure 1: Illustration of discrimination-aware channel pruning. Here, Lp\nS denotes the discrimination-\naware loss (e.g., cross-entropy loss) in the Lp-th layer, LM denotes the reconstruction loss, and Lf\ndenotes the \ufb01nal loss. For the p-th stage, we \ufb01rst \ufb01ne-tune the pruned model by Lp\nS and Lf , then\nS and LM .\nconduct the channel selection for each layer in {Lp\u22121 + 1, . . . , Lp} with Lp\n\nal. speed up VGG for 4\u00d7 with negligible performance degradation on ImageNet [53]. However,\nlow-rank approximation approaches are unable to remove those redundant channels that do not\ncontribute to the discriminative power of the network.\n\nChannel pruning. Compared with network quantization and sparse connections, channel pruning\nremoves both channels and the related \ufb01lters from the network. Therefore, it can be well supported\nby existing deep learning libraries with little additional effort. The key issue of channel pruning is to\nevaluate the importance of channels. Li et al. measure the importance of channels by calculating\nthe sum of absolute values of weights [24]. Hu et al. de\ufb01ne average percentage of zeros (APoZ)\nto measure the activation of neurons [16]. Neurons with higher values of APoZ are considered\nmore redundant in the network. With a sparsity regularizer in the objective function, training-\nbased methods [1, 27] are proposed to learn the compact models in the training phase. With the\nconsideration of ef\ufb01ciency, reconstruction-methods [14, 28] transform the channel selection problem\ninto the optimization of reconstruction error and solve it by a greedy algorithm or LASSO regression.\n\n3 Proposed method\nLet {xi, yi}N\ni=1 be the training samples, where N indicates the number of samples. Given an L-layer\nCNN model M, let W \u2208 Rn\u00d7c\u00d7hf\u00d7zf be the model parameters w.r.t. the l-th convolutional layer (or\nblock), as shown in Figure 1. Here, hf and zf denote the height and width of \ufb01lters, respectively; c\nand n denote the number of input and output channels, respectively. For convenience, hereafter we\nomit the layer index l. Let X \u2208 RN\u00d7c\u00d7hin\u00d7zin and O \u2208 RN\u00d7n\u00d7hout\u00d7zout be the input feature maps\nand the involved output feature maps, respectively. Here, hin and zin denote the height and width\nof the input feature maps, respectively; hout and zout represent the height and width of the output\nfeature maps, respectively. Moreover, let Xi,k,:,: be the feature map of the k-th channel for the i-th\nsample. Wj,k,:,: denotes the parameters w.r.t. the k-th input channel and j-th output channel. The\noutput feature map of the j-th channel for the i-th sample, denoted by Oi,j,:,:, is computed by\n\nOi,j,:,: =(cid:80)c\n\nk=1 Xi,k,:,: \u2217 Wj,k,:,:,\n\n(1)\n\n(2)\n\nwhere \u2217 denotes the convolutional operation.\nGiven a pre-trained model M, the task of Channel Pruning is to prune those redundant channels in\nW to save the model size and accelerate the inference speed in Eq. (1). In order to choose channels,\nj=1 ||Wj,k,:,:||F ), where \u2126(a) = 1 if\na (cid:54)= 0 and \u2126(a) = 0 if a = 0, and || \u00b7 ||F represents the Frobenius norm. To induce sparsity, we can\nimpose an (cid:96)2,0-norm constraint on W:\n\nwe introduce a variant of (cid:96)2,0-norm ||W||2,0 =(cid:80)c\nk=1 \u2126((cid:80)n\n\n||W||2,0 =(cid:80)c\n\nk=1 \u2126((cid:80)n\n\nj=1 ||Wj,k,:,:||F ) \u2264 \u03bal,\n\nwhere \u03bal denotes the desired number of channels at the layer l. Or equivalently, given a prede\ufb01ned\npruning rate \u03b7 \u2208 (0, 1) [1, 27], it follows that \u03bal = (cid:100)\u03b7c(cid:101).\n\n3\n\nConvConvPruned NetworkSoftmax\u2112\ud835\udc46\ud835\udc5d\ud835\udc0e\ud835\udc5dBatchNormReLUAvgPooling\ud835\udc05\ud835\udc5dConvConvConvBaseline Network\ud835\udc0e\ud835\udc83\ud835\udc17\ud835\udc83\ud835\udc16\ud835\udc83ConvReconstruction error\u2112\ud835\udc40\ud835\udc0e\ud835\udc17SoftmaxAvgPoolingConvSoftmaxAvgPoolingConvConvConv\u2112\ud835\udc53Fine-tuningChannel Selection\ud835\udc16\f3.1 Motivations\n\n(cid:80)N\n\n(cid:80)n\nj=1 ||Ob\n\nGiven a pre-trained model M, existing methods [14, 28] conduct channel pruning by minimizing the\nreconstruction error of feature maps between the pre-trained model M and the pruned one. Formally,\nthe reconstruction error can be measured by the mean squared error (MSE) between feature maps of\nthe baseline network and the pruned one as follows:\n\ni=1\n\ni,j,:,: \u2212 Oi,j,:,:||2\nF ,\n\nLM (W) = 1\n(3)\n2Q\nwhere Q = N \u00b7 n \u00b7 hout \u00b7 zout and Ob\ni,j,:,: denotes the feature maps of the baseline network. Recon-\nstructing feature maps can preserve most information in the learned model, but it has two limitations.\nFirst, the pruning performance is highly affected by the quality of the pre-trained model M. If the\nbaseline model is not well trained, the pruning performance can be very limited. Second, to achieve\nthe minimal reconstruction error, some channels in intermediate layers may be mistakenly kept, even\nthough they are actually not relevant to the discriminative power of the network. This issue will be\neven severer when the network becomes deeper.\nIn this paper, we seek to do channel pruning by keeping those channels that really contribute to the\ndiscriminative power of the network. In practice, however, it is very hard to measure the discriminative\npower of channels due to the complex operations (such as ReLU activation and Batch Normalization)\nin CNNs. One may consider one channel as an important one if the \ufb01nal loss Lf would sharply\nincrease without it. However, it is not practical when the network is very deep. In fact, for deep\nmodels, its shallow layers often have little discriminative power due to the long path of propagation.\nTo increase the discriminative power of intermediate layers, one can introduce additional losses to the\nintermediate layers of the deep networks [43, 22, 8]. In this paper, we insert P discrimination-aware\nlosses {Lp\np=1 evenly into the network, as shown in Figure 1. Let {L1, ..., LP , LP +1} be the layers\nat which we put the losses, with LP +1 = L being the \ufb01nal layer. For the p-th loss Lp\nS, we consider\ndoing channel pruning for layers l \u2208 {Lp\u22121 + 1, ..., Lp}, where Lp\u22121 = 0 if p = 1. It is worth\nmentioning that, we can add one loss to each layer of the network, where we have Ll = l. However,\nthis can be very computationally expensive yet not necessary.\n\nS}P\n\n3.2 Construction of discrimination-aware loss\nThe construction of discrimination-aware loss Lp\nS is very important in our method. As shown in\nFigure 1, each loss uses the output of layer Lp as the input feature maps. To make the computation\nof the loss feasible, we impose an average pooling operation over the feature maps. Moreover, to\naccelerate the convergence, we shall apply batch normalization [18, 9] and ReLU [29] before doing\nthe average pooling. In this way, the input feature maps for the loss at layer Lp, denoted by Fp(W),\ncan be computed by\n\n(4)\nwhere Op represents the output feature maps of layer Lp. Let F(p,i) be the feature maps w.r.t. the\ni-th example. The discrimination-aware loss w.r.t. the p-th loss is formulated as\n\nFp(W) = AvgPooling(ReLU(BN(Op))),\n\nLp\nS(W) = \u2212 1\n\nN\n\ni=1\n\n(5)\nwhere I{\u00b7} is the indicator function, \u03b8 \u2208 Rnp\u00d7m denotes the classi\ufb01er weights of the fully connected\nlayer, np denotes the number of input channels of the fully connected layer and m is the number of\nclasses. Note that we can use other losses such as angular softmax loss [26] as the additional loss.\nIn practice, since a pre-trained model contains very rich information about the learning task, similar\nto [28], we also hope to reconstruct the feature maps in the pre-trained model. By considering both\ncross-entropy loss and reconstruction error, we have a joint loss function as follows:\n\nF(p,i)\n\n,\n\n(cid:80)m\ne\u03b8(cid:62)\nt F(p,i)\nk=1 e\u03b8(cid:62)\n\nk\n\nL(W) = LM (W) + \u03bbLp\n\nS(W),\n\n(6)\n\n(cid:20)(cid:80)N\n\n(cid:80)m\nt=1 I{y(i) = t} log\n\n(cid:21)\n\nwhere \u03bb balances the two terms.\nProposition 1 (Convexity of the loss function) Let W be the model parameters of a considered\nlayer. Given the mean square loss and the cross-entropy loss de\ufb01ned in Eqs. (3) and (5), then the\njoint loss function L(W) is convex w.r.t. W.3\n\n3The proof can be found in Section S1 in the supplementary material.\n\n4\n\n\fLast, the optimization problem for discrimination-aware channel pruning can be formulated as\n\n(7)\nwhere \u03bal < c is the number channels to be selected. In our method, the sparsity of W can be either\ndetermined by a pre-de\ufb01ned pruning rate (See Section 3) or automatically adjusted by the stopping\nconditions in Section 3.5. We explore both effects in Section 4.\n\nminW L(W),\n\ns.t. ||W||2,0 \u2264 \u03bal,\n\nS}P\n\n3.3 Discrimination-aware channel pruning\nBy introducing P losses {Lp\np=1 to intermediate layers, the proposed discrimination-aware channel\npruning (DCP) method is shown in Algorithm 1. Starting from a pre-trained model, DCP updates the\nmodel M and performs channel pruning with (P + 1) stages. Algorithm 1 is called discrimination-\naware in the sense that an additional loss and the \ufb01nal loss are considered to \ufb01ne-tune the model.\nMoreover, the additional loss will be used to select channels, as discussed below. In contrast to\nGoogLeNet [43] and DSN [22], in Algorithm 1, we do not use all the losses at the same time. In fact,\nat each stage we will consider two losses only, i.e., Lp\n\nS and the \ufb01nal loss Lf .\n\nAlgorithm 1 Discrimination-aware channel prun-\ning (DCP)\nInput: Pre-trained model M,\n{xi, yi}N\ni=1, and parameters {\u03bal}L\nfor p \u2208 {1, ..., P + 1} do\n\ntraining data\nl=1.\n\nConstruct loss Lp\nLearn \u03b8 and Fine-tune M with Lp\nfor l \u2208 {Lp\u22121 + 1, ..., Lp} do\n\nS to layer Lp as in Figure 1.\nS and Lf .\nDo Channel Selection for layer l using Al-\ngorithm 2.\n\nend for\n\nend for\n\nAlgorithm 2 Greedy algorithm for channel selection\n\nInput: Training data, model M, parameters \u03bal, and \u0001.\nOutput: Selected channel subset A and model parame-\nters WA.\nInitialize A \u2190 \u2205, and t = 0.\nwhile (stopping conditions are not achieved) do\nCompute gradients of L w.r.t. W: G = \u2202L/\u2202W.\nFind the channel k = arg maxj /\u2208A{||Gj||F}.\nLet A \u2190 A \u222a {k}.\nSolve Problem (8) to update WA.\nLet t \u2190 t + 1.\n\nend while\n\nAt each stage of Algorithm 1, for example, in the p-th stage, we \ufb01rst construct the additional loss Lp\nand put them at layer Lp (See Figure 1). After that, we learn the model parameters \u03b8 w.r.t. Lp\nS and\n\ufb01ne-tune the model M at the same time with both the additional loss Lp\nS and the \ufb01nal loss Lf . In\nthe \ufb01ne-tuning, all the parameters in M will be updated.4 Here, with the \ufb01ne-tuning, the parameters\nregarding the additional loss can be well learned. Besides, \ufb01ne-tuning is essential to compensate the\naccuracy loss from the previous pruning to suppress the accumulative error. After \ufb01ne-tuning with\nLp\nS and Lf , the discriminative power of layers l \u2208 {Lp\u22121 + 1, ..., Lp} can be signi\ufb01cantly improved.\nThen, we can perform channel selection for the layers in {Lp\u22121 + 1, ..., Lp}.\n\nS\n\n3.4 Greedy algorithm for channel selection\n\nDue to the (cid:96)2,0-norm constraint, directly optimizing Problem (7) is very dif\ufb01cult. To address this\nissue, following general greedy methods in [25, 2, 52, 45, 46], we propose a greedy algorithm to solve\nProblem (7). To be speci\ufb01c, we \ufb01rst remove all the channels and then select those channels that really\ncontribute to the discriminative power of the deep networks. Let A \u2282 {1, . . . , c} be the index set of\nthe selected channels, where A is empty at the beginning. As shown in Algorithm 2, the channel\nselection method can be implemented in two steps. First, we select the most important channels of\ninput feature maps. At each iteration, we compute the gradients Gj = \u2202L/\u2202Wj, where Wj denotes\nthe parameters for the j-th input channel. We choose the channel k = arg maxj /\u2208A{||Gj||F} as an\nactive channel and put k into A. Second, once A is determined,we optimize W w.r.t. the selected\nchannels by minimizing the following problem:\n\n(8)\nwhere WAc denotes the submatrix indexed by Ac which is the complementary set of A. Here, we\napply stochastic gradient descent (SGD) to address the problem in Eq. (8), and update WA by\n\nminW L(W), s.t. WAc = 0,\n\nWA \u2190 WA \u2212 \u03b3 \u2202L\n\u2202WA ,\n\n(9)\n\n4The details of \ufb01ne-tuning algorithm is put in Section S2 in the supplementary material.\n\n5\n\n\fwhere WA denotes the submatrix indexed by A, and \u03b3 denotes the learning rate.\nNote that when optimizing Problem (8), WA is warm-started from the \ufb01ne-tuned model M. As a\nresult, the optimization can be completed very quickly. Moreover, since we only consider the model\nparameter W for one layer, we do not need to consider all data to do the optimization. To make a\ntrade-off between the ef\ufb01ciency and performance, we sample a subset of images randomly from the\ntraining data for optimization.5 Last, since we use SGD to update WA, the learning rate \u03b3 should be\ncarefully adjusted to achieve an accurate solution. Then, the following stopping conditions can be\napplied, which will help to determine the number of channels to be selected.\n\n3.5 Stopping conditions\nGiven a prede\ufb01ned parameter \u03bal in problem (7), Algorithm 2 will be stopped if ||W||2,0>\u03bal. However,\nin practice, the parameter \u03bal is hard to be determined. Since L is convex, L(Wt) will monotonically\ndecrease with iteration index t in Algorithm 2. We can therefore adopt the following stopping\ncondition:\n\n(10)\nwhere \u0001 is a tolerance value. If the above condition is achieved, the algorithm is stopped, and the\nnumber of selected channels will be automatically determined, i.e., ||Wt||2,0. An empirical study\nover the tolerance value \u0001 is put in Section 5.3.\n\n|L(Wt\u22121) \u2212 L(Wt)|/L(W0) \u2264 \u0001,\n\n4 Experiments\n\nIn this section, we empirically evaluate the performance of DCP. Several state-of-the-art methods\nare adopted as the baselines, including ThiNet [28], Channel pruning (CP) [14] and Slimming [27].\nBesides, to investigate the effectiveness of the proposed method, we include the following methods\nfor study: DCP: DCP with a pre-de\ufb01ned pruning rate \u03b7. DCP-Adapt: We prune each layer with\nthe stopping conditions in Section 3.5. WM: We shrink the width of a network by a \ufb01xed ratio and\ntrain it from scratch, which is known as width-multiplier [15]. WM+: Based on WM, we evenly\ninsert additional losses to the network and train it from scratch. Random DCP: Relying on DCP, we\nrandomly choose channels instead of using gradient-based strategy in Algorithm 2.\nDatasets. We evaluate the performance of various methods on three datasets, including CIFAR-\n10 [20], ILSVRC-12 [4], and LFW [17]. CIFAR-10 consists of 50k training samples and 10k testing\nimages with 10 classes. ILSVRC-12 contains 1.28 million training samples and 50k testing images\nfor 1000 classes. LFW [17] contains 13,233 face images from 5,749 identities.\n\n4.1\n\nImplementation details\n\nWe implement the proposed method on PyTorch [32]. Based on the pre-trained model, we apply our\nmethod to select the informative channels. In practice, we decide the number of additional losses\naccording to the depth of the network (See Section S4 in the supplementary material). Speci\ufb01cally,\nwe insert 3 losses to ResNet-50 and ResNet-56, and 2 additional losses to VGGNet and ResNet-18.\nWe \ufb01ne-tune the whole network with selected channels only. We use SGD with nesterov [30] for the\noptimization. The momentum and weight decay are set to 0.9 and 0.0001, respectively. We set \u03bb to\n1.0 in our experiments by default. On CIFAR-10, we \ufb01ne-tune 400 epochs using a mini-batch size of\n128. The learning rate is initialized to 0.1 and divided by 10 at epoch 160 and 240. On ILSVRC-12,\nwe \ufb01ne-tune the network for 60 epochs with a mini-batch size of 256. The learning rate is started at\n0.01 and divided by 10 at epoch 36, 48 and 54, respectively. The source code of our method can be\nfound at https://github.com/SCUT-AILab/DCP.\n\n4.2 Comparisons on CIFAR-10\n\nWe \ufb01rst prune ResNet-56 and VGGNet on CIFAR-10. The comparisons with several state-of-the-art\nmethods are reported in Table 1. From the results, our method achieves the best performance under\nthe same acceleration rate compared with the previous state-of-the-art. Moreover, with DCP-Adapt,\nour pruned VGGNet outperforms the pre-trained model by 0.58% in testing error, and obtains 15.58\u00d7\n\n5We study the effect of the number of samples in Section S5 in the supplementary material.\n\n6\n\n\fTable 1: Comparisons on CIFAR-10. \"-\" denotes that the results are not reported.\n\nModel\n\nVGGNet\n\n(Baseline 6.01%)\n\nResNet-56\n\n(Baseline 6.20%)\n\n#Param. \u2193\n#FLOPs \u2193\nErr. gap (%)\n#Param. \u2193\n#FLOPs \u2193\nErr. gap (%)\n\nSliming\n\n[27]\n\nCP\n[14]\n\nThiNet\n[28]\n1.92\u00d7 1.92\u00d7 8.71\u00d7 1.92\u00d7 1.92\u00d7\n2.00\u00d7 2.00\u00d7 2.04\u00d7 2.00\u00d7 2.00\u00d7\n+0.38\n+0.14\n+0.11\n1.97\u00d7 1.97\u00d7\n1.97\u00d7\n1.99\u00d7 1.99\u00d7\n1.99\u00d7\n+0.82\n+0.56\n+0.45\n\nWM WM+ Random\nDCP\n1.92\u00d7\n2.00\u00d7\n+0.14\n1.97\u00d7\n1.99\u00d7\n+0.63\n\n+0.32\n-\n2\u00d7\n+1.0\n\n+0.19\n\n-\n-\n-\n\nDCP\n1.92\u00d7\n2.00\u00d7\n-0.17\n1.97\u00d7\n1.99\u00d7\n+0.31\n\nDCP-Adapt\n15.58\u00d7\n2.86\u00d7\n-0.58\n3.37\u00d7\n1.89\u00d7\n-0.01\n\nreduction in model size. Compared with random DCP, our proposed DCP reduces the performance\ndegradation of VGGNet by 0.31%, which implies the effectiveness of the proposed channel selection\nstrategy. Besides, we also observe that the inserted additional losses can bring performance gain\nto the networks. With additional losses, WM+ of VGGNet outperforms WM by 0.27% in testing\nerror. Nevertheless, our method shows much better performance than WM+. For example, our pruned\nVGGNet with DCP-Adapt outperforms WM+ by 0.69% in testing error.\nPruning MobileNet v1 and MobileNet v2 on CIFAR-10. We apply DCP to prune recently devel-\noped compact architectures, e.g., MobileNet v1 and MobileNet v2 , and evaluate the performance on\nCIFAR-10. We report the results in Table 2. With additional losses, WM+ of MobileNet outperforms\nWM by 0.26% in testing error. However, our pruned models achieve 0.41% improvement over\nMobileNet v1 and 0.22% improvement over MobileNet v2 in testing error. Note that the Random\nDCP incurs performance degradation on both MobileNet v1 and MobileNet v2 by 0.30% and 0.57%,\nrespectively.\n\nTable 2: Performance of pruning 30% channels of MobileNet v1 and MobileNet v2 on CIFAR-10.\n\nModel\n\nMobileNet v1\n\n(Baseline 6.04%)\n\nMobileNet v2\n\n(Baseline 5.53%)\n\n#Param. \u2193\n#FLOPs \u2193\nErr. gap (%)\n#Param. \u2193\n#FLOPs \u2193\nErr. gap (%)\n\nWM WM+ Random\nDCP\n1.43\u00d7 1.43\u00d7\n1.43\u00d7\n1.75\u00d7 1.75\u00d7\n1.75\u00d7\n+0.30\n+0.48\n+0.22\n1.31\u00d7\n1.31\u00d7 1.31\u00d7\n1.36\u00d7\n1.36\u00d7 1.36\u00d7\n+0.45\n+0.40\n+0.57\n\nDCP\n1.43\u00d7\n1.75\u00d7\n-0.41\n1.31\u00d7\n1.36\u00d7\n-0.22\n\n4.3 Comparisons on ILSVRC-12\n\nTo verify the effectiveness of the proposed method on large-scale datasets, we further apply our\nmethod on ResNet-50 to achieve 2\u00d7 acceleration on ILSVRC-12. We report the single view evaluation\nin Table 3. Our method outperforms ThiNet [28] by 0.81% and 0.51% in top-1 and top-5 error,\nrespectively. Compared with channel pruning [14], our pruned model achieves 0.79% improvement\nin top-5 error. Compared with WM+, which leads to 2.41% increase in top-1 error, our method only\nresults in 1.06% degradation in top-1 error.\n\nTable 3: Comparisons on ILSVRC-12. The top-1 and top-5 error (%) of the pre-trained model are\n23.99 and 7.07, respectively. \"-\" denotes that the results are not reported.\n\nResNet-50\n\nModel\n\n#Param. \u2193\n#FLOPs \u2193\nTop-1 gap (%)\nTop-5 gap (%)\n\n4.4 Experiments on LFW\n\n2.06\u00d7\n2.25\u00d7\n+1.87\n+1.12\n\nThiNet [28] CP [14] WM WM+\n\nDCP\n2.06\u00d7 2.06\u00d7 2.06\u00d7\n2.25\u00d7 2.25\u00d7 2.25\u00d7\n+1.06\n+2.81\n+0.61\n+1.62\n\n+2.41\n+1.28\n\n-\n2\u00d7\n-\n\n+1.40\n\nWe further conduct experiments on LFW [17], which is a standard benchmark dataset for face\nrecognition. We use CASIA-WebFace [50] (which consists of 494,414 face images from 10,575\nindividuals) for training. With the same settings in [26], we \ufb01rst train SphereNet-4 (which contains\n4 convolutional layers) from scratch. And Then, we adopt our method to compress the pre-trained\nSphereNet model. Since the fully connected layer occupies 87.65% parameters of the model, we also\nprune the fully connected layer to reduce the model size.\n\n7\n\n\fTable 4: Comparisons of prediction accuracy, #Param. and #FLOPs on LFW. We report the ten-fold\ncross validation accuracy of different models.\n\nMethod\n\nFaceNet [37] DeepFace [44] VGG [31]\n\nSphereNet-4 [26]\n\n#Param.\n#FLOPs\n\nLFW acc. (%)\n\n140M\n1.6B\n99.63\n\n120M\n19.3B\n97.35\n\n133M\n11.3B\n99.13\n\n12.56M\n164.61M\n\n98.20\n\nDCP\n\n(prune 50%)\n\nDCP\n\n(prune 65%)\n\n5.89M\n45.15M\n98.30\n\n4.06M\n24.16M\n98.02\n\nWe report the results in Table 4. With the pruning rate of 50%, our method speeds up SphereNet-4 for\n3.66\u00d7 with 0.1% improvement in ten-fold validation accuracy. Compared with huge networks, e.g.,\nFaceNet [37], DeepFace [44], and VGG [31], our pruned model achieves comparable performance\nbut has only 45.15M FLOPs and 5.89M parameters, which is suf\ufb01cient to be deployed on embedded\nsystems. Furthermore, pruning 65% channels in SphereNet-4 results in a more compact model, which\nrequires only 24.16M FLOPs with the accuracy of 98.02% on LFW.\n\n5 Ablation studies\n\n5.1 Performance with different pruning rates\n\nTo study the effect of using different pruning rates \u03b7, we prune 30%, 50%, and 70% channels of\nResNet-18 and ResNet-50, and evaluate the pruned models on ILSVRC-12. Experimental results\nare shown in Table 5. Here, we only report the performance under different pruning rates, while the\ndetailed model complexity comparisons are provided in Section S8 in the supplementary material.\nFrom Table 5, in general, performance of the pruned models goes worse with the increase of pruning\nrate. However, our pruned ResNet-50 with pruning rate of 30% outperforms the pre-trained model,\nwith 0.39% and 0.14% reduction in top-1 and top-5 error, respectively. Besides, the performance\ndegradation of ResNet-50 is smaller than that of ResNet-18 with the same pruning rate. For example,\nwhen pruning 50% of the channels, while it only leads to 1.06% increase in top-1 error for ResNet-50,\nit results in 2.29% increase of top-1 error for ResNet-18. One possible reason is that, compared to\nResNet-18, ResNet-50 is more redundant with more parameters, thus it is easier to be pruned.\n\nTable 5: Comparisons on ResNet-18 and ResNet-\n50 with different pruning rates. We report the top-1\nand top-5 error (%) on ILSVRC-12.\n\nNetwork\n\n\u03b7\n\nTop-1/Top5 err.\n\nResNet-18\n\nResNet-50\n\n0% (baseline)\n\n0% (baseline)\n\n30%\n50%\n70%\n\n30%\n50%\n70%\n\n30.36/11.02\n30.79/11.14\n32.65/12.40\n35.88/14.32\n23.99/7.07\n23.60/6.93\n25.05/7.68\n27.25/8.87\n\nTable 6: Pruning results on ResNet-56 with dif-\nferent \u03bb on CIFAR-10.\n\n\u03bb\n0 (LM only)\n0.001\n0.005\n0.01\n0.05\n0.1\n0.5\n1.0\n1.0 (LS only)\n\nTraining err. Testing err.\n\n7.96\n7.61\n6.86\n6.36\n4.18\n3.43\n2.17\n2.10\n2.82\n\n12.24\n11.89\n11.24\n11.00\n9.74\n8.87\n8.11\n7.84\n8.28\n\n5.2 Effect of the trade-off parameter \u03bb\n\nWe prune 30% channels of ResNet-56 on CIFAR-10 with different \u03bb. We report the training error\nand testing error without \ufb01ne-tuning in Table 6. From the table, the performance of the pruned model\nimproves with increasing \u03bb. Here, a larger \u03bb implies that we put more emphasis on the additional\nloss (See Equation (6)). This demonstrates the effectiveness of discrimination-aware strategy for\nchannel selection. It is worth mentioning that both the reconstruction error and the cross-entropy\nloss contribute to better performance of the pruned model, which strongly supports the motivation to\nselect the important channels by LS and LM . After all, as the network achieves the best result when\n\u03bb is set to 1.0, we use this value to initialize \u03bb in our experiments.\n\n8\n\n\f5.3 Effect of the stopping condition\n\nTo explore the effect of stopping condition discussed in Section 3.5, we test different tolerance value\n\u0001 in the condition. Here, we prune VGGNet on CIFAR-10 with \u0001 \u2208 {0.1, 0.01, 0.001}. Experimental\nresults are shown in Table 7. In general, a smaller \u0001 will lead to more rigorous stopping condition and\nhence more channels will be selected. As a result, the performance of the pruned model is improved\nwith the decrease of \u0001. This experiment demonstrates the usefulness and effectiveness of the stopping\ncondition for automatically determining the pruning rate.\n\nTable 7: Effect of \u0001 for channel selection. We prune VGGNet and report the testing error on CIFAR-10.\nThe testing error of baseline VGGNet is 6.01%.\n\nLoss\nL\n\n\u0001\n0.1\n0.01\n0.001\n\nTesting err. (%)\n\n12.68\n6.63\n5.43\n\n#Param. \u2193\n152.25\u00d7\n31.28\u00d7\n15.58\u00d7\n\n#FLOPs \u2193\n27.39\u00d7\n5.35\u00d7\n2.86\u00d7\n\n5.4 Visualization of feature maps\n\nWe visualize the feature maps w.r.t. the pruned/selected channels of the \ufb01rst block (i.e., res-2a) in\nResNet-18 in Figure 2. From the results, we observe that feature maps of the pruned channels (See\nFigure 2(b)) are less informative compared to those of the selected ones (See Figure 2(c)). It proves\nthat the proposed DCP selects the channels with strong discriminative power for the network. More\nvisualization results can be found in Section S10 in the supplementary material.\n\nFigure 2: Visualization of the feature maps of the pruned/selected channels of res-2a in ResNet-18.\n\n6 Conclusion\n\nIn this paper, we have proposed a discrimination-aware channel pruning method for the compression\nof deep neural networks. We formulate the channel pruning/selection problem as a sparsity-induced\noptimization problem by considering both reconstruction error and channel discrimination power.\nMoreover, we propose a greedy algorithm to solve the optimization problem. Experimental results\non benchmark datasets show that the proposed method outperforms several state-of-the-art methods\nby a large margin with the same pruning rate. Our DCP method provides an effective way to obtain\nmore compact networks. For those compact network designs such as MobileNet v1&v2, DCP can\nstill improve their performance by removing redundant channels. In particular for MobileNet v2,\nDCP improves it by reducing 30% of channels on CIFAR-10. In the future, we will incorporate\nthe computational cost per layer into the optimization, and combine our method with other model\ncompression strategies (such as quantization) to further reduce the model size and inference cost.\n\nAcknowledgements\n\nThis work was supported by National Natural Science Foundation of China (NSFC) (61876208,\n61502177 and 61602185), Recruitment Program for Young Professionals, Guangdong Provincial\nScienti\ufb01c and Technological funds (2017B090901008, 2017A010101011, 2017B090910005), Funda-\nmental Research Funds for the Central Universities D2172480, Pearl River S&T Nova Program of\nGuangzhou 201806010081, CCF-Tencent Open Research Fund RAGR20170105, and Program for\nGuangdong Introducing Innovative and Enterpreneurial Teams 2017ZT07X183.\n\n9\n\n(a)Input image(b) Feature maps of the pruned channels(c) Feature maps of the selected channels\fReferences\n[1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In NIPS, pages\n\n2270\u20132278, 2016.\n\n[2] S. Bahmani, B. Raj, and P. T. Boufounos. Greedy sparsity-constrained optimization. JMLR, 14(Mar):807\u2013\n\n841, 2013.\n\n[3] J. Cao, Y. Guo, Q. Wu, C. Shen, J. Huang, and M. Tan. Adversarial learning with local coordinate coding.\n\nIn ICML, volume 80, pages 707\u2013715, 2018.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, pages 248\u2013255, 2009.\n\n[5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within\n\nconvolutional networks for ef\ufb01cient evaluation. In NIPS, pages 1269\u20131277, 2014.\n\n[6] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector\n\nquantization. arXiv preprint arXiv:1412.6115, 2014.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680,\n2014.\n\n[8] Y. Guo, M. Tan, Q. Wu, J. Chen, A. V. D. Hengel, and Q. Shi. The shallow end: Empowering shallower\n\ndeep-convolutional networks through auxiliary outputs. arXiv preprint arXiv:1611.01773, 2016.\n\n[9] Y. Guo, Q. Wu, C. Deng, J. Chen, and M. Tan. Double forward propagation for memorized batch\n\nnormalization. In AAAI, 2018.\n\n[10] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for ef\ufb01cient dnns. In NIPS, pages 1379\u20131387,\n\n2016.\n\n[11] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,\n\ntrained quantization and huffman coding. In ICLR, 2016.\n\n[12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef\ufb01cient neural network.\n\nIn NIPS, pages 1135\u20131143, 2015.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages\n\n770\u2013778, 2016.\n\n[14] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In ICCV, pages\n\n1389\u20131397, 2017.\n\n[15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. arXiv preprint\narXiv:1704.04861, 2017.\n\n[16] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trimming: A data-driven neuron pruning approach\n\ntowards ef\ufb01cient deep architectures. arXiv preprint arXiv:1607.03250, 2016.\n\n[17] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for\nstudying face recognition in unconstrained environments. Technical report, Technical Report 07-49,\nUniversity of Massachusetts, Amherst, 2007.\n\n[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In ICML, pages 448\u2013456, 2015.\n\n[19] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank\n\nexpansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.\n\n[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1097\u20131105, 2012.\n\n[22] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, pages 562\u2013570,\n\n2015.\n\n10\n\n\f[23] F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.\n\n[24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning \ufb01lters for ef\ufb01cient convnets. In ICLR,\n\n2017.\n\n[25] J. Liu, J. Ye, and R. Fujimaki. Forward-backward greedy algorithms for general convex smooth functions\n\nover a cardinality constraint. In ICML, pages 503\u2013511, 2014.\n\n[26] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face\n\nrecognition. In CVPR, pages 212\u2013220, 2017.\n\n[27] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning ef\ufb01cient convolutional networks through\n\nnetwork slimming. In ICCV, pages 2736\u20132744, 2017.\n\n[28] J.-H. Luo, J. Wu, and W. Lin. Thinet: A \ufb01lter level pruning method for deep neural network compression.\n\nIn ICCV, pages 5058\u20135066, 2017.\n\n[29] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, pages\n\n807\u2013814, 2010.\n\n[30] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In SMD,\n\nvolume 27, pages 372\u2013376, 1983.\n\n[31] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.\n\n[32] A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch: Tensors and dynamic neural networks in python\n\nwith strong gpu acceleration, 2017.\n\n[33] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[34] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using binary\n\nconvolutional neural networks. In ECCV, pages 525\u2013542, 2016.\n\n[35] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time object detection.\n\nIn CVPR, pages 779\u2013788, 2016.\n\n[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In NIPS, pages 91\u201399, 2015.\n\n[37] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition and\n\nclustering. In CVPR, pages 815\u2013823, 2015.\n\n[38] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In\n\nNIPS, pages 568\u2013576, 2014.\n\n[39] V. Sindhwani, T. Sainath, and S. Kumar. Structured transforms for small-footprint deep learning. In NIPS,\n\npages 3088\u20133096, 2015.\n\n[40] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse neural networks. In CVPRW, pages 455\u2013462,\n\n2017.\n\n[41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, pages 2377\u20132385,\n\n2015.\n\n[42] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face recognition with very deep neural networks. arXiv\n\npreprint arXiv:1502.00873, 2015.\n\n[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, pages 1\u20139, 2015.\n\n[44] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in\n\nface veri\ufb01cation. In CVPR, pages 1701\u20131708, 2014.\n\n[45] M. Tan, I. W. Tsang, and L. Wang. Towards ultrahigh dimensional feature selection for big data. JMLR,\n\n15(1):1371\u20131429, 2014.\n\n[46] M. Tan, I. W. Tsang, and L. Wang. Matching pursuit lasso part i: Sparse recovery over big dictionary. TSP,\n\n63(3):727\u2013741, 2015.\n\n11\n\n\f[47] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks:\n\nTowards good practices for deep action recognition. In ECCV, pages 20\u201336, 2016.\n\n[48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In\n\nNIPS, pages 2074\u20132082, 2016.\n\n[49] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel\n\npruning of convolution layers. arXiv preprint arXiv:1802.00124, 2018.\n\n[50] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint\n\narXiv:1411.7923, 2014.\n\n[51] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis. Nisp:\n\nPruning networks using neuron importance score propagation. In CVPR, pages 9194\u20139203, 2018.\n\n[52] X. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In\n\nICML, pages 127\u2013135, 2014.\n\n[53] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classi\ufb01cation and\n\ndetection. TPAMI, 38(10):1943\u20131955, 2016.\n\n[54] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns\n\nwith low-precision weights. In ICLR, 2017.\n\n[55] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training low bitwidth convolutional\n\nneural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.\n\n[56] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In ICLR, 2017.\n\n[57] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid. Towards effective low-bitwidth convolutional neural\n\nnetworks. In CVPR, pages 7920\u20137928, 2018.\n\n12\n\n\f", "award": [], "sourceid": 492, "authors": [{"given_name": "Zhuangwei", "family_name": "Zhuang", "institution": "SCUT"}, {"given_name": "Mingkui", "family_name": "Tan", "institution": "South China University of Technology"}, {"given_name": "Bohan", "family_name": "Zhuang", "institution": "The University of Adelaide"}, {"given_name": "Jing", "family_name": "Liu", "institution": "South China University of Technology"}, {"given_name": "Yong", "family_name": "Guo", "institution": "South China University of Technology"}, {"given_name": "Qingyao", "family_name": "Wu", "institution": "South China University of Technology"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}, {"given_name": "Jinhui", "family_name": "Zhu", "institution": "SCUT"}]}