{"title": "Frequency-Domain Dynamic Pruning for Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1043, "page_last": 1053, "abstract": "Deep convolutional neural networks have demonstrated their powerfulness in a variety of applications. However, the storage and computational requirements have largely restricted their further extensions on mobile devices. Recently, pruning of unimportant parameters has been used for both network compression and acceleration. Considering that there are spatial redundancy within most filters in a CNN, we propose a frequency-domain dynamic pruning scheme to exploit the spatial correlations. The frequency-domain coefficients are pruned dynamically in each iteration and different frequency bands are pruned discriminatively, given their different importance on accuracy. Experimental results demonstrate that the proposed scheme can outperform previous spatial-domain counterparts by a large margin. Specifically, it can achieve a compression ratio of 8.4x and a theoretical inference speed-up of 9.2x for ResNet-110, while the accuracy is even better than the reference model on CIFAR-110.", "full_text": "Frequency-Domain Dynamic Pruning for\n\nConvolutional Neural Networks\n\n1Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University\n\nZhenhua Liu1, Jizheng Xu2, Xiulian Peng2, Ruiqin Xiong1\n\nliu-zh@pku.edu.cn, jzxu@microsoft.com, xipe@microsoft.com, rqxiong@pku.edu.cn\n\n2Microsoft Research Asia\n\nAbstract\n\nDeep convolutional neural networks have demonstrated their powerfulness in a\nvariety of applications. However, the storage and computational requirements have\nlargely restricted their further extensions on mobile devices. Recently, pruning\nof unimportant parameters has been used for both network compression and ac-\nceleration. Considering that there are spatial redundancy within most \ufb01lters in\na CNN, we propose a frequency-domain dynamic pruning scheme to exploit the\nspatial correlations. The frequency-domain coef\ufb01cients are pruned dynamically\nin each iteration and different frequency bands are pruned discriminatively, given\ntheir different importance on accuracy. Experimental results demonstrate that the\nproposed scheme can outperform previous spatial-domain counterparts by a large\nmargin. Speci\ufb01cally, it can achieve a compression ratio of 8.4\u00d7 and a theoretical\ninference speed-up of 9.2\u00d7 for ResNet-110, while the accuracy is even better than\nthe reference model on CIFAR-10.\n\n1\n\nIntroduction\n\nIn recent years, convolutional neural networks have been performing well in a variety of arti\ufb01cial tasks\nincluding image classi\ufb01cation, face recognition, natural language processing and speech recognition.\nSince convolutional neural networks tend to be deeper and deeper which means more storage\nrequirements and \ufb02oating-point operations, there are many works devoting to simplify and accelerate\nthe deep neural networks.\nSome works performed structural sparsity approximation which alter the large sub-networks or\nlayers into shallow ones. Jaderberg et al. [1] proposed to construct a low rank basis of \ufb01lters by\nexploiting cross-channel or \ufb01lter redundancy. [2] took the nonlinear units into account and minimized\nthe reconstruction error of nonlinear responses, subjecting to a low-rank constraint. [3] and [4]\nemployed tensor-decomposition and tucker-decomposition to simplify convolutional neural networks,\nrespectively.\nSince operations in high precision are much more time-consuming than those with fewer \ufb01x-point\nvalues, Courbariaux et al. [5] proposed to constrain activations to +1 and \u22121. [6] proposed XNOR-\nNetworks which computed the scaling factor applying to both binary weights and binary input. [7]\nproposed HORQ Network which recursively computed the quantized residual to reduce the informa-\ntion loss. Methods in [8\u201311] employed ternary or \ufb01xed-point values to compress and accelerate the\nconvolutional neural networks.\nSome researchers also employed quantization to reduce the computation of CNNs. [12] utilized\nk-means clusting to identify the shared weights and limited all the weights that fell into the same\ncluster sharing the same weight. [13] employed product quantization to implement the ef\ufb01cient inner\nproduct computation. [14] extended the quantization method into frequency domain and used a hash\nfunction to randomly group frequency parameters into hash buckets and all parameters assigned to\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe same hash bucket shared a single value learned with standard backpropagation. [15] decomposed\nthe representations of convolutional \ufb01lters in frequency domain as common parts (i.e. cluster centers)\nshared by other similar \ufb01lters and their individual private parts (i.e. individual residuals). [16] revived\na principled regularization method based on soft weight-sharing.\nBesides decomposition and quantization, network pruning is also a widely studied and ef\ufb01cient\napproach. By pruning the near-zero connections and retraining the pruned network, both the network\nstorage and computation can be reduced. Han et al. [17] showed that network pruning can compress\nAlexNet and VGG-16 by 9\u00d7 and 13\u00d7, respectively, with negligible accuracy loss on ImageNet. [18]\nproposed a dynamic network surgery (DNS) method to reduce the network complexity. Compared\nwith the pruning methods which accomplished this task in a greedy way, they incorporated connection\nsplicing into the surgery to avoid incorrect pruning and made it as a continual network maintenance.\nThey compressed the parameters in LeNet-5 and AlexNet by 108\u00d7 and 17.7\u00d7 respectively. To\nfurther accelerate the deep convolutional nerural networks, [19] proposed to conduct channel pruning\nby a LASSO regression based channel selection and least square reconstruction. [16] and [20]\nprocess kernel weights in spatial domain and achieve both pruning and quantization in one training\nprocedure. [21], [22] and [23] prune nodes or \ufb01lters by employing Bayesian point of view and L0\nnorm regularization.\nThe pruning methods mentioned above are all conducted in spatial domain. Actually, due to the local\nsmoothness of images, most \ufb01lters in a CNN tend to be smooth, i.e. there are spatial redundancies. In\nthis paper, we try to fully exploit this spatial correlation and propose a frequency-domain network\npruning approach. First we show that a convolution or an inner product can be implemented by a\nDCT-domain multiplication. Further we apply a dynamic pruning to the DCT coef\ufb01cients of network\n\ufb01lters, since the dynamic method achieves a pretty good performance among the spatial pruning\napproaches. What\u2019s more, due to variant importance of different frequency bands, we compressed\nthem with different rates. Experimental results show that the proposed scheme can outperform\nprevious spatial-domain counterparts by a large margin on several datasets, without or with negligible\naccuracy loss. Speci\ufb01cally, the proposed algorithm can acquire accuracy gain for the ResNet on\nCIFAR-10, while achieving an impressive compression and acceleration of the network.\nThe rest of this paper is organized as follows. In Section 2, we introduce the proposed band-adaptive\nfrequency-domain dynamic pruning scheme. The theoretical analysis of computational complexity\nis presented in Section 3. Section 4 shows the experimental results on several datasets. Section 5\nconcludes the paper.\n\n2 Frequency-Domain Network Pruning\n\nIn this section, we \ufb01rst show how a spatial-domain convolution or inner product can be implemented\nby multiplication in frequency domain. Here we use 2-D DCT for spatial redundancy removal.\nThen the proposed frequency-domain dynamic pruning method is introduced. Further, the band-\nadaptive rate allocation strategy is explained, which prunes different frequency bands discriminatively\naccording to their importances.\n\n2.1 Frequency-Domain CNN\nWe \ufb01rst consider a convolutional layer of a CNN with the input tensor I \u2208 Rcin\u00d7win\u00d7hin and\nconvolutional \ufb01lters W \u2208 Rcin\u00d7d\u00d7d\u00d7cout. For the weight tensor W, the spatial support of each\nkernel \ufb01lter is d \u00d7 d. There are cin input channels and cout output feature maps. We unfold each cin\n\ufb01lters into a 1-D vector with a size of (cin \u00d7 d \u00d7 d) \u00d7 1. Then the weight tensor W is reshaped to a\n(cin \u00d7 d \u00d7 d) \u00d7 cout matrix W (see Fig. 1).\nLet O \u2208 Rcout\u00d7wout\u00d7hout denote the output of the convolutional layer (cid:104)I,W,\u2217(cid:105), where wout =\n(cid:98)(win + 2p \u2212 d)/s(cid:99) + 1 and hout = (cid:98)(hin + 2p \u2212 d)/s(cid:99) + 1. p and s are the padding and stride\nparameters, respectively. The input tensor I can be reshaped into a (wout \u00d7 hout) \u00d7 (cin \u00d7 d \u00d7 d)\nmatrix I, where each row is the unfolded version of a sub-tensor in I with the same size of a group\nof cin \ufb01lters. Then the convolution can be implemented by a matrix multiplication between I and W .\nThe output is given by O = I \u00b7 W , where O has a shape of (wout \u00d7 hout)\u00d7 (cout). The matrix O can\nbe reshaped to the output tensor O \u2208 Rcout\u00d7wout\u00d7hout by folding each column into a wout \u00d7 hout\nfeature map.\n\n2\n\n\f\uf8f6\uf8f7\uf8f7\uf8f8\n\n0\n0\n\n. . .\n0 B . . .\n...\n...\n\u00b7\u00b7\u00b7 B\n\n...\n\n0\n\n0\n\n...\n\nA =\n\n\uf8eb\uf8ec\uf8ec\uf8ed B 0\ncin(cid:88)\nd2(cid:88)\nd2(cid:88)\ncin(cid:88)\n\nk=1\n\n(cid:96)=1\n\nX T\n\ni,j =\n\nAi,(k\u22121)\u00b7d2+(cid:96)I T\n\n(k\u22121)\u00b7d2+(cid:96),j\n\n(1)\n\n(2)\n\n(3)\n\nFigure 1: This \ufb01gure shows the process of convolution in frequency domain.\n\nSuppose w =(cid:2)wT\n\n1 , wT\n\n2 ,\u00b7\u00b7\u00b7 , wT\n\n(cid:3)T is a column vector in W , where wk(k \u2208 [1, cin]) represents the\n\ncin\n\n1-D form of a kernel \ufb01lter. Then the 2-D DCT whose transformation size is d \u00d7 d can be applied\nto each kernel \ufb01lter. The operation can be seen as a matrix multiplication yk = B \u00b7 wk, where B\nis the Kronecker tensor product of 2-D DCT transformation matrix and itself. The shape of B is\n(d \u00d7 d) \u00d7 (d \u00d7 d). After obtaining each sub-vector of coef\ufb01cients yk, the coef\ufb01cients matrix Y is\ngenerated by reconnecting the sub-vectors and regrouping the vectors y. The input coef\ufb01cients matrix\nI can be obtained using the same scheme except that we apply 2-D DCT to the row sub-vectors of I.\nThe shapes of X and Y are the same as I and W , respectively.\nIn fact, the 2-D DCT transform of W can be viewed as a matrix multiplication, i.e. Y = A \u00b7 W . A is\nthe Kronecker tensor product of B and a unit matrix E whose size is cin \u00d7 cin.\n\nOn the other hand, we can derive X T = A \u00b7 I T . The computations of X T\nEq. 2 and 3, respectively.\n\ni,j and Yi,j are shown in\n\nYi,j =\n\nAi,(k\u22121)\u00b7d2+(cid:96)W(k\u22121)\u00b7d2+(cid:96),j\n\nk=1\n\n(cid:96)=1\n\nSince the basis of a 2-D DCT are orthonormal, we can easily derive that both B and A are also\northonormal matrices. As shown in Figure 1, the output matrix O can be computed by directly\nmultiplying X and Y , the proof of which is given as follows.\n\nX \u00b7 Y = (A \u00b7 I T )T \u00b7 (A \u00b7 W )\n\n= I \u00b7 AT \u00b7 A \u00b7 W\n= I \u00b7 (AT \u00b7 A) \u00b7 W\n= I \u00b7 W = O\n\nIn this way, the convolution in spatial domain is realized by the matrix multiplication in frequency\ndomain.\n\n3\n\n*=\uf0b4=inwinhoutcincoutcoutoutwh\uf0b4outcoutwouthInput ExtensionWeight ReshapeOutput Reshape2-D DCT2-D DCT\uf0b4outoutwh\uf0b4outoutwh\uf0b4.........outcoutc.........incIWOXYddcin\uf0b4\uf0b4ddcin\uf0b4\uf0b4ddcin\uf0b4\uf0b4ddcin\uf0b4\uf0b4d\fAs for the fully-connected layers, the weights can be viewed as a matrix shape of cin \u00d7 cout and the\ninput is a vector shape of 1 \u00d7 cin. The same scheme of convolutional layers can be directly applied\nto implement an inner product in frequency domain. For those fully-connected layers whose inputs\nare the outputs of convolutional layers, the 2-D DCT size can be set as the size of feature map in the\nprevious connected convolutional layer. As for other fully-connected layers, the 2-D DCT size can be\ndecided according to the size of input vector. In this paper, we do not apply transformation to the\nlatter kind of fully-connected layers, since the correlations among their weights are not so strong.\n\n2.2 Frequency-Domain Dynamic Network Pruning (FDNP)\n\nAs mentioned in Section 2.1, we can obtain the transform coef\ufb01cients of the input X and the weight\n\ufb01lters Y in a matrix form. In this section, we show how the proposed frequency-domain dynamic\nnetwork pruning approach works. In spatial domain, the \ufb01lters in a CNN are mostly smooth due to\nthe local pixel smoothness in natural images. In frequency domain, this leads to components with\nlarge magnitudes in the low frequency bands and small magnitudes in the high frequency bands. The\ncoef\ufb01cients in frequency domain are more sparse than that in spatial domain, so they have more\npotential to be pruned while retaining the same crucial information.\nIn order to represent a sparse model with part of its parameters pruned away, we utilize a mask matrix\nT whose values are binary to indicate the states of parameters, i.e., whether they are currently pruned\nor not. Then the optimization problem can be described as\n\nL(D\u22121(Y \u2297 T ))\n\nmin\nY,T\n\ns.t.Ti,j = f (Yi,j),\n\n(4)\nwhere L(\u00b7) is the loss function. D\u22121 denotes the inverse 2-D DCT. \u2297 represents the Hadamard\nproduct operator and f (\u00b7) is the discriminative function which satis\ufb01es f (Yi,j) = 1 if coef\ufb01cient\nYi,j seems to be important in the current layer and 0 otherwise. Following the dynamic method in\n[18], we set two thresholds a and b to decide the mask values of coef\ufb01cients for each layer. And to\nproperly evaluate the importance of each coef\ufb01cient, an absolute value is utilized. Using function\nf (\u00b7) as below, the coef\ufb01cients are not pruned forever and have a chance to return during the training\nprocess.\n\n\uf8f1\uf8f2\uf8f3 0\n\nT t\n1\n\n|\nif a > |Y t+1\n| \u2264 b\nif a \u2264 |Y t+1\n|\nif b < |Y t+1\n\ni,j\n\ni,j\n\ni,j\n\nf (Y t+1\n\ni,j ) =\n\n(i,j)\n\n(5)\n\nThe a and b are set according to the distribution of coef\ufb01cients in each layer, i.e.\n\n(6)\n(7)\nin which \u00b5 and \u03c3 are the mean value and the standard deviation of all coef\ufb01cients in one layer,\nrespectively. \u03b3 denotes the compression rate of each layer, by which the number of remaining\ncoef\ufb01cients in each layer is determined.\n\na = 0.9 \u2217 (\u00b5 + \u03b3 \u2217 \u03c3)\nb = 1.1 \u2217 (\u00b5 + \u03b3 \u2217 \u03c3)\n\n2.3 Band Adaptive Frequency-Domain Dynamic Network Pruning (BA-FDNP)\n\nAs components of different frequencies tend to be of different magnitudes, their importances vary for\nthe spatial structure of a \ufb01lter. Thus we allow varying compression rates for different frequencies.\nThe frequencies are partitioned into 2d \u2212 1 regions after analyzing the distribution of the coef\ufb01cients,\nwhere d is the transformation size. The compression rate \u03b3k is set for the kth frequency region where\nk = u + v is the index of a frequency region. A smaller \u03b3k introduces lower threshold and less\nparameters will be pruned. Since lower frequency components seem to be of higher importance, we\ncommonly assign lower \u03b3k to low-frequency regions with small indices (u, v). Correspondingly, the\nhigh frequencies with large indices (u, v), have magnitudes near zero. Larger compression rates will\n\ufb01t them better.\nWe set the compression rate of kth frequency region \u03b3k with a parameterized function, i.e. \u03b3k = g(\u00b7).\nWe adopt the beta distribution in the experiment:\n\nwhere x = (k + 1)/2d, k \u2208 [0, 2d \u2212 1].\n\ng(x; \u03bb, \u03c9) = x\u03bb\u22121(1 \u2212 x)\u03c9\u22121\n\n(8)\n\n4\n\n\fAs we mentioned before, we expect a smaller compression rate for low-frequency components due to\nthe higher importance. In the experiment, we modify function g(\u00b7) to be positively related with x by\nadjusting the values of \u03bb and \u03c9.\n\n2.4 Training Pruning Networks in Frequency Domain\n\nThe key point of training pruning networks in frequency domain is the updating scheme of weight\ncoef\ufb01cients matrix Y . Suppose w is a kernel \ufb01lter in W, and y denotes the corresponding coef\ufb01cients\nafter 2-D DCT transformation whose shape is d \u00d7 d. Since 2-D DCT is a linear transformation, the\ngradient in frequency domain is merely the 2-D DCT transformation of the gradient in spatial domain.\nThe proof is shown in the supplementary material.\n\n(9)\nwhere L is the total loss of the network. Then inspired by the method of Lagrange Multipliers and\ngradient descent, we can obtain a straightforward updating procedure of the \ufb01lter parameters Y in\nfrequency domain.\n\n\u2202L\n\u2202y = D(\n\n\u2202L\n\u2202w )\n\nY(u,v) \u2190 Y(u,v) \u2212 \u03b2\n\n\u2202L(Y \u2297 T )\n\u2202(Y(u,v)T(u,v))\n\n(10)\n\nin which \u03b2 is a positive learning rate. To enable the returning of improperly pruned parameters, we\nupdate not only the non-zero coef\ufb01cients, but also the ones corresponding to zero entries of T .\nThe procedure of training an L\u2212layers BA-FDNP network can be divided into three phases: feed-\nforward, back-propagation and coef\ufb01cient-update. In the feed-forward phase, the input and weight\n\ufb01lters are transformed into frequency domain to complete the computation as shown in Section 2.1.\nDuring back-propagation, after computing the standard gradient \u2202L\n\u2202W in spatial domain, 2-D DCT\nis directly used to obtain the gradient \u2202L\n\u2202Y in frequency domain. Then we apply dynamic pruning\nmethod after updating the coef\ufb01cients. It should be noticed that we update not only the remained\ncoef\ufb01cients, but also the ones considered to be unimportant temporarily. So we can give a chance\nfor those improperly pruned parameters to be returned. Repeat these steps iteratively, one can train\nthe BA-FDNP CNN in frequency domain. The process of the proposed algorithm is detailed in the\nsupplementary material.\n\n3 Computational Complexity\nGiven a convolutional layer with W \u2208 Rcin\u00d7d\u00d7d\u00d7cout as the weight tensor and Y denotes the\ncompressed coef\ufb01cients in this layer. Suppose \u03b7 is the ratio of non-zero elements in Y, the number\nof multiplications in convolution operations is \u03b7cind2coutwouthout in frequency domain. And each\nsub-input feature map with a size of d \u00d7 d costs 2d \u00b7 d2 multiplications due to the separable 2-D DCT\ntransform. The additional computational cost of 2-D DCT in one layer is 2d \u00b7 d2cinwouthout (see\nFig. 1). Compared to the original CNN, the theoretical inference speed-up of the proposed scheme is\n\nrs =\n\ncind2coutwouthout\n\n2d \u00b7 d2cinwouthout + \u03b7cind2coutwouthout\n\n=\n\ncout\n\n2d + \u03b7cout\n\n(11)\n\nSuppose \u03be is the ratio of non-zero elements in the compressed weights of spatial-domain pruning\nmethod. The inference speed-up of the spatial-domain pruning method is\n\n(cid:48) =\n\nrs\n\ncind2coutwouthout\n\u03becind2coutwouthout\n\n=\n\n1\n\u03be\n\n(12)\n\nEq.11 and 12 give the inference speed-up of one layer, the inference speed-up of whole network is\nalso related to the size of output feature map in each layer besides the compression rates of each layer.\nAlthough compared to the spatial-domain pruning method, our scheme has an additional computation\ncost of transformation, a larger compression ratio can be acquired, i.e. less non-zero elements left\nin compressed parameters, due to the sparser representation of the weight \ufb01lters. We will show the\ndetailed results in Section 4.\n\n5\n\n\fTable 1: Compression results comparison of our methods with [17] and [18] on LeNet-5.\n\nParameters Top-1 Accuracy\n\nIterations Compression\n\nLeNet-5\nReference\nPruned [17]\nPruned [18]\nFDNP (ours)\nBA-FDNP (ours)\n\n431K\n34.5K\n4.0K\n3.3K\n2.8K\n\n99.07%\n99.08%\n99.09%\n99.07%\n99.08%\n\n10K\n10K\n16K\n20K\n20K\n\n1\u00d7\n12.5\u00d7\n108\u00d7\n130\u00d7\n150\u00d7\n\nTable 2: Comparison of the percentage of remaining parameters in each layer after applying [17],\n[18] and our methods on LeNet-5.\nLayer Params. Params.%[17] Params.%[18] Params.%(FDNP) Params.%(BA-FDNP)\nconv1\nconv2\nfc1\nfc2\nTotal\n\n12%\n2.1%\n0.5%\n3.6%\n0.67%\n\n12.4%\n2.5%\n0.6%\n3.5%\n0.77%\n\n14.2%\n3.1%\n0.7%\n4.3%\n0.9%\n\n0.5K\n25K\n400K\n5K\n431K\n\n66%\n12%\n8%\n19%\n8%\n\n4 Experimental Results\n\nIn this section, we conduct comprehensive experiments on three benchmark datasets: MNIST,\nImageNet and CIFAR-10. LeNet, AlexNet and ResNet are tested on these three datasets respectively.\nWe mainly compare our schemes with [17] and [18], which are spatial-domain pruning methods.\nThe compared results of LeNet and AlexNet are directly acquired from the paper and we train the\ncompressed model of ResNet ourselves as the paper introduced since they didn\u2019t report the results of\nResNet.\nThe training processes are all performed with the Caffe framework [24]. A pre-trained model is\nobtained before applying pruning and the learning policy of \ufb01ne-tuning is the same as the ones\nwhile obtaining the pre-trained model if not mentioned speci\ufb01cally. The momentum and weight\ndecay are set to 0.9 and 0.0001 in all experiments. Since LeNet and AlexNet have only a few layers\nand the kernel size of each layer is different, we set the compression rate \u03b3 of each layer manually.\nOn the other hand, we set the same \u03b3 for every layer in ResNet. In the BA-FDNP scheme, the\nhyperparameters \u03bb and \u03c9 are set to 1.0 and 0.8 respectively.\n\n4.1 LeNet-5 on MNIST\n\nWe \ufb01rstly apply our scheme on MNIST with LeNet-5. MNIST is a benchmark image classi\ufb01cation\ndataset of handwritten digits from 0 to 9 and LeNet-5 is a conventional neural network which consists\nof four learnable layers, including two convolutional layers and two fully-connected layers. It is\ndesigned by LeCun et al. [25] for document recognition and has 431K learnable parameters. The\nlearning rate is set to 0.1 initially and reduced by 10 times for every 4K iterations during training. We\nuse \"xavier\" initialization method and train a reference model whose top-1 accuracy is 99.07% with\n10K iterations.\nWhile compressing LeNet-5 with FDNP and BA-FDNP, the batch size is set to 64 and the maximal\nnumber of iterations is properly increased to 15K. The comparision of our proposed schemes with\n[17] and [18] are shown in Table 1. The network parameters of LeNet-5 are reduced by a factor of\n130\u00d7 and 150\u00d7 with FDNP and BA-FDNP repectively which are much better than [17] and [18],\nwhile the classi\ufb01cation accuracies are as good.\nTo better demonstrate the advantage of our schemes, we make layer-by-layer comparisons among [17],\n[18] and our schemes in Table 2. There is a considerable improvement in every layer due to the\ntransformation. And we can see that the performance can bene\ufb01t from different compression rates for\ndifferent frequency bands.\n\n6\n\n\fTable 3: Compression results comparison of our methods with [17] and [18] on AlexNet.\n\nTop-1 Acc. Top-5 Acc. Parameters Iterations Compression\n\nAlexNet\nReference\nPruned [17]\nPruned [18]\nFDNP(ours)\nBA-FDNP(ours)\n\n56.58%\n57.23%\n56.91%\n56.84%\n56.82%\n\n79.88%\n80.33%\n80.01%\n80.02%\n79.96%\n\n61M\n6.8M\n3.45M\n2.9M\n2.7M\n\n45K\n480K\n70K\n70K\n70K\n\n1\u00d7\n9\u00d7\n17.7\u00d7\n20.9\u00d7\n22.6\u00d7\n\nTable 4: Comparison of the percentage of remaining parameters in each layer after applying [17],\n[18] and our methods for AlexNet.\n\nLayer Params. Params.%[17] Params.%[18] Params.%(FDNP) Paras.%(BA-FDNP)\nconv1\nconv2\nconv3\nconv4\nconv5\nfc1\nfc2\nfc3\nTotal\n\n42.3%\n34.6%\n24.6%\n27.7%\n23.8%\n3.0%\n4.8%\n3.8%\n4.4%\n\n53.8%\n40.6%\n29.0%\n32.3%\n32.5%\n3.7%\n6.6%\n4.6%\n5.7%\n\n40.7%\n35.1%\n28.6%\n29.9%\n26.7%\n3.4%\n4.7%\n3.9%\n4.8\n\n35K\n307K\n885K\n664K\n443K\n38M\n17M\n4M\n61M\n\n84%\n38%\n35%\n37%\n37%\n9%\n9%\n25%\n11%\n\n4.2 AlexNet on ImageNet\n\nWe further examine the performance of our scheme on the ILSVRC-2012 dateset, which has 1.2\nmillion training images and 50K validation images. AlexNet is adopted as the inference network.\nAlexNet has \ufb01ve convolutional layers and three fully-connected layers. After 450K iterations of\ntraining, a reference model with 61 million well-learned parameters is generated. While performing\nFDNP and BA-FDNP, the convolutional layers and fully-connected layers are pruned separately\nwhich is also applied in [18]. We run 350K iterations for convolutional layers and 350K iterations\nfor fully-connected layers respectively. When the weight coef\ufb01cients of convolutional layers are\npruning, the weight coef\ufb01cients of the fully-connected layers update as well but not be pruned and\nvice versa. We use a learning rate of 0.01 and reduced it by 10 times for every 100K iterations. The\nbatch size is set to 32 and we use \"gaussian\" initialization method for the training.\nAs we mentioned in section 2.1, FDNP and BA-FDNP are appplied to the convolutional layers as\nwell as the \ufb01rst fully-connected layer whose input is the output feature map of convolutional layer and\nthe other fully-connected layers are pruned in spatial domain using the method in [18]. Table 3 shows\nthe comparison of our schemes with [17] and [18] on AlexNet. Our FDNP and BA-FDNP methods\nachieve 20.9\u00d7 and 22.6\u00d7 compression ratios which are better than the spatial-domain pruning\nmethods. Besides, the classi\ufb01cation accuracies of our compressed schemes are still comparable with\nthe compared methods and better than the reference model.\nWe compare the percentage of remaining parameters in each layer of AlexNet after applying [17], [18]\nand our methods in Table 4. Our methods pruned more parameters on every single layer. Although we\nutilize the method in [18] among the last two fully-connected layers, our compressed model achieves\nthe lager compression ratios on these two layers due to the more expressive capacity of the other\nlayers.\n\n4.3 ResNet on CIFAR-10\n\nTo further demonstrate the effectiveness of our scheme, we apply it to the modern neural network\nResNet [26] on CIFAR-10. CIFAR-10 is also a classi\ufb01cation benchmark dataset containing a training\nset of 50K images and a test set of 10K images. During training, we use the same data augmentation\nlike in [27], which contains \ufb02ip and translation. ResNet-20 and ResNet-110 are conducted in this\n\n7\n\n\f(a) Accuracy\n\n(b) Compression Rate\n\n(c) Speed-up\n\nFigure 2: This \ufb01gure shows the accuracies, compression rates and theoretical inference speed-up of\nResNet-20 under different \u03b3.\n\n(a) Accuracy\n\n(b) Compression Rate\n\n(c) Speed-up\n\nFigure 3: This \ufb01gure shows the accuracies, compression rates and theoretical inference speed-up of\nResNet-110 under different \u03b3.\n\nexperiment. The learning rate is 0.1 and reduced by 10 times for every 40K iterations. The \"msra\"\ninitialization method is adopted in this experiment. After 100K iterations of training, the top-1\naccuracies of two reference models are 91.86% and 93.41% respectively.\nWe apply the method in [18], FDNP and BA-FDNP to the reference models of ResNet-20 and\nResNet-110 separately. The batch size is set to 100 and the maximal number of iterations is set\nto 150K. While applying FDNP and BA-FDNP, we employ the spatial-domain dynamic pruning\nmethod in [18] to compress the convolutional layers whose kernel sizes are 1 \u00d7 1.\nFig.2 and 3 show the performances of [18] and our proposed schemes under different \u03b3. It can\nbe seen that a larger \u03b3 prominently improves the compression ratio, while coming at a cost of a\nlittle decreased accuracy. Under the same condition, FDNP achieve larger compression ratios and\ninferece speed-up rates while the accuracies are nearly the same as or even better than [18]. When we\ncompress the model using different compression rates for different frequency bands, the performance\ncan be better. While keeping the accuracies the same as the reference models, our BA-FDNP scheme\ncan compress ResNet-20 and ResNet-110 by a factor of 6.5\u00d7 and 8.4\u00d7 respectively. In the meantime,\nthe theoretical inference speed-up ratios are 6.4\u00d7 and 9.2\u00d7 respectively, which means both the\nstorage requirements and the FLOPs can be well reduced. The interesting point is that the speed-up\nratio of ResNet-110 is ever larger than the compression ratio even with the additional computational\ncost of 2-D DCT transformation. We consider that it owe to our schemes pruning more coef\ufb01cients\nof layers who own larger size of output feature map.\nFigure.4(a) shows the number of pruned and remaining parameters of each layer in ResNet-20 when\n\u03b3 is set to 1.2. The proportions of the remaining parameters in different layers are different though we\nset the same compression rate. The energy histogram of the coef\ufb01cients before and after BA-FDNP is\nshown in Figure.4(b). The energy of coef\ufb01cients is more concentrated on the lower frequency bands\nafter pruning as we pruned more higher frequency coef\ufb01cients. By this result, it appears that the\nlower frequency components tend to be of higher importance.\n\n8\n\n\f(a)\n\n(b)\n\nFigure 4: This \ufb01gure shows (a) the number of pruned and remaining parameters of each layer in\nResNet-20 after applying BA-FDNP, (b) the energy histogram of coef\ufb01cients in each band before and\nafter BA-FDNP in ResNet-20.\n\n5 Conclusion\n\nIn this paper, we propose a novel approach to compress the convolutional neural networks by\ndynamically pruning the unimportant weight coef\ufb01cients in frequency domain. We \ufb01rstly give an\nimplementation of CNN in frequency domain. The coef\ufb01cients can be ef\ufb01ciently pruned since they\nare sparser after 2-D DCT transformation and many spatial-domain pruning methods can be applied.\nWhat\u2019s more, we set different compression rates for different frequency bands due to the variant\nimportance. Our BA-FDNP scheme achieves a 8.4\u00d7 of compression and a 9.2\u00d7 of acceleration\nfor ResNet-110 respectively without any loss of accuracy, which outperforms the previous pruning\nmethods by a considerable margins. In the future, we will consider to exploit the correlations among\ndifferent channels and employ 3-D transform to further compress and accelerate the convolutional\nneural networks. Besides, the quantization and Huffman-coding can also be applied to the coef\ufb01cients\nin frequency domain.\nAcknowledgemetns This work was part supported by the National Key Research and Development\nProgram of China (2017YFB1002203), the National Natural Science Foundation of China (61772041),\nthe Beijing Natural Science Foundation (4172027), and also by the Cooperative Medianet Innovation\nCenter. This work was done when Z. Liu was with Microsoft Research Asia.\n\nReferences\n[1] M. Jaderberg, A. Vedaldi, and A. Zisserman, \u201cSpeeding up convolutional neural networks with\n\nlow rank expansions,\u201d british machine vision conference, 2014.\n\n[2] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, \u201cEf\ufb01cient and accurate approximations of\nnonlinear convolutional networks,\u201d in Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, 2015, pp. 1984\u20131992.\n\n[3] C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, \u201cConvolutional neural networks with\n\nlow-rank regularization,\u201d international conference on learning representations, 2016.\n\n[4] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, \u201cCompression of deep convolutional\nneural networks for fast and low power mobile applications,\u201d international conference on\nlearning representations, 2016.\n\n[5] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, \u201cBinarized neural networks:\nTraining deep neural networks with weights and activations constrained to+ 1 or-1,\u201d arXiv\npreprint arXiv:1602.02830, 2016.\n\n[6] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, \u201cXnor-net: Imagenet classi\ufb01cation using\nbinary convolutional neural networks,\u201d in European Conference on Computer Vision. Springer,\n2016, pp. 525\u2013542.\n\n9\n\n\f[7] Z. Li, B. Ni, W. Zhang, X. Yang, and W. Gao, \u201cPerformance guaranteed network acceleration\nvia high-order residual quantization,\u201d in IEEE International Conference on Computer Vision,\n2017, pp. 2603\u20132611.\n\n[8] P. Wang and J. Cheng, \u201cFixed-point factorized networks,\u201d computer vision and pattern\n\nrecognition, pp. 4012\u20134020, 2016.\n\n[9] C. Zhu, S. Han, H. Mao, and W. J. Dally, \u201cTrained ternary quantization,\u201d international\n\nconference on learning representations, 2016.\n\n[10] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey, \u201cTernary neural\n\nnetworks with \ufb01ne-grained quantization,\u201d arXiv preprint arXiv:1705.01462, 2017.\n\n[11] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, \u201cIncremental network quantization: Towards\n\nlossless cnns with low-precision weights,\u201d arXiv preprint arXiv:1702.03044, 2017.\n\n[12] S. Han, H. Mao, and W. J. Dally, \u201cDeep compression: Compressing deep neural networks\nwith pruning, trained quantization and huffman coding,\u201d international conference on learning\nrepresentations, 2016.\n\n[13] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, \u201cQuantized convolutional neural networks for\n\nmobile devices,\u201d computer vision and pattern recognition, pp. 4820\u20134828, 2016.\n\n[14] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, \u201cCompressing convolutional\n\nneural networks in the frequency domain.,\u201d in KDD, 2016, pp. 1475\u20131484.\n\n[15] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, \u201cCnnpack: packing convolutional neural networks\nin the frequency domain,\u201d in Advances in Neural Information Processing Systems, 2016, pp.\n253\u2013261.\n\n[16] Karen Ullrich, Edward Meeds, and Max Welling, \u201cSoft weight-sharing for neural network\n\ncompression,\u201d arXiv preprint arXiv:1702.04008, 2017.\n\n[17] S. Han, J. Pool, J. Tran, and W. J. Dally, \u201cLearning both weights and connections for ef\ufb01cient\n\nneural networks,\u201d neural information processing systems, pp. 1135\u20131143, 2015.\n\n[18] Y. Guo, A. Yao, and Y. Chen, \u201cDynamic network surgery for ef\ufb01cient dnns,\u201d in Advances In\n\nNeural Information Processing Systems, 2016, pp. 1379\u20131387.\n\n[19] Y. He, X. Zhang, and J. Sun, \u201cChannel pruning for accelerating very deep neural networks,\u201d in\n\nInternational Conference on Computer Vision (ICCV), 2017, vol. 2, p. 6.\n\n[20] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov, \u201cVariational dropout sparsi\ufb01es deep\n\nneural networks,\u201d arXiv preprint arXiv:1701.05369, 2017.\n\n[21] Christos Louizos, Karen Ullrich, and Max Welling, \u201cBayesian compression for deep learning,\u201d\n\nin Advances in Neural Information Processing Systems, 2017, pp. 3288\u20133298.\n\n[22] Christos Louizos, Max Welling, and Diederik P Kingma, \u201cLearning sparse neural networks\n\nthrough l_0 regularization,\u201d arXiv preprint arXiv:1712.01312, 2017.\n\n[23] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov, \u201cStructured\nbayesian pruning via log-normal multiplicative noise,\u201d in Advances in Neural Information\nProcessing Systems, 2017, pp. 6775\u20136784.\n\n[24] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell, \u201cCaffe: Convolutional architecture for fast feature\nembedding,\u201d in Proceedings of the 22nd ACM international conference on Multimedia. ACM,\n2014, pp. 675\u2013678.\n\n[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d Proceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n10\n\n\f[26] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in\nProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.\n770\u2013778.\n\n[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, \u201cIdentity mappings in deep residual\n\nnetworks,\u201d in European conference on computer vision. Springer, 2016, pp. 630\u2013645.\n\n11\n\n\f", "award": [], "sourceid": 560, "authors": [{"given_name": "Zhenhua", "family_name": "Liu", "institution": "Peking University"}, {"given_name": "Jizheng", "family_name": "Xu", "institution": "Bytedance Inc."}, {"given_name": "Xiulian", "family_name": "Peng", "institution": "Microsoft Research"}, {"given_name": "Ruiqin", "family_name": "Xiong", "institution": "Peking University"}]}