{"title": "Synaptic Strength For Convolutional Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 10149, "page_last": 10158, "abstract": "Convolutional Neural Networks(CNNs) are both computation and memory inten-sive which hindered their deployment in mobile devices. Inspired by the relevantconcept in neural science literature, we propose Synaptic Pruning: a data-drivenmethod to prune connections between input and output feature maps with a newlyproposed class of parameters called Synaptic Strength. Synaptic Strength is de-signed to capture the importance of a connection based on the amount of informa-tion it transports. Experiment results show the effectiveness of our approach. OnCIFAR-10, we prune connections for various CNN models with up to96%, whichresults in significant size reduction and computation saving. Further evaluation onImageNet demonstrates that synaptic pruning is able to discover efficient modelswhich is competitive to state-of-the-art compact CNNs such as MobileNet-V2andNasNet-Mobile. Our contribution is summarized as following: (1) We introduceSynaptic Strength, a new class of parameters for CNNs to indicate the importanceof each connections. (2) Our approach can prune various CNNs with high com-pression without compromising accuracy. (3) Further investigation shows, theproposed Synaptic Strength is a better indicator for kernel pruning compared withthe previous approach in both empirical result and theoretical analysis.", "full_text": "Synaptic Strength For Convolutional Neural Network\n\nChen Lin\n\nSenseTime Research\n\nlinchen@sensetime.com\n\nWei Wu\n\nSenseTime Research\n\nwuwei@sensetime.com\n\nZhao Zhong \u2217\nNLPR, CASIA\n\nUniversity of Chinese Academy of Sciences\n\nzhao.zhong@nlpr.ia.ac.cn\n\nJunjie Yan\n\nSenseTime Research\n\nyanjunjie@sensetime.com\n\nAbstract\n\nConvolutional Neural Networks(CNNs) are both computation and memory inten-\nsive which hindered their deployment in mobile devices. Inspired by the relevant\nconcept in neural science literature, we propose Synaptic Pruning: a data-driven\nmethod to prune connections between input and output feature maps with a newly\nproposed class of parameters called Synaptic Strength. Synaptic Strength is de-\nsigned to capture the importance of a connection based on the amount of informa-\ntion it transports. Experiment results show the effectiveness of our approach. On\nCIFAR-10, we prune connections for various CNN models with up to 96% , which\nresults in signi\ufb01cant size reduction and computation saving. Further evaluation on\nImageNet demonstrates that synaptic pruning is able to discover ef\ufb01cient models\nwhich is competitive to state-of-the-art compact CNNs such as MobileNet-V2 and\nNasNet-Mobile. Our contribution is summarized as following: (1) We introduce\nSynaptic Strength, a new class of parameters for CNNs to indicate the importance\nof each connections. (2) Our approach can prune various CNNs with high com-\npression without compromising accuracy. (3) Further investigation shows, the\nproposed Synaptic Strength is a better indicator for kernel pruning compared with\nthe previous approach in both empirical result and theoretical analysis.\n\n1\n\nIntroduction\n\nIn recent years, Convolutional Neural Networks(CNNs) gradually become dominant in the computer\nvision community. Despite their good performance, CNNs have a huge number of parameters,\nresulting in high resource demand for storage and computation. Modern CNNs can reach hundreds\nof millions of parameters and billions of operations, which makes it dif\ufb01cult to deploy. To alle-\nviate aforementioned problem, various methods have been proposed to increase the ef\ufb01ciency of\nCNNs. These include knowledge distillation [12, 28], low-rank decomposition [24, 7], network\nquantization/binarization [38, 5, 4, 27] and weight pruning [9]. Recent work shows that there exists\nlarge amount of redundancy in CNNs[9, 13]. Accordingly, we can reduce the model size without\ncompromising accuracy with some appropriate schemes.\nMeanwhile, in neural science literature, tremendous redundancy is believed to exist in human\nbrain [2]. A process of synapse elimination called synaptic pruning removes unnecessary neuronal\nstructures occurs from birth to adolescence. The pruning process is believed essential to the \ufb02exibility\nrequired for the adaptive capabilities of the developing mind [6]. The key element of the brain\u2019s\npruning procedure is the synaptic strength. Synaptic pruning in brain follows a \"use it or lose it\"\n\n\u2217This work is done when Zhao Zhong is intern at SenseTime Research\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The analogy between neural synapses and CNNs\u2019 connections. The normalized and\nrecti\ufb01ed input feature channel(green) is regarded as the information from axon(left). The uni\ufb01ed\nkernel(blue) is regarded as the processing operator on dendrite(right). Synaptic Strength is de\ufb01ned\nas the combined scaling factor which makes it a good indicator for connection importance\n\nprinciple, which is achieved by increasing synaptic strength if the synapse is used, and the opposite\nif not [31]. Inspired by this mechanism, we proposed a new class of parameters in convolution\nlayer called Synaptic Strength as shown in Figure 1. The main idea of Synaptic Strength is to\nrepresent how much information the connection will provide to the \ufb01nal result. This is achieved by\nimposing normalization on both weight and input data. The analog between biological neural network\nand CNNs is built by regarding a single channel of the feature map produced by the intermediate\nconvolution layer as a single neuron. Suppose feature channel is produced by a \ufb01lter contains C\nkernels and the feature map in the previous layer(also contains C channels). The kernel could be seen\nas C synapse connected with C previous neurons. Thus the removal of synapse in a biological neural\nnetwork is analog to disconnect the input channel and its corresponding kernel(s). \"Disconnect\" can\nbe easily achieved by \"zero out\" kernel. Thus our synaptic pruning produces kernel-level sparse\nCNNs.\nWe evaluate the proposed method on CIFAR-10 and ImageNet. Experiment shows our approach\ncan achieve much higher compression rates while brings less impact on performance compared\nwith existing pruning methods. On CIFAR-10, we can remove up to 96% synaptic connections\nwithout decreasing accuracy. On ImageNet, our pruned ResNet-50 achieves competitive or even\nbetter ef\ufb01ciency compared with state-of-the-art compact models [29, 39] in term of accuracy and\nparameters. For compactness, we discuss the practicability of leveraging kernel-level sparsity for\nacceleration. In particular, we point out that the winograd convolution [18] is ideal to accelerate\nkernel-level sparse models. We also compare our approach with recent proposed winograd native\npruning methods [20, 21] for comparison.\n\n2 Related works\n\nNetwork Pruning The idea of network pruning originated form Yann LeCun and Solla [34]. After\ndeep learning starts to thrive, Han et al. [9] propose to prune the useless weights with small absolute\nvalue in model which is trained with L2-regularization and an iterative pruning and \ufb01netuning process.\nTheir models achieve high sparse rate at weight-level. They reduce the model size with a large margin\nwhile run-time speed up requires specially designed hardware [10]. Recent works [19, 26, 11] focus\non the idea of pruning the entire \ufb01lter. Different \ufb01lter importance indicator have been proposed.\nAnwar and Sung [1] \ufb01rst propose to prune parameters at kernel level. They achieve storage saving\nand acceleration in MRI tasks. Mao et al. [25] apply method proposed in [9] on different granularity\nto explore the tradeoff between regularity and accuracy. Wen et al. [32] propose to learn a structured\nsparsity. They use group lasso regularization to prune entire row or column of the weight matrix.\n\n2\n\n\ud835\udc5f\"the \ud835\udc8a-thkernel of \ud835\udc8b-thfilter \"\t\ud835\udefethe \ud835\udc8a-thinput channel\ud835\udc60=\ud835\udc5f\"\ud835\udefescale factorunifiedkernelL2-normSynaptic Strength\fLiu et al. [22] utilize L1 regularization on scale factors in batch-normalization layer to learn the\nimportance indicator for channel pruning. Huang and Wang [15] add sparsity regularization and\na modi\ufb01ed Accelerated Proximal Gradient on scaling factors in the training stage. They remove\nunimportant parts of CNN based on scaling factors value. Our synapse pruning is also in an end-to-\nend manner which is closely related to these ideas. Besides L1 and L2 norm, [23] propose a practical\nmethod for L0 norm regularization for CNNs. In order to achieve that, they include a collection of\nnon-negative stochastic gates.\nNeural architecture learning Optimizing network architecture is explored intensively in the litera-\nture. Stanley and Miikkulainen [30] proposed to optimize network typologies and parameter weights\nat the same time through evolution strategy starting with a minimum neural network. Recently, Zhong\net al. [37] and Zoph and Le [39] adopt reinforcement learning for neural architecture search, each\nof which trains massive amount of different neural networks and treats test accuracy as the reward.\nDiffer with these approaches, our synapse pruning starting with an existing human designed CNN\nand learning a compact structure through the \"use it or lose it\" principle. So the proposed method can\nbe regarded as an architecture learning method.\nIrregular connection pattern Recent proposed ef\ufb01cient CNN architectures aim to reduce both\ncomputation and size. These architectures usually conduct special connection patterns between\ninput and out feature channel. Non-dense connection patterns including depth-wise [13, 29], group\nconvolution [33], interleaved group connection[35, 36]. These prede\ufb01ned architecture achieve\ncompetitive accuracy with better ef\ufb01ciency. Huang et al. [14] tries to learning the input layers for\ngroup convolution in training stage. They impose a constraint on group size. In contrast, synapse\npruning has no restriction on the connection pattern at all. We argue that this will give more \ufb02exibility\nto the model.\n\nFigure 2: We introduce a new class of parameter called Synaptic Strength associated with connection\nin convolutional layers (middle). Sparsity regularization is applied on these parameters to automat-\nically identify useless connections. Connections with small Synaptic Strength will be pruned to\nproduce a compact model with kernel-level sparsity.\n\n3 Method\n\n3.1 Biological Analogy for CNNs\n\nIn biology, synaptic strength is a measure of the connectivity between axon and dendrite. Synaptic\nstrength is a changing attribute which represents the activeness of the connection. If a connection is\nhardly used by the overall processing procedure, synaptic strength is decreased which would \ufb01nally\ncause disconnection. In order to adopt the same mechanism for modern CNNs, two question should\nbe answered: (1) How to analog the same mechanism in CNNs. (2) How to de\ufb01ne the usefulness of\ncertain connection.\nTo answer the \ufb01rst question, we provide a demonstration of a convolution layer in Figure 2 (left). This\nconvolution layer takes 3 input feature channel(green). It contains 2 \ufb01lter, each of which produces a\nsingle output feature channel(blue) by adding up the 2D-convolution results produced by each kernel\ninside the \ufb01lter and the kernel\u2019s corresponding input feature channel. Here, we analogize a single\n\n3\n\n\ud835\udc9a\ud835\udfcf\ud835\udc9a\ud835\udfd0\ud835\udc99\ud835\udfcf0.00.2\ud835\udc99\ud835\udfd00.00.5\ud835\udc99\ud835\udfd10.40.0\ud835\udc65\u2019\ud835\udc65(\ud835\udc65)\ud835\udc66\u2019\ud835\udc66(\ud835\udc65\u2019\ud835\udc65(\ud835\udc65)\ud835\udc66\u2019\ud835\udc66((\ud835\udc8a+\ud835\udfcf)-thconv-layer\ud835\udc8a-thconv-layer(\ud835\udc8a+\ud835\udfcf)-thconv-layer\ud835\udc8a-thconv-layerSynaptic StrengthDenseSparse\foutput feature channel to a biological neuron with 3 dendrites, each of which relates itself with an\naxon(input feature channel) through 2D-convolution. Kernel for the 2D-convolution decides how the\ndendrite will process the input. Figure 1 is the close-up version. The \"axon\"(left) delivers information\nwhich is the input feature channel to the receiving \"dendrite\" on the(right). The \"dendrite\" perform\nconvolution utilizing the kernel. For the second question, we introduce a new class of parameters to\nrealize the function of synaptic strength.\n\n3.2 De\ufb01ne Synaptic Strength for CNNs\n\nSuppose the input feature map contains C channels. The convolution layer has K \ufb01lters, each of\nwhich is consist of C kernels one-to-one assigned to its input feature channel. This convolution\noperation generates K output feature channels. In general, we have that\n\nxout\nk = f\n\nc \u2217 kk,c + bk\nxin\n\n(1)\n\n(cid:32) C(cid:88)\n\nc=1\n\n(cid:33)\n\nwhere f represents the activation function and xin\nrepresents the j-th channel among the input feature\nj\nmap. kk,c represents the c-th kernel inside the k-th \ufb01lter. bk is the bias. Batched Normalization(BN)\nhas been adopted by most modern CNNs as a standard approach for fast convergence and better\ngeneralization [16]. We assume the target model perform batch normalization after a convolution\nlayer, before the non-linearity. Particularly, in the training stage, BN layer normalize the activation\ndistribution using mini-batch statistics. Suppose xin and xout is the input and output of a BN layer,\nB denotes the mini-batch data samples. BN perform normalization by:\n; xout = \u03b3 \u00b7 BN (x) + \u03b2\n\nBN (x) =\n\n(2)\n\n(cid:112)\u03c32\n\nxin \u2212 \u00b5B\nB + \u0001\n\nwhere \u00b5B and \u03c3B are the mean and standard deviation values computed across each elements in\nxin over B. The normalized activation N (x) is linear transformed by learnable af\ufb01ne transformation\nparameterized by \u03b3 and \u03b2(scale and shift). We also assume that the non-linearity f is homogeneous,\nwhich indicates that f (a \u00b7 x) = a \u00b7 f (x) for any scalar a. The de\ufb01nition of Synaptic Strength\ncould be derived with two modi\ufb01cation to the original model. First, we discard channel scaling\nfactor \u03b3 from BN layers. Second, we reparameterize each individual kernel k. We decomposed\nthe kernel as k = r \u00b7 k(cid:48), where r = ||k|| denotes the Frobenius-norm of the kernel and k(cid:48) = k||k||\nis the normalized(unit) kernel. Utilizing Equation 1 and 2, we show that the modi\ufb01ed model has\nthe same representation capacity. Without loss of generality, we consider a sub module consist\nwith \"BN-f-Conv\". Due to the normalization in the BN layer, the bias term in convolution layer is\nredundant, and thus it has been discarded from the model:\n\n(cid:16)\n\n(cid:16)\n\nc=1\n\nC(cid:88)\nC(cid:88)\nC(cid:88)\n\nc=1\n\nc=1\n\nxout\nk =\n\n=\n\n=\n\n\u03b3c \u00b7 BN (xin\n\nc ) + \u03b2c\n\nf\n\n\u03b3cf\n\nBN (xin\n\nc ) +\n\n\u03b2c\n\u03b3c\n\n\u03b3c \u00b7 rk,c \u00b7 f\n\nBN (xin\n\nc ) +\n\n(cid:16)\n\n(cid:17) \u2217 kk,c\n(cid:17) \u2217 (rk,c \u00b7 k(cid:48)\n(cid:17) \u2217 k(cid:48)\n\n\u03b2c\n\u03b3c\n\nk,c)\n\nk,c\n\n(3)\n\n(4)\n\n(5)\n\nWe de\ufb01ne Synaptic Strength as the production of BN\u2019s scaling factor and the Frobinus-norm of kernel:\n\nEquation 3 to 5 show the function of our modi\ufb01ed model remain identical to the original model.\n\nsk,c = \u03b3c \u00b7 rk,c\n\n(6)\n\n3.3\n\nIntuition and Analysis\n\nThe importance of the connection should be measured with how much amount of information it\nprovided. However, estimating information entropy for each dendrite is not ef\ufb01cient. Variance\nwhich is also an uncertainty measurement could be regarded as an acceptable compromise. Synaptic\n\n4\n\n\fStrength is explicitly designed to be a good indicator to the variance of the intermediate feature\nproduced by a single kernel, which is immediately reduced by summation across C input channels.\nAs shown in Equation 3 to 5, the data distribution coming through BN without the scale factor is\nconvoluted by the kernel. At inference time, BN normalized each data examples using a smoothed\nversion of batch mean and variance, which should be close to the overall statistics for data distribution.\nBy integrating the scale factor into Synaptic Strength, we force the variance of the output of the BN\nlayer stays close to 1. Thus Synaptic Strength controls the variance from data. On the other hand,\nconvolution with kernel certainly affects the variance of the feature. We restrict the kernel to lie on\nthe Unit Sphere. Thus Synaptic Strength parameters are forced to represent the multiplication of data\nvariance and kernel norm which makes it a good indicator of information.\n\n3.4 Optimization\n\nIn additional to classi\ufb01cation loss, we impose sparsity-inducing regularization on the Synaptic\nStrength s. After training, we prune the synapses whose strength is smaller than threshold \u03c4. Finally,\nwe \ufb01ne-tune the pruned network. Precisely, the training objective function is given by\n\nN(cid:88)\n\ni=1\n\nL =\n\n1\nN\n\n(cid:88)\n\ns\u2208S\n\nl(yi, f (xi, W )) + \u03bb\n\ng(s)\n\n(7)\n\nwhere xi, yi denote the i-th training data and label, W denotes all the trainable weights in the model, l\nis the classi\ufb01cation loss, the second sum-term is the sparse-inducing regularization, and s the Synaptic\nStrength. \u03bb is a scalar factor which controls the scale of the sparsity constraint. We choose g(s) = |s|\nin our experiments and use sub-gradient descent due to non-smooth point at 0.\n\n4 Experiments\n\nDataset\nIn order to evaluate the effectiveness of synapse pruning, we experiment with CIFAR-10\nand ImageNet. CIFAR-10 dataset contains 50,000 train examples and 10,000 test examples. Each\nexample contains a single object draw from 10 classes with resolution 32 \u00d7 32. In each experiment,\nwe perform standard data augmentation including random \ufb02ip and random crop. ImageNet dataset is\na large-scale image recognition benchmark which contains 1.2 million images for training and 50,000\nfor validation, each image belongs to one out of the total 1,000 classes. Both top1 and top5 single\ncenter crop accuracy is reported.\n\n4.1 Training Detail\n\nBaseline We train all models we are going to prune from scratch as the baseline model. All the\nnetworks are optimized with Stochastic Gradient Descent(SGD) with momentum 0.9 and weight\ndecay 10\u22124. For CIFAR-10 models, we train them with batch size 128 for 240 epochs in total. The\ninitial learning rate is set to 0.1 and divided by 10 at the beginning of 120 and 180 epoch. For\nImageNet models, we train them with batch size 256 for 100 epochs in total. The initial learning rate\nis set to 0.1 and dived ed by 10 at the beginning of 30, 60 and 90 epoch.\n\nGuidelines for Picking \u03bb\nFor CIFAR-10 models, we pick different sparsity regularization rate \u03bb\nas de\ufb01ned in Section 3.4 for different architectures. We pick \u03bb equals to 10\u22124 for VGGNet, while\n10\u22125 and 5 \u00d7 10\u22126 for ResNet-18 and DenseNet-40. Other settings remain identical to CIFAR-10\nbaseline models. For ImageNet models, more \ufb02exibility is required for \ufb01tting train data. Thus the\nsparsity constraint rate is set to 10\u22126. The rest of the setting is the same as the baseline routine. Our\nempirical experience suggests that a higher regularization rate would lead to more pruned kernels\nwith no cost at pruning procedure. But if the regularization is too strong, the accuracy before pruning\nwould be compromised. A basic strategy is to start with a small \u03bb and enlarge it until the performance\nof the model starts to decrease.\n\nPruning and \ufb01netune Since Synaptic strength could represent the information extracted by its\nowner kernel from the corresponding input channel, we apply a simple pruning strategy which is to\nremove all the synapse connection under threshold t. The threshold for pruning is decided by the\ndesired sparsity k%. The value of least k% synaptic strength in the network will become the threshold.\n\n5\n\n\fTable 1: Errors and pruning ratio on CIFAR-10\n\nModel\nError(%) Kernels\nVGG base\n6.66\n6.23\nVGG pruned\nResNet-18 base\n6.45\nResNet-18 pruned A 5.06\n5.80\nResNet-18 pruned B\n5.24\nDenseNet-40 base\nDenseNet-40 pruned\n5.58\n\n2,224,320\n88,972\n1,392,832\n113,436\n64,477\n226,728\n45,345\n\nPruned(%)\n0.00\n96.00\n0.00\n90.00\n95.00\n0.00\n80.00\n\nPruned(%)\n\nFlops\n398M 0.00\n94M\n76.38\n555M 0.00\n137M 75.32\n88M\n84.14\n282M 0.00\n71M\n74.82\n\nEmpirically, we prune the network using different k% and choose the best model considering its size\nand accuracy. We create a mask which indicates remain kernels for simplicity. For deployment, Block\nCompressed Row Storage could be applied to save memory space. Finetuning should be applied if\naccuracy drop is observed after pruning, which takes 50 epochs and 20 epochs for CIFAR-10 and\nImageNet models respectively.\n\n4.2 Results on CIFAR\n\nCompared to baseline The motivation of our synapse pruning is to minimize the computation and\nstorage cost for CNNs. Each of our pruned models could achieve up to 95% sparsity taking 2D-kernel\nas a unit, while still maintaining similar accuracy compared with baselines. The parameter saving in\nconvolution layers and FLOPs reductions is shown. For VGGNet and ResNet-18 we pruned 96% and\n90% of the total synapse with a slight increase in test accuracy. We attribute this to the regularization\neffect provided by L1 loss. Even for DenseNet-40, a relatively compact model which conduct feature\nreuse between layers intensively, synapse pruning could still remove a majority of synapses (about\n80%) with 0.34% loss in accuracy. For computation reduction, synapse pruning is able to reduce up\nto 80% of FLOPs. Notably, the proportion of FLOPs reduction is less than kernel reduction due to\nthe different kernel and input size. It is possible to adjust \u03bb based on the amount of calculation per\nlayer to alleviate the problem. Table 1 shows the resulted models after pruning compared to baseline\non CIFAR-10. Figure 3(Left) visualize the amount of connection we pruning relative to the baseline\nmodel.\n\nModel\nVGG [22]\nVGG ours\nResNet-56 [19]\nResNet-101 [19]\nResNet-164 [22]\nResNet-18 ours\nResNet-18 ours\nCondenseNetlight-94\nDenseNet-40 [22]\nDenseNet-40 ours\n\nError(%)\n6.20\n6.23\n6.94\n6.45\n5.27\n5.06\n5.56\n5.00\n5.19\n5.58\n\nParas\nFlops\n2.30M 195M\n0.80M 94M\n0.73M 90M\n1.68M 213M\n1.21M 124M\n1.01M 137M\n0.49M 88M\n0.33M 122M\n0.66M 190M\n0.21M 72M\n\nFigure 3: (Left) Remaining kernels of ResNet-18 pruned B. (Right) Compare with other compression\nmethods on CIFAR-10\n\nCompared with other methods As shown in Figure 3(Right), compared to state-of-the-art \ufb01lter-\nlevel weight pruning methods, synapse pruning achieves similar accuracy with roughly 1/3 parameters\nand at most 1/2 \ufb02ops for all three models. Compared to recent proposed compact model CondenseNet,\nour model could save 30% parameters and 50% computations with an accuracy drop of 0.58%.\nCondensenet [14] also take advantage of irregular connection patterns using a special indexing layer\nfollowed by group convolutions, which also introduce extra computation burdens compared with\n\ufb01lter-level pruning.\n\n6\n\n012345678910111213layer0369121518212427kernel amount \u00d7104Remaining kernels each layerTotalRemain\fTable 2: Compare with existing pruning based methods for Resnet-50 on ImageNet\n\nModel\nTop1 error(%)\nResNet-50\n24.70\nResNet-50 [15] \u223c26.80\n31.58\nResNet-50 [17]\n-\nResNet-50 [25]\n25.32\nResNet-50 ours\n\nTop5 error(%)\n7.80\n-\n11.70\n7.93\n7.20\n\nParameters\n25.6M\n\u223c16.5M\n8.66M\n\u223c10.22M\n5.9M\n\nTable 3: Compare with state-of-the-art compact models on ImageNet\n\nModel\nShuf\ufb02eNet 2\u00d75.3M\nCondenseNet\nNasNet-A Mobile\nMobileNet-v2 1.4\u00d7\nResNet-50 pruned ours\n\nTop1 error(%)\n29.10\n26.20\n26.00\n25.30\n25.32\n\nTop5 error(%)\n10.2\n8.3\n8.4\n-\n7.2\n\nParameters\n5.3M\n4.8M\n5.3M\n6.9M\n5.9M\n\n4.3 Results on ImageNet\n\nTo further evaluate the performance of synapse pruning on a larger dataset, we perform our method on\nImageNet dataset with ResNet-50. Our pruned models removed about 87% connections compared to\nbase models with only 0.6% accuracy drop. As shown in Table 2, we obtain better accuracy with fewer\nparameters while maintaining the lowest error rate compared to the existing method. Furthermore,\nwe compare the pruned model with state-of-the-art compact models in terms of parameter numbers\nas shown in Table 3.\n\n4.4 Analysis\n\nIn this section, we \ufb01rst compare the robustness between the proposed method and SSL[32]. Fur-\nthermore, ablation studies by (1)excluding \u03b3 from Synaptic strength and (2)omitting the kernel\nreparameterization are performed. Finally, we analysis the effect of hyper-parameter \u03bb, which is\nthe sparsity regularization rate(see Equation 7). All the experiment is performed with ResNet-18 on\nCIFAR-10.\n\nSensitivity Wen et al. [32] adopted group-LASSO regularization to push the weight in prede\ufb01ned\ngroups towards zero, which is applicable to kernel pruning. Kernel pruning is based on a global\nthreshold across all layers. Kernels with the mean absolute value less than the threshold are pruned.\nThe motivation of this experiment is to compare the robustness of two approaches at different pruning\nrate. In order to alleviate the performance gap generated by randomness in training, instead of\naccuracy, we show the result by plot accuracy drop(caused by pruning) against sparsity. Fine-tuning\nis performed after each pruning procedure. The results are summarized in Figure 4(a). The sparsity\nof connections we plotted is 70%, 80%, 90%, 95%, 97.5%. Our approach starts to outperform SSL\nfrom 90% sparsity. From 90% to 97.5%, the gap between the two methods become larger. If we\nconstraint the accuracy drop no be less than 1%, the proposed method could produce a model with\nroughly 2\u00d7 fewer kernels.\n\nAblation There are two modi\ufb01cations to the original model in order to perform synaptic pruning\n(1) discard the scale factor from previous batch-norm layer \u03b3 and (2) apply normalization to the\nkernel and explicitly parameterize kernel\u2019s L2-norm. We train two variants of synaptic pruning\neach of which omitted one of the aforementioned modi\ufb01cations to show the necessity. We refer the\nmodel trained without discarding the scale factor as \"non \ufb01x \u03b3\", the other one as \"non kernel-norm\".\nThe plot sparsity of connections is chosen as 60%, 70%, 80%, 90%. From the accuracy-sparsity\ncurve showed in Figure 4(b), the full version of the proposed method gets the highest accuracy when\npruning rate greater than 80%. Omitting either one of these modi\ufb01cation degrades the performance.\nIn order to highlight the disparity, \ufb01netuning is not performed.\n\n7\n\n\fFigure 4: (A) The Accuracy drop-Sparsity curve. Compare to SSL, synaptic pruning is better at\npreserving accuracy for high pruning rate. Which show that compared with SSL, our Synaptic Strength\nis more accurate as the indicator to connection importance. (B) Ablation study by modifying the\nde\ufb01nition of Synaptic Strength. Excluding either component has lead to performance degrade,which\nshows the optimist of our approach.\n\nTable 4: Compare with winograd-domain sparse models on CIFAR-10\n\nModel\nVGG-nagadomi base\nVGG-nagadomi\nVGG base\nVGG ours\n\nError(%)\n6.30\n7.40(+1.1)\n6.20\n6.23(+0.03)\n\nDensity\n100%\n25%\n100%\n4%\n\n5 Practicability\n\nConvolution operation in CNNs is commonly computed by GEneral Matrix Multiplication(GEMM).\nThe computation routine of GEMM requires lowering weight tensor and input tensor to 2D matri-\nces [3]. Using this method, the weight tensor of kernel sparse layer is transformed to a \"strip\" sparse\nmetrics, which can be accelerated using block sparse matrix multiplication algorithms on gpu [8].\nIn another track, Winograd convolution, which is adopted recently in convolutional computation [18],\noptimizes 2D convolution using Winograd decomposition and take summation over multiple 2D\nconvolutions. Several works have been proposed to prune individual weights directly in Winograd\ndomain since \ufb01ne-grained level sparsity preserved does not preserve after Winograd transform being\napplied. However, kernel level sparsity remains unchanged after being transformed into \"wino-grad\ndomain\". Thus we compared our method with the Winograd direct pruning method proposed by [21]\nin Table 4. Our method could achieve a signi\ufb01cantly higher rate of sparsity (about 8\u00d7) with almost\nno drop in accuracy.\n\n6 Conclusion\n\nInspired by the synaptic pruning mechanism inside the human brain, we introduced a new class of\nparameters called Synaptic Strength. We show that we can achieve high pruning with almost no cost\nat performance using Synaptic Strength. Further analysis proves that Synaptic Pruning is a better\nindicator of importance compared with the existing method. We will continue to investigate our\nmethod by exploring ef\ufb01cient inference methods for kernel sparse CNNs.\n\nReferences\n[1] Sajid Anwar and Wonyong Sung. Compact deep convolutional neural networks with coarse\n\npruning. arXiv preprint arXiv:1610.09639, 2016.\n\n[2] Gal Chechik, Isaac Meilijson, and Eytan Ruppin. Synaptic pruning in development: A compu-\n\ntational account. 10:1759\u201377, 11 1998.\n\n8\n\n(A)(B)\f[3] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan\nCatanzaro, and Evan Shelhamer. cudnn: Ef\ufb01cient primitives for deep learning. arXiv preprint\narXiv:1410.0759, 2014.\n\n[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Advances in Neural Information\nProcessing Systems, pages 3123\u20133131, 2015.\n\n[5] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks: Training deep neural networks with weights and activations constrained to+ 1\nor-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[6] Fergus Craik and Ellen Bialystok. Cognition through the lifespan: Mechanisms of change. 10:\n\n131\u20138, 04 2006.\n\n[7] J. Bruna Y. LeCun E. L. Denton, W. Zaremba and R. Fergus. Exploiting linear structure within\nconvolutional networks for ef\ufb01cient evaluation. Advances in Neural Information Processing\nSystems, 2014.\n\n[8] Scott Gray, Alec Radford, and Diederik P. Kingma. Gpu kernels for block-sparse weights.\n2017. URL https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/\nblocksparsepaper.pdf.\n\n[9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in Neural Information Processing Systems, pages\n1135\u20131143, 2015.\n\n[10] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J\nDally. Eie: ef\ufb01cient inference engine on compressed deep neural network. In Proceedings of the\n43rd International Symposium on Computer Architecture, pages 243\u2013254. IEEE Press, 2016.\n\n[11] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural\n\nnetworks. arXiv preprint arXiv:1707.06168, 2017.\n\n[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[14] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger. Condensenet: An\nef\ufb01cient densenet using learned group convolutions. arXiv preprint arXiv:1711.09224, 2017.\n\n[15] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks.\n\narXiv preprint arXiv:1707.01213, 2017.\n\n[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[17] Jianxin Wu Jian-Hao Luo and Weiyao Lin. Thinet: A \ufb01lter level pruning method for deep neural\n\nnetwork compression. arXiv preprint arXiv:1707.06342, 2017.\n\n[18] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks, 2016.\n\n[19] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[20] Sheng Li, Jongsoo Park, and Ping Tak Peter Tang. Enabling sparse winograd convolution by\n\nnative pruning. arXiv preprint arXiv:1702.08597, 2017.\n\n[21] Xingyu Liu, Jeff Pool, Song Han, and William J Dally. Ef\ufb01cient sparse-winograd convolutional\n\nneural networks. arXiv preprint arXiv:1802.06367, 2018.\n\n9\n\n\f[22] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.\nLearning ef\ufb01cient convolutional networks through network slimming. In 2017 IEEE Interna-\ntional Conference on Computer Vision, pages 2755\u20132763. IEEE, 2017.\n\n[23] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks\nthrough l0 regularization. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=H1Y8hhg0b.\n\n[24] A. Vedaldi M. Jaderberg and A. Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. BMVC, 2014.\n\n[25] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally.\n\nExploring the granularity of sparsity in convolutional neural networks, 2017.\n\n[26] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional\n\nneural networks for resource ef\ufb01cient inference. arXiv preprint arXiv:1611.06440, 2016.\n\n[27] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[28] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[29] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nInverted residuals and linear bottlenecks: Mobile networks for classi\ufb01cation, detection and\nsegmentation. arXiv preprint arXiv:1801.04381, 2018.\n\n[30] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting\n\ntopologies. Evolutionary computation, 10(2):99\u2013127, 2002.\n\n[31] Pierre Vanderhaeghen and Hwai-Jong Cheng. Guidance molecules in axon pruning and cell\n\ndeath. Cold Spring Harbor perspectives in biology, 2(6):a001859, 2010.\n\n[32] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li.\n\nLearning struc-\ntured sparsity in deep neural networks.\nIn D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29,\npages 2074\u20132082. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/\n6504-learning-structured-sparsity-in-deep-neural-networks.pdf.\n\n[33] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR),\n2017 IEEE Conference on, pages 5987\u20135995. IEEE, 2017.\n\n[34] John S Denker Yann LeCun and Sara A Solla. Optimal brain damage. page 598\u2013605, 1990.\n\n[35] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions for deep\nneural networks. CoRR, abs/1707.02725, 2017. URL http://arxiv.org/abs/1707.02725.\n\n[36] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient\n\nconvolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.\n\n[37] Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning.\n\nCoRR, abs/1708.05552, 2017. URL http://arxiv.org/abs/1708.05552.\n\n[38] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quanti-\nzation: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044,\n2017.\n\n[39] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n10\n\n\f", "award": [], "sourceid": 6523, "authors": [{"given_name": "CHEN", "family_name": "LIN", "institution": "SenseTime"}, {"given_name": "Zhao", "family_name": "Zhong", "institution": "CASIA (Institute of Automation Chinese Academy of Sciences)"}, {"given_name": "Wu", "family_name": "Wei", "institution": "Sensetime"}, {"given_name": "Junjie", "family_name": "Yan", "institution": "Sensetime Group Limited"}]}