{"title": "Runtime Neural Pruning", "book": "Advances in Neural Information Processing Systems", "page_first": 2181, "page_last": 2191, "abstract": "In this paper, we propose a Runtime Neural Pruning (RNP) framework which prunes the deep neural network dynamically at the runtime. Unlike existing neural pruning methods which produce a fixed pruned model for deployment, our method preserves the full ability of the original network and conducts pruning according to the input image and current feature maps adaptively. The pruning is performed in a bottom-up, layer-by-layer manner, which we model as a Markov decision process and use reinforcement learning for training. The agent judges the importance of each convolutional kernel and conducts channel-wise pruning conditioned on different samples, where the network is pruned more when the image is easier for the task. Since the ability of network is fully preserved, the balance point is easily adjustable according to the available resources. Our method can be applied to off-the-shelf network structures and reach a better tradeoff between speed and accuracy, especially with a large pruning rate.", "full_text": "Runtime Neural Pruning\n\nJi Lin\u2217\n\nDepartment of Automation\n\nTsinghua University\n\nYongming Rao\u2217\n\nDepartment of Automation\n\nTsinghua University\n\nlin-j14@mails.tsinghua.edu.cn\n\nraoyongming95@gmail.com\n\nJiwen Lu\n\nDepartment of Automation\n\nTsinghua University\n\nlujiwen@tsinghua.edu.cn\n\nJie Zhou\n\nDepartment of Automation\n\nTsinghua University\n\njzhou@tsinghua.edu.cn\n\nAbstract\n\nIn this paper, we propose a Runtime Neural Pruning (RNP) framework which\nprunes the deep neural network dynamically at the runtime. Unlike existing neural\npruning methods which produce a \ufb01xed pruned model for deployment, our method\npreserves the full ability of the original network and conducts pruning according to\nthe input image and current feature maps adaptively. The pruning is performed in a\nbottom-up, layer-by-layer manner, which we model as a Markov decision process\nand use reinforcement learning for training. The agent judges the importance\nof each convolutional kernel and conducts channel-wise pruning conditioned on\ndifferent samples, where the network is pruned more when the image is easier\nfor the task. Since the ability of network is fully preserved, the balance point is\neasily adjustable according to the available resources. Our method can be applied\nto off-the-shelf network structures and reach a better tradeoff between speed and\naccuracy, especially with a large pruning rate.\n\n1\n\nIntroduction\n\nDeep neural networks have been proven to be effective in various areas. Despite the great success,\nthe capability of deep neural networks comes at the cost of huge computational burdens and large\npower consumption, which is a big challenge for real-time deployments, especially for embedded\nsystems. To address this, several neural pruning methods have been proposed [11, 12, 13, 25, 38] to\nreduce the parameters of convolutional networks, which achieve competitive or even slightly better\nperformance. However, these works mainly focus on reducing the number of network weights, which\nhave limited effects on speeding up the computation. More speci\ufb01cally, fully connected layers are\nproven to be more redundant and contribute more to the overall pruning rate, while convolutional\nlayers are the most computationally dense part of the network. Moreover, such pruning strategy\nusually leads to an irregular network structure, i.e. with part of sparsity in convolution kernels, which\nneeds a special algorithm for speeding up and is hard to harvest actual computational savings. A\nsurprisingly effective approach to trade accuracy for the size and the speed is to simply reduce the\nnumber of channels in each convolutional layer. For example, Changpinyo et al. [27] proposed a\nmethod to speed up the network by deactivating connections between \ufb01lters in convolutional layers,\nachieving a better tradeoff between the accuracy and the speed.\nAll these methods above prune the network in a \ufb01xed way, obtaining a static model for all the input\nimages. However, it is obvious that some of the input sample are easier for recognition, which can be\n\n\u2217indicates equal contribution\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\frecognized by simple and fast models. Some other samples are more dif\ufb01cult, which require more\ncomputational resources. This property is not exploited in previous neural pruning methods, where\ninput samples are treated equally. Since some of the weights are lost during the pruning process, the\nnetwork will lose the ability for some hard tasks forever. We argue that preserving the whole ability\nof the network and pruning the neural network dynamically according to the input image is desirable\nto achieve better speed and accuracy tradeoff compared to static pruning methods, which will also\nnot harm the upper bound ability of the network.\nIn this paper, we propose a Runtime Neural Pruning (RNP) framework by pruning the neural network\ndynamically at the runtime. Different from existing methods that produce a \ufb01xed pruned model\nfor deployment, our method preserves the full ability of the original network and prunes the neural\nnetwork according to the input image and current feature maps. More speci\ufb01cally, we model the\npruning of each convolutional layer as a Markov decision process (MDP), and train an agent with\nreinforcement learning to learn the best policy for pruning. Since the whole ability of the original\nnetwork is preserved, the balance point can be easily adjusted according to the available resources,\nthus one single trained model can be adjusted for various devices from embedded systems to large\ndata centers. Experimental results on the CIFAR [22] and ImageNet [36] datasets show that our\nframework successfully learns to allocate different amount of computational resources for different\ninput images, and achieves much better performance at the same cost.\n\n2 Related Work\n\nNetwork pruning: There has been several works focusing on network pruning, which is a valid way\nto reduce the network complexity. For example, Hanson and Pratt [13] introduced hyperbolic and\nexponential biases to the pruning objective. Damage [25] and Surgeon [14] pruned the networks with\nsecond-order derivatives of the objective. Han et al. [11, 12] iteratively pruned near-zero weights to\nobtain a pruned network with no loss of accuracy. Some other works exploited more complicated\nregularizers. For example, [27, 44] introduced structured sparsity regularizers on the network weights,\n[32] put them to the hidden units. [17] pruned neurons based on the network output. Anwar et\nal. [2] considered channel-wise and kernel-wise sparsity, and proposed to use particle \ufb01lters to decide\nthe importance of connections and paths. Another aspect focuses on deactivating some subsets of\nconnections inside a \ufb01xed network architecture. LeCun et al. [24] removed connections between the\n\ufb01rst two convolutional feature maps in a uniform manner. Depth multiplier method was proposed\nin [16] to reduce the number of \ufb01lters in each convolutional layer by a factor in a uniform manner.\nThese methods produced a static model for all the samples, failing to exploit the different property of\ninput images. Moreover, most of them produced irregular network structures after pruning, which\nmakes it hard to harvest actual computational savings directly.\nDeep reinforcement learning: Reinforcement learning [29] aims to enable the agent to decide\nthe behavior from its experiences. Unlike conventional machine learning methods, reinforcement\nlearning is supervised through the reward signals of actions. Deep reinforcement learning [31] is a\ncombination of deep learning and reinforcement learning, which has been widely used in recent years.\nFor examples, Mnih et al. [31] combined reinforcement learning with CNN and achieved the human-\nlevel performance in the Atari game. Caicedo et al. [8] introduced reinforcement learning for active\nobject localization. Zhang et al. [45] employed reinforcement learning for vision control in robotics.\nReinforcement learning is also adopted for feature selection to build a fast classi\ufb01er. [4, 15, 21].\nDynamic network: Dynamic network structures and executions have been studied in previous\nworks [7, 28, 33, 39, 40]. Some input-dependent execution methods rely on a pre-de\ufb01ned strategy.\nCascade methods [26, 28, 39, 40] relied on manually-selected thresholds to control execution.\nDynamic Capacity Network [1] used a specially designed method to calculate a saliency map for\ncontrol execution. Other conditional computation methods activate part of a network under a learned\npolicy. Begio et al. [6] introduced Stochastic Times Smooth neurons as gaters for conditional\ncomputation within a deep neural network, producing a sparse binary gater to be computed as a\nfunction of the input. [5] selectively activated output of a fully-connected neural network, according\nto a control policy parametrized as the sigmoid of an af\ufb01ne transformation from last activation. Liu et\nal. [30] proposed Dynamic Deep Neural Networks (D2NN), a feed-forward deep neural network that\nallows selective execution with self-de\ufb01ned topology, where the control policy is learned using single\nstep reinforcement learning.\n\n2\n\n\fFigure 1: Overall framework of our RNP. RNP consists of two sub-networks: the backbone CNN net-\nwork and the decision network. The convolution kernels of backbone CNN network are dynamically\npruned according to the output Q-value of decision network, conditioned on the state forming from\nthe last calculated feature maps.\n\n3 Runtime Neural Pruning\n\nThe overall framework of our RNP is shown in Figure 1. RNP consists of two sub-networks, the\nbackbone CNN network and the decision network which decides how to prune the convolution kernels\nconditioned on the input image and current feature maps. The backbone CNN network can be any\nkinds of CNN structure. Since convolutional layers are the most computationally dense layers in a\nCNN, we focus on the pruning of convolutional layers in this work, leaving fully connected layers as\na classi\ufb01er.\n\n3.1 Bottom-up Runtime Pruning\n\nWe denote the backbone CNN with m convolutional layers as C, with convolutional layers denoted\nas C1, C2, ..., Cm, whose kernels are K1, K2, ..., Km, respectively, with number of channels as\nni, i = 1, 2, ..., m. These convolutional layers produce feature maps F1, F2, ..., Fm as shown in\nFigure 1, with the size of ni \u00d7 H \u00d7 W, i = 1, 2, ..., m. The goal is to \ufb01nd and prune the redundant\nconvolutional kernels in Ki+1, given feature maps Fi, i = 1, 2, ..., m \u2212 1, to reduce computation\nand achieve maximum performance simultaneously.\nTaking the i-th layer as an example, we denote our goal as the following objective:\n\nEFi[Lcls(conv(Fi, K[h(Fi)])) + Lpnt(h(Fi))],\n\nmin\nKi+1,h\n\n(1)\n\nwhere Lcls is the loss of the classi\ufb01cation task, Lpnt is the penalty term representing the tradeoff\nbetween the speed and the accuracy, h(Fi) is the conditional pruning unit that produces a list of\nindexes of selected kernels according to input feature map, K[\u00b7] is the indexing operation for kernel\npruning and conv(x1, x2) is the convolutional operation for input feature map x1 and kernel x2. Note\nthat our framework infers through standard convolutional layer after pruning, which can be easily\nboosted by utilizing GPU-accelerated neural network library such as cuDNN [9].\nTo solve the optimization problem in (1), we divide the whole problem into two sub-problems of\n{K} and h, and adopt an alternate training strategy to solve each sub-problem independently with\nthe neural network optimizer such as RMSprop [42].\nFor an input sample, there are totally m decisions of pruning to be made. A straightforward idea\nis using the optimized decisions under certain penalty to supervise the decision network. However,\nfor a backbone CNN with m layers, the time complexity of collecting the supervised signal is\ni=1 nm), which is NP-hard and unacceptable for prevalent very deep architecture such as\n\nO((cid:81)m\n\n3\n\nencoderRNNconvkernelsKi-1featuremapsFi-1decoderglobalpoolingpruneencoderdecoderglobalpoolingprune...calculatedprunedfeaturemapsFifeaturemapsFi+1convkernelsKi\fVGG [37] and ResNet [3]. To simplify the training problem, we employ the following two strategies:\n1) model the network pruning as a Markov decision process (MDP) [34] and train the decision network\nby reinforcement learning; 2) rede\ufb01ne the action of pruning to reduce the number of decisions.\n\n3.2 Layer-by-layer Markov Decision Process\n\nThe decision network consists of an encoder-RNN-decoder structure, where the encoder E embeds\nfeature map Fi into \ufb01xed-length code, RNN R aggregates codes from previous stages, and the\ndecoder D outputs the Q-value of each action. We formulate key elements in Markov decision\nprocess (MDP) based on the decision network to adopt deep Q-learning in our RNP framework as\nfollows.\nState: Given feature map Fi, we \ufb01rst extract a dense feature embedding pFi with global pooling,\nas commonly conducted in [10, 35], whose length is ni. Since the number of channels for different\nconvolutional layers are different, the length of pFi varies. To address this, we use the encoder E (a\nfully connected layer) to project the pooled feature into a \ufb01xed-length embedding E(pFi). E(pFi)\nfrom different layers are associated in a bottom-up way with a RNN structure, which produces a\nlatent code R(E(pFi )), regarded as embedded state information for reinforcement learning. The\ndecoder (also a fully connected layer) produces the Q-value for decision.\nAction: The actions for each pruning are de\ufb01ned in an incremental way. For convolution kernel Ki\nwith ni output channels, we determine which output channels are calculated and which to prune. To\nsimplify the process, we group the output feature maps into k sets, denoted as F(cid:48)\nk. One\nextreme case is k = ni, where one single output channel forms a set. The actions a1, a2, ..., ak are\nde\ufb01ned as follows: taking actions ai yields calculating the feature map groups F(cid:48)\ni, i =\n1, 2, ..., k. Hence the feature map groups with lower index are calculated more, and the higher indexed\nfeature map groups are calculated only when the sample is dif\ufb01cult enough. Specially, the \ufb01rst feature\nmap group is always calculated, which we mention as base feature map group. Since we do not have\nstate information for the \ufb01rst convolutional layer, it is not pruned, with totally m \u2212 1 actions to take.\nThough the de\ufb01nitions of actions are rather simple, one can easily extend the de\ufb01nition for more\ncomplicated network structures. Like Inception [41] and ResNet [3], we de\ufb01ne the action based on\nunit of a single block by sharing pruning rate inside the block, which is more scalable and can avoid\nconsidering about the sophisticated structures.\nReward: The reward of each action taken at the t-th step with action ai is de\ufb01ned as:\nif inference terminates (t = m \u2212 1),\n\n(cid:26) \u2212\u03b1Lcls + (i \u2212 1) \u00d7 p,\n\n2, ..., F(cid:48)\n2, ..., F(cid:48)\n\n1, F(cid:48)\n1, F(cid:48)\n\nrt(ai) =\n\n(i \u2212 1) \u00d7 p,\n\notherwise (t < m \u2212 1)\n\n(2)\n\nwhere p is a negative penalty that can be manually set. The reward was set according to the loss for\nthe original task. We took the negative loss \u2212\u03b1Lcls as the \ufb01nal reward so that if a task is completed\nbetter, the \ufb01nal reward of the chain will be higher, i.e., closer to 0. \u03b1 is a hyper-parameter to rescale\nLcls into a proper range, since Lcls varies a lot for different network structures and different tasks.\nTaking actions that calculate more feature maps, i.e., with higher i, will bring higher penalty due to\nmore computations. For t = 1, ..., m \u2212 2, the reward is only about the computation penalty, while\nat the last step, the chain will get a \ufb01nal reward of \u2212\u03b1Lcls to assess how well the pruned network\ncompletes the task.\nThe key step of the Markov decision model is to decide the best action at certain state. In other\nwords, it is to \ufb01nd the optimal decision policy. By introducing the Q-learning method [31, 43], we\nde\ufb01ne Q(ai, st) as the expectation value of taking action ai at state st. So the policy is de\ufb01ned as\n\u03c0 = argmaxai\nTherefore, the optimal action-value function can be written as:\n\nQ(ai, st).\n\nQ(st, ai) = max\n\n\u03c0\n\nE[rt + \u03b3rt+1 + \u03b32rt+2 + ...|\u03c0],\n\n(3)\n\nwhere \u03b3 is the discount factor in Q-learning, providing a tradeoff between the immediate reward\nand the prediction of future rewards. We use the decision network to approximate the expected\nQ-value Q\u2217(st, ai), with all the decoders sharing parameters and outputting a k-length vector, each\nrepresenting the Q\u2217 of corresponding action. If the estimation is optimal, we will have Q\u2217(st, ai) =\nQ(st, ai) exactly.\n\n4\n\n\fAccording to the Bellman equation [3], we adopt the squared mean error (MSE) as a criterion for\ntraining to keep decision network self-consistent. So we rewrite the objective for sub-problem of h in\noptimization problem 1 as:\n\nmin\n\n\u03b8\n\nLre = E[r(st, ai) + \u03b3 max\n\nai\n\nQ(st+1, ai) \u2212 Q(st, ai)]2,\n\n(4)\n\nwhere \u03b8 is the weights of decision network. In our proposed framework, a series of states are created\nfor an given input image. And the training is conducted using \u0001-greedy strategy that selects actions\nfollowing \u03c0 with probability \u0001 and select random actions with probability 1 \u2212 \u0001, while inference\nis conducted greedily. The backbone CNN network and decision network is trained alternately.\nAlgorithm 1 details the training procedure of the proposed method.\n\nSample random minibatch from {X}\nForward and sample \u0001-greedy actions {st, at}\nCompute corresponding rewards {rt}\nBackward Q values for each stage and generate \u2207\u03b8Lre\nUpdate \u03b8 using \u2207\u03b8Lre\n\nAlgorithm 1 Runtime neural pruning for solving optimization problem (1):\nInput: training set with labels {X}\nOutput: backbone CNN C, decision network D\n1: initialize: train C in normal way or initialize C with pre-trained model\n2: for i \u2190 1, 2, ..., M do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: end for\n19: return C and D\n\nend for\n// \ufb01ne-tune backbone CNN\nfor k \u2190 1, 2, ..., N2 do\n\n// train decision network\nfor j \u2190 1, 2, ..., N1 do\n\nSample random minibatch from {X}\nForward and calculate Lcls after runtime pruning by D\nBackward and generate \u2207CLcls\nUpdate C using \u2207CLcls\n\nend for\n\nIt is worth noticing that during the training of agent, we manually set a \ufb01xed penalty for different\nactions and reach a balance status. While during deployment, we can adjust the penalty by compen-\nsating the output Q\u2217 of each action with relative penalties accordingly to switch between different\nbalance point of accuracy and computation costs, since penalty is input-independent. Thus one single\nmodel can be deployed to different systems according to the available resources.\n\n4 Experiments\n\nWe conducted experiments on three different datasets including CIFAR-10, CIFAR-100 [22] and\nILSVRC2012 [36] to show the effectiveness of our method. For CIFAR-10, we used a four convolu-\ntional layer network with 3 \u00d7 3 kernels. For CIFAR-100 and ILSVRC2012, we used the VGG-16\nnetwork for evaluation. For results on the CIFAR dataset, we compared the results obtained by our\nRNP and naive channel reduction methods. For results on the ILSVRC2012 dataset, we compared\nthe results achieved by our RNP with recent state-of-the-art network pruning methods.\n\n4.1\n\nImplementation Details\n\nWe trained RNP in an alternative manner, where the backbone CNN network and the decision network\nwere trained iteratively. To help the training converge faster, we \ufb01rst initialized the CNN with random\npruning, where decisions were randomly made. Then we \ufb01xed the CNN parameters and trained the\ndecision network, regarding the backbone CNN as a environment, where the agent can take actions\nand get corresponding rewards. We \ufb01xed the decision network and \ufb01ne-tuned the backbone CNN\nfollowing the policy of the decision network, which helps CNN specialize in a speci\ufb01c task. The\n\n5\n\n\finitialization was trained using SGD, with an initial learning rate 0.01, decay by a factor of 10 after\n120, 160 epochs, with totally 200 epochs in total. The other training progress was conducted using\nRMSprop [42] with the learning rate of 1e-6. For the \u0001-greedy strategy, the hyper-parameter \u0001 was\nannealed linearly from 1.0 to 0.1 in the beginning and \ufb01xed at 0.1 thereafter.\nFor most experiments, we set the number of convolutional group to k = 4, which is a tradeoff between\nthe performance and the complicity. Increasing k will enable more possible pruning combinations,\nwhile at the same time making it harder for reinforcement learning with an enlarged action space.\nSince the action is taken conditioned on the current feature map, the \ufb01rst convolutional layer is not\npruned, where we have totally m \u2212 1 decisions to make, forming a decision sequence. During the\ntraining, we set the penalty for extra feature map calculation as p = \u22120.1, which is adjusted during\nthe deployment. The scale \u03b1 factor was set such that the average \u03b1Lcls is approximately 1 to make\nthe relative difference more signi\ufb01cant. For experiments on VGG-16 model, we de\ufb01ne the actions\nbased on unit of a single block by sharing pruning rate inside the block as mentioned in Section 3.2\nto simplify implementation and accelerate convergence.\nFor vanilla baseline methods comparison on CIFAR, we evaluated the performance of normal neural\nnetwork with the same computations. More speci\ufb01cally, we calculated the average number of\nmultiplications of every convolution layer and rounded it up to the nearest number of channels sharing\nsame computations, which resulted in an identical network topology with reduced convolutional\nchannels. We trained the vanilla baseline network with the SGD until convergence for comparison.\nAll our experiments were implemented using the modi\ufb01ed Caffe toolbox [20].\n\n4.2\n\nIntuitive Experiments\n\nTo have an intuitive understanding of our framework, we \ufb01rst conducted a simple experiment to\nshow the effectiveness and undergoing logic of our RNP. We considered a 3-category classi\ufb01cation\nproblem, consisting of male faces, female faces and background samples. It is intuitive to think that\nseparating male faces from female faces is a much more dif\ufb01cult task than separating faces from\nbackground, needing more detailed attention, so more resources should be allocated to face images\nthan background images. In other words, a good tradeoff for RNP is to prune the neural network\nmore when dealing with background images and keep more convolutional channels when inputting a\nface image.\nTo validate this idea, we constructed a 3-category dataset using Labeled Faces in the Wild [18] dataset,\nwhich we referred to as LFW-T. More speci\ufb01cally, we randomly cropped 3000 images for both male\nand female faces, and also 3000 background images randomly cropped from LFW. We used the\nattributes from [23] as labels for male and female faces. All these images were resized to 32 \u00d7 32\npixels. We held out 2000 images for testing and the remaining for training. For this experiment, we\ndesigned a 3-layer convolutional network with two fully connected layers. All convolutional kernels\nare 3 \u00d7 3 and with 32, 32, 64 output channels respectively. We followed the same training protocol as\nmentioned above with p = 0.1, and focused on the difference between different classes.\nThe original network achieved 91.1% accuracy. By adjusting the penalty, we managed to get a certain\npoint of accuracy-computation tradeoff, where computations (multiplications) were reduced by a\nfactor of 2, while obtaining even slightly higher accuracy of 91.75%. We looked into the average\ncomputations of different classes by counting multiplications of convolutional layers. The results\nwere shown in Figure 2. For the whole network, RNP allocated more computations on faces images\nthan background images, at approximately a ratio of 2, which clearly demonstrates the effectiveness\nof RNP. However, since the \ufb01rst convolutional layers and fully connected layers were not pruned, to\nget the absolute ratio of pruning rate, we also studied the pruning of a certain convolutional layer. In\nthis case, we selected the last convolutional layer conv3. The results are shown on the right \ufb01gure.\nWe see that for this certain layer, computations for face images are almost 5 times of background\nimages. The differences in computations show that RNP is able to \ufb01nd the relative dif\ufb01culty of\ndifferent tasks and exploit such property to prune the neural network accordingly.\n\n4.3 Results\n\nCIFAR-10 & CIFAR-100: For CIFAR-10 and CIFAR-100, we used a four-layer convolutional\nnetwork and the VGG-16 network for experiments, respectively. The goal of these two experiments\nis to compare our RNP with vanilla baseline network, where the number of convolutional layers was\n\n6\n\n\fFigure 2: The average multiplication numbers of different classes in our intuitive experiment. We\nshow the computation numbers for both the whole network (on the left) and the fully pruned\nconvolutional layer conv3 (on the right). The results show that RNP succeeds to focus more on faces\nimages by preserving more convolutional channels while prunes the network more when dealing with\nbackground images, reaching a good tradeoff between accuracy and speed.\n\nFigure 3: The results on CIFAR-10 (on the left) and CIFAR-100 (on the right). For vanilla curve, the\nrightmost point is the full model and the leftmost is the 1\n4 model. RNP outperforms naive channel\nreduction models consistently by a very large margin.\n\nreduced directly from the beginning. The fully connected layers of standard VGG-16 are too redundant\nfor CIFAR-100, so we eliminated one of the fully connected layer and set the inner dimension as 512.\nThe modi\ufb01ed VGG-16 model was easier to converge and actually slightly outperformed the original\nmodel on CIFAR-100. The results are shown in Figure 3. We see that for vanilla baseline method, the\naccuracy suffered from a stiff drop when computations savings were than 2.5 times. While our RNP\nconsistently outperformed the baseline model, and achieved competitive performance even with a\nvery large computation saving rate.\nILSVRC2012: We compared our RNP with recent state-of-the-art neural pruning methods [19, 27,\n46] on the ImageNet dataset using the VGG-16 model, which won the 2-nd place in ILSVRC2014\nchallenge. We evaluated the top-5 error using single-view testing on ILSVRC2012-val set and trained\nRNP model using ILSVRC2012-train set. The view was the center 224\u00d7 224 region cropped from the\n\nTable 1: Comparisons of increase of top-5 error on ILSVRC2012-val (%) with recent state-of-the-art\nmethods, where we used 10.1% top-5 error baseline as the reference.\n4\u00d7\n9.7\n3.84\n8.6\n3.23\n\nSpeed-up\nJaderberg et al. [19] ([46]\u2019s implementation)\nAsymmetric [46]\nFilter pruning [27] (our implementation)\nOurs\n\n5\u00d7 10\u00d7\n29.7\n\n3\u00d7\n2.3\n-\n3.2\n2.32\n\n-\n\n14.6\n3.58\n\n4.89\n\n7\n\n00.10.20.30.40.50.6AverageMaleFemaleBackground#Multiply(mil.)AverageMults.ofConv3(original:1.180Mmults.)00.511.522.53AverageMaleFemaleBackground#Multiply(mil.)AverageMults.ofWholeNetwork(original:4.950Mmults.)(a)(b)0510152025#Multiply (mil.)0.720.740.760.780.80.820.840.86Accuracy (%)RNPvanilla0100200300400#Multiply (mil.)0.520.540.560.580.60.620.640.66Accuracy (%)RNPvanilla\fFigure 4: Visualization of the original images and the feature maps of four convolutional groups,\nrespectively. The presented feature maps are the average of corresponding convolutional groups.\n\nTable 2: GPU inference time under different theoretical speed-up ratios on ILSVRC2012-val set.\n\nIncrease of top-5 error (%) Mean inference time (ms)\n\n3.26 (1.0\u00d7)\n1.38 (2.3\u00d7)\n1.07 (3.0\u00d7)\n0.880 (3.7\u00d7)\n0.554 (5.9\u00d7)\n\nSpeed-up solution\nVGG-16 (1\u00d7)\nOurs (3\u00d7)\nOurs (4\u00d7)\nOurs (5\u00d7)\nOurs (10\u00d7)\n\n0\n\n2.32\n3.23\n3.58\n4.89\n\nresized images whose shorter side is 256 by following [46]. RNP was \ufb01ne-tuned based on the public\navailable model 2 which achieves 10.1% top-5 error on ILSVRC2012-val set. Results are shown in\nTable 1, where speed-up is the theoretical speed-up ratio computed by the complexity. We see that\nRNP achieves similar performance with a relatively small speed-up ratio with other methods and\noutperforms other methods by a signi\ufb01cant margin with a large speed-up ratio. We further conducted\nour experiments on larger ratio (10\u00d7) and found RNP only suffered slight drops (1.31% compared to\n5\u00d7), far beyond others\u2019 results on 5\u00d7 setting.\n\n4.4 Analysis\n\nAnalysis of Feature Maps: Since we de\ufb01ne the actions in an incremental way, the convolutional\nchannels of lower index are calculated more (a special case is the base network that is always\ncalculated). The convolutional groups with higher index are increments to the lower-indexed ones,\nso the functions of different convolution groups might be similar to \"low-frequency\" and \"high-\nfrequency\" \ufb01lters. We visualized different functions of convolutional groups by calculating average\nfeature maps produced by each convolutional group. Specially, we took CIFAR-10 as example and\nvisualized the feature maps of conv2 with k = 4. The results are shown in Figure 4.\nFrom the \ufb01gure, we see that the base convolutional groups have highest activations to the input\nimages, which can well describe the overall appearance of the object. While convolutional groups\nwith higher index have sparse activations, which can be considered as a compensation to the base\nconvolutional groups. So the undergoing logic of RNP is to judge when it is necessary to compensate\nthe base convolutional groups with higher ones: if tasks are easy, RNP will prune the high-order\nfeature maps for speed, otherwise bring in more computations to pursue accuracy.\nRuntime Analysis: One advantage of our RNP is its convenience for deployment, which makes it\neasy to harvest actual computational time savings. Therefore, we measured the actual runtime under\nGPU acceleration, where we measured the actual inference time for VGG-16 on ILSVRC2012-val\nset. Inference time were measured on a Titan X (Pascal) GPU with batch size 64. Table 2 shows the\nGPU inference time of different settings. We see that our RNP generalizes well on GPU.\n\n2http://www.robots.ox.ac.uk/~vgg/research/very_deep/\n\n8\n\n\f5 Conclusion\n\nIn this paper, we have proposed a Runtime Neural Pruning (RNP) framework to prune the neural\nnetwork dynamically. Since the ability of network is fully preserved, the balance point is easily\nadjustable according to the available resources. Our method can be applied to off-the-shelf net-\nwork structures and reaches a better tradeoff between speed and accuracy. Experimental results\ndemonstrated the effectiveness of the proposed approach.\n\nAcknowledgements\n\nWe would like to thank Song Han, Huazhe (Harry) Xu, Xiangyu Zhang and Jian Sun for their generous\nhelp and insightful advice. This work is supported by the National Natural Science Foundation of\nChina under Grants 61672306 and the National 1000 Young Talents Plan Program. The corresponding\nauthor of this work is Jiwen Lu.\n\nReferences\n[1] Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville.\n\nDynamic capacity networks. arXiv preprint arXiv:1511.07838, 2015.\n\n[2] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural\n\nnetworks. arXiv preprint arXiv:1512.08571, 2015.\n\n[3] Richard Bellman. Dynamic programming and lagrange multipliers. PNAS, 42(10):767\u2013769, 1956.\n\n[4] Djalel Benbouzid, R\u00f3bert Busa-Fekete, and Bal\u00e1zs K\u00e9gl. Fast classi\ufb01cation using sparse decision dags.\n\narXiv preprint arXiv:1206.6387, 2012.\n\n[5] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural\n\nnetworks for faster models. arXiv preprint arXiv:1511.06297, 2015.\n\n[6] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[7] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for fast\n\ntest-time prediction. arXiv preprint arXiv:1702.07811, 2017.\n\n[8] Juan C Caicedo and Svetlana Lazebnik. Active object localization with deep reinforcement learning. In\n\nICCV, pages 2488\u20132496, 2015.\n\n[9] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and\n\nEvan Shelhamer. cudnn: Ef\ufb01cient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.\n\n[10] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan\nSalakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint arXiv:1612.02297,\n2016.\n\n[11] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[12] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for ef\ufb01cient\n\nneural network. In NIPS, pages 1135\u20131143, 2015.\n\n[13] Stephen Jos\u00e9 Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with\n\nback-propagation. In NIPS, pages 177\u2013185, 1989.\n\n[14] Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon.\n\nNIPS, pages 164\u2013164, 1993.\n\n[15] He He, Jason Eisner, and Hal Daume. Imitation learning by coaching. In Advances in Neural Information\n\nProcessing Systems, pages 3149\u20133157, 2012.\n\n[16] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco\nAndreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision\napplications. arXiv preprint arXiv:1704.04861, 2017.\n\n9\n\n\f[17] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron\n\npruning approach towards ef\ufb01cient deep architectures. arXiv preprint arXiv:1607.03250, 2016.\n\n[18] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical report, Technical Report\n07-49, University of Massachusetts, Amherst, 2007.\n\n[19] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv\npreprint arXiv:1408.5093, 2014.\n\n[21] Sergey Karayev, Tobias Baumgartner, Mario Fritz, and Trevor Darrell. Timely object recognition. In\n\nAdvances in Neural Information Processing Systems, pages 890\u2013898, 2012.\n\n[22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[23] Neeraj Kumar, Alexander Berg, Peter N Belhumeur, and Shree Nayar. Describable visual attributes for\n\nface veri\ufb01cation and image search. PAMI, 33(10):1962\u20131977, 2011.\n\n[24] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[25] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In NIPS, pages 598\u2013605, 1990.\n\n[26] Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, and Bart\nDhoedt. The cascading neural network: building the internet of smart things. Knowledge and Information\nSystems, pages 1\u201324, 2017.\n\n[27] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for ef\ufb01cient\n\nconvnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[28] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network\n\ncascade for face detection. In CVPR, pages 5325\u20135334, 2015.\n\n[29] Michael L Littman. Reinforcement learning improves behaviour from evaluative feedback. Nature,\n\n521(7553):445\u2013451, 2015.\n\n[30] Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-ef\ufb01ciency trade-offs by\n\nselective execution. arXiv preprint arXiv:1701.00299, 2017.\n\n[31] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[32] Kenton Murray and David Chiang. Auto-sizing neural networks: With applications to n-gram language\n\nmodels. arXiv preprint arXiv:1508.05051, 2015.\n\n[33] Augustus Odena, Dieterich Lawson, and Christopher Olah. Changing model behavior at test-time using\n\nreinforcement learning. arXiv preprint arXiv:1702.07780, 2017.\n\n[34] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley &\n\nSons, 2014.\n\n[35] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\n\nwith region proposal networks. In NIPS, pages 91\u201399, 2015.\n\n[36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. IJCV, 115(3):211\u2013252, 2015.\n\n[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[38] Nikko Str\u00f6m. Phoneme probability estimation with dynamic sparsely connected arti\ufb01cial neural networks.\n\nThe Free Speech Journal, 5:1\u201341, 1997.\n\n10\n\n\f[39] Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, and Lubomir Bourdev. Pronet: Learning to\n\npropose object-speci\ufb01c boxes for cascaded neural networks. In CVPR, pages 3485\u20133493, 2016.\n\n[40] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection.\n\nIn CVPR, pages 3476\u20133483, 2013.\n\n[41] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and\n\nthe impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016.\n\n[42] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.\n\n[43] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n[44] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep\n\nneural networks. In NIPS, pages 2074\u20132082, 2016.\n\n[45] Fangyi Zhang, J\u00fcrgen Leitner, Michael Milford, Ben Upcroft, and Peter Corke. Towards vision-based deep\n\nreinforcement learning for robotic motion control. arXiv preprint arXiv:1511.03791, 2015.\n\n[46] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks\n\nfor classi\ufb01cation and detection. PAMI, 38(10):1943\u20131955, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1298, "authors": [{"given_name": "Ji", "family_name": "Lin", "institution": "Tsinghua University"}, {"given_name": "Yongming", "family_name": "Rao", "institution": "Tsinghua University"}, {"given_name": "Jiwen", "family_name": "Lu", "institution": "Tsinghua University"}, {"given_name": "Jie", "family_name": "Zhou", "institution": "Tsinghua University"}]}