{"title": "Network Pruning via Transformable Architecture Search", "book": "Advances in Neural Information Processing Systems", "page_first": 760, "page_last": 771, "abstract": "Network pruning reduces the computation costs of an over-parameterized network without performance damage. Prevailing pruning algorithms pre-define the width and depth of the pruned networks, and then transfer parameters from the unpruned network to pruned networks. To break the structure limitation of the pruned networks, we propose to apply neural architecture search to search directly for a network with flexible channel and layer sizes. The number of the channels/layers is learned by minimizing the loss of the pruned networks. The feature map of the pruned network is an aggregation of K feature map fragments (generated by K networks of different sizes), which are sampled based on the probability distribution. The loss can be back-propagated not only to the network weights, but also to the parameterized distribution to explicitly tune the size of the channels/layers. Specifically, we apply channel-wise interpolation to keep the feature map with different channel sizes aligned in the aggregation procedure. The maximum probability for the size in each distribution serves as the width and depth of the pruned network, whose parameters are learned by knowledge transfer, e.g., knowledge distillation, from the original networks. Experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate the effectiveness of our new perspective of network pruning compared to traditional network pruning algorithms. Various searching and knowledge transfer approaches are conducted to show the effectiveness of the two components. Code is at: https://github.com/D-X-Y/NAS-Projects", "full_text": "Network Pruning via\n\nTransformable Architecture Search\n\nXuanyi Dong\u2020\u2021\u2217, Yi Yang\u2020\n\n\u2020ReLER, CAI, University of Technology Sydney, \u2021Baidu Research\nxuanyi.dong@student.uts.edu.au; yi.yang@uts.edu.au\n\nAbstract\n\nNetwork pruning reduces the computation costs of an over-parameterized network\nwithout performance damage. Prevailing pruning algorithms pre-de\ufb01ne the width\nand depth of the pruned networks, and then transfer parameters from the unpruned\nnetwork to pruned networks. To break the structure limitation of the pruned net-\nworks, we propose to apply neural architecture search to search directly for a\nnetwork with \ufb02exible channel and layer sizes. The number of the channels/layers\nis learned by minimizing the loss of the pruned networks. The feature map of the\npruned network is an aggregation of K feature map fragments (generated by K\nnetworks of different sizes), which are sampled based on the probability distribu-\ntion. The loss can be back-propagated not only to the network weights, but also\nto the parameterized distribution to explicitly tune the size of the channels/layers.\nSpeci\ufb01cally, we apply channel-wise interpolation to keep the feature map with\ndifferent channel sizes aligned in the aggregation procedure. The maximum proba-\nbility for the size in each distribution serves as the width and depth of the pruned\nnetwork, whose parameters are learned by knowledge transfer, e.g., knowledge\ndistillation, from the original networks. Experiments on CIFAR-10, CIFAR-100\nand ImageNet demonstrate the effectiveness of our new perspective of network\npruning compared to traditional network pruning algorithms. Various searching\nand knowledge transfer approaches are conducted to show the effectiveness of the\ntwo components. Code is at: https://github.com/D-X-Y/NAS-Projects.\n\n1\n\nIntroduction\n\nDeep convolutional neural networks (CNNs) have become wider and deeper to achieve high\nperformance on different applications [17, 22, 48]. Despite their great success,\nit is im-\npractical to deploy them to resource constrained devices, such as mobile devices and drones.\nA straightforward solution to address this prob-\nlem is using network pruning [29, 12, 13, 20,\n18] to reduce the computation cost of over-\nparameterized CNNs. A typical pipeline for\nnetwork pruning, as indicated in Fig. 1(a), is\nachieved by removing the redundant \ufb01lters and\nthen \ufb01ne-tuning the slashed networks, based\non the original networks. Different criteria for\nthe importance of the \ufb01lters are applied, such\nas L2-norm of the \ufb01lter [30], reconstruction\nerror [20], and learnable scaling factor [32].\nLastly, researchers apply various \ufb01ne-tuning\n\nFigure 1: A comparison between the typical prun-\ning paradigm and the proposed paradigm.\n\n\u2217This work was done when Xuanyi Dong was a research intern at Baidu Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nTraina largeCNN TPrune \ufb01lters,get a smallCNN SFine-tunethe CNN SAn ef\ufb01cientCNN S(a) The Traditional Pruning ParadigmTrain a large CNN TSearch for the width and depth of CNN STransferknowledgefrom T to SAn ef\ufb01cientCNN S(b) The Proposed Pruning Paradigm\fstrategies [30, 18] for the pruned network to ef\ufb01ciently transfer the parameters of the unpruned\nnetworks and maximize the performance of the pruned networks.\nTraditional network pruning approaches achieve effective impacts on network compacting while\nmaintaining accuracy. Their network structure is intuitively designed, e.g., pruning 30% \ufb01lters in\neach layer [30, 18], predicting sparsity ratio [15] or leveraging regularization [2]. The accuracy of the\npruned network is upper bounded by the hand-crafted structures or rules for structures. To break this\nlimitation, we apply Neural Architecture Search (NAS) to turn the design of the architecture structure\ninto a learning procedure and propose a new paradigm for network pruning as explained in Fig. 1(b).\nPrevailing NAS methods [31, 48, 8, 4, 40] optimize the network topology, while the focus of this\npaper is automated network size. In order to satisfy the requirements and make a fair comparison\nbetween the previous pruning strategies, we propose a new NAS scheme termed Transformable\nArchitecture Search (TAS). TAS aims to search for the best size of a network instead of topology,\nregularized by minimization of the computation cost, e.g., \ufb02oating point operations (FLOPs). The\nparameters of the searched/pruned networks are then learned by knowledge transfer [21, 44, 46].\nTAS is a differentiable searching algorithm, which can search for the width and depth of the networks\neffectively and ef\ufb01ciently. Speci\ufb01cally, different candidates of channels/layers are attached with a\nlearnable probability. The probability distribution is learned by back-propagating the loss generated\nby the pruned networks, whose feature map is an aggregation of K feature map fragments (outputs\nof networks in different sizes) sampled based on the probability distribution. These feature maps of\ndifferent channel sizes are aggregated with the help of channel-wise interpolation. The maximum\nprobability for the size in each distribution serves as the width and depth of the pruned network.\nIn experiments, we show that the searched architecture with parameters transferred by knowledge\ndistillation (KD) outperforms previous state-of-the-art pruning methods on CIFAR-10, CIFAR-100\nand ImageNet. We also test different knowledge transfer approaches on architectures generated by\ntraditional hand-crafted pruning approaches [30, 18] and random architecture search approach [31].\nConsistent improvements on different architectures demonstrate the generality of knowledge transfer.\n\n2 Related Studies\n\nNetwork pruning [29, 33] is an effective technique to compress and accelerate CNNs, and thus allows\nus to deploy ef\ufb01cient networks on hardware devices with limited storage and computation resources.\nA variety of techniques have been proposed, such as low-rank decomposition [47], weight pruning [14,\n29, 13, 12], channel pruning [18, 33], dynamic computation [9, 7] and quantization [23, 1]. They lie\nin two modalities: unstructured pruning [29, 9, 7, 12] and structured pruning [30, 20, 18, 33].\nUnstructured pruning methods [29, 9, 7, 12] usually enforce the convolutional weights [29, 14]\nor feature maps [7, 9] to be sparse. The pioneers of unstructured pruning, LeCun et al. [29] and\nHassibi et al. [14], investigated the use of the second-derivative information to prune weights of\nshallow CNNs. After deep network was born in 2012 [28], Han et al. [12, 13, 11] proposed a series of\nworks to obtain highly compressed deep CNNs based on L2 regularization. After this development,\nmany researchers explored different regularization techniques to improve the sparsity while preserve\nthe accuracy, such as L0 regularization [35] and output sensitivity [41]. Since these unstructured\nmethods make a big network sparse instead of changing the whole structure of the network, they need\ndedicated design for dependencies [11] and speci\ufb01c hardware to speedup the inference procedure.\nStructured pruning methods [30, 20, 18, 33] target the pruning of convolutional \ufb01lters or whole layers,\nand thus the pruned networks can be easily developed and applied. Early works in this \ufb01eld [2, 42]\nleveraged a group Lasso to enable structured sparsity of deep networks. After that, Li et al. [30]\nproposed the typical three-stage pruning paradigm (training a large network, pruning, re-training).\nThese pruning algorithms regard \ufb01lters with a small norm as unimportant and tend to prune them, but\nthis assumption does not hold in deep nonlinear networks [43]. Therefore, many researchers focus on\nbetter criterion for the informative \ufb01lters. For example, Liu et al. [32] leveraged a L1 regularization;\nYe et al. [43] applied a ISTA penalty; and He et al. [19] utilized a geometric median-based criterion.\nIn contrast to previous pruning pipelines, our approach allows the number of channels/layers to be\nexplicitly optimized so that the learned structure has high-performance and low-cost.\nBesides the criteria for informative \ufb01lters, the importance of network structure was suggested in [33].\nSome methods implicitly \ufb01nd a data-speci\ufb01c architecture [42, 2, 15], by automatically determining\n\n2\n\n\fFigure 2: Searching for the width of a pruned CNN from an unpruned three-layer CNN. Each\nconvolutional layer is equipped with a learnable distribution for the size of the channels in this layer,\nindicated by pi on the left side. The feature map for every layer is built sequentially by the layers, as\nshown on the right side. For a speci\ufb01c layer, K (2 in this example) feature maps of different sizes are\nsampled according to corresponding distribution and combined by channel-wise interpolation (CWI)\nand weighted sum. This aggregated feature map is fed as input to the next layer.\n\nthe pruning and compression ratio of each layer. In contrast, we explicitly discover the architecture\nusing NAS. Most previous NAS algorithms [48, 8, 31, 40] automatically discover the topology\nstructure of a neural network, while we focus on searching for the depth and width of a neural\nnetwork. Reinforcement learning (RL)-based [48, 3] methods or evolutionary algorithm-based [40]\nmethods are possible to search networks with \ufb02exible width and depth, however, they require huge\ncomputational resources and cannot be directly used on large-scale target datasets. Differentiable\nmethods [8, 31, 4] dramatically decrease the computation costs but they usually assume that the\nnumber of channels in different searching candidates is the same. TAS is a differentiable NAS method,\nwhich is able to ef\ufb01ciently search for a transformable networks with \ufb02exible width and depth.\nNetwork transformation [5, 10, 3] also studied the depth and width of networks. Chen et al. [5] manu-\nally widen and deepen a network, and proposed Net2Net to initialize the lager network. Ariel et al. [10]\nproposed a heuristic strategy to \ufb01nd a suitable width of networks by alternating between shrinking\nand expanding. Cai et al. [3] utilized a RL agent to grow the depth and width of CNNs, while our\nTAS is a differentiable approach and can not only enlarge but also shrink CNNs.\nKnowledge transfer has been proven to be effective in the literature of pruning. The parameters of\nthe networks can be transferred from the pre-trained initialization [30, 18]. Minnehan et al. [37]\ntransferred the knowledge of uncompressed network via a block-wise reconstruction loss. In this\npaper, we apply a simple KD approach [21] to perform knowledge transfer, which achieves robust\nperformance for the searched architectures.\n\n3 Methodology\n\nOur pruning approach consists of three steps: (1) training the unpruned large network by a standard\nclassi\ufb01cation training procedure. (2) searching for the depth and width of a small network via the\nproposed TAS. (3) transferring the knowledge from the unpruned large network to the searched small\nnetwork by a simple KD approach [21]. We will introduce the background, show the details of TAS,\nand explain the knowledge transfer procedure.\n\n3.1 Transformable Architecture Search\n\nNetwork channel pruning aims to reduce the number of channels in each layer of a network. Given an\ninput image, a network takes it as input and produces the probability over each target class. Suppose\nX and O are the input and output feature tensors of the l-th convolutional layer (we take 3-by-3\nconvolution as an example), this layer calculates the following procedure:\n\n(cid:88)cin\n\nk=1\n\nOj =\n\nXk,:,: \u2217 Wj,k,:,: where 1 \u2264 j \u2264 cout,\n\n(1)\n\n3\n\nlogitsimagesample two channel choices: 3 and 4 via p1CWI+=actual output of conv-1unpruned CNNCWI+=CWI+=\u00d7p13\u00d7p14\u00d7p21\u00d7p23\u00d7p31\u00d7p32p1=[p11,p12,p13,p14]sample two channel choices: 1 and 3 via p2sample two channel choices: 1 and 2 via p3channel=4feature mapchannel=4feature mapchannel=4feature mapprobC=1 2 3 4p11p12p13p14C=1 2 3 4p21p22p23p24probC=1 2 3 4p31p32p33p34probdistributionof #channels1st layer2nd layer3rd layerpruned CNN-1st layerpruned CNN-2nd layerpruned CNN-3rd layerpruned CNN-logitsimageimageimageimage\fwhere W \u2208 Rcout\u00d7cin\u00d73\u00d73 indicates the convolutional kernel weight, cin is the input channel, and\ncout is the output channel. Wj,k,:,: corresponds to the k-th input channel and j-th output channel. \u2217\ndenotes the convolutional operation. Channel pruning methods could reduce the number of cout, and\nconsequently, the cin in the next layer is also reduced.\nSearch for width. We use parameters \u03b1 \u2208 R|C| to indicate the distribution of the possible number\nof channels in one layer, indicated by C and max(C) \u2264 cout. The probability of choosing the j-th\ncandidate for the number of channels can be formulated as:\n\n(cid:80)|C|\n\nexp(\u03b1j)\nk=1 exp(\u03b1k)\n\npj =\n\nwhere 1 \u2264 j \u2264 |C|,\n\n(2)\n\nHowever, the sampling operation in the above procedure is non-differentiable, which prevents us from\nback-propagating gradients through pj to \u03b1j. Motivated by [8], we apply Gumbel-Softmax [26, 36]\nto soften the sampling procedure to optimize \u03b1:\n\n(cid:80)|C|\n\n\u02c6pj =\n\nexp((log (pj) + oj)/\u03c4 )\nk=1 exp((log (pk) + ok)/\u03c4 )\n\ns.t. oj = \u2212 log(\u2212 log(u)) & u \u223c U(0, 1),\n\n(3)\n\n(cid:88)\n\n(cid:80)\n\nwhere U(0, 1) means the uniform distribution between 0 and 1. \u03c4 is the softmax temperature. When\n\u03c4 \u2192 0, \u02c6p = [\u02c6p1, ..., \u02c6pj, ...] becomes one-hot, and the Gumbel-softmax distribution drawn from \u02c6p\nbecomes identical to the categorical distribution. When \u03c4 \u2192 \u221e, the Gumbel-softmax distribution\nbecomes a uniform distribution over C. The feature map in our method is de\ufb01ned as the weighted sum\nof the original feature map fragments with different sizes, where weights are \u02c6p. Feature maps with\ndifferent sizes are aligned by channel wise interpolation (CWI) so as for the operation of weighted\nsum. To reduce the memory costs, we select a small subset with indexes I \u2286 [|C|] for aggregation\ninstead of using all candidates. Additionally, the weights are re-normalized based on the probability\nof the selected sizes, which is formulated as:\n\nj\u2208I\n\n\u02c6O =\n\nexp((log(pj) + oj)/\u03c4 )\nk\u2208I exp((log(pk) + ok)/\u03c4 )\n\n\u00d7 CWI(O1:Cj ,:,:, max(CI)) s.t. I \u223c T \u02c6p,\n\n(4)\nwhere T \u02c6p indicates the multinomial probability distribution parameterized by \u02c6p. The proposed CWI\nis a general operation to align feature maps with different sizes. It can be implemented in many ways,\nsuch a 3D variant of spatial transformer network [25] or adaptive pooling operation [16]. In this\npaper, we choose the 3D adaptive average pooling operation [16] as CWI2 , because it brings no extra\nparameters and negligible extra costs. We use Batch Normalization [24] before CWI to normalize\ndifferent fragments. Fig. 2 illustrates the above procedure by taking |I| = 2 as an example.\nDiscussion w.r.t. the sampling strategy in Eq. (4). This strategy aims to largely reduce the memory\ncost and training time to an acceptable amount by only back-propagating gradients of the sampled\narchitectures instead of all architectures. Compared to sampling via a uniform distribution, the applied\nsampling method (sampling based on probability) could weaken the gradients difference caused by\nper-iteration sampling after multiple iterations.\nSearch for depth. We use parameters \u03b2 \u2208 RL to indicate the distribution of the possible number of\nlayers in a network with L convolutional layers. We utilize a similar strategy to sample the number of\nlayers following Eq. (3) and allow \u03b2 to be differentiable as that of \u03b1, using the sampling distribution\n\u02c6ql for the depth l. We then calculate the \ufb01nal output feature of the pruned networks as an aggregation\nfrom all possible depths, which can be formulated as:\n\nOout =\n\n\u02c6ql \u00d7 CWI( \u02c6Ol, Cout),\n\n(5)\n\n(cid:88)L\n\nl=1\n\nwhere \u02c6Ol indicates the output feature map via Eq. (4) at the l-th layer. Cout indicates the maximum\nsampled channel among all \u02c6Ol. The \ufb01nal output feature map Oout is fed into the last classi\ufb01cation\nlayer to make predictions. In this way, we can back-propagate gradients to both width parameters \u03b1\nand depth parameters \u03b2.\n\n2The formulation of the selected CWI: suppose B = CWI(A, Cout), where B \u2208 RCoutHW and A \u2208\nRCHW ; then Bi,h,w = mean(As:e\u22121,h,w), where s = (cid:98) i\u00d7C\n(cid:101). We tried other forms of\nCWI, e.g., bilinear and trilinear interpolation. They obtain similar accuracy but are much slower than our choice.\n\n(cid:99) and e = (cid:100) (i+1)\u00d7C\n\nCout\n\nCout\n\n4\n\n\fSearching objectives. The \ufb01nal architecture A is derived by selecting the candidate with the\nmaximum probability, learned by the architecture parameters A, consisting of \u03b1 for each layers and \u03b2.\nThe goal of our TAS is to \ufb01nd an architecture A with the minimum validation loss Lval after trained\nby minimizing the training loss Ltrain as:\nA,A)\n\n(6)\nA indicates the optimized weights of A. The training loss is the cross-entropy classi\ufb01cation\nwhere \u03c9\u2217\nloss of the networks. Prevailing NAS methods [31, 48, 8, 4, 40] optimize A over network candidates\nwith different typologies, while our TAS searches over candidates with the same typology structure\nas well as smaller width and depth. As a result, the validation loss in our search procedure includes\nnot only the classi\ufb01cation validation loss but also the penalty for the computation cost:\n\nminA Lval(\u03c9\u2217\n\nLtrain(\u03c9,A),\n\nA = arg min\n\u03c9\n\ns.t. \u03c9\u2217\n\nLval = \u2212 log(\n\n) + \u03bbcostLcost,\n\n(7)\n\n(cid:80)|z|\n\nexp(zy)\nj=1 exp(zj)\n\nwhere z is a vector denoting the output logits from the pruned networks, y indicates the ground\ntruth class of a corresponding input, and \u03bbcost is the weight of Lcost. The cost loss encourages\nthe computation cost of the network (e.g., FLOP) to converge to a target R so that the cost can be\ndynamically adjusted by setting different R. We used a piece-wise computation cost loss as:\n\n(cid:40) log(Ecost(A))\n\n0\n\n\u2212 log(Ecost(A))\n\nLcost =\n\n(1 \u2212 t) \u00d7 R < Fcost(A) < (1 + t) \u00d7 R\n\nFcost(A) > (1 + t) \u00d7 R\nFcost(A) < (1 \u2212 t) \u00d7 R\n\n,\n\n(8)\n\nAlgorithm 1 The TAS Procedure\nInput: split the training set into two dis-\n\nwhere Ecost(A) computes the expectation of the computation cost, based on the architecture parame-\nters A. Speci\ufb01cally, it is the weighted sum of computation costs for all candidate networks, where the\nweight is the sampling probability. Fcost(A) indicates the actual cost of the searched architecture,\nwhose width and depth are derived from A. t \u2208 [0, 1] denotes a toleration ratio, which slows down\nthe speed of changing the searched architecture. Note\nthat we use FLOP to evaluate the computation cost of\na network, and it is readily to replace FLOP with other\nmetric, such as latency [4].\nWe show the overall algorithm in Alg. 1. During search-\ning, we forward the network using Eq. (5) to make both\nweights and architecture parameters differentiable. We\nalternatively minimize Ltrain on the training set to op-\ntimize the pruned networks\u2019 weights and Lval on the\nvalidation set to optimize the architecture parameters A.\nAfter searching, we pick up the number of channels with\nthe maximum probability as width and the number of\nlayers with the maximum probability as depth. The \ufb01nal\nsearched network is constructed by the selected width\nand depth. This network will be optimized via KD, and\nwe will introduced the details in Sec. 3.2.\n\nSample batch data Dt from Dtrain\nCalculate Ltrain on Dt to update\nnetwork weights\nSample batch data Dv from Dval\nCalculate Lval on Dv via Eq. (7) to\nupdate A\n6: end while\n7: Derive the searched network from A\n8: Randomly initialize the searched net-\nwork and optimize it by KD via\nEq. (10) on the training set\n\n1: while not converge do\n2:\n3:\n\njoint sets: Dtrain and Dval\n\n4:\n5:\n\n3.2 Knowledge Transfer\n\nKnowledge transfer is important to learn a robust pruned network, and we employ a simple KD\nalgorithm [21] on a searched network architecture. This algorithm encourages the predictions z of\nthe small network to match soft targets from the unpruned network via the following objective:\n\nLmatch = \u2212(cid:88)|z|\n\ni=1\n\n(cid:80)|z|\n\nexp( \u02c6zi/T )\nj=1 exp( \u02c6zj/T )\n\nlog(\n\nexp(zi/T )\nj=1 exp(zj/T )\n\n),\n\n(9)\n\n(cid:80)|z|\n\nwhere T is a temperature, and \u02c6z indicates the logit output vector from the pre-trained unpruned\nnetwork. Additionally, it uses a softmax with cross-entropy loss to encourage the small network to\npredict the true targets. The \ufb01nal objective of KD is as follows:\n) + (1 \u2212 \u03bb)Lmatch\n\nLKD = \u2212\u03bb log(\n\ns.t. 0 \u2264 \u03bb \u2264 1,\n\n(10)\n\n(cid:80)|z|\n\nexp(zy)\nj=1 exp(zj)\n\n5\n\n\f(a) The FLOPs of the searched network over epochs\nwhen we do not constrain the FLOPs (\u03bbcost = 0).\n\n(b) The mean discrepancy over epochs when we do\nnot constrain the FLOPs (\u03bbcost = 0).\n\n(c) The FLOPs of the searched network over epochs\nwhen we constrain the FLOPs (\u03bbcost = 2).\n\n(d) The mean discrepancy over epochs when we con-\nstrain the FLOPs (\u03bbcost = 2).\n\nFigure 3: The impact of different choices to make architecture parameters differentiable.\n\nwhere y indicates the true target class of a corresponding input. \u03bb is the weight of loss to balance the\nstandard classi\ufb01cation loss and soft matching loss. After we obtain the searched network (Sec. 3.1),\nwe \ufb01rst pre-train the unpruned network and then optimize the searched network by transferring from\nthe unpruned network via Eq. (10).\n\n4 Experimental Analysis\n\nWe introduce the experimental setup in Sec. 4.1. We evaluate different aspects of TAS in Sec. 4.2,\nsuch as hyper-parameters, sampling strategies, different transferring methods, etc. Lastly, we compare\nTAS with other state-of-the-art pruning methods in Sec. 4.3.\n\n4.1 Datasets and Settings\n\nDatasets. We evaluate our approach on CIFAR-10, CIFAR-100 [27] and ImageNet [6]. CIFAR-10\ncontains 50K training images and 10K test images with 10 classes. CIFAR-100 is similar to CIFAR-10\nbut has 100 classes. ImageNet contains 1.28 million training images and 50K test images with 1000\nclasses. We use the typical data augmentation of these three datasets. On CIFAR-10 and CIFAR-100,\nwe randomly crop 32\u00d732 patch with 4 pixels padding on each border, and we also apply the random\nhorizontal \ufb02ipping. On ImageNet, we use the typical random resized crop, randomly changing the\nbrightness / contrast / saturation, and randomly horizontal \ufb02ipping for data augmentation. During\nevaluation, we resize the image into 256\u00d7256 and center crop a 224\u00d7224 patch.\nThe search setting. We search the number of channels over {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} of\nthe original number in the unpruned network. We search the depth within each convolutional stage.\n\n6\n\n0100200300400500The index of training epoch0306090120FLOPs (MB)sample w/ CWIsample w/o CWImixture w/ CWImixture w/o CWItarget FLOPs0100200300400500The index of training epoch0.00.10.20.30.40.5The discrepancysample w/ CWIsample w/o CWImixture w/ CWImixture w/o CWI0100200300400500The index of training epoch015304560FLOPs (MB)sample w/ CWIsample w/o CWImixture w/ CWImixture w/o CWItarget FLOPs0100200300400500The index of training epoch0.00.10.20.30.40.5The discrepancysample w/ CWIsample w/o CWImixture w/ CWImixture w/o CWI\fWe sample |I| = 2 candidates in Eq. (4) to reduce the GPU memory cost during searching. We set\nR according to the FLOPs of the compared pruning algorithms and set \u03bbcost of 2. We optimize the\nweights via SGD and the architecture parameters\nTable 1: The accuracy on CIFAR-100 when\nvia Adam. For the weights, we start the learning rate\npruning about 40% FLOPs of ResNet-32.\nfrom 0.1 and reduce it by the cosine scheduler [34].\nFor the architecture parameters, we use the constant\nlearning rate of 0.001 and a weight decay of 0.001.\nOn both CIFAR-10 and CIFAR-100, we train the\nmodel for 600 epochs with the batch size of 256.\nOn ImageNet, we train ResNets [17] for 120 epochs\nwith the batch size of 256. The toleration ratio t\nis always set as 5%. The \u03c4 in Eq. (3) is linearly\ndecayed from 10 to 0.1.\nTraining. For CIFAR experiments, we use SGD\nwith a momentum of 0.9 and a weight decay of\n0.0005. We train each model by 300 epochs, start\nthe learning rate at 0.1, and reduce it by the cosine scheduler [34]. We use the batch size of 256 and 2\nGPUs. When using KD on CIFAR, we use \u03bb of 0.9 and the temperature T of 4 following [46]. For\nResNet models on ImageNet, we follow most hyper-parameters as CIFAR, but use a weight decay of\n0.0001. We use 4 GPUs to train the model by 120 epochs with the batch size of 256. When using KD\non ImageNet, we set \u03bb as 0.5 and T as 4 on ImageNet.\n\nFLOPs\naccuracy\n41.1 MB 68.18 %\nPre-de\ufb01ned\n41.1 MB 69.34 %\nPre-de\ufb01ned w/ Init\n41.1 MB 71.40 %\nPre-de\ufb01ned w/ KD\n42.9 MB 68.57 %\nRandom Search\nRandom Search w/ Init 42.9 MB 69.14 %\nRandom Search w/ KD 42.9 MB 71.71 %\nTAS\u2020\n42.5 MB 68.95 %\nTAS\u2020 w/ Init\n42.5 MB 69.70 %\nTAS\u2020 w/ KD (TAS)\n42.5 MB 72.41 %\n\n4.2 Case Studies\n\nIn this section, we evaluate different aspects of our proposed TAS. We also compare it with different\nsearching algorithm and knowledge transfer method to demonstrate the effectiveness of TAS.\nThe effect of different strategies to differentiate \u03b1. We apply our TAS on CIFAR-100 to prune\nResNet-56. We try two different aggregation methods, i.e., using our proposed CWI to align feature\nmaps or not. We also try two different kinds of aggregation weights, i.e., Gumbel-softmax sampling\nas Eq. (3) (denoted as \u201csample\u201d in Fig. 3) and vanilla-softmax as Eq. (2) (denoted as \u201cmixture\u201d in\nFig. 3). Therefore, there are four different strategies, i.e., with/without CWI combining with Gumbel-\nsoftmax/vanilla-softmax. Suppose we do not constrain the computational cost, then the architecture\nparameters should be optimized to \ufb01nd the maximum width and depth. This is because such network\nwill have the maximum capacity and result in the best performance on CIFAR-100. We try all\nfour strategies with and without using the constraint of computational cost. We show the results in\nFig. 3c and Fig. 3a. When we do not\nTable 2: Results of different con\ufb01gurations when prune\nconstrain the FLOPs, our TAS can\nResNet-32 on CIFAR-10 with one V100 GPU. \u201c#SC\u201d in-\nsuccessfully \ufb01nd the best architecture\ndicates the number of selected channels. \u201cH\u201d indicates hours.\nshould have a maximum width and\ndepth. However, other three strate-\ngies failed. When we use the FLOP\nconstraint, we can successfully con-\nstrain the computational cost in the\ntarget range. We also investigate dis-\ncrepancy between the highest proba-\nbility and the second highest proba-\nbility in Fig. 3d and Fig. 3b. Theoretically, a higher discrepancy indicates that the model is more\ncon\ufb01dent to select a certain width, while a lower discrepancy means that the model is confused and\ndoes not know which candidate to select. As shown in Fig. 3d, with the training procedure going, our\nTAS becomes more con\ufb01dent to select the suitable width. In contrast, strategies without CWI can not\noptimize the architecture parameters; and \u201cmixture with CWI\u201d shows a worse discrepancy than ours.\nComparison w.r.t. structure generated by different methods in Table 1. \u201cPre-de\ufb01ned\u201d means\npruning a \ufb01xed ratio at each layer [30]. \u201cRandom Search\u201d indicates an NAS baseline used in [31].\n\u201cTAS\u2020\u201d is our proposed differentiable searching algorithm. We make two observations: (1) searching\ncan \ufb01nd a better structure using different knowledge transfer methods; (2) our TAS is superior to the\nNAS random baseline.\n\n#SC Search Time Memory Train Time FLOPs Accuracy\n|I|=1\n0.71 H 23.59 MB 89.85%\n|I|=2\n0.84 H 38.95 MB 92.98%\n|I|=3\n0.67 H 39.04 MB 92.63%\n|I|=5\n0.60 H 37.08 MB 93.18%\n|I|=8\n0.81 H 38.28 MB 92.65%\n\n2.83 H\n3.83 H\n4.94 H\n7.18 H\n10.64 H\n\n1.5GB\n2.4GB\n3.4GB\n5.1GB\n7.3GB\n\n7\n\n\fTable 3: Comparison of different pruning algorithms for ResNet on CIFAR. \u201cAcc\u201d = accuracy,\n\u201cFLOPs\u201d = FLOPs (pruning ratio), \u201cTAS (D)\u201d = searching for depth, \u201cTAS (W)\u201d = searching for\nwidth, \u201cTAS\u201d = searching for both width and depth.\n\nDepth Method\n\nCIFAR-10\n\nFLOPs\n\nCIFAR-100\n\nFLOPs\n\nTAS\n\nPrune Acc Acc Drop\n64.66% 2.87% 2.73E7 (33.1%)\n64.37% 3.25% 2.43E7 (42.2%)\n66.86% 0.76% 2.43E7 (42.2%)\n64.81% 3.88% 2.19E7 (46.2%)\n68.08% 0.61% 1.92E7 (52.9%)\n68.90% -0.21% 2.24E7 (45.0%)\n67.39% 2.69% 4.32E7 (37.5%)\n68.37% 1.40% 4.03E7 (41.5%)\n68.52% 1.25% 4.03E7 (41.5%)\n66.94% 3.66% 4.08E7 (41.0%)\n71.74% -1.12% 3.80E7 (45.0%)\n72.41% -1.80% 4.25E7 (38.5%)\n\nPrune Acc Acc Drop\n91.68% 1.06% 2.61E7 (36.0%)\nLCCL [7]\nSFP [18]\n90.83% 1.37% 2.43E7 (42.2%)\nFPGM [19] 91.09% 1.11% 2.43E7 (42.2%)\n90.97% 1.91% 2.19E7 (46.2%)\nTAS (D)\n92.31% 0.57% 1.99E7 (51.3%)\nTAS (W)\n92.88% 0.00% 2.24E7 (45.0%)\n90.74% 1.59% 4.76E7 (31.2%)\nLCCL [7]\nSFP [18]\n92.08% 0.55% 4.03E7 (41.5%)\nFPGM [19] 92.31% 0.32% 4.03E7 (41.5%)\n91.48% 2.41% 4.08E7 (41.0%)\nTAS (D)\nTAS (W)\n92.92% 0.96% 3.78E7 (45.4%)\n93.16% 0.73% 3.50E7 (49.4%)\nPFEC [30]\n93.06% -0.02% 9.09E7 (27.6%)\nLCCL [7]\n92.81% 1.54% 7.81E7 (37.9%)\nAMC [15]\n91.90% 0.90% 6.29E7 (50.0%)\nSFP [18]\n93.35% 0.56% 5.94E7 (52.6%)\nFPGM [19] 93.49% 0.42% 5.94E7 (52.6%)\n93.69% 0.77% 5.95E7 (52.7%)\n93.44% 0.19% 1.66E8 (34.2%)\nLCCL[7]\n93.30% 0.20% 1.55E8 (38.6%)\nPFEC [30]\n71.28% 2.86% 1.21E8 (52.3%)\nSFP [18]\n92.97% 0.70% 1.21E8 (52.3%)\n72.55% 1.59% 1.21E8 (52.3%)\nFPGM [19] 93.85% -0.17% 1.21E8 (52.3%)\n73.16% 1.90% 1.20E8 (52.6%)\n94.33% 0.64% 1.19E8 (53.0%)\n94.09% 0.45% 1.79E8 (27.40%) 75.26% 0.41% 1.95E8 (21.3%)\n94.00% 1.47% 1.78E8 (28.10%) 77.76% 0.53% 1.71E8 (30.9%)\n\n68.79% 2.61% 5.94E7 (52.6%)\n69.66% 1.75% 5.94E7 (52.6%)\n72.25% 0.93% 6.12E7 (51.3%)\n70.78% 2.01% 1.73E8 (31.3%)\n\n68.37% 2.96% 7.63E7 (39.3%)\n\n\u2212\n\u2212\n\n\u2212\n\n\u2212\n\u2212\n\n\u2212\n\n\u2212\n\u2212\n\n\u2212\n\nLCCL[7]\n\nTAS\n\nTAS\n\nTAS\n\nTAS\n\n20\n\n32\n\n56\n\n110\n\n164\n\nComparison w.r.t. different knowledge transfer methods in Table 1. The \ufb01rst line in each block\ndoes not use any knowledge transfer method. \u201cw/ Init\u201d indicates using pre-trained unpruned network\nas initialization. \u201cw/ KD\u201d indicates using KD. From Table 1, knowledge transfer methods can\nconsistently improve the accuracy of pruned network, even if a simple method is applied (Init).\nBesides, KD is robust and improves the pruned network by more than 2% accuracy on CIFAR-100.\nSearching width vs. searching depth. We try (1) only searching depth (\u201cTAS (D)\u201d), (2) only\nsearching width (\u201cTAS (W)\u201d), and (3) searching both depth and width (\u201cTAS\u201d) in Table 3. Results of\nonly searching depth are worse than results of only searching width. If we jointly search for both\ndepth and width, we can achieve better accuracy with similar FLOP than both searching depth and\nsearching width only.\nThe effect of selecting different numbers of architecture samples I in Eq. (4). We compare\ndifferent numbers of selected channels in Table 2 and did experiments on a single NVIDIA Tesla\nV100. The searching time and the GPU memory usage will increase linearly to |I|. When |I|=1, since\nthe re-normalized probability in Eq. (4) becomes a constant scalar of 1, the gradients of parameters \u03b1\nwill become 0 and the searching failed. When |I|>1, the performance for different |I| is similar.\nThe speedup gain. As shown in Table 2, TAS can \ufb01nish the searching procedure of ResNet-32\nin about 3.8 hours on a single V100 GPU . If we use evolution strategy (ES) or random searching\nmethods, we need to train network with many different candidate con\ufb01gurations one by one and then\nevaluate them to \ufb01nd a best. In this way, much more computational costs compared to our TAS are\n\n8\n\n\fTop-1\n\nTop-5\n\nModel\n\nMethod\n\nLCCL [7]\nSFP [18]\nFPGM [19]\n\nTable 4: Comparison of different pruning algorithms for different ResNets on ImageNet.\nFLOPs Prune\nRatio\n2.29% 1.19E9 34.6%\n1.85% 1.06E9 41.8%\n1.15% 1.06E9 41.8%\n0.68% 1.21E9 33.3%\n0.81% 2.38E9 41.8%\n1.40% 2.04E9 50.0%\n2.25E9 44.9%\n3.00E9 26.6%\n0.21% 2.36E9 42.2%\n0.48% 2.31E9 43.5%\n\nPrune Acc Acc Drop Prune Acc Acc Drop\n66.33%\n67.10%\n68.41%\n69.15%\n74.61%\n\n86.94%\n87.78%\n88.48%\n89.19%\n92.06%\n90.80%\n\n3.65%\n3.18%\n1.87%\n1.50%\n1.54%\n\nTaylor [38]\n\nAutoSlim [45]\nFPGM [19]\n\n-\n\n-\n\n1.68%\n\n0.65%\n1.26%\n\n-\n\n74.50%\n76.00%\n75.50%\n76.20%\n\nTAS\n\nSFP [18]\nCP [20]\n\n-\n-\n\n92.63%\n93.07%\n\n-\n-\n\nResNet-18\n\nResNet-50\n\nTAS\n\nrequired. A possible solution to accelerate ES or random searching methods is to share parameters of\nnetworks with different con\ufb01gurations [39, 45], which is beyond the scope of this paper.\n\n4.3 Compared to the state-of-the-art\n\nResults on CIFAR in Table 3. We prune different ResNets on both CIFAR-10 and CIFAR-100.\nMost previous algorithms perform poorly on CIFAR-100, while our TAS consistently outperforms\nthen by more than 2% accuracy in most cases. On CIFAR-10, our TAS outperforms the state-of-\nthe-art algorithms on ResNet-20,32,56,110. For example, TAS obtains 72.25% accuracy by pruning\nResNet-56 on CIFAR-100, which is higher than 69.66% of FPGM [19]. For pruning ResNet-32 on\nCIFAR-100, we obtain greater accuracy and less FLOP than the unpruned network. We obtain a\nslightly worse performance than LCCL [7] on ResNet-164. It because there are 8163 \u00d7 183 candidate\nnetwork structures to searching for pruning ResNet-164. It is challenging to search over such huge\nsearch space, and the very deep network has the over-\ufb01tting problem on CIFAR-10 [17].\nResults on ImageNet in Table 4. We prune ResNet-18 and ResNet-50 on ImageNet. For ResNet-18,\nit takes about 59 hours to search for the pruned network on 4 NVIDIA Tesla V100 GPUs. The\ntraining time of unpruned ResNet-18 costs about 24 hours, and thus the searching time is acceptable.\nWith more machines and optimized implementation, we can \ufb01nish TAS with less time cost. We show\ncompetitive results compared to other state-of-the-art pruning algorithms. For example, TAS prunes\nResNet-50 by 43.5% FLOPs, and the pruned network achieves 76.20% accuracy, which is higher\nthan FPGM by 0.7. Similar improvements can be found when pruning ResNet-18. Note that we\ndirectly apply the hyper-parameters on CIFAR-10 to prune models on ImageNet, and thus TAS can\npotentially achieve a better result by carefully tuning parameters on ImageNet.\nOur proposed TAS is a preliminary work for the new network pruning pipeline. This pipeline can be\nimproved by designing more effective searching algorithm and knowledge transfer method. We hope\nthat future work to explore these two components will yield powerful compact networks.\n\n5 Conclusion\n\nIn this paper, we propose a new paradigm for network pruning, which consists of two components.\nFor the \ufb01rst component, we propose to apply NAS to search for the best depth and width of a network.\nSince most previous NAS approaches focus on the network topology instead the network size, we\nname this new NAS scheme as Transformable Architecture Search (TAS). Furthermore, we propose a\ndifferentiable TAS approach to ef\ufb01ciently and effectively \ufb01nd the most suitable depth and width of a\nnetwork. For the second component, we propose to optimize the searched network by transferring\nknowledge from the unpruned network. In this paper, we apply a simple KD algorithm to perform\nknowledge transfer, and conduct other transferring approaches to demonstrate the effectiveness of\nthis component. Our results show that new efforts focusing on searching and transferring may lead to\nnew breakthroughs in network pruning.\n\n9\n\n\fReferences\n[1] M. Alizadeh, J. Fern\u00e1ndez-Marqu\u00e9s, N. D. Lane, and Y. Gal. An empirical study of binary neural networks\u2019\n\noptimisation. In International Conference on Learning Representations (ICLR), 2019.\n\n[2] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In The Conference on\n\nNeural Information Processing Systems (NeurIPS), pages 2270\u20132278, 2016.\n\n[3] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Ef\ufb01cient architecture search by network transformation.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 2787\u20132794, 2018.\n\n[4] H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware.\n\nIn International Conference on Learning Representations (ICLR), 2019.\n\n[5] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer.\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\nIn\n\n[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\ndatabase. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 248\u2013255, 2009.\n\n[7] X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference\ncomplexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 5840\u20135848, 2017.\n\n[8] X. Dong and Y. Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1761\u20131770, 2019.\n\n[9] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially\nadaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 1039\u20131048, 2017.\n\n[10] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi. MorphNet: Fast & simple\nresource-constrained structure learning of deep networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 1586\u20131595, 2018.\n\n[11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: ef\ufb01cient inference\nengine on compressed deep neural network. In The ACM/IEEE International Symposium on Computer\nArchitecture (ISCA), pages 243\u2013254, 2016.\n\n[12] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,\ntrained quantization and huffman coding. In International Conference on Learning Representations (ICLR),\n2015.\n\n[13] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef\ufb01cient neural network.\n\nIn The Conference on Neural Information Processing Systems (NeurIPS), pages 1135\u20131143, 2015.\n\n[14] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In The\n\nConference on Neural Information Processing Systems (NeurIPS), pages 164\u2013171, 1993.\n\n[15] J. L. Z. L. H. W. L.-J. L. He, Yihui and S. Han. AMC: Automl for model compression and acceleration on\nmobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 183\u2013202,\n2018.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\nrecognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9):1904\u20131916,\n2015.\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[18] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft \ufb01lter pruning for accelerating deep convolutional neural\n\nnetworks. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 2234\u20132240, 2018.\n\n[19] Y. He, P. Liu, Z. Wang, and Y. Yang. Pruning \ufb01lter via geometric median for deep convolutional neural\nnetworks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 4340\u20134349, 2019.\n\n[20] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings\n\nof the IEEE International Conference on Computer Vision (ICCV), pages 1389\u20131397, 2017.\n\n[21] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In The Conference on\n\nNeural Information Processing Systems Workshop (NeurIPS-W), 2014.\n\n[22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n4700\u20134708, 2017.\n\n10\n\n\f[23] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training\nneural networks with low precision weights and activations. The Journal of Machine Learning Research\n(JMLR), 18(1):6869\u20136898, 2017.\n\n[24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In The International Conference on Machine Learning (ICML), pages 448\u2013456, 2015.\n\n[25] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In The Conference on\n\nNeural Information Processing Systems (NeurIPS), pages 2017\u20132025, 2015.\n\n[26] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[27] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\nnetworks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 1097\u20131105,\n2012.\n\n[29] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In The Conference on Neural Information\n\nProcessing Systems (NeurIPS), pages 598\u2013605, 1990.\n\n[30] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning \ufb01lters for ef\ufb01cient convnets. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[31] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference\n\non Learning Representations (ICLR), 2019.\n\n[32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning ef\ufb01cient convolutional networks through\nnetwork slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\npages 2736\u20132744, 2017.\n\n[33] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning.\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\nIn\n\n[34] I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[35] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through l_0 regularization. In\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[36] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete\n\nrandom variables. In International Conference on Learning Representations (ICLR), 2017.\n\n[37] B. Minnehan and A. Savakis. Cascaded projection: End-to-end network compression and acceleration.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n10715\u201310724, 2019.\n\n[38] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz. Importance estimation for neural network\npruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 11264\u201311272, 2019.\n\n[39] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef\ufb01cient neural architecture search via parameter\n\nsharing. In The International Conference on Machine Learning (ICML), pages 4092\u20134101, 2018.\n\n[40] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er architecture\n\nsearch. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2019.\n\n[41] E. Tartaglione, S. Leps\u00f8y, A. Fiandrotti, and G. Francini. Learning sparse neural networks via sensitivity-\ndriven regularization. In The Conference on Neural Information Processing Systems (NeurIPS), pages\n3878\u20133888, 2018.\n\n[42] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In\n\nThe Conference on Neural Information Processing Systems (NeurIPS), pages 2074\u20132082, 2016.\n\n[43] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel\n\npruning of convolution layers. In International Conference on Learning Representations (ICLR), 2018.\n\n[44] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network\nminimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 4133\u20134141, 2017.\n\n[45] J. Yu and T. Huang. Network slimming by slimmable networks: Towards one-shot architecture search for\n\nchannel numbers. arXiv preprint arXiv:1903.11728, 2019.\n\n[46] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convo-\nlutional neural networks via attention transfer. In International Conference on Learning Representations\n(ICLR), 2017.\n\n11\n\n\f[47] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classi\ufb01cation and\ndetection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(10):1943\u20131955,\n2016.\n\n[48] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference\n\non Learning Representations (ICLR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 390, "authors": [{"given_name": "Xuanyi", "family_name": "Dong", "institution": "University of Technology Sydney"}, {"given_name": "Yi", "family_name": "Yang", "institution": "UTS"}]}