{"title": "TETRIS: TilE-matching the TRemendous Irregular Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 4115, "page_last": 4125, "abstract": "Compressing neural networks by pruning weights with small magnitudes can significantly reduce the computation and storage cost. Although pruning makes the model smaller, it is difficult to get practical speedup in modern computing platforms such as CPU and GPU due to the irregularity. Structural pruning has attract a lot of research interest to make sparsity hardware-friendly. Increasing the sparsity granularity can lead to better hardware utilization, but it will compromise the sparsity for maintaining accuracy.\n\nIn this work, we propose a novel method, TETRIS, to achieve both better hardware utilization and higher sparsity. Just like a tile-matching game, we cluster the irregularly distributed weights with small value into structured groups by reordering the input/output dimension and structurally prune them. Results show that it can achieve comparable sparsity with the irregular element-wise pruning and demonstrate negligible accuracy loss. The experiments also shows ideal speedup, which is proportional to the sparsity, on GPU platforms. Our proposed method provides a new solution toward algorithm and architecture co-optimization for accuracy-efficiency trade-off.", "full_text": "TETRIS: TilE-matching the TRemendous Irregular\n\nSparsity\n\nYu Ji1,2,3\n\nLing Liang3\n\nLei Deng3 Youyang Zhang1 Youhui Zhang1,2\u2217 Yuan Xie3\n\n{jiy15,zhang-yy15}@mails.tsinghua.edu.cn,zyh02@tsinghua.edu.cn\n\n1Department of Computer Science and Technology, Tsinghua University\n\n2Beijing Innovation Center for Future Chip\n\n{lingliang,leideng,yuanxie}@ece.ucsb.edu\n\n3Department of Electrical and Computer Engineering, University of California, Santa Barbara\n\nAbstract\n\nCompressing neural networks by pruning weights with small magnitudes can\nsigni\ufb01cantly reduce the computation and storage cost. Although pruning makes\nthe model smaller, it is dif\ufb01cult to get a practical speedup in modern computing\nplatforms such as CPU and GPU due to the irregularity. Structural pruning has\nattracted a lot of research interest to make sparsity hardware-friendly. Increasing the\nsparsity granularity can lead to better hardware utilization, but it will compromise\nthe sparsity for maintaining accuracy.\nIn this work, we propose a novel method, TETRIS, to achieve both better hardware\nutilization and higher sparsity. Just like a tile-matching game2, we cluster the\nirregularly distributed weights with small value into structured groups by reordering\nthe input/output dimension and structurally prune them. Results show that it\ncan achieve comparable sparsity with the irregular element-wise pruning and\ndemonstrate negligible accuracy loss. The experiments also show ideal speedup,\nwhich is proportional to the sparsity, on GPU platforms. Our proposed method\nprovides a new solution toward algorithm and architecture co-optimization for\naccuracy-ef\ufb01ciency trade-off.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) have achieved great success in a wide spectrum of applications, such\nas computer vision [1, 2, 3], speech recognition [4, 5], and language translation [6]. However, the\nhuge memory overhead and intensive computation requirement limit the execution performance\non cloud platforms and also impede the porting onto edge devices with constraints on resource\nand energy. Model compression by pruning can signi\ufb01cantly reduce the storage and computation\ncost [7, 8, 9, 10, 11, 12]. They can achieve impressive compression rate by pruning weights with\nsmall magnitudes and then retraining the model to recover the accuracy.\nAlthough pruning makes the model smaller, it requires specialized sparse BLAS library or customized\nhardware [13, 14, 15, 16, 17, 18] to accelerate its execution. On general computing platforms (e.g.,\nCPU and GPU), the performance of the pruned sparse model may be even worse than the original\ndense one if the sparsity is not suf\ufb01ciently high [19, 20]. This is because these general computing\n\n\u2217Corresponding Author\n2 A tile-matching game is a type of game where the player manipulates tiles in order to make them disappear\naccording to a matching criterion. Tetris is one of the most famous tile-matching games. Our approach is doing\nthe same thing that clusters the unimportant items and structurally prunes them.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fplatforms are usually optimized for continuous data access and computation. Sparse weight matrices\nlose the regular structure of dense matrices, it requires extra computation to decode the sparse format.\nIn addition, it also creates irregular memory access, which is extremely slow in modern memory\nsystems. Sparsity at this \ufb01ne-grained granularity is not hardware-friendly.\nRecent studies on structured sparsity [21, 22, 19, 23, 24, 20] have yielded better performance\nimprovements. They usually group weight elements into small dense regions and prune them at the\ngranularity of groups. Different kinds of grouping methods have been proposed to achieve better\nacceleration and higher sparsity. Many of them use coarse-grained groups such as channel-wise\npruning and row/column-wise pruning [21, 22, 19, 23, 24]. These approaches usually shrink the size\nof some dimensions and the rest operations remain dense structure of smaller size. However, they\nusually suffer from less sparsity for maintaining accuracy. Some other studies [20] introduce very\nsmall groups with limited numbers of adjacent elements. They can achieve similar sparsity to the\nelement-wise pruning but the performance increase is still far from the ideal expectation. Here the\nideal performance means that the reduction in execution time is ideally proportional to the removed\noperations.\nSimply increasing the sparse granularity can help improve hardware utilization but compromise the\nsparsity. Instead of carefully making tradeoffs between the two, we propose a reordering method\nto achieve both better hardware utilization and higher sparsity. The key idea is to cluster elements\nwith small magnitude closer by reordering the input and output dimensions before pruning at a\ncoarse granularity. It transforms irregularly distributed small elements into dense regions which can\nbe removed for coarse-grained sparsity. Our method achieves comparable sparsity with irregular\n\ufb01ne-grained pruning and maintains the model accuracy to a great extent. Meanwhile, we can achieve\nsigni\ufb01cant performance improvement due to the coarse-grained pruning granularity. The introduced\noverhead to reorder of input and output dimensions is negligible compared to the saved time.\nIt\u2019s worth noting that, our approach is orthogonal to other structured pruning methods. By reordering\nthe input and output dimensions, we can always increase the sparsity of pruned networks generated\nby those pruning methods. In this paper, we take the block sparsity [25, 24] as a study case, but it is\npossible to extend to different structured sparsity pattern.\n\n2 Related Work\n\nHan et al. [8, 9] \ufb01rst proposed the deep compression and pruning approach, which reduced > 90%\nparameters on AlexNet and VGG16. However, it is very dif\ufb01cult to enjoy the bene\ufb01ts for speedup\non GPU due to the irregular sparsity [19, 20]. Even implementing the deep compression through\nsparse matrix encoding on the specialized accelerators [13, 14], the indexing overhead is still costly.\nMoreover, the compressed model leads to irregular memory access pattern, which is extremely slow in\nmodern memory systems. It is dif\ufb01cult to leverage the sparsity in NN for performance improvement.\nThe following studies tried efforts from different perspectives to produce structured sparsity. The\nstudies on medium-grained sparsity (e.g. row/column level) [19, 26] presented a more regular\npattern via L2-norm group-lasso optimization. However, this row or column sparsity is still not\nthe favorite grain of general computing platforms. Coarser-grained sparsity (e.g. \ufb01lter level) was\nobtained [10, 11, 12, 21, 22] through formulating and solving various optimization problems, such\nas \ufb01lter scaling, group lasso, importance prediction, lower-rank forcing, etc. Nevertheless, the\nsparsity generated by these coarse-grained pruning methods is usually compromised because of\nthe accuracy maintaining. Other studies [20] introduce very small groups with limited numbers of\nadjacent elements. They presented similar sparsity with the element-wise pruning but the performance\nimprovement was still limited.\n\n3 Sparsity Granularity\n\nPruning methods always try to \ufb01nd a boolean mask tensor M to mark the pruned elements (1 for\npruned elements; 0 for preserved elements) for a given weight tensor W so that the number of\nelements in resulted weights (W \u21d0 (1 \u2212 M ) (cid:12) W ) is minimized with certain structural constraints.\nHere (cid:12) represents element-wise multiplication. In this paper, we denote any pruning method for\n\ufb01nding M as a mapping P (\u00b7) that M = P (W ). Then, they retrain the model to recover accuracy, in\nwhich the pruned weights are constantly zero. The generation of M is usually done by partitioning\n\n2\n\n\f(a) Speedup vs. sparsity under different block size on\nconv4-2 layer from VGG16.\n\n(b) Accuracy vs. sparsity under different block size\non VGG16.\n\nFigure 1: Tradeoffs between sparsity degree and hardware utilization with different pruning granular-\nity.\n\nthe tensor into many dense regions and then selecting unimportant regions to prune. The size of the\ndense region is the granularity of sparsity.\nThe granularity of sparsity affects both the sparsity degree and the hardware utilization. We test\nthe execution performance under different block size and sparsity. The blocksparse library [25], an\nopen-source GPU kernel for block sparsity, is used for the evaluation on a Titan V GPU. Figure 1a\nshows the result of a 512 \u00d7 512 \u00d7 3 \u00d7 3 convolution on a 28 \u00d7 28 \u00d7 512 input feature map, which\nis the most computation-intensive layer in VGG16 [1] for ImageNet [27] recognition. The baseline\nis the dense computation in Pytorch [28] with cuBlas as backend. The block sparsity is along the\nchannel dimensions (512 here). Deep Compression [9] reported a sparsity of 73% for that layer, so\nwe take this sparsity for detailed analysis. At this sparsity level, we can see that when the block size\nis less than 32, no practical performance improvement is gained. In contrast, when we set the block\nsize to over 32, the speedup is approaching the ideal case. Therefore, the sparsity granularity, i.e.\nblock size in this paper, should be suf\ufb01ciently large for fully utilizing the hardware.\nHowever, on the other side, when we increase the sparsity granularity too much, the accuracy will\ndrop signi\ufb01cantly. We also use VGG16 as an example to test the relationship between accuracy and\nsparsity under different block size. Since the dimensions of the \ufb01rst few layers are not large enough,\nwe enforce at least two blocks for those small layers in our test. For example, for the \ufb01rst convolution\nlayer with kernel dimension of 64 \u00d7 3 \u00d7 3 \u00d7 3, we use a block size of 32 \u00d7 3 \u00d7 3 \u00d7 3. As shown in\nFigure 1b, even if we only set a small block size 8 \u00d7 8, the accuracy still drop signi\ufb01cantly.\nAlthough we use block sparsity as an example, the trade-off between accuracy and hardware utilization\n(determined by sparse pattern and sparsity) is prevalent. The unimportant elements are usually\nirregularly distributed while grouping them toward regular pattern will greatly restrict the \ufb02exibility\nof selecting those unimportant elements and then compromise the sparsity for maintaining accuracy.\n\n4 Reordering Irregular Sparsity\n\nInstead of struggling to achieve a good trade-off among the sparsity, hardware utilization, and the\naccuracy, and then selecting a proper grouping granularity, we propose an orthogonal approach that\nclusters unimportant elements together to form regular structures. It enables the structured pruning\nalgorithms to achieve both higher sparsity and lower accuracy loss.\nLet W \u2208 Rm\u00d7n and B \u2208 Rn denote the weight matrix and bias vector of a fully-connected (FC)\nlayer, the number of input and output neurons is m and n, respectively. If we feed a batch of inputs\nX \u2208 RN\u00d7m with N samples, we can get Y = \u03c3(W X + B), where \u03c3 is an element-wise activation\noperation. Now we introduce two permutations \u03b1 and \u03b2 for the two dimensions of W :\n\n\u03b1 =\n\n2\na1 a2\n\n. . . m\n. . . am\n\n\u03b2 =\n\n2\nb2\n\n. . . n\n. . .\nbn\n\n.\n\n(1)\n\nThen, the layer computation can be governed by\n\n(2)\nwhere W [\u03b1; \u03b2] denotes the matrix in which the rows and columns are reordered according to\npermutations \u03b1 and \u03b2, and I is the unit permutation. After reordering, we can \ufb01rst apply \u03b1 on\n\nY [I; \u03b2] = \u03c3(W [\u03b1; \u03b2]X[I; \u03b1] + B[\u03b2])\n\n3\n\n(cid:18) 1\n\n(cid:19)\n\n(cid:18) 1\n\nb1\n\n(cid:19)\n\n00.20.40.60.81Sparsity0246810SpeedupIdeal speedup8816163232646412812825625600.20.40.60.81Sparsity60657075AccuracyDense Accuracy1188161632326464128128256256\fFigure 2: Reordering to cluster elements with similar magnitude for structured pruning.\n\nX, and feed it into a new fully-connected layer with weight matrix W [\u03b1; \u03b2] and bias B[\u03b2] to get\nY [\u03b2]. Then we only need to apply the reverse permutation \u03b2\u22121 to get the original result Y . The\noverhead introduced during runtime is only the permutation on input and output. But it provides us\nthe \ufb02exibility to choose a good \u03b1 and \u03b2 to cluster irregularly distributed small elements together so\nthat we can gain more sparsity for better hardware utilization with less accuracy loss.\nFor convolution, we can also reorder the weights as follows\n\nY [I; \u03b2; I; I] = \u03c3(W [\u03b2; \u03b1; I; I] \u2297 X[I; \u03b1; I; I] + B[\u03b2])\n\n(3)\nwhere W is the weight kernel, X is the input feature maps, Y is the output feature maps, and B\nis the bias vector. \u03b1 and \u03b2 are permutations on the input-channel and output-channel dimensions,\nrespectively.\nAs shown in Figure 2, we can leverage the \ufb02exibility of permutation and properly reorder the indices\nfor each dimension to cluster elements according to the value magnitude, which enables a better\nsparsity for structured pruning algorithms.\n\n4.1 Reordering Algorithm\n\nFor generality, we assume that the weight tensor has d dimensions. A pruning method P that\ngenerates a mask M = P (W ) usually tries to \ufb01nd a good mask so that the norm of the pruned value\nis minimized.\n\nP (W ) = arg min\nM\n\n(cid:107)M (cid:12) W(cid:107)\n\ns.t. M satis\ufb01es structural constraints\n\n(4)\n\nThe norm used here depends on the pruning algorithm.\nIn this paper, we introduce a new dimension, permutations of the weight tensor, to further minimize\nthe objective. We have the opportunity to choose the best permutations \u03b11, . . . , \u03b1d to minimize the\npruned values, i.e.\n\n(cid:107)M (cid:12) W [\u03b11; . . . ; \u03b1d](cid:107)\n\nmin\n\nM,\u03b1i,i\u2208\u2126\n\ns.t. M satis\ufb01es structural constraints\n\n(5)\n\nThe dimensions that can be reordered are denoted as set \u2126, and others are set to the unit permutations.\nAnalytically solving the minimization problem is dif\ufb01cult. The target is to generate permutations\nthat cluster unimportant elements closer into structured dense regions. This is similar to a k-means\nproblem that can be iteratively solved by the expectation maximization (EM) algorithm.\n\n\u2022 E-Step: Fix the permutation \u03b1 = \u03b1(t) and generate a mask by using the given pruning\n\nmethod P , i.e.\n\nM (t+1) = arg min\nM\n\n(cid:13)(cid:13)(cid:13)M (cid:12) W [\u03b1(t)\n\n1 ; . . . ; \u03b1(t)\nd ]\n\n(cid:13)(cid:13)(cid:13)\n\n=P (W [\u03b1(t)\n\n1 ; . . . ; \u03b1(t)\nd ])\n\n4\n\ns.t. M satis\ufb01es structural constraints\n\n(6)\n\n1234567814699216221015223030302411165025831141612615182402414521907311708551613418024361368625103612219847764652012723168167822113716916162171715324678764127562023168167414573192011708531115641126182402822163711169162171751636180134241368142292691610152625123106219847230113030241650258\f\u2022 M-Step: Fix the mask M (t+1) and \ufb01nd optimized permutations \u03b1(t+1)\n\nsuch that the masked\n\nvalues are minimized, i.e.\n\n\u03b1(t+1)\n\n1\n\n, . . . , \u03b1(t+1)\n\nd\n\n= arg min\n\u03b1i,i\u2208\u2126\n\n.\n\ni\n\n(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)M (t+1) (cid:12) W [\u03b11; . . . ; \u03b1d]\n\n(7)\n\nWe can use unit permutations as the initial con\ufb01guration for all dimensions and run above EM\nalgorithm iteratively until it converges. However, the M-Step is still an optimization problem. Since\nthe permutations of different dimensions are highly coupled, it is dif\ufb01cult to generate optimized\npermutations for all dimensions. We use an alternating minimization (AM) algorithm to optimize\ndifferent dimensions separately and iteratively. Each time, we \ufb01x other dimensions and only optimize\none dimension D, which can be described as\n\n(cid:107)M (cid:12) W [\u03b11; . . . ; \u03b1d](cid:107) .\n\narg min\n\u03b1D\n\n(8)\n\n(cid:88)\n\nThe possible permutations still have a large search space. We start from the unit permutation, and\ngreedily swap two indices that can mostly decrease equation (8) each time until the convergence.\nFinding the index pair highly depends on the exact form of the norm in the pruning algorithm P . For\nexample, if L1 norm is employed, then we use the absolute value of W as its importance matrix. For\nL2 norm, the importance matrix is the square of W . Without loss of generality, we use L1 norm as an\nexample. We \ufb01rst contract the importance matrix of W with M along all dimensions except the D-th\ndimension, i.e.\n\nSij =\n\n|Wk1,...,kD\u22121,i,kD+1,...,kd|Mk1,...,kD\u22121,j,kD+1,...,kd\n\n(9)\n\nk1,...,kD\u22121,kD+1,...,kd\n\nThe result S is a square matrix whose size is the same as the current dimension D. The element Sij\nrepresents the total value of masked elements wherein we mask the i-th slice in W with the j-th slice\nin M. Thus, the decrease of the objective function that we can gain from swapping the i-th slice and\nthe j-th slice is Gij as shown in Equation (10), which can also be written in the matrix format (11),\nwhere L is the diagonal vector of S\n\nGij = Sii + Sjj \u2212 Sij \u2212 Sji\n\nG = L + LT \u2212 S \u2212 ST\n\n(10)\n\n(11)\n\nThen, we only need to \ufb01nd the maximum elements Gij in G and swap the i-th slice and j-th slice in\nD dimension of W . The algorithm is described as in Algorithm 1.\n\nAlgorithm 1 Reordering algorithm\nInput: Pruning Method P , d-order weight tensor W , reordering dimension set \u2126\nOutput: Permutations \u03b1i\n\nInitialize all \u03b1i = I\nrepeat\n\nM = P (W )\nfor D \u2208 \u2126 do\n\nrepeat\n\nend for\n\nuntil Convergence\n\nCompute tensor contraction S = |W|M over dimensions {1, . . . , D \u2212 1, D + 1, . . . , d}.\nL = diagonal(S)\nG = L + LT \u2212 S \u2212 ST\ni, j = arg max(G)\nSwap i-th and j-th slices in D-th dimension of W\nSwap i-th and j-th indices of \u03b1D\n\nuntil max(G) \u2264 \u0001 where \u0001 is a small enough positive number.\n\n5\n\n\f4.2 Pruning Overhead Optimization\n\nThe algorithm includes three nested iterations, in which the inner loop takes more iterations to\nconverge than the other two. The inner loop contains a tensor contraction, which is computationally\nintensive. Although we only need to run the pruning algorithm once before \ufb01ne-tuning, the overhead\nis still too large. Fortunately, there exists large computational redundancy so that we can reuse many\nintermediate results from the previous iteration. At each iteration during the inner loop, the only\nupdate is to swap the i-th and j-th slices in the D-th dimension of W . Thus, for S, we only need to\nswap the i-th and j-th rows without recomputing the tensor contraction. Consequently, we can move\nthe computational intensive tensor contraction to the outer loop.\nThe rest of the computations in the inner loop are all element-wise operations or max operations,\nwhose computational complexity is proportional to the size of the matrix S. However, for some large\nlayers, the size of the corresponding S is n2, which is still time-consuming to perform element-wise\noperations or max operations over S. For example, the \ufb01rst fully-connected layer of VGG16 [1] has\nan input of size 25088, and our algorithm consumes more than 2 hours. We can further optimize\nthese O(n2) operations as follows: (i) Since the update on S is only to swap two rows i and j, G only\nneeds to recompute the i-th and j-th rows and columns, which is O(n) complexity. (ii) To optimize\nthe computation of \ufb01nding the maximum value in G, we maintain two vectors of size n to record\nthe maximum value in each row of G and their indices. Each time when we recompute the i-th and\nj-th rows and columns, we need \ufb01rst recompute the maximum value of the i-th and j-th rows, which\nis also O(n) complexity. For the rest rows, if the original maximum value is not in the i-th or j-th\ncolumns, we only need to compare the new value in the two columns with the original maximum\nvalue of each row, which is also O(n) complexity. Otherwise, we recompute the maximum value of\nthose rows; the complexity is O(nr), where r is the number of elements in the i-th or j-th column\nthat holds the maximum values of their rows. In practical, r is usually far less than n. With these\noptimizations, we can prune the entire large VGG16 model in less than 1 minute on a TiTan V GPU,\nwhich can be ignored compared to the \ufb01ne-tuning time.\n\n4.3 Runtime Overhead Optimization\n\nThe introduced overhead in the inference phase is that we have to reorder the input and output of\nall layers according to the generated permutations. It is a pure data-movement operation from a\ndense tensor to another, which can fully utilize the bandwidth of GPU. It only takes about 4% of the\ncomputation time of the normal layers. Considering the bene\ufb01ts it brought, this small overhead is\nalso negligible.\nIn addition, since the layers between two adjacent weighted layers are usually activation function\nor pooling operation along kernel dimensions (not the permuted dimensions along the channels),\nwe can merge the output permutations of the previous layer with the input permutations of the next\none. Thus, on average, each layer only requires one reordering data movement. With the optimized\nruntime overhead, we are able to speed up the computation of the layers close to the ideal case.\n\n5 Experiments\n\nOur reordering algorithm is general for structural pruning methods. Note that, our reordering\nalgorithm is orthogonal to existing structural pruning algorithm. Thus, we use one typical structural\npruning algorithm, block sparsity, as a case study to show how our algorithm can improve its sparsity\nand granularity.\nBlock sparsity is to \ufb01rst partition the weight tensor into small blocks with a grid and then prune the\nweight tensor at the granularity of blocks. In our experiment, we use the L1 norm of each block to\nmeasure the importance of the block and prune the blocks with smaller importance value. The block\nsparsity without reordering is our baseline algorithm.\nWe implement our reordering and pruning method in Pytorch [28]. The reordering algorithm will\ngenerate a mask and permutations for each weighted layer. We reorder the mask according to the\nreverse permutations to generate a permuted mask and use it to \ufb01lter the elements of the original\nweights so that we do not need to reorder the inputs and outputs for retraining. We test our method\non three networks of different scales: LeNet on MNIST, VGG14 on CIFAR-10, and VGG16 on\nImageNet. The \ufb01rst two models are trained from scratch to get the baseline accuracy, and the last\n\n6\n\n\f(a) Original kernel\n\n(b) After reordering\n\n(c) After pruning\n\nFigure 3: Reordering and pruning.\n\n(a) LeNet on MNIST\n\n(b) VGG14 on CIFAR-10\n\nFigure 4: Sparsity vs. Accuracy under different block sizes for LeNet on MNIST and VGG14 on\nCIFAR-10.\n\nmodel is obtained from torchvision [29]. For retraining the pruned VGG16, we use a learning rate of\n0.001 and retrain 20 epochs. In all of our tests, if the block size is larger than the channel dimension\nof the layer, we will reduce the block size for that layer to ensure that each layer at least has two\nblocks.\nFigure 3 visualizes the process of reordering and pruning the second convolutional layer in VGG16.\nWe sum up the kernel dimensions to visualize it as a 2D image. Figure 3a is the original weights.\nAfter reordering, the elements are distributed as in Figure 3b. Then, we can prune the blocks with\nsmall values to get the \ufb01nal weights in Figure 3c.\n\n5.1 Models on MNIST and CIFAR-10\n\nWe \ufb01rst do experiments on one small model, LeNet on MNIST, and one medium model, VGG14 on\nCIFAR-10. The accuracies for the original LeNet and VGG14 are 99.3% and 93.96%, respectively.\nFigure 4 shows the relationship between sparsity and accuracy under different block sizes. From the\n\ufb01gure, we can see that block size has little impact bene\ufb01t because of our reordering method. We can\nachieve about 10\u00d7 and 3\u00d7 speedup on the two models, respectively.\nNote that, for simplicity, we set the same pruning rate for all layers, which is not the optimal\ncon\ufb01guration. The accuracy can be improved by setting a lower pruning rate for some sensitive\nlayers (e.g. the \ufb01rst and the last layer). However, searching the hyper-parameter space to improve\nthe sparsity-accuracy curve is orthogonal to our approach. Our contribution is to make the curves\nof larger block sizes closer to that of smaller block sizes. In this way, we enable great performance\nimprovement with only little accuracy degradation.\n\n5.2 VGG16 on ImageNet\n\nWe also test our approach on a large-scale model, VGG16 on ImageNet dataset. For simplicity, we\nuse the pruning rate from Deep compression [9] as the basic con\ufb01guration. They can prune 92% of\n\n7\n\n00.20.40.60.81Sparsity0.950.960.970.980.991Accuracydense accuracyblock-size 16block-size 32block-size 64block-size 12800.20.40.60.81Sparsity0.80.820.840.860.880.90.920.940.960.981Accuracydense accuracyblock-size 8block-size 16block-size 32block-size 64block-size 128\f(a) Accuracy vs. Model size reduction\n\n(b) Accuracy vs. Computation reduction\n\nFigure 5: Top-1 accuracy vs. Sparsity under different block sizes for VGG16 on ImageNet. Without\nour reordering method, the accuracy for all case will drop 2.3% \u223c 6.0% compared to the 1 \u00d7 1 case.\nWith reordering, the accuracy drop will decrease to 0.1% \u223c 2.4%.\n\nFigure 6: Top-1 accuracy vs. Speedup under different sparsity and block sizes for VGG16 on\nImageNet. For each block size, we test the computation sparsity from 0% to 68%. For the best block\nsize con\ufb01guration, 32 \u00d7 32, the accuracy will decrease from 73.36% to 71.004%. Compared to the\n1 \u00d7 1 case that decrease to 72.074%, the accuracy drop is small. But the speedup will increase to\n2.35\u00d7 compared to the dense baseline while the performance of 1 \u00d7 1 case is much worse than the\nbaseline.\n\nthe model size and 68% of the operations of VGG16. In addition, we test the con\ufb01gurations in which\nthe pruning rates of all layers gradually decreases by 10%, 20%, and 30%.\nFigure 5 shows the relationship between sparsity and accuracy. Figure 5a is the sparsity in terms of\nmodel size and Figure 5b is the sparsity in terms of computation. From the perspective of acceleration,\nthe sparsity of computation is more relevant. Compared to the baseline without reordering that the\naccuracy signi\ufb01cantly dropped, our approach can make the curve of block-wise pruning close to the\nelement-wise case (1 \u00d7 1 case).\n\n5.3 Speedup vs. granularity\n\nThe granularity of sparsity affects the speedup from two aspects: coarser granularity leads to better\nhardware utilization, which pushes the practical speedup closer to the ideal case; but it would impair\nthe pruning rate. For example, in Deep Compression [9], the pruning rate for the \ufb01rst layer in VGG16\nis 42%. However, when the block size increases to 32, the whole layer only consists of two blocks.\nWe can only prune nothing or 50% of the weights. For the similar layers, we have to decrease the\nblock size that leads to smaller sparsity and less speedup.\nAs shown in Figure 6, we plot the relationship between accuracy and speedup under different block\nsizes for VGG16. As the block size increases the speedup \ufb01rst increases until block size reaches\n32, then it begins to decrease. The critical point is caused by two factors. Typically, for most\nconvolutional layers in VGG16, we can almost make full use of hardware when the block size is\nlarger than 32. If we keep increasing the block size, the sparsity may decrease because of the limited\n\n8\n\n00.10.20.30.40.50.60.70.80.91Model Size Sparsity60657075Accuracyreordered1188161632326464128128256256no reordering118816163232646412812825625600.10.20.30.40.50.60.70.80.91Computation Sparsity60657075Accuracyreordered1188161632326464128128256256no reordering118816163232646412812825625600.511.522.53Speedup6668707274Accuracy1188161632326464128128256256\fFigure 7: Speedup of convolutional layers in VGG16 with different block sizes based on Blocksparse\nlibrary [25]. The baseline is the dense implementation in pytorch based on cuBlas.\n\npruning rate for maintaining accuracy. Since the speedup is inversely proportional to the amount\nof remained computation, a small change on the pruning rate may lead to a signi\ufb01cant decrease in\nspeedup.\nDifferent layers have different critical points. Figure 7 shows the speedup of different convolutional\nlayers in VGG16 under different block sizes; and the pruning rate con\ufb01guration is set to the same as\nthat of Deep Compression. For layers that have fewer channels, they achieve the best speedup when\nthe block size is set to 32. For layers with more channels, the critical point is 64.\n\n6 Conclusion\n\nThe coarse-grained sparsity is usually bene\ufb01cial to achieve higher speedup on parallel hardware, but\nit usually achieves less sparsity or accuracy compared to the \ufb01ne-grained sparsity. In this paper, we\npresent a method to reorder irregular \ufb01ne-grained sparsity to structured coarse-grained sparsity to\nbridge the gap between the large sparsity we can gain from models and the poor practical speedup. It\ncan also help the \ufb01ne-grained pruning methods to achieve the ideal execution acceleration.\n\nAcknowledgement\n\nThis research was collaborative work of Tsinghua University and University of California, Santa\nBarbara. Thanks for the support from Beijing Innovation Center for Future Chip, Science and\nTechnology Innovation Special Zone project, and the National Science Foundations (NSF) under grant\nnumbers 1725447 and 1730309. We also thank OpenAI for their open-source library, blocksparse.\n\nReferences\n[1] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image\n\nrecognition,\u201d arXiv preprint arXiv:1409.1556, 2014.\n\n[2] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in\nProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778,\n2016.\n\n[3] J. Redmon and A. Farhadi, \u201cYolo9000: better, faster, stronger,\u201d arXiv preprint, vol. 1612, 2016.\n[4] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, \u201cConvolutional neural\nnetworks for speech recognition,\u201d IEEE/ACM Transactions on audio, speech, and language\nprocessing, vol. 22, no. 10, pp. 1533\u20131545, 2014.\n\n[5] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper,\nB. Catanzaro, Q. Cheng, G. Chen, et al., \u201cDeep speech 2: End-to-end speech recognition in\nenglish and mandarin,\u201d in International Conference on Machine Learning, pp. 173\u2013182, 2016.\n[6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al., \u201cGoogle\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation,\u201d arXiv preprint arXiv:1609.08144, 2016.\n\n9\n\nConv1.1Conv1.2Conv2.1Conv2.2Conv3.1Conv3.2Conv3.3Conv4.1Conv4.2Conv4.3Conv5.1Conv5.2Conv5.3Conv Layers in VGG1601234Speedup1188161632326464128128256256\f[7] A. Ardakani, C. Condo, and W. J. Gross, \u201cSparsely-connected neural networks: towards ef\ufb01cient\n\nvlsi implementation of deep neural networks,\u201d arXiv preprint arXiv:1611.01427, 2016.\n\n[8] S. Han, J. Pool, J. Tran, and W. Dally, \u201cLearning both weights and connections for ef\ufb01cient\nneural network,\u201d in Advances in neural information processing systems, pp. 1135\u20131143, 2015.\n[9] S. Han, H. Mao, and W. J. Dally, \u201cDeep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding,\u201d arXiv preprint arXiv:1510.00149, 2015.\n\n[10] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, \u201cLearning ef\ufb01cient convolutional\nnetworks through network slimming,\u201d in 2017 IEEE International Conference on Computer\nVision (ICCV), pp. 2755\u20132763, IEEE, 2017.\n\n[11] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, \u201cPruning \ufb01lters for ef\ufb01cient convnets,\u201d\n\narXiv preprint arXiv:1608.08710, 2016.\n\n[12] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, \u201cSparse convolutional neural\nnetworks,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npp. 806\u2013814, 2015.\n\n[13] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, \u201cEie: ef\ufb01cient\ninference engine on compressed deep neural network,\u201d in Computer Architecture (ISCA), 2016\nACM/IEEE 43rd Annual International Symposium on, pp. 243\u2013254, IEEE, 2016.\n\n[14] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., \u201cEse:\nEf\ufb01cient speech recognition engine with sparse lstm on fpga,\u201d in Proceedings of the 2017\nACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75\u201384, ACM,\n2017.\n\n[15] Y. Lu, L. Gong, C. Xu, F. Sun, Y. Zhang, C. Wang, and X. Zhou, \u201cA high-performance\nfpga accelerator for sparse neural networks: work-in-progress,\u201d in Proceedings of the 2017\nInternational Conference on Compilers, Architectures and Synthesis for Embedded Systems\nCompanion, p. 12, ACM, 2017.\n\n[16] A. Page, A. Jafari, C. Shea, and T. Mohsenin, \u201cSparcnet: A hardware accelerator for ef\ufb01cient\ndeployment of sparse convolutional networks,\u201d ACM Journal on Emerging Technologies in\nComputing Systems (JETC), vol. 13, no. 3, p. 31, 2017.\n\n[17] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W.\nKeckler, and W. J. Dally, \u201cScnn: An accelerator for compressed-sparse convolutional neural net-\nworks,\u201d in Proceedings of the 44th Annual International Symposium on Computer Architecture,\npp. 27\u201340, ACM, 2017.\n\n[18] C.-Y. Lin and B.-C. Lai, \u201cSupporting compressed-sparse activations and weights on simd-\nlike accelerator for sparse convolutional neural networks,\u201d in Design Automation Conference\n(ASP-DAC), 2018 23rd Asia and South Paci\ufb01c, pp. 105\u2013110, IEEE, 2018.\n\n[19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, \u201cLearning structured sparsity in deep neural\n\nnetworks,\u201d in Advances in Neural Information Processing Systems, pp. 2074\u20132082, 2016.\n\n[20] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, \u201cScalpel: Customizing dnn\npruning to the underlying hardware parallelism,\u201d in Proceedings of the 44th Annual International\nSymposium on Computer Architecture, pp. 548\u2013560, ACM, 2017.\n\n[21] J.-H. Luo, J. Wu, and W. Lin, \u201cThinet: A \ufb01lter level pruning method for deep neural network\n\ncompression,\u201d arXiv preprint arXiv:1707.06342, 2017.\n\n[22] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li, \u201cCoordinating \ufb01lters for faster deep neural\n\nnetworks,\u201d CoRR, abs/1703.09746, 2017.\n\n[23] X. Sun, X. Ren, S. Ma, and H. Wang, \u201cmeprop: Sparsi\ufb01ed back propagation for accelerated\n\ndeep learning with reduced over\ufb01tting,\u201d arXiv preprint arXiv:1706.06197, 2017.\n\n[24] S. Narang, E. Undersander, and G. Diamos, \u201cBlock-sparse recurrent neural networks,\u201d arXiv\n\npreprint arXiv:1711.02782, 2017.\n\n[25] G. Scott, R. Alec, and P. K. Diederik, \u201cGpu kernel for block-sparse weights,\u201d 2016.\n[26] W. Wen, Y. He, S. Rajbhandari, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li, \u201cLearning intrinsic\n\nsparse structures within long short-term memory,\u201d arXiv preprint arXiv:1709.05027, 2017.\n\n[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \u201cImageNet: A Large-Scale\n\nHierarchical Image Database,\u201d in CVPR09, 2009.\n\n10\n\n\f[28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer, \u201cAutomatic differentiation in pytorch,\u201d 2017.\n\n[29] S. Marcel and Y. Rodriguez, \u201cTorchvision the machine-vision package of torch,\u201d in Proceedings\n\nof the 18th ACM international conference on Multimedia, pp. 1485\u20131488, ACM, 2010.\n\n11\n\n\f", "award": [], "sourceid": 2041, "authors": [{"given_name": "Yu", "family_name": "Ji", "institution": "Tsinghua University"}, {"given_name": "Ling", "family_name": "Liang", "institution": "UCSB"}, {"given_name": "Lei", "family_name": "Deng", "institution": "UCSB"}, {"given_name": "Youyang", "family_name": "Zhang", "institution": "Tsinghua University"}, {"given_name": "Youhui", "family_name": "Zhang", "institution": "Tsinghua University"}, {"given_name": "Yuan", "family_name": "Xie", "institution": "UCSB"}]}