{"title": "Learning Structured Sparsity in Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2074, "page_last": 2082, "abstract": "High demand for computation resources severely hinders deployment of large-scale Deep Neural Networks (DNN) in resource constrained devices. In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation. Experimental results show that SSL achieves on average 5.1X and 3.1X speedups of convolutional layer computation of AlexNet against CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about twice speedups of non-structured sparsity; (3) regularize the DNN structure to improve classification accuracy. The results show that for CIFAR-10, regularization on layer depth reduces a 20-layer Deep Residual Network (ResNet) to 18 layers while improves the accuracy from 91.25% to 92.60%, which is still higher than that of original ResNet with 32 layers. For AlexNet, SSL reduces the error by ~1%.", "full_text": "Learning Structured Sparsity in Deep Neural\n\nNetworks\n\nWei Wen\n\nUniversity of Pittsburgh\n\nwew57@pitt.edu\n\nChunpeng Wu\n\nUniversity of Pittsburgh\n\nchw127@pitt.edu\n\nYandan Wang\n\nUniversity of Pittsburgh\n\nyaw46@pitt.edu\n\nYiran Chen\n\nUniversity of Pittsburgh\n\nyic52@pitt.edu\n\nHai Li\n\nUniversity of Pittsburgh\n\nhal66@pitt.edu\n\nAbstract\n\nHigh demand for computation resources severely hinders deployment of large-scale\nDeep Neural Networks (DNN) in resource constrained devices. In this work, we\npropose a Structured Sparsity Learning (SSL) method to regularize the structures\n(i.e., \ufb01lters, channels, \ufb01lter shapes, and layer depth) of DNNs. SSL can: (1) learn\na compact structure from a bigger DNN to reduce computation cost; (2) obtain a\nhardware-friendly structured sparsity of DNN to ef\ufb01ciently accelerate the DNN\u2019s\nevaluation. Experimental results show that SSL achieves on average 5.1\u00d7 and\n3.1\u00d7 speedups of convolutional layer computation of AlexNet against CPU and\nGPU, respectively, with off-the-shelf libraries. These speedups are about twice\nspeedups of non-structured sparsity; (3) regularize the DNN structure to improve\nclassi\ufb01cation accuracy. The results show that for CIFAR-10, regularization on\nlayer depth reduces a 20-layer Deep Residual Network (ResNet) to 18 layers while\nimproves the accuracy from 91.25% to 92.60%, which is still higher than that of\noriginal ResNet with 32 layers. For AlexNet, SSL reduces the error by \u223c 1%.\n\n1\n\nIntroduction\n\nDeep neural networks (DNN), especially deep Convolutional Neural Networks (CNN), made re-\nmarkable success in visual tasks [1][2][3][4][5] by leveraging large-scale networks learning from a\nhuge volume of data. Deployment of such big models, however, is computation-intensive. To reduce\ncomputation, many studies are performed to compress the scale of DNN, including sparsity regu-\nlarization [6], connection pruning [7][8] and low rank approximation [9][10][11][12][13]. Sparsity\nregularization and connection pruning, however, often produce non-structured random connectivity\nand thus, irregular memory access that adversely impacts practical acceleration in hardware platforms.\nFigure 1 depicts practical layer-wise speedup of AlexNet, which is non-structurally sparsi\ufb01ed by\n(cid:96)1-norm. Compared to original model, the accuracy loss of the sparsi\ufb01ed model is controlled within\n2%. Because of the poor data locality associated with the scattered weight distribution, the achieved\nspeedups are either very limited or negative even the actual sparsity is high, say, >95%. We de\ufb01ne\nsparsity as the ratio of zeros in this paper. In recently proposed low rank approximation approaches,\nthe DNN is trained \ufb01rst and then each trained weight tensor is decomposed and approximated by a\nproduct of smaller factors. Finally, \ufb01ne-tuning is performed to restore the model accuracy. Low rank\napproximation is able to achieve practical speedups because it coordinates model parameters in dense\nmatrixes and avoids the locality problem of non-structured sparsity regularization. However, low\nrank approximation can only obtain the compact structure within each layer, and the structures of the\nlayers are \ufb01xed during \ufb01ne-tuning such that costly reiterations of decomposing and \ufb01ne-tuning are\nrequired to \ufb01nd an optimal weight approximation for performance speedup and accuracy retaining.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Evaluation speedups of AlexNet on GPU platforms and the sparsity. conv1 refers to\nconvolutional layer 1, and so forth. Baseline is pro\ufb01led by GEMM of cuBLAS. The sparse matrixes\nare stored in the format of Compressed Sparse Row (CSR) and accelerated by cuSPARSE.\n\nInspired by the facts that (1) there is redundancy across \ufb01lters and channels [11]; (2) shapes of\n\ufb01lters are usually \ufb01xed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary\ncomputation imposed by this \ufb01xation; and (3) depth of the network is critical for classi\ufb01cation\nbut deeper layers cannot always guarantee a lower error because of the exploding gradients and\ndegradation problem [5], we propose Structured Sparsity Learning (SSL) method to directly learn\na compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a\ngeneric regularization to adaptively adjust multiple structures in DNN, including structures of \ufb01lters,\nchannels, \ufb01lter shapes within each layer, and structure of depth beyond the layers. SSL combines\nstructure regularization (on DNN for classi\ufb01cation accuracy) with locality optimization (on memory\naccess for computation ef\ufb01ciency), offering not only well-regularized big models with improved\naccuracy but greatly accelerated computation (e.g., 5.1\u00d7 on CPU and 3.1\u00d7 on GPU for AlexNet).\nOur source code can be found at https://github.com/wenwei202/caffe/tree/scnn.\n\n2 Related works\n\nConnection pruning and weight sparsifying. Han et al. [7][8] reduced parameters of AlexNet and\nVGG-16 using connection pruning. Since most reduction is achieved on fully-connected layers,\nno practical speedups of convolutional layers are observed for the similar issue shown in Figure 1.\nHowever, convolution is more costly and many new DNNs use fewer fully-connected layers, e.g., only\n3.99% parameters of ResNet-152 [5] are from fully-connected layers, compression and acceleration\non convolutional layers become essential. Liu et al. [6] achieved >90% sparsity of convolutional\nlayers in AlexNet with 2% accuracy loss, and bypassed the issue of Figure 1 by hardcoding the sparse\nweights into program. In this work, we also focus on convolutional layers. Compared to the previous\ntechniques, our method coordinates sparse weights in adjacent memory space and achieve higher\nspeedups. Note that hardware and program optimizations based on our method can further boost the\nsystem performance which is not covered in this paper due to space limit.\nLow rank approximation. Denil et al. [9] predicted 95% parameters in a DNN by exploiting the\nredundancy across \ufb01lters and channels. Inspired by it, Jaderberg et al. [11] achieved 4.5\u00d7 speedup\non CPUs for scene text character recognition and Denton et al. [10] achieved 2\u00d7 speedups for the\n\ufb01rst two layers in a larger DNN. Both of the works used Low Rank Approximation (LRA) with \u223c1%\naccuracy drop. [13][12] improved and extended LRA to larger DNNs. However, the network structure\ncompressed by LRA is \ufb01xed; reiterations of decomposing, training/\ufb01ne-tuning, and cross-validating\nare still needed to \ufb01nd an optimal structure for accuracy and speed trade-off. As the number of\nhyper-parameters in LRA method increases linearly with the layer depth [10][13], the search space\nincreases linearly or even exponentially. Comparing to LRA, our contributions are: (1) SSL can\ndynamically optimize the compactness of DNNs with only one hyper-parameter and no reiterations;\n(2) besides the redundancy within the layers, SSL also exploits the necessity of deep layers and\nreduce them; (3) DNN \ufb01lters regularized by SSL have lower rank approximation, so it can work\ntogether with LRA for more ef\ufb01cient model compression.\nModel structure learning. Group Lasso [14] is an ef\ufb01cient regularization to learn sparse structures.\nLiu et al. [6] utilized group Lasso to constrain the structure scale of LRA. To adapt DNN structure to\ndifferent databases, Feng et al. [16] learned the appropriate number of \ufb01lters in DNN. Different from\nprior arts, we apply group Lasso to regularize multiple DNN structures (\ufb01lters, channels, \ufb01lter shapes,\nand layer depth). A most related parallel work is Group-wise Brain Damage [17], which is a subset\n(i.e., learning \ufb01lter shapes) of our work and further justi\ufb01es the effectiveness of our techniques.\n\n2\n\n0 1 0 0.5 1 1.5 conv1 conv2 conv3 conv4 conv5 Quadro K600 Tesla K40c GTX Titan Sparsity Speedup Sparsity \fFigure 2: The proposed Structured Sparsity Learning (SSL) for DNNs. The weights in \ufb01lters are\nsplit into multiple groups. Through group Lasso regularization, a more compact DNN is obtained\nby removing some groups. The \ufb01gure illustrates the \ufb01lter-wise, channel-wise, shape-wise, and\ndepth-wise structured sparsity that are explored in the work.\n\n3 Structured Sparsity Learning Method for DNNs\n\nWe focus mainly on the Structured Sparsity Learning (SSL) on convolutional layers to regularize the\nstructure of DNNs. We \ufb01rst propose a generic method to regularize structures of DNN in Section 3.1,\nand then specify the method to structures of \ufb01lters, channels, \ufb01lter shapes and depth in Section 3.2.\nVariants of formulations are also discussed from computational ef\ufb01ciency viewpoint in Section 3.3.\n\n3.1 Proposed structured sparsity learning for generic structures\n\nSuppose the weights of convolutional\nlayers in a DNN form a sequence of 4-D tensors\nW (l) \u2208 RNl\u00d7Cl\u00d7Ml\u00d7Kl, where Nl, Cl, Ml and Kl are the dimensions of the l-th (1 \u2264 l \u2264 L)\nweight tensor along the axes of \ufb01lter, channel, spatial height and spatial width, respectively. L denotes\nthe number of convolutional layers. Then the proposed generic optimization target of a DNN with\nstructured sparsity regularization can be formulated as:\n\nE(W ) = ED(W ) + \u03bb \u00b7 R(W ) + \u03bbg \u00b7\n\nRg\n\n.\n\n(1)\n\n(cid:16)\n\nW (l)(cid:17)\n\nL(cid:88)\n\nl=1\n\nw can be represented as Rg(w) =(cid:80)G\n\nHere W represents the collection of all weights in the DNN; ED(W ) is the loss on data; R(\u00b7) is\nnon-structured regularization applying on every weight, e.g., (cid:96)2-norm; and Rg(\u00b7) is the structured\nsparsity regularization on each layer. Because group Lasso can effectively zero out all weights in\nsome groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights\ng=1 ||w(g)||g, where w(g) is a group of partial weights in w\nand G is the total number of groups. Different groups may overlap. Here || \u00b7 ||g is the group Lasso, or\n||w(g)||g =\n\n, where |w(g)| is the number of weights in w(g).\n\n(cid:114)(cid:80)|w(g)|\n\n(cid:17)2\n\nw(g)\n\ni\n\n(cid:16)\n\ni=1\n\n3.2 Structured sparsity learning for structures of \ufb01lters, channels, \ufb01lter shapes and depth\n\nIn SSL, the learned \u201cstructure\u201d is decided by the way of splitting groups of w(g). We investigate and\nformulate the \ufb01ler-wise, channel-wise, shape-wise, and depth-wise structured sparsity in Figure 2.\nFor simplicity, the R(\u00b7) term of Eq. (1) is omitted in the following formulation expressions.\nPenalizing unimportant \ufb01lers and channels. Suppose W (l)\nnl,:,:,: is the nl-th \ufb01lter and W (l)\n:,cl,:,: is the\ncl-th channel of all \ufb01lters in the l-th layer. The optimization target of learning the \ufb01lter-wise and\nchannel-wise structured sparsity can be de\ufb01ned as\n\n\uf8eb\uf8ed Nl(cid:88)\n\nL(cid:88)\n\nl=1\n\n\uf8f6\uf8f8 + \u03bbc \u00b7\n\nL(cid:88)\n\nl=1\n\n\uf8eb\uf8ed Cl(cid:88)\n\n\uf8f6\uf8f8 .\n\nE(W ) = ED(W ) + \u03bbn \u00b7\n\nnl=1||W (l)\n\nnl,:,:,:||g\n\ncl=1||W (l)\n\n:,cl,:,:||g\n\n(2)\n\nAs indicated in Eq. (2), our approach tends to remove less important \ufb01lters and channels. Note\nthat zeroing out a \ufb01lter in the l-th layer results in a dummy zero output feature map, which in turn\nmakes a corresponding channel in the (l + 1)-th layer useless. Hence, we combine the \ufb01lter-wise and\nchannel-wise structured sparsity in the learning simultaneously.\n\n3\n\n\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0shortcut depth-wise \t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0filter-wise channel-wise \t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\u2026\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0shape-wise W(l)nl,:,:,:(1)W(l):,cl,:,:(2)W(l):,cl,ml,kl(3)W(l)(4)1W(l)nl,:,:,:(1)W(l):,cl,:,:(2)W(l):,cl,ml,kl(3)W(l)(4)1W(l)nl,:,:,:(1)W(l):,cl,:,:(2)W(l):,cl,ml,kl(3)W(l)(4)1W(l)nl,:,:,:(1)W(l):,cl,:,:(2)W(l):,cl,ml,kl(3)W(l)(4)1\fLearning arbitrary shapes of \ufb01lers. As illustrated in Figure 2, W (l)\n:,cl,ml,kl denotes the vector of\nall corresponding weights located at spatial position of (ml, kl) in the 2D \ufb01lters across the cl-th\nchannel. Thus, we de\ufb01ne W (l)\n:,cl,ml,kl as the shape \ufb01ber related to learning arbitrary \ufb01lter shape\nbecause a homogeneous non-cubic \ufb01lter shape can be learned by zeroing out some shape \ufb01bers. The\noptimization target of learning shapes of \ufb01lers becomes:\n\nL(cid:88)\n\n\uf8eb\uf8ed Cl(cid:88)\n\nMl(cid:88)\n\nKl(cid:88)\n\n\uf8f6\uf8f8 .\n\nE(W ) = ED(W ) + \u03bbs \u00b7\n\nl=1\n\ncl=1\n\nml=1\n\nkl=1\n\n||W (l)\n\n:,cl,ml,kl||g\n\n(3)\n\n(cid:80)L\nRegularizing layer depth. We also explore the depth-wise sparsity to regularize the depth of DNNs\nin order to improve accuracy and reduce computation cost. The corresponding optimization target is\nl=1 ||W (l)||g. Different from other discussed sparsi\ufb01cation techniques,\nE(W ) = ED(W ) + \u03bbd \u00b7\nzeroing out all the \ufb01lters in a layer will cut off the message propagation in the DNN so that the output\nneurons cannot perform any classi\ufb01cation. Inspired by the structure of highway networks [18] and\ndeep residual networks [5], we propose to leverage the shortcuts across layers to solve this issue. As\nillustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still\nbe forwarded through the shortcut.\n\n3.3 Structured sparsity learning for computationally ef\ufb01cient structures\n\nAll proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction.\nMoreover, some variants of the formulations of these schemes can directly learn structures that can\nbe ef\ufb01ciently computed.\n2D-\ufb01lter-wise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D\nconvolutions. To perform ef\ufb01cient convolution, we explored a \ufb01ne-grain variant of \ufb01lter-wise sparsity,\nnamely, 2D-\ufb01lter-wise sparsity, to spatially enforce group Lasso on each 2D \ufb01lter of W (l)\nnl,cl,:,:. The\nsaved convolution is proportional to the percentage of the removed 2D \ufb01lters. The \ufb01ne-grain version of\n\ufb01lter-wise sparsity can more ef\ufb01ciently reduce the computation associated with convolution: Because\nthe distance of weights (in a smaller group) from the origin is shorter, which makes group Lasso\nmore easily to obtain a higher ratio of zero groups.\nCombination of \ufb01lter-wise and shape-wise sparsity for GEMM. Convolutional computation in\nDNNs is commonly converted to modality of GEneral Matrix Multiplication (GEMM) by lowering\nweight tensors and feature tensors to matrices [19]. For example, in Caffe [20], a 3D \ufb01lter W (l)\nnl,:,:,: is\nreshaped to a row in the weight matrix where each column is the collection of weights W (l)\n:,cl,ml,kl\nrelated to shape-wise sparsity. Combining \ufb01lter-wise and shape-wise sparsity can directly reduce the\ndimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use\nrow-wise and column-wise sparsity as the interchangeable terminology of \ufb01lter-wise and shape-wise\nsparsity, respectively.\n\n4 Experiments\n\nWe evaluate the effectiveness of our SSL using published models on three databases \u2013 MNIST,\nCIFAR-10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights\nare initialized by the baseline, and speedups are measured in matrix-matrix multiplication by Caffe in\na single-thread Intel Xeon E5-2630 CPU. Hyper-parameters are selected by cross-validation.\n\n4.1 LeNet and multilayer perceptron on MNIST\n\nIn the experiment of MNIST, we examine the effectiveness of SSL in two types of networks:\nLeNet [21] implemented by Caffe and a multilayer perceptron (MLP) network. Both networks were\ntrained without data augmentation.\nLeNet: When applying SSL to LeNet, we constrain the network with \ufb01lter-wise and channel-wise\nsparsity in convolutional layers to penalize unimportant \ufb01lters and channels. Table 1 summarizes\nthe remained \ufb01lters and channels, \ufb02oating-point operations (FLOP), and practical speedups. In the\ntable, LeNet 1 is the baseline and the others are the results after applying SSL in different strengths\n\n4\n\n\fTable 1: Results after penalizing unimportant \ufb01lters and channels in LeNet\nLeNet #\n\nChannel # \u00a7\n\nSpeedup \u00a7\n\nFLOP \u00a7\n\n1 (baseline)\n\nFilter # \u00a7\nError\n0.9% 20\u201450\n5\u201419\n0.8%\n1.0%\n3\u201412\n\u00a7In the order of conv1\u2014conv2\n\n2\n3\n\n1\u201420\n1\u20144\n1\u20143\n\n100%\u2014100% 1.00\u00d7\u20141.00\u00d7\n25%\u20147.6%\n1.64\u00d7\u20145.23\u00d7\n15%\u20143.6%\n1.99\u00d7\u20147.44\u00d7\n\nLeNet #\n\n1 (baseline)\n\n4\n5\n\nChannel #\n\nTable 2: Results after learning \ufb01lter shapes in LeNet\nError\n0.9%\n0.8%\n1.0%\n\nFilter size \u00a7\n25\u2014500\n21\u201441\n7\u201414\n\n1\u201420\n1\u20142\n1\u20141\n\n100%\u2014100% 1.00\u00d7\u20141.00\u00d7\n8.4%\u20148.2%\n2.33\u00d7\u20146.93\u00d7\n1.4%\u20142.8% 5.19\u00d7\u201410.82\u00d7\n\nSpeedup\n\nFLOP\n\n\u00a7 The sizes of \ufb01lters after removing zero shape \ufb01bers, in the order of conv1\u2014conv2\n\nof structured sparsity regularization. The results show that our method achieves the similar error\n(\u00b10.1%) with much fewer \ufb01lters and channels, and saves signi\ufb01cant FLOP and computation time.\nTo demonstrate the impact of SSL on the structures of \ufb01lters, we present all learned conv1 \ufb01lters\nin Figure 3. It can be seen that most \ufb01lters in LeNet 2 are entirely zeroed out except for \ufb01ve most\nimportant detectors of stroke patterns that are suf\ufb01cient for feature extraction. The accuracy of\nLeNet 3 (that further removes the weakest and redundant stroke detector) drops only 0.2% from that\nof LeNet 2. Compared to the random and blurry \ufb01lter patterns in LeNet 1 which are resulted from the\nhigh freedom of parameter space, the \ufb01lters in LeNet 2 & 3 are regularized and converge to smoother\nand more natural patterns. This explains why our proposed SSL obtains the same-level accuracy but\nhas much less \ufb01lters. The smoothness of the \ufb01lters are also observed in the deeper layers.\nThe effectiveness of the shape-wise sparsity on LeNet is summarized in Table 2. The baseline LeNet 1\nhas conv1 \ufb01lters with a regular 5 \u00d7 5 square (size = 25) while LeNet 5 reduces the dimension that\ncan be constrained by a 2 \u00d7 4 rectangle (size = 7). The 3D shape of conv2 \ufb01lters in the baseline is\nalso regularized to the 2D shape in LeNet 5 within only one channel, indicating that only one \ufb01lter in\nconv1 is needed. This fact signi\ufb01cantly saves FLOP and computation time.\n\nFigure 3: Learned conv1 \ufb01lters in LeNet 1 (top), LeNet 2 (middle) and LeNet 3 (bottom)\n\nMLP: Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.,\nthe number of neurons) of fully-connected layers. We enforce the group Lasso regularization on\nall the input (or output) connections of each neuron. A neuron whose input connections are all\nzeroed out can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a\nremovable dummy neuron if all of its output connections are zeroed out. Figure 4(a) summarizes\nthe learned structure and FLOP of different MLP networks. The results show that SSL can not only\nremove hidden neurons but also discover the sparsity of images. For example, Figure 4(b) depicts the\nnumber of connections of each input neuron in MLP 2, where 40.18% of input neurons have zero\nconnections and they concentrate at the boundary of the image. Such a distribution is consistent with\nour intuition: handwriting digits are usually written in the center and pixels close to the boundary\ncontain little discriminative classi\ufb01cation information.\n\n4.2 ConvNet and ResNet on CIFAR-10\n\nWe implemented the ConvNet of [1] and deep residual networks (ResNet) [5] on CIFAR-10. When\nregularizing \ufb01lters, channels, and \ufb01lter shapes, the results and observations of both networks are\nsimilar to that of the MNIST experiment. Moreover, we simultaneously learn the \ufb01lter-wise and\nshape-wise sparsity to reduce the dimension of weight matrix in GEMM by ConvNet. We also learn\nthe depth-wise sparsity of ResNet to regularize the depth of the DNNs.\n\n5\n\n\fFigure 4: (a) Results of learning the number of neurons in MLP. (b) the connection numbers of input\nneurons (i.e., pixels) in MLP 2 after SSL.\n\n(a)\n\n(b)\n\nTable 3: Learning row-wise and column-wise sparsity of ConvNet on CIFAR-10\nConvNet #\n1 (baseline)\n\nSpeedup \u00a7\nRow sparsity \u00a7\nError\n1.00\u00d7\u20131.00\u00d7\u20131.00\u00d7\n17.9% 12.5%\u20130%\u20130%\n17.9% 50.0%\u201328.1%\u20131.6% 0%\u201359.3%\u201335.1% 1.43\u00d7\u20133.05\u00d7\u20131.57\u00d7\n16.9% 31.3%\u20130%\u20131.6%\n1.25\u00d7\u20132.01\u00d7\u20131.18\u00d7\n\n2\n3\n\nColumn sparsity \u00a7\n0%\u20130%\u20130%\n\n0%\u201342.8%\u20139.8%\n\n\u00a7in the order of conv1\u2013conv2\u2013conv3\n\nConvNet: We use the network from Alex Krizhevsky et al. [1] as the baseline and implement it\nusing Caffe. All the con\ufb01gurations remain the same as the original implementation except that we\nadded a dropout layer with a ratio of 0.5 in the fully-connected layer to avoid over-\ufb01tting. ConvNet is\ntrained without data augmentation. Table 3 summarizes the results of three ConvNet networks. Here,\nthe row/column sparsity of a weight matrix is de\ufb01ned as the percentage of all-zero rows/columns.\nFigure 5 shows their learned conv1 \ufb01lters. In Table 3, SSL can reduce the size of weight matrix\nin ConvNet 2 by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups\nwithout accuracy drop. Surprisingly, without SSL, four conv1 \ufb01lters of the baseline are actually\nall-zeros as shown in Figure 5, demonstrating the great potential of \ufb01lter sparsity. When SSL is\napplied, half of conv1 \ufb01lters in ConvNet 2 can be zeroed out without accuracy drop.\nOn the other hand, in ConvNet 3, SSL lowers 1.0% (\u00b10.16%) error with a model even smaller than\nthe baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a better\nnetwork structure (including the number of \ufb01lters and \ufb01ler shapes) to reduce the error.\nResNet: To investigate the necessary depth of DNNs by SSL, we use a 20-layer deep residual networks\n(ResNet-20) [5] as the baseline. The network has 19 convolutional layers and 1 fully-connected\nlayer. Identity shortcuts are utilized to connect the feature maps with the same dimension while 1\u00d71\nconvolutional layers are chosen as shortcuts between the feature maps with different dimensions.\nBatch normalization [22] is adopted after convolution and before activation. We use the same data\naugmentation and training hyper-parameters as that in [5]. The \ufb01nal error of baseline is 8.82%. In\nSSL, the depth of ResNet-20 is regularized by depth-wise sparsity. Group Lasso regularization is\nonly enforced on the convolutional layers between each pair of shortcut endpoints, excluding the \ufb01rst\nconvolutional layer and all convolutional shortcuts. After SSL converges, layers with all zero weights\nare removed and the net is \ufb01nally \ufb01ne-tuned with a base learning rate of 0.01, which is lower than\nthat (i.e., 0.1) in the baseline.\nFigure 6 plots the trend of the error vs. the number of layers under different strengths of depth\nregularizations. Compared with original ResNet in [5], SSL learns a ResNet with 14 layers (SSL-\nResNet-14) reaching a lower error than that of the baseline with 20 layers (ResNet-20); SSL-ResNet-18\nand ResNet-32 achieve an error of 7.40% and 7.51%, respectively. This result implies that SSL can\nwork as a depth regularization to improve classi\ufb01cation accuracy. Note that SSL can ef\ufb01ciently learn\nshallower DNNs without accuracy loss to reduce computation cost; however, it does not mean the\ndepth of the network is not important. The trend in Figure 6 shows that the test error generally\ndeclines as more layers are preserved. A slight error rise of SSL-ResNet-20 from SSL-ResNet-18\nshows the suboptimal selection of the depth in the group of \u201c32\u00d732\u201d.\n\nFigure 5: Learned conv1 \ufb01lters in ConvNet 1 (top), ConvNet 2 (middle) and ConvNet 3 (bottom)\n\n6\n\nTable2:Resultsafterlearning\ufb01ltershapesinLeNetLeNet#ErrorFiltersize\u00a7Channel#FLOPSpeedup1(baseline)0.9%25\u20135001\u201320100%\u2013100%1.00\u21e5\u20131.00\u21e540.8%21\u2013411\u201328.4%\u20138.2%2.33\u21e5\u20136.93\u21e551.0%7\u2013141\u201311.4%\u20132.8%5.19\u21e5\u201310.82\u21e5\u00a7Thesizesof\ufb01ltersafterremovingzeroshape\ufb01bers,intheorderofconv1\u2013conv205010001020304050% Reconstruction error conv1conv205010001020304050% ranks conv1conv2conv305010001020304050 conv1conv2conv3conv4conv5Figure4:Thenormalizedreconstructureerrorofweightmatrixvs.thepercentofranks.PrincipalComponentAnalysis(PCA)isutilizedtoexploretheredundancyamong\ufb01lters.%ranksofeigenvec-torscorrespondingtothelargesteigenvaluesareselectedasbasistoperformlowrankapproximation.Left:LeNet2inTable1;middle:ConvNet2inTable4;right:AlexNet4inTable5.DashlinesindicatebaselinesandsolidlinesindicateresultsofSSL.detectorsofstrokepatternswhicharesuf\ufb01cientforfeatureextraction.TheaccuracyofLeNet3170(thatfurtherremovesoneweakestandoneredundantstrokedetector)comparedwithLeNet2drops171only0.2%.Althoughthetrainingprocessesofthreenetworksareindependent,thecorresponding172regularized\ufb01ltersinLeNet2andLeNet3demonstrateveryhighsimilarityandrepresentcertainlevel173ofalikenesstothoseinLeNet1.Comparingwithrandomandblurry\ufb01lterpatternsinLeNet1resulted174fromthehighfreedomofparameterspace,the\ufb01ltersinLeNet2&3areregularizedthroughthe175\ufb01lter-wiseandchannel-wisesparsityandthereforeconvergeatsmootherandmorenaturalpatterns.176ThisexplainswhyourproposedSSLobtainsthesame-levelaccuracybuthavingmuchless\ufb01lters.177Theseregularityandsimilarityphenomenaarealsoobservedindeeperlayers.Differentfromlow178rankdecompositionwhichonlyexploretheredundancyanddoesnotchangetherank,SSLcanreduce179theredundancyasshowninFigure4.180Wealsoexploretheeffectivenessoftheshape-wisesparsityonLeNetinTable2.ThebaselineLeNet1811hasaregular5\u21e55squaresizeofconv1\ufb01lters,whileLeNet5reducesthedimensiontolessthan1822\u21e54.Andthe3Dshapeof\ufb01ltersinconv2ofLeNet1areregularizedto2DshapeofLeNet5with183onlyonechannel,indicatingthatonlyone\ufb01lterinconv1isneeded.Thissavessigni\ufb01cantFLOPand184computingtime.185MLP:Besidesconvolutionallayers,ourproposedSSLcanbeextendedtolearnthestructure(i.e.186thenumberofneurons)infully-connectedlayers.Here,thebaselineMLPnetworkcomposedof187twohiddenlayerswith500and300neuronsrespectivelyobtainsatesterrorof1.43%.Weenforced188thegroupLassoregularizationonalltheinput(oroutput)connectionsofeveryneuron,including189thoseoftheinputlayer.Notethataneuronwithalltheinputconnectionszeroedoutdegenerate190toabiasneuroninthenextlayer;similarly,aneurondegeneratestoaremovabledummyneuron191ifallofitsoutputconnectionsarezeroedout.Assuch,thecomputationofGEneralMatrixVector192(GEMV)productinfully-connectedlayerscanbesigni\ufb01cantlyreduced.Table3summarizesthe193Table3:Learningthenumberofneuronsinmulti-layerperceptronMLP#ErrorNeuron#perlayer\u00a7FLOPperlayer\u00a71(baseline)1.43%784\u2013500\u2013300\u201310100%\u2013100%\u2013100%21.34%469\u2013294\u2013166\u20131035.18%\u201332.54%\u201355.33%31.53%434\u2013174\u201378\u20131019.26%\u20139.05%\u201326.00%\u00a7Intheorderofinputlayer\u2013hiddenlayer1\u2013hiddenlayer2\u2013outputlayer6 1281280291\fFigure 6: Error vs. layer number after depth regularization. # is the number of layers including\nthe last fully-connected layer. ResNet-# is the ResNet in [5]. SSL-ResNet-# is the depth-regularized\nResNet by SSL. 32\u00d732 indicates the convolutional layers with an output map size of 32\u00d732, etc.\n4.3 AlexNet on ImageNet\n\nTo show the generalization of our method to large scale DNNs, we evaluate SSL using AlexNet with\nILSVRC 2012. CaffeNet [20], the replication of AlexNet [1] with mirror changes, is used in our\nexperiment. All training images are rescaled to the size of 256\u00d7256. A 227\u00d7227 image is randomly\ncropped from each scaled image and mirrored for data augmentation and only the center crop is\nused for validation. The \ufb01nal top-1 validation error is 42.63%. In SSL, AlexNet is \ufb01rst trained with\nstructure regularization; when it converges, zero groups are removed to obtain a DNN with the new\nstructure; \ufb01nally, the network is \ufb01ne-tuned without SSL to regain the accuracy.\nWe \ufb01rst study 2D-\ufb01lter-wise and shape-wise sparsity by exploring the trade-offs between computation\ncomplexity and classi\ufb01cation accuracy. Figure 7(a) shows the 2D-\ufb01lter sparsity (the ratio between the\nremoved 2D \ufb01lters and total 2D \ufb01lters) and the saved FLOP of 2D convolutions vs. the validation\nerror. In Figure 7(a), deeper layers generally have higher sparsity as the group size shrinks and the\nnumber of 2D \ufb01lters grows. 2D-\ufb01lter sparsity regularization can reduce the total FLOP by 30%\u201340%\nwithout accuracy loss or reduce the error of AlexNet by \u223c1% down to 41.69% by retaining the original\nnumber of parameters. Shape-wise sparsity also obtains similar results. In Table 4, for example,\nAlexNet 5 achieves on average 1.4\u00d7 layer-wise speedup on both CPU and GPU without accuracy loss\nafter shape regularization; The top-1 error can also be reduced down to 41.83% if the parameters are\nretained. In Figure 7(a), the obtained DNN with the lowest error has a very low sparsity, indicating\nthat the number of parameters in a DNN is still important to maintain learning capacity. In this case,\nSSL works as a regularization to add restriction of smoothness to the model in order to avoid over-\n\ufb01tting. Figure 7(b) compares the results of dimensionality reduction of weight tensors in the baseline\nand our SSL-regularized AlexNet. The results show that the smoothness restriction enforces parameter\nsearching in lower-dimensional space and enables lower rank approximation of the DNNs. Therefore,\nSSL can work together with low rank approximation to achieve even higher model compression.\nBesides the above analyses, the computation ef\ufb01ciencies of structured sparsity and non-structured\nsparsity are compared in Caffe using standard off-the-shelf libraries, i.e., Intel Math Kernel Library\n\n(a)\n\n(b)\n\n(c)\n\nFigure 7: (a) 2D-\ufb01lter-wise sparsity and FLOP reduction vs. top-1 error. Vertical dash line shows the\nerror of original AlexNet; (b) The reconstruction error of weight tensor vs. dimensionality. Principal\nComponent Analysis (PCA) is utilized to perform dimensionality reduction. The eigenvectors\ncorresponding to the largest eigenvalues are selected as basis of lower-dimensional space. Dash lines\ndenote the results of the baselines and solid lines indicate the ones of the AlexNet 5 in Table 4; (c)\nSpeedups of (cid:96)1-norm and SSL on various CPUs and GPUs (In labels of x-axis, T# is the number of\nmaximum physical threads in CPUs). AlexNet 1 and AlexNet 2 in Table 4 are used as testbenches.\n\n7\n\n121416182078910SSL\u2212ResNet\u2212#% error SSLResNet\u221220ResNet\u221232121416182002468101214161820SSL\u2212ResNet\u2212## conv layers 32\u00d73216\u00d7168\u00d78121416182078910SSL\u2212ResNet\u2212#% error SSLResNet\u221220ResNet\u221232121416182002468101214161820SSL\u2212ResNet\u2212## conv layers 32\u00d73216\u00d7168\u00d780 20 40 60 80 100 0 20 40 60 80 100 41.5 42 42.5 43 43.5 44 conv1 conv2 conv3 conv4 conv5 FLOP % Sparsity % FLOP reduction % top-1 error 05010001020304050% dimensionality% Reconstruction error conv1conv2conv3conv4conv50 1 2 3 4 5 6 Quadro Tesla Titan Black Xeon T8 Xeon T4 Xeon T2 Xeon T1 l1 SSL speedup \fon CPU and CUDA cuBLAS and cuSPARSE on GPU. We use SSL to learn a AlexNet with high\ncolumn-wise and row-wise sparsity as the representative of structured sparsity method. (cid:96)1-norm is\nselected as the representative of non-structured sparsity method instead of connection pruning [7]\nbecause (cid:96)1-norm get a higher sparsity on convolutional layers as the results of AlexNet 3 and AlexNet\n4 depicted in Table 4. Speedups achieved by SSL are measured by GEMM, where all-zero rows (and\ncolumns) in each weight matrix are removed and the remaining ones are concatenated in consecutive\nmemory space. Note that compared to GEMM, the overhead of concatenation can be ignored. To\nmeasure the speedups of (cid:96)1-norm, sparse weight matrices are stored in the format of Compressed\nSparse Row (CSR) and computed by sparse-dense matrix multiplication subroutines.\nTable 4 compares the obtained sparsity and speedups of (cid:96)1-norm and SSL on CPU (Intel Xeon) and\nGPU (GeForce GTX TITAN Black) under approximately the same errors, e.g., with acceptable or no\naccuracy loss. To make a fair comparison, after (cid:96)1-norm regularization, the DNN is also \ufb01ne-tuned\nby disconnecting all zero-weighted connections so that, e.g., 1.39% accuracy is recovered for the\nAlexNet 1. Our experiments show that the DNNs require a very high non-structured sparsity to achieve\na reasonable speedup (the speedups are even negative when the sparsity is low). SSL, however, can\nalways achieve positive speedups. With an acceptable accuracy loss, our SSL achieves on average\n5.1\u00d7 and 3.1\u00d7 layer-wise acceleration on CPU and GPU, respectively. Instead, (cid:96)1-norm achieves\non average only 3.0\u00d7 and 0.9\u00d7 layer-wise acceleration on CPU and GPU, respectively. We note\nthat, at the same accuracy, our average speedup is indeed higher than that of [6] which adopts heavy\nhardware customization to overcome the negative impact of non-structured sparsity. Figure 7(c)\nshows the speedups of (cid:96)1-norm and SSL on various platforms, including both GPU (Quadro, Tesla\nand Titan) and CPU (Intel Xeon E5-2630). SSL can achieve on average \u223c 3\u00d7 speedup on GPU while\nnon-structured sparsity obtain no speedup on GPU platforms. On CPU platforms, both methods can\nachieve good speedups and the bene\ufb01t grows as the processors become weaker. Nonetheless, SSL\ncan always achieve averagely \u223c 2\u00d7 speedup compared to non-structured sparsity.\n5 Conclusion\n\nIn this work, we propose a Structured Sparsity Learning (SSL) method to regularize \ufb01lter, channel,\n\ufb01lter shape, and depth structures in Deep Neural Networks (DNN). Our method can enforce the DNN\nto dynamically learn more compact structures without accuracy loss. The structured compactness\nof the DNN achieves signi\ufb01cant speedups for the DNN evaluation both on CPU and GPU with\noff-the-shelf libraries. Moreover, a variant of SSL can be performed as structure regularization to\nimprove classi\ufb01cation accuracy of state-of-the-art DNNs.\n\nAcknowledgments\nThis work was supported in part by NSF XPS-1337198 and NSF CCF-1615475. The authors thank\nDrs. Sheng Li and Jongsoo Park for valuable feedback on this work.\n\nTable 4: Sparsity and speedup of AlexNet on ILSVRC 2012\n\n#\n\n1\n\n2\n\n3\n\n4\n\n5\n\nMethod\n\nTop1 err.\n\n(cid:96)1\n\n44.67%\n\nSSL\n\n44.66%\n\npruning [7]\n\n42.80%\n\n(cid:96)1\n\n42.51%\n\nSSL\n\n42.53%\n\nStatistics\nsparsity\nCPU \u00d7\nGPU \u00d7\nrow sparsity\n\ncolumn sparsity\n\nCPU \u00d7\nGPU \u00d7\nsparsity\nsparsity\nCPU \u00d7\nGPU \u00d7\nCPU \u00d7\nGPU \u00d7\n\ncolumn sparsity\n\n8\n\n2.76\n1.36\n\n4.84\n1.38\n\n6.27\n4.94\n\nconv2\n\nconv3\n\nconv4\n\n2.91\n0.52\n63.2% 76.9% 84.7% 80.7%\n12.9% 40.6% 46.9% 0.0%\n4.93\n3.37\n2.37\n3.05\n\nconv1\nconv5\n67.6% 92.4% 97.2% 96.6% 94.3%\n0.80\n0.25\n0.0%\n9.4%\n1.05\n1.00\n16.0% 62.0% 65.0% 63.0% 63.0%\n14.7% 76.2% 85.3% 81.5% 76.3%\n0.34\n0.08\n0.00% 20.9% 39.7% 39.7% 24.6%\n1.00\n1.00\n\n1.68\n1.72\n\n1.32\n1.36\n\n1.30\n0.42\n\n1.64\n1.63\n\n0.99\n0.17\n\n1.27\n1.25\n\n0.93\n0.32\n\n3.83\n1.04\n\n9.73\n4.03\n\n1.10\n0.30\n\n\fReferences\n\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105. 2012.\n\n[2] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\nobject detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2014.\n\n[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint\narXiv:1409.4842, 2015.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\narXiv preprint arXiv:1512.03385, 2015.\n\n[6] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional\n\nneural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[7] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for ef\ufb01cient\n\nneural network. In Advances in Neural Information Processing Systems, pages 1135\u20131143. 2015.\n\n[8] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[9] Misha Denil, Babak Shakibi, Laurent Dinh, Marc' Aurelio Ranzato, and Nando de Freitas. Predicting\nparameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148\u20132156.\n2013.\n\n[10] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\nwithin convolutional networks for ef\ufb01cient evaluation. In Advances in Neural Information Processing\nSystems, pages 1269\u20131277. 2014.\n\n[11] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[12] Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training\n\ncnns with low-rank \ufb01lters for ef\ufb01cient image classi\ufb01cation. arXiv preprint arXiv:1511.06744, 2015.\n\n[13] Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank\n\nregularization. arXiv preprint arXiv:1511.06067, 2015.\n\n[14] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of\n\nthe Royal Statistical Society. Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n[15] Seyoung Kim and Eric P Xing. Tree-guided group lasso for multi-task regression with structured sparsity.\n\nIn Proceedings of the 27th International Conference on Machine Learning, 2010.\n\n[16] Jiashi Feng and Trevor Darrell. Learning the structure of deep convolutional networks. In The IEEE\n\nInternational Conference on Computer Vision (ICCV), 2015.\n\n[17] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), June 2016.\n\n[18] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. arXiv preprint\n\narXiv:1505.00387, 2015.\n\n[19] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and\n\nEvan Shelhamer. cudnn: Ef\ufb01cient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.\n\n[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv\npreprint arXiv:1408.5093, 2014.\n\n[21] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[22] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1094, "authors": [{"given_name": "Wei", "family_name": "Wen", "institution": "University of Pittsburgh"}, {"given_name": "Chunpeng", "family_name": "Wu", "institution": "University of Pittsburgh"}, {"given_name": "Yandan", "family_name": "Wang", "institution": "University of Pittsburgh"}, {"given_name": "Yiran", "family_name": "Chen", "institution": "University of Pittsburgh"}, {"given_name": "Hai", "family_name": "Li", "institution": "University of Pittsburg"}]}