{"title": "Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 1269, "page_last": 1277, "abstract": "We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2\u00d7, while keeping the accuracy within 1% of the original model.", "full_text": "Exploiting Linear Structure Within Convolutional\n\nNetworks for Ef\ufb01cient Evaluation\n\nEmily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun and Rob Fergus\n\nDept. of Computer Science, Courant Institute, New York University\n\n{denton, zaremba, bruna, lecun, fergus} @cs.nyu.edu\n\nAbstract\n\nWe present techniques for speeding up the test-time evaluation of large convo-\nlutional networks, designed for object recognition tasks. These models deliver\nimpressive accuracy, but each image evaluation requires millions of \ufb02oating point\noperations, making their deployment on smartphones and Internet-scale clusters\nproblematic. The computation is dominated by the convolution operations in the\nlower layers of the model. We exploit the redundancy present within the con-\nvolutional \ufb01lters to derive approximations that signi\ufb01cantly reduce the required\ncomputation. Using large state-of-the-art models, we demonstrate speedups of\nconvolutional layers on both CPU and GPU by a factor of 2\u00d7, while keeping the\naccuracy within 1% of the original model.\n\n1\n\nIntroduction\n\nLarge neural networks have recently demonstrated impressive performance on a range of speech and\nvision tasks. However, the size of these models can make their deployment at test time problematic.\nFor example, mobile computing platforms are limited in their CPU speed, memory and battery life.\nAt the other end of the spectrum, Internet-scale deployment of these models requires thousands\nof servers to process the 100\u2019s of millions of images per day. The electrical and cooling costs of\nthese servers is signi\ufb01cant. Training large neural networks can take weeks, or even months. This\nhinders research and consequently there have been extensive efforts devoted to speeding up training\nprocedure. However, there are relatively few efforts aimed at improving the test-time performance\nof the models.\nWe consider convolutional neural networks (CNNs) used for computer vision tasks, since they are\nlarge and widely used in commercial applications. These networks typically require a huge number\nof parameters (\u223c 108 in [1]) to produce state-of-the-art results. While these networks tend to be\nhugely over parameterized [2], this redundancy seems necessary in order to overcome a highly non-\nconvex optimization [3]. As a byproduct, the resulting network wastes computing resources. In this\npaper we show that this redundancy can be exploited with linear compression techniques, resulting\nin signi\ufb01cant speedups for the evaluation of trained large scale networks, with minimal compromise\nto performance.\nWe follow a relatively simple strategy: we start by compressing each convolutional layer by \ufb01nding\nan appropriate low-rank approximation, and then we \ufb01ne-tune the upper layers until the prediction\nperformance is restored. We consider several elementary tensor decompositions based on singular\nvalue decompositions, as well as \ufb01lter clustering methods to take advantage of similarities between\nlearned features.\nOur main contributions are the following: (1) We present a collection of generic methods to exploit\nthe redundancy inherent in deep CNNs.\n(2) We report experiments on state-of-the-art Imagenet\n\n1\n\n\fCNNs, showing empirical speedups on convolutional layers by a factor of 2 \u2212 3\u00d7 and a reduction\nof parameters in fully connected layers by a factor of 5 \u2212 10\u00d7.\nNotation: Convolution weights can be described as a 4-dimensional tensor: W \u2208 RC\u00d7X\u00d7Y \u00d7F . C\nis the number of number of input channels, X and Y are the spatial dimensions of the kernel, and F\nis the target number of feature maps. It is common for the \ufb01rst convolutional layer to have a stride\nassociated with the kernel which we denote by \u2206. Let I \u2208 RC\u00d7N\u00d7M denote an input signal where\nC is the number of input maps, and N and M are the spatial dimensions of the maps. The target\nvalue, T = I \u2217 W , of a generic convolutional layer, with \u2206 = 1, for a particular output feature, f,\nand spatial location, (x, y), is\n\nC(cid:88)\n\nX(cid:88)\n\nY(cid:88)\n\nT (f, x, y) =\n\nI(c, x \u2212 x(cid:48), y \u2212 y(cid:48))W (c, x(cid:48), y(cid:48), f )\n\nc=1\n\nx(cid:48)=1\n\ny(cid:48)=1\n\nIf W is a tensor, (cid:107)W(cid:107) denotes its operator norm, sup(cid:107)x(cid:107)=1 (cid:107)W x(cid:107)F and (cid:107)W(cid:107)F denotes its Frobenius\nnorm.\n\n2 Related Work\n\nVanhoucke et al. [4] explored the properties of CPUs to speed up execution. They present many\nsolutions speci\ufb01c to Intel and AMD CPUs and some of their techniques are general enough to be\nused for any type of processor. They describe how to align memory, and use SIMD operations\n(vectorized operations on CPU) to boost the ef\ufb01ciency of matrix multiplication. Additionally, they\npropose the linear quantization of the network weights and input. This involves representing weights\nas 8-bit integers (range [\u2212127, 128]), rather than 32-bit \ufb02oats. This approximation is similar in spirit\nto our approach, but differs in that it is applied to each weight element independently. By contrast,\nour approximation approach models the structure within each \ufb01lter. Potentially, the two approaches\ncould be used in conjunction.\nThe most expensive operations in CNNs are the convolutions in the \ufb01rst few layers. The complexity\nof this operation is linear in the area of the receptive \ufb01eld of the \ufb01lters, which is relatively large for\nthese layers. However, Mathieu et al. [5] have shown that convolution can be ef\ufb01ciently computed\nin Fourier domain, where it becomes element-wise multiplication (and there is no cost associated\nwith size of receptive \ufb01eld). They report a forward-pass speed up of around 2\u00d7 for convolution\nlayers in state-of-the-art models. Importantly, the FFT method can be used jointly with most of the\ntechniques presented in this paper.\nThe use of low-rank approximations in our approach is inspired by work of Denil et al. [2] who\ndemonstrate the redundancies in neural network parameters. They show that the weights within a\nlayer can be accurately predicted from a small (e.g. \u223c 5%) subset of them. This indicates that\nneural networks are heavily over-parametrized. All the methods presented here focus on exploiting\nthe linear structure of this over-parametrization.\nFinally, a recent preprint [6] also exploits low-rank decompositions of convolutional tensors to speed\nup the evaluation of CNNs, applied to scene text character recognition. This work was developed\nsimultaneously with ours, and provides further evidence that such techniques can be applied to a\nvariety of architectures and tasks. Our work differs in several ways. First, we consider a signi\ufb01cantly\nlarger model. This makes it more challenging to compute ef\ufb01cient approximations since there are\nmore layers to propagate through and thus a greater opportunity for error to accumulate. Second, we\npresent different compression techniques for the hidden convolutional layers and provide a method\nof compressing the \ufb01rst convolutional layer. Finally, we present GPU results in addition to CPU\nresults.\n\n3 Convolutional Tensor Compression\n\nIn this section we describe techniques for compressing 4 dimensional convolutional weight tensors\nand fully connected weight matrices into a representation that permits ef\ufb01cient computation and\nstorage. Section 3.1 describes how to construct a good approximation criteria. Section 3.2 describes\n\n2\n\n\ftechniques for low-rank tensor approximations. Sections 3.3 and 3.4 describe how to apply these\ntechniques to approximate weights of a convolutional neural network.\n\n3.1 Approximation Metric\n\nOur goal is to \ufb01nd an approximation, \u02dcW , of a convolutional tensor W that facilitates more ef\ufb01cient\ncomputation while maintaining the prediction performance of the network. A natural choice for\nan approximation criterion is to minimize (cid:107) \u02dcW \u2212 W(cid:107)F . This criterion yields ef\ufb01cient compression\nschemes using elementary linear algebra, and also controls the operator norm of each linear convolu-\ntional layer. However, this criterion assumes that all directions in the space of weights equally affect\nprediction performance. We now present two methods of improving this criterion while keeping the\nsame ef\ufb01cient approximation algorithms.\nMahalanobis distance metric: The \ufb01rst distance metric we propose seeks to emphasize coordi-\nnates more prone to produce prediction errors over coordinates whose effect is less harmful for the\noverall system. We can obtain such measurements as follows. Let \u0398 = {W1, . . . , WS} denote\nthe set of all parameters of the S-layer network, and let U (I; \u0398) denote the output after the soft-\nmax layer of input image I. We consider a given input training set (I1, . . . , IN ) with known labels\n(y1, . . . , yN ). For each pair (In, yn), we compute the forward propagation pass U (In, \u0398), and de-\n\ufb01ne as {\u03b2n} the indices of the h largest values of U (In, \u0398) different from yn. Then, for a given\nlayer s, we compute\n\ndn,l,s = \u2207Ws (U (In, \u0398) \u2212 \u03b4(i \u2212 l)) , n \u2264 N , l \u2208 {\u03b2n} , s \u2264 S ,\n\n(1)\nwhere \u03b4(i\u2212l) is the dirac distribution centered at l. In other words, for each input we back-propagate\nthe difference between the current prediction and the h \u201cmost dangerous\u201d mistakes.\nThe Mahalanobis distance is de\ufb01ned from the covariance of d: (cid:107)W(cid:107)2\nmaha = w\u03a3\u22121wT , where w\nis the vector containing all the coordinates of W , and \u03a3 is the covariance of (dn,l,s)n,l. We do not\nreport results using this metric, since it requires inverting a matrix of size equal to the number of pa-\nrameters, which can be prohibitively expensive in large networks. Instead we use an approximation\nthat considers only the diagonal of the covariance matrix. In particular, we propose the following,\napproximate, Mahalanobis distance metric:\n\n(cid:16)(cid:88)\n\ndn,l,s(p)2(cid:17)1/2\n\n(2)\n\n(cid:88)\n\n(cid:107)W(cid:107)(cid:94)maha\n\n:=\n\n\u03b1pW (p) , where \u03b1p =\n\np\n\nn,l\n\nwhere the sum runs over the tensor coordinates. Since (2) is a reweighted Euclidiean metric, we can\nsimply compute W (cid:48) = \u03b1 . \u2217 W , where .\u2217 denotes element-wise multiplication, then compute the\napproximation \u02dcW (cid:48) on W (cid:48) using the standard L2 norm, and \ufb01nally output \u02dcW = \u03b1\u22121. \u2217 \u02dcW (cid:48) .\nData covariance distance metric:\nEx\u223cN (0,I)(cid:107)W x(cid:107)2\nisotropic covariance assumption by the empirical covariance of the input of the layer.\n\nF =\nF . Another alternative, similar to the one considered in [6], is to replace the\nIf W \u2208\n\nRC\u00d7X\u00d7Y \u00d7F is a convolutional layer, and(cid:98)\u03a3 \u2208 RCXY \u00d7CXY is the empirical estimate of the input\n\nOne can view the Frobenius norm of W as (cid:107)W(cid:107)2\n\ndata covariance, it can be ef\ufb01ciently computed as\n\n(3)\nwhere WF is the matrix obtained by folding the \ufb01rst three dimensions of W .As opposed to [6], this\napproach adapts to the input distribution without the need to iterate through the data.\n\n(cid:107)W(cid:107)data = (cid:107)(cid:98)\u03a31/2WF(cid:107)F ,\n\n3.2 Low-rank Tensor Approximations\n\n3.2.1 Matrix Decomposition\n\nMatrices are 2-tensors which can be linearly compressed using the Singular Value Decomposition.\nIf W \u2208 Rm\u00d7k is a real matrix, the SVD is de\ufb01ned as W = U SV (cid:62), where U \u2208 Rm\u00d7m, S \u2208\nRm\u00d7k, V \u2208 Rk\u00d7k. S is a diagonal matrix with the singular values on the diagonal, and U, V\nare orthogonal matrices. If the singular values of W decay rapidly, W can be well approximated\nby keeping only the t largest entries of S, resulting in the approximation \u02dcW = \u02dcU \u02dcS \u02dcV (cid:62), where\n\n3\n\n\f\u02dcU \u2208 Rm\u00d7t, \u02dcS \u2208 Rt\u00d7t, \u02dcV \u2208 Rt\u00d7k Then, for I \u2208 Rn\u00d7m, the approximation error (cid:107)I \u02dcW \u2212 IW(cid:107)F\nsatis\ufb01es (cid:107)I \u02dcW \u2212 IW(cid:107)F \u2264 st+1(cid:107)I(cid:107)F , and thus is controlled by the decay along the diagonal of S.\nNow the computation I \u02dcW can be done in O(nmt + nt2 + ntk), which, for suf\ufb01ciently small t is\nsigni\ufb01cantly smaller than O(nmk).\n\n3.2.2 Higher Order Tensor Approximations\nSVD can be used to approximate a tensor W \u2208 Rm\u00d7n\u00d7k by \ufb01rst folding all but two dimensions\ntogether to convert it into a 2-tensor, and then considering the SVD of the resulting matrix. For\nexample, we can approximate Wm \u2208 Rm\u00d7(nk) as \u02dcWm \u2248 \u02dcU \u02dcS \u02dcV (cid:62). W can be compressed even\nfurther by applying SVD to \u02dcV . We refer to this approximation as the SVD decomposition and use\nK1 and K2 to denote the rank used in the \ufb01rst and second application of SVD respectively.\nAlternatively, we can approximate a 3-tensor, WS \u2208 Rm\u00d7n\u00d7k, by a rank 1 3-tensor by \ufb01nding a\ndecomposition that minimizes\n\n(4)\nwhere \u03b1 \u2208 Rm, \u03b2 \u2208 Rn, \u03b3 \u2208 Rk and \u2297 denotes the outer product operation. Problem (4) is solved\nef\ufb01ciently by performing alternate least squares on \u03b1, \u03b2 and \u03b3 respectively, although more ef\ufb01cient\nalgorithms can also be considered [7].\nThis easily extends to a rank K approximation using a greedy algorithm: Given a tensor W , we\ncompute (\u03b1, \u03b2, \u03b3) using (4), and we update W (k+1) \u2190 W k \u2212 \u03b1 \u2297 \u03b2 \u2297 \u03b3. Repeating this operation\nK times results in\n\n(cid:107)W \u2212 \u03b1 \u2297 \u03b2 \u2297 \u03b3(cid:107)F ,\n\nK(cid:88)\n\n\u02dcWS =\n\n\u03b1k \u2297 \u03b2k \u2297 \u03b3k .\n\n(5)\n\nWe refer to this approximation as the outer product decomposition and use K to denote the rank of\nthe approximation.\n\nk=1\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: A visualization of monochromatic and biclustering approximation structures. (a) The\nmonochromatic approximation, used for the \ufb01rst layer. Input color channels are projected onto a set\nof intermediate color channels. After this transformation, output features need only to look at one\nintermediate color channel. (b) The biclustering approximation, used for higher convolution layers.\nInput and output features are clustered into equal sized groups. The weight tensor corresponding\n(c) The weight tensors for each\nto each pair of input and output clusters is then approximated.\ninput-output pair in (b) are approximated by a sum of rank 1 tensors using techniques described in\n3.2.2\n\n3.3 Monochromatic Convolution Approximation\nLet W \u2208 RC\u00d7X\u00d7Y \u00d7F denote the weights of the \ufb01rst convolutional layer of a trained network.\nWe found that the color components of trained CNNs tend to have low dimensional structure. In\nparticular, the weights can be well approximated by projecting the color dimension down to a 1D\nsubspace. The low-dimensional structure of the weights is illustrated in Figure 4.1.\nThe monochromatic approximation exploits this structure and is computed as follows. First, for\nevery output feature, f, we consider the matrix Wf \u2208 RC\u00d7(XY ), where the spatial dimensions of the\n\ufb01lter corresponding to the output feature have been combined, and \ufb01nd the SVD, Wf = Uf Sf V (cid:62)\nf ,\n\n4\n\nPointwise matrix multiplicationRGB inputIntermediate representationOutput2D monochromatic spatial convolutionInputOutputBi-cluster input and outputFCX\u00b7Y+\u2026FCX\u00b7Y+\fApproximation technique\nNo approximation\nMonochromatic\nBiclustering + outer product decomposition GHK(N M C\nBiclustering + SVD\nGHN M ( C\n\nNumber of operations\n\nXY CF N M \u2206\u22122\nC(cid:48)CN M + XY F N M \u2206\u22122\nG + XY N M \u2206\u22122 + F\nG K1 + K1XY K2\u2206\u22122 + K2\n\nH N M \u2206\u22122)\nF\nH )\n\nTable 1: Number of operations required for various approximation methods.\n\nf\n\nwhere Uf \u2208 RC\u00d7C, Sf \u2208 RC\u00d7XY , and Vf \u2208 RXY \u00d7XY . We then take the rank 1 approximation\n, where \u02dcUf \u2208 RC\u00d71, \u02dcSf \u2208 R, \u02dcVf \u2208 R1\u00d7XY . We can further exploit the\nof Wf , \u02dcWf = \u02dcUf \u02dcSf \u02dcV (cid:62)\nregularity in the weights by sharing the color component basis between different output features.\nWe do this by clustering the F left singular vectors, \u02dcUf , of each output feature f into C(cid:48) clusters,\nfor C(cid:48) < F . We constrain the clusters to be of equal size as discussed in section 3.4. Then,\nC(cid:48) output features, f, that is assigned to cluster cf , we can approximate Wf with\nfor each of the F\nf where Ucf \u2208 RC\u00d71 is the cluster center for cluster cf and \u02dcSf and \u02dcVf are as before.\n\u02dcSf \u02dcV (cid:62)\n\u02dcWf = Ucf\nThis monochromatic approximation is illustrated in the left panel of Figure 1(c). Table 1 shows the\nnumber of operations required for the standard and monochromatic versions.\n\n3.4 Biclustering Approximations\n\nWe exploit the redundancy within the 4-D weight tensors in the higher convolutional layers by clus-\ntering the \ufb01lters, such that each cluster can be accurately approximated by a low-rank factorization.\nWe start by clustering the rows of WC \u2208 RC\u00d7(XY F ), which results in clusters C1, . . . , Ca. Then\nwe cluster the columns of WF \u2208 R(CXY )\u00d7F , producing clusters F1, . . . , Fb. These two opera-\ntions break the original weight tensor W into ab sub-tensors {WCi,Fj}i=1,...,a,j=1,...,b as shown in\nFigure 1(b). Each sub-tensor contains similar elements, and thus is easier to \ufb01t with a low-rank\napproximation.\nIn order to exploit the parallelism inherent in CPU and GPU architectures it is useful to constrain\nclusters to be of equal sizes. We therefore perform the biclustering operations (or clustering for\nmonochromatic \ufb01lters in Section 3.3) using a modi\ufb01ed version of the k-means algorithm which bal-\nances the cluster count at each iteration. It is implemented with the Floyd algorithm, by modifying\nthe Euclidean distance with a subspace projection distance.\nAfter the input and output clusters have been obtained, we \ufb01nd a low-rank approximation of each\nsub-tensor using either the SVD decomposition or the outer product decomposition as described in\nSection 3.2.2. We concatenate the X and Y spatial dimensions of the sub-tensors so that the de-\ncomposition is applied to the 3-tensor, WS \u2208 RC\u00d7(XY )\u00d7F . While we could look for a separable\napproximation along the spatial dimensions as well, we found the resulting gain to be minimal. Us-\ning these approximations, the target output can be computed with signi\ufb01cantly fewer operations. The\nnumber of operations required is a function the number of input clusters, G, the output clusters H\nand the rank of the sub-tensor approximations (K1, K2 for the SVD decomposition; K for the outer\nproduct decomposition. The number of operations required for each approximation is described in\nTable 1.\n\n3.5 Fine-tuning\n\nMany of the approximation techniques presented here can ef\ufb01ciently compress the weights of a\nCNN with negligible degradation of classi\ufb01cation performance provided the approximation is not\ntoo harsh. Alternatively, one can use a harsher approximation that gives greater speedup gains but\nhurts the performance of the network. In this case, the approximated layer and all those below it can\nbe \ufb01xed and the upper layers can be \ufb01ne-tuned until the original performance is restored.\n\n4 Experiments\n\nWe use the 15 layer convolutional architecture of [8], trained on the ImageNet 2012 dataset [9]. The\nnetwork contains 5 convolutional layers, 3 fully connected layers and a softmax output layer. We\n\n5\n\n\fFigure 2: Visualization of the 1st layer \ufb01lters. (Left) Each component of the 96 7x7 \ufb01lters is plotted\nin RGB space. Points are colored based on the output \ufb01lter they belong to. Hence, there are 96\ncolors and 72 points of each color. Leftmost plot shows the original \ufb01lters and the right plot shows\nthe \ufb01lters after the monochromatic approximation, where each \ufb01lter has been projected down to a\nline in colorspace. (Right) Original and approximate versions of a selection of 1st layer \ufb01lters.\n\nevaluated the network on both CPU and GPU platforms. All measurements of prediction perfor-\nmance are with respect to the 50K validation images from the ImageNet12 dataset.\nWe present results showing the performance of the approximations described in Section 3 in terms\nof prediction accuracy, speedup gains and reduction in memory overhead. All of our \ufb01ne-tuning\nresults were achieved by training with less than 2 passes using the ImageNet12 training dataset.\nUnless stated otherwise, classi\ufb01cation numbers refer to those of \ufb01ne-tuned models.\n\n4.1 Speedup\n\nThe majority of forward propagation time is spent on the \ufb01rst two convolutional layers (see Supple-\nmentary Material for breakdown of time across all layers). Because of this, we restrict our attention\nto the \ufb01rst and second convolutional layers in our speedup experiments. However, our approxima-\ntions could easily applied to convolutions in upper layers as well.\nWe implemented several CPU and GPU approximation routines in an effort to achieve empirical\nspeedups. Both the baseline and approximation CPU code is implemented in C++ using Eigen3\nlibrary [10] compiled with Intel MKL. We also use Intel\u2019s implementation of openmp and multi-\nthreading. The baseline gives comparable performance to highly optimized MATLAB convolution\nroutines and all of our CPU speedup results are computed relative to this. We used Alex Krizhevsky\u2019s\nCUDA convolution routines 1 as a baseline for GPU comparisons. The approximation versions are\nwritten in CUDA. All GPU code was run on a standard nVidia Titan card.\nWe have found that in practice it is often dif\ufb01cult to achieve speedups close to the theoretical gains\nbased on the number of arithmetic operations (see Supplementary Material for discussion of the-\noretical gains). Moreover, different computer architectures and CNN architectures afford different\noptimization strategies making most implementations highly speci\ufb01c. However, regardless of im-\nplementation details, all of the approximations we present reduce both the number of operations and\nnumber of weights required to compute the output by at least a factor of two, often more.\n\n4.1.1 First Layer\n\nThe \ufb01rst convolutional layer has 3 input channels, 96 output channels and 7x7 \ufb01lters. We approx-\nimated the weights in this layer using the monochromatic approximation described in Section 3.3.\nThe monochromatic approximation works well if the color components span a small number of one\ndimensional subspaces. Figure 2 illustrates the effect of the monochromatic approximation on the\n\ufb01rst layer \ufb01lters.\nThe only parameter in the approximation is C(cid:48), the number of color channels used for the interme-\ndiate representation. As expected, the network performance begins to degrade as C(cid:48) decreases. The\nnumber of \ufb02oating point operations required to compute the output of the monochromatic convolu-\ntion is reduced by a factor of 2 \u2212 3\u00d7, with the larger gain resulting for small C(cid:48). Figure 3 shows the\nempirical speedups we achieved on CPU and GPU and the corresponding network performance for\nvarious numbers of colors used in the monochromatic approximation. Our CPU and GPU imple-\n\n1https://code.google.com/p/cuda-convnet/\n\n6\n\n\fFigure 3: Empirical speedups on (Left) CPU and (Right) GPU for the \ufb01rst layer. C(cid:48) is the number\nof colors used in the approximation.\n\nFigure 4: Empirical speedups for second convolutional layer. (Left) Speedups on CPU using biclus-\ntered (G = 2 and H = 2) with SVD approximation. (Right) peedups on GPU using biclustered\n(G = 48 and H = 2) with outer product decomposition approximation.\n\nmentations achieve empirical speedups of 2 \u2212 2.5\u00d7 relative to the baseline with less than 1% drop\nin classi\ufb01cation performance.\n\n4.1.2 Second Layer\n\nThe second convolutional layer has 96 input channels, 256 output channels and 5x5 \ufb01lters. We\napproximated the weights using the techniques described in Section 3.4. We explored various con-\n\ufb01gurations of the approximations by varying the number of input clusters G, the number of output\nclusters H and the rank of the approximation (denoted by K1 and K2 for the SVD decomposition\nand K for the outer product decomposition).\nFigure 4 shows our empirical speedups on CPU and GPU and the corresponding network perfor-\nmance for various approximation con\ufb01gurations. For the CPU implementation we used the bi-\nclustering with SVD approximation. For the GPU implementation we using the biclustering with\nouter product decomposition approximation. We achieved promising results and present speedups\nof 2 \u2212 2.5\u00d7 relative to the baseline with less than a 1% drop in performance.\n\n4.2 Combining approximations\n\nThe approximations can also be cascaded to provide greater speedups. The procedure is as fol-\nlows. Compress the \ufb01rst convolutional layer weights and then \ufb01ne-tune all the layers above until\nperformance is restored. Next, compress the second convolutional layer weights that result from\nthe \ufb01ne-tuning. Fine-tune all the layers above until performance is restored and then continue the\nprocess.\nWe applied this procedure to the \ufb01rst two convolutional layers. Using the monochromatic approxi-\nmation with 6 colors for the \ufb01rst layer and the biclustering with outer product decomposition approx-\n\n7\n\n\u2212101234561.41.61.822.22.42.62.83C\u2019 = 8C\u2019 = 12C\u2019 = 16C\u2019 = 24C\u2019 = 4C\u2019 = 6C\u2019 = 8C\u2019 = 12C\u2019 = 16C\u2019 = 24C\u2019 = 4C\u2019 = 6C\u2019 = 8C\u2019 = 12Percent loss in performanceEmpirical gain in speed on CPUFirst layer approximation: Performance loss vs. empirical CPU speedup ||W||F distance metric||W||data distance metricFinetuned\u22121012345611.21.41.61.822.22.42.6C\u2019 = 12C\u2019 = 16C\u2019 = 24C\u2019 = 4C\u2019 = 6C\u2019 = 12C\u2019 = 16C\u2019 = 24C\u2019 = 4C\u2019 = 6C\u2019 = 12Percent loss in performanceEmpirical gain in speed on GPUFirst layer approximation: Performance loss vs. empirical GPU speedup ||W||F distance metric||W||data distance metricFinetuned\u22121012345671.41.61.822.22.42.62.83K1 = 24K2 = 76K1 = 19K2 = 64K1 = 19K2 = 51K1 = 24K2 = 76K1 = 19K2 = 64K1 = 19K2 = 51K1 = 16K2 = 51K1 = 19K2 = 44K1 = 19K2 = 64K1 = 19K2 = 51K1 = 16K2 = 51Percent loss in performanceEmpirical gain in speed on CPUSecond layer approximation: Performance loss vs. empirical CPU speedup ||W||F distance metric||W||data distance metricFinetuned0123456711.21.41.61.822.22.42.62.8K = 6K = 8K = 10K = 12K = 14K = 16K = 5K = 6K = 8K = 10K = 12K = 14K = 16K = 5K = 6K = 8Percent loss in performanceEmpirical gain in speed on GPUSecond layer approximation: Performance loss vs. empirical GPU speedup ||W||F distance metric||W||maha distance metricFine\u2212tuned\fApproximation method\n\nNumber of parameters\n\nCXY F\n\nCC(cid:48) + XY F\n\nGHK( C\n\nG + XY + F\n\nH )\n\nGH( C\n\nG K1 + K1XY K2 + K2\n\nNM\n\nStandard colvolution\nConv layer 1: Monochromatic\nConv layer 2: Biclustering\n+ outer product decomposition\nConv layer 2: Biclustering + SVD\nStandard FC\nFC layer 1: Matrix SVD\n\nFC layer 2: Matrix SVD\n\nFC layer 3: Matrix SVD\n\nApproximation\nhyperparameters\n\nC(cid:48) = 6\n\nG = 48; H = 2; K = 6\n\nF\nH )\n\nG = 2; H = 2; K1 = 19; K2 = 24\n\nN K + KM\n\nN K + KM\n\nN K + KM\n\nK = 250\nK = 950\nK = 350\nK = 650\nK = 250\nK = 850\n\nReduction\nin weights\n\nIncrease\nin error\n\n3\u00d7\n5.3\u00d7\n3.9\u00d7\n13.4\u00d7\n3.5\u00d7\n5.8\u00d7\n3.14\u00d7\n8.1\u00d7\n2.4\u00d7\n\n0.43%\n0.68%\n\n0.9%\n\n0.8394%\n0.09%\n0.19%\n0.06%\n0.67%\n0.02%\n\nTable 2: Number of parameters expressed as a function of hyperparameters for various approxima-\ntion methods and empirical reduction in parameters with corresponding network performance.\n\nimation for the second layer (G = 48; H = 2; K = 8) and \ufb01ne-tuning with a single pass through\nthe training set we are able to keep accuracy within 1% of the original model. This procedure could\nbe applied to each convolutional layer, in this sequential manner, to achieve overall speedups much\ngreater than any individual layer can provide. A more comprehensive summary of these results can\nbe found in the Supplementary Material.\n\n4.3 Reduction in memory overhead\n\nIn many commercial applications memory conservation and storage are a central concern. This\nmainly applies to embedded systems (e.g. smartphones), where available memory is limited, and\nusers are reluctant to download large \ufb01les. In these cases, being able to compress the neural network\nis crucial for the viability of the product.\nIn addition to requiring fewer operations, our approximations require signi\ufb01cantly fewer parame-\nters when compared to the original model. Since the majority of parameters come from the fully\nconnected layers, we include these layers in our analysis of memory overhead. We compress the\nfully connected layers using standard SVD as described in 3.2.2, using K to denote the rank of the\napproximation.\nTable 2 shows the number of parameters for various approximation methods as a function of hy-\nperparameters for the approximation techniques. The table also shows the empirical reduction of\nparameters and the corresponding network performance for speci\ufb01c instantiations of the approxima-\ntion parameters.\n5 Discussion\n\nIn this paper we have presented techniques that can speed up the bottleneck convolution operations\nin the \ufb01rst layers of a CNN by a factor 2 \u2212 3\u00d7, with negligible loss of performance. We also show\nthat our methods reduce the memory footprint of weights in the \ufb01rst two layers by factor of 2 \u2212 3\u00d7\nand the fully connected layers by a factor of 5 \u2212 13\u00d7. Since the vast majority of weights reside in\nthe fully connected layers, compressing only these layers translates into a signi\ufb01cant savings, which\nwould facilitate mobile deployment of convolutional networks. These techniques are orthogonal to\nother approaches for ef\ufb01cient evaluation, such as quantization or working in the Fourier domain.\nHence, they can potentially be used together to obtain further gains.\nAn interesting avenue of research to explore in further work is the ability of these techniques to\naid in regularization either during or post training. The low-rank projections effectively decrease\nnumber of learnable parameters, suggesting that they might improve generalization ability. The\nregularization potential of the low-rank approximations is further motivated by two observations.\nThe \ufb01rst is that the approximated \ufb01lters for the \ufb01rst conolutional layer appear to be cleaned up\nversions of the original \ufb01lters. Additionally, we noticed that we sporadically achieve better test error\nwith some of the more conservative approximations.\n\nAcknowledgments\n\nThe authors are grateful for support from ONR #N00014-13-1-0646, NSF #1116923, #1149633 and\nMicrosoft Research.\n\n8\n\n\fReferences\n[1] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Inte-\ngrated recognition, localization and detection using convolutional networks. arXiv preprint\narXiv:1312.6229 (2013)\n\n[2] Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep\n\nlearning. arXiv preprint arXiv:1306.0543 (2013)\n\n[3] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:\n\nproving neural networks by preventing co-adaptation of feature detectors.\narXiv:1207.0580 (2012)\n\nIm-\narXiv preprint\n\n[4] Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In:\n\nProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop. (2011)\n\n[5] Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through ffts.\n\narXiv preprint arXiv:1312.5851 (2013)\n\n[6] Jaderberg, M., Vedaldi, Andrea, Zisserman, A.: Speeding up convolutional neural networks\n\nwith low rank expansions. arXiv preprint arXiv:1405.3866 (2014)\n\n[7] Zhang, T., Golub, G.H.: Rank-one approximation to high order tensors. SIAM J. Matrix Anal.\n\nAppl. 23(2) (February 2001) 534\u2013550\n\n[8] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv\n\npreprint arXiv:1311.2901 (2013)\n\n[9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierar-\n\nchical Image Database. In: CVPR09. (2009)\n\n[10] Guennebaud, G., Jacob, B., et al.: Eigen v3. http://eigen.tuxfamily.org (2010)\n[11] Zeiler, M.D., Taylor, G.W., Fergus, R.: Adaptive deconvolutional networks for mid and high\nlevel feature learning. In: Computer Vision (ICCV), 2011 IEEE International Conference on,\nIEEE (2011) 2018\u20132025\n\n[12] Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: Tiled convolutional neural\n\nnetworks. In: Advances in Neural Information Processing Systems. (2010)\n\n[13] Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng,\nA.Y.: Building high-level features using large scale unsupervised learning. arXiv preprint\narXiv:1112.6209 (2011)\n\n[14] Lowe, D.G.: Object recognition from local scale-invariant features.\n\nIn: Computer vision,\n1999. The proceedings of the seventh IEEE international conference on. Volume 2., Ieee (1999)\n1150\u20131157\n\n[15] Krizhevsky, A., Sutskever, I., Hinton, G.:\n\nImagenet classi\ufb01cation with deep convolutional\nneural networks. In: Advances in Neural Information Processing Systems 25. (2012) 1106\u2013\n1114\n\n9\n\n\f", "award": [], "sourceid": 719, "authors": [{"given_name": "Emily", "family_name": "Denton", "institution": "New York University"}, {"given_name": "Wojciech", "family_name": "Zaremba", "institution": "New York University"}, {"given_name": "Joan", "family_name": "Bruna", "institution": "NYU"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "New York U"}, {"given_name": "Rob", "family_name": "Fergus", "institution": "NYU"}]}