{"title": "Constructing Fast Network through Deconstruction of Convolution", "book": "Advances in Neural Information Processing Systems", "page_first": 5951, "page_last": 5961, "abstract": "Convolutional neural networks have achieved great success in various vision tasks; however, they incur heavy resource costs. By using deeper and wider networks, network accuracy can be improved rapidly. However, in an environment with limited resources (e.g., mobile applications), heavy networks may not be usable. This study shows that naive convolution can be deconstructed into a shift operation and pointwise convolution. To cope with various convolutions, we propose a new shift operation called active shift layer (ASL) that formulates the amount of shift as a learnable function with shift parameters. This new layer can be optimized end-to-end through backpropagation and it can provide optimal shift values. Finally, we apply this layer to a light and fast network that surpasses existing state-of-the-art networks.", "full_text": "Constructing Fast Network\n\nthrough Deconstruction of Convolution\n\nYunho Jeon\n\nJunmo Kim\n\nSchool of Electrical Engineering, KAIST\n\nSchool of Electrical Engineering, KAIST\n\njyh2986@kaist.ac.kr\n\njunmo.kim@kaist.ac.kr\n\nAbstract\n\nConvolutional neural networks have achieved great success in various vision tasks;\nhowever, they incur heavy resource costs. By using deeper and wider networks,\nnetwork accuracy can be improved rapidly. However, in an environment with\nlimited resources (e.g., mobile applications), heavy networks may not be usable.\nThis study shows that naive convolution can be deconstructed into a shift operation\nand pointwise convolution. To cope with various convolutions, we propose a\nnew shift operation called active shift layer (ASL) that formulates the amount\nof shift as a learnable function with shift parameters. This new layer can be\noptimized end-to-end through backpropagation and it can provide optimal shift\nvalues. Finally, we apply this layer to a light and fast network that surpasses\nexisting state-of-the-art networks. Code is available at https://github.com/\njyh2986/Active-Shift.\n\n1\n\nIntroduction\n\nDeep learning has been applied successfully in various \ufb01elds. For example, convolutional neural\nnetworks (CNNs) have been developed and applied successfully to a wide variety of vision tasks.\nIn this light, the current study examines various network structures[15, 19, 21, 20, 22, 6, 7, 9]. In\nparticular, networks are being made deeper and wider because doing so improves accuracy. This\napproach has been facilitated by hardware developments such as graphics processing units (GPUs).\nHowever, this approach increases the inference and training times and consumes more memory.\nTherefore, a large network might not be implementable in environments with limited resources, such\nas mobile applications. In fact, accuracy may need to be sacri\ufb01ced in such environments. Two\nmain types of approaches have been proposed to avoid this problem. The \ufb01rst approach is network\nreduction via pruning[5, 4, 17], in which the learned network is reduced while maximizing accuracy.\nThis method can be applied to most general network architectures. However, it requires additional\nprocesses after or during training, and therefore, it may further increase the overall amount of time\nrequired for preparing the \ufb01nal networks.\nThe second approach is to use lightweight network architectures[10, 8, 18] or new components[13,\n23, 26] to accommodate limited environments. This approach does not focus only on limited\nresource environments; it can provide better solutions for many applications by reducing resource\nusage while maintaining or even improving accuracy. Recently, grouped or depthwise convolution\nhas attracted attention because it reduces the computational complexity greatly while maintaining\naccuracy. Therefore, it has been adopted in many new architectures[2, 24, 8, 18, 26].\nDecomposing a convolution is effective for reducing the number of parameters and computational\ncomplexity. Initially, a large convolution was decomposed using several small convolutions[19, 22].\nBinary pattern networks[13] have also been proposed to reduce network size. This raises the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fquestion of what the atomic unit is for composing a convolution. We show that a convolution can be\ndeconstructed into two components: 1\u00d71 convolution and shift operation.\nRecently, shift operations have been used to replace spatial convolutions[23] and reduce the number of\nparameters and computational complexity. In this approach, shift amounts are assigned heuristically\nby grouping input channels. In this study, we formulate the shift operation as a learnable function\nwith shift parameters and optimize it through training. This generalization affords many bene\ufb01ts.\nFirst, we do not need to assign shift values heuristically; instead, they can be trained from random\ninitializations. Second, we can simulate convolutions with large receptive \ufb01elds such as a dilated\nconvolution[1, 25]. Finally, we can obtain a signi\ufb01cantly improved tradeoff between performance\nand complexity.\nThe contributions of this paper are summarized as follows:\n\n\u2022 We deconstruct a heavy convolution into two atomic operations: 1\u00d71 convolution and\nshift operation. This deconstruction can greatly reduce the parameters and computational\ncomplexity.\n\u2022 We propose an active shift layer (ASL) with learnable shift parameters that allows opti-\nmization through backpropagation. It can replace the existing spatial convolutions while\nreducing the computational complexity and inference time.\n\u2022 The proposed method is used to construct a light and fast network. This network shows\n\nstate-of-the-art results with fewer parameters and low inference time.\n\n2 Related Work\nDecomposition of Convolution VGG[19] decomposes a 5\u00d75 convolution into two 3\u00d73 convolu-\ntions to reduce the number of parameters and simplify network architectures. GoogleNet[22] uses\n1\u00d77 and 7\u00d71 spatial convolutions to simulate a 7\u00d77 convolution. Lebedev et al. [16] decomposed a\nconvolution with a sum of four convolutions with small kernels. Recently, depthwise separable convo-\nlution has been shown to achieve similar accuracy to naive convolution while reducing computational\ncomplexity[2, 8, 18].\nIn addition to decomposing a convolution into other types of convolutions, a new unit has been\nproposed. A \ufb01xed binary pattern[13] has been shown to be an ef\ufb01cient alternative for spatial\nconvolution. A shift operation[23] can approximately simulate spatial convolution without any\nparameters and \ufb02oating point operations (FLOPs). An active convolution unit (ACU)[11] and\ndeformable convolution[3] showed that the input position of convolution can be learned by introducing\ncontinuous displacement parameters.\nMobile Architectures Various network architectures have been proposed for mobile applications with\nlimited resources. SqueezeNet[10] designed a \ufb01re module for reducing the number of parameters and\ncompressed the trained network to a very small size. Shuf\ufb02eNet[26] used grouped 1\u00d71 convolution\nto reduce dense connections while retaining the network width and suggested a shuf\ufb02e layer to\nmix grouped features. MobileNet[18, 8] used depthwise convolution to reduce the computational\ncomplexity.\nNetwork Pruning Network pruning[5, 4, 17] is not closely related to our work. However, this\nmethodology has similar aims. It reduces the computational complexity of trained architectures while\nmaintaining the accuracy of the original networks. It can be also applied to our networks for further\noptimization.\n\n3 Method\n\nThe basic convolution has many weight parameters and large computational complexity. If the\ndimension of the weight parameter is D \u00d7 C \u00d7 K, the computation complexity (in FLOPs) is\n\n(1)\nwhere K is the spatial dimension of the kernel (e.g., K is nine for 3\u00d73 convolution); C and D are\nthe numbers of input and output channels, respectively; and the spatial dimension of the input feature\nis W \u00d7 H.\n\n(D \u00d7 C \u00d7 K) \u00d7 (W \u00d7 H),\n\n2\n\n\f3.1 Deconstruction of Convolution\n\nThe basic convolution for one spatial position can be formulated as follows:\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u02dcwd,c,k \u00b7 \u02dcxc,m+ik,n+jk =\n\n\u02dcwd,c,k \u00b7 \u02dcxc,m+ik,n+jk ,\n\n\u02dcyd,m,n =\n\nc\n\nk\n\n(2)\nwhere \u02dcx\u00b7,\u00b7,\u00b7 is an element of the C \u00d7 W \u00d7 H input tensor, and \u02dcy\u00b7,\u00b7,\u00b7 is an element of the D \u00d7 W \u00d7 H\noutput tensor. w\u00b7,\u00b7,\u00b7 is an element of the D \u00d7 C \u00d7 K weight tensor \u02dcW . k is the spatial index of the\nkernel, which points from top-left to bottom-right of a spatial dimension of the kernel. ik and jk are\ndisplacement values for the corresponding kernel index k. The above equation can be converted to a\nmatrix multiplication:\n\nk\n\nc\n\nY = W + \u00d7 X + =\n\n=\n\n(cid:104) \u02dcW :,:,1 \u02dcW :,:,2 ... \u02dcW :,:,K\n(cid:88)\n(cid:88)\n\n\u02dcW :,:,k \u00d7 X k\n\n:,: =\n\nk\n\nk\n\n(cid:105) \u00d7\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8ef\uf8f0 X 1\n\n:,:\nX 2\n:,:\n...\nX K\n:,:\n\n\u02dcW :,:,k \u00d7 Sk(X)\n\n(3)\n\nwhere W + and X + are reordered matrices for the weight and input, respectively. \u02dcW :,:,k represents\nthe D \u00d7 C matrix corresponding to the kernel index k. X is a C \u00d7 (W \u00b7 H) matrix, and X k\n:,: is a\nC \u00d7 (W \u00b7 H) matrix that represents a spatially shifted version of input matrix X with shift amount\n(ik, jk). Then, W + becomes a D \u00d7 (K \u00b7 C) matrix, and X + becomes a (K \u00b7 C) \u00d7 (W \u00b7 H) matrix.\nThe output Y forms a D \u00d7 (W \u00b7 H) matrix.\nEq. (3) shows that the basic convolution is simply the sum of 1\u00d71 convolutions on shifted inputs.\nThe shifted input X k\n:,: can be formulated using the shift function Sk that maps the original input to\nthe shifted input corresponding to the kernel index k. The conventional convolution uses the usual\nshifted input with integer-valued shift amounts for each kernel index k. As an extreme case, if we\ncan share the shifted inputs regardless of the kernel index, that is, Sk(X) = S(X), this simpli\ufb01es to\njust one pointwise (i.e., 1\u00d71) convolution and greatly reduces the computation complexity:\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u02dcW :,:,k \u00d7 Sk(X) =\n\n\u02dcW :,:,k \u00d7 S(X) = (\n\n\u02dcW :,:,k) \u00d7 S(X) = W \u00d7 S(X)\n\n(4)\n\nk\n\nk\n\nk\n\nHowever, as in this extreme case, if only one shift is applied to all input channels, the receptive \ufb01eld\nof convolution is too limited, and the network will provide poor results. To overcome this problem,\nShiftNet[23] introduced grouped shift that applies different shift values by grouping input channels.\nThis shift function can be represented by Eq. (5) and is followed by the single pointwise convolution\nY = W \u00d7 SG(X).\n\n\uf8ee\uf8ef\uf8ef\uf8f0\n\nX 1\n\n1:n,:\n\nX 2\n\nn+1:2n,:\n\n...\n\nX G\n\n(G\u22121)n+1:,:\n\n\uf8f9\uf8fa\uf8fa\uf8fb (5)\n\n\uf8ee\uf8ef\uf8ef\uf8f0 X 1\n\n1,:\nX 2\n2,:\n...\nX C\nC,:\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\nSG(X) =\n\nSC(X) =\n\n(6)\n\nwhere n is the number of channels per kernel, and it is the same as (cid:98)C/K(cid:99). G is the number of shift\ngroups. If C is a multiple of K, G is the same as K, otherwise, G is K+1.The shift function applies\ndifferent shift values according to the group number and not by the kernel index. The amount of shift\nis assigned heuristically to cover all kernel dimensions, and pointwise convolution is applied before\nand after the shift layer to make it invariant to the permutation of input and output channels.\n\n3.2 Active Shift Layer\n\nApplying the shift by group is a little arti\ufb01cial, and if the kernel size is large, the number of input\nchannels per shift is reduced. To solve this problem, we suggest a depthwise shift layer that applies\n\n3\n\n\f(a) Grouped Shift\n\n(b) ASL\n\nFigure 1: Comparison of shift operation: (a) shifting is applied to each group and the shift amount is\nassigned heuristically[23] and (b) shifting is applied to each channel using shift parameters and they\nare optimized by training.\n\ndifferent shift values for each channel (Eq. (6)). This is similar to decomposing a convolution\ninto depthwise separable convolutions. Depthwise separable convolution \ufb01rst makes features with\ndepthwise convolution and mixes them with 1\u00d71 convolutions. Similarly, a depthwise shift layer\nshifts each input channel, and shifted inputs are mixed with 1\u00d71 convolutions.\nBecause the shifted input SC(X) goes through single 1\u00d71 convolutions, this reduces the com-\nputational complexity by a factor of the kernel size K. More importantly, by removing both the\nspatial convolution and sparse memory access, it is possible to construct a network with only dense\noperations like 1\u00d71 convolutions. This can provide a greater speed improvement than that achieved\nby reducing only the number of FLOPs.\nNext, we consider how to assign the shift value for each channel. The exhaustive search over all\npossible combinations of assigning the shift values for each channel is intractable, and assigning\nvalues heuristically is suboptimal. We formulated the shift values as a learnable function with the\nadditional shift parameter \u03b8s that de\ufb01nes the amount of shift of each channel (Eq. (7)). We called the\nnew component the active shift layer (ASL).\n\n\u03b8s = {(\u03b1c, \u03b2c)|1 \u2264 c \u2264 C}\n\n(7)\n\nwhere c is the index of the channel, and the parameters \u03b1c and \u03b2c de\ufb01ne the horizontal and vertical\namount of shift, respectively. If the parameter is an integer, it is not differentiable and cannot be\noptimized. We can relax the integer constraint and allow \u03b1c and \u03b2c to take real numbers, and the\nvalue for non-integer shift can be calculated through interpolation. We used bilinear interpolation\nfollowing [11]:\n\n\u02dcxc,m+\u03b1c,n+\u03b2c =Z 1\n+Z 2\n\u2206\u03b1c = \u03b1c \u2212 (cid:98)\u03b1c(cid:99), \u2206\u03b2c = \u03b2c \u2212 (cid:98)\u03b2c(cid:99),\n\nc \u00b7 (1 \u2212 \u2206\u03b1c) \u00b7 (1 \u2212 \u2206\u03b2c) + Z 3\nc \u00b7 (1 \u2212 \u2206\u03b1c) \u00b7 \u2206\u03b2c + Z 4\n\nc \u00b7 \u2206\u03b1c \u00b7 \u2206\u03b2c,\n\nc \u00b7 \u2206\u03b1c \u00b7 (1 \u2212 \u2206\u03b2c)\n\n(8)\n\n(9)\n\nwhere (m, n) is the spatial position of the feature map, and Z i\nbilinear interpolation:\n\nc are the four nearest integer points for\n\nZ 1\nZ 3\n\nc = xc,m+(cid:98)\u03b1c(cid:99),n+(cid:98)\u03b2c(cid:99), Z 2\nc = xc,m+(cid:98)\u03b1c(cid:99)+1,n+(cid:98)\u03b2c(cid:99), Z 4\n\nc = xc,m+(cid:98)\u03b1c(cid:99),n+(cid:98)\u03b2c(cid:99)+1,\n\nc = xc,m+(cid:98)\u03b1c(cid:99)+1,n+(cid:98)\u03b2c(cid:99)+1.\n\n(10)\n\nBy using interpolations, the shift parameters are differentiable, and therefore, they can be trained\nthrough backpropagation. With the shift parameter \u03b8s, a conventional convolution can be formulated\nas follows with ASL S\u03b8s\nC :\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0 X (\u03b11,\u03b21)\n\nX (\u03b12,\u03b22)\n\n2,:\n...\n\n1,:\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\nX (\u03b1C ,\u03b2C )\n\nC,:\n\nY = W \u00d7 S\u03b8s\n\nC (X) = W \u00d7\n\n4\n\n(11)\n\nLDRU\fLayer\n\n3\u00d73 dw-conv\n1\u00d71 conv\n\nBN+Scale/Biasa\n\nReLU\n\nEltwise Sum\n\nInference time\n\n39 ms\n11 ms\n3 ms\n5 ms\n2 ms\n\nFLOPs\n29M\n206M\n<5M\n<5M\n<5M\n\nTable 1: Comparison of inference time vs. number\nof FLOPs. A smaller number of FLOPs does not\nguarantee fast inference.\n\naThe implementation of BN of Caffe [12] is not\nef\ufb01cient because it is split with two components. We\nintegrated them for fast inference.\n\nFigure 2: Ratio of time to FLOPs. This repre-\nsents the inference time per 1M FLOPs. A lower\nvalue means that the unit runs ef\ufb01ciently. Time\nis measured using an Intel i7-5930K CPU with a\nsingle thread and averaged over 100 repetitions.\n\nASL affords many advantages compared to the previous shift layer[23] (Fig. 1). First, shift values do\nnot need to be assigned manually; instead, they are learned through backpropagation during training.\nSecond, ASL does not depend on the original kernel size. In previous studies, if the kernel size\nchanged, the group for the shift also changed. However, ASL is independent of kernel size, and it can\nenlarge its receptive \ufb01eld by increasing the amount of shift. This means that ASL can mimic dilated\nconvolution[1, 25]. Furthermore, there is no need to permutate to achieve the invariance properties in\nASL; therefore, it is not necessary to use 1\u00d71 convolution before ASL. Although our component has\nparameters unlike the heuristic shift layer[23], these are almost negligible compared to the number of\nconvolution parameters. If the input channel is C, ASL has only 2 \u00b7 C parameters.\n\n3.3 Trap of FLOPs\n\nFLOPs is widely used for comparing model complexity, and it is considered proportional to the run\ntime. However, a small number of FLOPs does not guarantee fast execution speed. Memory access\ntime can be a more dominant factor in real implementations. Because I/O devices usually access\nmemory in units of blocks, many densely packed values might be read faster than a few numbers of\nlargely distributed values. Therefore, the implementability of an ef\ufb01cient algorithm in terms of both\nFLOPs and memory access time would be more important. Although a 1\u00d71 convolution has many\nFLOPs, this is a dense matrix multiplication that is highly optimized through general matrix multiply\n(GEMM) functions. Although depthwise convolution reduces the number of parameters and FLOPs\ngreatly, this operation needs fragmented memory access that is not easy to optimize.\nFor gaining a better understanding, we performed simple experiments to observe the inference time\nof each unit. These experiments were conducted using a 224\u00d7224 image with 64 channels, and the\noutput channel dimension of the convolutions is also 64. Table 1 shows the experimental results.\nAlthough 1\u00d71 convolution has a much larger number of FLOPs than 3\u00d73 depthwise convolution, the\ninference of 1\u00d71 convolution is faster than that of 3\u00d73 depthwise convolution.\nThe ratio of inference time to FLOPs can be a useful measure for analyzing the ef\ufb01ciency of layers\n(Fig. 2). As expected, 1\u00d71 convolution is most ef\ufb01cient, and the other units have similar ef\ufb01ciency.\nHere, it should be noted that although axillary layers are fast compared to convolutions, they are not\nthat ef\ufb01cient from an implementation viewpoint. Therefore, to make a fast network, we have to also\nconsider these auxiliary layers. These results are derived using the popular deep learning package\nCaffe [12], and it does not guarantee an optimized implementation. However, it shows that we should\nnot rely only on the number of FLOPs alone to compare the network speed.\n\n4 Experiment\n\nTo demonstrate the performance of our proposed method, we conducted several experiments with\nclassi\ufb01cation benchmark datasets. For ASL, the shift parameters are randomly initialized with\nuniform distribution between -1 and 1. We used a normalized gradient following ACU[11] with an\ninitial learning rate of 1e-2. Input images are normalized for all experiments.\n\n5\n\n\f4.1 Experiment on CIFAR-10/100\n\nWe conducted experiments to verify the basic performance of ASL with the CIFAR-10/100\ndataset [14] that contains 50k training and 10k test 32\u00d732 images. We used conventional pre-\nprocessing methods [7] to pad four pixels of zeros on each side, \ufb02ipped horizontally, and cropped\nrandomly. We trained 64k iterations with an initial learning rate of 0.1 and multiplied by 0.1 after\n32k and 48k iterations.\nWe compared the results with those of ShiftNet[23], which applied shift operations heuristically. The\nbasic building block is BN-ReLU-1\u00d71 Conv-BN-ReLU-ASL-1\u00d71 Conv order. The network size\nis controlled by multiplying the expansion rate (\u03b5) on the \ufb01rst convolution of each residual block.\nTable 3 shows the results; ours are consistently better than the previous results by a large margin. We\nfound that widening the base width is more ef\ufb01cient than increasing the expansion rate; this increases\nthe width of all layers in a network. With the same depth of 20, the network with base width of 46\nachieved better accuracy than that with a base width of 16 and expansion rate of 9. The last row of\nthe table shows that our method provided better results with fewer parameters and smaller depth.\nBecause increasing depth caused an increase in inference time, this approach is also better in terms of\ninference speed.\nInterestingly, our proposed layer\nnot only reduces the computa-\ntional complexity but also could\nimprove the network perfor-\nmance. Table 2 shows a compari-\nson result with the network using\ndepthwise convolution. A focus\non optimizing resources might re-\nduce the accuracy; nonetheless,\nour proposed architecture pro-\nvided better results. This shows\nthe possibility of extending ASL\nto a general network structure. A\nnetwork with ASL runs faster in\nterms of inference time, and train-\ning with ASL is also much faster\nowing to the reduction in sparse\nmemory access and the number\nof BNs.\nFig. 3 shows an example of the shift parameters of each layer after optimizations (ASNet with base\nwidth of 88). Large shift parameter values mean that a network can view a large receptive \ufb01eld. This\nis similar to the cases of ACU[11] and dilated convolution[1, 25], both of which enlarge the receptive\n\ufb01eld without additional weight parameters. An interesting phenomenon is observed; whereas the\nshift values of the other layers are irregular, the shift values in the layer with stride 2 (stage2/shift1,\nstage3/shift1) can be seen to tend to the center between the pixels. This seems to compensate for\nreducing the resolution. These features make ASL more powerful than conventional convolutions,\nand it could result in higher accuracy.\n\nTable 2: Comparison with networks for depthwise convolution.\nASL makes the network faster and provides better results. B\nand DW denote the BN-ReLU layer and depthwise convolution,\nrespectively. For a fair comparison of BN, we also conducted\nexperiments on a network without BN-ReLU between depthwise\nconvolution and last 1\u00d71 convolution(1B-DW3-1).\n\naIntel i7-5930K\nbGTX Titan X(Maxwell)\n\n1B-DW3-B-1\n1B-DW3-1\n\n1B-ASL-1(ours)\n\n(CPUa)\n16 ms\n15 ms\n10.6 ms\n\n(GPUb)\n9h03\n7h41\n5h53\n\nBuilding block\n\nC10\n\nInference Time\n\nTraining Time\n\n94.16\n93.97\n94.5\n\n4.2 Experiment on ImageNet\n\nTo prove the generality of the proposed method, we conducted experiments with an ImageNet 2012\nclassi\ufb01cation task. We did not apply intensive image augmentation; we simply used a randomly\n\ufb02ipped image with 224\u00d7224 cropped from 256\u00d7256 following Krizhevsky et al. [15]. The initial\nlearning rate was 0.1 with a linearly decaying learning rate, and the weight decay was 1e-4. We\ntrained 90 epochs with 256 batch size.\nOur network is similar to a residual network with bottleneck blocks[6, 7], and Table 6 shows\nour network architecture. All building blocks are in a residual path, and we used a pre-activation\nresidual[7]. We used the same basic blocks as in the previous experiment, and we used only one spatial\nconvolution for the \ufb01rst layer. Because increasing the depth also increases the number of auxiliary\n\n1Depths are counted without shift layer, as noted in a previous study\n\n6\n\n\fTable 3: Comparison with ShiftNet. Our results are better by a large margin. The last row shows that\nour result is better with a smaller number of parameters and depth.\n\nDepth1\n\nBase Width\n\n20\n20\n20\n20\n20\n110\n20\n\n16\n16\n16\n16\n46\n16\n88\n\n\u03b5\n1\n3\n6\n9\n1\n6\n1\n\nParam(M)\n\n0.035\n0.1\n0.19\n0.28\n0.28\n1.2\n0.99\n\nShiftNet[23]\nC10\nC100\n55.62\n86.66\n62.32\n90.08\n68.64\n90.59\n91.69\n69.82\n\n-\n\n-\n\n93.17\n\n-\n\n72.56\n\n-\n\nASNet(ours)\nC10\nC100\n63.43\n89.14\n68.83\n91.62\n70.68\n92.54\n92.93\n71.83\n73.07\n93.52\n73.46\n93.73\n94.53\n76.73\n\nFigure 3: Trained shift values of each layer. Shifted values are scattered to various positions. This\nenables the network to cover multiple receptive \ufb01elds.\n\nlayers, we expanded the network width to increase the accuracy as in previous studies[8, 18, 23, 26].\nThe network width is controlled by the base width w.\nTable 5 shows the result for ImageNet, and our method obtains much better results with fewer\nparameters compared to other networks. In particular, we surpass ShiftNet by a large margin; this\nindicates that ASL is much more powerful than heuristically assigned shift operations. In terms of\ninference time, we compared our network with MobileNetV2, one of the best-performing networks\nwith fast inference time. MobileNetV2 has fewer FLOPs compared to our network; however, this\ndoes not guarantee a fast inference time, as we noted in section 3.3. Our network runs faster than\nMobileNetV2 because we have smaller depth and no depthwise convolutions that run slow owing to\nmemory access (Fig. 4).\n\n4.3 Ablation Study\n\nAlthough both ShiftNet[23] and ASL use shift operation, ASL achieved much better accuracy. This\nis because of a key characteristic of ASL, namely, that it learns the real-valued shift amounts using\nthe network itself. To clarify the factor of improvement according to the usage of ASL, we conducted\nadditional experiments using ImageNet (AS-ResNet-w32). Table 4 shows the top-1 accuracy with\nthe amount of improvement inside parenthesis. Grouped Shift (GS) indicates the same method as\nShiftNet. Sampled Real (SR) indicates cases in which the initialization of the shift values was sampled\nfrom a Gaussian distribution with standard deviation 1 to imitate the \ufb01nal state of shift values trained\nby ASL. Similarly, the values for Sampled Integer (SI) are obtained from a Gaussian distribution, but\nrounds a sample of real numbers to an integer point. Training Real (TR) is the same as our proposed\nmethod, and only TR learns the shift values.\nComparing GS with SI suggests that random integer sampling is slightly better than heuristic\nassignment owing to the potential expansion of the receptive \ufb01eld as the amount of shift can be larger\nthan 1. The result of SR shows that a relaxation of the shift values to the real domain, another key\n\n7\n\n\fcharacteristic of ASL, turned out to be even more helpful. In terms of learning shift values, TR\nachieved the largest improvement, which was around two times that achieved by SR. These results\nshow the effectiveness of learning real-valued shift parameters.\n\nTable 4: Ablation study using AS-ResNet-w32 on ImageNet. The improvement by using ASL\noriginated from expanding the domain of a shift parameter from integer to real and learning shifts.\n\nMethod\nGrouped Shift\nSampled Integer\nSampled Real\nTraining Real\n\nShifting Domain\n\nInitialization Learning Shift\n\nInteger\nInteger\nReal\nReal\n\nHeuristic\nN (0, 1)\nN (0, 1)\nU[\u22121, 1]\n\nX\nX\nX\nO\n\nTop-1\n59.8 (-)\n\n60.1 (+0.3)\n61.9 (+2.1)\n64.1 (+4.3)\n\nTable 5: Comparison with other networks. Our networks achieved better results with similar number\nof parameters. Compared to MobileNetV2, our network runs faster although it has a larger number of\nFLOPs. Table is sorted by descending order of the number of parameters.\n\nNetwork\nMobileNetV1[8]\nShiftNet-A[23]\nMobileNetV2[18]\nAS-ResNet-w68(ours)\nShuf\ufb02eNet-\u00d71.5[26]\nMobileNetV2-\u00d70.75\nAS-ResNet-w50(ours)\nMobileNetV2-\u00d70.5\nMobileNetV1-\u00d70.5\nSqueezeNet[10]\nShiftNet-B\nAS-ResNet-w32(ours)\nShiftNet-C\n\n-\n\n-\n\n89.7\n91\n90.7\n\nTop-1 Top-5\n70.6\n70.1\n71.8\n72.2\n71.3\n69.8\n69.9\n65.4\n63.7\n57.5\n61.2\n64.1\n58.8\n\n80.3\n83.6\n85.4\n82\n\n89.6\n89.3\n86.4\n\n-\n\nParam(M)\n\nInference Timea\n\nFLOPs(M) CPU(ms) GPU(ms)\n\n4.2\n4.1\n3.47\n3.42\n3.4\n2.61\n1.96\n1.95\n1.3\n1.2\n1.1\n0.9\n0.78\n\n569\n1.4G\n300\n729\n292\n209\n404\n97\n149\n-\n371\n171\n-\n\n74.1\n54.7\n47.9\n\n-\n\n-\n\n10.04\n7.07\n6.73\n\n-\n\n-\n\n40.4\n32.1\n26.8\n\n6.23\n6.14\n5.73\n\n-\n-\n\n-\n\n31.8\n18.7\n\n-\n-\n\n-\n\n7.88\n5.37\n\naMeasured by Caffe [12] using an Intel i7-5930K CPU with a single thread and GTX Titan X (Maxwell).\nInference time for MobileNet and ShiftNet (including FLOPs) are measured by using their network description.\n\nFigure 4: Comparison with MobileNetV2[18]. Our network achieves better results with the same\ninference time.\n\n8\n\n\fTable 6: Network structure for AS-ResNet. Network width is controlled by base width w\n\nInput size Output size\n\n2242\n1122\n1122\n562\n282\n142\n72\n1\n\n1122\n1122\n562\n282\n142\n72\n1\n1\n\n5 Conclusion\n\nOperator\n3\u00d73 conv\nbasic block\nbasic block\nbasic block\nbasic block\nbasic block\n\nglobal avg-pool\n\nfc\n\nOutput channel Repeat\n\nstride\n\nw\nw\nw\n2w\n4w\n8w\n-\n\n1000\n\n1\n1\n3\n4\n6\n3\n1\n1\n\n2\n1\n2\n2\n2\n2\n-\n-\n\nIn this study, we deconstruct convolution to a shift operation followed by pointwise convolution.\nWe formulate a shift operation as a function having additional parameters. The amount of shift\ncan be learned end-to-end through backpropagation. The ability of learning shift values can help\nmimic various type of convolutions. By sharing the shifted input, the number of parameters and\ncomputational complexity can be reduced greatly. We also showed that using ASL could improve the\nnetwork accuracy while reducing the network parameters and inference time. By using the proposed\nlayer, we suggested a fast and light network and achieved better results compared to those of existing\nnetworks. The use of ASL for more general network architectures could be an interesting extension\nof the present study.\n\nReferences\n[1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille.\nIn\n\nSemantic image segmentation with deep convolutional nets and fully connected crfs.\nInternational Conference on Learning Representations (ICLR), 2015.\n\n[2] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), pages 1800\u20131807, July 2017.\n\n[3] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional\nnetworks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),\npages 764\u2013773, Oct 2017. doi: 10.1109/ICCV.2017.89.\n\n[4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in Neural Information Processing Systems, pages\n1135\u20131143, 2015.\n\n[5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and huffman coding. In International Conference\non Learning Representations (ICLR), 2016.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2016.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\nnetworks. In European Conference on Computer Vision (ECCV), pages 630\u2013645. Springer,\n2016.\n\n[8] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In The IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), June 2018.\n\n9\n\n\f[10] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and\nKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb\nmodel size. arXiv preprint arXiv:1602.07360, 2016.\n\n[11] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image\nclassi\ufb01cation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n1846\u20131854. IEEE, 2017.\n\n[12] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages\n675\u2013678. ACM, 2014.\n\n[13] Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional\nneural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJuly 2017.\n\n[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-\nberger, editors, Advances in Neural Information Processing Systems, pages 1097\u20131105. Curran\nAssociates, Inc., 2012.\n\n[16] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky.\nSpeeding-up convolutional neural networks using \ufb01ne-tuned cp-decomposition. In International\nConference on Learning Representations (ICLR), 2014.\n\n[17] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. In International Conference on Learning Representations (ICLR), 2017.\n\n[18] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), June 2018.\n\n[19] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale\nImage Recognition. In International Conference on Learning Representations (ICLR), pages\n1\u201314, 2015.\n\n[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[21] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi.\n\ninception-resnet and the impact of residual connections on learning.\nshop, 2016.\n\nInception-v4,\nIn ICLR 2016 Work-\n\n[22] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 2818\u20132826, 2016.\n\n[23] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gho-\nlaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero \ufb02op, zero parameter alternative to\nspatial convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2018.\n\n[24] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 5987\u20135995. IEEE, 2017.\n\n[25] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n10\n\n\f[26] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient\nconvolutional neural network for mobile devices. In The IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), June 2018.\n\n11\n\n\f", "award": [], "sourceid": 2893, "authors": [{"given_name": "Yunho", "family_name": "Jeon", "institution": "KAIST"}, {"given_name": "Junmo", "family_name": "Kim", "institution": "KAIST"}]}