{"title": "Convolution with even-sized kernels and symmetric padding", "book": "Advances in Neural Information Processing Systems", "page_first": 1194, "page_last": 1205, "abstract": "Compact convolutional neural networks gain efficiency mainly through depthwise convolutions, expanded channels and complex topologies, which contrarily aggravate the training process. Besides, 3x3 kernels dominate the spatial representation in these models, whereas even-sized kernels (2x2, 4x4) are rarely adopted. In this work, we quantify the shift problem occurs in even-sized kernel convolutions by an information erosion hypothesis, and eliminate it by proposing symmetric padding on four sides of the feature maps (C2sp, C4sp). Symmetric padding releases the generalization capabilities of even-sized kernels at little computational cost, making them outperform 3x3 kernels in image classification and generation tasks. Moreover, C2sp obtains comparable accuracy to emerging compact models with much less memory and time consumption during training.\nSymmetric padding coupled with even-sized convolutions can be neatly implemented into existing frameworks, providing effective elements for architecture designs, especially on online and continual learning occasions where training efforts are emphasized.", "full_text": "Convolution with even-sized kernels and symmetric\n\npadding\n\nShuang Wu1, Guanrui Wang1, Pei Tang1, Feng Chen2, Luping Shi1\n\n1Department of Precision Instrument, 2Department of Automation\n\n{lpshi,chenfeng}@mail.tsinghua.edu.cn\n\nCenter for Brain Inspired Computing Research\n\nBeijing Innovation Center for Future Chip\n\nTsinghua University\n\nAbstract\n\nCompact convolutional neural networks gain ef\ufb01ciency mainly through depthwise\nconvolutions, expanded channels and complex topologies, which contrarily aggra-\nvate the training process. Besides, 3\u00d73 kernels dominate the spatial representation\nin these models, whereas even-sized kernels (2\u00d72, 4\u00d74) are rarely adopted. In\nthis work, we quantify the shift problem occurs in even-sized kernel convolutions\nby an information erosion hypothesis, and eliminate it by proposing symmetric\npadding on four sides of the feature maps (C2sp, C4sp). Symmetric padding\nreleases the generalization capabilities of even-sized kernels at little computational\ncost, making them outperform 3\u00d73 kernels in image classi\ufb01cation and generation\ntasks. Moreover, C2sp obtains comparable accuracy to emerging compact models\nwith much less memory and time consumption during training. Symmetric padding\ncoupled with even-sized convolutions can be neatly implemented into existing\nframeworks, providing effective elements for architecture designs, especially on\nonline and continual learning occasions where training efforts are emphasized.\n\n1\n\nIntroduction\n\nDeep convolutional neural networks (CNNs) have achieved signi\ufb01cant successes in numerous com-\nputer vision tasks such as image classi\ufb01cation [37], semantic segmentation [43], image generation\n[8], and game playing [28]. Other than domain-speci\ufb01c applications, various architectures have been\ndesigned to improve the performance of CNNs [11, 14, 3], wherein the feature extraction and represen-\ntation capabilities are mostly enhanced by deeper and wider models containing ever-growing numbers\nof parameters and operations. Thus, the memory overhead and computational complexity greatly\nimpede their deployment in embedded AI systems. This motivates the deep learning community to\ndesign compact CNNs with reduced resources, while still retaining satisfactory performance.\nCompact CNNs mostly derive generalization capabilities from architecture engineering. Shortcut\nconnection [11] and dense concatenation [14] alleviate the degradation problem as the network\ndeepens. Feature maps (FMs) are expanded by pointwise convolution (C1) and bottleneck architec-\nture [35, 40]. Multi-branch topology [38], group convolution [42], and channel shuf\ufb02e operation\n[47] recover accuracy at the cost of network fragmentation [25]. More recently, there is a trend\ntowards mobile models with <10M parameters and <1G FLOPs [13, 25, 23], wherein the depthwise\nconvolution (DWConv) [5] plays a crucial role as it decouples cross-channel correlations and spatial\ncorrelations. Aside from human priors and handcrafted designs, emerging neural architecture search\n(NAS) methods optimize structures by reinforcement learning [48], evolution algorithm [32], etc.\nDespite the progress, the fundamental spatial representation is dominated by 3\u00d73 kernel convolutions\n(C3) and the exploration of other kernel sizes is stagnating. Even-sized kernels are deemed inferior and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frarely adopted as basic building blocks for deep CNN models [37, 38]. Besides, most of the compact\nmodels concentrate on the inference efforts (parameters and FLOPs), whereas the training efforts\n(memory and speed) are neglected or even becoming more intractable due to complex topologies\n[23], expanded channels [35], and additional transformations [15, 47, 40]. With the growing demands\nfor online and continual learning applications, the training efforts should be jointly addressed and\nfurther emphasized. Furthermore, recent advances in data augmentation [46, 7] have shown much\nmore powerful and universal bene\ufb01ts. A simpler structure combined with enhanced augmentations\neasily eclipses the progress made by intricate architecture engineering, inspiring us to rethink basic\nconvolution kernels and the mathematical principles behind them.\nIn this work, we explore the generalization capabilities of even-sized kernels (2\u00d72, 4\u00d74). Direct\nimplementation of these kernels encounters performance degradation in both classi\ufb01cation and\ngeneration tasks, especially in deep networks. We quantify this phenomenon by an information\nerosion hypothesis: even-sized kernels have asymmetric receptive \ufb01elds (RFs) that produce pixel\nshifts in the resulting FMs. The location offset accumulates when stacking multiple convolutions,\nthus severely eroding the spatial information. To address the issue, we propose convolution with\neven-sized kernels and symmetric padding on each side of the feature maps (C2sp, C4sp).\nSymmetric padding not merely eliminates the shift problem, but also extends RFs of even-sized\nkernels. Various classi\ufb01cation results demonstrate that C2sp is an effective decomposition of C3 in\nterms of 30%-50% saving of parameters and FLOPs. Moreover, compared with compact CNN blocks\nsuch as DWConv, inverted-bottleneck [35], and ShiftNet [40], C2sp achieves competitive accuracy\nwith >20% speedup and >35% memory saving during training. In generative adversarial networks\n(GANs) [8], C2sp and C4sp both obtain stabilized convergence and improved image qualities. Our\nwork stimulates a new perspective full of optional units for architecture engineering, as well as\nprovides basic but effective alternatives that balance both the training and inference efforts.\n\n2 Related work\n\nOur method belongs to compact CNNs that design new architectures and then train them from scratch.\nWhereas most network compressing methods in the literature attempt to prune weights [9] or quantize\noperands and operations [41] in terms of implementation on neural accelerators [4, 31]. These\nmethods are orthogonal to our work and can be jointly adopted.\nEven-sized kernel Even-sized kernels are mostly applied together with stride 2 to resize images. For\nexample, GAN models in [27] apply 4\u00d74 kernels and stride 2 in the discriminators and generators to\navoid the checkerboard artifact [29]. However, 3\u00d73 kernels are preferred when it comes to deep and\nlarge-scale GANs [18, 21, 3]. Except for scaling, few works have implemented even-sized kernels as\nbasic building blocks for their CNN models. In [10], 2\u00d72 kernels are tested with relatively shallow\n(about 10 layers) models, where the FM sizes between two convolution layers are not preserved\nstrictly. In relational reinforcement learning [45], two C2 layers are adopted to achieve reasoning and\nplanning of objects represented by 4 pixels.\nAtrous convolution Dilated convolution [43] supports exponential expansions of RFs without loss\nof resolution or coverage, which is speci\ufb01cally suitable for dense prediction tasks such as semantic\nsegmentation. Deformable convolution [6] augments the spatial sampling locations of kernels by\nadditional 2D offsets and learning the offsets directly from target datasets. Therefore, deformable\nkernels shift at pixel-level and focus on geometric transformations. ShiftNet [40] sidesteps spatial\nconvolutions entirely by shift kernels that contain no parameter or FLOP. However, it requires large\nchannel expansions to reach satisfactory performance. Active shift [16] formulates the amount of FM\nshift as a learnable function and optimized two parameters through end-to-end backpropagation.\n\n3 Symmetric padding\n\n3.1 The shift problem\n\nWe start with the spatial correlation in basic convolution kernels. Intuitively, replacing a C3 with\ntwo C2s should provide performance gains aside from an 11% reduction of overheads, which is\ninspired by the factorization of C5 into two C3s [37]. However, experiments in Figure 3 indicate that\nthe classi\ufb01cation accuracy of C2 is inferior to C3 and saturate much faster as the network deepens.\n\n2\n\n\fFigure 1: Normalized FMs derived from well-trained ResNet-56 models. Three spatial sizes 32\u00d732,\n16\u00d716, and 8\u00d78 before down-sampling stages are presented. First row: Conv2\u00d72 with asymmetric\npadding (C2). Second row: Conv2\u00d72 with symmetric padding (C2sp). Left: a sample in the CIFAR10\ntest dataset. Right: average results from all the samples in the test dataset.\n\nBesides, replacing each C3 with C4 also hurts accuracy even though a 3\u00d73 kernel can be regarded\nas a subset of 4\u00d74 kernel, which contains 77% more parameters and FLOPs. To address this issue,\nthe FMs of well-trained ResNet-56 [11] models with C2 and C2sp are visualized in Figure 1. Since\nevery single FM (channel) is very stochastic and hard to interpret, we report the average values of\nall channels in that convolution layer. FMs of C4 and C3 have similar manners as C2 and C2sp,\nrespectively and are omitted for clarity. It is clearly seen that the post-activation (ReLU) values in C2\nare gradually shifting to the left-top corner of the spatial location. These compressed and distorted\nfeatures are not suitable for the following classi\ufb01cation, let alone pixel-level tasks based on it such as\ndetection and semantic segmentation, where all the annotations will have offsets starting from the\nleft-top corner of the image.\nWe identify this as the shift problem observed in even-sized kernels. For a conventional convolution\nbetween ci input and co output FMs F and square kernels w of size k \u00d7 k, it can be given as\n\nF o(p) =\n\nwi(\u03b4) \u00b7 F i(p + \u03b4),\n\n(1)\n\nci(cid:88)\n\n(cid:88)\n\ni=1\n\n\u03b4\u2208R\n\nwhere \u03b4 and p enumerate locations in RF R and in FMs of size h \u00d7 w, respectively. When k is an\nodd number, e.g., 3, we de\ufb01ne the central point of R as origin:\n\nR = {(\u2212\u03ba,\u2212\u03ba), (\u2212\u03ba, 1 \u2212 \u03ba), . . . , (\u03ba, \u03ba)}, \u03ba = (cid:100) k \u2212 1\n\n(cid:101),\n\n(2)\nwhere \u03ba denotes the maximum pixel number from four sides to the origin. (cid:100)\u00b7(cid:101) is the ceil rounding\n\nfunction. Since R is symmetrical, we have(cid:80)\n\n2\n\n\u03b4\u2208R \u03b4 = (0, 0).\n\nWhen k is an even number, e.g., 2 or 4, implementing convolution between F i and kernels wi\nbecomes inevitably asymmetric since there is no central point to align. In most deep learning\nframeworks, it draws little attention and is obscured by pre-de\ufb01ned offsets. For example, TensorFlow\n[1] picks the nearest pixel in the left-top direction as the origin, which gives an asymmetric R:\n\nR = {(1 \u2212 \u03ba, 1 \u2212 \u03ba), (1 \u2212 \u03ba, 2 \u2212 \u03ba), . . . , (\u03ba, \u03ba)},\n\n\u03b4 = (\u03ba, \u03ba).\n\n(3)\n\n(cid:88)\n\n\u03b4\u2208R\n\nThe shift occurs at all the spatial locations p and is equivalent to pad one more zero on the bottom\nand right sides of FMs before convolutions. On the contrary, Caffe [17] pads one more zero on the\nleft and top sides. PyTorch [30] only supports symmetric padding by default, users need to manually\nde\ufb01ne the padding policy if desired.\n\n3.2 The information erosion hypothesis\n\nAccording to the above, even-sized kernels make zero-padding asymmetric with 1 pixel, and averagely\n(between two opposite directions) lead to 0.5-pixel shifts in the resulting FMs. The position offset\n\n3\n\n8162432816243248121648121624682468816243281624324812164812162468246881624328162432481216481216246824688162432816243248121648121624682468\faccumulates when stacking multiple layers of even-sized convolutions, and eventually squeezes and\ndistorts features to a certain corner of the spatial location. Ideally, in case that such asymmetric\npadding is performed for n times in the TensorFlow style with convolutions in between, the resulting\npixel-to-pixel correspondence of FMs will be\np \u2212 (\n\n(4)\nSince FMs have \ufb01nite size h \u00d7 w and are usually down-sampled to force high-level feature repre-\nsentations, then the edge effect [26, 2] cannot be ignored because zero-padding at edges will distort\nthe effective values of FM, especially in deep networks and small FMs. We hypothesize that the\nquantity of information Q is equal to the mean L1-norm of the FM, then successive convolutions\nwith zero-padding to preserve FM size will gradually erode the information:\n\n(cid:105) \u2190 F0(p).\n\nFn\n\n(cid:104)\n\nn\n2\n\n,\n\nn\n2\n\n)\n\n|Fn(p)|, Qn < Qn\u22121.\n\n(5)\n\n(cid:88)\n\np\u2208h\u00d7w\n\nQn =\n\n1\nhw\n\nThe information erosion happens recursively and is very complex to be formulated, we directly derive\nFMs from deep networks that contain various kernel sizes. In Figure 2, 10k images of size 32\u00d732\nare fed into untrained ResNet-56 models where identity connections and batch normalizations are\nremoved. Q decreases progressively and faster in larger kernel sizes and smaller FMs. Besides,\nasymmetric padding in even-sized kernels (C2, C4) speeds up the erosion dramatically, which is\nconsistent with well-trained networks in Figure 1. An analogy is that FM can be seen as a rectangular\nice chip melting in water except that it can only exchange heat on its four edges. The smaller the ice,\nthe faster the melting process happens. Symmetric padding equally distributes thermal gradients so\nas to slow down the exchange. Whereas asymmetric padding produces larger thermal gradients on a\ncertain corner, thus accelerating it.\nThe hypothesis also provides explanations for some experimental observations in the literature. (1)\nThe degradation problem in very deep networks [11]: although the vanishing/exploding forward\nactivations and backward gradients have been addressed by intermediate normalization [15], the\nspatial information is eroded and blurred by the edge effect after multiple convolutions. (2) It is\nreported [3] that in GANs, doubling the depth of networks hampers training, and increasing the kernel\nsize to 5 or 7 leads to minor improvement or even degradation. These indicate that GANs require\naugmented spatial information and are more sensitive to progressive erosion.\n\nFigure 2: Left: layerwise quantity of information Q and colormaps derived from the last convolution\nlayers. FMs are down-sampled after 18th and 36th layers. Right: implementation of convolution with\n2\u00d72 kernels and symmetric padding (C2sp).\n\n3.3 Method\nSince R is inevitably asymmetric for even kernels in Equation 3, it is dif\ufb01cult to introduce symmetry\nwithin a single FM. Instead, we aim at the \ufb01nal output F o summed by multiple input F i and kernels.\nFor clarity, let R0 be the shifted RF in Equation 3 that picks the nearest pixel in the left-top direction\nas origin, then we explicitly introduce a shifted collection R+\nR+ = {R0,R1,R2,R3}\n\n(6)\n\n4\n\n183654Layer index10-310-210-10.5Mean L1-norm valueC5C3C4C2C4spC2sp0.010.040.070.10Symmetric paddingcoConv2x2[ci,co, 2, 2]\u2026\u2026\u2026\u2026ci/4\u2a02feature mappaddingkernelci\fthat includes all four directions: left-top, right-top, left-bottom, right-bottom.\nLet \u03c0 : I \u2192 R+ be the surjective-only mapping from input channel indexes i \u2208 I = {1, 2, ..., ci} to\ncertain shifted RFs. By adjusting the proportion of four shifted RFs, we can ensure that\n\n\u03b4 = (0, 0).\n\n(7)\n\nci(cid:88)\n\n(cid:88)\n\ni=1\n\n\u03b4\u2208\u03c0(i)\n\nWhen mixing four shifted RFs within a single convolution, the RFs of even-sized kernels are partially\nextended, e.g., 2\u00d72 \u2192 3\u00d73, 4\u00d74 \u2192 5\u00d75. If ci is an integer multiple of 4 (usually satis\ufb01ed), the\nsymmetry is strictly obeyed within a single convolution layer by distributing RFs in sequence\n\n\u03c0(i) = R(cid:98)4i/ci(cid:99).\n\n(8)\nAs mention above, the shifted RF is equivalent to pad one more zero at a certain corner of FMs. Thus,\nthe symmetry can be neatly realized by a grouped padding strategy, an example of C2sp is illustrated\nin Figure 2. In summary, the 2D convolution with even-sized kernels and symmetric padding consists\nof three steps: (1) Dividing the input FMs equally into four groups. (2) Padding FMs according to the\ndirection de\ufb01ned in that group. (3) Calculating the convolution without any padding. We have also\ndone ablation studies on other methods dealing with the shift problem, please see Section 5.\n\n4 Experiments\n\nIn this section, the ef\ufb01cacy of symmetric padding is validated in CIFAR10/100 [20] and ImageNet\n[33] classi\ufb01cation tasks, as well as CIFAR10, LSUN bedroom [44], and CelebA-HQ [18] generation\ntasks. First of all, we intuitively demonstrate that the shift problem has been eliminated by symmetric\npadding. In the symmetric case of Figure 1, FMs return to the central position, exhibiting healthy\nmagnitudes and reasonable geometries. In Figure 2, C2sp and C4sp have much lower attenuation rates\nthan C2 and C4 regarding information quantity Q. Besides, C2sp has larger Q than C3, expecting\nperformance improvement in the following evaluations.\n\nFigure 3: Left and middle: parameter-accuracy curves of ResNets and DenseNets that contain\nmultiple depths and various convolution kernels. Right: training and testing curves on DenseNet-112\nwith C3 and C2sp.\n\n4.1 Exploration of various kernel sizes\n\nTo explore the generalization capabilities of various convolution kernels, ResNet series without\nbottleneck architectures [11] are chosen as the backbones. We maintain all the other components and\ntraining hyperparameters as the same, and only replace each C3 by a C4, C2 or C2sp. The networks\nare trained on CIFAR10 with depths in 6n + 2, n \u2208 {3, 6, . . . , 24}. The parameter-accuracy curves\nare shown in Figure 3. The original even-sized kernels 4\u00d74, 2\u00d72 perform poorly and encounter\nfaster saturation as the network deepens. Compared with C3, C2sp reaches similar accuracy with\nonly 60%-70% of the parameters, as well as FLOPs that are linearly correlated. We also \ufb01nd that\nsymmetric padding only slightly improves the accuracy of C4sp. Considering the worse performance\nof C5 and the attenuation curves in Figure 2, the edge effect might dominate the information erosion\nof 4\u00d74 kernels rather than the shift problem in such network depth.\n\n5\n\n00.511.522.5Params (M)56789Test error (%)C5C4C4spC3C2C2sp0.20.512510Params (M)2224262830Test error (%)C5C4C4spC3C2C2sp0100200300Epoch2025303540Test error (%)10-310-210-1100101Training losserror C3error C2sploss C3loss C2sp\fSince the network architectures or datasets may affect the conclusion, we further evaluate these\nkernels on CIFAR100 with DenseNet [14] backbone to cross-validate the generality and consistency.\nThe results of depths 6n + 4, n \u2208 {3, 6, . . . , 18} are shown in Figure 3. At the same depth, C2sp\nachieves comparable accuracy to C3 as the network gets deeper. The training losses indicate that\nC2sp have better generalization and less over\ufb01tting than C3. Under the criterion of similar accuracy,\na C2sp model will save 30%-50% parameters and FLOPs in the CIFAR evaluations. Therefore, we\nrecommend using C2sp as a better alternative to C3 in classi\ufb01cation tasks.\n\n4.2 Compare with compact CNN blocks\n\nTo facilitate fair comparisons for C2sp with compact CNN blocks that contain C1, DWConvs, or shift\nkernels, we use ResNets as backbones and adjust the width and depth to maintain the same number\nof parameters and FLOPs (overheads). In case there are n input channels for a basic residual block,\nthen two C2sp layers will consume about 8n2 overheads, the expansion is marked as 1-1-1 since\nno channel expands. For ShiftNet blocks [40], we choose expansion rate 3 and 3\u00d73 shift kernels as\nsuggested, the overheads are about 6n2. Therefore, the value of n is slightly increased. While for the\ninverted-bottleneck [35], the suggested expansion rate 6 results in 12n2 + O(6n) overheads, thus the\nnumber of blocks is reduced by 1/3. For depthwise-separable convolutions [5], the overheads are\nabout 2n2 + O(n), so the channels are doubled and formed as 2-2-2 expansions.\n\nTable 1: Comparison of various compact CNN blocks on CIFAR100. Shift, Invert, and Sep denotes\nShiftNet block, inverted-bottleneck, and depthwise-separable convolution, respectively. mixup\ndenotes training with mixup augmentation. Exp denotes the expansion rates of channels in that block.\nSPS refers to the speed during training: samples per second.\n\nModel Block\n\n20\n\n56\n\n110\n\nShift\nInvert\nSep\nC2sp\nShift\nInvert\nSep\nC2sp\nShift\nInvert\nSep\nC2sp\n\nstandard\n\n26.87 \u00b1 0.19\n26.86 \u00b1 0.11\n26.70 \u00b1 0.20\n26.77 \u00b1 0.19\n24.07 \u00b1 0.24\n22.36 \u00b1 0.25\n23.31 \u00b1 0.09\n23.19 \u00b1 0.29\n22.94 \u00b1 0.26\n21.76 \u00b1 0.17\n22.31 \u00b1 0.22\n21.93 \u00b1 0.04\n\nError (%)\n\nwith mixup\n24.74 \u00b1 0.04\n25.32 \u00b1 0.12\n23.81 \u00b1 0.15\n24.90 \u00b1 0.17\n21.31 \u00b1 0.21\n21.48 \u00b1 0.25\n20.69 \u00b1 0.11\n20.91 \u00b1 0.22\n20.47 \u00b1 0.18\n20.43 \u00b1 0.20\n19.42 \u00b1 0.05\n19.52 \u00b1 0.10\n\nParams FLOPs\n(M)\n0.49\n0.42\n0.51\n0.50\n1.48\n1.52\n1.47\n1.54\n2.97\n3.17\n2.91\n3.10\n\n(M)\n70.0\n76.7\n73.6\n73.2\n213.4\n240.1\n218.0\n224.2\n428.3\n485.1\n434.4\n450.7\n\nExp Memory Speed\n(SPS)\n1-3-1\n1906\n2057\n1-6-1\n2709\n2-2-2\n3328\n1-1-1\n1-3-1\n683\n803\n1-6-1\n1034\n2-2-2\n1230\n1-1-1\n1-3-1\n348\n426\n1-6-1\n540\n2-2-2\n636\n1-1-1\n\n(MB)\n853\n1219\n732\n487\n2195\n2561\n1707\n1219\n4146\n4756\n3170\n1951\n\nThe results are summarized in Table 1. Since most models easily over\ufb01t CIFAR100 training set\nwith standard augmentation, we also train the models with mixup [46] augmentation to make the\ndifferences more signi\ufb01cant. In addition to error rates, the memory consumption and speed during\ntraining are reported. C2sp performs better accuracy than ShiftNets, which indicates that sidestepping\nspatial convolutions entirely by shift operations may not be an ef\ufb01cient solution. Compared with\nblocks that contain DWConv, C2sp achieves competitive results in 56 and 110 nets with fewer\nchannels and simpler architectures, which reduce memory consumption (>35%) and speed up\n(>20%) the training process.\nIn Table 2, we compare C2sp with NAS models: NASNet [48], PNASNet [23], and AmoebaNet [32].\nWe apply Wide-DenseNet [14] and adjust the width and depth (K = 48, L = 50) to have approxi-\nmately 3.3M parameters. C2sp suffers less than 0.2% accuracy loss compared with state-of-the-art\nauto-generated models, and achieves better accuracy (+0.21%) when the augmentation is enhanced.\nAlthough NAS models leverage fragmented operators [25], e.g., pooling, group convolution, DWConv\nto improve accuracy with similar numbers of parameters, the regular-structured Wide-DenseNet has\nbetter memory and computational ef\ufb01ciency in runtime. In our reproduction, the training speeds on\nTitanXP for NASNet-A and Wide-DesNet are about 200 and 400 SPS, respectively.\n\n6\n\n\fTable 2: Test error rates (%) on CIFAR10 dataset. c/o and mixup denotes cutout [7] and mixup [46]\ndata augmentation.\n\nModel\nNASNet-A [48]\nPNASNet-5 [23]\nAmoebaNet-A [32]\nWide-DenseNet C3\nWide-DenseNet C2sp\nNASNet-A + c/o [48]\nWide-DenseNet C2sp + c/o + mixup\n\nError (%)\n\nParams (M)\n\n3.41\n3.41\n3.34\n3.81\n3.54\n2.65\n2.44\n\n3.3\n3.2\n3.2\n3.4\n3.2\n3.3\n3.2\n\n4.3\n\nImageNet classi\ufb01cation\n\nWe start with the widely-used ResNet-50 and DenseNet-121 architectures. Since both of them contain\nbottlenecks and C1s to scale down the number of channels, C3 only consumes about 53% and 32%\nof the total overheads. Changing C3s to C2sp results in about 25% and 17% reduction of parameters\nand FLOPs, respectively. The top-1 classi\ufb01cation error rates are shown in Table 3, C2sp have minor\nloss (0.2%) in ResNet, and slightly larger degradation (0.5%) in DenseNet. After all, there are only\n0.9M parameters for spatial convolution in DenseNet-121 C2sp.\nWe further scale the channels of ResNet-50 down to 0.5\u00d7 as a mobile setting. At this stage, a C2\nmodel (asymmetric), as well as reproductions of MobileNet-v2 [35] and Shuf\ufb02eNet-v2 [25] are\nevaluated. Symmetric padding greatly reduces the error rate of ResNet-50 0.5\u00d7 C2 for 2.5%. On\nthis basis, we propose an optimized ResNet-50 0.5\u00d7 C2sp model that achieves comparable accuracy\nto compact CNNs, but has fewer parameters and FLOPs. Although MobileNet-v2 presents the best\naccuracy, it uses inverted-bottlenecks (the same structure in Table 1) to expand too many FMs, which\nsigni\ufb01cantly increase the memory consumption and slow down the training process (about 400 SPS),\nwhile other models can easily reach 900 SPS.\n\nTable 3: Top-1 error rates on ImageNet. Results are obtained by our rseproductions using the same\ntraining hyperparameters.\n\nModel\nResNet-50 C3 [11]\nResNet-50 C2sp\nDenseNet-121 C3 [14]\nDenseNet-121 C2sp\nResNet-50 0.5\u00d7 C3\nResNet-50 0.5\u00d7 C2\nResNet-50 0.5\u00d7 C2sp\nResNet-50 0.5\u00d7 C2sp optim\nShuf\ufb02eNet v2 2.0\u00d7 [25]\nMobileNet v2 1.4\u00d7 [35]\n\nError (%)\n\n23.8\n24.0\n24.6\n25.1\n26.8\n29.8\n27.3\n26.8\n26.6\n25.8\n\nParams (M)\n\nFLOPs (M)\n\n25.5\n19.3\n8.0\n6.7\n6.9\n5.3\n5.3\n5.8\n7.4\n6.1\n\n4089\n3062\n2834\n2143\n1127\n870\n870\n573\n591\n582\n\n4.4\n\nImage generation\n\nThe ef\ufb01cacy of symmetric padding is further validated in image generation tasks with GANs. In\nCIFAR10 32\u00d732 image generation, we follow the same architecture described in [27], which has\nabout 6M parameters in the generator and 1.5M parameters in the discriminator. In LSUN bedroom\nand CelebA-HQ 128\u00d7128 image generation, ResNet-19 [21] is adopted with \ufb01ve residual blocks in\nthe generator and six residual blocks in the discriminator, containing about 14M parameters for each\nof them. Since the training of GAN is a zero-sum game between two neural networks, we remain all\ndiscriminators as the same (C3) to mitigate their in\ufb02uences, and replace each C3 in generators with\n\n7\n\n\fTable 4: Scores for different kernels. Higher inception score and lower FID is better.\n\nModel\n\nC5\nC3\nC4\nC4sp\nC2\nC2sp\n\nC10 (IS)\n7.64\u00b10.06\n7.79\u00b10.06\n7.76\u00b10.11\n7.74\u00b10.08\n7.77\u00b10.05\n\nC10 (FID)\n26.54\u00b11.38\n24.12\u00b10.47\n26.45\u00b11.50\n24.86\u00b10.41\n23.35\u00b10.26\n\nLSUN (FID)\n30.84\u00b11.13\n36.04\u00b17.10\n39.17\u00b15.52\n24.61\u00b12.45\n27.73\u00b16.76\n\nCelebA (FID)\n33.90\u00b13.77\n43.39\u00b15.78\n37.93\u00b17.54\n30.22\u00b13.02\n31.25\u00b14.86\n\nnon-convergence\n\na C4, C2, C4sp, or C2sp. Besides, the number of channels is reduced to 0.75\u00d7 in C4 and C4sp, or\nexpanded 1.5\u00d7 in C2 and C2sp to approximate the same number of parameters.\nThe inception scores [34] and FIDs [12] are shown in Table 4 for quantitatively evaluating generated\nimages, and examples from the best FID runs are visualized in Figure 4. Symmetric padding is\ncrucial for the convergence of C2 generators, and remarkably improves the quality of C4 generators.\nIn addition, the standard derivations (\u00b1) con\ufb01rm that symmetric padding stabilizes the training of\nGANs. On CIFAR10, C2sp performs the best scores while in LSUN bedroom and CelebA-HQ\ngeneration, C4sp is slightly better than C2sp. The diverse results can be explained by the information\nerosion hypothesis: In CIFAR10 generation, the network depth is relatively deep in terms of image\nsize 32 \u00d7 32, then a smaller kernel will have less attenuation rate and more channels. Whereas the\nnetwork depth is relatively shallow in terms of image size 128\u00d7128, and the edge effect is negligible.\nThen larger RFs are more important than wider channels in high-resolution image generation. In\nsummary, the symmetric padding eliminates the shifting problem and simultaneously expands the RF.\nThe former is universal, while the latter is limited by the edge effect on some occasions.\n\nFigure 4: Examples generated by GANs on CIFAR10 (32\u00d732, C2sp, IS=8.27, FID=19.49), LSUN-\nbedroom (128\u00d7128, C4sp, FID=16.63) and CelebA-HQ (128\u00d7128, C4sp, FID=19.83).\n\nImplementation details\n\n4.5\nResults reported as mean\u00b1std in tables or error bars in \ufb01gures are trained for 5 times with different\nrandom seeds. The default settings for CIFAR classi\ufb01cations are as follows: We train models for\n300 epochs with mini-batch size 64 except for the results in Table 2, which run 600 epochs as in\n[48]. We use a cosine learning rate decay [24] starting from 0.1 except for DenseNet tests, where the\npiece-wise constant decay performs better. The weight decay factor is 1e-4 except for parameters\nin depthwise convolutions. The standard augmentation [22] is applied and the \u03b1 equals 1 in mixup\naugmentation.\nFor ImageNet classi\ufb01cations, all the models are trained for 100 epochs with mini-batch size 256.\nThe learning rate is set to 0.1 initially and annealed according to the cosine decay schedule. We\nfollow the data augmentation in [36]. Weight decay is 1e-4 in ResNet-50 and DenseNet-121 models,\nand decreases to 4e-5 in the other compact models. Some results are worse than reported in the\noriginal papers. It is likely due to the inconsistency of mini-batch size, learning rate decay, or total\n\n8\n\n\ftraining epochs, e.g., about 420 epochs in [35]. Our example code and models are available at\nhttps://github.com/boluoweifenda/CNN.\nIn generation tasks with GANs, we follow models and hypermeters recommended in [21]. The\nlearning rate is 0.2, \u03b21 is 0.5 and \u03b22 is 0.999 for Adam optimizer [19]. The mini-batch size is 64, the\nratio of discriminator to generator updates is 5:1 (ncritic = 5). The results in Table 4 and Figure 4 are\ntrained for 200k and 500k discriminator update steps, respectively. We use the non-saturation loss\n[8] without gradient norm penalty. The spectral normalization [27] is applied in discriminators, no\nnormalization is applied in generators.\n\n5 Discussion\n\nAblation study We have tested other methods dealing with shift problem, and divided them into\ntwo categories: (1) Replacing asymmetric padding with additional non-convolution layer, e.g.,\ninterpolation, pooling; (2) Achieving symmetry with multiple convolution layers, e.g., padding 1\npixel at each side before/within two non-padding convolutions. Their implementation is restricted to\ncertain architectures and the accuracy is no better than symmetric padding. Our main consideration is\nto propose a basic but elegant building element that achieves symmetry within a single layer, thus\nmost of the existing compact models can be neatly transferred to even-sized kernels. We have also\ntested C3 with asymmetric padding, and observed accuracy degradation as the asymmetry gains.\nNetwork fragmentation From the evaluations above, C2sp achieves comparable accuracy with\nless training memory and time. Although fragmented operators distributed in many groups [25]\nhave fewer parameters and FLOPs, the operational intensity [39] decreases as the group number\nincreases. This negatively impacts the ef\ufb01ciency of computation, energy, and bandwidth in hardware\nthat has strong parallel computing capabilities. In the situation where memory access dominates the\ncomputation, the reduction of FLOPs does not guarantee faster execution speed [16]. We conclude\nthat it is still controversial to (1) increase network fragmentation by grouping strategies and complex\ntopologies; (2) decompose spatial and channel correlations by DWConvs, shift operations, and C1s.\nNaive implementation Meanwhile, most deep learning frameworks and hardware are mainly opti-\nmized for C3, which restrains the ef\ufb01ciency of C4sp and C2sp to a large extent. For example, in our\nhigh-level python implementation in TensorFlow for models with C2sp, C2, and C3, despite that\nthe parameters and FLOPs ratio is 4:4:9, the speed (SPS) and memory consumption ratio during\ntraining is about 1:1.14:1.2 and 1:0.7:0.7, respectively. The speed and memory overheads can be\nfurther optimized in the following computation libraries and software engineering once even-sized\nkernels are adopted by the deep learning community.\n\n6 Conclusion\n\nIn this work, we explore the generalization capabilities of even-sized kernels (2\u00d72, 4\u00d74) and quantify\nthe shift problem by an information erosion hypothesis. Then we introduce symmetric padding\nto elegantly achieve symmetry within a single convolution layer. In classi\ufb01cations, C2sp achieves\n30%-50% saving of parameters and FLOPs compared to C3 on CIFAR10/100, and improves accuracy\nfor 2.5% from C2 on ImageNet. Compared to existing compact convolution blocks, C2sp achieves\ncompetitive results with fewer channels and simpler architectures, which reduce memory consumption\n(>35%) and speed up (>20%) the training process. In generation tasks, C2sp and C4sp both achieve\nimproved image qualities and stabilized convergence. Even-sized kernels with symmetric padding\nprovide promising building units for architecture designs that emphasize training efforts on online\nand continual learning occasions.\n\nAcknowledgments\n\nWe thank the reviewers for their valuable suggestions and insightful comments. This work is partially\nsupported by the Project of NSFC No. 61836004, the Brain-Science Special Program of Beijing\nunder Grant Z181100001518006, the Suzhou-Tsinghua innovation leading program 2016SZ0102\nand the National Key R&D Program of China 2018YFE0200200.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: a system for\nlarge-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Farzin Aghdasi and Rabab K Ward. Reduction of boundary artifacts in image restoration. IEEE\n\nTransactions on Image Processing, 5(4):611\u2013618, 1996.\n\n[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity\n\nnatural image synthesis. In International Conference on Learning Representations, 2019.\n\n[4] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-ef\ufb01cient\nrecon\ufb01gurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State\nCircuits, 52(1):127\u2013138, 2016.\n\n[5] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceed-\nings of the IEEE conference on computer vision and pattern recognition, pages 1251\u20131258,\n2017.\n\n[6] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.\nDeformable convolutional networks. In Proceedings of the IEEE international conference on\ncomputer vision, pages 764\u2013773, 2017.\n\n[7] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[9] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and huffman coding. In International Conference\non Learning Representations, 2016.\n\n[10] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In Proceed-\nings of the IEEE conference on computer vision and pattern recognition, pages 5353\u20135360,\n2015.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n[13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 4700\u20134708, 2017.\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[16] Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution.\n\nIn Advances in Neural Information Processing Systems, pages 5951\u20135961, 2018.\n\n10\n\n\f[17] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages\n675\u2013678. ACM, 2014.\n\n[18] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for im-\nproved quality, stability, and variation. In International Conference on Learning Representations,\n2018.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Learning Representations, 2015.\n\n[20] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[21] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The\ngan landscape: Losses, architectures, regularization, and normalization. arXiv preprint\narXiv:1807.04720, 2018.\n\n[22] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-\n\nsupervised nets. In Arti\ufb01cial Intelligence and Statistics, pages 562\u2013570, 2015.\n\n[23] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,\nAlan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In\nProceedings of the European Conference on Computer Vision (ECCV), pages 19\u201334, 2018.\n\n[24] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In\n\nInternational Conference on Learning Representations, 2017.\n\n[25] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02enet v2: Practical guidelines\nfor ef\ufb01cient cnn architecture design. In Proceedings of the European Conference on Computer\nVision (ECCV), pages 116\u2013131, 2018.\n\n[26] G McGibney, MR Smith, ST Nichols, and A Crawley. Quantitative evaluation of several partial\nfourier reconstruction algorithms used in mri. Magnetic resonance in medicine, 30(1):51\u201359,\n1993.\n\n[27] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\nfor generative adversarial networks. In International Conference on Learning Representations,\n2018.\n\n[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[29] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.\n\nDistill, 1(10):e3, 2016.\n\n[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[31] Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe\nZou, Zhenzhi Wu, Wei He, et al. Towards arti\ufb01cial general intelligence with hybrid tianjic chip\narchitecture. Nature, 572(7767):106, 2019.\n\n[32] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image\nclassi\ufb01er architecture search. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence,\nvolume 33, pages 4780\u20134789, 2019.\n\n[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n11\n\n\f[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[35] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 4510\u20134520, 2018.\n\n[36] Nathan Silberman and Sergio Guadarrama. Tensor\ufb02owslim image classi\ufb01cation model library,\n\n2017.\n\n[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. In International Conference on Learning Representations, 2015.\n\n[38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pages 2818\u20132826, 2016.\n\n[39] Samuel Williams, Andrew Waterman, and David Patterson. Roo\ufb02ine: an insightful visual\nperformance model for multicore architectures. Communications of the ACM, 52(4):65\u201376,\n2009.\n\n[40] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gho-\nlaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero \ufb02op, zero parameter alternative to\nspatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 9127\u20139135, 2018.\n\n[41] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in\n\ndeep neural networks. In International Conference on Learning Representations, 2018.\n\n[42] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR),\n2017 IEEE Conference on, pages 5987\u20135995. IEEE, 2017.\n\n[43] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In\n\nInternational Conference on Learning Representations, 2016.\n\n[44] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:\nConstruction of a large-scale image dataset using deep learning with humans in the loop. arXiv\npreprint arXiv:1506.03365, 2015.\n\n[45] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin,\nKarl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria\nLangston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter Battaglia. Deep\nreinforcement learning with relational inductive biases. In International Conference on Learning\nRepresentations, 2019.\n\n[46] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond\nempirical risk minimization. In International Conference on Learning Representations, 2018.\n\n[47] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient\nconvolutional neural network for mobile devices. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 6848\u20136856, 2018.\n\n[48] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\narchitectures for scalable image recognition. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 8697\u20138710, 2018.\n\n12\n\n\f", "award": [], "sourceid": 719, "authors": [{"given_name": "Shuang", "family_name": "Wu", "institution": "Tsinghua University"}, {"given_name": "Guanrui", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Pei", "family_name": "Tang", "institution": "Tsinghua University"}, {"given_name": "Feng", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Luping", "family_name": "Shi", "institution": "Tsinghua University"}]}