{"title": "DropBlock: A regularization method for convolutional networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10727, "page_last": 10737, "abstract": "Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout. Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropbBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices. Extensive experiments show that DropBlock works better than dropout in regularizing convolutional networks.\n On ImageNet classification, ResNet-50 architecture with DropBlock achieves $78.13\\%$ accuracy, which is more than $1.6\\%$ improvement on the baseline. On COCO detection, DropBlock improves Average Precision of RetinaNet from $36.8\\%$ to $38.4\\%$.", "full_text": "DropBlock: A regularization method for\n\nconvolutional networks\n\nGolnaz Ghiasi\nGoogle Brain\n\nTsung-Yi Lin\nGoogle Brain\n\nQuoc V. Le\nGoogle Brain\n\nAbstract\n\nDeep neural networks often work well when they are over-parameterized and\ntrained with a massive amount of noise and regularization, such as weight decay\nand dropout. Although dropout is widely used as a regularization technique for\nfully connected layers, it is often less effective for convolutional layers. This\nlack of success of dropout for convolutional layers is perhaps due to the fact that\nactivation units in convolutional layers are spatially correlated so information\ncan still \ufb02ow through convolutional networks despite dropout. Thus a structured\nform of dropout is needed to regularize convolutional networks. In this paper, we\nintroduce DropBlock, a form of structured dropout, where units in a contiguous\nregion of a feature map are dropped together. We found that applying DropbBlock\nin skip connections in addition to the convolution layers increases the accuracy.\nAlso, gradually increasing number of dropped units during training leads to better\naccuracy and more robust to hyperparameter choices. Extensive experiments\nshow that DropBlock works better than dropout in regularizing convolutional\nnetworks. On ImageNet classi\ufb01cation, ResNet-50 architecture with DropBlock\nachieves 78.13% accuracy, which is more than 1.6% improvement on the baseline.\nOn COCO detection, DropBlock improves Average Precision of RetinaNet from\n36.8% to 38.4%.\n\n1\n\nIntroduction\n\nDeep neural nets work well when they have a large number of parameters and are trained with a\nmassive amount of regularization and noise, such as weight decay and dropout [1]. Though the \ufb01rst\nbiggest success of dropout was associated with convolutional networks [2], recent convolutional\narchitectures rarely use dropout [3, 4, 5, 6, 7, 8, 9, 10]. In most cases, dropout was mainly used at the\nfully connected layers of the convolutional networks [11, 12, 13].\nWe argue that the main drawback of dropout is that it drops out features randomly. While this can be\neffective for fully connected layers, it is less effective for convolutional layers, where features are\ncorrelated spatially. When the features are correlated, even with dropout, information about the input\ncan still be sent to the next layer, which causes the networks to over\ufb01t. This intuition suggests that a\nmore structured form of dropout is needed to better regularize convolutional networks.\nIn this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to\nregularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a\nfeature map, are dropped together. As DropBlock discards features in a correlated area, the networks\nmust look elsewhere for evidence to \ufb01t the data (see Figure 1).\nIn our experiments, DropBlock is much better than dropout in a range of models and datasets. Adding\nDropBlock to ResNet-50 architecture improves image classi\ufb01cation accuracy on ImageNet from\n76.51% to 78.13%. On COCO detection, DropBlock improves AP of RetinaNet from 36.8% to\n38.4%.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) input image to a convolutional neural network. The green regions in (b) and (c) include\nthe activation units which contain semantic information in the input image. Dropping out activations\nat random is not effective in removing semantic information because nearby activations contain\nclosely related information. Instead, dropping continuous regions can remove certain semantic\ninformation (e.g., head or feet) and consequently enforcing remaining units to learn features for\nclassifying input image.\n\n2 Related work\n\nSince its introduction, dropout [1] has inspired a number of regularization methods for neural\nnetworks such as DropConnect [14], maxout [15], StochasticDepth [16], DropPath [17], Sched-\nuledDropPath [8], shake-shake regularization [18], and ShakeDrop regularization [19]. The basic\nprinciple behind these methods is to inject noise into neural networks so that they do not over\ufb01t the\ntraining data. When it comes to convolutional neural networks, most successful methods require the\nnoise to be structured [16, 17, 8, 18, 19, 20]. For example, in DropPath, an entire layer in the neural\nnetwork is zeroed out of training, not just a particular unit. Although these strategies of dropping out\nlayers may work well for layers with many input or output branches, they cannot be used for layers\nwithout any branches. Our method, DropBlock, is more general in that it can be applied anywhere\nin a convolutional network. Our method is closely related to SpatialDropout [20], where an entire\nchannel is dropped from a feature map. Our experiments show that DropBlock is more effective than\nSpatialDropout.\nThe developments of these noise injection techniques speci\ufb01c to the architectures are not unique to\nconvolutional networks. In fact, similar to convolutional networks, recurrent networks require their\nown noise injection methods. Currently, Variational Dropout [21] and ZoneOut [22] are two of the\nmost commonly used methods to inject noise to recurrent connections.\nOur method is inspired by Cutout [23], a data augmentation method where parts of the input\nexamples are zeroed out. DropBlock generalizes Cutout by applying Cutout at every feature map in\na convolutional networks. In our experiments, having a \ufb01xed zero-out ratio for DropBlock during\ntraining is not as robust as having an increasing schedule for the ratio during training. In other words,\nit\u2019s better to set the DropBlock ratio to be small initially during training, and linearly increase it over\ntime during training. This scheduling scheme is related to ScheduledDropPath [8].\n\n3 DropBlock\n\nDropBlock is a simple method similar to dropout. Its main difference from dropout is that it drops\ncontiguous regions from a feature map of a layer instead of dropping out independent random units.\nPseudocode of DropBlock is shown in Algorithm 1. DropBlock has two main parameters which\nare block_size and \u03b3. block_size is the size of the block to be dropped, and \u03b3, controls how many\nactivation units to drop.\nWe experimented with a shared DropBlock mask across different feature channels or each feature\nchannel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in\nour experiments.\n\n2\n\n\freturn A\n\nAlgorithm 1 DropBlock\n1: Input:output activations of a layer (A), block_size, \u03b3, mode\n2: if mode == Inference then\n3:\n4: end if\n5: Randomly sample mask M: Mi,j \u223c Bernoulli(\u03b3)\n6: For each zero position Mi,j, create a spatial square mask with the center being Mi,j, the width,\n7: Apply the mask: A = A \u00d7 M\n8: Normalize the features: A = A \u00d7 count(M )/count_ones(M )\n\nheight being block_size and set all the values of M in the square to be zero (see Figure 2).\n\n(a)\n\n(b)\n\nFigure 2: Mask sampling in DropBlock. (a) On every feature map, similar to dropout, we \ufb01rst\nsample a mask M. We only sample mask from shaded green region in which each sampled entry can\nexpanded to a mask fully contained inside the feature map. (b) Every zero entry on M is expanded to\nblock_size \u00d7 block_size zero block.\n\nSimilar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an\naveraged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks\ninclude a special subset of sub-networks covered by dropout where each network does not see\ncontiguous parts of feature maps.\n\nSetting the value of block_size.\nIn our implementation, we set a constant block_size for all\nfeature maps, regardless the resolution of feature map. DropBlock resembles dropout [1] when\nblock_size = 1 and resembles SpatialDropout [20] when block_size covers the full feature map.\n\nSetting the value of \u03b3.\nIn practice, we do not explicitly set \u03b3. As stated earlier, \u03b3 controls the\nnumber of features to drop. Suppose that we want to keep every activation unit with the probability\nof keep_prob, in dropout [1] the binary mask will be sampled with the Bernoulli distribution with\nmean 1 \u2212 keep_prob. However, to account for the fact that every zero entry in the mask will be\nexpanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust \u03b3\naccordingly when we sample the initial binary mask. In our implementation, \u03b3 can be computed as\n\n1 \u2212 keep_prob\nblock_size2\n\n\u03b3 =\n\nf eat_size2\n\n(f eat_size \u2212 block_size + 1)2\n\n(1)\n\nwhere keep_prob can be interpreted as the probability of keeping a unit in traditional dropout. The\nsize of valid seed region is (f eat_size \u2212 block_size + 1)2 where f eat_size is the size of feature\nmap. The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so\nthe above equation is only an approximation. In our experiments, we \ufb01rst estimate the keep_prob to\nuse (between 0.75 and 0.95), and then compute \u03b3 according to the above equation.\n\nScheduled DropBlock. We found that DropBlock with a \ufb01xed keep_prob during training does not\nwork well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually\ndecreasing keep_prob over time from 1 to the target value is more robust and adds improvement for\n\n3\n\n\fthe most values of keep_prob. In our experiments, we use a linear scheme of decreasing the value of\nkeep_prob, which tends to work well across many hyperparameter settings. This linear scheme is\nsimilar to ScheduledDropPath [8].\n\n4 Experiments\n\nIn the following sections, we empirically investigate the effectiveness of DropBlock for image classi-\n\ufb01cation, object detection, and semantic segmentation. We apply DropBlock to ResNet-50 [4] with\nextensive experiments for image classi\ufb01cation. To verify the results can be transferred to a different\narchitecture, we perform DropBlock on a state-of-the-art model architecture, AmoebaNet [10], and\nshow improvements. In addition to image classi\ufb01cation, We show DropBlock is helpful in training\nRetinaNet [24] for object detection and semantic segmentation.\n\n4.1\n\nImageNet Classi\ufb01cation\n\nThe ILSVRC 2012 classi\ufb01cation dataset [25] contains 1.2 million training images, 50,000 validation\nimages, and 150,000 testing images. Images are labeled with 1,000 categories. We used horizontal\n\ufb02ip, scale, and aspect ratio augmentation for training images as in [12, 26]. During evaluation, we\napplied a single-crop rather than averaging results over multiple crops. Following the common\npractice, we report classi\ufb01cation accuracy on the validation set.\n\nImplementation Details We trained models on Tensor Processing Units (TPUs) and used the\nof\ufb01cial Tensor\ufb02ow implementations for ResNet-501 and AmoebaNet2. We used the default image\nsize (224 \u00d7 224 for ResNet-50 and 331 \u00d7 331 for AmoebaNet), batch size (1024 for ResNet-50 and\n2048 for AmoebaNet) and hyperparameters setting for all the models. We only increased number\nof training epochs from 90 to 300 for ResNet-50 architecture. The learning rate was decayed by\nthe factor of 0.1 at 100, 200 and 265 epochs. AmoebaNet models were trained for 340 epochs\nand exponential decay scheme was used for scheduling learning rate. Since baselines are usually\nover\ufb01tted for the longer training scheme and have lower validation accuracy at the end of training,\nwe report the highest validation accuracy over the full training course for fair comparison.\n\n4.1.1 DropBlock in ResNet-50\n\nResNet-50 [4] is a widely used Convolutional Neural Network (CNN) architecture for image recogni-\ntion. In the following experiments, we apply different regularization techniques on ResNet-50 and\ncompare the results with DropBlock. The results are summarized in Table 1.\n\nModel\nResNet-50\nResNet-50 + dropout (kp=0.7) [1]\nResNet-50 + DropPath (kp=0.9) [17]\nResNet-50 + SpatialDropout (kp=0.9) [20]\nResNet-50 + Cutout [23]\nResNet-50 + AutoAugment [27]\nResNet-50 + label smoothing (0.1) [28]\nResNet-50 + DropBlock, (kp=0.9)\nResNet-50 + DropBlock (kp=0.9) + label smoothing (0.1)\n\ntop-1(%)\n76.51 \u00b1 0.07\n76.80 \u00b1 0.04\n77.10 \u00b1 0.08\n77.41 \u00b1 0.04\n76.52 \u00b1 0.07\n77.17 \u00b10.05\n78.13 \u00b1 0.05\n78.35 \u00b1 0.05\n\n77.63\n\ntop-5(%)\n93.20 \u00b1 0.05\n93.41 \u00b1 0.04\n93.50 \u00b1 0.05\n93.74 \u00b1 0.02\n93.21 \u00b1 0.04\n93.45 \u00b10.03\n94.02 \u00b1 0.02\n94.15 \u00b1 0.03\n\n93.82\n\nTable 1: Summary of validation accuracy on ImageNet dataset for ResNet-50 architecture. For\ndropout, DropPath, and SpatialDropout, we trained models with different keep_prob values and\nreported the best result. DropBlock is applied with block_size = 7. We report average over 3 runs.\nThe code of these results is in https://github.com/tensorflow/tpu/tree/master/models/\nofficial/resnet.\n\n1https://github.com/tensor\ufb02ow/tpu/tree/master/models/of\ufb01cial/resnet\n2https://github.com/tensor\ufb02ow/tpu/tree/master/models/experimental/amoeba_net\n\n4\n\n\f(a)\n\n(b)\n\nFigure 3: ImageNet validation accuracy against keep_prob with ResNet-50 model. All methods drop\nactivation units in group 3 and 4.\n\nDropBlock wo/ scheduling and not on the\n\nskip connections\n\nDropBlock wo/ scheduling\n\nDropBlock\n\n)\n7\nx\n7\n\n:\nn\no\ni\nt\nu\nl\no\ns\ne\nr\n(\n\n4\n\np\nu\no\nr\nG\n\n4\n&\n3\n\ns\np\nu\no\nr\nG\n\nFigure 4: Comparison of ResNet-50 trained on ImageNet when DropBlock is applied to group 4 or\ngroups 3 and 4. From left to right, we show performance of applying DropBlock on convolution\nbranches only and progressively improve it by applying DropBlock on skip connections and adding\nscheduling keep_prob. The best accuracy is achieved by using block_size = 7 (bottom right \ufb01gure).\n\nWhere to apply DropBlock.\nIn residual networks, a building block consists of a few convolution\nlayers and a separate skip connection that performs identity mapping. Every convolution layer is\nfollowed by batch normalization layer and ReLU activation. The output of a building block is the\nsum of outputs from the convolution branch and skip connection.\nA residual network can be represented by building groups based on the spatial resolution of feature\nactivation. A building group consists of multiple building blocks. We use group 4 to represent the\nlast group in residual network (i.e., all layers in conv5_x) and so on.\nIn the following experiments, we study where to apply DropBlock in residual networks. We ex-\nperimented with applying DropBlock only after convolution layers or applying DropBlock after\nboth convolution layers and skip connections. To study the performance of DropBlock applying to\ndifferent feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3\nand 4.\n\nDropBlock vs. dropout. The original ResNet architecture does not apply any dropout in the\nmodel. For the ease of discussion, we de\ufb01ne the dropout baseline for ResNet as applying dropout\non convolution branches only. We applied DropBlock to both groups 3 and 4 with block_size = 7\nby default. We decreased \u03b3 by factor 4 for group 3 in all the experiments. In Figure 3-(a), we show\n\n5\n\n0.50.60.70.80.91.0keep_prob76.076.577.077.578.0accuracy (validation)SpatialDropoutdropoutDropBlock0.750.800.850.900.951.00keep_prob76.076.577.077.578.0accuracy (validation)DropBlock wo schedulingDropBlock1234567block_size76.076.577.077.578.078.5accuracy (validation)1234567block_size76.076.577.077.578.078.5accuracy (validation)1234567block_size76.577.077.578.0accuracy (validation)2468101214block_size76.076.577.077.578.078.5accuracy (validation)2468101214block_size76.577.077.578.0accuracy (validation)2468101214block_size76.577.077.578.0accuracy (validation)\fthat DropBlock outperforms dropout with 1.3% for top-1 accuracy. The scheduled keep_prob makes\nDropBlock more robust to the change of keep_prob and adds improvement for the most values of\nkeep_prob (3-(b)).\nWith the best keep_prob found in Figure 3, we swept over block_size from 1 to size covering full\nfeature map. Figure 4 shows applying larger block_size is generally better than applying block_size\nof 1. The best DropBlock con\ufb01guration is to apply block_size = 7 to both groups 3 and 4.\nIn all con\ufb01gurations, DropBlock and dropout share the similar trend and DropBlock has a large gain\ncompared to the best dropout result. This shows evidence that the DropBlock is a more effective\nregularizer compared to dropout.\n\nDropBlock vs. SpatialDropout. Similar as dropout baseline, we de\ufb01ne the SpatialDropout [20]\nbaseline as applying it on convolution branches only. SpatialDropout is better than dropout but\ninferior to DropBlock. In Figure 4, we found SpatialDropout can be too harsh when applying to\nhigh resolution feature map on group 3. DropBlock achieves the best result by dropping block with\nconstant size on both groups 3 and 4.\n\nComparison with DropPath. Following ScheduledDropPath [8] we applied scheduled DropPath\non all connections except the skip connections. We trained models with different values for keep_prob\nparameter. Also, we trained models where we applied DropPath in all groups and similar to our other\nexperiments only at group 4 or at group 3 and 4. We achieved best validation accuracy of 77.10%\nwhen we only apply it to group 4 with keep_prob = 0.9.\n\nComparison with Cutout. We also compared with Cutout [23] which is a data augmentation\nmethod and randomly drops a \ufb01xed size block from the input images. Although Cutout improves\naccuracy on the CIFAR-10 dataset as suggested by [23], it does not improve the accuracy on the\nImageNet dataset in our experiments.\n\nComparison with other regularization techniques. We compare DropBlock to data augmentation\nand label smoothing, which are two commonly used regularization techniques. In Table 1, DropBlock\nhas better performance compared to strong data augmentation [27] and label smoothing [28]. The\nperformance improves when combining DropBlock and label smoothing and train for 330 epochs,\nshowing the regularization techniques can be complimentary when we train for longer.\n\n4.1.2 DropBlock in AmoebaNet\n\nWe also show the effectiveness of DropBlock on a recent AmoebaNet-B architecture which is a\nstate-of-art architecture, found using evolutionary architecture search [10]. This model has dropout\nwith keep probability of 0.5 but only on the the \ufb01nal softmax layer.\nWe apply DropBlock after all batch normalization layers and also in the skip connections of the last\n50% of the cells. The resolution of the feature maps in these cells are 21x21 or 11x11 for input image\nwith the size of 331x331. Based on the experiments in the last section, we used keep_prob of 0.9 and\nset block_size = 11 which is the width of the last feature map. DropBlock improves top-1 accuracy\nof AmoebaNet-B from 82.25% to 82.52% (Table 2).\n\nModel\nAmoebaNet-B (6, 256)\nAmoebaNet-B (6, 256) + DropBlock\n\ntop-5(%)\n95.88\n96.07\n\ntop-1(%)\n\n82.25\n82.52\n\nTable 2: Top-1 and top-5 validation accuracy of AmoebaNet-B architecture trained on ImageNet.\n\n4.2 Experimental Analysis\n\nDropBlock demonstrates strong empirical results on improving ImageNet classi\ufb01cation accuracy\ncompared to dropout. We hypothesize dropout is insuf\ufb01cient because the contiguous regions in\nconvolution layers are strongly correlated. Randomly dropping a unit still allows information to\n\ufb02ow through neighboring units. In this section, we conduct an analysis to show DropBlock is more\neffective in dropping semantic information. Subsequently, the model regularized by DropBlock is\n\n6\n\n\f(a) Inference with block_size = 1.\n\n(b) Inference with block_size = 7.\n\nFigure 5: ResNet-50 model trained with block_size = 7 and keep_prob = 0.9 has higher accuracy\ncompared to the ResNet-50 model trained with block_size = 1 and keep_prob = 0.9: (a) when we\napply DropBlock with block_size = 1 at inference with different keep_prob or (b) when we apply\nDropBlock with block_size = 7 at inference with different keep_prob. The models are trained and\nevaluated with DropBlock at groups 3 and 4.\n\ntoyshop\n\nknot\n\nbookshop\n\nbookshop\n\nstone wall\n\nspiral\n\ne\ng\na\nm\n\ni\n\nt\nu\np\nn\n\ni\n\nl\ne\nd\no\nm\n\nl\na\nn\ni\ng\ni\nr\no\n\n1\n\n:\ne\nz\ni\ns\nk\nc\no\nl\n\nb\n\n7\n\n:\ne\nz\ni\ns\nk\nc\no\nl\n\nb\n\nFigure 6: Class activation mapping (CAM) [29] for ResNet-50 model trained without DropBlock and\ntrained with DropBlock with the block_size of 1 or 7. The model trained with DropBlock tends to\nfocus on several spatially distributed regions.\n\nmore robust compared to model regularized by dropout. We study the problem by applying DropBlock\nwith block_size of 1 and 7 during inference and observing the differences in performance.\n\nDropBlock drops more semantic information. We \ufb01rst took the model trained without any regu-\nlarization and tested it with DropBlock with block_size = 1 and block_size = 7. The green curves\nin Figure 5 show the validation accuracy reduced quickly with decreasing keep_prob during inference.\nThis suggests DropBlock removes semantic information and makes classi\ufb01cation more dif\ufb01cult. The\naccuracy drops more quickly with decreasing keep_prob, for block_size = 1 in comparison with\nblock_size = 7 which suggests DropBlock is more effective to remove semantic information than\ndropout.\n\nModel trained with DropBlock is more robust. Next we show that model trained with large block\nsize, which removes more semantic information, results in stronger regularization. We demonstrate\n\n7\n\n0.750.800.850.900.951.00keep_prob0.00.10.20.30.40.50.60.70.8accuracy (validation)trained without DropBlocktrained with block_size=1trained with block_size=70.750.800.850.900.951.00keep_prob0.00.10.20.30.40.50.60.70.8accuracy (validation)trained without DropBlocktrained with block_size=1trained with block_size=7\fthe fact by taking model trained with block_size = 7 and applied block_size = 1 during inference\nand vice versa. In Figure 5, models trained with block_size = 1 and block_size = 7 are both robust\nwith block_size = 1 applied during inference. However, the performance of model trained with\nblock_size = 1 reduced more quickly with decreasing keep_prob when applying block_size = 7\nduring inference. The results suggest that block_size = 7 is more robust and has the bene\ufb01t of\nblock_size = 1 but not vice versa.\n\nDropBlock learns spatially distributed representations. We hypothesize model trained with\nDropBlock needs to learn spatially distributed representations because DropBlock is effective in\nremoving semantic information in a contiguous region. The model regularized by DropBlock should\nlearn multiple discriminative regions instead of only focusing on one discriminative region. We use\nclass activation maps (CAM) introduced in [29] to visualize conv5_3 class activations of ResNet-50\non ImageNet validation set. Figure 6 shows the class activations of original model and models\ntrained with DropBlock with block_size = 1 and block_size = 7. In general, models trained with\nDropBlock learn spatially distributed representations that induce high class activations on multiple\nregions, whereas model without regularization tends to focus on one or few regions.\n\n4.3 Object Detection in COCO\n\nDropBlock is a generic regularization module for CNNs. In this section, we show DropBlock can\nalso be applied for training object detector in COCO dataset [30]. We use RetinaNet [24] framework\nfor the experiments. Unlike an image classi\ufb01er that predicts single label for an image, RetinaNet runs\nconvolutionally on multiscale Feature Pyramid Networks (FPNs) [31] to localize and classify objects\nin different scales and locations. We followed the model architecture and anchor de\ufb01nition in [24] to\nbuild FPNs and classi\ufb01er/regressor branches.\n\nWhere to apply DropBlock to RetinaNet model. RetinaNet model uses ResNet-FPN as its back-\nbone model. For simplicity, we apply DropBlock to ResNet in ResNet-FPN and use the best\nkeep_prob we found for ImageNet classi\ufb01cation training. DropBlock is different from recent work\n[32] which learns to drop a structured pattern on features of region proposals.\n\nTraining object detector from random initialization. Training object detector from random\ninitialization has been considered as a challenging task. Recently, a few papers tried to address the\nissue using novel model architecture [33], large minibatch size [34], and better normalization layer\n[35]. In our experiment, we look at the problem from the model regularization perspective. We\ntried DropBlock with keep_prob = 0.9, which is the identical hyperparamters as training image\nclassi\ufb01cation model, and experimented with different block_size. In Table 3, we show that the model\ntrained from random initialization surpasses ImageNet pre-trained model. Adding DropBlock gives\nadditional 1.6% AP. The results suggest model regularization is an important ingredient to train object\ndetector from scratch and DropBlock is an effective regularization approach for object detection.\n\nModel\nRetinaNet, \ufb01ne-tuning from ImageNet\nRetinaNet, no DropBlock\nRetinaNet, keep_prob = 0.9, block_size = 1\nRetinaNet, keep_prob = 0.9, block_size = 3\nRetinaNet, keep_prob = 0.9, block_size = 5\nRetinaNet, keep_prob = 0.9, block_size = 7\n\nAP\n36.5\n36.8\n37.9\n38.3\n38.4\n38.2\n\nAP50\n55.0\n54.6\n56.1\n56.4\n56.4\n56.0\n\nAP75\n39.1\n39.4\n40.6\n41.2\n41.2\n40.9\n\nTable 3: Object detection results trained from random initialization in COCO using RetinaNet and\nResNet-50 FPN backbone model.\n\nImplementation details. We use open-source implementation of RetinaNet3 for experiments. The\nmodels were trained on TPU with 64 images in a batch. During training, multiscale jitter was applied\nto resize images between scales [512, 768] and then the images were padded or cropped to max\ndimension 640. Only single scale image with max dimension 640 was used during testing. The batch\n\n3https://github.com/tensor\ufb02ow/tpu/tree/master/models/of\ufb01cial/retinanet\n\n8\n\n\fnormalization layers were applied after all convolution layers, including classi\ufb01er/regressor branches.\nThe model was trained using 150 epochs (280k training steps). The initial learning rate 0.08 was\napplied for \ufb01rst 120 epochs and decayed 0.1 at 120 and 140 epochs. The model with ImageNet\ninitialization was trained for 28 epochs with learning decay at 16 and 22 epochs. We used \u03b1 = 0.25\nand \u03b3 = 1.5 for focal loss. We used a weight decay of 0.0001 and a momentum of 0.9. The model\nwas trained on COCO train2017 and evaluated on COCO val2017.\n\n4.4 Semantic Segmentation in PASCAL VOC\n\nWe show DropBlock also improves semantic segmentation model. We use PASCAL VOC 2012\ndataset for experiments and follow the common practice to train with augmented 10,582 training\nimages [36] and report mIOU on 1,449 test set images. We adopt open-source RetinaNet implementa-\ntion for semantic segmentation. The implementation uses the ResNet-FPN backbone model to extract\nmultiscale features and attaches fully convolution networks on top to predict segmentation. We use\ndefault hyperparameters in open-source code for training.\nFollowing the experiments for object detection, we study the effect of DropBlock for training model\nfrom random initialization. We trained model started with pre-trained ImageNet model for 45 epochs\nand model with random initialization for 500 epochs. We experimented with applying DropBlock to\nResNet-FPN backbone model and fully convolution networks and found apply DropBlock to fully\nconvolution networks is more effective. Applying DropBlock greatly improves mIOU for training\nmodel from scratch and shrinks performance gap between training from ImageNet pre-trained model\nand randomly initialized model.\n\nModel\n\ufb01ne-tuning from ImageNet\nno DropBlock\nkeep_prob = 0.2, block_size = 1\nkeep_prob = 0.2, block_size = 4\nkeep_prob = 0.2, block_size = 16\n\nmIOU\n74.6\n47.2\n51.0\n53.2\n53.4\n\nTable 4: Semantic segmentation results trained from random initialization in PASCAL VOC 2012\nusing ResNet-101 FPN backbone model.\n\n5 Discussion\n\nIn this work, we introduce DropBlock to regularize training CNNs. DropBlock is a form of structured\ndropout that drops spatially correlated information. We demonstrate DropBlock is a more effective\nregularizer compared to dropout in ImageNet classi\ufb01cation and COCO detection. DropBlock consis-\ntently outperforms dropout in an extensive experiment setup. We conduct an analysis to show that\nmodel trained with DropBlock is more robust and has the bene\ufb01ts of model trained with dropout.\nThe class activation mapping suggests the model can learn more spatially distributed representations\nregularized by DropBlock.\nOur experiments show that applying DropBlock in skip connections in addition to the convolution\nlayers increases the accuracy. Also, gradually increasing number of dropped units during training\nleads to better accuracy and more robust to hyperparameter choices.\n\nReferences\n[1] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, 2012.\n\n[3] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. CoRR, abs/1502.03167, 2015.\n\n9\n\n\f[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, pages 770\u2013778, 2016.\n\n[5] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-\n\nresnet and the impact of residual connections on learning. In AAAI, 2017.\n\n[6] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-\n\ntions for deep neural networks. In CVPR, pages 5987\u20135995, 2017.\n\n[7] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In CVPR, pages\n\n6307\u20136315. IEEE, 2017.\n\n[8] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for\n\nscalable image recognition. In CVPR, 2017.\n\n[9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.\n\n[10] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. CoRR, abs/1802.01548, 2018.\n\n[11] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. Advances in Neural Information Processing Systems, 2015.\n\n[12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. In CVPR, 2015.\n\n[13] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\n\ninception architecture for computer vision. In CVPR, pages 2818\u20132826, 2016.\n\n[14] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks\n\nusing dropconnect. In International Conference on Machine Learning, pages 1058\u20131066, 2013.\n\n[15] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. In International Conference on Machine Learning, 2013.\n\n[16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic\n\ndepth. In ECCV, pages 646\u2013661. Springer, 2016.\n\n[17] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks\n\nwithout residuals. International Conference on Learning Representations, 2017.\n\n[18] Xavier Gastaldi. Shake-shake regularization. CoRR, abs/1705.07485, 2017.\n\n[19] Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. CoRR, abs/1802.02375,\n\n2018.\n\n[20] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Ef\ufb01cient object\n\nlocalization using convolutional networks. In CVPR, 2015.\n\n[21] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 1019\u20131027, 2016.\n\n[22] David Krueger, Tegan Maharaj, J\u00e1nos Kram\u00e1r, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary\nKe, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, and Chris Pal. Zoneout:\nRegularizing rnns by randomly preserving hidden activations. CoRR, abs/1606.01305, 2016.\n\n[23] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with\n\ncutout. CoRR, abs/1708.04552, 2017.\n\n[24] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for dense object\n\ndetection. In ICCV, 2017.\n\n[25] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, 2009.\n\n[26] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-\n\ntional networks. In CVPR, 2017.\n\n[27] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning\n\naugmentation policies from data. CoRR, abs/1805.09501, 2018.\n\n10\n\n\f[28] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking\n\nthe inception architecture for computer vision. In CVPR, 2016.\n\n[29] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features\n\nfor discriminative localization. In CVPR, 2018.\n\n[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.\n\n[31] Tsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature\n\npyramid networks for object detection. In CVPR, 2017.\n\n[32] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-Fast-RCNN: Hard positive generation via\n\nadversary for object detection. In CVPR, 2017.\n\n[33] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. DSOD:\n\nLearning deeply supervised object detectors from scratch. In ICCV, 2017.\n\n[34] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet:\n\nA large mini-batch object detector. In CVPR, 2018.\n\n[35] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.\n\n[36] Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic\n\ncontours from inverse detectors. In ICCV, 2011.\n\n11\n\n\f", "award": [], "sourceid": 6839, "authors": [{"given_name": "Golnaz", "family_name": "Ghiasi", "institution": "Google"}, {"given_name": "Tsung-Yi", "family_name": "Lin", "institution": "Google Brain"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}]}