{"title": "Fixing the train-test resolution discrepancy", "book": "Advances in Neural Information Processing Systems", "page_first": 8252, "page_last": 8262, "abstract": "Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time!\n\nWe then propose a simple strategy to optimize the classifier performance, that employs different train and test resolutions. It relies on a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images, and therefore significantly reduce the training time. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128x128 images, and 79.8% with one trained at 224x224.\n\nA ResNeXt-101 32x48d pre-trained with weak supervision on 940 million 224x224 images and further optimized with our technique for test resolution 320x320 achieves 86.4% top-1 accuracy (top-5: 98.0%). To the best of our  knowledge this is the highest ImageNet single-crop accuracy  to date.", "full_text": "Fixing the train-test resolution discrepancy\n\nHugo Touvron, Andrea Vedaldi, Matthijs Douze, Herv\u00b4e J\u00b4egou\n\nFacebook AI Research\n\nAbstract\n\nData-augmentation is key to the training of neural networks for image classi\ufb01-\ncation. This paper \ufb01rst shows that existing augmentations induce a signi\ufb01cant\ndiscrepancy between the size of the objects seen by the classi\ufb01er at train and test\ntime: in fact, a lower train resolution improves the classi\ufb01cation at test time!\nWe then propose a simple strategy to optimize the classi\ufb01er performance, that\nemploys different train and test resolutions. It relies on a computationally cheap\n\ufb01ne-tuning of the network at the test resolution. This enables training strong clas-\nsi\ufb01ers using small training images, and therefore signi\ufb01cantly reduce the training\ntime. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-\n50 trained on 128\u21e5128 images, and 79.8% with one trained at 224\u21e5224.\nA ResNeXt-101 32x48d pre-trained with weak supervision on 940 million\n224\u21e5224 images and further optimized with our technique for test resolution\n320\u21e5320 achieves 86.4% top-1 accuracy (top-5: 98.0%). To the best of our\nknowledge this is the highest ImageNet single-crop accuracy to date.\n\n1\n\nIntroduction\n\nConvolutional Neural Networks [18] (CNNs) are used extensively in computer vision tasks such as\nimage classi\ufb01cation [17], object detection [27], inpainting [37], style transfer [9] and even image\ncompression [28]. In order to obtain the best possible performance from these models, the training\nand testing data distributions should match. However, often data pre-processing procedures are\ndifferent for training and testing. For instance, in image recognition the current best training practice\nis to extract a rectangle with random coordinates from the image, to arti\ufb01cially increase the amount\nof training data. This region, which we call the Region of Classi\ufb01cation (RoC), is then resized to\nobtain an image, or crop, of a \ufb01xed size (in pixels) that is fed to the CNN. At test time, the RoC\nis instead set to a square covering the central part of the image, which results in the extraction of a\ncenter crop. This re\ufb02ects the bias of photographers who tend center important visual content. Thus,\nwhile the crops extracted at training and test time have the same size, they arise from different RoCs,\nwhich skews the distribution of data seen by the CNN.\nOver the years, training and testing pre-processing procedures have evolved to improve the perfor-\nmance of CNNs, but so far they have been optimized separately [7]. In this paper, we \ufb01rst show\nthat this separate optimization has led to a signi\ufb01cant distribution shift between training and testing\nregimes with a detrimental effect on the test-time performance of models. We then show that this\nproblem can be solved by jointly optimizing the choice of resolutions and scales at training and test\ntime, while keeping the same RoC sampling. Our strategy only requires to \ufb01ne-tune two layers in\norder to compensate for the shift in statistics caused by changing the crop size. This retains the\nadvantages of existing pre-processing protocols for training and testing, including augmenting the\ntraining data, while compensating for the distribution shift.\nOur approach is based on a rigorous analysis of the effect of pre-processing on the statistics of natural\nimages, which shows that increasing the size of the crops used at test time compensates for randomly\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Selection of the image regions fed to the network at training time and testing time, with\ntypical data-augmentation. The red region of classi\ufb01cation is resampled as a crop that is fed to the\nneural net. For objects that have as similar size in the input image, like the white horse, the standard\naugmentations typically make them larger at training time than at test time (second column). To\ncounter this effect, we either reduce the train-time resolution, or increase the test-time resolution\n(third and fourth column). The horse then has the same size at train and test time, requiring less\nscale invariance for the neural net. Our approach only needs a computationally cheap \ufb01ne-tuning.\n\nsampling the RoCs at training time. This analysis also shows that we need to use lower resolution\ncrops at training than at test time. This signi\ufb01cantly impacts the processing time: halving the crop\nresolution leads to a threefold reduction in the network evaluation speed and reduces signi\ufb01cantly\nthe memory consumption for a typical CNN, which is especially important for training on GPUs.\nFor instance, for a target test resolution of 224\u21e5224, training at resolution 160\u21e5160 provides better\nresults than the standard practice of training at resolution 224\u21e5224, while being more ef\ufb01cient. In\naddition we can adapt a ResNet-50 train at resolution 224\u21e5224 for the test resolution 320\u21e5320 and\nthus obtain top-1 accuracy of 79.8% (single-crop) on ImageNet.\nAlternatively, we leverage the improved ef\ufb01ciency to train high-accuracy models that operate at\nmuch higher resolution at test time while still training quickly. For instance, we achieve an top-1\naccuracy of 86.4% (single-crop) on ImageNet with a ResNeXt-101 32x48d pre-trained in weakly-\nsupervised fashion on 940 million public images. Finally, our method makes it possible to save GPU\nmemory, which we exploit in the optimization: employing larger batch sizes leads to a better \ufb01nal\nperformance [13].\n\n2 Related work\n\nImage classi\ufb01cation is a core problem in computer vision, and used as a benchmark task by the\ncommunity to measure progress on image understanding. Models pre-trained for image classi\ufb01ca-\ntion, usually on the ImageNet database [8], transfer to a variety of other applications [24]. Further-\nmore, advances in image classi\ufb01cation translate to improved results on many other tasks [10, 15].\nRecent research in image classi\ufb01cation has demonstrated improved performance by considering\nlarger networks and higher resolution images [14, 22]. For instance, the state of the art in the Im-\nageNet ILSVRC 2012 benchmark is currently held by the ResNeXt-101 32x48d [22] architecture\nwith 829M parameters using 224\u21e5224 images for training. The state of the art for a model learned\nfrom scratch is currently held by the Ef\ufb01cientNet-b7 [34] with 66M parameters using 600\u21e5600\nimages for training. In this paper, we focus on the ResNet-50 architecture [11] due to its good accu-\nracy/cost tradeoff (25.6M parameters) and its popularity. We also conduct some experiments using\nthe PNASNet-5-Large [21] architecture (86.1M parameters), which exhibits good performance on\nImageNet with a reasonable training time, and with the ResNeXt-101 32x48d [22] weakly super-\nvised, which is best publicly available network on ImageNet.\n\nData augmentation is routinely employed at training time to improve model generalization and\nreduce over\ufb01tting. Typical transformations [3, 5, 32] include: random-size crop, horizontal \ufb02ip and\ncolor jitter. In our paper, we adopt the standard set of augmentations commonly used in image clas-\nsi\ufb01cation. As a reference, we consider the default models in the PyTorch library. The accuracy is\nalso improved by combining multiple data augmentations at test time, although this means that sev-\n\n2\n\n\fFigure 2: Empirical distribution of the areas of\nthe RoCs as a fraction of the image areas ex-\ntracted by data augmentation. The data aug-\nmentation schemes are the standard ones used\nat training and testing time for CNN classi\ufb01ers.\nThe spiky distribution at test time is due to the\nfact that RoCs are center crops and the only re-\nmaining variability is due to the different image\naspect ratios. Notice that the distribution is very\ndifferent at training and testing time.\n\neral forward passes are required to classify one image. For example, [11, 17, 32] used ten crops (one\ncentral, and one for each corner of the image and their mirrored versions). Another performance-\nboosting strategy is to classify an image by feeding it at multiple resolutions [11, 30, 32], again\naveraging the predictions. More recently, multi-scale strategies such as the feature pyramid net-\nwork [20] have been proposed to directly integrate multiple resolutions in the network, both at train\nand test time, with signi\ufb01cant gains in category-level detection.\n\nFeature pooling. A recent approach [5] employs p-pooling instead of average pooling to adapt the\nnetwork to test resolutions signi\ufb01cantly higher than the training resolution. The authors show that\nthis improves the network\u2019s performance, in accordance with the conclusions drawn by Boureau et\nal. [6]. Similar pooling techniques have been employed in image retrieval for a few years [26, 35],\nwhere high-resolution images are required to achieve a competitive performance.\n\n3 Region selection and scale statistics\n\nApplying a Convolutional Neural Network (CNN) classi\ufb01er to an image generally requires to pre-\nprocess the image. One of the key steps involves selecting a rectangular region in the input image,\nwhich we call Region of Classi\ufb01cation (RoC). The RoC is then extracted and resized to a square\ncrop of a size compatible with the CNN, e.g., AlexNet requires a 224 \u21e5 224 crop as input.\nWhile this process is simple, in practice it has two subtle but signi\ufb01cant effects on how the image\ndata is presented to the CNN. First, the resizing operation changes the apparent size of the objects\nin the image (section 3.1). This is important because CNNs do not have a predictable response to a\nscale change (as opposed to translations). Second, the choice of different crop sizes (for architectures\nsuch as ResNet that admit non-\ufb01xed inputs) has an effect on the statistics of the network activations,\nespecially after global pooling layers (section 3.2). This section analyses in detail these two effects.\nIn the discussion, we use the following conventions: The \u201cinput image\u201d is the original training or\ntesting image; the RoC is a rectangle in the input image; and the \u201ccrop\u201d is the pixels of the RoC,\nrescaled with bilinear interpolation to a \ufb01xed resolution, then fed to the CNN.\n\n3.1 Scale and apparent object size\n\nIf a CNN is to acquire a scale-invariant behavior for object recognition, it must learn it from data.\nHowever, resizing the input images in pre-processing changes the distribution of objects sizes. Since\ndifferent pre-processing protocols are used at training and testing time1, the size distribution differs\nin the two cases. This is quanti\ufb01ed next.\n\n3.1.1 Relation between apparent and actual object sizes\nWe consider the following imaging model: the camera projects the 3D world onto a 2D image, so\nthe apparent size of the objects is inversely proportional to their distance from the camera. For\nsimplicity, we model a 3D object as an upright square of height and width R \u21e5 R at a distance Z\nfrom the camera, and fronto-parallel to it. Hence, its image is a r \u21e5 r rectangle, where the apparent\n1At training time, the extraction and resizing of the RoC is used as an opportunity to augment the data by\nrandomly altering the scale of the objects, in this manner the CNN is stimulated to be invariant to a wider range\nof object scales.\n\n3\n\n20%40%60%80%frequency%oftheimageareatrainPDFtestPDF\fsize r is given by r = f R/Z where f is the focal length of the camera. Thus we can express the\napparent size as the product r = f \u00b7 r1 of the focal length f, which depends on the camera, and\nof the variable r1 = R/Z, whose distribution p(r1) is camera-independent. While the focal length\nis variable, the \ufb01eld of view angle \u2713FOV of most cameras is usually in the [40, 60] range. Hence,\nfor an image of size H \u21e5 W one can write f = kpHW where k1 = 2 tan(\u2713FOV/2) \u21e1 1 is\n\napproximately constant. With this de\ufb01nition for f, the apparent size r is expressed in pixels.\n\n3.1.2 Effect of image pre-processing on the apparent object size\nNow, we consider the effect of rescaling images on the apparent size of objects. If an object has an\nextent of r \u21e5 r pixels in the input image, and if s is the scaling factor between input image and the\ncrop, then by the time the object is analysed by the CNN, it will have the new size of rs\u21e5 rs pixels.\nThe scaling factor s is determined by the pre-processing protocol, discussed next.\n\nTrain-time scale augmentation. As a prototypical augmentation protocol, we consider\nRandomResizedCrop in PyTorch, which is very similar to augmentations used by other toolkits\nsuch as Caffe and the original AlexNet. RandomResizedCrop takes as input an H \u21e5 W image,\nselects a RoC at random, and resizes the latter to output a Ktrain \u21e5 Ktrain crop. The RoC extent\nis obtained by \ufb01rst sampling a scale parameter  such that 2 \u21e0 U ([2\n+]) and an aspect ra-\ntio \u21b5 such that ln \u21b5 \u21e0 U ([ln \u21b5, ln \u21b5+]). Then, the size of the RoC in the input image is set\nto HRoC \u21e5 WRoC = p\u21b5HW \u21e5pHW/\u21b5. The RoC is resized anisotropically with factors\n(Ktrain/HRoC, Ktrain/WRoC) to generate the output image. Assuming for simplicity that the input\nimage is square (i.e. H = W ) and that \u21b5 = 1, the scaling factor from input image to output crop is\ngiven by:\n\n, 2\n\ns =\n\npKtrainKtrain\npHRoCWRoC\n\n=\n\n1\n \u00b7\n\nKtrainpHW\n\n.\n\n(1)\n\nBy scaling the image in this manner, the apparent size of the object becomes\n\n(2)\nSince kKtrain is constant, differently from r, rtrain does not depend on the size H \u21e5 W of the input\nimage. Hence, pre-processing standardizes the apparent size, which otherwise would depend on the\ninput image resolution. This is important as networks do not have built-in scale invariance.\n\nrtrain = s \u00b7 r = sf \u00b7 r1 =\n\n\u00b7 r1.\n\n\n\nkKtrain\n\nTest-time scale augmentation. As noted above, test-time augmentation usually differs from train-\ntime augmentation. The former usually amounts to: isotropically resizing the image so that the\nshorter dimension is Kimage\nand then extracting a Ktest \u21e5 Ktest crop (CenterCrop) from that. Under\nthe assumption that the input image is square (H = W ), the scaling factor from input image to crop\nrewrites as s = Kimage\n\ntest /pHW , so that\n\ntest\n\nThis has a a similar size standardization effect as the train-time augmentation.\n\nrtest = s \u00b7 r = kKimage\n\ntest\n\n\u00b7 r1.\n\n(3)\n\nLack of calibration. Comparing eqs. (2) and (3), we conclude that the same input image contain-\ning an object of size r1 results in two different apparent sizes if training or testing pre-processing is\nused. These two sizes are related by:\n\nKimage\ntest\nKtrain\nIn practice, for standard networks such as AlexNet Kimage\ntest /Ktrain \u21e1 1.15; however, the scaling factor\n is sampled (with the square law seen above) in a range [, +] = [0.28, 1]. Hence, at testing\ntime the same object may appear as small as a third of what it appears at training time. For standard\nvalues of the pre-processing parameters, the expected value of this ratio w.r.t.  is\n\nrtest\nrtrain\n\n=  \u00b7\n\n(4)\n\n.\n\nE\uf8ff rtest\nrtrain = F \u00b7\n\ntest\n\nKimage\nKtrain \u21e1 0.80,\n\nwhere F captures all the sampling parameters.\n\nF =\n\n2\n3 \u00b7\n\n+  3\n3\n\n2\n+  2\n\n\n,\n\n(5)\n\n4\n\n\fFigure 3: Cumulative density function of the\nvectors components on output of the spatial av-\nerage pooling operator, for a standard ResNet-\n50 trained at resolution 224, and tested at differ-\nent resolutions. The distribution is measured on\nthe validation images of Imagenet.\n\n3.2 Scale and activation statistics\nIn addition to affecting the apparent size of objects, pre-processing also affects the activation statis-\ntics of the CNN, especially if its architecture allows changing the size of the input crop. We \ufb01rst\nlook at the receptive \ufb01eld size of a CNN activation in the previous layer. This is the number of input\nspatial locations that affect that response. For the convolutional part of the CNN, comprising linear\nconvolution, subsampling, ReLU, and similar layers, changing the input crop size is almost neutral\nbecause the receptive \ufb01eld is unaffected by the input size. However, for classi\ufb01cation the network\nmust be terminated by a pooling operator (usually average pooling) in order to produce a \ufb01xed-size\nvector. Changing the size of the input crop strongly affects the activation statistics of this layer.\n\nActivation statistics. We measure the distribution of activation values after the average pooling in\na ResNet-50 in \ufb01g. 3. As it is applied on a ReLU output, all values are non-negative. At the default\ncrop resolution of Ktest = Ktrain = 224 pixels, the activation map is 7\u21e57 with a depth of 2048. At\nKtest = 64, the activation map is only 2\u21e52: pooling only 0 values becomes more likely and activations\nare more sparse (the rate of 0\u2019s increases form 0.5% to 29.8%). The values are also more spread out:\nthe fraction of values above 2 increases from 1.2% to 11.9%. Increasing the resolution reverts the\neffect: with Ktest = 448, the activation map is 14\u21e514, the output is less sparse and less spread out.\nThis simple statistical observations shows that if the distribution of activations changes at test time,\nthe values are not in the range that the \ufb01nal classi\ufb01er layers (linear & softmax) were trained for.\n\n3.3 Larger test crops result in better accuracy\nDespite the fact that increasing the crop size affects the activation statistics, it is generally bene\ufb01cial\nfor accuracy, since as discussed before it reduces the train-test object size mismatch. For instance,\nthe accuracy of ResNet-50 on the ImageNet validation set as Ktest is changed (see section 5) are:\n\nKtest\naccuracy\n\n64\n29.4\n\n128\n65.4\n\n224\n77.0\n\n256\n78.0\n\n288\n78.4\n\n320\n78.3\n\n384\n77.7\n\n448\n76.6\n\nThus for Ktest = 288 the accuracy is 78.4%, which is greater than 77.0% obtained for the native\ncrop size Ktest = Ktrain = 224 used in training. In \ufb01g. 5, we see this result is general: better accuracy\nis obtained with higher resolution crops at test time than at train time. In the next section, we explain\nand leverage this discrepancy by adjusting the network\u2019s weights.\n\n4 Method\n\nBased on the analysis of section 3, we propose two improvements to the standard setting. First,\nwe show that the difference in apparent object sizes at training and testing time can be removed by\nincreasing the crop size at test time, which explains the empirical observation of section 3.3. Second,\nwe slightly adjust the network before the global average pooling layer in order to compensate for\nthe change in activation statistics due to the increased size of the input crop.\n\n4.1 Calibrating the object sizes by adjusting the crop size\nEquation (5) estimates the change in the apparent object sizes during training and testing. If the\nsize of the intermediate image Kimage\nis increased by a factor \u21b5 (where \u21b5 \u21e1 1/0.80 = 1.25 in the\nexample) then at test time, the apparent size of the objects is increased by the same factor. This\n\ntest\n\n5\n\n\fKtest = 64\n\nKtest = 128\n\nKtest = 224\n\nKtest = 448\n\nFigure 4: CDF of the activations on output of the average pooling layer, for a ResNet-50, when\ntested at different resolutions Ktest. Compare the state before and after \ufb01ne-tuning the batch-norm.\n\nequalizes the effect of the training pre-processing that tends to zoom on the objects. However,\nincreasing Kimage\ntest with Ktest \ufb01xed means looking at a smaller part of the object. This is not ideal: the\nobject to identify is often well framed by the photographer, so the crop may show only a detail of the\nobject or miss it altogether. Hence, in addition to increasing Kimage\n, we also increase the crop size\nKtest to keep the ratio Kimage\ntest /Ktest constant. However, this means that Ktest > Ktrain, which skews\nthe activation statistics (section 3.2). The next section shows how to compensate for this skew.\n\ntest\n\n4.2 Adjusting the statistics before spatial pooling\n\nAt this point, we have selected the \u201ccorrect\u201d test resolution for the crop but we have skewed activa-\ntion statistics. Hereafter we explore two approaches to compensate for this skew.\n\nParametric adaptation. We \ufb01t the output of the average pooling layer (section 3.2) with a para-\nmetric Fr\u00b4echet distribution at the original Ktrain and \ufb01nal Ktest resolutions. Then, we de\ufb01ne an equal-\nization mapping from the new distribution back to the old one via a scalar transformation, and apply\nit as an activation function after the pooling layer (see Appendix A). This compensation provides\na measurable but limited improvement on accuracy, probably because the model is too simple and\ndoes not differentiate the distributions of different components going through the pooling operator.\n\nAdaptation via \ufb01ne-tuning.\nIncreasing the crop resolution at test time is effectively a domain\nshift. A simple way to compensate for this shift is to \ufb01ne-tune the model. In our case, we \ufb01ne-\ntune on the same training set, after switching from Ktrain to Ktest. We restrict the \ufb01ne-tuning to the\nvery last layers of the network. We observed in the distribution analysis that the sparsity should be\nadapted. This requires at least to include the batch normalization that precedes the global pooling\ninto the \ufb01ne-tuning. In this way the batch statistics are adapted to the increased resolution. We also\nuse the test-time augmentation scheme during \ufb01ne-tuning to avoid incurring further domain shifts.\nFigure 4 shows the pooling operator\u2019s activation statistics before and after \ufb01ne-tuning. After \ufb01ne-\ntuning the activation statistics closely resemble the train-time statistics. This hints that adaptation is\nsuccessful. Yet, as discussed above, this does not imply an improvement in accuracy.\n\n5 Experiments\n\nBenchmark data. We experiment on the ImageNet-2012 benchmark [29], reporting validation\nperformance as top-1 accuracy. It has been argued that this measure is sensitive to errors in the\nImageNet labels [31]. However, the top-5 metrics, which is more robust, tends to saturate with\nmodern architectures, while the top-1 accuracy is more sensitive to improvements in the model.\nTo assess the signi\ufb01cance of our results, we compute the standard deviation of the top-1 accuracy:\nwe classify the validation images, split the set into 10 folds and measure the accuracy on 9 of them,\nleaving one out in turn. The standard deviation of accuracy over these folds is \u21e0 0.03% for all\nsettings. Therefore, we report 1 signi\ufb01cant digit in the accuracy percentages.\nWe also report results for other datasets involving transfer learning in section 5.3 when presenting\ntransfer learning results.\n\n6\n\n\fFigure 5: Top-1 accuracy of the ResNet-50 according to the test time resolution. Left: without\nadaptation, right: after resolution adaptation. The numerical results are reported in Appendix C. A\ncomparison of results without random resized crop is reported in Appendix D.\n\nArchitectures. We use standard state-of-the-art neural network architectures with no modi\ufb01ca-\ntions, We consider in particular ResNet-50 [11]. For larger experiments, we use PNASNet-5-Large\n[21], learned using neural architecture search as a succession of interconnected cells. It is accurate\n(82.9% Top-1) with relatively few parameters (86.1 M). We use also ResNeXt-101 32x48d [22],\npre-trained in weakly-supervised fashion on 940 million public images with 1.5k hashtags matching\nwith 1000 ImageNet1k synsets. It is accurate (85.4% Top-1) with lot of parameters (829 M).\nTraining protocol. We train ResNet-50 with SGD with a learning rate of 0.1 \u21e5 B/256, where B\nis the batch size, as in [13]. The learning rate is divided by 10 every 30 epochs. With a Repeated\nAugmentation of 3, an epoch processes 5005 \u21e5 512/B batches, or \u21e090% of the training images,\nsee [5]. In the initial training, we use B = 512, 120 epochs and the default PyTorch data augmen-\ntation: horizontal \ufb02ip, random resized crop (as in section 3) and color jittering. To \ufb01netune, the\ninitial learning rate is 0.008 same decay, B = 512, 60 epochs. For ResNeXt-101 32x48d we use the\npretrained version from the PyTorch hub repository [2]. We also use a ten times smaller learning\nrate and a batch size two times smaller. For PNASNet-5-Large we use the pretrained version from\nCadene\u2019s GitHub repository [1]. The difference with the ResNet-50 \ufb01ne-tuning is that we modify\nthe last three cells, in one epoch and with a learning rate of 0.0008. We ran our training experiments\non machines with 8 Tesla V100 GPUs and 80 CPU cores.\n\nFine-tuning data-augmentation. We experimented three data-augmentation for \ufb01ne-tuning: The\n\ufb01rst one (test DA) is resizing the image and then take the center crop, The second one (test DA2)\nis resizing the image, random horizontal shift of the center crop, horizontal \ufb02ip and color jittering.\nThe last one (train DA) is the train-time data-augmentation as described in the previous paragraph.\nThe test DA data-augmentation described in this paragraph is the simplest. We use it for a more\ndirect comparison with the literature for all the results reported with ResNet-50 and PNASNet-5-\nLarge, except in Table 2 where we report results with test DA2, which provides a slightly better\nperformance. For ResNeXt-101 32x48d all reported results are obtained with test DA2.\nSection C provides a comparison of the performance of these train-time data augmentation.\n\nThe baseline\nexperiment is to increase the resolution without adaptation. Repeated augmentations\nalready improve the default PyTorch ResNet-50 from 76.2% top-1 accuracy to 77.0%. Figure 5(left)\nshows that increasing the resolution at test time increases the accuracy of all our networks. E.g., the\naccuracy of a ResNet-50 trained at resolution 224 increases from 77.0 to 78.4 top-1 accuracy, an\nimprovement of 1.4 percentage points. This concurs with prior \ufb01ndings in the literature [12].\n\n5.1 Results\n\nImprovement of our approach on a ResNet-50. Figure 5(right) shows the results obtained after\n\ufb01ne-tuning the last batch norm in addition to the classi\ufb01er. With \ufb01ne-tuning we get the best results\n(79%) with the classic ResNet-50 trained at Ktrain = 224. Compared to when there is no \ufb01ne-tuning,\nthe Ktest at which the maximal accuracy is obtained increases from Ktest = 288 to 384. If instead\n\n7\n\n\fTable 1: Application to larger networks: Resulting top-1 accuracy.\n\nModel\n\nTrain\n\nFine-tuning\n\nresolution Class. BN\n\n3 cells\n\nPNASNet-5-Large\nPNASNet-5-Large\nPNASNet-5-Large\n\nResNeXt-101 32x48d\n\n331\n331\n331\n\n224\n\nX\nX\n\nX\nX\n\nX\n\nClass. BN 3 conv.\n\nX\n\nX\n\n331\n82.7\n82.7\n82.7\n\n384\n83.0\n83.4\n83.3\n\n224\n85.4\n\nTest resolution\n395\n83.2\n83.5\n83.4\n\n416\n83.0\n83.4\n83.5\n\n288\n86.1\n\n448\n83.0\n83.5\n83.6\n\n480\n82.8\n83.4\n83.7\n\n320\n86.4\n\nTable 2: State of the art on ImageNet with various architectures (single Crop evaluation).\n\nExtra Training Data Train Test # Parameters Top-1 (%) Top-5 (%)\n\nModels\nResNet-50 Pytorch\nResNet-50 mix up [40]\nResNet-50 CutMix [39]\nResNet-50-D [13]\nMultiGrain R50-AA-500 [5]\nResNet-50 Billion-scale [38]\nOur ResNet-50\nOur ResNet-50 CutMix\nOur ResNet-50 Billion-scale@160\nOur ResNet-50 Billion-scale@224\nPNASNet-5 (N = 4, F = 216) [21]\nMultiGrain PNASNet @ 500px [5]\nAmoebaNet-B (6,512) [14]\nEf\ufb01cientNet-B7 [34]\nOur PNASNet-5\nResNeXt-101 32x8d [22]\nResNeXt-101 32x16d [22]\nResNeXt-101 32x32d [22]\nResNeXt-101 32x48d [22]\nOur ResNeXt-101 32x48d\n\n224\n224\n224\n224\n224\n224\n224\n224\n160\n224\n331\n331\n480\n600\n331\n224\n224\n224\n224\n224\n\n224\n224\n224\n224\n500\n224\n384\n320\n224\n320\n331\n500\n480\n600\n480\n224\n224\n224\n224\n320\n\n25.6M\n25.6M\n25.6M\n25.6M\n25.6M\n25.6M\n25.6M\n25.6M\n25.6M\n25.6M\n86.1M\n86.1M\n577M\n66M\n86.1M\n88M\n193M\n466M\n829M\n829M\n\n76.1\n77.7\n78.4\n79.3\n79.4\n81.2\n79.1\n79.8\n81.9\n82.5\n82.9\n83.6\n84.3\n84.4\n83.7\n82.2\n84.2\n85.1\n85.4\n86.4\n\n92.9\n94.4\n94.1\n94.6\n94.8\n96.0\n94.6\n94.9\n96.1\n96.6\n96.2\n96.7\n97.0\n97.1\n96.8\n96.4\n97.2\n97.5\n97.6\n98.0\n\nX\n\nX\nX\n\nX\nX\nX\nX\nX\n\nwe reduce the training resolution, Ktrain = 128 and testing at Ktrain = 224 yields 77.1% accuracy,\nwhich is above the baseline trained at full test resolution without \ufb01ne-tuning.\n\nApplication to larger networks. The same adaptation method can be applied to any convolutional\nnetwork. In Table 1 we report the result on the PNASNet-5-Large and the IG-940M-1.5k ResNeXt-\n101 32x48d [22]. For the PNASNet-5-Large, we found it bene\ufb01cial to \ufb01ne-tune more than just the\nbatch-normalization and the classi\ufb01er. Therefore, we also experiment with \ufb01ne-tuning the three last\ncells. By increasing the resolution to Ktest = 480, the accuracy increases by 1 percentage point.\nBy combining this with an ensemble of 10 crops at test time, we obtain 83.9% accuracy. With\nthe ResNeXt-101 32x48d increasing the resolution to Ktest = 320, the accuracy increases by 1.0\npercentage point. We thus reached 86.4% top-1 accuracy.\n\nSpeed-accuracy trade-off. We consider the trade-off between training time and accuracy (nor-\nmalized as if it was run on 1 GPU). The full table with timings are in supplementary Section C. In\nthe initial training stage, the forward pass is 3 to 6 times faster than the backward pass. However,\nduring \ufb01ne-tuning the ratio is inverted because the backward pass is applied only to the last layers.\nIn the low-resolution training regime (Ktrain = 128), the additional \ufb01ne-tuning required by our\nmethod increases the training time from 111.8 h to 124.1 h (+11%). This is to obtain an accuracy\nof 77.1%, which outperforms the network trained at the native resolution of 224 in 133.9 h. We\nproduce a \ufb01ne-tuned network with Ktest = 384 that obtains a higher accuracy than the network\ntrained natively at that resolution, and the training is 2.3\u21e5 faster: 151.5 h instead of 348.5 h.\n\n8\n\n\fTable 3: Transfer learning task with our method and comparison with the state of the art. We only\ncompare ImageNet-based transfer learning results with a single center crop for the evaluation (if\navailable, otherwise we report the best published result) without any change in architecture com-\npared to the one used on ImageNet. We report the Top-1 accuracy (%).\nDataset\n94.0\nStanford Cars [16]\nCUB-200-2011 [36]\n88.4\nOxford 102 Flowers [23] InceptionResNet-V2 95.0\n94.6\nOxford-IIIT Pets [25]\nBirdsnap [4]\n83.4\n\nEf\ufb01cientNet-B7 [34]\nMPN-COV [19]\nEf\ufb01cientNet-B7 [34]\nAmoebaNet-B (6,512) [14]\nEf\ufb01cientNet-B7 [34]\n\nState-of-the-art models\n\nModels\nSENet-154\nSENet-154\n\nSENet-154\nSENet-154\n\nBaseline +our method\n\n94.4\n88.7\n95.7\n94.8\n84.3\n\n94.7\n88.7\n98.8\n95.9\n84.3\n\nAblation study. We study the contribution of the different choices to the performance, limited\nto Ktrain = 128 and Ktrain = 224. By simply \ufb01ne-tuning the classi\ufb01er (the fully connected layers\nof ResNet-50) with test-time augmentation, we reach 78.9% in Top-1 accuracy with the classic\nResNet-50 initially trained at resolution 224. The batch-norm \ufb01ne-tuning and improvement in data\naugmentation advances it to 79.0%. The higher the difference in resolution between training and\ntesting, the more important is batch-norm \ufb01ne-tuning to adapt to the data augmentation. The full\nresults are in the supplementary Section C.\n\n5.2 Beyond the current state of the art\n\nTable 2 compares our results with competitive methods from the literature. Our ResNet-50 is slightly\nworse than ResNet50-D and MultiGrain, but these do not have exactly the same architecture. On\nthe other hand our ResNet-50 CutMix, which has a classic ResNet-50 architecture, outperforms\nothers ResNet-50 including the slightly modi\ufb01ed versions. Our \ufb01ne-tuned PNASNet-5 outperforms\nthe MultiGrain version. To the best of our knowledge our ResNeXt-101 32x48d surpasses all other\nmodels available in the literature. It achieves 86.4% Top-1 accuracy and 98.0% Top-5 accuracy,\ni.e., it is the \ufb01rst model to exceed 86.0% in Top-1 accuracy and 98.0% in Top-5 accuracy on the\nImageNet-2012 benchmark [29]. This exceeds the previous state of the art [22] by 1.0% absolute in\nTop-1 accuracy and 0.4% Top-5 accuracy.\n\n5.3 Transfer learning tasks\n\nWe have used our method in transfer learning tasks to validate its effectiveness on other dataset\nthan ImageNet. We evaluated it on the following datasets: Stanford Cars [16], CUB-200-2011 [36],\nOxford 102 Flowers [23], Oxford-IIIT Pets [25] and Birdsnap [4]. We used our method with two\ntypes of networks for transfer learning tasks: SENet-154 [3] and InceptionResNet-V2 [33]. For all\nthese experiments, we proceed as follows. (1) we initialize our network with the weights learned\non ImageNet (using models from [1]); (2) we train it entirely for several epochs at a certain reso-\nlution; (3) we \ufb01ne-tune with a higher resolution the last batch norm and the fully connected layer.\nTable 3 reports the baseline performance and shows that our method systematically improves the\nperformance, leading to the new state of the art for several benchmarks. We notice that our method\nis most effective on datasets of high-resolution images.\n\n6 Conclusion\n\nWe have studied extensively the effect of using different train and test scale augmentations on\nthe statistics of natural images and of the network\u2019s pooling activations. We have shown that, by\nadjusting the crop resolution and via a simple and light-weight parameter adaptation, it is possi-\nble to increase the accuracy of standard classi\ufb01ers signi\ufb01cantly, everything being equal otherwise.\nWe have also shown that researchers waste resources when both training and testing strong net-\nworks at resolution 224 \u21e5 224; We introduce a method that can \u201c\ufb01x\u201d these networks post-facto\nand thus improve their performance. An open-source implementation of our method is available at\nhttps://github.com/facebookresearch/FixRes.\n\n9\n\n\fReferences\n[1] Pre-trained pytorch models.\n\npytorch. Accessed: 2019-05-23.\n\nhttps://github.com/Cadene/pretrained-models.\n\n[2] Pytorch hub models.\n\nresnext/. Accessed: 2019-06-26.\n\nhttps://pytorch.org/hub/facebookresearch_WSL-Images_\n\n[3] Jie Hu andLi Shen and Gang Sun.\n\narXiv:1709.01507, 2017.\n\nSqueeze-and-excitation networks.\n\narXiv preprint\n\n[4] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap:\nLarge-scale \ufb01ne-grained visual categorization of birds. In Conference on Computer Vision and\nPattern Recognition, 2014.\n\n[5] Maxim Berman, Herv\u00b4e J\u00b4egou, Andrea Vedaldi, Iasonas Kokkinos, and Matthijs Douze. Multi-\ngrain: a uni\ufb01ed image embedding for classes and instances. arXiv preprint arXiv:1902.05509,\n2019.\n\n[6] Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in\n\nvisual recognition. In International Conference on Machine Learning, 2010.\n\n[7] Ekin Dogus Cubuk, Barret Zoph, Dandelion Man\u00b4e, Vijay Vasudevan, and Quoc V. Le. Au-\ntoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501,\n2018.\n\n[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-\nscale hierarchical image database. In Conference on Computer Vision and Pattern Recognition,\npages 248\u2013255, 2009.\n\n[9] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convo-\nlutional neural networks. In Conference on Computer Vision and Pattern Recognition, pages\n2414\u20132423, 2016.\n\n[10] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. End-to-end learning of deep vi-\nsual representations for image retrieval. International journal of Computer Vision, 124(2):237\u2013\n254, 2017.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In Conference on Computer Vision and Pattern Recognition, June 2016.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. arXiv preprint arXiv:1603.05027, 2016.\n\n[13] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks\nfor image classi\ufb01cation with convolutional neural networks. arXiv preprint arXiv:1812.01187,\n2018.\n\n[14] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,\nand Zhifeng Chen. Gpipe: Ef\ufb01cient training of giant neural networks using pipeline paral-\nlelism. arXiv preprint arXiv:1811.06965, 2018.\n\n[15] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better?\n\narXiv preprint arXiv:1805.08974, 2018.\n\n[16] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for \ufb01ne-\ngrained categorization. In 4th International IEEE Workshop on 3D Representation and Recog-\nnition (3dRR-13), 2013.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, 2012.\n[18] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne\nHubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recogni-\ntion. Neural computation, 1(4):541\u2013551, 1989.\n\n[19] Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo.\n\nIs second-order information\n\nhelpful for large-scale visual recognition? arXiv preprint arXiv:1703.08050, 2017.\n\n[20] Tsung-Yi Lin, Piotr Doll\u00b4ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Be-\nlongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 2117\u20132125, 2017.\n\n10\n\n\f[21] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,\nAlan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In\nInternational Conference on Computer Vision, September 2018.\n\n[22] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan\nLi, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised\npretraining. In European Conference on Computer Vision, 2018.\n\n[23] M-E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of\nclasses. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image\nProcessing, 2008.\n\n[24] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-\nlevel image representations using convolutional neural networks. In Conference on Computer\nVision and Pattern Recognition, 2014.\n\n[25] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference\n\non Computer Vision and Pattern Recognition, 2012.\n\n[26] Filip Radenovi\u00b4c, Giorgos Tolias, and Ondrej Chum. Fine-tuning cnn image retrieval with no\n\nhuman annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\nobject detection with region proposal networks. In Advances in Neural Information Processing\nSystems, 2015.\n\n[28] Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression. In International\n\nConference on Machine Learning, 2017.\n\n[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-\nFei.\nInternational journal of Computer\nVision, 2015.\n\nImagenet large scale visual recognition challenge.\n\n[30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog-\n\nnition. In International Conference on Learning Representations, 2015.\n\n[31] Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Understanding\n\nmistakes and uncovering biases. In European Conference on Computer Vision, 2018.\n\n[32] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-\nIn Conference on Computer\n\nhoucke, and A. Rabinovich. Going deeper with convolutions.\nVision and Pattern Recognition, 2015.\n\n[33] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and\n\nthe impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016.\n\n[34] Mingxing Tan and Quoc V. Le. Ef\ufb01cientnet: Rethinking model scaling for convolutional neural\n\nnetworks. arXiv preprint arXiv:1905.11946, 2019.\n\n[35] Giorgos Tolias, Ronan Sicre, and Herv\u00b4e J\u00b4egou. Particular object retrieval with integral max-\n\npooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015.\n\n[36] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-\n\n2011 Dataset. Technical report, 2011.\n\n[37] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 341\u2013349, 2012.\n\n[38] Ismet Zeki Yalniz, Herv\u00b4e J\u00b4egou, Kan Chen, Manohar Paluri, and Dhruv Kumar Ma-\narXiv preprint\n\nhajan. Billion-scale semi-supervised learning for image classi\ufb01cation.\narXiv:1905.00546, 2019.\n\n[39] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon\nYoo. Cutmix: Regularization strategy to train strong classi\ufb01ers with localizable features. arXiv\npreprint arXiv:1905.04899, 2019.\n\n[40] Hongyi Zhang, Moustapha Ciss\u00b4e, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond\n\nempirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4477, "authors": [{"given_name": "Hugo", "family_name": "Touvron", "institution": "Facebook AI Research"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "Facebook AI Research and University of Oxford"}, {"given_name": "Matthijs", "family_name": "Douze", "institution": "Facebook AI Research"}, {"given_name": "Herve", "family_name": "Jegou", "institution": "Facebook AI Research"}]}