{"title": "Faster Neural Networks Straight from JPEG", "book": "Advances in Neural Information Processing Systems", "page_first": 3933, "page_last": 3944, "abstract": "The simple, elegant approach of training convolutional neural\n  networks (CNNs) directly from RGB pixels has enjoyed overwhelming\n  empirical success. But can more performance be squeezed out of\n  networks by using different input representations?  In this paper we\n  propose and explore a simple idea: train CNNs directly on the\n  blockwise discrete cosine transform (DCT) coefficients computed and\n  available in the middle of the JPEG codec. Intuitively, when\n  processing JPEG images using CNNs, it seems unnecessary to\n  decompress a blockwise frequency representation to an expanded pixel\n  representation, shuffle it from CPU to GPU, and then process it with\n  a CNN that will learn something similar to a transform back to\n  frequency representation in its first layers. Why not skip both\n  steps and feed the frequency domain into the network directly?  In\n  this paper we modify \\libjpeg to produce DCT coefficients directly,\n  modify a ResNet-50 network to accommodate the differently sized and\n  strided input, and evaluate performance on ImageNet. We find\n  networks that are both faster and more accurate, as well as networks\n  with about the same accuracy but 1.77x faster than ResNet-50.", "full_text": "Faster Neural Networks Straight from JPEG\n\nLionel Gueguen1\n\nAlex Sergeev1\n\nBen Kadlec1\n\nRosanne Liu2\n\nJason Yosinski2\n\n1Uber\n\n2Uber AI Labs\n\n{lgueguen,asergeev,bkadlec,rosanne,yosinski}@uber.com\n\nAbstract\n\nThe simple, elegant approach of training convolutional neural networks (CNNs)\ndirectly from RGB pixels has enjoyed overwhelming empirical success. But\ncould more performance be squeezed out of networks by using different input\nrepresentations? In this paper we propose and explore a simple idea: train CNNs\ndirectly on the blockwise discrete cosine transform (DCT) coef\ufb01cients computed\nand available in the middle of the JPEG codec. Intuitively, when processing JPEG\nimages using CNNs, it seems unnecessary to decompress a blockwise frequency\nrepresentation to an expanded pixel representation, shuf\ufb02e it from CPU to GPU,\nand then process it with a CNN that will learn something similar to a transform\nback to frequency representation in its \ufb01rst layers. Why not skip both steps and\nfeed the frequency domain into the network directly? In this paper, we modify\nlibjpeg to produce DCT coef\ufb01cients directly, modify a ResNet-50 network to\naccommodate the differently sized and strided input, and evaluate performance\non ImageNet. We \ufb01nd networks that are both faster and more accurate, as well as\nnetworks with about the same accuracy but 1.77x faster than ResNet-50.\n\n1\n\nIntroduction\n\nThe amazing progress toward training neural networks, particularly convolutional neural networks\n[14], to attain good performance on a variety of tasks [13, 19, 20, 10] has led to the widespread\nadoption of such models in both academia and industry. When CNNs are trained using image data as\ninput, data is most often provided as an array of red-green-blue (RGB) pixels. Convolutional layers\nproceed to compute features starting from pixels, with early layers often learning Gabor \ufb01lters and\nlater layers learning higher level, more abstract features [13, 27].\nIn this paper, we propose and explore a simple idea for accelerating neural network training and\ninference in the common scenario where networks are applied to images encoded in the JPEG format.\nIn such scenarios, images would typically be decoded from a compressed format to an array of RGB\npixels and then fed into a neural network. Here we propose and explore a more direct approach.\nFirst, we modify the libjpeg library to decode JPEG images only partially, resulting in an image\nrepresentation consisting of a triple of tensors containing discrete cosine transform (DCT) coef\ufb01cients\nin the YCbCr color space. Due to how the JPEG codec works, these tensors are at different spatial\nresolutions. We then design and train a network to operate directly from this representation; as one\nmight suspect, this turns out to work reasonably well.\n\nRelated Work When training and/or inference speed is critical, much work has focused on accel-\nerating network computation by reducing the number of parameters or by using operations more\ncomputationally ef\ufb01cient on a graphics processing unit (GPU) [12, 3, 9]. Several works have em-\nployed spatial frequency decomposition and other compressed representations for image processing\nwithout using deep learning [22, 18, 8, 5, 7]. Other works have combined deep learning with com-\npressed representations other than JPEG to promising effect [24, 1]. The most similar works to ours\ncome from [6] and [25]. [6] train on DCT coef\ufb01cients compressed not via the JPEG encoder but by a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFigure 1: (a) The three steps to encode JPEG images: \ufb01rst, the RGB image is converted to the YCbCr\ncolor space and the chroma channels are downsampled, then the channels are projected through the\nDCT and quantized, and \ufb01nally the quantized coef\ufb01cients are losslessly compressed. See Sec. 2 for\nfull details. (b) JPEG decoding follows the inverse process. In this paper, we run only the \ufb01rst step of\ndecoding and then feed the DCT coef\ufb01cients directly into a neural network. This saves time in three\nways: the last steps of normal JPEG decoding are skipped, the data transferred from CPU to GPU\nis smaller by a factor of two, and the image is already in the frequency domain. To the extent early\nlayers in neural networks already learn a transform to the frequency domain, this allows the use of\nneural networks with fewer layers.\n\nsimpler truncation approach. [25] train on a similar input representation but do not employ the full\nearly JPEG stack, in particular not including the Cb/Cr downsampling step. Thus our work stands on\nthe shoulders of many previous studies, extending them to the full early JPEG stack, to much deeper\nnetworks, and to training on a much larger dataset and more dif\ufb01cult task. We carefully time the\nrelevant operations and perform ablation studies necessary to understand from where performance\nimprovements arise.\nThe rest of the paper makes the following contributions. We review the JPEG codec in more detail,\ngiving intuition for steps in the process that have features appropriate for neural network training\n(Sec. 2). Because the Y and Cb/Cr DCT blocks have different resolution, we consider different\narchitectures inspired by ResNet-50 [10] by which the information from these different channels may\nbe combined, each with different speed and performance considerations (Sec. 3 and Sec. 5). It turns\nout that some combinations produce much faster networks at the same performance as baseline RGB\nmodels or better performance at a more modest speed gain (Fig. 5). Having found faster and more\naccurate networks in DCT space, we ask whether one could simply \ufb01nd a nearby ResNet architecture\nthat operates in RGB space that exhibits the same boosts to performance or speed. We \ufb01nd that\nsimple mutations to ResNet-50 do not produce competitive networks (Sec. 4). Finally, given the\nsuperior performance of the DCT representation, we do an ablation study to examine whether this is\ndue to the different color space or speci\ufb01c \ufb01rst layer \ufb01lters. We \ufb01nd that the exact DCT transform\nworks curiously well, even better than trying to learn a transform of the same dimension (Sec. 4.3,\nSec. 5.3)! So others may reproduce experiments and bene\ufb01t from speed increases found in this paper,\nwe release our code at https://github.com/uber-research/jpeg2dct.\n\n2 JPEG Compression\n2.1 The JPEG Encoder\nThe JPEG standard (ISO/IEC 10918) was created in 1992 as the result of an effort started as early as\n1986 [11]. Despite it being over 30 years old, the JPEG standard, which supports both 8-bit grayscale\nimages and 24-bit color images, remains the dominant image representation in consumer electronics\nand on the internet. In this paper, we consider only the 24-bit color version, which begins with RGB\npixels encoded with 8 bits per color channel.\n\n2\n\n\fAs illustrated in Fig. 1a, JPEG encoding consists of the following three steps. The color space of an\nimage is converted from RGB to YCbCr, consisting of one luma component (Y), representing the\nbrightness, and two chroma components, Cb and Cr, representing the color. The spatial resolution\nof the chroma channels is reduced, usually by a factor of 2 or 3, while the resolution of Y is kept\nthe same. This basic compression takes advantage of the fact that the eye is less sensitive to \ufb01ne\ncolor details than to \ufb01ne brightness details. In this paper, we assume a reduction by a factor of 2.\nEach of the three Y, Cb, and Cr channels in the image is split into blocks of 8\u00d78 pixels, and each\nblock undergoes a DCT, which is similar to a Fourier transform in that it produces a spatial frequency\nspectrum. The amplitudes of the frequency components are then quantized. Since human vision is\nmuch more sensitive to small variations in color or brightness over large areas than to the strength of\nhigh-frequency brightness variations, the magnitudes of the high-frequency components are stored\nwith a lower accuracy than the low-frequency components. The quality setting of the encoder (for\nexample 50 or 95 on a scale of 0\u2013100 in the Independent JPE Group\u2019s library) affects the extent to\nwhich the resolution of each frequency component is reduced. If a very low-quality setting is used,\nmany high-frequency components may be discarded as they end up quantized to zero. The size of the\nresulting data for all 8\u00d78 blocks is further reduced using a lossless compression algorithm, a variant\nof Huffman encoding. Decoding or decompression from JPEG entails the corresponding inverse\ntransforms in reverse order of the above steps; inverse transforms are lossless except for the inverse\nof the quantization step. Due to the loss of precision during the quantization of the DCT coef\ufb01cients,\nthe original image is recovered up to some distortions.\nA standard implementation of the codec is libjpeg [15] released for the \ufb01rst time on 7-Oct-1991.\nThe current version is the release 9b of 17-Jan-2016, and it provides a stable and solid foundation\nof the JPEG support for many applications. An accelerated branch, libjpeg-turbo [16], has been\ndeveloped for exploiting Single Instruction Multiple Data (SIMD) parallelism. Other even faster\nversions have been developed that leverage the high parallelism of GPUs [23], where the Huffman\ncodec is run on the CPU, and the pixel transformations, such as the color space transform and DCT,\nare executed on the GPU. Fig. 1 shows the JPEG encoding process and a schematic view of the partial\ndecoding process we employ in this paper. We decode a compressed image up to its DCT coef\ufb01cients,\nwhich are then directly inputted to a CNN. Because CNNs often compute Gabor \ufb01lters on the \ufb01rst\nlayer [13, 29, 28], and Gabor \ufb01lters are similar to the conversion to frequency space realized by the\nDCT, it may be possible to prune the CNN of its \ufb01rst few layers without detriment; we experimentally\nverify this hunch in later sections. When using DCT coef\ufb01cients, one has the option to either cast\nquantized values from int directly to \ufb02oat or to put them through the approximate inverse quantization\nprocess employed by the JPEG decoder. We chose to approximately invert quantization as it results\nin a network less sensitive to the quantization tables, which depend on the compression quality.\n\n2.2 Details of the DCT Transform\nBefore delving into network details, it is worth considering a few aspects of the DCT in more detail.\nIn JPEG compression, the DCT transform [17] is applied to non-overlapping blocks of size 8\u00d78. Each\nblock is projected onto a basis of 64 patterns representing various horizontal, vertical, and composite\nfrequencies. The basis is orthogonal, so any block can be fully recovered from the knowledge of its\ncoef\ufb01cients. The DCT can be thought of as convolution with a speci\ufb01c \ufb01lter size of 8\u00d78, stride of\n8\u00d78, one input channel, 64 output channels, and speci\ufb01c, non-learned orthonormal \ufb01lters. The 64\n\ufb01lters are illustrated in Fig. 2a. Let us consider a few details. Because the DCT processes each of\nthe three input channels (one for luminance and two for chroma) separately, in terms of convolution\nit should be thought of as a three separate applications of convolution to three single-channel input\nimages (equivalently: depthwise convolution), because information from separate input channels stays\nseparate. Because the \ufb01lter size and stride are both 8, spatial information does not cross to adjacent\nblocks. Finally, note that while the standard convolutional layer may learn an orthonormal basis, in\ngeneral it will not. Instead, learned bases may be undercomplete, complete but not orthogonal, or\novercomplete, depending on the number of \ufb01lters and spatial size.\n\n3 Designing CNN models for DCT input\nIn this section, we describe transforms that facilitate the adoption of DCT coef\ufb01cients by a conven-\ntional CNN architecture such as ResNet-50 [10]. Some careful design is required, as DCT coef\ufb01cients\nfrom the Y channel, DY , generally have a larger size than those from the chroma channels, DCb and\nDCr, as shown in Fig. 1a, where the actual shapes are calculated based on an image input size of\n\n3\n\n\fa. DCT\n\nb. ResNet-50 RGB\n\nc. DCT-Learn\n\nd. DCT-Ortho\n\n(a) The 64 orthonormal DCT basis vectors used for decomposing single-channel 8\u00d78\nFigure 2:\npixel blocks in the JPEG standard [26]. (b) The 64 \ufb01rst-layer convolution \ufb01lters of size 7\u00d77 learned\nby a baseline ResNet-50 network operating on RGB pixels [10]. (c) The 64 convolution \ufb01lters of size\n8\u00d78 learned starting from random weights by the DCT-Learn network described in Sec. 4.3. (d) The\n64 convolution \ufb01lters from the DCT-Ortho network, similar to (c) but with an added orthonormal\nregularization.\n224 \u00d7 224. It is necessary, then, to have special transforms that take care of the spatial dimension\nmatching, before the resulting activations can be concatenated and fed into a conventional CNN. We\nconsider two abstract transforms (T1, T2) that separately operate on different coef\ufb01cient channel,\nwith the objective of resulting in matching spatial sizes among three activations aY , aCb and aCr,\nwhere aY = T1(DY ), aCb = T2(DCb), and aCr = T2(DCr). Fig. 3 illustrates this process.\nIn addition to ensuring that convolutional feature map sizes align, it is important to consider the\nresulting receptive \ufb01eld size and stride (hereafter denoted with R and S) for each unit at the end\nof transforms and throughout the network. Whereas for typical networks taking RGB input, the\nreceptive \ufb01eld and stride of each unit will be the same in terms of each input channel (red, green,\nblue), here the receptive \ufb01elds considered in the original pixel space may be different for information\n\ufb02owing through the Y channel vs the Cb and Cr channels, which is probably not desired. We examine\nthe representation size resulting from the DCT operation, and when compared with the same set of\nparameters of a ResNet-50 at various blocks (bottom table), we \ufb01nd that the spatial dimensions of DY\nmatches the activation dimensions of Block 3, while the spatial dimensions of DCr and DCb matches\nthose from Block 4. This inspired us to skip some of the ResNet blocks in the design of network\narchitecture, but skipping without further modi\ufb01cation results in a much less powerful network (fewer\nlayers and fewer parameters), as well as \ufb01nal network layers with much smaller receptive \ufb01elds.\nThe transforms (T1, T2) are generic and allow us to bring the DCT coef\ufb01cients to a compatible size. In\ndetermining transforms we considered the following design concepts. The transforms can be (1) non-\nparametric and/or manually designed, such as up- or down-sampling of the original DCT coef\ufb01cients,\n(2) learned, and can be simply expressed as convolution layers, or (3) a combination of layers, such as\na ResNet block itself. We explored seven different methods of transforms (T1, T2), from the simplest\nupsampling to deconvolution, and combined with different options of subsequent ResNet block stacks.\nWe describe each, with further details in Sec. S1 in the Supplementary Information:\n\n\u2022 UpSampling. Both chroma DCT coef\ufb01cients DCb and DCr are upsampled by duplicating\npixels by a factor of two in height and width to the dimensions of DY . The three are then\nconcatenated channelwise, and go through a batch normalization layer before going into\nResNet ConvBlock 3 (CB3) but with reduced stride 1, then standard CB4 and CB5.\n\n\u2022 UpSampling-RFA. This setup is similar to UpSampling, but here we keep ResNet CB2\n(rather than removing it) and CB2 and CB3 such that they mimic the increase in R and S\nobserved in the original ResNet-50 blocks; we denote this \u201cReceptive Field Aware\u201d or RFA.\nAs illustrated in Fig. 4, without this modi\ufb01cation, the jump in R from input to the \ufb01rst block\nis large and the R later in the network is never as large (green line) as in the baseline ResNet.\nBy instead keeping CB2 but decreasing its stride, the transition to large R is more gradual\nand upon reaching CB3 R and S match the baseline ResNet through the rest of the layers.\nThe architecture is depicted in Fig. 3b and in Fig. S1.\n\u2022 Deconvolution-RFA. An alternative to upsampling is a learnable deconvolution layer. In\nthis design, we use two separate deconvolution layers on DCb and DCr to increase the\nspatial size. The rest of the design is the same as UpSampling-RFA.\n\n4\n\n\fa. ResNet-50 RGB\n\nb. UpSampling-RFA\n\nc. Late-Concat\n\n(a) The \ufb01rst layers of the original ResNet-50 architecture [10]. (b) The architecture of\nFigure 3:\nUpSampling-RFA is illustrated with coef\ufb01cients DY , DCb and DCr of dimensions 28 \u00d7 28 \u00d7 64\nand 14 \u00d7 14 \u00d7 64, respectively. The short names NT and U stand for the operations No Transform\nand Upsampling, respectively. (c) The architecture Late-Concat is depicted where the luminance\ncoef\ufb01cients DY go through the ResNet Block 3, while the chroma coef\ufb01cients go through single\nconvolutions. This results in extra total computation along the luma path compared to the chroma\npath and tends to work well.\n\n\u2022 DownSampling. As opposed to upsampling spatially smaller coef\ufb01cients, another approach\nis to downsample the large one, DY , with a convolution layer. The rest of the design is\nsimilar to UpSampling, but with a few changes made to handle smaller input spatial size.\nAs we will see in Sec. 5, this network operating on smaller total input results in much faster\nprocessing at the expense of higher error.\n\u2022 Late-Concat. In this design, we run DY on its own through two ConvBlocks (CBs) and\nthree IdentityBlocks (IBs) of ResNet-50. In parallel, DCb and DCr are passed through a CB\nbefore being concatenated with the DY path. The joined representation is then fed into the\nstandard ResNet stack just after CB4. The architecture is depicted in Fig. 3c and in Fig. S1.\nThe effect is extra total computation along the luma path compared to the chroma path, and\nthe result is a fast network with good performance.\n\u2022 Late-Concat-RFA. This receptive \ufb01eld aware version of Late-Concat passes DY through\nthree CBs with kernel size and strides tweaked such that the increase in R mimics the R in\nthe original ResNet-50. In parallel DCb and DCr take the same path as in Late-Concat before\nbeing concatenated to the result of the DY path. The comparison of averaged receptive \ufb01eld\nis illustrated in Fig. 4, where one can see that Late-Concat-RFA has a smoother increase of\nreceptive \ufb01elds in comparison to Late-Concat. As explained in Fig. S1 for details, because\nthe spatial size is smaller than in a standard ResNet, we use a larger number of channels in\nthe early blocks.\n\u2022 Late-Concat-RFA-Thinner. This architecture is identical to Late-Concat-RFA but with\nmodi\ufb01ed numbers of channels. The number of channels is decreased in the \ufb01rst two CBs\nalong the DY path and increased in the third, changing channel counts from {1024, 512,\n512} to {384, 384, and 768}. The DCb and DCr components are fed through a CB with 256\nchannels instead of 512. All other parts of the network are identical to Late-Concat-RFA.\nThese changes were made in an attempt to keep the performance of the Late-Concat-RFA\nmodel but obtain some of the speed bene\ufb01ts of the Late-Concat. As will be shown in Fig. 5,\nit strikes an attractive balance.\n\n4 RGB Network Controls\nAs we will observe in Sec. 5 and Fig. 5, many of the networks taking DCT as input perform with lower\nerror and/or higher speed than the baseline ResNet-50 RGB. In this section, we examine whether this\nis just due to making many architecture tweaks, some of which happen to work better than a baseline\nResNet. Here we start with a baseline ResNet and attempt to mutate the architecture slightly to get it\nto perform with lower error and/or higher speed. Inputs are RGB images of size 224 \u00d7 224 \u00d7 3.\n\n5\n\n\fFigure 4: The average of receptive \ufb01eld sizes within each ResNet block vs. the corresponding block\nstride. Both axes are in log scale. The measurements are reported for some of the DCT based\narchitectures, and they are compared to the growth of the receptive \ufb01eld observed in ResNet-50. The\nplots underline how the receptive \ufb01eld aware (RFA) versions of basic DCT based architectures allow\na transition similar to the one observed in the baseline network.\n\n4.1 Reducing the Number of Layers\nTo start, we test the simple idea of removing convolution layers in ResNet-50. We remove the Identity\nblocks one at a time, from the bottom up, from Blocks 2 and 3, resulting in 6 experiments as 6 layers\nare removed. We never remove the convolution layer between Blocks 2 and 3 to keep the number of\nchannels in each block and representation size unchanged.\nIn this series of experiments, the \ufb01rst identity layer (ID) from Block 2 is removed \ufb01rst. Secondly,\nboth the \ufb01rst and second ID layers are removed. The experiment continues until all 3 ID layers of\nboth Block 2 and 3 are removed. In the last con\ufb01guration, the network shares similarities with the\nUpSampling architecture, where the RGB signal is transformed with a small number of convolutions\nto a representation size of 28 \u00d7 28 \u00d7 512. The RGB input goes through the following series of layers:\nconvolution, max pooling, one last identity layer from Block 3. We can see the trade-off between the\ninference speed and accuracy in Fig. 5 under the legend \u201cBaseline, Remove ID Blocks\u201d (series of 6\ngray squares). As shown, networks become slightly faster but at a large reduction in accuracy.\n\n4.2 Reducing the Number of Channels\nBecause reducing the number of layers worked poorly, we also investigate thinning the network:\nreducing the number of channels in each layer to speed up inference. The last fully connected layer is\nmodi\ufb01ed to adapt to the size of its input layer while maintaining the same number of outputs. We\npropose to reduce the number of channels by taking the original number of channels and dividing it\nby a \ufb01xed ratio. We conduct three experiments with ratios {1.1,\n2, 2}. The same trade-off between\nspeed or GFLOPS and accuracy is shown in Fig. 5 under the legend \u201cReduced # of Channels\u201d. As\nwith reducing the number of layers, networks become slightly faster but at a large reduction in\naccuracy. Perhaps both results might have been suspected, as the authors of ResNet-50 likely already\ntuned the network depth and width well; nevertheless, it is important to verify that the performance\nimprovements observed could not have been obtained through this much simpler approach.\n\n\u221a\n\n4.3 Learning the DCT Transform\nA \ufb01nal set of four experiments \u2014 shown in Fig. 5 as four \u201cYCbCr pixels, DCT layer\u201d diamonds \u2014\naddress whether we can obtain a similar bene\ufb01t to the DCT architectures but starting from RGB pixels\nby using convolutional layers designed to replicate, exactly or approximately, the DCT transform.\nRGB images are \ufb01rst converted into YCbCr space, then each channel is fed independently through a\nconvolution layer. To mimic the DCT, the convolution \ufb01lter size is set to 8\u00d78 with a stride of 8, and\n64 output channels (or in some cases: more) are used. The resulting activations are then concatenated\nbefore being fed into ResNet Block 2. In DCT-Learn, we randomly initialize \ufb01lters and train them in\nthe standard way. In DCT-Ortho, we regularize the convolution weights toward orthonormality, as\ndescribed in [2], to encourage them not to discard information, inspired by the orthonormality of the\nDCT transform. In DCT-Frozen, we simply use the exact DCT coef\ufb01cients without training, and in\nDCT-Frozenx2 we modify the stride to be 4 instead of 8 to increase representation size at that layer\nand allow \ufb01lters to overlap. Surprisingly, this network tied the performance (6.98%) of the best other\n\n6\n\n101Stride101102Averaged Receptive Fieldinputinputblock2block2block2block2block2block2block2block2block2block3block3block3block3block3block3block3block3block3block3block3block3block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block5block5block5block5block5block5block5block5block5inputblock2block2inputblock3block3ResNet-50UpSampling-RFAUpSampling101Stride101102Averaged Receptive Fieldinputinputblock2block2block2block2block2block2block2block2block2block3block3block3block3block3block3block3block3block3block3block3block3block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block4block5block5block5block5block5block5block5block5block5inputblock2block2inputblock3block3ResNet-50Late-Concat-RFALate-Concat\fTable 1: The averaged top-1 and top-5 error rates are represented for the baseline ResNet-50\narchitecture and the proposed DCT based ones. Standard deviation is appended to the top error rates\nfor experiments repeated more than three times. The frame per second inference speed measured on\nan NVIDIA Pascal GPU is also reported given that data is packed in batches of size 1024.\n\nARCHITECTURE\nRESNET-50 RGB\nRESNET-50 YCBCR\n\nUPSAMPLING\n\nLATE-CONCAT-RFA-THINNER\n\nUPSAMPLING-RFA\n\nDECONVOLUTION-RFA\n\nDOWNSAMPLING\nLATE-CONCAT\n\nLATE-CONCAT-RFA\n\nTOP-1 ERR\n24.22 \u00b1 0.08\n25.07 \u00b1 0.07\n24.06 \u00b1 0.09\n23.94 \u00b1 0.015\n\n24.36\n\n27.00\n24.93\n24.08\n24.61\n\nTOP-5 ERR\n7.35 \u00b1 0.004\n7.81 \u00b1 0.12\n7.14 \u00b1 0.07\n6.98 \u00b1 0.005\n\n7.36\n\n8.98\n7.62\n7.09\n7.43\n\nTOP-1 DIFF\n\nTOP-5 DIFF\n\n-\n\n+0.14\n+0.85\n-0.16\n-0.27\n+2.78\n+0.71\n-0.14\n+0.39\n\n-\n\n+0.01\n+0.45\n-0.21\n-0.36\n+2.36\n+0.27\n-0.25\n+0.08\n\nFPS\n208\n207\n396\n266\n268\n451\n401\n267\n369\n\napproach when averaged over three runs, though without the speedup of the Deconvolution-RFA\napproach. This is interesting because it departs from network design rules of thumb currently in\nvogue: \ufb01rst layer \ufb01lters are large instead of small, hard-coded instead of learned, run on YCbCr space\ninstead of RGB, and process channels depthwise (separately) instead of together. Future work could\nevaluate to what extent we should adopt these atypical choices as standard practice.\n\n5 Results and Discussions\nExperiments described in Section 3 and 4 are conducted with the Keras framework and TensorFlow\nbackend. Training is performed on the ImageNet dataset [4] with the standard ResNet-50 stepwise\ndecreasing learning rates described in [10]. The distributed training framework Horovod [21] is\nemployed to facilitate parallel training over 128 NVIDIA Pascal GPUs. To accommodate the\nparallel training, the learning rates are multiplied by the number of parallel running instances. Each\nexperiment trains for 90 epochs, which correspond to only 2-3 hours in this parallelization setup. A\ntotal of more than 50 experiments are run. All experiments are conducted with images which are \ufb01rst\nresized to 224\u00d7224 pixels with a random crop, and the JPEG quality used during encoding is 100, so\nas little information is lost as possible. A limitation of using a JPEG representation during training is\nthat to do data augmentation e.g. via random crops, one must decompress the image, transform it,\nand then re-encode it before accessing the DCT coef\ufb01cients. Of course, inference after the model\nis trained will not require this process. Inference time measurements are calculated by running the\ninference on 20 batches of size 1024 \u00d7 224 \u00d7 224 \u00d7 3 on the 128 GPUs where the overall time is\ncollected, and the effective number of images per second per GPU is then calculated. All timing is\ncomputed for inference, not training, and is computed as if data were already loaded; thus timing\nimprovements do not include possible additional savings due to reduced JPEG decompression time.\n\n5.1 Error Rate versus Inference Speed\nWe report full results in Table 1, for all seven proposed DCT architectures from Section 3, along with\ntwo baselines: ResNet-50 on RGB inputs, and ResNet-50 on YCbCr inputs. The full results include\nvalidation top-1 and top-5 error rates and inference frames per second (FPS). Both ResNet baselines\nachieve a top-5 error rate of 7.35% at an inference speed of 208 FPS on an NVIDIA Pascal GPU,\nwhile the best DCT network achieves it at 6.98% with 268 FPS. We analyze the 7 experiment results\nby dividing them into three categories. The \ufb01rst category contains those where DCT coef\ufb01cients are\ndirectly connected with the ResNet-50 architecture; this includes UpSampling, DownSampling, and\nLate-Concat. Several of these architectures providing signi\ufb01cant inference speed-up (three far-right\ndots in Fig. 5), almost 2\u00d7 in the best case.\nThe speedup is due to less computation as a consequence of reduced ResNet blocks. A sharp increase\nof error with DownSampling suggests that a reduction in the spatial structure of the Y (luma) causes a\nreduction of information while maintaining its spatial resolution (as in UpSampling and Late-Concat)\nperforms closer to the baseline. In the second category, the two best architectures above are extended\nto increase their R slowly, so as to mimic the R growth of ResNet-50 (see Fig. 4). This category\ncontains UpSampling-RFA and Late-Concat-RFA, and they are shown to achieve better error rates\n\n7\n\n\fthan their non-RFA counterparts while still providing an average speed-up of 1.3\u00d7. With the proper\nRFA adjustments in architecture, these two versions manage to beat the RGB baseline. A third\ncategory attempts to further improve the RFA architectures, by (1) learning the upsampling operation\nwith Deconvolution-RFA, and (2) reducing the number of channels with Late-Concat-RFA-Thinner.\nOn the one hand, Deconvolution-RFA reduces the top-5 error rate of UpSampling-RFA by 0.15%\nwhile maintaining an equivalent inference speed. On the other hand, Late-Concat-RFA-Thinner\nachieves error rates on par with the baseline while providing a speed-up ratio of 1.77\u00d7. A review\nof the GFLOPS for each architecture (cf. Fig. 5) shows that despite more computation of some\narchitectures, all architectures achieve higher speeds thanks to halved data transfer between CPU and\nGPU. Speed tests performed for the Late-Concat-RFA architecture that ignore data transfer gains\nshow that about 25% of the measured gain is due to limited data transfer.\n\n5.2 Ability to Trade-Off\nIn analyzing results from RGB network controls, we observe a continual increase in inference speed\nand GFLOPS coupled with an increase in error rates, as the network size is reduced. None of the\ncontrols can maintain one while improving the other. The curves (gray and light gray in Fig. 5),\nhowever, exhibit how the two opposing forces play with each other and provide insights to the user\nto determine the trade-offs when choosing network size. We observe that decreasing the number\nof channels offers the worst trade-off curve, as the error rate increases drastically for only small\nspeed-up gains. Removing the identity blocks offers a better trade-off curve, but this approach still\nallows only limited speed-ups and reaches a cliff where speed-up is bounded.\nConsidering the trade-off curves from DCT architectures (blue and red curves in Fig. 5), however,\nwe notice the apparent advantage especially if one urges to gain an improvement on inference\nspeed. We notice the signi\ufb01cant gain in speedup while maintaining an error rate within a 1% range\nof the baseline. We conclude therefore that making use of DCT coef\ufb01cients in CNNs constitutes\nan appealing strategy to balance loss versus computational speed. We also want to highlight two\nof the proposed DCT architectures that demonstrate compelling error/speed trade-offs. First, the\nDeconvolution-RFA architecture achieves the smallest top-5 error rate overall, while still improving\ninference speed by 30% over the baseline (black square in the \ufb01gure). Secondly, the Late-Concat-\nRFA-Thinner architecture provides an error rate closest to the baseline while allowing 77% faster\ninference. Moreover, the small slopes of the two curves strongly manifest that at a slight cost of\ncomputation, the RFA tweaks in the design improves accuracy by allowing a slow, smooth increase\nof receptive \ufb01elds.\n\n5.3 Learning DCT Transform\nAnother interesting curve to examine is the result from Sec. 4.3, experiments attempting to learn\nconvolutions behaving like DCT. It is the darker gray curve in Fig. 5 annotated with legends starting\nwith \u201cYCbCr pixels\u201d. The \ufb01rst two experiments trying to learn the DCT weights from random\ninitialization, with and without orthonormal regularization, achieve slightly higher error rates than\nour RGB and DCT baselines. The third and fourth experiments relying on frozen weights initialized\nfrom the DCT \ufb01lters themselves achieve lower error rates, on par with the best DCT architecture.\nThese results show that learning the DCT \ufb01lters is hard with one convolution layer and produce\nsub-performant networks. Moreover leveraging directly the DCT weights allow better error rates,\nmaking the JPEG DCT coef\ufb01cients an appealing representation for feeding CNN.\n\n6 Conclusions\nIn this paper, we proposed and tested the simple idea of training CNNs directly on the blockwise\ndiscrete cosine transform (DCT) coef\ufb01cients computed as part of the JPEG codec. Results are\ncompelling: at a similar performance to a ResNet-50 baseline, we observed speedups of 1.77x, and\nat performance signi\ufb01cantly better than the baseline, we obtained speedups of 1.3x. This simple\nchange of input representation may be an effective method of accelerating processing in a wide range\nof speed-sensitive applications, from processing large data sets in data-centers to processing JPEG\nimages locally on mobile devices.\n\n8\n\n\fFigure 5: (top) Inference speed vs top-5 error rates. (bottom) GigaFLOPS vs top-5 error rates.\nSix sets of experiments are grouped. ResNet-50 baseline on both RGB and YCbCr show nearly\nidentical performance, indicating that the YCbCr color space on its own is not suf\ufb01cient for improved\nperformance. Two sets of controls on the RGB baseline \u2014 baseline with removed ID blocks and\nwith a reduced number of channels \u2014 show that simply making ResNet-50 shorter or thinner cannot\nproduce speed gains at a competitive level of performance to the DCT networks. Finally, two sets of\nDCT experiments are shown, those that merge Y and Cb/Cr channels early in the network (within one\nlayer of each other) or late (after more than a layer of processing of the Y channel). Several of these\nnetworks are both faster and more accurate, and the Late-Concat-RFA-Thinner network is about the\nsame level of accuracy while being 1.77x faster than ResNet-50.\n\n9\n\n150200250300350400450500Image Per Second6.06.57.07.58.08.59.09.510.0Top-5 error rateremove 2 ID Blocksremove 3 ID Blocksremove 4 ID Blocksremove 5 ID Blocksremove 6 ID Blocks#channels / 1.1#channels / 1.4#channels / 2.0DownSamplingUpSamplingUpSampling-RFADeconvolution-RFALate-ConcatLate-Concat-RFA-ThinnerLate-Concat-RFADCT-LearnDCT-OrthoDCT-FrozenDCT-Frozenx2ResNet-50 RGBResNet-50 Top-5 errorBaseline, remove ID blocksBaseline, reduced # of channelsYCbCr pixels, ResNet-50RGB pixels, ResNet-50DCT Early MergeDCT Late MergeYCbCr pixels, DCT layer24681012GFLOPS7.07.58.08.59.0Top-5 error rateremove 2 ID Blocksremove 3 ID Blocksremove 4 ID Blocksremove 5 ID Blocksremove 6 ID Blocks#channels / 1.1#channels / 1.4#channels / 2.0DownSamplingUpSamplingUpSampling-RFADeconvolution-RFALate-ConcatLate-Concat-RFA-ThinnerLate-Concat-RFAResNet-50 Top-5 errorBaseline, remove ID blocksBaseline, reduced # of channelsYCbCr pixels, ResNet-50RGB pixels, ResNet-50DCT Early MergeDCT Late Merge\fReferences\n[1] Amir Adler, Michael Elad, and Michael Zibulevsky. Compressed learning: A deep neural\n\nnetwork approach. arXiv preprint arXiv:1610.09615, 2016.\n\n[2] Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Neural photo editing with\nintrospective adversarial networks. In 5th International Conference on Learning Representations,\nToulon, France, 2017.\n\n[3] Franc\u00b8ois Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR,\n\nabs/1610.02357, 2016.\n\n[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[5] Guocan Feng and Jianmin Jiang. Jpeg compressed image retrieval via statistical features. Pattern\n\nRecognition, 36(4):977\u2013985, 2003.\n\n[6] Dan Fu and Gabriel Guimaraes. Using compression to speed up image classi\ufb01cation in arti\ufb01cial\n\nneural networks. 2016.\n\n[7] Lionel Gueguen and Mihai Datcu. A similarity metric for retrieval of compressed objects:\nApplication for mining satellite image time series. IEEE Transactions on Knowledge and Data\nEngineering, 20(4):562\u2013575, 2008.\n\n[8] Ziad M Hafed and Martin D Levine. Face recognition using the discrete cosine transform.\n\nInternational journal of computer vision, 43(3):167\u2013188, 2001.\n\n[9] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural\nnetwork with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[11] G. Hudson, A. Lger, B. Niss, and I. Sebestyn. Jpeg at 25: Still going strong. IEEE MultiMedia,\n\n24(2):96\u2013103, Apr 2017.\n\n[12] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and\nKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model\nsize. CoRR, abs/1602.07360, 2016.\n\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classi\ufb01cation with deep convo-\nlutional neural networks. In Advances in Neural Information Processing Systems 25, pages\n1106\u20131114, 2012.\n\n[14] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, Nov 1998.\n\n[15] libjpeg. Jpeg library. http://libjpeg.sourceforge.net/, 2018. [Online; accessed Febru-\n\nary 6, 2018].\n\n[16] libjpeg-turbo. Jpeg library turbo, 2018. [Online; accessed February 6, 2018].\n\n[17] Shizhong Liu and A. C. Bovik. Ef\ufb01cient dct-domain blind measurement and reduction of block-\ning artifacts. IEEE Transactions on Circuits and Systems for Video Technology, 12(12):1139\u2013\n1149, Dec 2002.\n\n[18] Mrinal K Mandal, F Idris, and Sethuraman Panchanathan. A critical evaluation of image\nImage and Vision Computing,\n\nand video indexing techniques in the compressed domain.\n17(7):513\u2013529, 1999.\n\n[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying Atari with Deep Reinforcement Learning. ArXiv e-prints, December 2013.\n\n10\n\n\f[20] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\nobject detection with region proposal networks. In Advances in neural information processing\nsystems, pages 91\u201399, 2015.\n\n[21] Alex Sergeev and Mike Del Balso. Meet Horovod: Uber\u2019s open source distributed deep learning\n\nframework for TensorFlow, 2017.\n\n[22] Bo Shen and Ishwar K Sethi. Direct feature extraction from compressed images. In Storage and\nRetrieval for Still Image and Video Databases IV, volume 2670, pages 404\u2013415. International\nSociety for Optics and Photonics, 1996.\n\n[23] Martin Srom. gpujpeg, 2018. [Online; accessed February 6, 2018].\n\n[24] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and\nLuc Van Gool. Towards image understanding from deep compression without decoding. arXiv\npreprint arXiv:1803.06131, 2018.\n\n[25] Matej Ulicny and Rozenn Dahyot. On using cnn with dct based image data. In Proceedings of\n\nthe 19th Irish Machine Vision and Image Processing conference IMVIP, 2017.\n\n[26] Wikipedia. Jpeg, 2018. [Online; accessed February 6, 2018].\n\n[27] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural\nnetworks? In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 3320\u20133328. Curran\nAssociates, Inc., December 2014.\n\n[28] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural\nnetworks through deep visualization. In Deep Learning Workshop, International Conference on\nMachine Learning (ICML), 2015.\n\n[29] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional neural networks.\n\narXiv preprint arXiv:1311.2901, 2013.\n\n11\n\n\fSupplementary Information for:\n\nFaster Neural Networks Straight from JPEG\n\nS1 Details of model architectures\n\nFig. S1 shows the baseline ResNet-50 architecture as well as the seven architectures discussed in\nSec. 3 that take DCT input.\n\nFigure S1: The baseline ResNet-50 architecture and the seven related architectures discussed in\nSec. 3. Gray banded highlights are arbitrary and solely for visual clarity. The baseline ResNet-50\ncontains ConvBlocks CB1, CB2, CB3, CB4 with doubling number of channels at each stage increase.\nIn this \ufb01gure we use ConvBlock subscripts to refer to a block with the same number of channels\nas in ResNet-50, not to indicate the order of the CB within our model. Thus, for example, in the\nDownSampling model, CB4 is followed by CB3, another CB4, and CB5. Because models taking DCT\ninput start with a representation with much lower spatial size but many more input channels, using\nConvBlocks with many channels early in the network is advantageous. Best viewed electronically\nwith zoom.\n\n12\n\nBaseline C(64, 7, 2)BN, RM(3, 2)CB2(s=1)IB, IBCB3IB, IB, IBCB4IB, IB, IB, IB, IBCB5IB, IBGAPFC(1000)SoftmaxRGB pix (224, 224, 3)UpSampling Reference: BaselineConcat (28, 28, 192)CB3(s=1)Y (28, 28, 64)Cb,Cr (14, 14, 128)U (28, 28, 128)BNUpSampling-RFA Reference: UpsamplingCB4(k=1, s=1)IB(k=2), IBDownSampling Reference: BaselineConcat (14, 14, 192)Y (28, 28, 64)Cb,Cr (14, 14, 128)C(256, 2, 2) (14, 14, 256)CB3(s=1)CB4(s=1)Late-Concat Reference: BaselineConcatY (28, 28, 64)Cb,Cr (14, 14, 128)BNBNCB4(k=1, s=1)CB4(s=1)IB, IB, IBCB4Late-Concat-RFA Reference: BaselineConcatY (28, 28, 64)Cb,Cr (14, 14, 128)BNCB3(s=1)IB, IB, IBBNCB4(k=1, s=1)CB4(k=1, s=1)IB(k=2), IBCB4Late-Concat-RFA-Thinner (Same as Late-Concat-RFA but with di\ufb00erent number of channels; see text.Deconvolution-RFA Reference: Upsampling-RFAConcat (28, 28, 192)Y (28, 28, 64)Cb,Cr (14, 14, 128)Deconv (28, 28, 128)CB4(k=1, s=1)IB(k=2), IBBNLegendRGB pix RGB pixel input Y Y-channel DCT input Cb, Cr Cb- and Cr-channel DCT input\u2028C Convolution(channels, \ufb01lter size, stride) Deconv Deconvolution with 64 output channels, \ufb01lter size 2, stride 2. Separate deconvolution layers are applied to Cb and to Cr, resulting in 128 total output channels. BN BatchNormalization R Relu M MaxPooling(pool size, stride) U Upsampling layer (2x) Concat Channelwise concatenation CBn ConvBlock stage n, with number of channels as in original ResNet-50 paper, kernel size = 3 and stride = 2 unless speci\ufb01ed otherwise. IB IdentityBlock, with number of channels matched to preceding CB layer (as in ResNet-50) GAP Global average pooling layer FC Fully connected layer (channels) Softmax Softmax nonlinearity \u2028Layers up to this point are the same as reference Layers after this point are the same as reference This layer or these blocks are same as reference Shape of representation at layer shown like this: (height, width, channels) For example: (14, 14, 128)\f", "award": [], "sourceid": 1941, "authors": [{"given_name": "Lionel", "family_name": "Gueguen", "institution": "UBER"}, {"given_name": "Alex", "family_name": "Sergeev", "institution": "Uber Technologies Inc,"}, {"given_name": "Ben", "family_name": "Kadlec", "institution": "Uber"}, {"given_name": "Rosanne", "family_name": "Liu", "institution": "Uber AI Labs"}, {"given_name": "Jason", "family_name": "Yosinski", "institution": "Uber AI Labs; Recursion"}]}