{"title": "Learning to Inpaint for Image Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 1246, "page_last": 1255, "abstract": "We study the design of deep architectures for lossy image compression. We present two architectural recipes in the context of multi-stage progressive encoders and empirically demonstrate their importance on compression performance. Specifically, we show that: 1) predicting the original image data from residuals in a multi-stage progressive architecture facilitates learning and leads to improved performance at approximating the original content and 2) learning to inpaint (from neighboring image pixels) before performing compression reduces the amount of information that must be stored to achieve a high-quality approximation. Incorporating these design choices in a baseline progressive encoder yields an average reduction of over 60% in file size with similar quality compared to the original residual encoder.", "full_text": "Learning to Inpaint for Image Compression\n\nMohammad Haris Baig\u2217\n\nDepartment of Computer Science\n\nDartmouth College\n\nHanover, NH\n\nVladlen Koltun\n\nIntel Labs\n\nSanta Clara, CA\n\nLorenzo Torresani\nDartmouth College\n\nHanover, NH\n\nAbstract\n\nWe study the design of deep architectures for lossy image compression. We present\ntwo architectural recipes in the context of multi-stage progressive encoders and em-\npirically demonstrate their importance on compression performance. Speci\ufb01cally,\nwe show that: (a) predicting the original image data from residuals in a multi-stage\nprogressive architecture facilitates learning and leads to improved performance at\napproximating the original content and (b) learning to inpaint (from neighboring\nimage pixels) before performing compression reduces the amount of information\nthat must be stored to achieve a high-quality approximation. Incorporating these\ndesign choices in a baseline progressive encoder yields an average reduction of over\n60% in \ufb01le size with similar quality compared to the original residual encoder.\n\n1\n\nIntroduction\n\nVisual data constitutes most of the total information created and shared on the Web every day and it\nforms a bulk of the demand for storage and network bandwidth [13]. It is customary to compress\nimage data as much as possible as long as there is no perceptible loss in content. In recent years\ndeep learning has made it possible to design deep models for learning compact representations for\nimage data [2, 16, 18, 19, 20]. Deep learning based approaches, such as the work of Rippel and\nBourdev [16], signi\ufb01cantly outperform traditional methods of lossy image compression. In this paper,\nwe show how to improve the performance of deep models trained for lossy image compression.\nWe focus on the design of models that produce progressive codes. Progressive codes are a sequence\nof representations that can be transmitted to improve the quality of an existing estimate (from a\npreviously sent code) by adding missing detail. This is in contrast to non-progressive codes whereby\nthe entire data for a certain quality approximation must be transmitted before the image can be viewed.\nProgressive codes improve the user\u2019s browsing experience by reducing loading time of pages that are\nrich in images. Our main contributions in this paper are two-fold.\n\n1. While traditional progressive encoders are optimized to compress residual errors in each\nstage of their architecture (residual-in, residual-out), instead we propose a model that is\ntrained to predict at each stage the original image data from the residual of the previous\nstage (residual-in, image-out). We demonstrate that this leads to an easier optimization\nresulting in better image compression. The resulting architecture reduces the amount of\ninformation that must be stored for reproducing images at similar quality by 18% compared\nto a traditional residual encoder.\n\n2. Existing deep architectures do not exploit the high degree of spatial coherence exhibited by\nneighboring patches. We show how to design and train a model that can exploit dependences\nbetween adjacent regions by learning to inpaint from the available content. We introduce\nmulti-scale convolutions that sample content at multiple scales to assist with inpainting.\n\n\u2217http://www.cs.dartmouth.edu/ haris/compression\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe jointly train our proposed inpainting and compression models and show that inpainting\nreduces the amount of information that must be stored by an additional 42%.\n\n2 Approach\n\nWe begin by reviewing the architecture and the learning objective of a progressive multi-stage encoder-\ndecoder with S stages. We adopt the convolutional-deconvolutional residual encoder proposed by\nToderici et al. [19] as our reference model. The model extracts a compact binary representation B\nfrom an image patch P . This binary representation, used to reconstruct an approximation of the\noriginal patch, consists of the sequence of representations extracted by the S stages of the model,\nB = [B1, B2, . . . BS].\nThe \ufb01rst stage of the model extracts a binary code B1 from the input patch P . Each of the subse-\nquent stages learns to extract representations Bs, to model the compressions residuals Rs\u22121 from\nthe previous stage. The compression residuals Rs are de\ufb01ned as Rs = Rs\u22121 \u2212 Ms(Rs\u22121|\u0398s),\nwhere Ms(Rs\u22121|\u0398s) represents the reconstruction obtained by the stage s when modelling the\nresiduals Rs\u22121. The model at each stage is split into an encoder Bs = Es(Rs\u22121|\u0398E\ns ) and a decoder\nDs(Bs|\u0398D\ns ) and \u0398s = {\u0398E\ns }. The param-\ns , \u0398D\neters for the sth stage of the model are denoted by \u0398s. The residual encoder-decoder is trained on a\ndataset P, consisting of N image patches, according to the following objective:\n\ns ) such that Ms(Rs\u22121|\u0398s) = Ds(Es(Rs\u22121|\u0398E\n\ns )|\u0398D\n\n\u02c6L(P; \u03981:S) =\n\n(cid:107)R(i)\n\ns\u22121 \u2212 Ms(R(i)\n\ns\u22121|\u0398s)(cid:107)2\n\n2.\n\n(1)\n\ns=1\n\ni=1\n\nrepresents the compression residual for the ith patch P (i) after stage s and R(i)\n\nR(i)\ns\nResidual encoders are dif\ufb01cult to optimize as gradients have to traverse long paths from later stages to\naffect change in the previous stages. When moving along longer paths, gradients tend to decrease in\nmagnitude as they get to earlier stages. We address this shortcoming of residual encoders by studying\na class of architectures we refer to as \u201cResidual-to-Image\u201d (R2I).\n\n0 = P (i).\n\n2.1 Residual-to-Image (R2I)\n\nTo address the issue of vanishing gradients we add connections between subsequent stages and restate\nthe loss to predict the original data at the end of each stage, thus performing residual-to-image\nprediction. This leads to the new objective shown below:\n\nS(cid:88)\n\nN(cid:88)\n\nS(cid:88)\n\nN(cid:88)\n\nL(P; \u03981:S) =\n\n(cid:107)P (i) \u2212 Ms(R(i)\n\ns\u22121|\u0398s)(cid:107)2\n\n2.\n\n(2)\n\ns=1\n\ni=1\n\nStage s of this model takes as input the compression residuals Rs\u22121 computed with respect to the\noriginal data, Rs\u22121 = P \u2212 Ms\u22121(Rs\u22122|\u0398s\u22121), and Ms\u22121(Rs\u22122|\u0398s\u22121) now approximates the\nreconstruction of the original data P at stage s \u2212 1. To allow complete image reconstructions to\nbe produced at each stage while only feeding in residuals, we introduce connections between the\nlayers of adjacent stages. These connections allow for later stages to incorporate information that has\nbeen recovered by earlier stages into their estimate of the original image data. Consequently, these\nconnections (between subsequent stages) allow for better optimization of the model.\nIn addition to assisting with modeling the original image, these connections play two key roles. Firstly,\nthese connections create residual blocks [10] which encourage explicit learning of how to reproduce\ninformation which could not be generated by the previous stage. Secondly, these connections reduce\nthe length of the path along which information has to travel from later stages to impact the earlier\nstages, leading to a better joint optimization.\nThis leads us to the question of where should such connections be introduced and how should\ninformation be propagated? We consider two types of connections to propagate information between\nsuccessive stages. 1) Prediction connections are analogous to the identity shortcuts introduced by\nHe et al. [10] for residual learning. They act as parameter-free additive connections: the output of\n\n2\n\n\fFigure 1: Multiple approaches for introducing connections between successive stages. These designs\nfor progressive architectures allow for varying degrees of information to be shared. Architecture\n(b-d) do not reconstruct residuals, but the original data at every stage. We call these architectures\n\u201cresidual-to-image\u201d (R2I).\n\neach stage is produced by simply adding together the residual predictions of the current stage and all\npreceding stages (see Figure 1(b)) before applying a \ufb01nal non-linearity.2) Parametric connections\nare referred to as projection shortcuts by He et al. [10]. Here we use them to connect corresponding\nlayers in two consecutive stages of the compression model. The features of each layer from the\nprevious stage are convolved with learned \ufb01lters before being added to the features of the same\nlayer in the current stage. A non-linearity is then applied on top. The prediction connections only\nyield the bene\ufb01t of creating residual blocks, albeit very large and thus dif\ufb01cult to optimize. In\ncontrast, parametric connections allow for the intermediate representations from previous stages to\nbe passed to the subsequent stages. They also create a denser connectivity pattern with gradients\nnow moving along corresponding layers in adjacent stages. We consider two variants of parametric\nconnections: \u201cfull\u201d which use parametric connections between all the layers in two successive stages\n(see Figure 1(c)), and \u201cdecoding\u201d connections which link only corresponding decoding layers (i.e.,\nthere are no connections between encoding layers of adjacent stages). We note that the LSTM-based\nmodel of Toderici et al. [20] represents a particular instance of R2I network with full connections.\nIn Section 3 we demonstrate that R2I models with decoding connections outperform those with full\nconnections and provide an intuitive explanation for this result.\n\n2.2\n\nInpainting Network\n\nImage compression architectures learn to encode and decode an image patch-by-patch. Encoding all\npatches independently assumes that the regions contain truly independent content. This assumption\ngenerally does not hold true when the patches being encoded are contiguous. We observe that the\ncontent of adjacent image patches is not independent. We propose a new module for the compression\nmodel designed to exploit the spatial coherence between neighboring patches. We achieve this goal\nby training a model with the objective of predicting the content of each patch from information\navailable in the neighboring regions.\nDeep models for inpainting, such as the one proposed by Pathak et al. [14], are trained to predict\nthe values of pixels in the region \u02c6W from a context region \u02c6C (as shown in Figure 2). As there\nis data present all around the region to be inpainted this imposes strong constraints on what the\ninpainted region should look like. We consider the scenario where images are encoded and decoded\nblock-by-block moving from left to right and going from top to bottom (similar to how traditional\ncodecs process images [1, 21]). Now, at decoding time only content above and to the left of each\npatch will have been reconstructed (see Figure 2(a)). This gives rise to the problem of \u201cpartial-context\ninpainting\u201d. We propose a model that, given an input region C, attempts to predict the content of\nthe current patch P . We denote by \u02c6P the dataset which contains all the patches from the dataset P\n\n3\n\nResidualsDecoder layerEncoder layerResidual PredictionParametrized connectionAdditive connection10BinarizerInformation \ufb02owa) Residual encoder 1010100010101001Data01010001d) Decoding connections1010100001010001c) Full connections1010100001010001b) Prediction connections10101000+Residual to Image (R2I) ModelsData Prediction\fFigure 2: (a) The two kinds of inpainting problems. (b) A multi-scale convolutional layer with 3\ndilation factors. The colored boxes represent pixels from which the content is sampled.\n\nand the respective context regions C for each patch. The loss function used to train our inpainting\nnetwork is:\n\nLinp( \u02c6P; \u0398I ) =\n\nN(cid:88)\n\ni=1\n\n(cid:107)P (i) \u2212 MI (C (i)|\u0398I )(cid:107)2\n\n2.\n\n(3)\n\nThe output of the inpainting network is denoted by MI (C (i)|\u0398I ), where \u0398I refers to the parameters\nof the inpainting network.\n\n2.2.1 Architecture of the Partial-Context Inpainting Network\n\nOur inpainting network has a feed-forward architecture which propagates information from the\ncontext region C to the region being inpainted, P . To improve the ability of our model at predicting\ncontent, we use a multi-scale convolutional layer as the basic building block of our inpainting network.\nWe make use of the dilated convolutions described by Yu and Koltun [23] to allow for sampling\nat various scales. Each multi-scale convolutional layer is composed of k \ufb01lters for each dilation\nfactor being considered. Varying the dilation factor of \ufb01lters gives us the ability to analyze content\nat various scales. This structure of \ufb01lters provides two bene\ufb01ts. First, it allows for a substantially\ndenser and more diverse sampling of data from context and second it allows for better propagation of\ncontent at different spatial scales. A similarly designed layer was also used by Chen et al. [5] for\nsampling content at multiple scales for semantic segmentation. Figure 2(b) shows the structure of a\nmulti-scale convolutional layer.\nThe multi-scale convolutional layer also gives us the freedom to propagate content at full resolution\n(no striding or pooling) as only a few multi-scale layers suf\ufb01ce to cover the entire region. This allows\nus to train a relatively shallow yet highly expressive architecture which can propagate \ufb01ne-grained\ninformation that might otherwise be lost due to sub-sampling. This light-weight and ef\ufb01cient design\nis needed to allow for joint training with a multi-stage compression model.\n\n2.2.2 Connecting the Inpainting Network with the R2I Compression model\n\nNext, we describe how to use the prediction of the inpainting network for assisting with compression.\nWhereas the inpainting network learns to predict the data as accurately as possible, we note that\nthis is not suf\ufb01cient to achieve good performance on compression, where it is also necessary that\nthe \u201cinpainting residuals\u201d be easy to compress. We describe the inpainting residuals as R0 =\nP \u2212 MI (C|\u0398I ), where MI (C|\u0398I ) denotes the inpainting estimate. As we wanted to train our\nmodel to always predict the data, we add the inpainting estimate to the \ufb01nal prediction of each stage\nof our compression model. This allows us to (a) produce the original content at each stage and (b) to\n\n4\n\n(a) Variations of the inpainting problemPartial-context InpaintingCFull-context Inpainting\u02c6CContent available for inpaintingRegion to inpaint\u02c6PP(b) Multi-scale convolutional layer\fdiscover an inpainting that is bene\ufb01cial for all stages of the model because of joint training. We now\ntrain our complete model as\n\nLC( \u02c6P; \u0398I , \u03981:S) = Linp( \u02c6P; \u0398I ) +\n\nN(cid:88)\n\nS(cid:88)\n\ni=1\n\ns=1\n\n(cid:107)P (i) \u2212 [Ms(R(i)\n\ns\u22121|\u0398s) + MI (C (i)|\u0398I )] (cid:107)2\n\n2.\n\n(4)\n\nIn this new objective LC, the \ufb01rst term Linp corresponds to the original inpainting loss, R(i)\n0\ncorresponds to the inpainting residual for example i. We note that each stage of this inpainting-based\nprogressive coder directly affects what is learned by the inpainting network. We refer to the model\ntrained with this joint objective as \u201cInpainting for Residual-to-Image Compression\u201d (IR2I).\nWhereas we train our model to perform inpainting from the original image content, we use a lossy\napproximation of the context region C when encoding images with IR2I. This is done because at\ndecoding time our model does not have access to the original image data. We use the approximation\nfrom stage 2 of our model for performing inpainting at encoding and decoding time, and transmit\nthe binary codes for the \ufb01rst two stages as a larger \ufb01rst code. This strategy allows us to leverage\ninpainting while performing progressive image compression.\n\n2.3\n\nImplementation Details\n\nOur models were trained on 6,507 images from the ImageNet dataset [7], as proposed by Ball\u00e9 et\nal. [2] to train their single-stage encoder-decoder architectures. A full description of the R2I models\nand the inpainting network is provided in the supplementary material. We use the Caffe library [11]\nto train our models. The residual encoder and R2I models were trained for 60,000 iterations whereas\nthe joint inpainting network was trained for 110,000 iterations. We used the Adam optimizer [12]\nfor training our models and the MSRA initialization [9] for initializing all stages. We used initial\nlearning rates of 0.001 and the learning rate was dropped after 30K and 45K for the R2I models.\nFor the IR2I model, the learning rate was dropped after 30K, 65K, and 90K iterations by a factor of\n10 each time. All of our models were trained to reproduce the content of 32 \u00d7 32 image patches.\nEach of our models has 8 stages, with each stage contributing 0.125 bits-per-pixel (bpp) to the total\nrepresentation of a patch. Our models handle binary optimization by employing the biased estimators\napproach proposed by Raiko et al. [15] as was done by Toderici et al. [19, 20].\nOur inpainting network has 8 multi-scale convolutional layers for content propagation and one\nstandard convolutional layer for performing the \ufb01nal prediction. Each multi-scale convolutional\nlayer consists of 24 \ufb01lters each for dilation factors 1, 2, 4, 8. Our inpainting network takes as input a\ncontext region C of size 64 \u00d7 64, where the bottom right 32 \u00d7 32 region is zeroed out and represents\nthe region to be inpainted.\n\n3 Results\n\nWe investigate the improvement brought about by the presented techniques. We are interested in\nstudying the reduction in bit-rate, for varying quality of reconstruction, achieved after adaptation\nfrom the residual encoder proposed by Toderici et al. [19]. To evaluate performance, we perform\ncompression with our models on images from the Kodak dataset [8]. The dataset consists of 24\nuncompressed color images of size 512 \u00d7 768. The quality is measured according to the MS-\nSSIM [22] metric (higher values indicate better quality). We use the Bjontegaard-Delta metric [4] to\ncompute the average reduction in bit-rate across all quality settings.\n\n3.1 R2I - Design and Performance\n\nThe table in Figure 3(a) shows the percentage reduction in bit-rate achieved by the three variations of\nthe Residual-to-Image models. As can be seen, adding side-connections and training for the more\ndesirable objective (i.e., approximating the original data) at each stage helps each of our models.\nThat said, having connections in the decoder only helps more compared to using a \u201cfull\u201d connection\napproach or only sharing the \ufb01nal prediction.\n\n5\n\n\fApproach\nR2I Prediction\nR2I Full\nR2I Decoding\n\nRate Savings (%)\nSSIM MS-SSIM\n\n5.177\n7.652\n17.951\n\n4.483\n10.015\n20.002\n\n(a)\n\nFigure 3: (a) Average rate savings for each of the three R2I variants compared to the residual encoder\nproposed by Toderici et al. [19]. (b) Figure shows the quality of images produced by each of the three\nR2I variants across a range of bit-rates.\n\n(b)\n\nFigure 4: The R2I training loss from 3 different stages (start, middle, end) viewed as a function\nof iterations for the \u201cfull\u201d and the \u201cdecoding\u201d connections models. We note that the decoding\nconnections model converges faster, to a lower value, and shows less variance.\n\nThe model which shares only the prediction between stages performs poorly in comparison to the\nother two designs as it does not allow for features from earlier stages to be altered as ef\ufb01ciently as\ndone by the full or decoding connections architectures.\nThe model with decoding connections does better than the architecture with full connections because\nfor the model with connections at decoding only the binarization layer in each stage extracts a\nrepresentation from the relevant information only (the residuals with respect to the data). In contrast,\nwhen connections are established in both the encoder and the decoder, the binary representation may\ninclude information that has been captured by a previous stage, thereby adding burden on each stage\nin identifying information pertinent to improving reconstruction, leading to a tougher optimization.\nFigure 4 shows that the model with full connections struggles to minimize the training error compared\nto the model with decoding connections. This difference in training error points to the fact that\nconnections in the encoder make it harder for the model to do well at training time. This dif\ufb01culty of\noptimization ampli\ufb01es with the increase in stages as can be seen by the difference between the full\nand decoding architecture performance (shown in Figure 3(b)) because the residuals become harder\nto compress.\nWe note that R2I models signi\ufb01cantly improve the quality of reconstruction at higher bit rates but do\nnot improve the estimates at lower bit-rates as much (see Figure 3(b)). This tells us that the overall\nperformance can be improved by focusing on approaches that yield a signi\ufb01cant improvement at\nlower bit-rates, such as inpainting, which is analyzed next.\n\n6\n\n0.10.20.30.40.50.60.70.80.91681012141618BitRateMS\u2212SSIM (dB) Residual EncoderR2I Prediction connectionR2I Full connectionsR2I Decoding connections0.511.522.533.544.555.5601020304050IterationsTraining Loss (MSE) Enc\u2212DecDec\u2212only0.511.522.533.544.555.56050100150200IterationsTraining Loss (MSE) Enc\u2212DecDec\u2212only0.511.522.533.544.555.5601020304050IterationsTraining Loss (MSE) Enc\u2212DecDec\u2212onlyIterationsIterations (x104)Iterationsa) Stage 1b) Stage 4c) Stage 8R2I full connections R2I decoding connections\fApproach\nR2I Decoding\nR2I Decoding Sep-Inp\nR2I Decoding Joint-Inp\n\nRate Savings (%)\nSSIM MS-SSIM\n\n20.002\n27.379\n63.353\n\n17.951\n27.794\n60.446\n\n(a) Impact of inpainting on the performance\nat compression. All bit-rate savings are re-\nported with respect to the residual encoder\nby Toderici et al. [19]\n\nFigure 5: (a) Average rate savings with varying forms of inpainting. (b) The quality of images with\neach of our proposed approaches at varying bit-rates.\n\n(b)\n\n3.2\n\nImpact of Inpainting\n\nWe begin analyzing the performance of the inpainting network and other approaches on partial-context\ninpainting. We compare the performance of the inpainting network with both traditional approaches\nas well as a learning-based baseline. Table 1 shows the average SSIM achieved by each approach for\ninpainting all non-overlapping patches in the Kodak dataset.\n\nApproach\n\nPDE-based\n\nExemplar-based\n\nLearning-based\n\nSSIM\n\n[3]\n\n0.4574\n\n[6]\n\n0.4611\n\nVanilla network\n\nInpainting network\n\n0.4545\n\n0.5165\n\nTable 1: Average SSIM for partial-context inpainting on the Kodak dataset [8]. The vanilla model is\na feed-forward CNN with no multi-scale convolutions.\n\nThe vanilla network corresponds to a 32-layer (4 times as deep as the inpainting network) model that\ndoes not use multi-scale convolutions (all \ufb01lters have a dilation factor of 1), has the same number of\nparameters, and also operates at full resolution (as our inpainting network). This points to the fact that\nthe improvement in performance of the inpainting network over the vanilla model is a consequence of\nusing multi-scale convolutions. The inpainting network improves over traditional approaches because\nour model learns the best strategy for propagating content as opposed to using hand-engineered\nprinciples of content propagation. The low performance of the vanilla network shows that learning\nby itself is not superior to traditional approaches and multi-scale convolutions play a key role in\nachieving better performance.\nWhereas inpainting provides an initial estimate of the content within the region it by no means\ngenerates a perfect reconstruction. This leads us to the question of whether this initial estimate\nis better than not having an estimate? The table in Figure 5(a) shows the performance on the\ncompression task with and without inpainting. These results show that the greatest reduction in \ufb01le\nsize is achieved when the inpainting network is jointly trained with the R2I model. We note (from\nFigure 5(b)) that inpainting greatly improves the quality of results obtained at lower and at higher bit\nrates.\nThe baseline where the inpainting network is trained separately from the compression network\nis presented here to emphasize the role of joint training. Traditional codecs [1] use simple non\nlearning-based inpainting approaches and their prede\ufb01ned methods of representing data are unable to\ncompactly encode the inpainting residuals. Learning to inpaint separately improves the performance\n\n7\n\n0.20.30.40.50.60.70.80.919101112131415161718BitRateMS\u2212SSIM (dB) Residual EncoderR2I\u2212Decoding Sep\u2212InpR2I\u2212Decoding Joint\u2212Inp\fas the inpainted estimate is better than not having any estimate. But given that the compression model\nhas not been trained to optimize the compression residuals the reduction in bit-rate for achieving high\nquality levels is low. We show that with joint training, we can not only train a model that does better\ninpainting but also ensure that the inpainting residuals can be represented compactly.\n\n3.3 Comparison with Existing Approaches\n\nTable 2 shows a comparison of the performance of various approaches compared to JPEG [21] in the\n0.125 to 1 bits-per-pixel (bpp) range. We select this range as images from our models towards the\nend of this range show no perceptible artifacts of compression.\nThe \ufb01rst part of the table evaluates the performance of learning-based progressive approaches. We\nnote that our proposed model outperforms the multi-stage residual encoder proposed by Toderici\net al. [19] (trained on the same 6.5K dataset) by 17.9% and IR2I outperforms the residual encoder\nby reducing \ufb01le-sizes by 60.4%. The residual-GRU, while similar in architecture to our \u201cfull\u201d\nconnections model, does not do better even when trained on a dataset that is 1000 times bigger and\nfor 10 times more training time. The results shown here do not make use of entropy coding as the\ngoal of this work is to study how to improve the performance of deep networks for progressive image\ncompression and entropy coding makes it harder to understand where the performance improvements\nare coming from. As various approaches use different entropy coding methods, this further obfuscates\nthe source of the improvements.\nThe second part of the table shows the performance of existing codecs. Existing codecs use entropy\ncoding and rate-distortion optimization. We note that even without using either of these powerful\npost processing techniques, our \ufb01nal \u201cIR2I\u201d model is competitive with traditional methods for\ncompression, which use both of these techniques. A comparison with recent non-progressive\napproaches [2, 18], which also use these post-processing techniques for image compression, is\nprovided in the supplementary material.\n\nApproach\n\nNumber of\n\nTraining Images\n\nProgressive Rate Savings\n\n(%)\n\nResidual Encoder [19]\nResidual-GRU [20]\nR2I (Decoding connections)\nIR2I\nJPEG-2000 [17]\nWebP [1]\n\n6.5K\n6M\n6.5K\n6.5K\nN/A\nN/A\n\nYes\nYes\nYes\nYes\nNo\nNo\n\n2.56\n33.26\n18.53\n51.25\n63.01\n64.98\n\nTable 2: Average rate savings compared to JPEG [21]. The savings are computed on the Kodak [8]\ndataset with rate-distortion pro\ufb01les measuring MS-SSIM in the 0-1 bpp range.\n\nWe observe that a naive implementation of IR2I creates a linear dependence in content (as all regions\nused as context have to be decoded before being used for inpainting) and thus may be substantially\nslower. In practice, this slowdown would be negligible as one can use a diagonal scan pattern (similar\nto traditional codecs) for ensuring high parallelism thereby reducing run times. Furthermore, we\nperform inpainting using predictions from the \ufb01rst step only. Therefore, the dependence only exists\nwhen generating the \ufb01rst progressive code. For all subsequent stages, there is no dependence in\ncontent, and our approach is comparable in run time to similar approaches.\n\n4 Conclusion and Future Work\n\nWe study a class of \u201cResidual to Image\u201d models and show that within this class, architectures which\nhave decoding connections perform better at approximating image data compared to designs with\nother forms of connectivity. We observe that our R2I decoding connections model struggles at low\nbit-rates and we show how to exploit spatial coherence between content of adjacent patches via\ninpainting to improve performance at approximating image content at low bit-rates. We design a new\n\n8\n\n\fmodel for partial-context inpainting using multi-scale convolutions and show that the best way to\nleverage inpainting is by jointly training the inpainting network with our R2I Decoding model.\nOne interesting extension of this work would be to incorporate entropy coding within our progressive\ncompression framework to train models that produce binary codes which have low-entropy and can\nbe represented even more compactly. Another possible direction would be to extend our proposed\nframework to video data, where the gains from our discovery of recipes for improving compression\nmay be even greater.\n\n5 Acknowledgements\n\nThis work was funded in part by Intel Labs and NSF award CNS-120552. We gratefully acknowledge\nNVIDIA and Facebook for the donation of GPUs used for portions of this work. We would like to\nthank George Toderici, Nick Johnston, Johannes Balle for providing us with information needed for\naccurate assessment. We are grateful to members of the Visual Computing Lab at Intel Labs, and\nmembers of the Visual Learning Group at Dartmouth College for their feedback.\n\nReferences\n[1] WebP a new image format for the web. https://developers.google.com/speed/webp/.\n\nAccessed: 2017-04-29.\n\n[2] Johannes Ball\u00e9, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compres-\n\nsion. In ICLR, 2017.\n\n[3] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting.\nIn Proceedings of the 27th annual conference on Computer graphics and interactive techniques,\npages 417\u2013424. ACM Press/Addison-Wesley Publishing Co., 2000.\n\n[4] Gisle Bjontegaard. Improvements of the bd-psnr model. In ITU-T SC16/Q6, 35th VCEG\n\nMeeting, Berlin, Germany, July 2008, 2008.\n\n[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.\nDeeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and\nfully connected crfs. arXiv preprint arXiv:1606.00915, 2016.\n\n[6] Antonio Criminisi, Patrick P\u00e9rez, and Kentaro Toyama. Region \ufb01lling and object removal by\nexemplar-based image inpainting. IEEE Transactions on image processing, 13(9):1200\u20131212,\n2004.\n\n[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[8] Eastman Kodak Company. Kodak lossless true color image suite, 1999. http://r0k.us/\n\ngraphics/kodak/.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE\ninternational conference on computer vision, pages 1026\u20131034, 2015.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[11] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[12] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,\n\n2014.\n\n9\n\n\f[13] Mary Meeker. Internet Trends Report 2016. KPCB, 2016.\n\n[14] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\nencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2536\u20132544, 2016.\n\n[15] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning\n\nbinary stochastic feedforward neural networks. In ICLR, 2015.\n\n[16] Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression. In International\n\nConference on Machine Learning, 2017.\n\n[17] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The jpeg 2000 still image\n\ncompression standard. IEEE Signal processing magazine, 18(5):36\u201358, 2001.\n\n[18] L. Theis, W. Shi, A. Cunningham, and F. Husz\u00e1r. Lossy image compression with compressive\n\nautoencoders. In ICLR, 2017.\n\n[19] George Toderici, Sean M. O\u2019Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet\nBaluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent\nneural networks. In ICLR, 2016.\n\n[20] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor,\nand Michele Covell. Full resolution image compression with recurrent neural networks. arXiv\npreprint arXiv:1608.05148, 2016.\n\n[21] Gregory K. Wallace. The JPEG still picture compression standard. Communications of the\n\nACM, 34(4), 1991.\n\n[22] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image\nIn Signals, Systems and Computers, 2004. Conference Record of the\n\nquality assessment.\nThirty-Seventh Asilomar Conference on, volume 2, pages 1398\u20131402. IEEE, 2003.\n\n[23] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In\n\nICLR, 2016.\n\n10\n\n\f", "award": [], "sourceid": 833, "authors": [{"given_name": "Mohammad Haris", "family_name": "Baig", "institution": "Dartmouth College"}, {"given_name": "Vladlen", "family_name": "Koltun", "institution": "Intel Labs"}, {"given_name": "Lorenzo", "family_name": "Torresani", "institution": "Dartmouth/Facebook"}]}