{"title": "Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 1141, "page_last": 1151, "abstract": "We present a new approach to learn compressible representations in deep architectures with an end-to-end training strategy. Our method is based on a soft (continuous) relaxation of quantization and entropy, which we anneal to their discrete counterparts throughout training. We showcase this method for two challenging applications: Image compression and neural network compression. While these tasks have typically been approached with different methods, our soft-to-hard quantization approach gives results competitive with the state-of-the-art for both.", "full_text": "Soft-to-Hard Vector Quantization for End-to-End\n\nLearning Compressible Representations\n\nEirikur Agustsson\n\nETH Zurich\n\nFabian Mentzer\n\nETH Zurich\n\nMichael Tschannen\n\nETH Zurich\n\naeirikur@vision.ee.ethz.ch\n\nmentzerf@vision.ee.ethz.ch\n\nmichaelt@nari.ee.ethz.ch\n\nLukas Cavigelli\n\nETH Zurich\n\ncavigelli@iis.ee.ethz.ch\n\nRadu Timofte\n\nETH Zurich & Merantix\n\ntimofter@vision.ee.ethz.ch\n\nLuca Benini\nETH Zurich\n\nbenini@iis.ee.ethz.ch\n\nLuc Van Gool\n\nKU Leuven & ETH Zurich\nvangool@vision.ee.ethz.ch\n\nAbstract\n\nWe present a new approach to learn compressible representations in deep archi-\ntectures with an end-to-end training strategy. Our method is based on a soft\n(continuous) relaxation of quantization and entropy, which we anneal to their\ndiscrete counterparts throughout training. We showcase this method for two chal-\nlenging applications: Image compression and neural network compression. While\nthese tasks have typically been approached with different methods, our soft-to-hard\nquantization approach gives results competitive with the state-of-the-art for both.\n\nIntroduction\n\n1\nIn recent years, deep neural networks (DNNs) have led to many breakthrough results in machine\nlearning and computer vision [20, 28, 10], and are now widely deployed in industry. Modern DNN\nmodels often have millions or tens of millions of parameters, leading to highly redundant structures,\nboth in the intermediate feature representations they generate and in the model itself. Although\noverparametrization of DNN models can have a favorable effect on training, in practice it is often\ndesirable to compress DNN models for inference, e.g., when deploying them on mobile or embedded\ndevices with limited memory. The ability to learn compressible feature representations, on the other\nhand, has a large potential for the development of (data-adaptive) compression algorithms for various\ndata types such as images, audio, video, and text, for all of which various DNN architectures are now\navailable.\nDNN model compression and lossy image compression using DNNs have both independently attracted\na lot of attention lately. In order to compress a set of continuous model parameters or features, we\nneed to approximate each parameter or feature by one representative from a set of quantization\nlevels (or vectors, in the multi-dimensional case), each associated with a symbol, and then store the\nassignments (symbols) of the parameters or features, as well as the quantization levels. Representing\neach parameter of a DNN model or each feature in a feature representation by the corresponding\nquantization level will come at the cost of a distortion D, i.e., a loss in performance (e.g., in\nclassi\ufb01cation accuracy for a classi\ufb01cation DNN with quantized model parameters, or in reconstruction\nerror in the context of autoencoders with quantized intermediate feature representations). The rate\nR, i.e., the entropy of the symbol stream, determines the cost of encoding the model or features in a\nbitstream.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTo learn a compressible DNN model or feature representation we need to minimize D + \u03b2R, where\n\u03b2 > 0 controls the rate-distortion trade-off. Including the entropy into the learning cost function can\nbe seen as adding a regularizer that promotes a compressible representation of the network or feature\nrepresentation. However, two major challenges arise when minimizing D + \u03b2R for DNNs: i) coping\nwith the non-differentiability (due to quantization operations) of the cost function D + \u03b2R, and ii)\nobtaining an accurate and differentiable estimate of the entropy (i.e., R). To tackle i), various methods\nhave been proposed. Among the most popular ones are stochastic approximations [39, 19, 7, 32, 5]\nand rounding with a smooth derivative approximation [15, 30]. To address ii) a common approach\nis to assume the symbol stream to be i.i.d. and to model the marginal symbol distribution with a\nparametric model, such as a Gaussian mixture model [30, 34], a piecewise linear model [5], or a\nBernoulli distribution [33] (in the case of binary symbols).\nIn this paper, we propose a uni\ufb01ed end-to-end learning frame-\nwork for learning compressible representations, jointly op-\ntimizing the model parameters, the quantization levels, and\nthe entropy of the resulting symbol stream to compress ei-\nther a subset of feature representations in the network or the\nmodel itself (see inset \ufb01gure). We address both challenges i)\nand ii) above with methods that are novel in the context DNN\nmodel and feature compression. Our main contributions are:\n\u2022 We provide the \ufb01rst uni\ufb01ed view on end-to-end learned compression of feature representations and\nDNN models. These two problems have been studied largely independently in the literature so far.\n\u2022 Our method is simple and intuitively appealing, relying on soft assignments of a given scalar\nor vector to be quantized to quantization levels. A parameter controls the \u201chardness\u201d of the\nassignments and allows to gradually transition from soft to hard assignments during training. In\ncontrast to rounding-based or stochastic quantization schemes, our coding scheme is directly\ndifferentiable, thus trainable end-to-end.\n\nz: vector to be compressed\n\nDNN model compression\n\nFK \u25e6 ... \u25e6 Fb+1\n\nx(1)\n\nx(K\u22121)\n\ndata compression\n\nz = x(b)\n\nz = [w1, w2, . . . , wK]\n\nF1(\u00b7 ; w1)\n\nx(K)\n\nFK(\u00b7 ; wK)\n\nx\n\nx\n\nFb \u25e6 ... \u25e6 F1\n\nx(K)\n\n\u2022 Our method does not force the network to adapt to speci\ufb01c (given) quantization outputs (e.g.,\nintegers) but learns the quantization levels jointly with the weights, enabling application to a wider\nset of problems. In particular, we explore vector quantization for the \ufb01rst time in the context of\nlearned compression and demonstrate its bene\ufb01ts over scalar quantization.\n\n\u2022 Unlike essentially all previous works, we make no assumption on the marginal distribution of\nthe features or model parameters to be quantized by relying on a histogram of the assignment\nprobabilities rather than the parametric models commonly used in the literature.\n\n\u2022 We apply our method to DNN model compression for a 32-layer ResNet model [13] and full-\nresolution image compression using a variant of the compressive autoencoder proposed recently\nin [30]. In both cases, we obtain performance competitive with the state-of-the-art, while making\nfewer model assumptions and signi\ufb01cantly simplifying the training procedure compared to the\noriginal works [30, 6].\n\nThe remainder of the paper is organized as follows. Section 2 reviews related work, before our\nsoft-to-hard vector quantization method is introduced in Section 3. Then we apply it to a compres-\nsive autoencoder for image compression and to ResNet for DNN compression in Section 4 and 5,\nrespectively. Section 6 concludes the paper.\n\n2 Related Work\nThere has been a surge of interest in DNN models for full-resolution image compression, most\nnotably [32, 33, 4, 5, 30], all of which outperform JPEG [35] and some even JPEG 2000 [29]\nThe pioneering work [32, 33] showed that progressive image compression can be learned with\nconvolutional recurrent neural networks (RNNs), employing a stochastic quantization method during\ntraining. [4, 30] both rely on convolutional autoencoder architectures. These works are discussed in\nmore detail in Section 4.\nIn the context of DNN model compression, the line of works [12, 11, 6] adopts a multi-step procedure\nin which the weights of a pretrained DNN are \ufb01rst pruned and the remaining parameters are quantized\nusing a k-means like algorithm, the DNN is then retrained, and \ufb01nally the quantized DNN model\nis encoded using entropy coding. A notable different approach is taken by [34], where the DNN\n\n2\n\n\fcompression task is tackled using the minimum description length principle, which has a solid\ninformation-theoretic foundation.\nIt is worth noting that many recent works target quantization of the DNN model parameters and\npossibly the feature representation to speed up DNN evaluation on hardware with low-precision\narithmetic, see, e.g., [15, 23, 38, 43]. However, most of these works do not speci\ufb01cally train the DNN\nsuch that the quantized parameters are compressible in an information-theoretic sense.\nGradually moving from an easy (convex or differentiable) problem to the actual harder problem during\noptimization, as done in our soft-to-hard quantization framework, has been studied in various contexts\nand falls under the umbrella of continuation methods (see [3] for an overview). Formally related but\nmotivated from a probabilistic perspective are deterministic annealing methods for maximum entropy\nclustering/vector quantization, see, e.g., [24, 42]. Arguably most related to our approach is [41],\nwhich also employs continuation for nearest neighbor assignments, but in the context of learning a\nsupervised prototype classi\ufb01er. To the best of our knowledge, continuation methods have not been\nemployed before in an end-to-end learning framework for neural network-based image compression\nor DNN compression.\n\nL(X ,Y; F ) =\n\n1\nN\n\n(1)\n\nand\n\nN(cid:88)\n\ni=1\n\n(cid:96)(F (xi), yi) + \u03bbR(W),\n\n3 Proposed Soft-to-Hard Vector Quantization\n3.1 Problem Formulation\nPreliminaries and Notations. We consider the standard model for DNNs, where we have an\narchitecture F : Rd1 (cid:55)\u2192 RdK+1 composed of K layers F = FK \u25e6 \u00b7\u00b7\u00b7 \u25e6 F1, where layer Fi\nmaps Rdi \u2192 Rdi+1, and has parameters wi \u2208 Rmi. We refer to W = [w1,\u00b7\u00b7\u00b7 , wK] as the\nparameters of the network and we denote the intermediate layer outputs of the network as x(0) :=\nx(i) := Fi(x(i\u22121)), such that F (x) = x(K) and x(i) is the feature vector produced\nx\nby layer Fi.\nThe parameters of the network are learned w.r.t. training data X = {x1,\u00b7\u00b7\u00b7 , xN} \u2282 Rd1 and labels\nY = {y1,\u00b7\u00b7\u00b7 , yN} \u2282 RdK+1, by minimizing a real-valued loss L(X ,Y; F ). Typically, the loss can\nbe decomposed as a sum over the training data plus a regularization term,\n(e.g., R(W) =(cid:80)\n\nwhere (cid:96)(F (x), y) is the sample loss, \u03bb > 0 sets the regularization strength, and R(W) is a regularizer\ni (cid:107)wi(cid:107)2 for l2 regularization). In this case, the parameters of the network can be\nlearned using stochastic gradient descent over mini-batches. Assuming that the data X ,Y on which\nthe network is trained is drawn from some distribution PX,Y, the loss (1) can be thought of as an\nestimator of the expected loss E[(cid:96)(F (X), Y) + \u03bbR(W)]. In the context of image classi\ufb01cation, Rd1\nwould correspond to the input image space and RdK+1 to the classi\ufb01cation probabilities, and (cid:96) would\nbe the categorical cross entropy.\nWe say that the deep architecture is an autoencoder when the network maps back into the input space,\nwith the goal of reproducing the input. In this case, d1 = dK+1 and F (x) is trained to approximate x,\ne.g., with a mean squared error loss (cid:96)(F (x), y) = (cid:107)F (x) \u2212 y(cid:107)2. Autoencoders typically condense\nthe dimensionality of the input into some smaller dimensionality inside the network, i.e., the layer\nwith the smallest output dimension, x(b) \u2208 Rdb , has db (cid:28) d1, which we refer to as the \u201cbottleneck\u201d.\nCompressible representations. We say that a weight parameter wi or a feature x(i) has a compress-\nible representation if it can be serialized to a binary stream using few bits. For DNN compression, we\nwant the entire network parameters W to be compressible. For image compression via an autoencoder,\nwe just need the features in the bottleneck, x(b), to be compressible.\nSuppose we want to compress a feature representation z \u2208 Rd in our network (e.g., x(b) of an\nautoencoder) given an input x. Assuming that the data X ,Y is drawn from some distribution PX,Y, z\nwill be a sample from a continuous random variable Z.\nTo store z with a \ufb01nite number of bits, we need to map it to a discrete space. Speci\ufb01cally, we map\nz to a sequence of m symbols using a (symbol) encoder E : Rd (cid:55)\u2192 [L]m, where each symbol is an\nindex ranging from 1 to L, i.e., [L] := {1, . . . , L}. The reconstruction of z is then produced by a\n(symbol) decoder D : [L]m (cid:55)\u2192 Rd, which maps the symbols back to \u02c6z = D(E(z)) \u2208 Rd. Since z is\n\n3\n\n\fa sample from Z, the symbol stream E(z) is drawn from the discrete probability distribution PE(Z).\nThus, given the encoder E, according to Shannon\u2019s source coding theorem [8], the correct metric for\ncompressibility is the entropy of E(Z):\n\nP (E(Z) = e) log(P (E(Z) = e)).\n\n(2)\n\n(cid:88)\n\nH(E(Z)) = \u2212\n\ne\u2208[L]m\n\nOur generic goal is hence to optimize the rate distortion trade-off between the expected loss and the\nentropy of E(Z):\n\nEX,Y[(cid:96)( \u02c6F (X), Y) + \u03bbR(W)] + \u03b2H(E(Z)),\n\nmin\n\nE,D,W\n\n(3)\n\nwhere \u02c6F is the architecture where z has been replaced with \u02c6z, and \u03b2 > 0 controls the trade-off\nbetween compressibility of z and the distortion it imposes on \u02c6F .\nHowever, we cannot optimize (3) directly. First, we do not know the distribution of X and Y.\nSecond, the distribution of Z depends in a complex manner on the network parameters W and the\ndistribution of X. Third, the encoder E is a discrete mapping and thus not differentiable. For our \ufb01rst\napproximation we consider the sample entropy instead of H(E(Z)). That is, given the data X and\nsome \ufb01xed network parameters W, we can estimate the probabilities P (E(Z) = e) for e \u2208 [L]m\nvia a histogram. For this estimate to be accurate, we however would need |X| (cid:29) Lm. If z is the\nbottleneck of an autoencoder, this would correspond to trying to learn a single histogram for the\nentire discretized data space. We relax this by assuming the entries of E(Z) are i.i.d. such that we\ncan instead compute the histogram over the L distinct values. More precisely, we assume that for\nl=1 pel , where pj is the histogram\ne = (e1,\u00b7\u00b7\u00b7 , em) \u2208 [L]m we can approximate P (E(Z) = e) \u2248\nestimate\n\n(cid:81)m\n\npj := |{el(zi)|l \u2208 [m], i \u2208 [N ], el(zi) = j}|\n\nmN\n\n,\n\n(4)\n\nwhere we denote the entries of E(z) = (e1(z),\u00b7\u00b7\u00b7 , em(z)) and zi is the output feature z for training\ndata point xi \u2208 X . We then obtain an estimate of the entropy of Z by substituting the approximation\n(3.1) into (2),\n\nH(E(Z)) \u2248 \u2212\n\npel\n\nlog\n\npel\n\n= \u2212m\n\npj log pj = mH(p),\n\n(5)\n\nwhere the \ufb01rst (exact) equality is due to [8], Thm. 2.6.6, and H(p) := \u2212\nentropy for the (i.i.d., by assumption) components of E(Z) 1.\nWe now can simplify the ideal objective of (3), by replacing the expected loss with the sample mean\nover (cid:96) and the entropy using the sample entropy H(p), obtaining\n\nj=1 pj log pj is the sample\n\n(cid:33)\n\n(cid:32) m(cid:89)\n\n(cid:88)\n\ne\u2208[L]m\n\nl=1\n\n(cid:33)\n\n(cid:32) m(cid:89)\n\nl=1\n\nL(cid:88)\n\nj=1\n\n(cid:80)L\n\nN(cid:88)\n\ni=1\n\n1\nN\n\n(cid:96)(F (xi), yi) + \u03bbR(W) + \u03b2mH(p).\n\n(6)\n\nWe note that so far we have assumed that z is a feature output in F , i.e., z = x(k) for some k \u2208 [K].\nHowever, the above treatment would stay the same if z is the concatenation of multiple feature\noutputs. One can also obtain a separate sample entropy term for separate feature outputs and add\nthem to the objective in (6).\nIn case z is composed of one or more parameter vectors, such as in DNN compression where z = W,\nz and \u02c6z cease to be random variables, since W is a parameter of the model. That is, opposed to the\ncase where we have a source X that produces another source \u02c6Z which we want to be compressible,\nwe want the discretization of a single parameter vector W to be compressible. This is analogous to\ncompressing a single document, instead of learning a model that can compress a stream of documents.\nIn this case, (3) is not the appropriate objective, but our simpli\ufb01ed objective in (6) remains appropriate.\nThis is because a standard technique in compression is to build a statistical model of the (\ufb01nite) data,\nwhich has a small sample entropy. The only difference is that now the histogram probabilities in (4)\nare taken over W instead of the dataset X , i.e., N = 1 and zi = W in (4), and they count towards\nstorage as well as the encoder E and decoder D.\n\n1In fact, from [8], Thm. 2.6.6, it follows that if the histogram estimates pj are exact, (5) is an upper bound\n\nfor the true H(E(Z)) (i.e., without the i.i.d. assumption).\n\n4\n\n\fChallenges. Eq. (6) gives us a uni\ufb01ed objective that can well describe the trade-off between com-\npressible representations in a deep architecture and the original training objective of the architecture.\nHowever, the problem of \ufb01nding a good encoder E, a corresponding decoder D, and parameters W\nthat minimize the objective remains. First, we need to impose a form for the encoder and decoder,\nand second we need an approach that can optimize (6) w.r.t. the parameters W. Independently of the\nchoice of E, (6) is challenging since E is a mapping to a \ufb01nite set and, therefore, not differentiable.\nThis implies that neither H(p) is differentiable nor \u02c6F is differentiable w.r.t. the parameters of z and\nlayers that feed into z. For example, if \u02c6F is an autoencoder and z = x(b), the output of the network\nwill not be differentiable w.r.t. w1,\u00b7\u00b7\u00b7 , wb and x(0),\u00b7\u00b7\u00b7 , x(b\u22121).\nThese challenges motivate the design decisions of our soft-to-hard annealing approach, described in\nthe next section.\n\n3.2 Our Method\nEncoder and decoder form. For the encoder E : Rd (cid:55)\u2192 [L]m we assume that we have L centers\nvectors C = {c1,\u00b7\u00b7\u00b7 , cL} \u2282 Rd/m. The encoding of z \u2208 Rd is then performed by reshaping it\ninto a matrix Z = [\u00afz(1),\u00b7\u00b7\u00b7 , \u00afz(m)] \u2208 R(d/m)\u00d7m, and assigning each column \u00afz(l) to the index of its\nnearest neighbor in C. That is, we assume the feature z \u2208 Rd can be modeled as a sequence of m\npoints in Rd/m, which we partition into the Voronoi tessellation over the centers C. The decoder\nD : [L]m (cid:55)\u2192 Rd then simply constructs \u02c6Z \u2208 R(d/m)\u00d7m from a symbol sequence (e1,\u00b7\u00b7\u00b7 , em) by\npicking the corresponding centers \u02c6Z = [ce1 ,\u00b7\u00b7\u00b7 , cem ], from which \u02c6z is formed by reshaping \u02c6Z back\ninto Rd. We will interchangeably write \u02c6z = D(E(z)) and \u02c6Z = D(E(Z)).\nThe idea is then to relax E and D into continuous mappings via soft assignments instead of the hard\nnearest neighbor assignment of E.\nSoft assignments. We de\ufb01ne the soft assignment of \u00afz \u2208 Rd/m to C as\nwhere softmax(y1,\u00b7\u00b7\u00b7 , yL)j :=\npositive entries and (cid:107)\u03c6(\u00afz)(cid:107)1 = 1. We denote the j-th entry of \u03c6(\u00afz) with \u03c6j(\u00afz) and note that\n\n(7)\ney1 +\u00b7\u00b7\u00b7+eyL is the standard softmax operator, such that \u03c6(\u00afz) has\n\n\u03c6(\u00afz) := softmax(\u2212\u03c3[(cid:107)\u00afz \u2212 c1(cid:107)2, . . . ,(cid:107)\u00afz \u2212 cL(cid:107)2]) \u2208 RL,\n\neyj\n\n(cid:40)\n\n\u03c3\u2192\u221e \u03c6j(\u00afz) =\nlim\n\n1\n0\n\nif j = arg minj(cid:48)\u2208[L](cid:107)\u00afz \u2212 cj(cid:48)(cid:107)\notherwise\n\nsuch that \u02c6\u03c6(\u00afz) := lim\u03c3\u2192\u221e \u03c6(\u00afz) converges to a one-hot encoding of the nearest center to \u00afz in C. We\ntherefore refer to \u02c6\u03c6(\u00afz) as the hard assignment of \u00afz to C and the parameter \u03c3 > 0 as the hardness of\nthe soft assignment \u03c6(\u00afz).\nUsing soft assignment, we de\ufb01ne the soft quantization of \u00afz as\n\nL(cid:88)\n\n\u02dcQ(\u00afz) :=\n\ncj\u03c6i(\u00afz) = C\u03c6(\u00afz),\n\nwhere we write the centers as a matrix C = [c1,\u00b7\u00b7\u00b7 , cL] \u2208 Rd/m\u00d7L. The corresponding hard\nassignment is taken with \u02c6Q(\u00afz) := lim\u03c3\u2192\u221e \u02dcQ(\u00afz) = ce(\u00afz), where e(\u00afz) is the center in C nearest to \u00afz.\nTherefore, we can now write:\n\nj=1\n\n\u02c6Z = D(E(Z)) = [ \u02c6Q(\u00afz(1)),\u00b7\u00b7\u00b7 , \u02c6Q(\u00afz(m))] = C[ \u02c6\u03c6(\u00afz(1)),\u00b7\u00b7\u00b7 , \u02c6\u03c6(\u00afz(m))].\n\nNow, instead of computing \u02c6Z via hard nearest neighbor assignments, we can approximate it with\na smooth relaxation \u02dcZ := C[\u03c6(\u00afz(1)),\u00b7\u00b7\u00b7 , \u03c6(\u00afz(m))] by using the soft assignments instead of the\nhard assignments. Denoting the corresponding vector form by \u02dcz, this gives us a differentiable\napproximation \u02dcF of the quantized architecture \u02c6F , by replacing \u02c6z in the network with \u02dcz.\nEntropy estimation. Using the soft assignments, we can similarly de\ufb01ne a soft histogram, by\nsumming up the partial assignments to each center instead of counting as in (4):\n\nqj :=\n\n1\n\nmN\n\n\u03c6j(\u00afz(l)\n\ni ).\n\nN(cid:88)\n\nm(cid:88)\n\ni=1\n\nl=1\n\n5\n\n\fThis gives us a valid probability mass function q = (q1,\u00b7\u00b7\u00b7 , qL), which is differentiable but converges\nto p = (p1,\u00b7\u00b7\u00b7 , pL) as \u03c3 \u2192 \u221e.\nWe can now de\ufb01ne the \u201csoft entropy\u201d as the cross entropy between p and q:\n\nwhere DKL(p||q) = (cid:80)\n\n\u02dcH(\u03c6) := H(p, q) = \u2212\n\npj log qj = H(p) + DKL(p||q)\n\nL(cid:88)\n\nj=1\n\nL(cid:88)\n\nj=1\n\nN(cid:88)\n\nm(cid:88)\n\nL(cid:88)\n\ni=1\n\nl=1\n\nj=1\n\nj pj log(pj/qj) denotes the Kullback\u2013Leibler divergence.\n\nSince\nDKL(p||q) \u2265 0, this establishes \u02dcH(\u03c6) as an upper bound for H(p), where equality is obtained\nwhen p = q.\nWe have therefore obtained a differentiable \u201csoft entropy\u201d loss (w.r.t. q), which is an upper bound on\nthe sample entropy H(p). Hence, we can indirectly minimize H(p) by minimizing \u02dcH(\u03c6), treating\nthe histogram probabilities of p as constants for gradient computation. However, we note that while\nqj is additive over the training data and the symbol sequence, log(qj) is not. This prevents the use\nof mini-batch gradient descent on \u02dcH(\u03c6), which can be an issue for large scale learning problems.\nIn this case, we can instead re-de\ufb01ne the soft entropy \u02dcH(\u03c6) as H(q, p). As before, \u02dcH(\u03c6) \u2192 H(p)\nas \u03c3 \u2192 \u221e, but \u02dcH(\u03c6) ceases to be an upper bound for H(p). The bene\ufb01t is that now \u02dcH(\u03c6) can be\ndecomposed as\n\n\u02dcH(\u03c6) := H(q, p) = \u2212\n\nqj log pj = \u2212\n\n1\n\nmN\n\n\u03c6j(\u00afz(l)\n\ni ) log pj,\n\n(8)\n\nsuch that we get an additive loss over the samples xi \u2208 X and the components l \u2208 [m].\nSoft-to-hard deterministic annealing. Our soft assignment scheme gives us differentiable ap-\nproximations \u02dcF and \u02dcH(\u03c6) of the discretized network \u02c6F and the sample entropy H(p), respectively.\nHowever, our objective is to learn network parameters W that minimize (6) when using the encoder\nand decoder with hard assignments, such that we obtain a compressible symbol stream E(z) which\nwe can compress using, e.g., arithmetic coding [40].\nTo this end, we anneal \u03c3 from some initial value \u03c30 to in\ufb01nity during training, such that the soft\napproximation gradually becomes a better approximation of the \ufb01nal hard quantization we will use.\nChoosing the annealing schedule is crucial as annealing too slowly may allow the network to invert\nthe soft assignments (resulting in large weights), and annealing too fast leads to vanishing gradients\ntoo early, thereby preventing learning. In practice, one can either parametrize \u03c3 as a function of the\niteration, or tie it to an auxiliary target such as the difference between the network losses incurred by\nsoft quantization and hard quantization (see Section 4 for details).\nFor a simple initialization of \u03c30 and the centers C, we can sample the centers from the set Z :=\n{\u00afz(l)\n\u00afz\u2208Z (cid:107)\u00afz \u2212 \u02dcQ(\u00afz)(cid:107)2\nusing SGD.\n\n|i \u2208 [N ], l \u2208 [m]} and then cluster Z by minimizing the cluster energy(cid:80)\n\ni\n\nImage Compression\n\n4\nWe now show how we can use our framework to realize a simple image compression system. For\nthe architecture, we use a variant of the convolutional autoencoder proposed recently in [30] (see\nAppendix A.1 for details). We note that while we use the architecture of [30], we train it using our\nsoft-to-hard entropy minimization method, which differs signi\ufb01cantly from their approach, see below.\nOur goal is to learn a compressible representation of the features in the bottleneck of the autoencoder.\nBecause we do not expect the features from different bottleneck channels to be identically distributed,\nwe model each channel\u2019s distribution with a different histogram and entropy loss, adding each entropy\nterm to the total loss using the same \u03b2 parameter. To encode a channel into symbols, we separate the\nchannel matrix into a sequence of pw \u00d7 ph-dimensional patches. These patches (vectorized) form the\ncolumns of Z \u2208 Rd/m\u00d7m, where m = d/(pwph), such that Z contains m (pwph)-dimensional points.\nHaving ph or pw greater than one allows symbols to capture local correlations in the bottleneck,\nwhich is desirable since we model the symbols as i.i.d. random variables for entropy coding. At test\ntime, the symbol encoder E then determines the symbols in the channel by performing a nearest\nneighbor assignment over a set of L centers C \u2282 Rpwph, resulting in \u02c6Z, as described above. During\ntraining we instead use the soft quantized \u02dcZ, also w.r.t. the centers C.\n\n6\n\n\f0.20bpp / 0.91 / 0.69 / 23.88dB\n\nSHVQ (ours)\n\n0.20bpp / 0.90 / 0.67 / 24.19dB\n\nBPG\n\n0.20bpp / 0.88 / 0.63 / 23.01dB\n\nJPEG 2000\n\n0.22bpp / 0.77 / 0.48 / 19.77dB\n\nJPEG\n\nFigure 1: Top: MS-SSIM as a function of rate for SHVQ (Ours), BPG, JPEG 2000, JPEG, for each\ndata set. Bottom: A visual example from the Kodak data set along with rate / MS-SSIM / SSIM /\nPSNR.\n\nWe trained different models using Adam [17], see Appendix A.2. Our training set is composed\nsimilarly to that described in [4]. We used a subset of 90,000 images from ImageNET [9], which\nwe downsampled by a factor 0.7 and trained on crops of 128 \u00d7 128 pixels, with a batch size of 15.\nTo estimate the probability distribution p for optimizing (8), we maintain a histogram over 5,000\nimages, which we update every 10 iterations with the images from the current batch. Details about\nother hyperparameters can be found in Appendix A.2.\nThe training of our autoencoder network takes place in two stages, where we move from an identity\nfunction in the bottleneck to hard quantization. In the \ufb01rst stage, we train the autoencoder without any\nquantization. Similar to [30] we gradually unfreeze the channels in the bottleneck during training (this\ngives a slight improvement over learning all channels jointly from the start). This yields an ef\ufb01cient\nweight initialization and enables us to then initialize \u03c30 and C as described above. In the second stage,\nwe minimize (6), jointly learning network weights and quantization levels. We anneal \u03c3 by letting the\ngap between soft and hard quantization error go to zero as the number of iterations t goes to in\ufb01nity.\nLet eS = (cid:107) \u02dcF (x)\u2212x(cid:107)2 be the soft error, eH = (cid:107) \u02c6F (x)\u2212x(cid:107)2 be the hard error. With gap(t) = eH\u2212eS\nwe can denote the error between the actual the desired gap with eG(t) = gap(t) \u2212 T /(T + t) gap(0),\nsuch that the gap is halved after T iterations. We update \u03c3 according to \u03c3(t + 1) = \u03c3(t) + KG eG(t),\nwhere \u03c3(t) denotes \u03c3 at iteration t. Fig. 3 in Appendix A.4 shows the evolution of the gap, soft and\nhard loss as sigma grows during training. We observed that both vector quantization and entropy loss\nlead to higher compression rates at a given reconstruction MSE compared to scalar quantization and\ntraining without entropy loss, respectively (see Appendix A.3 for details).\n\nEvaluation. To evaluate the image compression performance of our Soft-to-Hard Vector Quantiza-\ntion Autoencoder (SHVQ) method we use four datasets, namely Kodak [2], B100 [31], Urban100 [14],\nImageNET100 (100 randomly selected images from ImageNET [25]) and three standard quality\nmeasures, namely peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [37], and\nmulti-scale SSIM (MS-SSIM), see Appendix A.5 for details. We compare our SHVQ with the\nstandard JPEG, JPEG 2000, and BPG [1], focusing on compression rates < 1 bits per pixel (bpp) (i.e.,\nthe regime where traditional integral transform-based compression algorithms are most challenged).\nAs shown in Fig. 1, for high compression rates (< 0.4 bpp), our SHVQ outperforms JPEG and JPEG\n2000 in terms of MS-SSIM and is competitive with BPG. A similar trend can be observed for SSIM\n(see Fig. 4 in Appendix A.6 for plots of SSIM and PSNR as a function of bpp). SHVQ performs\nbest on ImageNET100 and is most challenged on Kodak when compared with JPEG 2000. Visually,\nSHVQ-compressed images have fewer artifacts than those compressed by JPEG 2000 (see Fig. 1,\nand Fig. 5\u201312 in Appendix A.7).\n\nRelated methods and discussion.\nJPEG 2000 [29] uses wavelet-based transformations and adap-\ntive EBCOT coding. BPG [1], based on a subset of the HEVC video compression standard, is the\n\n7\n\n0.20.40.6rate[bpp]0.860.880.900.920.940.960.981.00MS-SSIMImageNET1000.20.40.6rate[bpp]0.860.880.900.920.940.960.981.00MS-SSIMB1000.20.40.6rate[bpp]0.860.880.900.920.940.960.981.00MS-SSIMUrban1000.20.40.6rate[bpp]0.860.880.900.920.940.960.981.00MS-SSIMKodakSHVQ(ours)BPGJPEG2000JPEG\fACC COMP.\n[%] RATIO\nMETHOD\n1.00\n92.6\nORIGINAL MODEL\n92.6\n4.52\nPRUNING + FT. + INDEX CODING + H. CODING [12]\nPRUNING + FT. + K-MEANS + FT. + I.C. + H.C. [11]\n92.6 18.25\nPRUNING + FT. + HESSIAN-WEIGHTED K-MEANS + FT. + I.C. + H.C. 92.7 20.51\n92.7 22.17\nPRUNING + FT. + UNIFORM QUANTIZATION + FT. + I.C. + H.C.\nPRUNING + FT. + ITERATIVE ECSQ + FT. + I.C. + H.C.\n92.7 21.01\nSOFT-TO-HARD ANNEALING + FT. + H. CODING (OURS)\n92.1 19.15\nSOFT-TO-HARD ANNEALING + FT. + A. CODING (OURS)\n92.1 20.15\n\nTable 1: Accuracies and compression factors for different DNN compression techniques, using a\n32-layer ResNet on CIFAR-10. FT. denotes \ufb01ne-tuning, IC. denotes index coding and H.C. and A.C.\ndenote Huffman and arithmetic coding, respectively. The pruning based results are from [6].\n\nQuantization\nBackpropagation\nEntropy estimation (soft) histogram\nTraining material\nOperating points\n\nImageNET\nsingle model\n\nTheis et al. [30]\nrounding to integers\n\nSHVQ (ours)\nvector quantization\ngrad. of soft relaxation grad. of identity mapping\nGaussian scale mixtures\nhigh quality Flickr images\nensemble\n\ncurrent state-of-the art for image compression. It uses context-adaptive binary arithmetic coding\n(CABAC) [21].\nThe recent works of [30, 5]\nalso showed competitive perfor-\nmance with JPEG 2000. While\nwe use the architecture of [30],\nthere are stark differences be-\ntween the works, summarized\nin the inset table. The work of [5] build a deep model using multiple generalized divisive normaliza-\ntion (GDN) layers and their inverses (IGDN), which are specialized layers designed to capture local\njoint statistics of natural images. Furthermore, they model marginals for entropy estimation using\nlinear splines and also use CABAC[21] coding. Concurrent to our work, the method of [16] builds on\nthe architecture proposed in [33], and shows that impressive performance in terms of the MS-SSIM\nmetric can be obtained by incorporating it into the optimization (instead of just minimizing the MSE).\nIn contrast to the domain-speci\ufb01c techniques adopted by these state-of-the-art methods, our framework\nfor learning compressible representation can realize a competitive image compression system, only\nusing a convolutional autoencoder and simple entropy coding.\n\n5 DNN Compression\nFor DNN compression, we investigate the ResNet [13] architecture for image classi\ufb01cation. We adopt\nthe same setting as [6] and consider a 32-layer architecture trained for CIFAR-10 [18]. As in [6], our\ngoal is to learn a compressible representation for all 464,154 trainable parameters of the model.\nWe concatenate the parameters into a vector W \u2208 R464,154 and employ scalar quantization (m = d),\nsuch that ZT = z = W. We started from the pre-trained original model, which obtains a 92.6%\naccuracy on the test set. We implemented the entropy minimization by using L = 75 centers and\nchose \u03b2 = 0.1 such that the converged entropy would give a compression factor \u2248 20, i.e., giving\n\u2248 32/20 = 1.6 bits per weight. The training was performed with the same learning parameters as\nthe original model was trained with (SGD with momentum 0.9). The annealing schedule used was a\nsimple exponential one, \u03c3(t + 1) = 1.001 \u00b7 \u03c3(t) with \u03c3(0) = 0.4. After 4 epochs of training, when\n\u03c3(t) has increased by a factor \u2248 20, we switched to hard assignments and continued \ufb01ne-tuning at\na 10\u00d7 lower learning rate. 2 Adhering to the benchmark of [6, 12, 11], we obtain the compression\nfactor by dividing the bit cost of storing the uncompressed weights as \ufb02oats (464, 154 \u00d7 32 bits) with\nthe total encoding cost of compressed weights (i.e., L \u00d7 32 bits for the centers plus the size of the\ncompressed index stream).\nOur compressible model achieves a comparable test accuracy of 92.1% while compressing the DNN\nby a factor 19.15 with Huffman and 20.15 using arithmetic coding. Table 1 compares our results with\nstate-of-the-art approaches reported by [6]. We note that while the top methods from the literature\nalso achieve accuracies above 92% and compression factors above 20\u00d7, they employ a considerable\namount of hand-designed steps, such as pruning, retraining, various types of weight clustering, special\nencoding of the sparse weight matrices into an index-difference based format and then \ufb01nally use\n\n2 We switch to hard assignments since we can get large gradients for weights that are equally close to two\ncenters as \u02dcQ converges to hard nearest neighbor assignments. One could also employ simple gradient clipping.\n\n8\n\n\fentropy coding. In contrast, we directly minimize the entropy of the weights in the training, obtaining\na highly compressible representation using standard entropy coding.\nIn Fig. 13 in Appendix A.8, we show how the sample entropy H(p) decays and the index histograms\ndevelop during training, as the network learns to condense most of the weights to a couple of centers\nwhen optimizing (6). In contrast, the methods of [12, 11, 6] manually impose 0 as the most frequent\ncenter by pruning \u2248 80% of the network weights. We note that the recent works by [34] also manages\nto tackle the problem in a single training procedure, using the minimum description length principle.\nIn contrast to our framework, they take a Bayesian perspective and rely on a parametric assumption\non the symbol distribution.\n\n6 Conclusions\nIn this paper we proposed a uni\ufb01ed framework for end-to-end learning of compressed representations\nfor deep architectures. By training with a soft-to-hard annealing scheme, gradually transferring\nfrom a soft relaxation of the sample entropy and network discretization process to the actual non-\ndifferentiable quantization process, we manage to optimize the rate distortion trade-off between the\noriginal network loss and the entropy. Our framework can elegantly capture diverse compression\ntasks, obtaining results competitive with state-of-the-art for both image compression as well as DNN\ncompression. The simplicity of our approach opens up various directions for future work, since our\nframework can be easily adapted for other tasks where a compressible representation is desired.\n\nAcknowledgments\nThis work was supported by EUs Horizon 2020 programme under grant agreement No 687757 \u2013\nREPLICATE, by NVIDIA Corporation through the Academic Hardware Grant, by ETH Zurich, and\nby Armasuisse.\n\nReferences\n[1] BPG Image format. https://bellard.org/bpg/.\n[2] Kodak PhotoCD dataset. http://r0k.us/graphics/kodak/.\n[3] Eugene L Allgower and Kurt Georg. Numerical continuation methods: an introduction,\n\nvolume 13. Springer Science & Business Media, 2012.\n\n[4] Johannes Ball\u00e9, Valero Laparra, and Eero P Simoncelli. End-to-end optimization of nonlinear\n\ntransform codes for perceptual quality. arXiv preprint arXiv:1607.05006, 2016.\n\n[5] Johannes Ball\u00e9, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compres-\n\nsion. arXiv preprint arXiv:1611.01704, 2016.\n\n[6] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Towards the limit of network quantization.\n\narXiv preprint arXiv:1612.01543, 2016.\n\n[7] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Advances in Neural Information\nProcessing Systems, pages 3123\u20133131, 2015.\n\n[8] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nHierarchical Image Database. In CVPR09, 2009.\n\nImageNet: A Large-Scale\n\n[10] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and\nSebastian Thrun. Dermatologist-level classi\ufb01cation of skin cancer with deep neural networks.\nNature, 542(7639):115\u2013118, 2017.\n\n[11] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[12] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in Neural Information Processing Systems, pages\n1135\u20131143, 2015.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June\n2016.\n\n9\n\n\f[14] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from\ntransformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 5197\u20135206, 2015.\n\n[15] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan-\ntized neural networks: Training neural networks with low precision weights and activations.\narXiv preprint arXiv:1609.07061, 2016.\n\n[16] Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Troy Chinen,\nSung Jin Hwang, Joel Shor, and George Toderici. Improved lossy image compression with\npriming and spatially adaptive bit rates for recurrent networks. arXiv preprint arXiv:1703.10114,\n2017.\n\n[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[19] Alex Krizhevsky and Geoffrey E Hinton. Using very deep autoencoders for content-based\n\nimage retrieval. In ESANN, 2011.\n\n[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[21] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand. Context-based adaptive binary arithmetic\ncoding in the h. 264/avc video compression standard. IEEE Transactions on circuits and systems\nfor video technology, 13(7):620\u2013636, 2003.\n\n[22] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images\nand its application to evaluating segmentation algorithms and measuring ecological statistics.\nIn Proc. Int\u2019l Conf. Computer Vision, volume 2, pages 416\u2013423, July 2001.\n\n[23] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[24] Kenneth Rose, Eitan Gurewitz, and Geoffrey C Fox. Vector quantization by deterministic\n\nannealing. IEEE Transactions on Information theory, 38(4):1249\u20131257, 1992.\n\n[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[26] Wenzhe Shi, Jose Caballero, Ferenc Husz\u00e1r, Johannes Totz, Andrew P Aitken, Rob Bishop,\nDaniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an\nef\ufb01cient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 1874\u20131883, 2016.\n\n[27] Wenzhe Shi, Jose Caballero, Lucas Theis, Ferenc Huszar, Andrew Aitken, Christian Ledig, and\nZehan Wang. Is the deconvolution layer the same as a convolutional layer? arXiv preprint\narXiv:1609.07009, 2016.\n\n[28] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[29] David S. Taubman and Michael W. Marcellin. JPEG 2000: Image Compression Fundamentals,\n\nStandards and Practice. Kluwer Academic Publishers, Norwell, MA, USA, 2001.\n\n[30] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszar. Lossy image compression\n\nwith compressive autoencoders. In ICLR 2017, 2017.\n\n[31] Radu Timofte, Vincent De Smet, and Luc Van Gool. A+: Adjusted Anchored Neighborhood\nRegression for Fast Super-Resolution, pages 111\u2013126. Springer International Publishing, Cham,\n2015.\n\n[32] George Toderici, Sean M O\u2019Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet\nBaluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent\nneural networks. arXiv preprint arXiv:1511.06085, 2015.\n\n10\n\n\f[33] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor,\nand Michele Covell. Full resolution image compression with recurrent neural networks. arXiv\npreprint arXiv:1608.05148, 2016.\n\n[34] Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network\n\ncompression. arXiv preprint arXiv:1702.04008, 2017.\n\n[35] Gregory K Wallace. The JPEG still picture compression standard.\n\nconsumer electronics, 38(1):xviii\u2013xxxiv, 1992.\n\nIEEE transactions on\n\n[36] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality\nassessment. In Asilomar Conference on Signals, Systems Computers, 2003, volume 2, pages\n1398\u20131402 Vol.2, Nov 2003.\n\n[37] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from\nerror visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600\u2013612,\nApril 2004.\n\n[38] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in\ndeep neural networks. In Advances in Neural Information Processing Systems, pages 2074\u20132082,\n2016.\n\n[39] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[40] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression.\n\nCommun. ACM, 30(6):520\u2013540, June 1987.\n\n[41] Paul Wohlhart, Martin Kostinger, Michael Donoser, Peter M. Roth, and Horst Bischof. Optimiz-\ning 1-nearest prototype classi\ufb01ers. In IEEE Conf. on Computer Vision and Pattern Recognition\n(CVPR), June 2013.\n\n[42] Eyal Yair, Kenneth Zeger, and Allen Gersho. Competitive learning and soft competition for\n\nvector quantizer design. IEEE transactions on Signal Processing, 40(2):294\u2013309, 1992.\n\n[43] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quanti-\nzation: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 773, "authors": [{"given_name": "Eirikur", "family_name": "Agustsson", "institution": "ETH Zurich"}, {"given_name": "Fabian", "family_name": "Mentzer", "institution": "ETH Zurich"}, {"given_name": "Michael", "family_name": "Tschannen", "institution": "ETH Zurich"}, {"given_name": "Lukas", "family_name": "Cavigelli", "institution": "ETH Zurich"}, {"given_name": "Radu", "family_name": "Timofte", "institution": "ETH Zurich"}, {"given_name": "Luca", "family_name": "Benini", "institution": "ETH Zurich"}, {"given_name": "Luc", "family_name": "Gool", "institution": "Computer Vision Lab, ETH Zurich"}]}