{"title": "From Mixtures of Mixtures to Adaptive Transform Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 925, "page_last": 931, "abstract": null, "full_text": "From Mixtures of Mixtures to \nAdaptive Transform Coding \n\nCynthia Archer and Todd K. Leen \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n20000 N.W. Walker Rd, Beaverton, OR 97006-1000 \n\nE-mail: archer, tleen@cse.ogi.edu \n\nAbstract \n\nWe establish a principled framework for adaptive transform cod(cid:173)\ning. Transform coders are often constructed by concatenating an ad \nhoc choice of transform with suboptimal bit allocation and quan(cid:173)\ntizer design. Instead, we start from a probabilistic latent variable \nmodel in the form of a mixture of constrained Gaussian mixtures. \nFrom this model we derive a transform coding algorithm, which is \na constrained version of the generalized Lloyd algorithm for vector \nquantizer design. A byproduct of our derivation is the introduc(cid:173)\ntion of a new transform basis, which unlike other transforms (PCA, \nDCT, etc.) is explicitly optimized for coding. Image compression \nexperiments show adaptive transform coders designed with our al(cid:173)\ngorithm improve compressed image signal-to-noise ratio up to 3 dB \ncompared to global transform coding and 0.5 to 2 dB compared to \nother adaptive transform coders. \n\n1 \n\nIntroduction \n\nCompression algorithms for image and video signals often use transform coding as a \nlow-complexity alternative to vector quantization (VQ). Transform coders compress \nmulti-dimensional data by transforming the signal vectors to new coordinates and \ncoding the transform coefficients independently of one another with scalar quantiz(cid:173)\ners. \n\nThe coordinate transform may be fixed a priori as in the discrete cosine transform \n(DCT). It can also be adapted to the signal statistics using, for example, principal \ncomponent analysis (PCA), where the goal is to concentrate signal energy in a \nfew signal components. Noting that signals such as images and speech are non(cid:173)\nstationary, several researchers have developed non-linear [1, 2] and local linear or \nadaptive [3,4] PCA transforms for dimension reduction!. None of these transforms \nare designed to minimize compression distortion nor are they designed in concert \nwith quantizer development. \n\nlIn dimension reduction the original d-dimensional signal is projected onto a subspace \n\nor submanifold of lower dimension. The retained coordinates are not quantized. \n\n\fSeveral researchers have extended the idea of local linear transforms to transform \ncoding [5, 6, 7]. In these adaptive transform coders, the signal space is partitioned \ninto disjoint regions and a transform and set of scalar quantizers are designed for \neach region. In our own previous work [7], we use k-means partitioning to define the \nregions. Dony and Haykin [5] partition the space to minimize dimension-reduction \nerror. Tipping and Bishop [6] use soft partitioning according to a probabilistic rule \nthat reduces, in the appropriate limit, to partitioning by dimension-reduction error. \nThese systems neither design transforms nor partition the signal space with the goal \nof minimizing compression distortion. \n\nThis ad hoc construction contrasts sharply with the solid grounding of vector quanti(cid:173)\nzation. Nowlan [8] develops a probabilistic framework for VQ by demonstrating the \ncorrespondence between a VQ and a mixture of spherically symmetric Gaussians. \nIn the limit that the mixture component variance goes to zero, the Expectation(cid:173)\nMaximization (EM) procedure for fitting the mixture model to data becomes iden(cid:173)\ntical to the Linde-Buzo-Gray (LBG) algorithm [9] for vector quantizer design. \n\nThis paper develops a similar grounding for both global and adaptive (local) trans(cid:173)\nform coding. We define a constrained mixture of Gaussians model that provides \na framework for transform coder design. Our new design algorithm is simply a \nconstrained version of the LBG algorithm. It iteratively optimizes the signal space \npartition, the local transforms, the allocation of coding bits, and the scalar quan(cid:173)\ntizer reproduction values until it reaches a local distortion minimum. This approach \nleads to two new results, an orthogonal transform and a method of partitioning the \nsignal space, both designed to minimize coding error. \n\n2 Global Transform Coder Model \n\nIn this section, we develop a constrained mixture of Gaussians model that provides \na probabilistic framework for global transform coding. \n\n2.1 Latent Variable Model \n\nA transform coder converts a signal to new coordinates and then codes the coordi(cid:173)\nnate values independently of one another with scalar quantizers. To replicate this \nstructure, we envision the data as drawn from a d-dimensionallatent data space, S, \nin which the density p( 8) = p( 81,82, ... ,8d) is a product of the marginal densities, \nPJ(8J), J = 1. . . d. \n\na 11,12 \n~ \n\n-\n\nI \n\n- r2\u00b7 \n12 \n\ns\u00b7 \n\nS \n\nr1\u00b7 -\n11-\nI 1 \n\nFigure 1: Structure of latent variable space, S, and mapping to observed space, X . The \nlatent data density consists of a mixture of spherical Gaussians with component means qa \nconstrained to lie at the vertices of a rectangular grid. The latent data is mapped to the \nobserved space by an orthogonal transform, W. \n\n\fWe model the density in the latent space with a constrained mixture of Gaussian \ndensities \n\nK \n\np(s) = L 7ra P(sla) \n\n(1) \n\na=l \n\nwhere 7ra are the mixing coefficients and p(sla) = N(qa, ~a) is Gaussian with \nmean qa and variance ~a. The mixture component means, qa, lie at the ver(cid:173)\ntices of a rectangular grid as illustrated in figure (1). The coordinates of qa are \n[rliu r2i2\" .. , rdiJ T , where r JiJ is the i~h grid mark on the SJ axis. There are KJ \ngrid mark values on the SJ axis, so the total number of grid vertices K = TIJ KJ. \nWe constrain the mixture components variances, ~a to be spherically symmetric \nwith the same variance, (721, with 1 the identity matrix. We do not fit (72 to the \ndata, but treat it as a \"knob\", which we will turn to zero to reveal a transform \ncoder. These mean and variance constraints yield marginal densities PJ(sJliJ) = \nN(r JiJ> (72). We write the density of S conditioned on a as \n\np(sla) =P(Sl, ... ,Sdla(i1, ... ,id)) = IIpJ(SJli J). \n\nd \n\n(2) \n\nJ=l \n\nand constrain each 7ra to be a product of prior probabilities, 7ra (il, ... ,id) = TIJ PJir \nIncorporating these constraints into (1) and noting that the sum over the mixture \ncomponents a is equivalent to sums over all grid mark values, the latent density \nbecomes \n\nKl K2 \n\nKd \n\np(s) = L L ... L IIpJiJ PJ(sJliJ) = II L PJiJ PJ(sJliJ)' \n\nd KJ \n\n(3) \n\nwhere the second equality comes by regrouping terms. \n\nThe latent data is mapped to the observation space by an orthogonal transforma(cid:173)\ntion, W (figure 1). Using p(xls) = c5(x - Ws - fJ,) and (1), the density on observed \ndata x conditioned on component a is p(xla) = N(W qa + fJ\" (721). The total density \non x is \n\nK \n\np(x) = L 7ra p(xla) . \n\n(4) \n\nThe data log likelihood for N data vectors, {x n , n = 1 ... N}, averaged over the \nposterior probabilities p(alxn) is \n\na=l \n\n(5) \n\n2.2 Model Fitting and Transform Coder design \n\nThe model (4) can be fit to data using the EM algorithm. In the limit that the \nvariance of the mixture components goes to zero, the EM procedure for fitting the \nmixture model to data corresponds to a constrained LBG (CLBG) algorithm for \noptimal transform coder design. \nIn the limit (72 -+ 0 the entropy term, In 7ra , becomes insignificant and the compo(cid:173)\nnent posteriors collapse to \n\n(6) \n\n\fEach data vector is assigned to the component whose mean has the smallest Eu(cid:173)\nclidean distance to it. These assignments minimize mean squared error. \n\nIn the limit that (72 -+ 0, maximizing the likelihood (5) is equivalent to minimizing \ncompression distortion \n\nD = L 7ra N L Ix - Wqa _1-\u00a31 2 \n\n1 \n\na \n\na xER\", \n\n(7) \n\nwhere Ra = {x Ip(alx) = I}, Na is the number of x ERa, and 7ra = Na/N. \nTo optimize the transform, we find the orientation of the current quantizer grid \nwhich minimizes (7). The transform, W, is constrained to be orthogonal, that is \nWTW = I. We first define the matrix of outer products Q \nQ = L 7raqa (~ L (x-I-\u00a3f) \n\n(8) \n\n. \n\na \n\na xER\", \n\nMinimizing the distortion (7) with respect to some element of W and using Lagrange \nmultipliers to enforce the orthogonality of W yields the condition \n\nQW = WTQT \n\n(9) \n\nor QW is symmetric. This symmetry condition and the orthogonality condition, \nWTW = I, uniquely determine the coding optimal transform (COT) W. The COT \nreduces to the PCA transform when the data is Gaussian. However, in general the \nCOT differs from PCA. For instance in global transform coding trials on a variety \nof grayscale images, the COT improves signal-to-noise ratio (SNR) relative to PCA \nby 0.2 to 0.35 dB for fixed-rate coding at 1.0 bits per pixel (bpp). For variable-rate \ncoding, SNR improvement due to using the COT is substantial, 0.3 to 1.2 dB for \nentropies of 0.25 to 1.25 bpp. \nWe next minimize (7) with respect to the grid mark values, r JiJ' for J = \n1 .. . d and iJ = 1 . .. K J and the number of grid values K J for each coordinate. \nIt is advantageous to rewrite compression distortion as the sum of distortions \nD = LJ DJ due to quantizing the transform coefficients SJ = WJ x, where WJ \nis the Jfh column vector of W. The rJiJ grid mark values that minimize each DJ \nare the reproduction values of a scalar Lloyd quantizer [10] designed for the trans(cid:173)\nform coefficients, SJ. KJ is the number of reproduction values in the quantizer for \ntransform coordinate J. Allocating the log2 (K) coding bits among the transform \ncoordinates so that we minimize distortion [11] determines the optimal KJ's. \n\n3 Local Transform Coder Model \n\nIn this section, we develop a mixture of constrained Gaussian mixtures model that \nprovides a probabilistic framework for adaptive transform coding. \n\n3.1 Latent Variable Model \n\nA local or adaptive transform coder identifies regions in data space that require \ndifferent quantizer grids and orthogonal transforms. A separate transform coder \nis designed for each of these regions. To replicate this structure, we envision the \nobserved data as drawn from one of M grids in the latent space. The latent variables, \ns, are modeled with a mixture of Gaussian densities, where the mixture components \nare constrained to lie at the grid vertices. Each grid has the same number of mixture \ncomponents, K, however the number and spacing of grid marks on each axis can \ndiffer. This is illustrated schematically (in the hard-clustering limit) in figure 2. \n\n\f51 \n\n~ q a. \n\nW(2)5 + J.L(2) \n\nFigure 2: Nonstationary data model: Structure of latent variable space, S, and mapping \n(in the hard clustering limit) to observed space, X. The density in the latent space consists \nof mixtures of spherically symmetric Gaussians. The mixture component means, qim), lie \nat the vertices the mth grid. Latent data is mapped to the observation space by W(m). \n\nThe density on s conditioned on a single mixture component, 0: in grid m,2 is \np(slo:, m) = N(q~m), 0-21). The latent density is a mixture of constrained Gaussian \nmixture densities, \n\nM \n\nK \n\np(s) = L'lrm LP(o:lm) p(slo:, m) \n\nm=l \n\n0=1 \n\n(10) \n\n(11) \n\nThe latent data is mapped to the observation space by orthonormal transforms \nw(m). The density on x conditioned on 0: in grid m is p(xlo:, m) = N(w(m)q~m) + \np,(m),0-21). The observed data density is \nK \n\nM \n\np(x) = L'lrm L p(o:lm) p(xlo:, m) \n\nm=l \n\n0=1 \n\n3.2 Optimal Adaptive Transform Coder Design \n\nIn the limit that 0-2 -t 0, the EM procedure for fitting this model corresponds to a \nconstrained LBG algorithm for adaptive transform coder design. As before, a single \nmixture component becomes responsible for Xn \n\n0: m x -t { 1 \n\nif Ix - w(m)q~m) - p,(m) 12 ::; Ix - w(m)q~m) - p,(m) 12 \n\n'