{"title": "Learning Sparse Image Codes using a Wavelet Pyramid Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 887, "page_last": 893, "abstract": null, "full_text": "Learning Sparse Image Codes using a \n\nWavelet Pyramid Architecture \n\nBruno A. Olshausen \n\nDepartment of Psychology and \n\nCenter for Neuroscience, UC Davis \n\n1544 Newton Ct. \nDavis, CA 95616 \n\nbaolshausen@uedavis.edu \n\nPhil Sallee \n\nDepartment of Computer Science \n\nUC Davis \n\nDavis, CA 95616 \n\nsallee@es.uedavis.edu \n\nMichael S. Lewicki \n\nDepartment of Computer Science and \n\nCenter for the Neural Basis of Cognition \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \nlewieki@enbe.emu.edu \n\nAbstract \n\nWe show how a wavelet basis may be adapted to best represent \nnatural images in terms of sparse coefficients. The wavelet basis, \nwhich may be either complete or overcomplete, is specified by a \nsmall number of spatial functions which are repeated across space \nand combined in a recursive fashion so as to be self-similar across \nscale. These functions are adapted to minimize the estimated code \nlength under a model that assumes images are composed of a linear \nsuperposition of sparse, independent components. When adapted \nto natural images, the wavelet bases take on different orientations \nand they evenly tile the orientation domain, in stark contrast to the \nstandard, non-oriented wavelet bases used in image compression. \nWhen the basis set is allowed to be overcomplete, it also yields \nhigher coding efficiency than standard wavelet bases. \n\n1 \n\nIntroduction \n\nThe general problem we address here is that of learning efficient codes for represent(cid:173)\ning natural images. Our previous work in this area has focussed on learning basis \nfunctions that represent images in terms of sparse, independent components [1, 2]. \nThis is done within the context of a linear generative model for images, in which \nan image I(x,y) is described in terms of a linear superposition of basis functions \nbi(x,y) with amplitudes ai> plus noise v(x,y): \n\n(1) \n\n\fA sparse, factorial prior is imposed upon the coefficients ai, and the basis functions \nare adapted so as to maximize the average log-probability of images under the model \n(which is equivalent to minimizing the model's estimate of the code length of the \nimages). When the model is trained on an ensemble of whitened natural images, \nthe basis functions converge to a set of spatially localized, oriented, and bandpass \nfunctions that tile the joint space of position and spatial-frequency in a manner \nsimilar to a wavelet basis. Similar results have been achieved using other forms of \nindependent components analysis [3, 4]. \n\nOne of the disadvantages of this approach, from an image coding perspective, is \nthat it may only be applied to small sub-images (e.g., 12 x 12 pixels) extracted \nfrom a larger image. Thus, if an image were to be coded using this method, it \nwould need to be blocked and would thus likely introduce blocking artifacts as the \nresult of quantization or sparsification of the coefficients. In addition, the model \nis unable to capture spatial structure in the images that is larger than the image \nblock, and scaling up the algorithm to significantly larger blocks is computationally \nintractable. \n\nThe solution to these problems that we propose here is to assume translation- and \nscale-invariance among the basis functions, as in a wavelet pyramid architecture. \nThat is, if a basis function is learned at one position and scale, then it is assumed \nto be repeated at all positions (spaced apart by two positions horizontally and \nvertically) and scales (in octave increments). Thus, the entire set of basis functions \nfor tiling a large image may be learned by adapting only a handful of parameters(cid:173)\ni.e., the wavelet filters and the scaling function that is used to expand them across \nscale. \n\nWe show here that when a wavelet image model is adapted to natural images to \nyield coefficients that are sparse and as statistically independent as possible, the \nwavelet functions converge to a set of oriented functions, and the scaling function \nconverges to a circularly symmetric lowpass filter appropriate for generating self(cid:173)\nsimilarity across scale. Moreover, the resulting coefficients achieve higher coding \nefficiency (higher SNR for a fixed bit-rate) than traditional wavelet bases which are \ntypically designed \"by hand\" according to certain mathematical desiderata [5]. \n\n2 Wavelet image model \n\nThe wavelet image model is specified by a relatively small number of parameters, \nconsisting of a set of wavelet functions 'ljJi(X,y), i = l..M, and scaling function \n\u00a2(x,y). An image is generated by upsampling and convolving the coefficients at \na given band i with 'ljJi (or with \u00a2 at the lowest-resolution level of the pyramid), \nfollowed by successive upsampling and convolution with \u00a2, depending on their level \nwithin the pyramid. The wavelet image model for an L level pyramid is specified \nmathematically as \n\nI(x,y) \n\ng(x, y, i) \n\ng(x, y, 0) + v(x, y) \n{ aL-l(x,y) l=L-1 \nl < L - 1 \n\nII (x, y) \n\nII(x,y) = [g(x,y,l + 1)t 2] * \u00a2(x,y) + L: [a~(x,y)t 2J * 'ljJi(X,y) \n\nM \n\ni=l \n\n(2) \n\n(3) \n\n(4) \n\nwhere the coefficients a are indexed by their position, x, y, band, i, and level of \nresolution l within the pyramid (l = 0 is the highest resolution level) . The symbol \n\n\fFigure 1: Wavelet image model. Shown are the coefficients of the first three levels \nof a pyramid (l = 0,1,2), with each level split into a number of different bands \n(i = 1...M). The highest level (l = 3) is not shown and contains only one lowpass \nband. \n\nt 2 denotes upsampling by two and is defined as \n\nf(x,y)t2 \n\n== \n\n{ f ( ~, ~) x even & y even \n\notherwise \n\no \n\n(5) \n\nThe wavelet pyramid model is schematically illustrated in figure 1. Thaditional \nwavelet bases typically utilize three bands (M = 3), in which case the representation \nis critically sampled (same number of coefficients as image pixels). Here, we shall \nalso examine the cases of M = 4 and 6, in which the representation is overcomplete \n(more coefficients than image pixels). \nBecause the image model is linear, it may be expressed compactly in vector/matrix \nnotation as \n\nI=Ga+v \n\n(6) \nwhere the vector a is the entire list of coefficient values at all positions, bands, and \nlevels of the pyramid, and the columns of G are the basis functions corresponding to \neach coefficient, which are parameterized by 'l/J and 41. The probability of generating \nan image I given a specific state of the coefficients a and assuming Gaussian i.i.d. \nnoise v is then \n\nP(Ila,O) = - e - 2 1-\n~I G \n\n1 \nZAN \n\n12 \n\na \n\n(7) \n\nwhere 0 denotes the parameters of the model and includes the wavelet pyramid \nfunctions 'l/Ji and 41, as well as the noise variance, a~ = 1/ AN. \nThe prior probability distribution over the coefficients is assumed to be factorial \nand sparse: \n\nP(a) \n\n1 \n- e \nZs \n\n-S(a.) \n' \n\n(8) \n\n(9) \n\nwhere S is a non-convex function that shapes P(ai) to have the requisite \"sparse\" \nform-\ni.e., peaked at zero with heavy tails, or positive kurtosis. We choose here \nS(x) = t3log(1 + (x/a)2), which corresponds to a Cauchy-like prior over the coeffi(cid:173)\ncients (an exact Cauchy distribution would be obtained for t3 = 1).1 \n\n1 A more optimal choice for the prior would be to use a mixture-of-Gaussians distribu(cid:173)\n\ntion, which better captures the sharp peak at zero characteristic of a sparse representation. \nBut properly maximizing the posterior with such a prior presents formidable challenges [6) . \n\n\f3 \n\nInferring the coefficients \n\nThe coefficients for a particular image are determined by finding the maximum of \nthe posterior distribution (MAP estimate) \n\na \n\nargmax P(all, B) \n\n= argmax P(lla, B)P(aIB) \n\na \n\na \n\nargmln [A;II_GaI2+ ~S(ai)l \n\n(10) \n\n(11) \n\nA local minimum may be found via gradient descent, yielding the differential equa(cid:173)\ntion \n\nex: ANGT e - S(a) \n\n(12) \n(13) \nThe computations involving G T e and G a in equations 12 and 13 may be performed \nquickly and efficiently using fast algorithms for building pyramids and reconstruct(cid:173)\ning from pyramids [7]. \n\nIi \ne = I-Ga. \n\n4 Learning \n\nOur goal in adapting the wavelet model to natural images is to find the functions \n'l/Ji and tjJ that minimize the description length \u00a3 of images under the model \n\n\u00a3 = \n\n-(logP(IIB)) \n\nP(IIB) = f P(lla, B) P(aIB) da \n\n(14) \n\n(15) \n\nA learning rule for the basis functions may be derived by gradient descent on \u00a3: \n\n8\u00a3 \n8Bi \n\n= AN \\ (eT ~~ a) P(aII,O) ) \n\n(16) \n\nInstead of sampling from the full posterior distribution, however, we utilize a simpler \napproximation in which a single sample is taken at the posterior maximum, and so \nwe have \n\nAll \n\n(AT 8G A) \nUUi ex: e 8Bi a \n\n(17) \nwhere e = I - Ga. The price we pay for this approximation, though, is that the \nbasis functions will grow without bound, since the greater their norm, I G k I, the \nsmaller each ak will become, thus decreasing the sparseness penalty in (11). This \ntrivial solution is avoided by adaptively rescaling the basis functions after each \nlearning step so that a target variance on the coefficients is met, as described in an \nearlier paper [1]. \nThe update rules for 'l/Ji and tjJ are then derived from (17), and may be expressed in \nterms of the following recursive formulas: \n\n~'l/Ji(m,n) = F't/J(e(x,y),m,n,O) \n\nF't/J (1, m, n, l) \n\n- L f(2x + m, 2y + n) ai(x, y) + F't/J([f * tjJ]{, 2, m, n, l + 1) \n\n(18) \n\nx,y \n\n\f~\u00a2(m, n) = F