{"title": "Learning Sparse Codes with a Mixture-of-Gaussians Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 847, "abstract": null, "full_text": "Learning sparse codes with a \nmixture-of-Gaussians prior \n\nBruno A. Olshausen \n\nDepartment of Psychology and \n\nCenter for Neuroscience, UC Davis \n\n1544 Newton Ct. \nDavis, CA 95616 \n\nK. Jarrod Millman \n\nCenter for Neuroscience, UC Davis \n\n1544 Newton Ct. \nDavis, CA 95616 \n\nkjmillman@ucdavis. edu \n\nbaolshausen@ucdavis.edu \n\nAbstract \n\nWe describe a method for learning an overcomplete set of basis \nfunctions for the purpose of modeling sparse structure in images. \nThe sparsity of the basis function coefficients is modeled with a \nmixture-of-Gaussians distribution. One Gaussian captures non(cid:173)\nactive coefficients with a small-variance distribution centered at \nzero, while one or more other Gaussians capture active coefficients \nwith a large-variance distribution. We show that when the prior is \nin such a form, there exist efficient methods for learning the basis \nfunctions as well as the parameters of the prior. The performance \nof the algorithm is demonstrated on a number of test cases and \nalso on natural images. The basis functions learned on natural \nimages are similar to those obtained with other methods, but the \nsparse form of the coefficient distribution is much better described. \nAlso, since the parameters of the prior are adapted to the data, no \nassumption about sparse structure in the images need be made a \npriori, rather it is learned from the data. \n\n1 \n\nIntroduction \n\nThe general problem we address here is that of learning a set of basis functions \nfor representing natural images efficiently. Previous work using a variety of opti(cid:173)\nmization schemes has established that the basis functions which best code natural \nimages in terms of sparse, independent components resemble a Gabor-like wavelet \nbasis in which the basis functions are spatially localized, oriented and bandpass in \nspatial-frequency [1, 2, 3, 4]. In order to tile the joint space of position, orienta(cid:173)\ntion, and spatial-frequency in a manner that yields useful image representations, \nit has also been advocated that the basis set be overcomplete [5], where the num(cid:173)\nber of basis functions exceeds the dimensionality of the images being coded. A \nmajor challenge in learning overcomplete bases, though, comes from the fact that \nthe posterior distribution over the coefficients must be sampled during learning. \nWhen the posterior is sharply peaked, as it is when a sparse prior is imposed, then \nconventional sampling methods become especially cumbersome. \n\n\f842 \n\nB. A. Olshausen and K. J. Millman \n\nOne approach to dealing with the problems associated with overcomplete codes and \nsparse priors is suggested by the form of the resulting posterior distribution over the \ncoefficients averaged over many images. Shown below is the posterior distribution of \none of the coefficients in a 4 x's overcomplete representation. The sparse prior that \nwas imposed in learning was a Cauchy distribution and is overlaid (dashed line). It \nwould seem that the coefficients do not fit this imposed prior very well, and instead \nwant to occupy one of two states: an inactive state in which the coefficient is set \nnearly to zero, and an active state in which the coefficient takes on some significant \nnon-zero value along a continuum. This suggests that the appropriate choice of \nprior is one that is capable of capturing these two discrete states. \n\nII \nI \nI \nI \n\u2022 \nI \nI \n\n\\ . \n\nI \n\n-2 \n\n0 \n\n2 \ncoo!IIdon1v .... \n\nFigure 1: Posterior distribution of coefficients with Cauchy prior overlaid. \n\nOur approach to modeling this form of sparse structure uses a mixture-of-Gaussians \nprior over the coefficients. A set of binary or ternary state variables determine \nwhether the coefficient is in the active or inactive state, and then the coefficient \ndistribution is Gaussian distributed with a variance and mean that depends on \nthe state variable. An important advantage of this approach, with regard to the \nsampling problems mentioned above, is that the use of Gaussian distributions allows \nan analytical solution for integrating over the posterior distribution for a given \nsetting of the state variables. The only sampling that needs to be done then is over \nthe binary or ternary state variables. We show here that this problem is a tractable \none. This approach differs from that taken previously by Attias [6] in that we do \nnot use variational methods to approximate the posterior, but rather we rely on \nsampling to adequately characterize the posterior distribution over the coefficients. \n\n2 Mixture-of-Gaussians model \n\nAn image, I(x, y), is modeled as a linear superposition of basis functions, \u00a2i(X, y), \nwith coefficients ai, plus Gaussian noise II( x, y) : \n\n(1) \n\nIn what follows this will be expressed in vector-matrix notation as 1= q. a + II. \nThe prior probability distribution over the coefficients is factorial, with the distri(cid:173)\nbution over each coefficient ai modeled as a mixture-of-Gaussians distribution with \neither two or three Gaussians (fig. 2). A set of binary or ternary state variables Si \nthen determine which Gaussian is used to describe the coefficients. \nThe total prior over both sets of variables, a and s, is of the form \n\n(2) \n\n\fLearning Sparse Codes with a Mixture-of-Gaussians Prior \n\n843 \n\nTwo Gaussians \n\n(binary state variables) \n\nThree Gaussians \n\n(ternary state variables) \n\nP(lIj) \n\nP(lIj) \n\n_,;=-1 \n\n.<;=1 \n\nai \n\nai \n\nFigure 2: Mixture-of-Gaussians prior. \n\nwhere P(Si) determines the probability of being in the active or inactive states, and \nP(ailsi) is a Gaussian distribution whose mean and variance is determined by the \ncurrent state Si. \nThe total image probability is then given by \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\nP(IIO) = L P(s/O) J P(I/a, O)P(als, O)da \n\ns \n\nwhere \n\nP(Ila, 0) \n\nP(als,O) \n\nP(sIO) \n\n-~II-4>aI2 \n\n1 \n--e 2 \nZAN \n\n_l_e-t(a-Il(s))t Aa(s) (a-Il(s)) \nZAa(s) \n\n1 _1 s t A s \n\n\u2022 \n\n--e 2 \nZA. \n\nand the parameters 0 include AN, 4), Aa(s), f.L(s), and As . Aa(s) is a diagonal in(cid:173)\nverse covariance matrix with elements Aa(S)ii = Aa; (Si). (The notations Aa(s) and \nf.L(s) are used here to explicitly reflect the dependence of the means and variances \nof the ai on sd As is also diagonal (for now) with elements ASii = As;. The model \nis illustrated graphically in figure 3. \n\nSi \n\n(binary or \nternary) \n\nFigure 3: Image model. \n\n\f844 \n\n3 Learning \n\nB. A. Olshausen and K. J. Millman \n\nThe objective function for learning the parameters of the model is the average \nlog-likelihood: \n\n\u00a3 = (log P(IIO)) \n\n(7) \n\nMaximizing this objective will minimize the lower bound on coding length. \nLearning is accomplished via gradient ascent on the objective, \u00a3. The learning rules \nfor the parameters As, Aa (s), J-t( s) and ~ are given by: \n\nex \n\n= \n\n{)\u00a3 \n\n{)>\"Si \n\n1 2 [(Si)P(SiI 9) - (si)P(sII,9)] \n\n{)\u00a3 \n\n{)>\"ai (u) \n! [(8(Si - u))P(sII,9)_ \n2 \n\n>\"ai (u) \n\n(8(Si - u) (Kii(U) - 2ai(U)J-ti(U) + J-t~(u)))P(sII,9)] \n{)\u00a3 \n\n{)J-ti( u) \n>\"ai (u) (8(Si - u) (ai(u) - J-ti(U))) \n{)\u00a3 \n\n(8) \n\n(9) \n\n(10) \n\n{)~ \n\n>\"N [I (a(s)) P(sII,9) - ~ (K(s)) P(SII,9)] \n\n(11) \nwhere u takes on values 0,1 (binary) or -1,0,1 (ternary) and K(s) H-1 (s) + \na(s) a(s)T. (a and H are defined in eqs. 15 and 16 in the next section.) Note \nthat in these expressions we have dropped the outer brackets averaging over images \nsimply to reduce clutter. \nThus, for each image we must sample from the posterior P(sll, 0) in order to collect \nthe appropriate statistics needed for learning. These statistics must be accumulated \nover many different images, and then the parameters are updated according to the \nrules above. Note that this approach differs from that of Attias [6] in that we do \nnot attempt to sum over all states, s, or to use the variational approximation to \napproximate the posterior. Instead, we are effectively summing only over those \nstates that are most probable according to the posterior. We conjecture that this \nscheme will work in practice because the posterior has significant probability only \nfor a small fraction of states s, and so it can be well-characterized by a relatively \nsmall number of samples. Next we present an efficient method for Gibbs sampling \nfrom the posterior. \n\n4 Sampling and inference \n\nIn order to sample from the posterior P(sll,O), we first cast it in Boltzmann form: \n(12) \n\nP(sll,O) ex e-E(s) \n\nwhere \n\nE(s) = \n\n-logP(s,IIO) = -logP(sIO) J P(lla,O)P(als,O)da \n\n\fLearning Sparse Codes with a Mixture-of-Gaussians Prior \n\n845 \n\n~ST Ass + 10gZAa(S) + Eals(a,s) + ~IOgdetH(s) + const. \n\n(13) \n\nand \n\na = argminEals(a,s) \n\na \n\nH(s) = \\7\\7 aEals(a, s) = )\\Nif!T if! + Aa(s) \n\n(14) \n\n(15) \n\n(16) \n\nGibbs-sampling on P(sII, (}) can be performed by flipping state variables Si accord(cid:173)\ning to \n\nP(Si t- sa) = \n\nP(Si t- sa) = \n\nl+e~E(\u00b7i~\u00b7a) \n\n1 \n\n(binary) \n\n(ternary) \n\n(17) \n(18) \n\nWhere sa = Si in the binary case, and sa and sf3 are the two alternative states in \nthe ternary case. AE(Si t- sa) denotes the change in E(s) due to changing Si to \nsa and is given by: \n\nwhere ASi = sa - Si, AAai = Aai (sa) - Aai (Si), J = H- 1 , and Vi = Aai (Si) J.Li(Si). \nNote that all computations for considering a change of state are local and involve \nonly terms with index i. Thus, deciding whether or not to change state can be \ncomputed quickly. However, if a change of state is accepted, then we must update \nJ. Using the Sherman-Morrison formula, this can be kept to an O(N2) computation: \n\n(19) \n\nJ t- J -\n\nAAak \n\n[ \n1 + AAak Jkk \n\n] J k Jk \n\n(20) \n\nAs long as accepted state changes are rare (which we have found to be the case for \nsparse distributions), then Gibbs sampling may be performed quickly and efficiently. \nIn addition, Hand J are generally very sparse matrices, so as the system is scaled \nup the number of elements of a that are affected by a flip of Si will be relatively \nfew. \n\nIn order to code images under this model, a single state of the coefficients must be \nchosen for a given image. We use for this purpose the MAP estimator: \n\nargmaxP(aII,s, (}) \n\na \n\narg max P(sII, (}) \n\ns \n\n(21) \n\n(22) \n\nMaximizing the posterior distribution over s is accomplished by assigning a tem(cid:173)\nperature, \n\nP(sII, (}) ex e-E(s)/T \n\n(23) \n\nand gradually lowering it until there are no more state changes. \n\n\f846 \n\n5 Results \n\n5.1 Test cases \n\nB. A. Olshausen and K. 1. Millman \n\nWe first trained the algorithm on a number of test cases containing known forms \nof both sparse and non-sparse (bi-modal) structure, using both critically sampled \n(complete) and 2x's overcomplete basis sets. The training sets consisted of 6x6 \npixel image patches that were created by a sparse superposition of basis functions \n(36 or 72) with P(ISil = 1) = 0.2, Aa; (0) = 1000, and Aa; (1) = 10. The results of \nthese test cases confirm that the algorithm is capable of correctly extracting both \nsparse and non-sparse structure from data, and they are not shown here for lack of \nspace. \n\n5.2 Natural images \n\nWe trained the algorithm on 8x8 image patches extracted from pre-whitened natural \nimages. In all cases, the basis functions were initialized to random functions (white \nnoise) and the prior was initialized to be Gaussian (both Gaussians of roughly equal \nvariance). Shown in figure 4a, b are the results for a set of 128 basis functions (2 x 's \novercomplete) in the two-Gausian case. In the three-Gaussian case, the prior was \ninitialized to be platykurtic (all three Gaussians of equal variance but offset at \nthree different positions). Thus, in this case the sparse form of the prior emerged \ncompletely from the data. The resulting priors for two of the coefficients are shown \nin figure 4c, with the posterior distribution averaged over many images overlaid. For \nsome of the coefficients the posterior distribution matches the mixture-of-Gaussians \nprior well, but for others the tails appear more Laplacian in form. Also, it appears \nthat the extra complexity offered by having three Gaussians is not utilized: Both \nGaussians move to the center position and have about the same mean. When a \nnon-sparse, bimodal prior is imposed, the basis function solution does not become \nlocalized, oriented, and bandpass as it does with sparse priors. \n\n5.3 Coding efficiency \n\nWe evaluated the coding efficiency by quantizing the coefficients to different levels \nand calculating the total coefficient entropy as a function of the distortion intro(cid:173)\nduced by quantization. This was done for basis sets containing 48, 64, 96, and \n128 basis functions. At high SNR's the overcomplete basis sets yield better coding \nefficiency, despite the fact that there are more coefficients to code. However, the \npoint at which this occurs appears to be well beyond the point where errors are no \nlonger perceptually noticeable (around 14 dB). \n\n6 Conclusions \n\nWe have shown here that both the prior and basis functions of our image model \ncan be adapted to natural images. Without sparseness being imposed, the model \nboth seeks distributions that are sparse and learns the appropriate basis functions \nfor this distribution. Our conjecture that a small number of samples allows the \nposterior to be sufficiently characterized appears to hold. In all cases here, aver(cid:173)\nages were collected over 40 Gibbs sweeps, with 10 sweeps for initialization. The \nalgorithm proved capable of extracting the structure in challenging datasets in high \ndimensional spaces. \n\nThe overcomplete image codes have the lowest coding cost at high SNR levels, but \nat levels that appear higher than is practically useful. On the other hand, the \n\n\fLearning Sparse Codes with a Mixture-of-Gaussians Prior \n\n847 \n\na. \n\nc. \n\n::~~::~~ \n\n. - '~ \". \n\n~:. \n\n\" \n\n\" \n\n.\\ \n\n10'\" \n\n. \n\n\" \n\n10.... \n\n'0\" \n\n-2 \n\n-1 \n\n0 \n\nI \n\n-. \n\n2 \n\n10~ J./ \n\n\"'\\. \n\n-Z \n\n_I \n\n0 \n\n1 \n\n2 \n\nFigure 4: An overcomplete set of 128 basis functions (a) and priors (b, vertical axis \nis log-probability) learned from natural images. c, Two of the priors learned from \na three-Gaussian mixture using 64 basis functions, with the posterior distribution \naveraged over many coefficients overlaid. d, Rate distortion curve comparing the \ncoding efficiency of different learned basis sets. \n\nsum of marginal entropies likely underestimates the true entropy of the coefficients \nconsiderably, as there are certainly statistical dependencies among the coefficients. \nSo it may still be the case that the overcomplete bases will show a win at lower \nSNR's when these dependencies are included in the model (through the coupling \nterm As). \n\nAcknowledgments \n\nThis work was supported by NIH grant R29-MH057921. \n\nReferences \n[1] Olshausen BA, Field DJ (1997) \"Sparse coding with an overcomplete basis set: A \n\nstrategy employed by VI?\" Vision Research, 37: 3311-3325. \n\n[2] Bell AJ, Sejnowski TJ (1997) \"The independent components of natural images are \n\nedge filters,\" Vision Research, 37: 3327-3338. \n\n[3] van Hateren JH, van der Schaaff A (1997) \"Independent component filters of natural \nimages compared with simple cells in primary visual cortex,\" Proc. Royal Soc. Lond. \nB, 265: 359-366. \n\n[4] Lewicki MS , Olshausen BA (1999) \"A probabilistic framework for the adaptation and \n\ncomparison of image codes,\" JOSA A, 16(7): 1587-160l. \n\n[5] Simoncelli EP, Freeman WT, Adelson EH, Heeger DJ (1992) \"Shiftable multiscale \n\ntransforms,\" IEEE Transactions on Information Theory, 38(2): 587-607. \n\n[6] Attias H (1999) \"Independent factor analysis,\" Neural Computation, 11: 803-852. \n\n\f", "award": [], "sourceid": 1668, "authors": [{"given_name": "Bruno", "family_name": "Olshausen", "institution": null}, {"given_name": "K.", "family_name": "Millman", "institution": null}]}