{"title": "Group Sparse Coding with a Laplacian Scale Mixture Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 676, "page_last": 684, "abstract": "We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefficients. Each coefficient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efficient inference procedures for both the coefficients and the scale parameter. When the scale parameters of a group of coefficients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude fluctuations among coefficients, which have been shown to constitute a large fraction of the redundancy in natural images. We show that, as a consequence of this group sparse coding, the resulting inference of the coefficients follows a divisive normalization rule, and that this may be efficiently implemented a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model.", "full_text": "Group Sparse Coding with a Laplacian Scale Mixture\n\nPrior\n\nPierre J. Garrigues\n\nIQ Engines, Inc.\n\nBerkeley, CA 94704\n\npierre.garrigues@gmail.com\n\nBruno A. Olshausen\n\nHelen Wills Neuroscience Institute\n\nSchool of Optometry\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nbaolshausen@berkeley.edu\n\nAbstract\n\nWe propose a class of sparse coding models that utilizes a Laplacian Scale Mixture\n(LSM) prior to model dependencies among coef\ufb01cients. Each coef\ufb01cient is mod-\neled as a Laplacian distribution with a variable scale parameter, with a Gamma\ndistribution prior over the scale parameter. We show that, due to the conjugacy\nof the Gamma prior, it is possible to derive ef\ufb01cient inference procedures for both\nthe coef\ufb01cients and the scale parameter. When the scale parameters of a group of\ncoef\ufb01cients are combined into a single variable, it is possible to describe the de-\npendencies that occur due to common amplitude \ufb02uctuations among coef\ufb01cients,\nwhich have been shown to constitute a large fraction of the redundancy in natu-\nral images [1]. We show that, as a consequence of this group sparse coding, the\nresulting inference of the coef\ufb01cients follows a divisive normalization rule, and\nthat this may be ef\ufb01ciently implemented in a network architecture similar to that\nwhich has been proposed to occur in primary visual cortex. We also demonstrate\nimprovements in image coding and compressive sensing recovery using the LSM\nmodel.\n\n1\n\nIntroduction\n\nThe concept of sparsity is widely used in the signal processing, machine learning and statistics\ncommunities for model \ufb01tting and solving inverse problems. It is also important in neuroscience as\nit is thought to underlie the neural representations used by the brain. The operation to compute the\nsparse representation of a signal x \u2208 Rn with respect to a dictionary of basis functions \u03a6 \u2208 Rn\u00d7m\ncan be implemented via an (cid:96)1-penalized least-square problem commonly referred to as Basis Pursuit\nDenoising (BPDN) [2] or Lasso [3]\n\nmin\n\ns\n\n1\n2\n\n(cid:107)x \u2212 \u03a6s(cid:107)2\n\n2 + \u00b5(cid:107)s(cid:107)1,\n\n(1)\n\nwhere \u00b5 is a regularization parameter that controls the tradeoff between the quality of the reconstruc-\ntion and the sparsity. This approach has been applied to problems such as image coding, compressive\nsensing [4], or classi\ufb01cation [5]. The (cid:96)1 penalty leads to solutions where typically a large number\nof coef\ufb01cients are exactly zero, which is a desirable property to achieve model selection or data\ncompression, or for obtaining interpretable results. The cost function of BPDN is convex, and many\nef\ufb01cient algorithms have been recently developed to solve this problem [6, 7, 8, 9].\nMinimizing the cost function of BPDN corresponds to MAP inference in a probabilistic model\n2 e\u2212\u03bb|si|. Hence, the\nwhere the coef\ufb01cients are independent and have Laplacian priors p(si) = \u03bb\nsignal model assumed by BPDN is linear, generative, and the basis function coef\ufb01cients are inde-\npendent. In the context of analysis-based models of natural images (for a review on analysis-based\n\n1\n\n\fand synthesis-based or generative models see [10]), it has been shown that the linear responses of\nnatural images to Gabor-like \ufb01lters have kurtotic histograms, and that there can be strong dependen-\ncies among these responses in the form of common amplitude \ufb02uctuations [11, 12, 13, 14]. It has\nalso been observed in the context of generative image models that the inferred sparse coef\ufb01cients\nexhibit pronounced statistical dependencies [15, 16], and therefore the independence assumption is\nviolated. It has been proposed in block-(cid:96)1 methods to account for dependencies among the coef\ufb01-\ncients by dividing them into subspaces such that dependencies within the subspaces are allowed, but\nnot across the subspaces [17] . This approach can produce blocking artifacts and has recently been\ngeneralized to overlapping subspaces in [18]. Another approach is to only allow certain con\ufb01gura-\ntions of active coef\ufb01cients [19].\nWe propose in this paper a new class of prior on the basis function coef\ufb01cients that makes it possible\nto model their statistical dependencies in a probabilistic generative model, whose inferred represen-\ntations are more sparse than those obtained with the factorial Laplacian prior, and for which we have\nef\ufb01cient inference algorithms. Our approach consists of introducing for each coef\ufb01cient a hyperprior\non the inverse scale parameter \u03bbi of the Laplacian distribution. The coef\ufb01cient prior is thus a mixture\nof Laplacian distributions which we denote \u201cLaplacian Scale Mixture\u201d (LSM), which is an analogy\nto the Gaussian scale mixture (GSM) [12]. Higher-order dependencies of feedforward responses of\nwavelet coef\ufb01cients [12] or basis functions learned using independent component analysis [14] have\nbeen captured using GSMs, and we extend this approach to a generative sparse coding model using\nLSMs.\nWe de\ufb01ne the Laplacian scale mixture in Section 2, and we describe the inference algorithms in the\nresulting sparse coding models with an LSM prior on the coef\ufb01cients in Section 3. We present an\nexample of a factorial LSM model in Section 4, and of a non-factorial LSM model in Section 5 that\nis particularly well suited to signals having the \u201cgroup sparsity\u201d property. We show that the non-\nfactorial LSM results in a divisive normalization rule for inferring the coef\ufb01cients. When the groups\nare organized topographically and the basis is trained on natural images, the resulting model resem-\nbles the neighborhood divisive normalization that has been hypothesized to occur in visual cortex.\nWe also demonstrate that the proposed LSM inference algorithm provides superior performance in\nimage coding and compressive sensing recovery.\n\n2 The Laplacian Scale Mixture distribution\nA random variable si is a Laplacian scale mixture if it can be written si = \u03bb\u22121\ni ui, where ui has\n2 e\u2212|ui|, and the multiplier variable \u03bbi is a\na Laplacian distribution with scale 1, i.e. p(ui) = 1\npositive random variable with probability p(\u03bbi). We also suppose that \u03bbi and ui are independent.\nConditioned on the parameter \u03bbi, the coef\ufb01cient si has a Laplacian distribution with inverse scale \u03bbi,\ni.e. p(si|\u03bbi) = \u03bbi\n2 e\u2212\u03bbi|si|. The distribution over si is therefore a continuous mixture of Laplacian\ndistributions with different inverse scales, and it can be computed by integrating out \u03bbi\n\n(cid:90) \u221e\n\np(si) =\n\n(cid:90) \u221e\n\np(si| \u03bbi)p(\u03bbi)d\u03bbi =\n\n0\n\n0\n\ne\u2212\u03bbi|si|p(\u03bbi)d\u03bbi.\n\n\u03bbi\n2\n\nNote that for most choices of p(\u03bbi) we do not have an analytical expression for p(si). We denote\nsuch a distribution a Laplacian Scale Mixture (LSM). It is a special case of the Gaussian Scale\nMixture (GSM) [12] as the Laplacian distribution can be written as a GSM.\n\n3\n\nInference in a sparse coding model with LSM prior\n\nWe propose the linear generative model\n\nx = \u03a6s + \u03bd =\n\nm(cid:88)\n\nsi\u03d5i + \u03bd,\n\n(2)\n\nwhere x \u2208 Rn, \u03a6 = [\u03d51, . . . , \u03d5m] \u2208 Rn\u00d7m is an overcomplete transform or basis set, and the\ncolumns \u03d5i are its basis functions. \u03bd \u223c N (0, \u03c32In) is small Gaussian noise. The coef\ufb01cients are\nendowed with LSM distributions. They can be used to reconstruct x and are called the synthesis\ncoef\ufb01cients.\n\ni=1\n\n2\n\n\fGiven a signal x, we wish to infer its sparse representation s in the dictionary \u03a6. We consider in this\nsection the computation of the maximum a posteriori (MAP) estimate of the coef\ufb01cients s given the\ninput signal x. Using Bayes\u2019 rule we have p(s | x) \u221d p(x | s)p(s), and therefore the MAP estimate\n\u02c6s is given by\n\n{\u2212 log p(s | x)} = arg min\n\n{\u2212 log p(x | s) \u2212 log p(s)}.\n\n(3)\n\n\u02c6s = arg min\n\ns\n\ns\n\nIn general it is dif\ufb01cult to compute the MAP estimate with an LSM prior on s since we do not\nnecessarily have an analytical expression for the log-likelihood log p(s). However, we can compute\nthe complete log-likelihood log p(s, \u03bb) analytically\n\nlog p(s, \u03bb) = log p(s | \u03bb) + log p(\u03bb) = \u2212\u03bbi|si| + log\n\n+ log p(\u03bb).\n\n\u03bbi\n2\n\nHence, if we also observed the latent variable \u03bb, we would have an objective function that can be\nmaximized with respect to s. The standard approach in machine learning when confronted with\nsuch a problem is the Expectation-Maximization (EM) algorithm, and we derive in this Section an\nEM algorithm for the MAP estimation of the coef\ufb01cients. We use Jensen\u2019s inequality and obtain the\nfollowing upper bound on the posterior likelihood\n\u2212 log p(s | x) \u2264 \u2212 log p(x | s) \u2212\n\nd\u03bb := L(q, s),\n\nq(\u03bb) log\n\n(cid:90)\n\n(4)\n\np(s, \u03bb)\nq(\u03bb)\n\n\u03bb\n\nwhich is true for any probability distribution q(\u03bb). Performing coordinate descent in the auxiliary\nfunction L(q, s) leads to the following updates that are usually called the E step and the M step.\n\n(5)\n\n(6)\n\n(7)\n\nE Step\n\nM Step\n\nq(t+1) = arg min\n\nq\n\ns(t+1) = arg min\n\nL(q, s(t))\nL(q(t+1), s)\n\nLet < . >q denote the expectation with respect to q(\u03bb). The M Step (6) simpli\ufb01es to\n\ns(t+1) = arg min\n\ns\n\n1\n\n2\u03c32(cid:107)x \u2212 \u03a6s(cid:107)2\n\n2 +\n\n(cid:104)\u03bbi(cid:105)q(t+1) |si|,\n\ns\n\nm(cid:88)\n\ni=1\n\nwhich is a least-square problem regularized by a weighted sum of the absolute values of the coef\ufb01-\ncients. It is a quadratic program very similar to BPDN, and we can therefore use ef\ufb01cient algorithms\ndeveloped for BPDN that take advantage of the sparsity of the solution. This presents a signi\ufb01cant\ncomputational advantage over the GSM prior where the inferred coef\ufb01cients are not exactly sparse.\nWe have equality in the Jensen inequality if q(\u03bb) = p(\u03bb | s). The inequality (4) is therefore tight for\nthis particular choice of q, which implies that the E step reduces to q(t+1)(\u03bb) = p(\u03bb | s(t)). Note\nthat in the M step we only need to compute the expectation of \u03bbi with respect to the maximizing\ndistribution in the E step. Hence we only need to compute the suf\ufb01cient statistics\n\n(cid:104)\u03bbi(cid:105)p(\u03bb|s(t)) =\n\n\u03bbi p(\u03bb | s(t))d\u03bb.\n\n(8)\nNote that the posterior of the multiplier given the coef\ufb01cient p(\u03bb | s) might be hard to compute. We\nwill see in Section 4.1 that it is tractable if the prior on \u03bb is factorial and each \u03bbi has a Gamma dis-\ntribution, as the Laplacian distribution and the Gamma distribution are conjugate. We can apply the\nef\ufb01cient algorithms developed for BPDN to solve (7). Furthermore, warm-start capable algorithms\nare particularly interesting in this context as we can initialize the algorithm with s(t), and we do not\nexpect the solution to change much after a few iterations of EM.\n\n\u03bb\n\n(cid:90)\n\n4 Sparse coding with a factorial LSM prior\n\nWe propose in this Section a sparse coding model where the distribution of the multipliers is facto-\nrial, and each multiplier has a Gamma distribution, i.e. p(\u03bbi) = (\u03b2\u03b1/\u0393(\u03b1))\u03bb\u03b1\u22121\ne\u2212\u03b2\u03bbi, where \u03b1 is\nthe shape parameter and \u03b2 is the inverse scale parameter. With this particular choice of a prior on\nthe multiplier, we can compute the probability distribution of si analytically:\n\ni\n\nThis distribution has heavier tails than the Laplacian distribution. The graphical model correspond-\ning to this generative model is shown in Figure 1.\n\np(si) =\n\n\u03b1\u03b2\u03b1\n\n2(\u03b2 + |si|)\u03b1+1 .\n\n3\n\n\f4.1 Conjugacy\n\nThe Gamma distribution and Laplacian distribution are conjugate, i.e. the posterior probability of\n\u03bbi given si is also a Gamma distribution when the prior over \u03bbi is a Gamma distribution and the\nconditional probability of si given \u03bbi is a Laplace distribution with inverse scale \u03bbi. Hence, the\nposterior of \u03bbi given si is a Gamma distribution with parameters \u03b1 + 1 and \u03b2 + |si|.\nThe conjugacy is a key property that we can use in our EM algorithm proposed in Section 3. We saw\nthat the solution of the E step is given by q(t+1)(\u03bb) = p(\u03bb | s(t)). In the factorial model we have\ni ). The solution of the E step is therefore a product of Gamma distributions\n\np(\u03bb | s) =(cid:81)\n\ni p(\u03bbi | s(t)\n\nwith parameters \u03b1 + 1 and \u03b2 + |s(t)\n\ni\n\n|, and the suf\ufb01cient statistics (8) are given by\n(cid:104)\u03bbi(cid:105)p(\u03bbi|s(t)\n\ni ) =\n\n\u03b1 + 1\n\u03b2 + |s(t)\n\n| .\n\ni\n\n(9)\n\ni\n\ni\n\nA coef\ufb01cient that has a small value after t iterations but is not exactly zero will have in the next\niteration a large reweighting factor \u03bb(t+1)\n, which increases the chance that it will be set to zero\nin the next iteration, resulting in a sparser representation. On the other hand, a coef\ufb01cient having\na large value after t iterations corresponds to a feature that is very salient in the signal x.\nIt is\ntherefore bene\ufb01cial to reduce its corresponding inverse scale \u03bb(t+1)\nsuch that it is not penalized and\ncan account for as much information as possible.\nWe saw that with the Gamma prior we can compute the distribution of si analytically, and therefore\nwe can compute the gradient of log p(s | x) with respect to s. Hence another inference algorithm\nis to descend the cost function in (3) directly using a method such as conjugate gradient, or the\nmethod proposed in [20] where the authors also exploit the conjugacy of the Laplacian and Gamma\npriors. We argue here that the EM algorithm is in fact more ef\ufb01cient. The solution of (7) indeed has\ntypically few elements that are non-zero, and the computational complexity scales with the number\nof non-zero coef\ufb01cients [6, 7]. On the other hand, a gradient-based method will have a harder time\nidentifying the support of the solution, and therefore the required computations will involve all the\ncoef\ufb01cients, which is computationally expensive.\nThe update formula (9) is coincidentally equivalent to the reweighted L1 minimization scheme pro-\nposed by Cand`es et al. [21]. They solve the following sequence of problems\n\n|si| subject to (cid:107)x \u2212 \u03a6s(cid:107)2 \u2264 \u03b4\n\n\u03bb(t)\ni\n\n(10)\n\nm(cid:88)\n\ns(t+1) = arg min\n\ns\n\ni=1\n\ni\n\n= 1/(\u03b2 + |s(t)\n\n|) (which is identical to our rule when \u03b1 = 0). The authors show\nwith update \u03bb(t+1)\nthat the solutions achieved by their algorithm are more sparse than the solution where \u03bbi = 1 for\nall i. Whereas they derive this rule from mathematical intuitions regarding the L1 ball, we show\nthat this update rule follows from from Bayesian inference assuming a Gamma prior over \u03bb. It\nwas also shown that evidence maximization in a sparse coding model with an automatic relevance\ndetermination prior can also be solved via a sequence of reweighted (cid:96)1 optimization problems [22].\n\ni\n\n4.2 Application to image coding\n\nIt has been shown that the convex relaxation consisting of replacing the (cid:96)0 norm with the (cid:96)1 norm is\nable to identify the sparsest solution under some conditions on the dictionary of basis functions [23].\nHowever, these conditions are typically not veri\ufb01ed for the dictionaries learned from the statistics\nof natural images [24]. For instance, it was observed in [16] that it is possible to infer sparser\nrepresentations with a prior over the coef\ufb01cients that is a mixture of a delta function at zero and a\nGaussian distribution than with the Laplacian prior. We show that our proposed inference algorithm\nalso leads to representations that are more sparse, as the LSM prior with Gamma hyperprior has\nheavier tails than the Laplacian distribution. We selected 1000 16 \u00d7 16 image patches at random,\nand computed their sparse representations in a dictionary with 256 basis functions using both the\nconventional Laplacian prior and our LSM prior. The dictionary is learned from the statistics of\nnatural images [24] using a Laplacian prior over the coef\ufb01cients. To ensure that the reconstruction\nerror is the same in both cases, we solve the constrained version of the problem as in [21], where we\nrequire that the signal to noise ratio of the reconstruction is equal to 10. We choose \u03b2 = 0.01 and 5\n\n4\n\n\fEM iterations. We can see in Figure 2 that the representations using the LSM prior are indeed more\nsparse by approximately a factor of 2. Note that the computational complexity to compute these\nsparse representations is much lower than that of [16].\n\nFigure 1: Graphical model representation of our\nproposed generative model where the multipli-\ners distribution is factorial.\n\nFigure 2: Sparsity comparison. On the x-axis\n(resp. y-axis) is the (cid:96)0 norm of the represen-\ntation inferred with the Laplacian prior (resp.\nLSM prior).\n\n5 Sparse coding with a non-factorial model\n\nIt has been shown that many natural signals such as sound or images have a particular type of\nhigher-order, sparse structure in which active coef\ufb01cients occur in groups corresponding to basis\nfunctions having similar properties (position, orientation, or frequency tuning) [25, 1]. We focus in\nthis Section on a class of signals that has a particular type of higher-order structure where the active\ncoef\ufb01cients occur in groups. We show here that the LSM prior can be used to capture this group\nstructure in natural images, and we propose an ef\ufb01cient inference algorithm for this case.\n\n5.1 Group sparsity\n\nor neighborhoods indexed by Nk, i.e. {1, . . . , m} = (cid:83)\nindices of the nonzero coef\ufb01cients are given by(cid:83)\n\nWe consider a dictionary \u03a6 such that the basis functions can be divided in a set of disjoint groups\nk\u2208\u039b Nk, and Ni \u2229 Nj = \u2205 if i (cid:54)= j. A\nsignal having the group sparsity property is such that the sparse coef\ufb01cients occur in groups, i.e. the\n\nk\u2208\u0393 Nk, where \u0393 is a subset of \u039b.\n\nThe group sparsity structure can be captured with the LSM prior by having all the coef\ufb01cients in a\ngroup share the same inverse scale parameter, i.e. for all i \u2208 Nk, \u03bbi = \u03bb(k). The corresponding\ngraphical model is shown in Figure 3. This addresses the case where dependencies are allowed\nwithin groups, but not across groups as in the block-(cid:96)1 method [17]. Note that for some types of\ndictionaries it is more natural to consider overlapping groups to avoid blocking artifacts. We propose\nin Section 5.2 inference algorithms for both overlapping and non-overlapping cases.\n\nFigure 3: The two groups N(k) = {i \u2212 2, i \u2212\n1, i} and N(l) = {i + 1, i + 2, i + 3} are non-\noverlapping.\n\nFigure 4: The basis function coef\ufb01cients in the\nneighborhood de\ufb01ned by N (i) = {i\u22121, i, i+1}\nshare the same multiplier \u03bbi.\n\n5.2\n\nInference\n\nIn the EM algorithm we proposed in Section 3, the suf\ufb01cient statistics that are computed in the E\nstep are (cid:104)\u03bbi(cid:105)p(\u03bbi|s(t)) for all i. We suppose as in Section 4.1 that the prior on \u03bb(k) is Gamma with\n\n5\n\ns1s2smsjx1xnxi\u03bb1\u03bb2\u03bbm\u03bbj\u03c6ij020406080100120140Laplacian prior020406080100120140LSM priorSparsity of the representationsi-1\u03bb(k)si-2sisi+1si+2\u03bb(l)si+3si-1\u03bbi-1si-2si\u03bbisi+1si+2\u03bbi+2si+3\u03bbi+1\f(cid:104)\u03bbi(cid:105)p(\u03bbi|s(t)) =(cid:10)\u03bb(k)\n\nparameters \u03b1 and \u03b2. Using the structure of the dependencies in the probabilistic model shown in\nFigure 3, we have\n\n(11)\nwhere the index i is in the group Nk, and sNk = (sj)j\u2208Nk is the vector containing all the coef\ufb01cients\nin the group. Using the conjugacy of the Laplacian and Gamma distributions, the distribution of \u03bb(k)\ngiven all the coef\ufb01cients in the neighborhood is a Gamma distribution with parameters \u03b1 +|Nk| and\n|sj|, where |Nk| denotes the size of the neighborhood. Hence (11) can be rewritten as\n\n\u03b2 +(cid:80)\n\np(\u03bb(k)|s(t)Nk\n\n(cid:11)\n\n)\n\nj\u2208Nk\n\nfollows\n\n\u03bb(t+1)\n(k) =\n\n\u03b2 +(cid:80)\n\n\u03b1 + |Nk|\nj\u2208Nk\n\nj | .\n|s(t)\n\n/(\u03b2 +(cid:80)\n\nThe resulting update rule is a form of divisive normalization. We saw in Section 2 that we can write\nsk = \u03bb\u22121\n(k)uk, where uk is a Laplacian random variable with scale 1, and thus after convergence we\nhave u(\u221e)\nk = (\u03b1 + |Nk|)s(\u221e)\n|). Such rescaling operations are also thought to\nj\nplay an important role in the visual system. [25]\nNow let us consider the case where coef\ufb01cient neighborhoods are allowed to overlap. Let N (i)\ndenote the indices of the neighborhood that is centered around si (see Figure 4 for an example). We\npropose to estimate the scale parameter \u03bbi by only considering the coef\ufb01cients in N (i), and suppose\nthat they all share the same multiplier \u03bbi. In this case the EM update is given by\n\n|s(\u221e)\n\nj\u2208Nk\n\nk\n\n(12)\n\n(13)\n\n\u03bb(t+1)\ni\n\n=\n\n\u03b2 +(cid:80)\n\n\u03b1 + |N (i)|\n\nj | .\nj\u2208N (i) |s(t)\n\nNote that we have not derived this rule from a proper probabilistic model. A coef\ufb01cient is indeed a\nmember of many neighborhoods as shown in Figure 4, and the structure of the dependencies implies\np(\u03bbi | s) (cid:54)= p(\u03bbi | sN (i)). However, we show experimentally that estimating the multiplier using\n(13) gives good performance. A similar approximation is used in the GSM analysis-based model\n[26]. Note that the noise shaping algorithm, which bears similarities with the iterative thresholding\nalgorithm developed for BPDN [7], is modi\ufb01ed in [27] using an update that is essentially inversely\nproportional to ours. The authors show improved coding ef\ufb01ciency in the context of natural images.\n\n5.3 Compressive sensing recovery\nIn compressed sensing, we observe a number n of random projections of a signal s0 \u2208 Rm, and\nit is in principle impossible to recover s0 if n < m. However, if s0 has p non-zero coef\ufb01cients, it\nhas been shown in [28] that it is suf\ufb01cient to use n \u221d p log m such measurements. We denote by\nW \u2208 Rn\u00d7m the measurement matrix and let y = W s0 be the observations. A standard method to\nobtain the reconstruction is to use the solution of the Basis Pursuit (BP) problem\n\n\u02c6s = arg min\n\ns\n\n(cid:107)s(cid:107)1\n\nsubject to W s = y.\n\n(14)\n\nNote that the solution of BP is the solution of BPDN as \u00b5 converges to zero in (1), or \u03b4 = 0 in (10).\nIf the signal has structure beyond sparsity, one can in principle recover the signal with even fewer\nmeasurements using an algorithm that exploits this structure [19, 29]. We therefore compare the\nperformance of BP with the performance of our proposed LSM inference algorithms\n\n|si|\n\n\u03bb(t)\ni\n\nsubject to W s = y.\n\n(15)\n\nm(cid:88)\n\ns(t+1) = arg min\n\ns\n\ni=1\n\nWe denote by RWBP the algorithm with the factorial update (9), and RW3BP (resp. RW5BP) the\nalgorithm with our proposed divisive normalization update (13) with group size 3 (resp. 5). We\nconsider 50-dimensional signals that are sparse in the canonical basis and where the neighborhood\nsize is 3. To sample such a signal s \u2208 R50, we draw a number d of \u201ccentroids\u201d i, and we sample three\nvalues for si\u22121, si and si+1 using a normal distribution of variance 1. The groups are thus allowed\nto overlap. A compressive sensing recovery problem is parameterized by (m, n, d). To explore the\nproblem space we display the results using phase plots as in [30], which plots performance as a\nfunction of different parameter settings. We \ufb01x m = 50 and parameterize the phase plots using\nthe indeterminacy of the system indexed by \u03b4 = n/m, and the approximate sparsity of the system\n\n6\n\n\findexed by \u03c1 = 3d/m. We vary \u03b4 and \u03c1 in the range [.1, .9] using a 30 by 30 grid. For a given\nvalue (\u03b4, \u03c1) on the grid, we sample 10 sparse signals using the corresponding (m, n, d) parameters.\nThe underlying sparse signal is recovered using the three algorithms and we average the recovery\nerror (cid:107)\u02c6s \u2212 s0(cid:107)2/(cid:107)s0(cid:107)2 for each of them. We show in Figure 5 that RW3BP clearly outperforms\nRWBP. There is a slight improvement by going from BP to RWBP (see supplementary material),\nbut this improvement is rather small as compared with going from RWBP to RW3BP and RW5BP.\nThis illustrates the importance of using the higher-order structure of the signals in the inference\nalgorithm. The performance of RW3BP and RW5BP is comparable (see supplementary material),\nwhich shows that our algorithm is not very sensitive to the choice of the neighborhood size.\n\nFigure 5: Compressive sensing recovery results using synthetic data. Shown are the phase plots for\na sequence of BP problems with the factorial update (RWBP), and a sequence of BP problems with\nthe divisive normalization update with neighborhood size 3 (RW3BP). On the x-axis is the sparsity\nof the system indexed by \u03c1 = 3d/m, and on the y-axis is the indeterminacy of the system indexed\nby \u03b4 = n/m. At each point (\u03c1, \u03b4) in the phase plot we display the average recovery error.\n\n5.4 Application to natural images\n\nIt has been shown that adapting a dictionary of basis functions to the statistics of natural images so\nas to maximize sparsity in the coef\ufb01cients results in a set of dictionary elements whose spatial prop-\nerties match those of V1 (primary visual cortex) receptive \ufb01elds [24]. However, the basis functions\nare learned under a probabilistic model where the probability density over the basis functions coef-\n\ufb01cients is factorial, whereas the sparse coef\ufb01cients exhibit statistical dependencies [15, 16]. Hence,\na generative model with factorial LSM is not rich enough to capture the complex statistics of natural\nimages. We propose here to model these dependencies using a non-factorial LSM model. We \ufb01x\na topography where the basis functions coef\ufb01cients are arranged on a 2D grid, and with overlap-\nping neighborhoods of \ufb01xed size 3 \u00d7 3. The corresponding inference algorithm uses the divisive\nnormalization update (13).\n\nWe learn the optimal dictionary of basis functions \u03a6 using the learning rule \u2206\u03a6 = \u03b7(cid:10)(x \u2212 \u03a6\u02c6s)\u02c6sT(cid:11)\n\nas in [24], where \u03b7 is the learning rate, \u02c6s are the basis functions coef\ufb01cients inferred under the model\n(13), and the average is taken over a batch of size 100. We \ufb01x n = m = 256, and sample 16 \u00d7 16\nimage patches from a set of whitened images, using a total of 100000 batches. The learned basis\nfunctions are shown in Figure 6. We see here that the neighborhoods of size 3 \u00d7 3 group basis\nfunctions at a similar position, scale and orientation. The topography is similar to how neurons are\narranged in the visual cortex, and is reminiscent of the results obtained in topographic ICA [13] and\ntopographic mixture of experts models [31]. An important difference is that our model is based on a\ngenerative sparse coding model in which both inference and learning can be implemented via local\nnetwork interactions [7]. Because of the topographic organization, we also obtain a neighborhood-\nbased divisive normalization rule.\nDoes the proposed non-factorial model represent image structure more ef\ufb01ciently than those with\nfactorial priors? To answer this question we measured the models\u2019 ability to recover sparse struc-\nture in the compressed sensing setting. We note that the basis functions are learned such that they\nrepresent the sparse structure in images, as opposed to representing the images exactly (there is a\nnoise term in the generative model (2)). Hence, we design our experiment such that we measure\nthe recovery of this sparse structure. Using the basis functions shown in Figure 6, we \ufb01rst infer the\n\n7\n\n0.10.20.30.40.50.60.70.80.9\u03c10.10.20.30.40.50.60.70.80.9\u03b4RWBP0.00.10.20.30.40.50.60.70.80.91.00.10.20.30.40.50.60.70.80.9\u03c10.10.20.30.40.50.60.70.80.9\u03b4RW3BP0.00.10.20.30.40.50.60.70.80.91.0\fsparse coef\ufb01cients s0 of an image patch x such that (cid:107)x \u2212 \u03a6s0(cid:107)2 < \u03b4 using the inference algorithm\ncorresponding to the model. We \ufb01x \u03b4 such that the SNR is 10, and thus the three sparse approxi-\nmations for the three models contain the same amount of signal power. We then compute random\nprojections y = \u02dcW \u03a6s0 where \u02dcW is the random measurements matrix. We attempt to recover the\nsparse coef\ufb01cients as in Section 5.3 by substituting W := \u03a6 \u02dcW , and y := \u03a6s0. We compare the\nrecovery performance (cid:107)\u03a6\u02c6s\u2212 \u03a6s0(cid:107)2/(cid:107)\u03a6s0(cid:107)0 for 100 16\u00d7 16 image patches selected at random, and\nwe use 110 random projections. We can see in Figure 7 that the model with non-factorial LSM prior\noutperforms the other models as it is able to capture the group sparsity structure in natural images.\n\nFigure 7: Compressive sensing recovery. On the\nx-axis is the recovery performance for the fac-\ntorial LSM model (RWBP), and on the y-axis\nthe recovery performance for the non-factorial\nLSM model with 3 \u00d7 3 overlapping groups\n(RW3\u00d73BP). RW3\u00d73BP outperforms RWBP.\nSee supplementary material for the comparison\nbetween RW3\u00d73BP and BP as well as between\nRWBP and BP.\n\nFigure 6: Basis functions learned in a non-\nfactorial LSM model with overlapping groups of\nsize 3 \u00d7 3\n6 Conclusion\n\nWe introduced a new class of probability densities that can be used as a prior for the coef\ufb01cients in a\ngenerative sparse coding model of images. By exploiting the conjugacy of the Gamma and Laplacian\nprior, we were able to derive an ef\ufb01cient inference algorithm that consists of solving a sequence of\nreweighted (cid:96)1 least-square problems, thus leveraging the multitude of algorithms already developed\nfor BPDN. Our framework also makes it possible to capture higher-order dependencies through\ngroup sparsity. When applied to natural images, the learned basis functions of the model may be\ntopographically organized according to the speci\ufb01ed group structure. We also showed that exploiting\nthe group sparsity results in performance gains for compressive sensing recovery on natural images.\nAn open question is the learning of group structure, which is a topic of ongoing work.\nWe wish to acknowledge support from NSF grant IIS-0705939.\nReferences\n[1] S. Lyu and E. P. Simoncelli. Statistical modeling of images with \ufb01elds of gaussian scale mixtures. In\n\nAdvances in Neural Computation Systems (NIPS), Vancouver, Canada, 2006.\n\n[2] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on\n\nScienti\ufb01c Computing, 20(1):33\u201361, 1999.\n\n[3] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[4] Y. Tsaig and D.L. Donoho. Extensions of compressed sensing. Signal Processing, 86(3):549\u2013571, 2006.\n[5] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unla-\n\nbeled data. Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007.\n\n8\n\n\f[6] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2004.\n\n[7] C.J. Rozell, D.H Johnson, R.G. Baraniuk, and B.A. Olshausen. Sparse coding via thresholding and local\n\ncompetition in neural circuits. Neural Computation, 20(10):2526\u20132563, October 2008.\n\n[8] J. Friedman, T. Hastie, H. Hoe\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The Annals of\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\n[9] M. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Application to\ncompressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing,\n1(4):586\u2013597, 2007.\n\n[10] M. Elad, P. Milanfar, and R. Rubinstein. Analysis vs synthesis in signal priors.\n\n23(3):947\u2013968, June 2007.\n\nInverse Problems,\n\n[11] C. Zetzsche, G. Krieger, and B. Wegmann. The atoms of vision: Cartesian or polar?\n\nOptical Society of America A, 16(7):1554\u20131565, 1999.\n\nJournal of the\n\n[12] M.J. Wainwright, E.P. Simoncelli, and A.S. Willsky. Random cascades on wavelet trees and their use\nin modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1), July\n2001.\n\n[13] A. Hyv\u00a8arinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computa-\n\ntion, 13(7):1527\u20131558, 2001.\n\n[14] Y. Karklin and M.S. Lewicki. A hierarchical bayesian model for learning nonlinear statistical regularities\n\nin nonstationary natural signals. Neural Computation, 17(2):397\u2013423, February 2005.\n\n[15] P. Hoyer and A. Hyv\u00a8arinen. A multi-layer sparse coding network learns contour coding from natural\n\nimages. Vision Research, 42:1593\u20131605, 2002.\n\n[16] P.J. Garrigues and B.A. Olshausen. Learning horizontal connections in a sparse coding model of natural\n\nimages. In Advances in Neural Computation Systems (NIPS), Vancouver, Canada, 2007.\n\n[17] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, February 2006.\n\n[18] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso.\n\nConference on Machine Learning (ICML), 2009.\n\nIn International\n\n[19] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde. Model-based compressive sensing. Preprint,\n\nAugust 2008.\n\n[20] I. Ramirez, F. Lecumberry, and G. Sapiro. Universal priors for sparse modeling. CAMPSAP, December\n\n2009.\n\n[21] E.J. Cand`es, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted l1 minimization. J. Fourier\n\nAnal. Appl., to appear, 2008.\n\n[22] D. Wipf and S. Nagarajan. A new view of automatic relevance determination. In Advances in Neural\n\nInformation Processing Systems 20, 2008.\n\n[23] J.A. Tropp.\n\nJust relax: convex programming methods for identifying sparse signals in noise.\n\nTransactions on Information Theory, 52(3):1030\u20131051, 2006.\n\nIEEE\n\n[24] B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381(6583):607\u2013609, June 1996.\n\n[25] M.J. Wainwright, O. Schwartz, and E.P. Simoncelli. Natural image statistics and divisive normalization:\nModeling nonlinearity and adaptation in cortical neurons. In R. Rao, B.A. Olshausen, and M.S. Lewicki,\neditors, Statistical Theories of the Brain. MIT Press, 2001.\n\n[26] J. Portilla, V. Strela, M.J Wainwright, and E.P. Simoncelli.\n\nImage denoising using scale mixtures of\n\ngaussians in the wavelet domain. IEEE Transactions on Image Processing, 12(11):1338\u20131351, 2003.\n\n[27] R.M. Figueras and E.P. Simoncelli. Statistically driven sparse image representation. In Proc 14th IEEE\n\nInt\u2019l Conf on Image Proc, volume 6, pages 29\u201332, September 2007.\n\n[28] E. Cand`es. Compressive sampling. Proceedings of the International Congress of Mathematicians, 2006.\n[29] V. Cevher, , M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signal recovery using markov random\n\n\ufb01elds. In Advances in Neural Computation Systems (NIPS), Vancouver, B.C., Canada, 2008.\n\n[30] D. Donoho and Y. Tsaig. Fast solution of l 1-norm minimization problems when the solution may be\n\nsparse. preprint, 2006.\n\n[31] S. Osindero, M. Welling, and G.E. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Computation, 18(2):381\u2013414, 2006.\n\n9\n\n\f", "award": [], "sourceid": 507, "authors": [{"given_name": "Pierre", "family_name": "Garrigues", "institution": null}, {"given_name": "Bruno", "family_name": "Olshausen", "institution": null}]}