{"title": "Sparse Code Shrinkage: Denoising by Nonlinear Maximum Likelihood Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 473, "page_last": 479, "abstract": null, "full_text": "Sparse Code Shrinkage: Denoising by \n\nNonlinear Maximum Likelihood Estimation \n\nAapo Hyvarinen, Patrik Hoyer and Erkki Oja \n\nHelsinki University of Technology \n\nLaboratory of Computer and Information Science \n\nP.O. Box 5400, FIN-02015 HUT, Finland \n\naapo.hyvarinen@hut.fi,patrik.hoyer@hut.fi,erkki.oja@hut.fi \n\nhttp://www.cis.hut.fi/projects/ica/ \n\nAbstract \n\nSparse coding is a method for finding a representation of data in \nwhich each of the components of the representation is only rarely \nsignificantly active. Such a representation is closely related to re(cid:173)\ndundancy reduction and independent component analysis, and has \nsome neurophysiological plausibility. In this paper, we show how \nsparse coding can be used for denoising. Using maximum likelihood \nestimation of nongaussian variables corrupted by gaussian noise, we \nshow how to apply a shrinkage nonlinearity on the components of \nsparse coding so as to reduce noise. Furthermore, we show how to \nchoose the optimal sparse coding basis for denoising. Our method \nis closely related to the method of wavelet shrinkage, but has the \nimportant benefit over wavelet methods that both the features and \nthe shrinkage parameters are estimated directly from the data. \n\n1 \n\nIntroduction \n\nA fundamental problem in neural network research is to find a suitable representa(cid:173)\ntion for the data. One of the simplest methods is to use linear transformations of the \nobserved data. Denote by x = (Xl, X2, ... , Xn)T the observed n-dimensional random \nvector that is the input data (e.g., an image window), and by s = (81,82 , . .. , 8 n )T \nthe vector of the linearly transformed component variables. Denoting further the \nn x n transformation matrix by W, the linear representation is given by \n\ns=Wx. \n\n(1) \n\n\f474 \n\nA. Hyviirinen, P Hoyer and E. Dja \n\nWe assume here that the number of transformed components equals the number of \nobserved variables, but this need not be the case in general. \n\nAn important representation method is given by (linear) sparse coding [1 , 10], in \nwhich the representation of the form (1) has the property that only a small number \nof the components Si of the representation are significantly non-zero at the same \ntime. Equivalently, this means that a given component has a 'sparse' distribution. \nA random variable Si is called sparse when Si has a distribution with a peak at zero, \nand heavy tails, as is the case, for example, with the double exponential (or Laplace) \ndistribution [6]; for all practical purposes, sparsity is equivalent to supergaussianity \nor leptokurtosis [8]. Sparse coding is an adaptive method, meaning that the matrix \nW is estimated for a given class of data so that the components Si are as sparse as \npossible; such an estimation procedure is closely related to independent component \nanalysis [2J. \n\nSparse coding of sensory data has been shown to have advantages from both phys(cid:173)\niological and information processing viewpoints [1] . However, thorough analyses of \nthe utility of such a coding scheme have been few. In this paper, we introduce and \nanalyze a statistical method based on sparse coding. Given a signal corrupted by \nadditive gaussian noise, we attempt to reduce gaussian noise by soft thresholding \n('shrinkage') of the sparse components. Intuitively, because only a few of the com(cid:173)\nponents are significantly active in the sparse code of a given data point, one may \nassume that the activities of components with small absolute values are purely noise \nand set them to zero, retaining just a few components with large activities. This \nmethod is closely connected to the wavelet shrinkage method [3]. In fact, sparse \ncoding may be viewed as a principled way for determining a wavelet-like basis and \nthe corresponding shrinkage nonlinearities, based on data alone. \n\n2 Maximum likelihood estimation of sparse components \n\nThe starting point of a rigorous derivation of our denoising method is the fact that \nthe distributions of the sparse components are nongaussian. Therefore, we shall \nbegin by developing a general theory that shows how to remove gaussian noise from \nnongaussian variables, making minimal assumptions on the data. \n\nDenote by S the original nongaussian random variable (corresponding here to a \nnoise-free version of one of the sparse components Si), and by v gaussian noise of \nzero mean and variance a 2 \u2022 Assume that we only observe the random variable y : \n\ny=S+v \n\n(2) \n\nand we want to estimate the original s. Denoting by p the probability density of s, \nand by f = -logp its negative log-density, the maximum likelihood (ML) method \ngives the following estimator for s: \n\n\u00a7 = argmin ~(y - u)2 + f(u). \n\nu 2a \n\n(3) \n\nAssuming f to be strictly convex and differentiable, this can be solved [6] to yield \n\u00a7 = g(y), where the function g can be obtained from the relation \n\n(4) \n\nThis nonlinear estimator forms the basis of our method. \n\n\fSparse Code Shrinkage: Denoising by Nonlinear Maximum Likelihood Estimation \n\n475 \n\n\"'-~~-----r\\ --~-~~---, \n\n'. '. \n' . . . . . \n\n, \" \n\nFigure 1: Shrinkage nonlinearities and associated probability densities. Left: Plots \nof the different shrinkage functions. Solid line: shrinkage corresponding to Laplace \ndensity. Dashed line: typical shrinkage function obtained from (6). Dash-dotted \nline: typical shrinkage function obtained from (8). For comparison, the line x = y is \ngiven by dotted line. All the densities were normalized to unit variance, and noise \nvariance was fixed to .3. Right: Plots of corresponding model densities of the sparse \ncomponents. Solid line: Laplace density. Dashed line: a typical moderately super(cid:173)\ngaussian density given by (5). Dash-dotted line: a typical strongly supergaussian \ndensity given by (7). For comparison, gaussian density is given by dotted line. \n\n3 Parameterizations of sparse densities \n\nTo use the estimator defined by (3) in practice, the densities of the Si need to \nbe modelled with a parameterization that is rich enough. We have developed two \nparameterizations that seem to describe very well most of the densities encountered \nin image denoising. Moreover, the parameters are easy to estimate, and the inversion \nin (4) can be performed analytically. Both models use two parameters and are thus \nable to model different degrees of supergaussianity, in addition to different scales, \ni.e. variances. The densities are here assumed to be symmetric and of zero mean. \n\nThe first model is suitable for supergaussian densities that are not sparser than the \nLaplace distribution r6], and is given by the family of densities \n\np(s) = C exp( -as2 12 - bls!), \n\n(5) \nwhere a, b > 0 are parameters to be estimated, and C is an irrelevant scaling \nconstant . The classical Laplace density is obtained when a = 0, and gaussian \ndensities correspond to b = o. A simple method for estimating a and b was given \nin [6]. For this density, the nonlinearity g takes the form: \n\ng(u) = \n\n1 2 sign(u) max(O, lui - ba2 ) \n\n1 +a a \n\n(6) \n\nwhere a 2 is the noise variance. The effect of the shrinkage function in (6) is to \nreduce the absolute value of its argument by a certain amount, which depends on \nthe parameters, and then rescale. Small arguments are thus set to zero. Examples \nof the obtained shrinkage functions are given in Fig. l. \n\nThe second model describes densities that are sparser than the Laplace density: \n\n(a: + 2) [a: (a: + 1)/2](a/Hl) \np(s) = 2d [Va: (a: + 1)/2 + I sid 1](a+3)\u00b7 \n\n1 \n\n(7) \n\n\f476 \n\nA. Hyvarinen, P Hoyer and E. Dja \n\nWhen a -+ 00, the Laplace density is obtained as the limit. A simple consistent \nmethod for estimating the parameters d, a > 0 in (7) can be obtained from the \nrelations d = JE{S2} and a = (2 - k + Jk(k + 4))/(2k - 1) with k = d2Ps(O)2, \nsee [6]. The resulting shrinkage function can be obtained as [6] \n\ng(u) = sign(u)max(O, \n\nlui - ad \n\n2 \n\n1 \n\n,..,---,-----,--::-------:,....,----\n+ \"2 J (l u l + ad)2 - 4a2(a + 3)) \n\n(8) \n\nwhere a = Ja(a + 1)/2, and g(u) is set to zero in case the square root in (8) is \nimaginary. This is a shrinkage function that has a certain hard-thresholding flavor, \nas depicted in Fig. 1. \n\nExamples of the shapes of the densities given by (5) and (7) are given in Fig. 1, \ntogether with a Laplace density and a gaussian density. For illustration purposes, \nthe densities in the plot are normalized to unit variance, but these parameterizations \nallow the variance to be choosen freely. \n\nChoosing whether model (5) or (7) should be used can be based on moments of \nthe distributions; see [6]. Methods for estimating the noise variance a 2 are given in \n[3,6]. \n\n4 Sparse code shrinkage \n\nThe above results imply the following sparse code shrinkage method for denoising. \nAssume that we observe a noisy version x = x + v of the data x, where v is gaussian \nwhite noise vector. To denoise x, we transform the data to a sparse code, apply the \nabove ML estimation procedure component-wise, and then transform back to the \noriginal variables. Here, we constrain the transformation to be orthogonal; this is \nmotivated in Section 5. To summarize: \n\n1. First, using a noise-free training set of x, use some sparse coding method \nfor determining the orthogonal matrix W so that the components Si in \ns = Wx have as sparse distributions as possible. Estimate a density model \nPi(Si) for each sparse component, using the models in (5) and (7). \n\n2. Compute for each noisy observation x(t) of x the corresponding noisy sparse \n\ncomponents y(t) = Wx(t). Apply the shrinkage non-linearity gi(') as de(cid:173)\nfined in (6), or in (8), on each component Yi(t), for every observation index \nt. Denote the obtained components by Si(t) = gi(Yi(t)). \n\n3. Invert the relation (1) to obtain estimates of the noise-free x, given by \n\nx(t) = WT\u00a7(t) . \n\nTo estimate the sparsifying transform W, we assume that we have access to a noise(cid:173)\nfree realization of the underlying random vector. This assumption is not unrealistic \non many applications: for example, in image denoising it simply means that we \ncan observe noise-free images that are somewhat similar to the noisy image to be \ntreated, i.e., they belong to the same environment or context. This assumption can \nbe, however, relaxed in many cases, see [7]. The problem of finding an optimal \nsparse code in step 1 is treated in the next section. \n\n\fSparse Code Shrinkage: Denoising by Nonlinear Maximum Likelihood Estimation \n\n477 \n\nIn fact , it turns out that the shrinkage operation given above is quite similar to \nthe one used in the wavelet shrinkage method derived earlier by Donoho et al [3] \nfrom a very different approach. Their estimator consisted of applying the shrinkage \noperator in (6) , with different values for the parameters, on the coefficients of the \nwavelet transform. There are two main differences between the two methods. The \nfirst is the choice of the transformation. We choose the transformation using the \nstatistical properties of the data at hand, whereas Donoho et al use a predetermined \nwavelet transform. The second important difference is that we estimate the shrink(cid:173)\nage nonlinearities by the ML principle, again adapting to the data at hand, whereas \nDonoho et al use fixed thresholding operators derived by the minimax principle. \n\n5 Choosing the optimal sparse code \n\nDifferent measures of sparseness (or nongaussianity) have been proposed in the lit(cid:173)\nerature [1, 4, 8, 10]. In this section, we show which measures are optimal for our \nmethod. We shall here restrict ourselves to the class of linear, orthogonal transfor(cid:173)\nmations. This restriction is justified by the fact that orthogonal transformations \nleave the gaussian noise structure intact, which makes the problem more simply \ntractable. This restriction can be relaxed, however, see [7]. \n\nA simple, yet very attractive principle for choosing the basis for sparse coding is \nto consider the data to be generated by a noisy independent component analysis \n(ICA) model [10, 6, 9] : \n\nx = As+v, \n\n(9) \n\nwhere the Si are now the independent components, and v is multivariate gaussian \nnoise. We could then estimate A using ordinary maximum likelihood estimation \nof the ICA model. Under the restriction that A is constrained to be orthogonal, \nestimation of the noise-free components Si then amounts to the above method of \nshrinking the values of AT x, see [6]. In this ML sense, the optimal transformation \nmatrix is thus given by W = AT. In particular, using this principle means that \nordinary ICA algorithms can be used to estimate the sparse coding basis. This \nis very fortunate since the computationally efficient methods for ICA estimation \nenable the basis estimation even in spaces of rather high dimensions [8, 5]. \n\nAn alternative principle for determining the optimal sparsifying transformation is \nto minimize the mean-square error (MSE). In [6], a theorem is given that shows that \nthe optimal basis in minimum MSE sense is obtained by maximizing 2:~=1 IF(wTx) \nwhere IF(s) = E{[P'(s)jp(s)J2} is the Fisher information of the density of s, and \nthe wT are the rows of W . Fisher information of a density [4] can be considered as \na measure of its nongaussianity. It is well-known [4] that in the set of probability \ndensities of unit variance, Fisher information is minimized by the gaussian density, \nand the minimum equals 1. Thus the theorem shows that the more nongaussian \n(sparse) S is, the better we can reduce noise. Note, however, that Fisher information \nis not scale-invariant. \n\nThe former (ML) method of determining the basis matrix gives usually sparser \ncomponents than the latter method based on minimizing MSE. In the case of image \ndenoising, however, these two methods give essentially equivalent bases if a percep(cid:173)\ntually weighted MSE is used [6]. Thus we luckily avoid the classical dilemma of \nchoosing between these two optimality criteria. \n\n\f478 \n\n6 Experiments \n\nA. Hyvtirinen, P. Hoyer and E. Oja \n\nImage data seems to fulfill the assumptions inherent in sparse code shrinkage: It is \npossible to find linear representations whose components have sparse distributions, \nusing wavelet-like filters [10]. Thus we performed a set of experiments to explore the \nutility of sparse code shrinkage in image denoising. The experiments are reported \nin more detail in [7]. \n\nData. The data consisted of real-life images, mainly natural scenes. The images \nwere randomly divided into two sets. The first set was used in estimating the \nmatrix W that gives the sparse coding transformation, as well as in estimating the \nshrinkage nonlinearities. The second set was used as a test set. It was artificially \ncorrupted by Gaussian noise, and sparse code shrinkage was used to reduce the \nnoise. The images were used in the method in the form of subwindows of 8 x 8 \npixels. \n\nMethods. The sparse coding matrix W was determined by first estimating the \nICA model for the image windows (with DC component removed) using the FastICA \nalgorithm [8, 5], and projecting the obtained estimate on the space of orthogonal \nmatrices. The training images were also used to estimate the parametric density \nmodels of the sparse components. In the first series of experiments, the local vari(cid:173)\nance was equalized as a preprocessing step [7]. This implied that the density in \n(5) was a more suitable model for the densities of the sparse components; thus the \nshrinkage function in (6) was used. In the second series, no such equalization was \nmade, and the density model (7) and the shrinkage function (8) were used [7]. \n\nResults. Fig. 2 shows, on the left, a test image which was artificially corrupted \nwith Gaussian noise with standard deviation 0.5 (the standard deviations of the \noriginal images were normalized to 1). The result of applying our denoising method \n(without local variance equalization) on that image is shown on the right. Visual \ncomparison of the images in Fig. 2 shows that our sparse code shrinkage method \ncancels noise quite effectively. One sees that contours and other sharp details are \nconserved quite well, while the overall reduction of noise is quite strong, which in \nis contrast to methods based on low-pass filtering. This result is in line with those \nobtained by wavelet shrinkage [3]. More experimental results are given in [7]. \n\n7 Conclusion \nSparse coding and ICA can be applied for image feature extraction, resulting in a \nwavelet-like basis for image windows [10]. As a practical application of such a basis, \nwe introduced the method of sparse code shrinkage. It is based on the fact that in \nsparse coding the energy of the signal is concentrated on only a few components, \nwhich are different for each observed vector. By shrinking the absolute values of the \nsparse components towards zero, noise can be reduced. The method is also closely \nconnected to modeling image data with noisy independent component analysis [9]. \nWe showed how to find the optimal sparse coding basis for denoising, and we de(cid:173)\nveloped families of probability densities that allow the shrinkage nonlinearities to \nadapt accurately to the data at hand. Experiments on image data showed that the \nperformance of the method is very appealing. The method reduces noise without \nblurring edges or other sharp features as much as linear low-pass or median filtering. \nThis is made possible by the strongly non-linear nature of the shrinkage operator \nthat takes advantage of the inherent statistical structure of natural images. \n\n\f", "award": [], "sourceid": 1612, "authors": [{"given_name": "Aapo", "family_name": "Hyv\u00e4rinen", "institution": null}, {"given_name": "Patrik", "family_name": "Hoyer", "institution": null}, {"given_name": "Erkki", "family_name": "Oja", "institution": null}]}