{"title": "Learning Nonlinear Overcomplete Representations for Efficient Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 556, "page_last": 562, "abstract": "", "full_text": "Learning nonlinear overcomplete \nrepresentations for efficient coding \n\nMichael S. Lewicki \nlewicki~salk.edu \n\nTerrence J. Sejnowski \n\nterry~salk.edu \n\nHoward Hughes Medical Institute \nComputational Neurobiology Lab \n\nThe Salk Institute \n\n10010 N. Torrey Pines Rd. \n\nLa Jolla, CA 92037 \n\nAbstract \n\nWe derive a learning algorithm for inferring an overcomplete basis \nby viewing it as probabilistic model of the observed data. Over(cid:173)\ncomplete bases allow for better approximation of the underlying \nstatistical density. Using a Laplacian prior on the basis coefficients \nremoves redundancy and leads to representations that are sparse \nand are a nonlinear function of the data. This can be viewed as \na generalization of the technique of independent component anal(cid:173)\nysis and provides a method for blind source separation of fewer \nmixtures than sources. We demonstrate the utility of overcom(cid:173)\nplete representations on natural speech and show that compared \nto the traditional Fourier basis the inferred representations poten(cid:173)\ntially have much greater coding efficiency. \n\nA traditional way to represent real-values signals is with Fourier or wavelet bases. \nA disadvantage of these bases, however, is that they are not specialized for any \nparticular dataset. Principal component analysis (PCA) provides one means for \nfinding an basis that is adapted for a dataset, but the basis vectors are restricted \nto be orthogonal. An extension of PCA called independent component analysis \n(Jutten and Herault, 1991; Comon et al., 1991; Bell and Sejnowski, 1995) allows \nthe learning of non-orthogonal bases. All of these bases are complete in the sense \nthat they span the input space, but they are limited in terms of how well they can \napproximate the dataset's statistical density. \n\nRepresentations that are overcomplete, i. e. more basis vectors than input variables, \ncan provide a better representation, because the basis vectors can be specialized for \n\n\fLearning Nonlinear Overcomplete Representations for Efficient Coding \n\n557 \n\na larger variety of features present in the entire ensemble of data. A criticism of \novercomplete representations is that they are redundant, i.e. a given data point may \nhave many possible representations, but this redundancy is removed by the prior \nprobability of the basis coefficients which specifies the probability of the alternative \nrepresentations. \n\nMost of the overcomplete bases used in the literature are fixed in the sense that \nthey are not adapted to the structure in the data. Recently Olshausen and Field \n(1996) presented an algorithm that allows an overcomplete basis to be learned. This \nalgorithm relied on an approximation to the desired probabilistic objective that had \nseveral drawbacks, including tendency to breakdown in the case of low noise levels \nand when learning bases with higher degrees of overcompleteness. In this paper, we \npresent an improved approximation to the desired probabilistic objective and show \nthat this leads to a simple and robust algorithm for learning optimal overcomplete \nbases. \n\n1 \n\nInferring the representation \n\nThe data, X 1 :L ' are modeled with an overcomplete linear basis plus additive noise: \n\nx = AS+i \n\n(1) \n\nwhere A is an L x M matrix, whose columns are the basis vectors, where M ~ L . \nWe assume Gaussian additive noise so that 10gP(xIA, s) ()( -A(X - As)2/2, where \nA = 1/(12 defines the precision of the noise. \nThe redundancy in the overcomplete representation is removed by defining a density \nfor the basis coefficients, P(s), which specifies the probability of the alternative \nrepresentations. The most probable representation, 5, is found by maximizing the \nposterior distribution \n\ns = maxP(sIA,x) = maxP(s)P(xIA,s) \n\n8 \n\n8 \n\n(2) \n\nP(s) influences how the data are fit in the presence of noise and determines the \nuniqueness of the representation. In this model, the data is a linear function of \ns, but s is not, in general, a linear function of the data. IT the basis function is \ncomplete (A is invertible) then, assuming broad priors and low noise, the most \nprobable internal state can be computed simply by inverting A. In the case of an \novercomplete basis, however, A can not be inverted. Figure 1 shows how different \npriors induce different representations. \n\nUnlike the Gaussian prior, the optimal representation under the Laplacian prior \ncannot be obtained by a simple linear operation. One approach for optimizing sis \nto use the gradient of the log posterior in an optimization algorithm. An alternative \nmethod for finding the most probable internal state is to view the problem as the \nlinear program: min 1 T s such that As = x. This can be generalized to handle both \npositive and negative s and solved efficiently and exactly with interior point linear \nprogramming methods (Chen et al., 1996). \n\n\f558 \n\na \n\nb \n\nif L2 \n\nL1 \n\nG) \n:::l \n~ 0.1 \n\n10'05~\\ \n\n8 \n\n\\\" \n20 \n\no \n\n~-\n\nM. S. Lewicki and T. 1. Sejnowski \n\n-~ - - ~ - - 80 \n\n100 \n\n120 \n\nFigure 1: Different priors induce different representations. (a) The 2D data distribution \nhas three main axes which form an overcomplete representation. The graphs marked \"L2\" \nand \"L1\" show the optimal scaled basis vectors for the data point x under the Gaussian \nand Laplacian prior, respectively. Assuming zero noise, a Gaussian for P{s) is equivalent \nto finding the exact fitting s with minimum L2 norm, which is given by the pseudoinverse \ns = A+x. A Laplacian prior (P{Sm) ex: exp[-OlsmlJ) yields the exact fit with minimum L1 \nnorm, which is a nonlinear operation which essentially selects a subset of the basis vectors \nto represent the data (Chen et al., 1996). The resulting representation is sparse. (b) A \n64-sample segment of speech was fit to a 2x overcomplete Fourier representation (128 basis \nvectors). The plot shows rank order distribution of the coefficients of s under a Gaussian \nprior (dashed); and a Laplacian prior (solid). Far more significantly positive coefficients \nare required under the Gaussian prior than under the Laplacian prior. \n\n2 Learning \n\nThe learning objective is to adapt A to maximize the probability of the data which \nis computed by marginalizing over the internal states \n\nP(xIA) = J ds P(s)P(xIA, s) \n\n(3) \n\ngeneral, this integral cannot be evaluated analytically but can be approximated \nwith a Gaussian integral around s, yielding \n\nlog P(xIA) ~ const. + log pes) - ~ (x - As)2 - ~ log det H \n\n(4) \n\nwhere H is the Hessian of the log posterior at S, given by )'ATA - VVlogP(s). \nTo avoid a singularity under the Laplacian prior, we use the approximation \n(logP(sm\u00bb)' ~ -8tanh(,8sm) which gives the Hessian full rank and positive de(cid:173)\nterminant. For large ,8 this approximates the true Laplacian prior. A learning rule \ncan be obtained by differentiating log P(xIA) with respect to A. \n\nIn the following discussion, we will present the derivations of the three terms in (4) \nand simplifying assumptions that lead to the following simple form of the learning \nrule \n\n(5) \n\n\fLearning Nonlinear Ollercomplete Representations for Efficient Cod(ng \n\n559 \n\n2.1 Deriving V log pes) \n\nThis term specifies how to change A so as to make the probability of the represen(cid:173)\ntation s more probable. IT we assume a Laplacian prior, this component changes A \nto make the representation more sparse. \nWe assume pes) = rIm P(Sm). In order to obtain 8sm/8aij, we need to describe \ns as a function of A. If the basis is complete (and we assume low noise), then we \nmay simply invert A to obtain s = A -IX. When A is overcomplete, however, there \nis no simple expression, but we may still make an approximation. \nUnder priors, the most probable solution, s, will yield at most L non-zero elements. \nIn effect, this selects a complete basis from A. Let A represent this reduced basis \nunder s. We then have s = A -1(X- \u20ac) where s is equal to s with M - L zero-valued \nelements removed. A-I obtained by removing the columns of A corresponding to \nthe M - L zero-valued elements of s. This allows us to use results obtained for the \ncase when A is invertible. Following MacKay (1996) we obtain \n\nRewriting in matrix notation we have \n\n810gP(s) _ \n-\n\n8A \n\n-\n\nA~ -Tv~T \n\nzs \n\n(6) \n\n(7) \n\nWe can use this to obtain an expression in terms of the original variables. We \nsimply invert the mapping s ~ s to obtain Z f- z and W T f- A -T (row-wise) with \nZm = 0 and row m ofWT = 0 if 8m = O. We then have \n\n8 log P(s) WT T \n\n8A = -\n\nzs \n\n(8) \n\n2.2 Deriving Vex - As)2 \n\nThe second term specifies how to change A so as to minimize the data misfit. \nLetting ek = [x - AS]k and using the results and notation from above we have: \n\n~ ~ 8s, \n8 A~ 2 \n-8 .. \"2 L-ek = AeiSj + A L-ek L-ak'~ \na\" \nalJ \n\nk \n\nk \n\nI \n\n= AeiSj + A Lek L \n= AeiSj - AeiSj = 0 \n\nk \n\nI \n\n-aklWliSj \n\n(9) \n\n(10) \n\n(11) \n\nThus no gradient component arises from the error term. \n\n2.3 Deriving V log det H \n\nThe third term in the learning rule specifies how to change the weights so as to \nminimize the width of the posterior distribution P(xIA) and thus increase the \noverall probability of the data. An element of H is defined by Hmn = Cmn + bmn \n\n\f560 \n\nM. S. Lewicki and T. J. Sejnowski \n\nwhere Cmn = Ek Aakmakn and bmn = [- V'V' log P(s)]mn. This gives \n\n8logdetH _ \"\"H- 1 [8emn + 8bmn ] \n\n8a\u00b7\u00b7 \n~ \n\n8a \u00b7\u00b7 \n~ \n\n- ~ nm 8a .. \nU \n\nmn \n\nFirst considering 8Cmn/8aij, we can obtain \n\nL H~~ ~~~ = L H~;.\\aim + L Hj~.\\aim + Hj/ 2Aaij \nmn \n\nm:f.j \n\nm:f.j \n\n~3 \n\n(12) \n\n(13) \n\nUsing the fact that H~; = Hj~ due to the symmetry of the Hessian, we have \n\n(14) \n\nNext we derive 8bmn/8aij. We have that V'V'logP(s) is diagonal, because we \n\nassume pes) = nm P(sm). Letting 2Ym = H~!n8bmm/8sm and using the result \n\nunder the reduced representation (6) we can obtain \n\n(15) \n\n2.4 Stabilizing and simplifying the learning rule \n\nPutting the terms together yields a problematic expression due to the matrix in(cid:173)\nverses. This can be alleviated by multiplying the gradient by an appropriate positive \ndefinite matrix, which rescales the gradient components but preserves a direction \nvalid for optimization. Noting that ATWT = I we have \n\nH'\\ is large (low noise) then the Hessian is dominated by AATA and we have \n\n(16) \n\n(17) \n\nThe vector y hides a computation involving the inverse Hessian. IT the basis vectors \nin A are randomly distributed, then as the dimensionality of A increases the basis \nvectors become approximately orthogonal and consequently the Hessian becomes \napproximately diagonal. It can be shown that if log pes) and its derivatives are \nsmooth, Ym vanishes for large A. Combining the remaining terms yields equation \n(5). Note that this rule contains no matrix inverses and the vector z involves only \nthe derivative of the log prior. \n\nIn the case where A is square, this form of the rule is similar to the natural gradient \nindependent component analysis (ICA) learning rule (Amari et al., 1996). The \ndifference in the more general case where A is rectangular is that s must maximize \nthe posterior distribution P(slx, A) which cannot be done simply with the filter \nmatrix as in standard ICA algorithms. \n\n\fLearning Nonlinear Overcomplete Representations/or Efficient Coding \n\n561 \n\n3 Examples \n\nMore sources than inputs. In these 2D examples, the bases were initialized \nto random, normalized vectors. The coefficients were solved using BPMPD and \npublicly available interior point linear programming package (Meszaros, 1997) which \ngives the most probable solution under the Laplacian prior assuming zero noise. \nThe algorithm was run for 30 iterations using equation (5) with a stepsize of 0.001 \nand a batchsize of 200. Convergence was rapid, typically requiring less than 20 \niterations. In all cases, the direction of the learned vectors matched those of the \ntrue generating distribution; the magnitude was estimated less precisely, possibly \ndue to the approximation oflogP(xIA). This can be viewed as a source separation \nproblem, but true separation will be limited due to the projection of the sources \ndown to a smaller subspace which necessarily loses information . \n\n. ' \n\n'. ~ \n\nFigure 2: Examples illustrating the fitting of 2D distributions with overcomplete bases. \nThe first example is equivalent to 3 sources mixed into 2 channels; the second to 4 sources \nmixed into 2 channels. The data in both examples were generated from the true basis A \nusing x = As with the elements of s distributed according to an exponential distribution \nwith unit mean. Identical results were obtained by drawing s from a Laplacian prior \n(positive and negative coefficients). The overcomplete bases allow the model to capture \nthe true underlying statistical structure in the 2D data space. \n\nOvercomplete representations of speech. Speech data were obtained from the \nTIMIT database, using a single speaker was speaking ten different example sen(cid:173)\ntences with no preprocessing. The basis was initialized to an overcomplete Fourier \nbasis. A conjugate gradient routine was used to obtain the most probable basis \ncoefficients. The stepsize was gradually reduced over 10000 iterations. Figure 3 \nshows that the learned basis is quite different from the Fourier representation. The \npower spectrum for the learned basis vectors can be multimodal and/or broadband. \nThe learned basis achieves greater coding efficiency: 2.19 \u00b1 0.59 bits per sample \ncompared to 3.86 \u00b1 0.28 bits per sample for a 2x overcomplete Fourier basis. \n\n4 Summary \n\nLearning overcomplete representations allows a basis to better approximate the un(cid:173)\nderlying statistical density of the data and consequently the learned representations \nhave better encoding and denoising properties than generic bases. Unlike the case \nfor complete representations and the standard ICA algorithm, the transformation \n\n\f562 \n\nM S. Lewicki and T. 1. Sejnowski \n\n~ ~ ..... ~ ~ ~~J~~ \nJvV ~ ........... ~ ~ ~ ~ ~~~ \n+ ~ fIN ~ ~ A- L \n..... .. -JVV H ~ i~~L~ \n... + ..... ..... ....... ~J~--L~ \n~ ~ ~ .\"... \n\n-L J-.-L \nf\\;vJ'v JL~~JL-\n\nFigure 3: An example of fitting a 2x overcomplete representation to segments of from \nnatural speech. Each segment consisted of 64 samples, sampled at a frequency of 8000 Hz \n(8 msecs). The plot shows a random sample of 30 of the 128 basis vectors (each scaled to \nfull range). The right graph shows the corresponding power spectral densities (0 to 4000 \nHz). \n\nfrom the data to the internal representation is non-linear. The probabilistic formu(cid:173)\nlation of the basis inference problem offers the advantages that assumptions about \nthe prior distribution on the basis coefficients are made explicit and that different \nmodels can be compared objectively using log P(xIA). \n\nReferences \n\nAmari, S., Cichocki, A., and Yang, H. H. (1996). A new learning algorithm for blind \nsignal separation. In Advances in Neural and Information Processing Systems, \nvolume 8, pages 757-763, San Mateo. Morgan Kaufmann. \n\nBell, A. J. and Sejnowski, T. J. (1995). An information maximization approach to \nblind separation and blind deconvolution. Neural Computation, 7(6}:1129-1159. \n\nChen, S., Donoho, D. L., and Saunders, M. A. (1996). Atomic decomposition by \n\nbasis pursuit. Technical report, Dept. Stat., Stanford Univ., Stanford, CA. \n\nComon, P., Jutten, C., and Herault, J. (1991). Blind separation of sources .2. \n\nproblems statement. Signal Processing, 24(1}:11-20. \n\nJutten, C. and Herault, J. (1991). Blind separation of sources .1. an adaptive \n\nalgorithm based on neuromimetic architecture. Signal Processing, 24(1):1-10. \n\nMacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for inde(cid:173)\n\npendent component analysis. University of Cambridge, Cavendish Laboratory. \nAvailable at ftp: / /wol. ra. phy. cam. ac. uk/pub/mackay / ica. ps. gz. \n\nMeszaros, C. (1997). BPMPD: An interior point linear programming solver. Code \n\navailable at ftp: / /ftp.netlib. org/ opt/bpmpd. tar. gz. \n\nOlshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive-field \n\nproperties by learning a sparse code for natural images. Nature, 381:607-609. \n\n\f", "award": [], "sourceid": 1424, "authors": [{"given_name": "Michael", "family_name": "Lewicki", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}