{"title": "Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units", "book": "Advances in Neural Information Processing Systems", "page_first": 377, "page_last": 384, "abstract": null, "full_text": " Sparse Coding of Natural Images Using an\n Overcomplete Set of Limited Capacity Units\n\n\n\n Eizaburo Doi Michael S. Lewicki\n Center for the Neural Basis of Cognition Center for the Neural Basis of Cognition\n Carnegie Mellon University Computer Science Department\n Pittsburgh, PA 15213 Carnegie Mellon University\n edoi@cnbc.cmu.edu Pittsburgh, PA 15213\n lewicki@cnbc.cmu.edu\n\n\n\n\n Abstract\n\n It has been suggested that the primary goal of the sensory system is to\n represent input in such a way as to reduce the high degree of redun-\n dancy. Given a noisy neural representation, however, solely reducing\n redundancy is not desirable, since redundancy is the only clue to reduce\n the effects of noise. Here we propose a model that best balances redun-\n dancy reduction and redundant representation. Like previous models, our\n model accounts for the localized and oriented structure of simple cells,\n but it also predicts a different organization for the population. With noisy,\n limited-capacity units, the optimal representation becomes an overcom-\n plete, multi-scale representation, which, compared to previous models,\n is in closer agreement with physiological data. These results offer a new\n perspective on the expansion of the number of neurons from retina to V1\n and provide a theoretical model of incorporating useful redundancy into\n efficient neural representations.\n\n\n\n1 Introduction\n\nEfficient coding theory posits that one of the primary goals of sensory coding is to eliminate\nredundancy from raw sensory signals, ideally representing the input by a set of statistically\nindependent features [1]. Models for learning efficient codes, such as sparse coding [2] or\nICA [3], predict the localized, oriented, and band-pass characteristics of simple cells. In\nthis framework, units are assumed to be non-redundant and so the number of units should\nbe identical to the dimensionality of the data.\n\nRedundancy, however, can be beneficial if it is used to compensate for inherent noise in\nthe system [4]. The models above assume that the system noise is low and negligible so\nthat redundancy in the representation is not necessary. This is equivalent to assuming that\nthe representational capacity of individual units is unlimited. Real neurons, however, have\nlimited capacity [5], and this should place constraints on how a neural population can best\nencode a sensory signal. In fact, there are important characteristics of simple cells, such as\nthe multi-scale representation, that cannot be explained by efficient coding theory.\n\nThe aim of this study is to evaluate how the optimal representation changes when the system\n\n\f\nis constrained by limited capacity units. We propose a model that best balances redundancy\nreduction and redundant representation given the limited capacity units. In contrast to the\nefficient coding models, it is possible to have a larger number of units than the intrinsic\ndimensionality of the data. This further allows to introduce redundancy in the population,\nenabling precise reconstruction using the imprecise representation of a single unit.\n\n\n2 Model\n\nEncoding\n\nWe assume that the encoding is a linear transform of the input x, followed by the additive\nchannel noise n N (0, 2 I)\n n ,\n\n r = Wx + n (1)\n\n = u + n, (2)\n\nwhere rows of W are referred to as the analysis vectors, r is the representation, and u is\nthe signal component of the representation. We will refer to u as coefficients because it is\na set of clean coefficients associated with the synthesis vectors in the decoding process, as\ndescribed below.\n\nWe define channel noise level as follows,\n\n 2\n (channel noise level) = n 100 [%] (3)\n 2t\n\nwhere 2t is a constant target value of the coefficient variance. It is the inverse of the signal-\nto-noise ratio in the representation, and therefore, we can control the information capacity\nof a single unit by varying the channel noise variance. Note that in the previous models\n[2, 3, 6] there is no channel noise; therefore r = u, where the signal-to-noise ratio of the\nrepresentation is infinite.\n\n\nDecoding\n\nThe decoding process is assumed to be a linear transform of the representation,\n\n ^\n x = Ar, (4)\n\nwhere the columns of A are referred to as the synthesis vectors1, and ^\n x is the reconstruction\nof the input. The reconstruction error e is then expressed as\n\n e = x - ^\n x (5)\n\n = (I - AW) x - An. (6)\n\nNote that no assumption on the reconstruction error is made, because eq. 4 is not a proba-\nbilistic data generative model, in contrast to the previous approaches [2, 6].\n\n\nRepresentation desiderata\n\nWe assume a two-fold goal for the representation. The first is to preserve input information\na given noisy, limited information capacity unit. The second is to make the representation\n\n 1In the noiseless and complete case, they are equivalent to the basis functions [2, 3]. In our\nsetting, however, they are in general no longer basis functions. To make this clear, we call A and W\nas synthesis and analysis vectors.\n\n\f\n (a) 0% Ch.Noise (b) 20% Ch.Noise (c) 80% Ch.Noise (d) 8x overcomp.\n\n\n sisehtnyS\n\n sisylanA\n\nFigure 1: Optimal codes for toy problems. Data (shown with small dots) is generated with\ntwo i.i.d. Laplacians mixed via non-orthogonal basis functions (shown by gray bars). The\noptimal synthesis vectors (top row) and analysis vectors (bottom row) are shown as black\nbars. Plots of synthesis vectors are scaled for visibility. (a-c) shows the complete code with\n0, 20, and 80% channel noise level. (d) shows the case of 80% channel noise using an 8x\novercomplete code. Reconstruction error is (a) 0.0%, (b) 13.6%, (c) 32.2%, (d) 6.8%.\n\n\nas sparse as possible, which yields an efficient code. The cost function to be minimized is\ntherefore defined as follows,\n\n C(A, W) = (reconstruction error) - 1(sparseness) + 2(fixed variance) (7)\n\n M M 2\n u2\n = e 2 - i\n 1 ln p(ui) + 2 ln , (8)\n 2\n i=1 i=1 t\n\nwhere represents an ensemble average over the samples, and M is the number of units.\nThe sparseness is measured by the loglikelihood of a sparse prior p as in the previous\nmodels [2, 3, 6]. The third, fixed variance term penalizes the case in which the coefficient\nvariance of the i-th unit u2 deviates from its target value 2\n i t . It serves to fix the signal-\nto-noise ratio in the representation, yielding a fixed information capacity. Without this\nterm, the coefficient variance could become trivially large so that the signal-to-noise ratio\nis high, yielding smaller reconstruction error; or, the variance becomes small to satisfy only\nthe sparseness constraint, which is not desirable either.\n\nNote that in order to introduce redundancy in the representation, we do not assume statisti-\ncal independence of the coefficients. The second term in eq. 8 measures the sparseness of\ncoefficients individually but it does not impose their statistical independence. We illustrate\nit with toy problems in Figure 1. If there is no channel noise, the optimal complete (1x) code\nis identical to the ICA solution (a), since it gives the most sparse, non-Gaussian solution\nwith minimal error. As the channel noise increases (b and c), sparseness is compromised\nfor minimizing the reconstruction error by choosing correlated, redundant representation.\nIn an extreme case where the channel noise is high enough, the two units are almost com-\npletely redundant (c). It should be noted that in such a case two vectors represent the\ndirection of the first principal component of the data.\n\nIn addition to de-emphasizing sparseness, there is another way to introduce redundancy in\nthe representation. Since the goal of the representation is not the separation of independent\nsources, we can set an arbitrarily large number of units in the representation. When the in-\nformation capacity of a single unit is limited, the capacity of a population can be made large\n\n\f\nby increasing the number of units. As shown in Figure 1c-d, the reconstruction error de-\ncreases as we increase the degree of overcompleteness. Note that the optimal overcomplete\ncode is not simply a duplication of the complete code.\n\n\nLearning rule\n\nThe optimal code can be learned by the gradient descent of the cost function (eq. 8) with\nrespect to A and W,\n\n A (I - AW) xxT WT - 2 A,\n n (9)\n\n W AT (I - AW) xxT\n\n ln(u) ln[ u2 /2]\n + t\n 1 xT - 2diag W xxT . (10)\n u u2\n\nIn the limit of zero channel noise in the square case (e.g., Figure 1a) the solution is at the\nequilibrium when W = A-1 (see eq. 9), where the learning rule becomes similar to the\nstandard ICA (except the 3rd term in eq. 10). In all other cases, there is no reason to believe\nthat W = A-1, if it exists, minimizes the cost function. This is the reason why we need\nto optimize A and W individually.\n\n\n3 Optimal representations for natural images\n\nWe examined optimal codes for natural image patches using the proposed model. The\ntraining data is 8x8 pixel image patches, sampled from a data set of 62 natural images [7].\nThe data is not preprocessed except for the subtraction of DC components [8]. Accordingly,\nthe intrinsic dimensionality of the data is 63, and an N-times overcomplete code consists\nof N63 units. The training set is sequentially updated during the learning and the order\nis randomized to prevent any local structure in the sequence. A typical number of image\npatches in a training is 5 106.\n\nHere we first descirbe how the presence of channel noise changes the optimal code in the\ncomplete case. Next, we examine the optimal code at different degree of overcompleteness\ngiven a high channel noise level.\n\n\n3.1 Optimal code at different channel noise level\n\nWe varied the channel noise level as 10, 20, 40, and 80%. As shown in Figure 2, learned\nsynthesis and analysis vectors look somewhat similar to ICA (only 10 and 80% are shown\nfor clarity). The comparison to the receptive fields of simple cells should be made with the\nanalysis vectors [9, 10, 7]. They show localized and oriented structures and are well fitted\nby the Gabor function, indicating the similarity to simple cells in V1. Now, an additional\ncharacteristic to the Gabor-like structure is that the spatial-frequency tuning of the analysis\nvectors shifts towards lower spatial-frequencies as the channel noise increases (Figure 2d).\n\nThe learned code is expected to be robust to the channel noise. The reconstruction error\nwith respect to the data variance turned out to be 6.5, 10.1, 15.7, and 23.8% for 10, 20, 40,\nand 80% of channel noise level, respectively. The noise reduction is significant considering\nthe fact that any whitened representation including ICA should generate the reconstruction\nerror of exactly the same amount of the channel noise level2. For the learned ICA code\nshown in Figure 2a, the reconstruction error was 82.7% when 80% channel noise was\napplied.\n\n 2Since the mean squared error is expressed as e 2 = 2n Tr(AAT ) = 2n Tr( xxT ) =\n2n(data variance), where W is whitening filters, A( W-1) is their corresponding basis functions.\nWe used eq. (6) and xxT = AW xxT WT AT = AAT .\n\n\f\n (a) ICA (b) 10% (c) 80% (d)\n\n 50\n s ICA\n is 10%\n e 40 80%\n h ]\n t #\n n [\n y 30\n S tnuoC 20\n 10\n\n si 0\n s 0 0.5 1 1.5\n yl Spatial frequency\n anA\n\n\nFigure 2: Optimal complete code at different channel noise level. (a-c) Optimized synthesis\nand analysis vectors. (a) ICA. (b) Proposed model at 10% channel noise level. (c) Proposed\nmodel at 80% channel noise level. Here 40 vectors out of 63 are shown. (d) Distribution of\nspatial-frequency tuning of the analysis vectors in the condition of (a)-(c).\n\n\nThe robustness to channel noise can be explained by the shift of the representation to-\nwards lower spatial-frequencies. We analyzed the reconstruction error by projecting it to\nthe principal axes of the data. Figure 3a shows the error spectrum of the code for 80%\nchannel noise, along with the data spectrum (the percentage of the data variance along the\nprincipal axes). Note that the data variance of natural images is mostly explained by the\nfirst principal components, which correspond to lower spatial-frequencies. In the proposed\nmodel, the ratio of the error to the data variance is relatively small around the first principal\ncomponents. It can be seen much clearer in Figure 3b, where the reconstruction percent-\nage at each principal component is replotted. The reconstruction is more precise for more\nsignificant principal components (i.e., smaller index), and it drops down to zero for minor\ncomponents. For comparison, we analyzed the error for ICA code, where the synthesis and\nanalysis vectors are optimized without channel noise and its robustness to channel noise is\ntested with 80% channel noise level. As shown in Figure 3, ICA reconstructs every com-\nponent equally irrespective of their very different data variance3, therefore the percentage\nof reconstruction is flat. The proposed model can be robust to channel noise by primarily\nrepresenting the principal components.\n\nNote that such a biased reconstruction depends on the channel noise level. In Figure 3b\nwe also shows the reconstruction spectrum with 10% channel noise using the code for\n10% channel noise level. Compared to the 80% case, the model comes to reconstruct the\ndata at relatively minor components as well. It means that the model can represent finer\ninformation if the information capacity of a single unit is large enough. Such a shift of\nrepresentation is also demonstrated with the toy probems in Figure 1a-c.\n\n\n3.2 Optimal code at different degree of overcompleteness\n\nNow we examine how the optimal representation changes with the different number of\navailable units. We fixed the channel noise level at 80% and vary the degree of overcom-\npleteness as 1x, 2x, 4x, and 8x. Learned vectors for 8x are shown in Figure 4a, and those for\n\n 3Since the error spectrum for a whitened representation is expressed as (Ete)2 = 2n \nDiag(ET xxT E) = 2n Diag(D) = 2n (data spectrum), where EDET = xxT is the eigen\nvalue decomposition of the data covariance matrix.\n\n\f\n (a) (b)\n 2\n 10 100\n ICA ICA\n ]\n 80% %[ 80%\n 1 80\n ] \n DAT 10%\n 10\n % n\n [ o 8x\n i\n e t 60\n c c\n n u\n a 0 r\n i 10 t\n r s\n a n 40\n V oce\n 1 R\n 10 20\n\n\n 0\n 0 1\n 10 10 10 20 30 40 50 60\n Index of Principal Components Index of Principal Components\n\n\n\nFigure 3: Error analysis. (a) Power spectrum of the data (`DAT') and the reconstruction\nerror with 80% channel noise. `80%' is the error of the 1x code for 80% channel noise\nlevel. `ICA' is the error of the ICA code. (b) Percentage of reconstruction at each principal\ncomponent. In addition to the conditions in (a), we also show the following (see text).\n`10%': 1x code for 10% channel noise level. The error is measured with 10% channel\nnoise. `8x': 8x code for 80% channel noise level. The error is measured with 80% channel\nnoise.\n\n\n\n1x are in Figure 2c. Compared to the 1x case, where the synthesis and analysis vectors look\nuniform in shape, the 8x code shows more diversity. To be precise, as illustrated in Figure\n4b, the spatial-frequency tuning of the analysis vectors becomes more broadly distribued\nand cover a larger region as the degree of overcompleteness increases. Physiological data\nat the central fovea shows that the spatial-frequency tuning of V1 simple cells spans three\n[11] or two [12] octaves. Models for efficient coding, especially ICA which provides the\nmost efficient code, do not reproduce such a multi-scale representation; instead, the result-\ning analysis vectors tune only to the highest spatial-frequency (Figure 2a; [3, 9, 10, 7]).\nIt is important that the proposed model generates a broader tuning distribution under the\npresence of the channel noise and with the high degree of overcompleteness.\n\nAn important property of the proposed model is that the reconstruction error decreases as\nthe degree of overcompleteness increases. The resulting error is 23.8, 15.5, 9.7, and 6.2%\nfor 1x, 2x, 4x, and 8x code. The noise analysis shows that the model comes to represent\nminor components as the degree of overcompleteness increases (Figure 3b). There is an\ninteresting similarity between the error spectra of 8x code for 80% channel noise and 1x\ncode for 10% channel noise. It is suggested that the population of units can represent the\nsame amount and the same kind of information using N times larger number of units if the\ninformation capacity of a single unit is decreased with N times larger channel noise level.\n\n\n\n4 Discussion\n\n\nA multi-scale representation is known to provide an approximately efficient representation,\nalthough it is not optimal as there are known statistical dependencies between scales [13].\nWe conjecture these residual dependences may be one reason why previous efficient cod-\ning models could not yield a broad multi-scale representation. In contrast, the proposed\nmodel can introduce useful redundancies in the representation, which is consistent with the\nemergence of a multi-scale representation. Although it can generate a broader distribution\nof the spatial-frequency tuning, in these experiments, it covers only about one octave, not\ntwo or three octaves as in the physiological data [11, 12]. This issue still remains to be\nexplained.\n\n\f\n (a) 8x overcomplete w/ 80% ch. noise (b)\n\n x102\n 3\n si 1x\n s 2x\n eh 4x\n t ] 8x\n n # 2\n y [ \n S tnuoC 1\n\n si 0\n sy 0 0.5 1 1.5\n l Spatial frequency\n anA\n\n\nFigure 4: Optimal overcomplete code. (a) Optimized 8x overcomplete code for 80% chan-\nnel noise level. Here only 176 out of 504 functions are shown. The functions are sorted\naccording to the spatial-frequency tuning of the analysis vectors. (b) Distribution of spatial-\nfrequency tuning of the analysis vectors at different degree of overcompleteness.\n\n\n\nAnother important characteristic of simple cells is the fact that the more numerous cells\nare tuned to the lower spatial-frequencies [11, 12]. An explanation of it is that the high\ndata-variance components should be highly oversampled so that the reconstruction erorr is\nminimized given the limited precision of a single unit [12]. As we described earlier, such\na biased representation for the high variance components is observed in our model (Figure\n3b). However, the distribution of the spatial-frequency tuning of the analysis vectors does\nnot correspond to this trend; instead, it is bell-shaped (Figure 4b). This apparent inconsis-\ntency might be resolved by considering the synthesis vectors, because the reconstruction\nerror is determined by both synthesis and analysis vectors.\n\nA related work is the Atick & Redlich's model for retinal ganglion cells [14]. It also utilizes\nredundancy in the representation but to compensate for sensory noise rather than channel\nnoise; therefore, the two models explain different phenomena. Another related work is\nOlshausen & Field's sparse coding model for simple cells [2], but this again looks at the\neffects of sensory noise (note that if the sensory noise is neglegible this algorithm does not\nlearn a sparse representation, while the proposed model is appropriate for this condition; of\ncourse such a condition might be unrealistic). Now, given a photopic environment where\nthe sensory noise can reasonably be regarded to be small [14], it should rather be impor-\ntant to examine how the constraint of noisy, limited information capacity units changes the\nrepresentation. It is reported that the information capacity is significantly decreased from\nphotoreceptors to spiking neurons [15], which supports our approach. In spite of its signif-\nicance, to our knowledge the influence of channel noise on the representation had not been\nexplored.\n\n\n5 Conclusion\n\n\nWe propose a model that both utilizes redundancy in the representation in order to compen-\nsate for the limited precision of a single unit and reduces unnecessary redundancy in order\nto yield an efficient code. The noisy, overcomplete code for natural images generates a\ndistributed spatial-frequency tuning in addition to the Gabor-like analysis vectors, showing\na closer agreement with the physiological data than the previous efficient coding models.\n\n\f\nThe information capacity of a representation may be constrained either by the intrinsic\nnoise in a single unit or by the number of units. In either case, the proposed model can\nadapt the parameters to primarily represent the high-variance, coarse information, yielding\na robust representation to channel noise. As the limitation is relaxed by decreasing the\nchannel noise level or by increasing the number of units, the model comes to represent\nlow-variance, fine information.\n\n\nReferences\n\n [1] H. B. Barlow. Possible principles underlying the transformation of sensory messages. In W. A.\n Rosenblith, editor, Sensory communication, pages 217234. MIT Press, MA, 1961.\n\n [2] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy\n employed by V1? Vision Research, 37:33113325, 1997.\n\n [3] A. J. Bell and T. J. Sejnowski. The independent components of natural scenes are edge filters.\n Vision Research, 37:33273338, 1997.\n\n [4] H. B. Barlow. Redundancy reduction revisited. Network: Comput. Neural Syst., 12:241253,\n 2001.\n\n [5] A. Borst and F. E. Theunissen. Information theory and neural coding. Nature Neuroscience,\n 2:947957, 1999.\n\n [6] M. S. Lewicki and B. A. Olshausen. Probabilistic framework for the adaptation and comparison\n of image codes. J. Opt. Soc. Am. A, 16:15871601, 1999.\n\n [7] E. Doi, T. Inui, T.-W. Lee, T. Wachtler, and T. J. Sejnowski. Spatiochromatic receptive field\n properties derived from information-theoretic analyses of cone mosaic responses to natural\n scenes. Neural Computation, 15:397417, 2003.\n\n [8] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons,\n NY, 2001.\n\n [9] J. H. van Hateren and A. van der Schaaf. Independent component filters of natural images\n compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265:359366,\n 1998.\n\n[10] D. L. Ringach. Spatial structure and symmetry of simple-cell receptive fields in macaque pri-\n mary visual cortex. Journal of Neurophysiology, 88:455463, 2002.\n\n[11] R. L. De Valois, D. G. Albrecht, and L. G. Thorell. Spatial frequency selectivity of cells in\n macaque visual cortex. Vision Research, 22(545-559), 1982.\n\n[12] C. H. Anderson and G. C. DeAngelis. Population codes and signal to noise ratios in primary\n visual cortex. In Society for Neuroscience Abstract, page 822.3, 2004.\n\n[13] E. P. Simoncelli. Modeling the joint statistics of images in the wavelet domain. In Proc. SPIE\n 44th Annual Meeting, pages 188195, Denver, Colorado, 1999.\n\n[14] J. J. Atick and A. N. Redlich. What does the retina know about natural scenes? Neural\n Computation, 4:196210, 1992.\n\n[15] S. B. Laughlin and R. R. de Ruyter van Steveninck. The rate of information transfer at graded-\n potential synapses. Nature, 379(15):642645, 1996.\n\n\f\n", "award": [], "sourceid": 2581, "authors": [{"given_name": "Eizaburo", "family_name": "Doi", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}]}