{"title": "Edges are the 'Independent Components' of Natural Scenes.", "book": "Advances in Neural Information Processing Systems", "page_first": 831, "page_last": 837, "abstract": "", "full_text": "Edges are the 'Independent Components' of \n\nNatural Scenes. \n\nAnthony J. Bell and Terrence J. Sejnowski \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\n10010 N. Torrey Pines Road \nLa Jolla, California 92037 \n\ntony@salk.edu, terry@salk.edu \n\nAbstract \n\nField (1994) has suggested that neurons with line and edge selectivities \nfound in primary visual cortex of cats and monkeys form a sparse, dis(cid:173)\ntributed representation of natural scenes, and Barlow (1989) has reasoned \nthat such responses should emerge from an unsupervised learning algorithm \nthat attempts to find a factorial code of independent visual features. We \nshow here that non-linear 'infomax', when applied to an ensemble of nat(cid:173)\nural scenes, produces sets of visual filters that are localised and oriented. \nSome of these filters are Gabor-like and resemble those produced by the \nsparseness-maximisation network of Olshausen & Field (1996). In addition, \nthe outputs of these filters are as independent as possible, since the info(cid:173)\nmax network is able to perform Independent Components Analysis (ICA). \nWe compare the resulting ICA filters and their associated basis functions, \nwith other decorrelating filters produced by Principal Components Analysis \n(PCA) and zero-phase whitening filters (ZCA). The ICA filters have more \nsparsely distributed (kurtotic) outputs on natural scenes. They also resem(cid:173)\nble the receptive fields of simple cells in visual cortex, which suggests that \nthese neurons form an information-theoretic co-ordinate system for images. \n\n1 \n\nIntroduction. \n\nBoth the classic experiments of Rubel & Wiesel [8] on neurons in visual cortex, and sev(cid:173)\neral decades of theorising about feature detection in vision, have left open the question \nmost succinctly phrased by Barlow \"Why do we have edge detectors?\" That is: are there \nany coding principles which would predict the formation of localised, oriented receptive \n\n\f832 \n\nA. 1. Bell and T. 1. Sejnowski \n\nfields? Barlow's answer is that edges are suspicious coincidences in an image. Formalised \ninformation-theoretically, this means that our visual cortical feature detectors might be the \nend result of a redundancy reduction process [4, 2], in which the activation of each feature \ndetector is supposed to be as statistically independent from the others as possible. Such a \n'factorial code' potentially involves dependencies of all orders, but most studies [9, 10, 2] \n(and many others) have used only the second-order statistics required for decorrelating the \noutputs of a set of feature detectors. Yet there are multiple decorrelating solutions, includ(cid:173)\ning the 'global' unphysiological Fourier filters produced by PCA, so further constraints are \nrequired. \n\nField [7] has argued for the importance of sparse, or 'Minimum Entropy', coding [4], in which \neach feature detector is activated as rarely as possible. Olshausen & Field demonstrated [12] \nthat such a sparseness criterion could be used to self-organise localised, oriented receptive \nfields. \n\nHere we present similar results using a direct information-theoretic criterion which maximises \nthe joint entropy of a non-linearly transformed output feature vector [5]. Under certain \nconditions, this process will perform Independent Component Analysis (or ICA) which is \nequivalent to Barlow's redundancy reduction problem. Since our ICA algorithm, applied to \nnatural scenes, does produce local edge filters, Barlow's reasoning is vindicated. Our ICA \nfilters are more sparsely distributed than those of other decorrelating filters, thus supporting \nsome of the arguments of Field (1994) and helping to explain the results of Olshausen's \nnetwork from an information-theoretic point of view. \n\n2 Blind separation of natural images. \n\nA perceptual system is exposed to a series of small image patches, drawn from an ensemble \nof larger images. In the linear image synthesis model [12], each image patch, represented \nby the vector x, has been formed by the linear combination of N basis functions. The basis \nfunctions form the columns of a fixed matrix, A. The weighting of this linear combination \n(which varies with each image) is given by a vector, s. Each component of this vector \nhas its own associated basis function, and represents an underlying 'cause' of the image. \nThus: x=As. The goal of a perceptual system, in this simplified framework, is to linearly \ntransform the images, x, with a matrix of filters, W, so that the resulting vector: u= Wx, \nrecovers the underlying causes, s, possibly in a different order, and rescaled. For the sake \nof argument, we will define the ordering and scaling of the causes so that W = A -1. But \nwhat should determine their form? If we choose decorrelation, so that (uuT ) = I, then the \nsolution for W must satisfy: \n\nWTW = (xxT)-1 \n\n(1) \n\nThere are several ways to constrain the solution to this: \n\n(1) Principal Components Analysis W p \n(PCA), is the Orthogonal (global) solution \n[WWT = I]. The PCA solution to Eq.(l) is W p = n-tET, where n is the diagonal \nmatrix of eigenvalues, and E is the matrix who's columns are the eigenvectors. The fil(cid:173)\nters (rows of W p) are orthogonal, are thus the same as the PCA basis functions, and are \ntypically global Fourier filters, ordered according to the amplitude spectrum of the image. \nExample PCA filters are shown in Fig.1a. \n(2) Zero-phase Components Analysis W z (ZCA), is the Symmetrical (local) solution \n[WWT = W 2]. The ZCA solution to Eq.(l) is W z = (XXT )-1/2 (matrix square root). \nZCA is the polar opposite of PCA. It produces local (centre-surround type) whitening fil-\n\n\fEdges are the \"Independent Components\" of Natural Scenes \n\n833 \n\nters, which are ordered according to the phase spectrum of the image. That is, each filter \nwhitens a given pixel in the image, preserving the spatial arrangement of the image and \nflattening its frequency (amplitude) spectrum. W z is related to the transforms described \nby Atick & Redlich [3]. Example ZCA filters and basis functions are shown in Fig.1b. \n\n(3) Independent Components Analysis WI (ICA), is the Factorised (semi-local) solution \n[/u(u) = TIi lUi (Ui)]. Please see [5] for full references. The 'infomax' solution we describe \nhere is related to the approaches in [5, 1, 6]. \n\nAs we will show, in Section 5, ICA on natural images produces decorrelating filters which \nare sensitive to both phase (locality) and spatial frequency information, just as in transforms \ninvolving oriented Gabor functions or wavelets. Example ICA filters are shown in Fig.1d \nand their corresponding basis functions are shown in Fig.1e. \n\n3 An leA algorithm. \n\nIt is important to recognise two differences between finding an ICA solution, WI, and other \ndecorrelation methods. (1) there may be no ICA solution, and (2) a given ICA algorithm \nmay not find the solution even if it exists, since there are approximations involved. In \nthese senses, ICA is different from PCA and ZCA, and cannot be calculated analytically, for \nexample, from second order statistics (the covariance matrix), except in the gaussian case. \n\nThe approach 'which we developed in [5] (see there for further references to ICA) was to \nmaximise by stochastic gradient ascent the joint entropy, H[g(u)], of the linear transform \nsquashed by a sigmoidal function, g. When the non-linear function is the same (up to scal(cid:173)\ning and shifting) as the cumulative density functions (c.d.f.s) of the underlying independent \ncomponents, it can be shown (Nadal & Parga 1995) that such a non-linear 'infomax' pro(cid:173)\ncedure also minimises the mutual information between the Ui, exactly what is required for \nICA. In most cases, however, we must pick a non-linearity, g, without any detailed knowledge \nof the probability density functions (p.d.f.s) of the underlying independent components. In \ncases where the p.d.f.s are super-gaussian (meaning they are peakier and longer-tailed than \na gaussian, having kurtosis greater than 0), we have repeatedly observed, using the logistic \nor tanh nonlinearities, that maximisation of H[g(u)] still leads to ICA solutions, when they \nexist, as with our experiments on speech signal separation [5]. Although the infomax algo(cid:173)\nrithm is described here as an ICA algorithm, a fuller understanding needs to be developed \nof under exactly what conditions it may fail to converge to an ICA solution. \n\nThe basic infomax algorithm changes weights according to the entropy gradient. Defin(cid:173)\ning Yi = g( Ui) to be the sigmoidally transformed output variables, the stochastic gradient \nlearning rule is: \n\n(2) \n\nIn this, E denotes expected value, y = [g(ud ... g(UN)V, and IJI is the absolute value of the \ndeterminant of the Jacobian matrix: J = det [8Yi/8xj]ij' and y = [Y1 . .. YN]T, the elements \nof which depend on the nonlinearity according to: Yi = 8/8Yi(8yi/8ui). \nAmari, Cichocki & Yang [1] have proposed a modification of this rule which utilises the \nnatural gradient rather than the absolute gradient of H(y). The natural gradient exists for \nobjective functions which are functions of matrices, as in this case, and is the same as the \nrelative gradient concept developed by Cardoso & Laheld (1996). It amounts to multiplying \n\n\f834 \n\nA. 1. Bell and T. 1. Sejnowski \n\nthe absolute gradient by WTW, giving, in our case, the following altered version of Eq.(2): \n\nThis rule has the twin advantages over Eq.(2) of avoiding the matrix inverse, and of con(cid:173)\nverging several orders of magnitude more quickly, for data, x, that is not prewhitened. The \nspeedup is explained by the fact that convergence is no longer dependent on the condi(cid:173)\ntioning of the underlying basis function matrix, A. Writing Eq.(3) for one weight gives \n~Wij ex: Wij + iii l:k WkjUk. This rule is 'almost local' requiring a backwards pass. \n\n(3) \n\n(a) \n\npeA \n\n(b) \n(c) \nzeA w \n\n(d) \n\n(e) \n\nleA \n\nA \n\nD \n\n1 \n\n5 \n\n7 \n\n11 \n\n15 \n\n22 \n\n37 \n\n60 \n\n63 \n\n89 \n\n109 \n\n144 \n\nFigure 1: Selected decorrelating filters \nand their basis functions extracted from \nthe natural scene data. Each type of \ndecorrelating filter yielded 144 12x12 fil(cid:173)\nters, of which we only display a sub(cid:173)\nset here. Each column contains filters \nor basis functions of a particular type, \nand each of the rows has a number re(cid:173)\nlating to which row of the filter or basis \nfunction matrix is displayed. \n(a) PCA \n(W p): The 1st, 5th, 7th etc Principal \nComponents, showing increasing spatial \nfrequency. There is no need to show basis \nfunctions and filters separately here since \nfor PCA, they are the same thing. \n(b) \nZCA (W z): The first 6 entries in this \ncolumn show the one-pixel wide centre(cid:173)\nsurround filter which whitens while pre(cid:173)\nserving the phase spectrum. All are iden(cid:173)\ntical, but shifted. The lower 6 entries \n(37, 60 show the basis functions instead, \nwhich are the columns of the inverse of \nthe W z matrix. \nthe weights \nlearnt by the lCA network trained on \nW z-whitened data, showing (in descend(cid:173)\ning order) the DC filter, localised oriented \nfilters, and localised checkerboard filters. \n(d) WI: The corresponding lCA filters, \ncalculated according to WI = WW z, \nlooking like whitened versions of the W(cid:173)\nthe corresponding basis \nfilters. \nfunctions, or columns of WI!' These \nare the patterns which optimally stimu(cid:173)\nlate their corresponding lCA filters, while \nnot stimulating any other lCA filter, so \nthat W1A = I. \n\n( c) W: \n\n(e) A: \n\n\fEdges are the \"Independent Components\" of Natural Scenes \n\n835 \n\n4 Methods. \n\nWe took four natural scenes involving trees, leafs etc., and converted them to greyscale \nvalues between 0 and 255. A training set, {x}, was then generated of 17,59512x12 samples \nfrom the images. This was 'sphered' by subtracting the mean and multiplying by twice \nthe local symmetrical (zero-phase) whitening filter: {x} +- 2Wz({x} -\n(x)). This removes \nboth first and second order statistics from the data, and makes the covariance matrix of x \nequal to 41. This is an appropriately scaled starting point for further training since infomax \n(Eq.(3)) on raw data, with the logistic function, Yi = (1 + e-Ui)-t, produces au-vector \nwhich approximately satisfies (uuT ) = 41. Therefore, by prewhitening x in this way, we can \nensure that the subsequent transformation, u = Wx, to be learnt should approximate an \northonormal matrix (rotation without scaling), roughly satisfying the relation WTW = I. \nThe matrix, W, is then initialised to the identity matrix, and trained using the logistic \nfunction version of Eq.(3), in which Yi = 1 - 2Yi. Thirty sweeps through the data were \nperformed, at the end of each of which, the order of the data vectors was permuted to avoid \ncyclical behaviour in the learning. In each sweep, the weights were updated in batches of 50 \npresentations. The learning rate (proportionality constant in Eq.(3)) followed 21 sweeps at \n0.001, and 3 sweeps at each of 0.0005,0.0002 and 0.0001, taking 2 hours running MATLAB \non a Sparc-20 machine, though a reasonable result for 12x12 filters can be achieved in 30 \nminutes. To verify that the result was not affected by the starting condition of W = I, the \ntraining was repeated with several randomly initialised weight matrices, and also on data \nthat was not prewhitened. The results were qualitatively similar, though convergence was \nmuch slower. \n\nThe full ICA transform from the raw image was calculated as the product of the sphering \n(ZCA) matrix and the learnt matrix: WI = WW z. The basis function matrix, A, was \ncalculated as wIt. A PCA matrix, W p, was calculated. The original (unsphered) data \nwas then transformed by all three decorrelating transforms, and for each the kurtosis of each \nof the 144 filters was calculated. Then the mean kurtosis for each filter type (ICA, PCA, \nZCA) was calculated, averaging over all filters and input data. \n\n5 Results. \n\nThe filters and basis functions resulting from training on natural scenes are displayed in \nFig.1 and Fig.2. Fig.1 displays example filters and basis functions of each type. The PCA \nfilters, Fig.1a, are spatially global and ordered in frequency. The ZCA filters and basis \nfunctions are spatially local and ordered in phase. The ICA filters, whether trained on the \nZCA-whitened images, Fig.le, or the original images, Fig.1d, are semi-local filters, most with \na specific orientation preference. The basis functions, Fig.1e, calculated from the Fig.1d ICA \nfilters, are not local, and naturally have the spatial frequency characteristics of the original \nimages. Basis functions calculated from Fig.1d (as with PCA filters) are the same as the \ncorresponding filters since the matrix W (as with W p) is orthogonal. \n\nIn order to show the full variety of ICA filters, Fig.2 shows, with lower resolution, all 144 \nfilters in the matrix W, in reverse order of the vector-lengths of the filters, so that the \nfilters corresponding to higher-variance independent components appear at the top. The \ngeneral result is that ICA filters are localised and mostly oriented. Unlike the basis functions \ndisplayed in Olshausen & Field (1996), they do not cover a broad range of spatial frequencies. \nHowever, the appropriate comparison to make is between the ICA basis functions, and the \nbasis functions in Olshausen & Field's Figure 4. The ICA basis functions in our Fig.1e are \n\n\f836 \n\nA. 1. Bell and T. 1. Sejnowski \n\noriented, but not localised and therefore it is difficult to observe any multiscale properties. \nHowever, when we ran the ICA algorithm on Olshausen's images, which were preprocessed \nwith a whitening/low pass filter, our algorithm yielded basis functions which were localised \nmultiscale Gabor patches qualitively similar to those in Olshausen's Figure 4. Part of the \ndifference in our results is therefore attributable to different preprocessing techniques. \n\nThe distributions (image histograms) produced by PCA, ZCA and ICA are generally double(cid:173)\nexponential (e-1u;j), or 'sparse', meaning peaky with a long tail, when compared to a gaus(cid:173)\nsian, as predicted by Field [7]. The log histograms are seen to be roughly linear across \n5 orders of magnitude. The histogram for the ICA filters, however, departs slightly from \nlinearity, being more peaked, and having a longer tail than the ZCA and PCA histograms. \nThis spreading of the tail signals the greater sparseness of the outputs of the ICA filters, \nand this is reflected in a calculated kurtosis measure of 10.04 for ICA, compared to 3.74 for \nPCA, and 4.5 for ZCA. \n\nIn conclusion, these simulations show that the filters found by the ICA algorithm of Eq.(3) \nwith a logistic non-linearity are localised, oriented, and produce outputs distributions of \nvery high sparseness. It is notable that this is achieved through an information theoretic \nlearning rule which (1) has no noise model, (2) is sensitive to higher-order statistics (spatial \ncoincidences), (3) is non-Hebbian (it is closer to anti-Hebbian) and (4) is simple enough to \nbe almost locally implementable. Many other levels of higher-order invariance (translation, \nrotation, scaling, lighting) exist in natural scenes. It will be interesting to see if information(cid:173)\ntheoretic techniques can be extended to address these invariances. \n\nAcknowledgements \n\nThis work emerged through many extremely useful discussions with Bruno Olshausen and \nDavid Field. We are very grateful to them, and also to Paul Viola and Barak Pearlmutter. \nThe work was supported by the Howard Hughes Medical Institute. \n\nReferences \n\n[1] Amari S. Cichocki A. & Yang H.H. 1996. A new learning algorithm for blind signal \n\nseparation, Advances in Neural Information Processing Systems 8, MIT press. \n\n[2] Atick J.J. 1992. Could information theory provide an ecological theory of sensory pro(cid:173)\n\ncessing? Network 3, 213-251 \n\n[3] Atick J .J . & Redlich A.N. 1993. Convergent algorithm for sensory receptive field devel(cid:173)\n\nopment, Neural Computation 5, 45-60 \n\n[4] Barlow H.B. 1989. Unsupervised learning, Neural Computation 1, 295-311 \n\n[5] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind sep(cid:173)\n\naration and blind deconvolution, Neural Computation, 7, 1129-1159 \n\n[6] Cardoso J-F. & Laheld B. 1996. Equivariant adaptive source separation, IEEE Trans. \n\non Signal Proc., Dec.1996. \n\n[7] Field D.J. 1994. What is the goal of sensory coding? Neural Computation 6, 559-601 \n\n[8] Hubel D.H. & Wiesel T.N. 1968. Receptive fields and functional architecture of monkey \n\nstriate cortex, J. Physiol., 195: 215-244 \n\n\fEdges are the \"Independent Components \" of Natural Scenes \n\n837 \n\n[9] Linsker R. 1988. Self-organization in a perceptual network. Computer, 21, 105-117 \n\n[10] Miller K.D. 1988. Correlation-based models of neural development, in Neuroscience and \n\nConnectionist Theory, M. Gluck & D. Rumelhart, eds., 267-353, L.Erlbaum, NJ \n\n[11] Nadal J-P. & Parga N. 1994. Non-linear neurons in the low noise limit: a factorial code \n\nmaximises information transfer. Network, 5,565-581. \n\n[12] Olshausen B.A. & Field D.J. 1996. Natural image statistics and efficient coding, Net(cid:173)\n\nwork: Computation in Neural Systems, 7, 2 . \n\n\u2022 \n\n\u2022 \n\nFigure 2: The matrix of 144 filters obtained by training on ZCA-whitened natural images. \nEach filter is a row of the matrix W, and they are ordered left-to-right, top-to-bottom in \nreverse order of the length of the filter vectors. In a rough characterisation, and more-or-Iess \nin order of appearance, the filters consist of one DC filter (top left), 106 oriented filters (of \nwhich 35 were diagonal, 37 were vertical and 34 horizontal), and 37 localised checkerboard \npatterns. The diagonal filters are longer than the vertical and horizontal due to the bias \ninduced by having square, rather than circular, receptive fields. \n\n\f", "award": [], "sourceid": 1321, "authors": [{"given_name": "Anthony", "family_name": "Bell", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}