{"title": "Deriving Receptive Fields Using an Optimal Encoding Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 960, "abstract": null, "full_text": "Deriving Receptive Fields Using An \n\nOptimal Encoding Criterion \n\nRalph Linsker \n\nIBM T. J. Watson Research Center \n\nP. O. Box 218, Yorktown Heights, NY 10598 \n\nAbstract \n\nAn information-theoretic optimization principle ('infomax') has \npreviously been used for unsupervised learning of statistical reg(cid:173)\nularities in an input ensemble. The principle states that the input(cid:173)\noutput mapping implemented by a processing stage should be cho(cid:173)\nsen so as to maximize the average mutual information between \ninput and output patterns, subject to constraints and in the pres(cid:173)\nence of processing noise. In the present work I show how infomax, \nwhen applied to a class of nonlinear input-output mappings, can \nunder certain conditions generate optimal filters that have addi(cid:173)\ntional useful properties: (1) Output activity (for each input pat(cid:173)\ntern) tends to be concentrated among a relatively small number \n(2) The filters are sensitive to higher-order statistical \nof nodes. \nstructure (beyond pairwise correlations). If the input features are \nlocalized, the filters' receptive fields tend to be localized as well. \n(3) Multiresolution sets of filters with subsampling at low spatial \nfrequencies - related to pyramid coding and wavelet representations \n- emerge as favored solutions for certain types of input ensembles. \n\n1 \n\nINTRODUCTION \n\nIn unsupervised network learning, the development of the connection weights is \ninfluenced by statistical properties of the ensemble of input vectors, rather than by \nthe degree of mismatch between the network's output and some 'desired' output. \nAn implicit goal of such learning is that the network should transform the input \nso that salient features present in the input are represented at the output in a \n\n953 \n\n\f954 \n\nLinsker \n\nmore useful form. This is often done by reducing the input dimensionality in a way \nthat preserves the high-variance components of the input (e.g., principal component \nanalysis, Kohonen feature maps). \n\nThe principle of maximum information preservation ('infomax') is an unsupervised \nlearning strategy that states (Linsker 1988): From a set of allowed input-output \nmappings (e.g., parametrized by the connection weights), choose a mapping that \nmaximizes the (ensemble-averaged) Shannon information that the output vector \nconveys about the input vector, in the presence of noise. Such a mapping maximizes \nthe ensemble-averaged mutual information (MI) between input and output. \n\nThis paper (a) summarize earlier results on info max solutions for linear networks, \n(b) identifies some limitations of these solutions (ways in which very different filter \nsets are equally optimal from the infomax standpoint), and (c) shows how, by adding \na small nonlinearity to the network, one can remove these limitations and at the \nsame time improve the utility of the output representations. We show that infomax, \nacting on the modified network, tends to favor sparsely coded representations and \n(depending on the input ensemble) sets of filters that span multiple resolution scales \n(related to wavelets and 'pyramid coding'). \n\n2 \n\nINFOMAX IN LINEAR NETWORKS \n\nFor definiteness and brevity, we consider a linear network having a particular type \nof noise model and input statistical properties. For a more detailed discussion of \nrelated models see (Linsker 1989). \n\nSince the computation of the MI (which involves the output entropy) is in general \nintractable for continuous-valued output vectors, previous work (and the present \npaper) makes use of a surrogate MI, which we will call the 'as-if-Gaussian' MI. This \nquantity is, by definition, computed as though the output vectors comprised a mul(cid:173)\ntivariate Gaussian distribution having the same mean and covariance as the actual \ndistribution of output vectors. Although expedient, this substitution has lacked a \nprincipled justification. The Appendix shows that, under certain conditions, using \nthis 'surrogate MI' (and not the full MI) is indeed appropriate and justified. \n\nDenote the input vector by S = {Si} (Si is the activity at input node i), the output \nvector by Z = {Zn}, the matrix of connection weights by C = {Cni}, noise at the \ninput nodes by N = {Nd, and noise at the output nodes by v = {vn }. Then our \nprocessing model is, in matrix form, Z = C(S + N) + v. Assume that N and v are \nGaussian random variables, (S) = (N) = (v) = 0, \\fNT ) = (SvT ) = (NvT) = 0, \nand, for the covariance matrices, (SsT) = Q, (N N ) = TJI, (vvT) = {3I'. (Angle \nbrackets denote an ensemble average, superscript T denotes transpose, and I and \nI' denote unit matrices on the input and output spaces, respectively.) In general, \nMI = Hz -\n(Hzls) where Hz is the output entropy and HZls is the entropy of the \noutput for given S. Replacing MI by the 'as-if-Gaussian' MI means replacing Hz \nby the expression for the entropy of a multivariate Gaussian distribution, which is \n(apart from an irrelevant constant term) H~ = (1/2) lndet Q', where Q' = (ZZT) = \nCQCT + TJCCT + (3I' is the output covariance. Note that, when S is fixed, Z = \nCS+(CN +v) is a Gaussian distribution centered on CS, so that we have (Hzls) = \n(1/2)lndetQII where Q\" = ((CN + v)(CN + v)T) = TJCCT + {3I'. Therefore the \n\n\fDeriving Receptive Fields Using An Optimal Encoding Criterion \n\n955 \n\n'as-if-Gaussian' MI is \n\nMI' = (1/2)[lndetQ' -lndetQ\"]. \n\n(1) \n\nThe variance of the output at node n (prior to adding noise lin) is Vn = ([C(S + \nN)]~) = (CQCT + TJCCT)nn. We will constrain the dynamic range of each output \nnode (limiting the number of output values that can be discriminated from one \nanother in the presence of output noise) by requiring that Vn = 1 for each n. \nSubject to this constra.int, we are to find a matrix C that maximizes MI'. For a \nlocal Hebbian algorithm that accomplishes this maximization, see (Linsker 1992). \nHere, in order to proceed analytically, we consider a special case of interest. \nSuppose that the input statistics are shift-invariant, so that the covariance (SiSj) \nis a function of (j - i). We then use a shift-invariant filter Ansatz, Cni = C(i - n). \nInfomax then determines the optimal filter gain as a function of spatial frequency; \ni.e., the magnitude of the Fourier components c(k) of C(i - n). The derivation is \nsummarized below. \nDenote by q(k), q'(k), and q\"(k) the Fourier transforms of QU - i), Q'(m - n), \nand Q\"(m - n) respectively. \ntherefore \nq'(k) = [q(k) + TJ) I c(k) 12 +(3. Similarly, q\"(k) = TJ I c(k) 12 +(3. We obtain \nMI' = (1/2)~dlnq'(k) - Inq\"(k)]. Each node's output variance Vn is equal to \nV = (I/K)~dq(k)+TJ] I c(k) 12 where K is the number of terms in the sum over k. \nmethod; that is, we maximize MI\" = MI' + J.L(V - 1) with respect to each I c(k) 12. \nTo maximize MI' subject to the constraint on V we use the Lagrange multiplier \n!his yields an equation for each k that is quadratic in I c(k) 12. The unique solution \n\nSince Q' = CQCT + TJccf1' + (3I', \n\nIS \n\n(TJ/{3) I c(k) I - -1 + 2[q(k) + TJ]{1 + [1- J.L{3q(k)] \n\n} \n\n(2) \n\n2_ \n\nq(k) \n\n2TJK 1/2 \n\nif the RHS is positive, and zero otherwise. The Lagrange multiplier J.L( < 0) is chosen \nso that the {I c( k) I} satisfy V = 1. \nStarting from a differently-stated goal (that of reducing redundancy subject to a \nlimit on information loss), which turns out to be closely related to infomax, (Atick \n& Redlich 1990a) found an expression for the optimal filter gain that is the same \nas that of Eq. 2 except for the choice of constraint. \n\nFilter properties found using this approach are related to those found in early stages \nof biological sensory processing. Smoothing and bandpass (contrast-enhancing) \nfilters emerge as infomax solutions (Linsker 1989, Atick & Redlich 1990a) in certain \ncases, and good agreement with retinal contrast sensitivity measurements has been \nfound (Atick & Redlich 1990b). \nNonetheless, the value of the infomax solution Eq. 2 is limited in two important \nways. First, the phases of the {c(k)} are left undetermined. Any choice of phases is \nequally good at maximizing MI' in a linear network. Thus the real-space response \nfunction C(i - n), which determines the receptive field properties of the output \nnodes, is nonunique (and indeed may be highly nonlocalized in space). \n\nSecond, it is useful to extend the solution Ansatz to allow a number of different filter \ntypes a = 1, ... , A at each output site, while continuing to require that each type \n\n\f956 \n\nLinsker \n\nsatisfy the shift-invariance condition Cni(a) == C(i - n;a). For example, one may \nwant to model a topographic 'retinocortical' mapping in which each patch of cortex \n(each 'site') contains multiple filter types, yet each patch carries out the same set of \nprocessing functions on its input. For this Ansatz, one again obtains Eq. 2 (deriva(cid:173)\ntion omitted here), but with 1 c(k) 12 on the LHS replaced by ~ap(a)1 c(k; a) 12, \nwhere c(k; a) is the F.T. of C(i - n; a), and p(a) is the fraction of the total number \nof filters (at each site) that are of type a. The partitioning of the overall (sum(cid:173)\nsquared) gain among the multiple filter types is thus left undetermined. \n\nThe higher-order statistical structure of the input (beyond covariance) is not being \nexploited by infomaxin the above analysis, because (1) the network is linear and (2) \nonly pairwise correlations among the output activities enter into MI'. We shall show \nthat if we make the network even mildly nonlinear, MI' is no longer independent of \nthe choice of phases or of the partitioning of gain among multiple filter types. \n\n3 NETWORK WITH WEAK NONLINEARITY \nlin, where Un = ~iCniSi, for small f. This differs from the linear network analyzed \n\nWe consider the weakly nonlinear input-output relation Zn = Un + fU~ + ~iCniNi + \n\nabove by the term in U~. (For simplicity, terms nonlinear in the noise are not \nincluded.) The cubic term increases the signal-to-noise ratio selectively when Un is \nlarge in absolute value. We maximize MI' as defined in Eq. l. \nHeuristically, the new term will cause infomax to favor solutions in which some \noutput nodes have large (absolute) activity values, over solutions in which all output \nnodes have moderate activities. The output layer can thus encode information \nabout the input vector (e.g., signal the presence of a feature) via the high activity \nof a small number of nodes, rather than via the particular activity values of many \nnodes. This has several (interrelated) potential advantages. (1) The concentration \nof activity among fewer nodes is a type of sparse coding. (2) The resulting output \nrepresentation may be more resistant to noise. (3) The presence of a feature can be \nsignaled to a later processing stage using fewer connections. (4) Since the particular \nnodes that have high activity depend upon the input vector, this type of mapping \ntransforms a set of continuous-valued inputs at each site into a partially place-coded \nrepresentation. A model of this sort may thus be useful for understanding better \nthe formation of place-coded representations in biological systems. \n\n3.1 MATHEMATICAL DETAILS \n\nThis section may be skipped without loss of continuity. In matrix form, U CS, \nWn - U~ for each n, and Z = U + f W + C N + II. Keeping terms through first \norder in f, the output covariance is Q' = {ZZT} = CQCT + 1]CCT + (3I' + fF, \nwhere F = {WUT)+{UWT} . [As an aside, Fnm = (UnUm(U~+U~)} resembles the \ncovariance (UnUm), except that presentations having large U~ +U~ are given greater \ntype Cni = C( i-n), taking the Fourier transform yields q'(k) = [q(k)+1]]1 c(k; a) 12+ \nweight in the ensemble average.] For shift-invariant input statistics and one filter \n(3 + ff(k) where f(k) is the F.T. of F(m - n) = Fnm. So In det Q' = ~k lnq'(k) = \n~ln{[q(k)+1]] 1 c(k) 12 +(3}+f~g(k) where g(k) _ [f(k)/{[q(k) +1]]1 c(k;a) 12+{3}] . \nUsing a Lagrange multiplier as before, the quantity to be maximized is MI\" = \n\n\fDeriving Receptive Fields Using An Optimal Encoding Criterion \n\n957 \n\n~~J[~ \n~~ -+-11~ \n\nFigure 1: Breaking of phase degeneracy. See text for discussion. \n\nMI\"(e = 0) + (e/2)~g(k). \neach k define d(k) to be the A x A matrix whose elements are: d(k)a.b = [q(k) + \nNow suppose there are multiple filter types a = 1, ... , A at each output site. For \nTJ]c(k; a)c*(k; b) + [f3/ p(a)]c5a.b where c5a.b is the Kronecker delta. Also define f(k) to \nbe the A x A matrix each of whose elements f(k)ab is the F.T. of F(m - n;a,b) \nwhere F(m - n; a, b) = (Un(a)Wm(b)) + (Wn(a)Um(b)). Then the O{e) part of MI\" \nis: (e/2)~kTr{[d(k)]-lf(k)}. Note that [d(k)]-l is the inverse of the matrix d(k), \nand that 'Tr' denotes the trace. [Outline of derivation: In the basis defined by the \nFourier harmonics, Q' is block diagonal (one A x A block for each k). So In det Q' = \n~k Indetq'(k) where each q'(k) is an A x A matrix of the form q~(k) + eq~(k). \nExpanding In det q' (k) through O( e) yields the stated result.] \nThe infomax calculation to lowest order in e [i.e., O( eO)] is the same as for the linear \nnetwork. Here, for simplicity, we determine the sum-squared gain, ~ap(a)1 c(k; a) 12 , \nas in the linear case; then seek to maximize the new term, of O(e), subject to \nthis constraint on the value of the sum-squared gain. How the nonlinear term \nbreaks phase and gain-apportionment degeneracies is of interest here; a small O( e) \ncorrection to the sum-squared gain is not. \n\n4 \n\nILLUSTRATIVE RESULTS \n\nTwo examples will show how adding the nonlinear perturbative term to the net(cid:173)\nwork's output breaks a degeneracy among different filter solutions. In each case the \ninput space is a one-dimensional 'retina' with wraparound. \n\n4.1 BREAKING THE PHASE DEGENERACY \n\nIn this example (see Figure 1) there is one filter type at each output site. We consider \ntwo types of input ensembles: (1) Each input vector (Fig. la shows one example) \nis drawn from a multivariate Gaussian distribution (so there is no higher-order \nstatistical structure beyond pairwise correlations). The input covariance matrix \nQ(j - i) is a Gaussian function of the distance between the sites. (2) Each input \n\n\f958 \n\nLinsker \n\nvector is a random sum of Gaussian 'bumps': Si = Ej aj [s( i - j) - so] where s( i - j) \nis a Gaussian (shown in Fig. 1b for j=20j there are 64 nodes in all); So is the mean \nvalue of s( i - j); and each aj is independently and randomly chosen (with constant \nprobability) to be 1 or O. This ensemble does have higher-order structure, with \neach input presentation being characterized by the presence of localized features \n(the bumps) at particular locations. \nThe infomax solution for I c(k) 12 is plotted versus spatial frequency k in Fig. 1c \nfor a particular choice of noise parameters (1], (3). As stated earlier, MI' for a linear \nnetwork is indifferent to the phases of the Fourier components {c(k)}. A particular \nrandom choice of phases produces the real-space filter C(i - n) shown in Fig. Id, \nwhich spans the entire 'retina.' Setting all phases to zero produces the localized \nfilter shown in Fig. 1\u00a3. If the Gaussian 'bump' of Fig. 1b is presented as input to a \nnetwork of filters each of which is a shifted version of Fig. 1d, the linear response of \nthe network (i.e., the convolution of the 'bump' with the filter) is shown in Fig. Ie. \nReplacing the filter of Fig. 1d by that of Fig. 1\u00a3, but keeping the input the same, \nproduces the output response shown in Fig. 19. \n\nThe cubic nonlinearity causes MI' to be larger for the filter of Fig. 1\u00a3 than for that of \nFig. 1d. Heuristically, if we focus on the diagonal elements of the output covariance \nQ', the nonlinear term is 2\u20ac(U~). Maximizing MI' favors increasing this term (sub(cid:173)\nject to a constraint on output variance) hence favors filter solutions for which the \nUn distribution is non-Gaussian with a preponderance of large values. Projection \npursuit methods also use a measure of the non-Gaussianity of the output distri(cid:173)\nbution to construct filters that extract 'interesting' features from high-dimensional \ndata (cf. Intrator 1992). \n\n4.2 BREAKING THE PARTITIONING DEGENERACY FOR \n\nMULTIPLE FILTER TYPES \n\nIn this example (see Fig. 2), the input ensemble comprises a set of self-similar \npatterns (each is a sine-Gabor 'ripple' as in Fig. 2a) that are related by translation \nand dilation (scale change over a factor of 80). Figure 2b shows the input power \nspectrum vs. k; the scaling region goes as 11k. Figure 2c shows the infomax solution \nfor the gain I c(k;a) I vs. k when there is just one filter type. When the input SNR \nis large (as in the scaling region) the infomax filters 'whiten' the output; note the \nflat portion of the output power spectrum (Fig. 2d). \n[We modify the infomax \nsolution by extending the power-law form of I c(k) I to low k (dotted line in Figs. \n2c,d). This avoids artifacts resulting from the rapid increase in I c(k) I, which is \nin turn caused by our having omitted low-k patterns from the input ensemble for \nreasons of numerical efficiency.] The dotted envelope curve in Figure 2e shows the \nsum-squared gain Eap(a) I c(k) 12 when multiple filter types a are allowed. The \nquantity plotted is just the square of that shown in Fig. 2c, but on a linear rather \nthan log-log plot (note values greater than 5 are cut off to save space). \n\nThe network nonlinearity has the following effect. We first allow two filter types \nto share the overall gain. Optimizing MI' over various partitionings, we find that \ninfo max favors a crossover between filter types at k ~ 400. Allowing three, then four, \nfilter types produces additional crossovers at lower k. For an Ansatz in which each \nfilter's share of the sum-squared gain is tapered linearly near its cutoff frequencies, \n\n\fDeriving Receptive Fields Using An Optimal Encoding Criterion \n\n959 \n\nthe best solution found for each p( a) 1 c( k) 12 is shown in Fig. 2e (semilog plot vs. \nk ). Figure 2f plots the corresponding 1 c( k; a) 1 vs. k on a linear scale. Note that the \nthree lower-k filters appear roughly self-similar. (The peak in the highest-k filter is \nan artifact due to the cutoff of the input ensemble at high k.) The four real-space \nfilters C(i-n; a) are plotted vs. (i-n) in Fig. 2g [phases chosen to make C(i-n; a) \nantisymmetric] . \nThe resulting filters span multiple resolution scales. The density p( a) is less for the \nlower-frequency filters (spatial subsampling). When more filter types are allowed, \nthe increase in MI' becomes progressively less. Although in our model the filters are \npresent with density p at each output site, a similar MIl is obtained if one spaces \nadjacent filters of type a by a distance