{"title": "Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features", "book": "Advances in Neural Information Processing Systems", "page_first": 901, "page_last": 907, "abstract": null, "full_text": "Redundancy and Dimensionality Reduction in \nSparse-Distributed Representations of Natural \n\nObjects in Terms of Their Local Features \n\nPenio S. Penev* \n\nLaboratory of Computational Neuroscience \n\nThe Rockefeller University \n\n1230 York Avenue, New York, NY 10021 \n\npenev@rockefeller.edu http://venezia.rockefeller.edu/ \n\nAbstract \n\nLow-dimensional representations are key to solving problems in high(cid:173)\nlevel vision, such as face compression and recognition. Factorial coding \nstrategies for reducing the redundancy present in natural images on the \nbasis of their second-order statistics have been successful in account(cid:173)\ning for both psychophysical and neurophysiological properties of early \nvision. Class-specific representations are presumably formed later, at \nthe higher-level stages of cortical processing. Here we show that when \nretinotopic factorial codes are derived for ensembles of natural objects, \nsuch as human faces, not only redundancy, but also dimensionality is re(cid:173)\nduced. We also show that objects are built from parts in a non-Gaussian \nfashion which allows these local-feature codes to have dimensionalities \nthat are substantially lower than the respective Nyquist sampling rates. \n\n1 Introduction \n\nSensory systems must take advantage of the statistical structure of their inputs in order to \nprocess them efficiently, both to suppress noise and to generate compact representations of \nseemingly complex data. Redundancy reduction has been proposed as a design principle \nfor such systems (Barlow, 1961); in the context of Information Theory (Shannon, 1948), \nit leads to factorial codes (Barlow et aI., 1989; Linsker, 1988). When only the second(cid:173)\norder statistics are available for a given sensory ensemble, the maximum entropy initial \nassumption (Jaynes, 1982) leads to a multi-dimensional Gaussian model of the probability \ndensity; then, the Karhunen-Loeve Transform (KLT) provides a family of equally efficient \nfactorial codes. In the context of the ensemble of natural images, with a specific model for \nthe noise, these codes have been able to account quantitatively for the contrast sensitivity \nof human subjects in all signal-to-noise regimes (Atick and Redlich, 1992). Moreover, \nwhen the receptive fields are constrained to have retinotopic organization, their circularly \nsymmetric, center-surround opponent structure is recovered (Atick and Redlich, 1992). \n\nAlthough redundancy can be reduced in the ensemble of natural images, because its spec(cid:173)\ntrum obeys a power law (Ruderman and Bialek, 1994), there is no natural cutoff, and the \n\n\u2022 Present address: NEC Research Institute, 4 Independence Way, Princeton, NJ 08550 \n\n\fdimensionality of the \"retinal\" code is the same as that of the input. This situation is not \ntypical. When KLT representations are derived for ensembles of natural objects, such as \nhuman faces (Sirovich and Kirby, 1987), the factorial codes in the resulting families are \nnaturally low-dimensional (Penev, 1998; Penev and Sirovich, 2000). Moreover, when a \nretinotopic organization is imposed, in a procedure called Local Feature Analysis (LFA), \nthe resulting feed-forward receptive fields are a dense set of detectors for the local features \nfrom which the objects are built (Penev and Atick, 1996). LFA has also been used to derive \nlocal features for the natural-objects ensembles of: 3D surfaces of human heads (Penev and \nAtick, 1996), and 2D images of pedestrians (Poggio and Girosi, 1998). \n\nParts-based representations of object classes, including faces, have been recently derived \nby Non-negative Matrix Factorization (NMF) (Lee and Seung, 1999), \"biologically\" moti(cid:173)\nvated by the hypothesis that neural systems are incapable of representing negative values. \nAs has already been pointed out (Mel, 1999), this hypothesis is incompatible with a wealth \nof reliably documented neural phenomena, such as center-surround receptive field organi(cid:173)\nzation, excitation and inhibition, and ON/OFF visual-pathway processing, among others. \n\nHere we demonstrate that when parts-based representations of natural objects are derived \nby redundancy reduction constrained by retinotopy (Penev and Atick, 1996), the result(cid:173)\ning sparse-distributed, local-feature representations not only are factorial, but also are of \ndimensionalities substantially lower than the respective Nyquist sampling rates. \n\n2 Compact Global Factorial Codes of Natural Objects \n\nA properly registered and normalized object will be represented by the receptor readout \nvalues \u00a2>(x), where {x} is a grid that contains V receptors. An ensemble ofT objects will \nbe denoted by {\u00a2>t (X)}tET. 1 Briefly (see, e.g., Sirovich and Kirby, 1987, for details), when \nT > V, its Karhunen-Loeve Tran,lform (KLT) representation is given by \n\nv \n\n\u00a2>t(x) = L a~O'r1Pr(x) \n\n(1) \n\nwhere {an (arranged in non-increasing order) is the eigen.lpectrum of the spatial and tem(cid:173)\nporal correlation matrices, and {1Pr (x)} and {a~} are their respective orthonormal eigen(cid:173)\nvectors. The KLT representation of an arbitrary, possibly out-of-sample, object \u00a2>(x) is \ngiven by the joint activation \n\nr=1 \n\n(2) \n\nof the set of global analysis filters {O';I1Pr(x)}, which are indexed with r, and whose \noutputs, {ar } , are decorrelated. 2 In the context of the ensemble of natural images, the \n\"whitening\" by the factor 0';1 has been found to account for the contrast sensitivity of hu(cid:173)\nman subjects (Atick and Redlich, 1992). When the output dimensionality is set to N < V, \nthe reconstruction-optimal in the amount of preserved signal power-and the respective \nerror utilize the global synthesis filters {O'r1Pr (x)}, and are given by \n\nN \n\n\u00a2>ISc = Lara r1Pr and \u00a2>JJ'\" = \u00a2> - \u00a2>ISc. \n\nr=1 \n\n(3) \n\niPor the illustrations in this study, 2T = T = 11254 frontal-pose facial images were registered \nand normalized to a grid with V = 64 x 60 = 3840 pixels as previously described (Penev and \nSirovich, 2000). \n\n2This is certainly true for in-sample objects, since {a~} are orthonormal (1). Por out-of-sample \nobjects, there is always the issue whether the size of the training sample, T , is sufficient to ensure \nproper generalization. The current ensemble has been found to generalize well in the regime for r \nthat is explored here (Penev and Sirovich, 2000). \n\n\f20 \n\n60 \n\n100 \n\n150 \n\n220 \n\n350 \n\n500 \n\n700 \n\n1000 \n\nFigure 1: Successive reconstructions, errors, and local entropy densities. For the indicated global \ndimensionalities, N, the reconstructions cP'iV\" (3) of an out-oj-sample example are shown in the top \nrow, and the respective residual errors, cP'Nr , in the middle row (the first two errors are amplified 5 x \nand the rest-20x). The respective entropy densities ON (5), are shown in the bottom, low-pass \nfiltered with Fr ,N = u; / (u; + UN 2) (cf. Fig. 3), and scaled adaptively at each N to fill the available \ndynamic range. \n\nWith the standard multidimensional Gaussian model for the probability density P[4>] \n(Moghaddam and Pentland, 1997; Penev, 1998), the information content of the reconstruc(cid:173)\ntion (3)-equal to the optimal-code length (Shannon, 1948; Barlow, I96I)-is \n\n-logP[4>rJ C] ex Z)a r I 2 . \n\nN \n\nr=l \n\n(4) \n\nBecause of the normalization by fIr in (2), all KLT coefficients have unit variance 0); the \nmodel (4) is spherically symmetric, and all filters contribute equally to the entropy of the \ncode. What criterion, then, could guide dimensionality reduction? \n\nFollowing (Atick and Redlich, 1992), when noise is taken into account, N ~ 400 has \nbeen found as an estimate of the global dimensionality for the ensemble frontal-pose faces \n(Penev and Sirovich, 2000). This conclusion is reinforced by the perceptual quality of \nthe successive reconstructions and errors, shown in Fig. I-the face-specific information \ncrosses over from the error to the reconstruction at N ~ 400, but not much earlier. \n\n3 Representation of Objects in Terms of Local Features \n\nIt was shown in Section 2 that when redundancy reduction on the basis of the second-order \nstatistics is applied to ensembles of natural objects, the resulting factorial code is compact \n(low dimensional), in contrast with the \"retinal\" code, which preserves the dimensionality \nof the input (Atick and Redlich, 1992). Also, the filters in the beginning of the hierarchy \n(Fig. 2) correspond to intuitively understandable sources of variability. Nevertheless, this \ncompact code has some problems. The learned receptive fields, shown in Fig. 2, are global, \nin contrast with the local, retinotopic organization of sensory processing, found throughout \nmost of the visual system. Moreover, although the eigenmodes in the regime r E [100,400] \nare clearly necessary to preserve the object-specific information (Fig. 1), their respective \nglobal filters (Fig. 2) are ripply, non-intuitive, and resemble the hierarchy of sine/cosine \nmodes of the translationally invariant ensemble of natural images. \n\nIn order to cope with these problems in the context of object ensembles, analogously to the \nlocal factorial retinal code (Atick and Redlich, 1992), Local Feature Analysis (LFA) has \nbeen developed (Penev and Atick, 1996; Penev, 1998). LFA uses a set of local analysisfil-\n\n\fFigure 2: The basis-vector hierarchy of the global factorial code. Shown are the first 14 eigenvectors, \nand the ones with indices: 21,41; and 60,94, 155,250,500, 1000,2000,3840 (bottom row). \n\ncenters \n\na \n\nb \n\nc \n\nd \n\ne \n\nFigure 3: Local feature detectors and residual correlations of their outputs. centers: The typical face \n(\"pI, Fig. 2) is marked with the central positions of five of the feature detectors. a-e: For those \nchoises of x\"', the local filters K(x\", , y) (6) are shown in the top row, and the residual correlations of \ntheir respective outputs with the outputs of all the rest, P(x\", , y) (9), in the bottom. In principle, the \ncutoff at r = N, which effectively implements a low-pass filter, should not be as sharp as in (6)-it \nhas been shown that the human contrast sensitivity is described well by a smooth cutoff of the type \nFr = u;/(u; + n 2 ), where n 2 is a mearure of the effective noise power (Atick and Redlich, 1992). \nFor this figure, K(x,y) = L~=I\"pr(X):~\"pT(Y)' with N = 400 andn = U400. \n\nters, K(x, y), whose outputs are topographically indexed with the grid variable x (cf. eq. 2) \n\nO(x) = V ~ K(x,y)\u00a2(y) \n\nL'. 1 \" \ny \n\n(5) \n\nand are as decorrelated as possible. For a given dimensionality, or width of the band, of \nthe compact code, N, maximal decorrelation can be achieved with K(x,y) = K~)(x, y) \nfrom the following topographic family of kernels \n\nK~)(x,y) ~ L '\u00a2r(x)u;:n'\u00a2r(Y)\u00b7 \n\nN \n\nr=l \n\n(6) \n\nFor the ensemble of natural scenes, which is translationally and rotationally invariant, the \nlocal filters (6) are center-surround receptive fields (Atick and Redlich, 1992). For object \nensembles, the process of construction-categorization-breaks a number of symmetries \nand shifts the higher-order statistics into second-order, where they are conveniently ex(cid:173)\nposed to robust estimation and, subsequently, to redundancy reduction. The resulting local \nreceptive fields, some of which are shown in the top row of Fig. 3, turn out to be feature \ndetectors that are optimally tuned to the structures that appear at their respective centers. \nAlthough the local factorial code does not exhibit the problems discussed earlier, it has \n\n\frepresentational properties that are equivalent to those of the global factorial code. The \nreconstruction and error are identical, but now utilize the local synthesis filers KJ;l) (6) \n\nN \n\n\u00a2W(x) = Larur'lj;r(X) = ~ L KJ;l) (x, y)O(y) \n\nr=l \n\ny \n\n(7) \n\nand the information (4) is expressed in terms of O(x), which therefore provides the local \ninformation density \n\n-logP[\u00a2~CllX L lar l2 = V L ION(XW\u00b7 \n\nN \n\n1 \n\nr=l \n\nx \n\n(8) \n\n4 Greedy Sparsification of the Smooth Local Information Density \n\nIn the case of natural images, N = V, and the outputs of the local filters are completely \ndecorrelated (A tick and Redlich, 1992). For natural objects, the code is low-dimensional \n(N < V), and residual correlations, some shown in the bottom row of Fig. 3, are unavoid(cid:173)\nable; they are generally given by the projector to the sub band \n\nPN(X,y) ~ ~ LO~(x)O~(y) == K~)(x,y) \n\n(9) \n\nt \n\nand are as close to 8(x, y) as possible (Penev and Atick, 1996). The smoothness of the local \ninformation density is controlled by the width of the band, as shown in Fig. 1. Since O(x) \nis band limited, it can generally be reconstructed exactly from a subsampling over a limited \nset of grid points M ~ {xm}, from the IMI variables {Om ~ O(Xm)}x~EM' as long as \nthis density is critically sampled (1M I = N). When 1M I < N, the maximum-likelihood \ninterpolation in the context of the probability model (8) is given by \n\nIMI \n\nIMI \n\norec(x) = L Omam(x) with am(x) = L Q-l mnPn(x) \n\n(10) \n\nm=l \n\nn=l \n\nwhere Pm(x) ~ P(xm,x), and Q ~ PIM is the restriction ofP on the set of reference \npoints, with Qnm = Pn(xm) (Penev, 1998). When O(x) is critically sampled (IMI = N) \non a regular grid, V -t 00, and the eigenmodes (1) are sines and cosines, then (10) is the \nfamiliar Nyquist interpolation formula. In order to improve numerical stability, irregular \nsubsampling has been proposed (Penev and Atick, 1996), by a data-driven greedy algo(cid:173)\nrithm that successively enlarges the support of the subsampling at the n-th step, M(n), by \noptimizing for the residual entropy error, Ilo~rr (x) 112 = IIO(x) - o~ec (x) 112. \n\nThe LFA code is sparse. In a recurrent neural-network implementation (Penev, 1998) the \ndense output O(x) of the feed-forward receptive fields, K(x,y), has been interpreted as \nsub-threshold activation, which is predictively suppressed through lateral inhibition with \nweights Pm (x), by the set of active units, at {xm}.3 \n\n5 Dimensionality Reduction Beyond the Nyquist Sampling Rate \n\nThe efficient allocation of resources by the greedy sparsification is evident in Fig. 4A-B; \nthe most prominent features are picked up first (Fig. 4A), and only a handful of active units \nare used to describe each individual local feature (Fig. 4B). Moreover, when the dimension(cid:173)\nality of the representation is constrained, evidently from Fig. 4C-F, the sparse local code \nhas a much better perceptual quality of the reconstruction than the compact global one. \n\n3This type of sparseness is not to be confused with \"high kurtosis of the output distribution;\" in \n\nLFA, the non-active units are completely shut down, rather than \"only weakly activated.\" \n\n\fA \n\nB \n\nc \n\nD \n\nE \n\nF \n\nFigure 4: Efficiency of the sparse allocation of resources. (A): the locations of the first 25 active \nunits, M(25), of the sparsification with N = 220, n = 0\"400 (see Fig. 3), of the example in Fig. 1 \nand in (C), are overlayed on \u00a2(x) and numbered sequentially. \n(B): the locations of the active \nunits in M(64) are overlayed on O(x). For \u00a2(x) in (C) (cf. Fig. 1), reconstructions with a fixed \ndimensionality, 64, of its deviation from the typical face (-1/;1 in Fig. 2), are shown in the top row of \n(D, E, F), and the respective errors, in the bottom row. (D): reconstruction from the sparsification \n{O(Xm)}X=EM (10) with M = M(64) from (B). (E): reconstruction from the first 64 global \ncoefficients (3), N = 64. (F): reconstruction from a subsampling of \u00a2(x) on a regular 8 x 8 grid \n(64 samples). The errors in (D) and (E) are magnified 5 x ; in (F), 1 x . \n\n600 \n\n0 \n\n500 \n\n\u2022 i 400 \n\n~ 300 \n~ \n\nA \n\n100 \n\nE \n\n\" \n\"' ~ \n] \n~ \ni \na: .. 0 \n\n~ \n\n001 \n\n0001 \n\n00001 \n\n0 \n\n500 \n\nSOO \n\n01 \nRallo of Sparse and Global Dimensionalilies \n\n02 \n\n04 \n\n03 \n\n0.5 \n\n0.6 \n\n07 OB \n\nB \n\n200 \n\nNumber of Reconstruction Terms \n\n300 \n\n400 \n\nFigure 5: The relationship between the dimensionalities of the global and the local factorial codes. \nThe entropy of the KLT reconstruction (8) for the out-of-sample example (cf. Fig. 1) is plotted in \n(A) with a solid line as a function of the global dimensionality, N. The entropies of the LFA re(cid:173)\nconstructions (10) are shown with dashed lines parametrically of the number of active units 1M I for \nN E {600, 450, 300, 220, 110,64, 32}, from top to bottom, respectively. The ratios of the residual, \n110~rr 112, and the total, 110112 (8), information are plotted in (B) with dashed lines parametrically of \n1M II N, for the same values of N; a true exponential dependence is plotted with a solid line. \n\nThis is an interesting observation. Although the global code is optimal in the amount of \ncaptured energy, the greedy sparsification optimizes the amount of captured information, \nwhich has been shown to be the biologically relevant measure, at least in the retinal case \n(Atick and Redlich, 1992). In order to quantify the relationship between the local dimen(cid:173)\nsionality of the representation and the amount of information it captures, rate-distortion \ncurves are shown in Fig. 5. As expected (4), each degree of freedom in the global code con(cid:173)\ntributes approximately equally to the information content. On the other hand, the first few \nlocal terms in (10) pull off a sizeable fraction of the total information, with only a modest \nincrease thereafter (Fig. 5A). In all regimes for N, the residual information decreases ap(cid:173)\nproximately exponentially with increasing dimensionality ratio IMI/N (Fig. 5B); 90% of \nthe information is contained in a representation with local dimensionality, 25%-30% of the \nrespective global one; 99%, with 45%-50%. This exponential decrease has been shown \nto be incompatible with the expectation based on the Gaussian (4), or any other spherical, \nassumption (Penev, 1999). Hence, the LFA representation, by learning the building blocks \nof natural objects-the local features-reduces not only redundancy, but also dimensional(cid:173)\nity. Because LFA captures aspects of the sparse, non-Gaussian structure of natural-object \nensembles, it preserves practically all of the information, while allocating resources sub(cid:173)\nstantially below the Nyquist sampling rate. \n\n\f6 Discussion \n\nHere we have shown that, for ensembles of natural objects, with low-dimensional global \nfactorial representations, sparsification of the local information density allows under(cid:173)\nsampling which results in a substantial additional dimensionality reduction. Although more \ngeneral ensembles, such as those of natural scenes and natural sound, have full-dimensional \nglobal representations, the sensory processing of both visual and auditory signals happens \nin a multi-scale, bandpass fashion. Preliminary results (Penev and Iordanov, 1999) sug(cid:173)\ngest that sparsification within the subbands is possible beyond the respective Nyquist rate; \nhence, when the sparse dimensionalities of the sub bands are added together, the result is \naggregate dimensionality reduction, already at the initial stages of sensory processing. \n\nAcknowledgments \nThe major part of this research was made possible by the William o 'Baker Fellowship, so generously \nextended to, and gratefully accepted by, the author. He is also indebted to M. J. Feigenbaum for his \nhospitality and support; to MJF and A. J. Libchaber, for the encouraging and enlightening discussions, \nscientific and otherwise; to R. M. Shapley, for asking the questions that led to Fig. 5; to J. J. Atick, \nB. W. Knight, A. G. Dimitrov, L. Sirovich, J. D. Victor, E. Kaplan, L. G. Iordanov, E. P. Simoncelli, \nG. N. Reeke, J. E. Cohen, B. Klejn, A. Oppenheim, and A. P. Blicher for fruitful discussions. \n\nReferences \nAtick, J. J. and A. N. Redlich (1992). What does the retina know about natural scenes? Neural \n\nComput. 4(2), 196-210. \n\nBarlow, H. B. (1961). Possible principles unredlying the transformation of sensory messages. In \n\nW. Rosenblith (Ed.), Sensory Communication, pp. 217-234. Cambridge, MA: M.I.T. Press. \n\nBarlow, H. B., T. P. Kaushal, and G. J. Mitchison (1989). Finding minimum entropy codes. Neural \n\nComputation 1 (3), 412-423. \n\nJaynes, E. T. (1982). On the rationale of maximum-entropy methods. Proc. IEEE 70,939-952. \nLee, D. D. and H. S. Seung (1999). Learning the parts of objects by non-negative matrix factorization. \n\nNature 401 (6755), 788-791. \n\nLinsker, R. (1988). Self-organization in a perceptual network. Computer 21,105-117. \nMel, B. W. (1999). Think positive to find parts. Nature 401 (6755),759- 760. \nMoghaddam, B. and A. Pentland (1997). Probabilistic visual learning for object representation. IEEE \n\nTrans. on Pattern Analysis and Machine Intelligence 19(7), 669- 710. \n\nPenev, P. S. (1998). Local Feature Analysis: A Statistical Theory for Information Representation \nand Transmission. Ph. D. thesis, The Rockefeller University, New York, NY. available at http: \n/ /venezia.rockefeller.edu/penev /thesis/. \n\nPenev, P. S. (1999). Dimensionality reduction by sparsification in a local-features representation \nof human faces. Technical report, The Rockefeller University. ftp:/ /venezia.rockefeller.edu/ \npubs / Penev PS- NIPS99-reduce. ps. \n\nPenev, P. S. and J. J. Atick (1996). Local Feature Analysis: A general statistical theory for object \n\nrepresentation. Network: Comput. Neural Syst. 7(3), 477- 500. \n\nPenev, P. S. and L. G. Iordanov (1999). Local Feature Analysis: A flexible statistical framework for \ndimensionality reduction by sparsification of naturalistic sound. Technical report, The Rockefeller \nUniversity. ftp:/ /venezia.rockefeller.edu/pubs/PenevPS-ICASSP2000-sparse.ps. \n\nPenev, P. S. and L. Sirovich (2000). The global dimensionality of face space. In Proc. 4th Int'l Coni. \n\nAutomatic Face and Gesture Recognition, Grenoble, France, pp. 264--270. IEEE CS. \n\nPoggio, T. and F. Girosi (1998). A sparse representation for function approximation. Neural Com(cid:173)\n\nput. 10(6), 1445-1454. \n\nRuderman, D. L. and W. Bialek (1994). Statistics of natural images: Scaling in the woods. Phys. \n\nRev. Lett. 73(6),814--817. \n\nShannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. 1. 27, 379-423, \n\n623- 656. \n\nSirovich, L. and M. Kirby (1987). Low-dimensional procedure for the characterization of human \n\nfaces . 1. Opt. Soc. Am. A 4,519-524. \n\n\f", "award": [], "sourceid": 1792, "authors": [{"given_name": "Penio", "family_name": "Penev", "institution": null}]}