{"title": "Temporal Coherence, Natural Image Sequences, and the Visual Cortex", "book": "Advances in Neural Information Processing Systems", "page_first": 157, "page_last": 164, "abstract": null, "full_text": "Temporal Coherence, Natural Image Sequences,\n\nand the Visual Cortex\n\nJarmo Hurri and Aapo Hyv\u00e4rinen\nNeural Networks Research Centre\nHelsinki University of Technology\nP.O.Box 9800, 02015 HUT, Finland\n{jarmo.hurri,aapo.hyvarinen}@hut.\ufb01\n\nAbstract\n\nWe show that two important properties of the primary visual cortex\nemerge when the principle of temporal coherence is applied to natural\nimage sequences. The properties are simple-cell-like receptive \ufb01elds and\ncomplex-cell-like pooling of simple cell outputs, which emerge when\nwe apply two different approaches to temporal coherence. In the \ufb01rst\napproach we extract receptive \ufb01elds whose outputs are as temporally co-\nherent as possible. This approach yields simple-cell-like receptive \ufb01elds\n(oriented, localized, multiscale). Thus, temporal coherence is an alterna-\ntive to sparse coding in modeling the emergence of simple cell receptive\n\ufb01elds. The second approach is based on a two-layer statistical generative\nmodel of natural image sequences. In addition to modeling the temporal\ncoherence of individual simple cells, this model includes inter-cell tem-\nporal dependencies. Estimation of this model from natural data yields\nboth simple-cell-like receptive \ufb01elds, and complex-cell-like pooling of\nsimple cell outputs. In this completely unsupervised learning, both lay-\ners of the generative model are estimated simultaneously from scratch.\nThis is a signi\ufb01cant improvement on earlier statistical models of early\nvision, where only one layer has been learned, and others have been \ufb01xed\na priori.\n\n1 Introduction\n\nThe functional role of simple and complex cells has puzzled scientists since their response\nproperties were \ufb01rst mapped by Hubel and Wiesel in the 1950s (see, e.g., [1]). The current\nview of the functionality of sensory neural networks emphasizes learning and the relation-\nship between the structure of the cells and the statistical properties of the information they\nprocess (see, e.g., [2]). In 1996 a major advance was achieved when Olshausen and Field\nshowed that simple-cell-like receptive \ufb01elds emerge when sparse coding is applied to nat-\nural image data [3]. Similar results were obtained with independent component analysis\nshortly thereafter [4]. In the case of image data, independent component analysis is closely\nrelated to sparse coding [5].\n\nIn this paper we show that a principle called temporal coherence [6, 7, 8, 9] leads to the\nemergence of major properties of the primary visual cortex from natural image sequences.\n\n\fTemporal coherence is based on the idea that when processing temporal input, the repre-\nsentation changes as little as possible over time. Several authors have demonstrated the\nusefulness of this principle using simulated data (see, e.g., [6, 7]).\n\nWe apply the principle of temporal coherence to natural input, and at the level of early\nvision, in two different ways. In the \ufb01rst approach we show that when the input consists\nof natural image sequences, the maximization of temporal response strength correlation\nof cell output leads to receptive \ufb01elds which are similar to simple cell receptive \ufb01elds.\nThese results show that temporal coherence is an alternative to sparse coding, in that they\nboth result in the emergence of simple-cell-like receptive \ufb01elds from natural input data.\nWhereas earlier research has focused on establishing a link between temporal coherence\nand complex cells, our results demonstrate that such a connection exists even on the simple\ncell level. We will also show how this approach can be interpreted as estimation of a linear\nlatent variable model in which the latent signals have varying variances.\n\nIn the second approach we use the principle of temporal coherence to formulate a two-layer\ngenerative model of natural image sequences. In addition to single-cell temporal coherence,\nthis model also captures inter-cell temporal dependencies. We show that when this model\nis estimated from natural image sequence data, the results include both simple-cell-like\nreceptive \ufb01elds, and a complex-cell-like pooling of simple cell outputs. Whereas in earlier\nresearch learning two-layer statistical models of early vision has required \ufb01xing one of the\nlayers beforehand, in our model both layers are learned simultaneously.\n\n2 Simple-cell-like receptive \ufb01elds are temporally coherent features\n\nOur \ufb01rst approach to modeling temporal coherence in natural image sequences can be in-\nterpreted either as maximization of temporal coherence of cell outputs, or as estimation of\na latent variable model in which the underlying variables have certain kind of time struc-\nture. This situation is analogous to sparse coding, because measures of sparseness can also\nbe used to estimate linear generative models with non-Gaussian independent sources [5].\nWe \ufb01rst describe our measure of temporal coherence, and then provide the link to latent\nvariable models.\n\nIn this paper we restrict ourselves to consider linear spatial models of simple cells. Lin-\near simple cell models are commonly used in studies concerning the connections between\nvisual input statistics and simple cell receptive \ufb01elds [3, 4]. (Non-negative and spatiotem-\nporal extensions of this basic framework are discussed in [10].) The linear spatial model\nuses a set of spatial \ufb01lters (vectors) w1; :::; wK to relate input to output. Let signal vector\nx(t) denote the input of the system at time t: A vectorization of image patches can be done\nby scanning images column-wise into vectors \u2013 for windows of size N (cid:2) N this yields\nvectors with dimension N 2: The output of the kth \ufb01lter at time t, denoted by signal yk(t);\nk x(t): Let matrix W = [w1 (cid:1) (cid:1) (cid:1) wK]T denote a matrix with all the\nis given by yk(t) = wT\n\ufb01lters as rows. Then the input-output relationship can be expressed in vector form by\n\ny(t) = Wx(t);\n\nwhere signal vector y(t) = [y1(t) (cid:1) (cid:1) (cid:1) yK(t)]T :\nTemporal response strength correlation, the objective function, is de\ufb01ned by\n\nf (W) =\n\nK\n\nX\n\nk=1\n\nEt fg(yk(t))g(yk(t (cid:0) (cid:1)t))g ;\n\n(1)\n\n(2)\n\nwhere the nonlinearity g is strictly convex, even (rectifying), and differentiable. The sym-\nbol (cid:1)t denotes a delay in time. The nonlinearity g measures the strength (amplitude) of\nthe response of the \ufb01lter, and emphasizes large responses over small ones (see [10] for\n\n\fA\n\n)\nt\n(\ny\n\n3\n\n0\n\n(cid:0)3\n\nB\n\n)\nt\n(\n2\n\ny\n\n9\n\n6\n\n3\n\n0\n\n0\n\n200\n\n400\n\n0\n\n200\n\n400\n\ntime index\n\ntime index\n\nFigure 1: Illustration of nonstationarity of variance. (A) A temporally uncorrelated signal\ny(t) with nonstationary variance. (B) Plot of y2(t):\n\nadditional discussion). Examples of choices for this nonlinearity are g1((cid:11)) = (cid:11)2; which\nmeasures the energy of the response, and g2((cid:11)) = ln cosh (cid:11); which is a robusti\ufb01ed ver-\nsion of g1: A set of \ufb01lters which has a large temporal response strength correlation is such\nthat the same \ufb01lters often respond strongly at consecutive time points, outputting large (ei-\nther positive or negative) values. This means that the same \ufb01lters will respond strongly\nover short periods of time, thereby expressing temporal coherence of a population code. A\ndetailed discussion of the difference between temporal response strength correlation and\nsparseness, including several control experiments, can be found in [10].\n\nTo keep the outputs of the \ufb01lters bounded we enforce the unit variance constraint on each of\nthe output signals yk(t): Additional constraints are needed to keep the \ufb01lters from converg-\ning to the same solution \u2013 we force their outputs to be uncorrelated. A gradient projection\nmethod can be used to maximize (2) under these constraints. The initial value of W is\nselected randomly. See [10] for details.\n\nThe interpretation of maximization of objective function (2) as estimation of a generative\nmodel is based on the concept of sources with nonstationary variances [11, 12]. The linear\ngenerative model for x(t); the counterpart of equation (1), is similar to the one in [13, 3]:\n\nx(t) = Ay(t):\n\n(3)\n\nHere A = [a1 (cid:1) (cid:1) (cid:1) aK] denotes a matrix which relates the image patch x(t) to the activities\nof the simple cells, so that each column ak; k = 1; :::; K; gives the feature that is coded by\nthe corresponding simple cell. The dimension of x(t) is typically larger than the dimension\nof y(t); so that (1) is generally not invertible but an underdetermined set of linear equations.\nA one-to-one correspondence between W and A can be established by computing the\npseudoinverse solution A = WT (WWT )(cid:0)1:\nThe nonstationarity of the variances of sources y(t) means that their variances change over\ntime, and the variance of a signal is correlated at nearby time points. An example of a signal\nwith nonstationary variance is shown in Figure 1. It can be shown [12] that optimization of\na cumulant-based criterion, similar to equation (2), can separate independent sources with\nnonstationary variances. Thus, the maximization of the objective function can also be in-\nterpreted as estimation of generative models in which the activity levels of the sources vary\nover time, and are temporally correlated over time. As was noted above, this situation is\nanalogous to the application of measures of sparseness to estimate linear generative models\nwith non-Gaussian sources.\n\nThe algorithm was applied to natural image sequence data, which was sampled from a sub-\nset of image sequences used in [14]. The number of samples was 200,000, (cid:1)t was 40 ms,\nand the sampled image patches were of size 16(cid:2)16 pixels. Preprocessing consisted of tem-\nporal decorrelation, subtraction of local mean, and normalization [10], and dimensionality\nreduction from 256 to 160 using principal component analysis [5] (this degree of reduction\n\n\fFigure 2: Basis vectors estimated using the principle of temporal coherence. The\nvectors were estimated from natural image sequences by optimizing temporal response\nstrength correlation (2) under unit energy and uncorrelatedness constraints (here non-\nlinearity g((cid:11)) = ln cosh (cid:11)).\nThe basis vectors have been ordered according to\nEt fg(yk(t))g(yk(t (cid:0) (cid:1)t))g ; that is, according to their \u201ccontribution\u201d into the \ufb01nal ob-\njective value (vectors with largest values top left).\n\nretains 95% of signal energy).\n\nFigure 2 shows the basis vectors (columns of matrix A) which emerge when temporal\nresponse strength correlation is maximized for this data. The basis vectors are oriented, lo-\ncalized, and have multiple scales. These are the main features of simple cell receptive \ufb01elds\n[1]. A quantitative analysis, showing that the resulting receptive \ufb01elds are similar to those\nobtained using sparse coding, can be found in [10], where the details of the experiments\nare also described.\n\n3 Inter-cell temporal dependencies yield simple cell output pooling\n\n3.1 Model\n\nTemporal response strength correlation, equation (2), measures the temporal coherence of\nindividual simple cells. In terms of the generative model described above, this means that\nthe nonstationary variances of different yk(t)\u2019s have no interdependencies. In this section\nwe add another layer to the generative model presented above to extend the theory to simple\ncell interactions, and to the level of complex cells.\n\nLike in the generative model described at the end of the previous section, the output layer\nof the model (see Figure 3) is linear, and maps signed cell responses to image features. But\nin contrast to the previous section, or models used in independent component analysis [5]\nor basic sparse coding [3], we do not assume that the components of y(t) are independent.\nInstead, we model the dependencies between these components with a multivariate autore-\ngressive model in the \ufb01rst layer of our model. Let abs (y(t)) = [jy1(t)j (cid:1) (cid:1) (cid:1) jyK(t)j]T ; let\nv(t) denote a driving noise signal, and let M denote a K (cid:2) K matrix. Our model is a\nmultidimensional \ufb01rst-order autoregressive process, de\ufb01ned by\n\nabs (y(t)) = M abs (y(t (cid:0) (cid:1)t)) + v(t):\n\n(4)\n\nAs in independent component analysis, we also need to \ufb01x the scale of the latent variables\nby de\ufb01ning Et ny\n\nk(t)o = 1 for k = 1; :::; K:\n\n2\n\n\fv(t)\n\nabs (y(t)) = M abs (y(t (cid:0) (cid:1)t)) + v(t)\n\ny(t)\n\nabs (y(t))\n(cid:2)\n\nx(t) = Ay(t)\n\nx(t)\n\nrandom signs\n\nFigure 3: The two layers of the generative model. Let abs (y(t)) = [jy1(t)j (cid:1) (cid:1) (cid:1) jyK(t)j]T\ndenote the amplitudes of simple cell responses. In the \ufb01rst layer, the driving noise signal\nv(t) generates the amplitudes of simple cell responses via an autoregressive model. The\nsigns of the responses are generated randomly between the \ufb01rst and second layer to yield\nsigned responses y(t): In the second layer, natural video x(t) is generated linearly from\nsimple cell responses. In addition to the relations shown here, the generation of v(t) is af-\nfected by M abs (y(t (cid:0) (cid:1)t)) to ensure non-negativity of abs (y(t)) : See text for details.\n\nThere are dependencies between the driving noise v(t) and output strengths abs (y(t)) ;\ncaused by the non-negativity of abs (y(t)) : To take these dependencies into ac-\ncount, we use the following formalism.\nLet u(t) denote a random vector with\ncomponents which are statistically independent of each other. We de\ufb01ne v(t) =\nmax ((cid:0)M abs (y(t (cid:0) (cid:1)t)) ; u(t)) ; where,\nfor vectors a and b; max (a; b) =\n[max(a1; b1) (cid:1) (cid:1) (cid:1) max(an; bn)]T : We assume that u(t) and abs (y(t)) are uncorrelated.\nTo make the generative model complete, a mechanism for generating the signs of cell re-\nsponses y(t) must be included. We specify that the signs are generated randomly with\nequal probability for plus or minus after the strengths of the responses have been gener-\nated. Note that one consequence of this is that the different yk(t)\u2019s are uncorrelated. In the\nestimation of the model this uncorrelatedness property is used as a constraint. When this\nis combined with the unit variance (scale) constraints described above, the resulting set of\nconstraints is the same as in the approach described in Section 2.\n\nIn equation (4), a large positive matrix element M(i; j); or M(j; i); indicates that there is\nstrong temporal coherence between the output strengths of cells i and j: Thinking in terms\nof grouping temporally coherent cells together, matrix M can be thought of as containing\nsimilarities (reciprocals of distances) between different cells. We will use this property in\nthe experimental section to derive a topography of simple cell receptive \ufb01elds from M:\n\n3.2 Estimation of the model\n\nTo estimate the model de\ufb01ned above we need to estimate both M and W (pseudoinverse\nof A). We \ufb01rst show how to estimate M; given W: We then describe an objective function\nwhich can be used to estimate W; given M: Each iteration of the estimation algorithm\nconsists of two steps. During the \ufb01rst step M is updated, and W is kept constant; during\nthe second step these roles are reversed.\n\nFirst, regarding the estimation of M; consider a situation in which W is kept constant. It\ncan be shown that M can be estimated by using approximative method of moments, and\nthat the estimate is given by\nM (cid:25) (cid:12)Et n(abs (y(t)) (cid:0) Et fabs (y(t))g) (abs (y(t (cid:0) (cid:1)t)) (cid:0) Et fabs (y(t))g)To\n\n(cid:2) Et n(abs (y(t)) (cid:0) Et fabs (y(t))g) (abs (y(t)) (cid:0) Et fabs (y(t))g)To(cid:0)1\n\n;\n\n(5)\n\nwhere (cid:12) > 1: Since this multiplier has a constant linear effect in the objective function\n\n\fgiven below, its value does not change the optima, so we can set (cid:12) = 1 in the optimization.\n(Details are given in [15].) The resulting estimator is the same as the optimal least mean\nsquares linear predictor in the case of unconstrained v(t):\nThe estimation of W is more complicated. A rigorous derivation of an objective function\nbased on well-known estimation principles is very dif\ufb01cult, because the statistics involved\nare non-Gaussian, and the processes have dif\ufb01cult interdependencies. Therefore, instead\nof deriving an objective function from \ufb01rst principles, we derived an objective function\nheuristically, and veri\ufb01ed through simulations that the objective function is capable of es-\ntimating the two-layer model. The objective function is a weighted sum of the covariances\nof \ufb01lter output strengths at times t (cid:0) (cid:1)t and t; de\ufb01ned by\n\nf (W; M) =\n\nK\n\nK\n\nX\n\ni=1\n\nX\n\nj=1\n\nM(i; j) cov fjyi(t)j ; jyj(t (cid:0) (cid:1)t)jg :\n\n(6)\n\nIn the actual estimation algorithm, W is updated by employing a gradient projection ap-\nproach to the optimization of (6) under the constraints. The initial value of W is selected\nrandomly.\n\nThe fact that the algorithm described above is able to estimate the two-layer model has\nbeen veri\ufb01ed through extensive simulations (details can be found in [15]).\n\n3.3 Experiments\n\nThe estimation algorithm was run on the same data set as in the previous experiment (see\nSection 2). The extracted matrices A and M can be visualized simultaneously by using the\ninterpretation of M as a similarity matrix (see Section 3.1). Figure 4 illustrates the basis\nvectors \u2013 that is, columns of A \u2013 laid out at spatial coordinates derived from M in a way\nexplained below. The resulting basis vectors are again oriented, localized and multiscale,\nas in the previous experiment.\n\nThe two-dimensional coordinates of the basis vectors were determined from M using mul-\ntidimensional scaling (see \ufb01gure caption for details). The temporal coherence between the\noutputs of two cells i and j is re\ufb02ected in the distance between the corresponding receptive\n\ufb01elds: the larger the elements M(i; j) and M(j; i) are, the closer the receptive \ufb01elds are\nto each other. We can see that local topography emerges in the results: those basis vectors\nwhich are close to each other seem to be mostly coding for similarly oriented features at\nnearby spatial positions. This kind of grouping is characteristic of pooling of simple cell\noutputs at complex cell level [1].1\nThus, the estimation of our two-layer model from natural image sequences yields both\nsimple-cell-like receptive \ufb01elds, and grouping similar to the pooling of simple cell outputs.\nLinear receptive \ufb01elds emerge in the second layer (matrix A), and cell output grouping\nemerges in the \ufb01rst layer (matrix M). Both of these layers are estimated simultaneously.\nThis is a signi\ufb01cant improvement on earlier statistical models of early vision, because no a\npriori \ufb01xing of either of these layers is needed.\n\n4 Conclusions\n\nWe have shown in this paper that when the principle of temporal coherence is applied to nat-\nural image sequences, both simple-cell-like receptive \ufb01elds, and complex-cell-like pooling\nof simple cell outputs emerge. These results were obtained with two different approaches\n\n1Some global topography also emerges: those basis vectors which code for horizontal features\n\nare on the left in the \ufb01gure, while those that code for vertical features are on the right.\n\n\fFigure 4: Results of estimating the two-layer generative model from natural image se-\nquences. Basis vectors (columns of A) plotted at spatial coordinates given by applying\nmultidimensional scaling to M. Matrix M was \ufb01rst converted to a non-negative similarity\nmatrix Ms by subtracting mini;j M(i; j) from each of its elements, and by setting each\nof the diagonal elements at value 1. Multidimensional scaling was then applied to Ms\nby interpreting entries Ms(i; j) and Ms(j; i) as similarity measures between cells i and j:\nSome of the resulting coordinates were very close to each other, so tight cell clusters were\nmagni\ufb01ed for purposes of visual display. Details are given in [15].\n\n\fto temporal coherence. The \ufb01rst used temporally coherent simple cell outputs, and the\nsecond was based on a temporal two-layer generative model of natural image sequences.\nSimple-cell-like receptive \ufb01elds emerge in both cases, and the output pooling emerges as a\nlocal topographic property in the case of the two-layer generative model.\n\nThese results are important for two reasons. First, to our knowledge this is the \ufb01rst time\nthat localized and oriented receptive \ufb01elds with different scales have been shown to emerge\nfrom natural data using the principle of temporal coherence. In some models of invariant\nvisual representations [8, 16] simple cell receptive \ufb01elds are obtained as by-products, but\nlearning is strongly modulated by complex cells, and the receptive \ufb01elds seem to lack the\nimportant properties of spatial localization and multiresolution. Second, in earlier research\non statistical models of early vision, learning two-layer models has required a priori \ufb01xing\nof one of the layers. This is not needed in our two-layer model, because both layers emerge\nsimultaneously in a completely unsupervised manner from the natural input data.\n\nReferences\n[1] Stephen E. Palmer. Vision Science \u2013 Photons to Phenomenology. The MIT Press, 1999.\n[2] Eero P. Simoncelli and Bruno A. Olshausen. Natural image statistics and neural representation.\n\nAnnual Review of Neuroscience, 24:1193\u20131216, 2001.\n\n[3] Bruno A. Olshausen and David Field. Emergence of simple-cell receptive \ufb01eld properties by\n\nlearning a sparse code for natural images. Nature, 381(6583):607\u2013609, 1996.\n\n[4] Anthony Bell and Terrence J. Sejnowski. The independent components of natural scenes are\n\nedge \ufb01lters. Vision Research, 37(23):3327\u20133338, 1997.\n\n[5] Aapo Hyv\u00e4rinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis. John Wiley\n\n& Sons, 2001.\n\n[6] Peter F\u00f6ldi\u00e1k. Learning invariance from transformation sequences. Neural Computation,\n\n3(2):194\u2013200, 1991.\n\n[7] James Stone. Learning visual parameters using spatiotemporal smoothness constraints. Neural\n\nComputation, 8(7):1463\u20131492, 1996.\n\n[8] Christoph Kayser, Wolfgang Einh\u00e4user, Olaf D\u00fcmmer, Peter K\u00f6nig, and Konrad K\u00f6rding. Ex-\ntracting slow subspaces from natural videos leads to complex cells. In Georg Dorffner, Horst\nBischof, and Kurt Hornik, editors, Arti\ufb01cial Neural Networks \u2013 ICANN 2001, volume 2130 of\nLecture notes in computer science, pages 1075\u20131080. Springer, 2001.\n\n[9] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning of\n\ninvariances. Neural Computation, 14(4):715\u2013770, 2002.\n\n[10] Jarmo Hurri and Aapo Hyv\u00e4rinen. Simple-cell-like receptive \ufb01elds maximize temporal coher-\n\nence in natural video. Neural Computation, 2003. In press.\n\n[11] Kiyotoshi Matsuoka, Masahiro Ohya, and Mitsuru Kawamoto. A neural net for blind separation\n\nof nonstationary signals. Neural Networks, 8(3):411\u2013419, 1995.\n\n[12] Aapo Hyv\u00e4rinen. Blind source separation by nonstationarity of variance: A cumulant-based\n\napproach. IEEE Transactions on Neural Networks, 12(6):1471\u20131474, 2001.\n\n[13] Aapo Hyv\u00e4rinen and Patrik O. Hoyer. A two-layer sparse coding model learns simple and com-\nplex cell receptive \ufb01elds and topography from natural images. Vision Research, 41(18):2413\u2013\n2423, 2001.\n\n[14] J. Hans van Hateren and Dan L. Ruderman. Independent component analysis of natural im-\nage sequences yields spatio-temporal \ufb01lters similar to simple cells in primary visual cortex.\nProceedings of the Royal Society of London B, 265(1412):2315\u20132320, 1998.\n\n[15] Jarmo Hurri and Aapo Hyv\u00e4rinen. A two-layer dynamic generative model of natural image\n\nsequences. Submitted.\n\n[16] Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen. Self-organized formation of various\ninvariant-feature \ufb01lters in the adaptive-subspace SOM. Neural Computation, 9(6):1321\u20131344,\n1997.\n\n\f", "award": [], "sourceid": 2184, "authors": [{"given_name": "Jarmo", "family_name": "Hurri", "institution": null}, {"given_name": "Aapo", "family_name": "Hyv\u00e4rinen", "institution": null}]}