{"title": "Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images", "book": "Advances in Neural Information Processing Systems", "page_first": 838, "page_last": 844, "abstract": null, "full_text": "Emergence of movement sensitive \n\nneurons' properties by learning a sparse \n\ncode for natural moving images \n\nRafal Bogacz \n\nDept. of Computer Science \n\nUniversity of Bristol \nBristol BS8 lUB, U.K. \nR.Bogacz@bri.l'fol.ac.uk \n\nMalcolm W. Brown \n\nDept. of Anatomy \n\nUniversity of Bristol \nBristol BS8 lTD, U.K. \n\nM. W.Brown@bri.l'fol.ac.uk \n\nChristophe Giraud-Carrier \nDept. of Computer Science \n\nUniversity of Bristol \nBristol BS8 lUB, U.K. \n\ncgc@c.l'.bri.l'.ac.uk \n\nAbstract \n\nOlshausen & Field demonstrated that a learning algorithm that \nattempts to generate a sparse code for natural scenes develops a \ncomplete family of localised, oriented, bandpass receptive fields, \nsimilar to those of 'simple cells' in VI. This paper describes an \nalgorithm which finds a sparse code for sequences of images that \npreserves information about the input. This algorithm when trained \non natural video sequences develops bases representing \nthe \nmovement in particular directions with particular speeds, similar to \nthe receptive fields of the movement-sensitive cells observed in \ncortical visual areas. Furthermore, \nto previous \napproaches to learning direction selectivity, the timing of neuronal \nactivity encodes the phase of the movement, so the precise timing \nof spikes is crucially important to the information encoding. \n\nin contrast \n\n1 Introduction \n\nIt was suggested by Barlow [3] that the goal of early sensory processing is to reduce \nredundancy in sensory information and the activity of sensory neurons encodes \nindependent features. Neural modelling can give some insight into how these neural \nnets may learn and operate. Atick & Redlich [1] showed that training a neural \nnetwork on patches of natural images, aiming to remove pair-wise correlation \nbetween neuronal responses, results in neurons having centre-surround receptive \nfields resembling those of retinal ganglion neurons. Olshausen & Field [11,12] \ndemonstrated that a learning algorithm that attempts to generate a sparse code for \nnatural scenes while preserving information about the visual input, develops a \ncomplete family of localised, oriented, bandpass receptive fields, similar to those of \nsimple-cells in VI. The activities of the neurons implementing this coding signal the \npresence of edges, which are basic components of natural images. Olshausen & \nField chose their algorithm to create a sparse representation because it possesses a \nhigher degree of statistical independence among its outputs [11]. Similar receptive \nfields were also obtained by training a neural net so as to make the responses of \nneurons as independent as possible [4]. Other authors [14,16,5] have shown that \ndirection selectivity of the simple-cells may also emerge from unsupervised \n\n\flearning. However, there is no agreed way of how the receptive fields of neurons \nthat encode movements are created. \n\nThis paper describes an algorithm which finds a sparse code for sequences of \nimages that preserves the critical information about the input. This algorithm, \ntrained on natural video images, develops bases representing movements in \nparticular directions at particular speeds, similar to the receptive fields of the \nmovement-sensitive cells observed in early visual areas [9,2]. The activities of the \nneurons implementing this encoding signal the presence of edges moving with \ncertain speeds in certain directions, with each neuron having its preferred speed and \ndirection. Furthermore, in contrast to all the previous approaches, the timing of \nneural activity encodes the movement's phase, so the precise timing of spikes is \ncrucially important for information coding. \n\nThe proposed algorithm is an extension of the one proposed by Olshausen & Field. \nHence it is a high level algorithm, which cannot be directly implemented in a \nbiologically plausible neural network. However, a plausible neural network \nperforming a similar task can be developed. The proposed algorithm is described in \nSection 2. Sections 3 and 4 show the methods and the results of simulations. Finally, \nSection 5 discusses how the algorithm differs from the previous approaches, and the \nimplications of the presented results. \n\n2 Description of the algorithm \n\nSince the proposed algorithm is an extension of the one described by Olshausen & \nField [11 ,12], this section starts with a brief introduction of the main ideas of their \nalgorithm. They assume that an image x can be represented in terms of a linear \nsuperposition of basis functions Ai. For clarity of notation, let us represent both \nimages and bases as vectors created by concatenating rows of pixels as shown in \nFigure 1, and let each number in the vector describe the brightness of the \ncorresponding pixel. Let the basis functions Ai form the columns of a matrix A . Let \nthe weighting of the above mentioned linear superposition (which changes from one \nimage to the next) be given by a vector s: \n\nx=As \n\n(1) \n\nThe image x may be encoded, for example using the inverted transformation where \nit exists. Hence, the image code s is determined by the choice of basis functions Ai. \nOlshausen & Field [11,12] try to find bases that result in a code s that preserves \ninformation about the original image x and that is sparse. Therefore, they minimise \nthe following cost function with respect to A, where A, denotes a constant \ndetermining the importance of sparseness [11] : \n\nE = -[preserved information in s about x] - A,[sparseness of s] \n\n(2) \n\nThe algorithm proposed in this paper is similar, but it takes into consideration the \ntemporal order of images. Let us divide time into intervals (to be able to treat it as \ndiscrete) and denote the image observed at time t and the code generated by xt and \nst, respectively. The Olshausen & Field algorithm assumes that image x is a linear \nsuperposition (mixture) of s. By contrast, our algorithm assumes that images are \nconvolved mixtures of s, i.e., st depends not only on xt but also on xt-l, xt-2, ... , Xt -(T-l) \n(i.e. Sl depends on T preceding Xl). Therefore, each basis function may also be \n\nimage ~ ...... \u2022 I I I I I \u2022 \n\nI xT \n\nFigure 1: Representing images as vectors. \n\n\fEE\u00a7Em ~ ~ EE\u00a7EE\u00a7 \n\n, \nX-\n\nXl \n\nX3 \n\nX4 \n\nX5 \n\nX6 \n\nEE\u00a7 EE\u00a7 EE\u00a7 \n\nA/ A/ A I O \n\n.I'll \n\n.1'/ \n\n.I} \n\n.1'1 4 \n\n.I} \n\nI \n.I} \n\n~~~ .1'2 1 \n\n.1'/ \n\n.I} \n\n.1'24 \n\n.1'/ \n\n.1'26 \n\nFigure 2: Encoding of an image sequence. In the example, there are two basis \nfunctions, each described by T = 3 vectors. The first basis encodes movement to \n\nthe right, the second encodes movement down. A sequence x of 6 images is \n\nshown on the top and the corresponding code s below. A \"spike\" over a \ncoefficient .1'/ denotes that .1'/ = 1, the absence of a \"spike\" denotes .1'/ = o. \n\nrepresented as a sequence of vectors AiO, Ail, ... , Ar l (corresponding to a sequence \nof images). These vectors create columns of the mixing matrices A 0, A I, ... , AIel. \nEach coefficient .1'/ describes how strongly the basis function Ai is present in the last \nT images. This relationship is illustrated in Figure 2 and is expressed by Equation 3. \n\nT-I \n\nx' = [.A fsf+1 \n\nf=O \n\n(3) \n\nIn the proposed algorithm, the basis functions A are also found by optimising the \ncost function of Equation 2. The detailed method of this minimisation is described \nbelow, and this paragraph gives its overview. In each optimisation step, a sequence \nx of P image patches is selected from a random position in the video sequence (P 2: \n2D. Each of the optimisation steps consists of two operations. Firstly, the sequence \nof coefficient vectors s which minimises the cost function E for the images x is \nfound. Secondly, the basis matrices A are modified in the direction opposite to the \ngradient of E over A, thus minimising the cost function. These two operations are \nrepeated for different sequences of image patches. \n\nIn Equation 2, the term \"preserved information in s about x\" expresses how weJl x \nmay be reconstructed on the basis of s. In particular, it is defined as the negative of \nthe square of the reconstruction error. The reconstruction error is the difference \nbetween the original image sequence x and the sequence of images r reconstructed \nfrom s. The sequence r may be reconstructed from s in the foJlowing way: \n\nT- I \n\nr' = [.A fsf+1 \n\nf=O \n\nThe precise definition of the cost function is then given by: \n\nP-T+I \n\nE = ~ ~ (x~ - r; Y + A ~ ~ C ~ \n\nP \n\n[ \n\n.t ) \n\n(4) \n\n(5) \n\nIn Equation 5, C is a nonlinear function, and (j is a scaling constant. Images at the \nstart and end of the sequence (e.g. , Xl, xP ) may share some bases with images not in \nthe sequence (e.g., xO, x \u00b7 l , XP+I). To avoid this problem, only the middle images are \nreconstructed and only for them is the reconstruction error computed in the cost \nfunction. In particular, only images from T to P-T+l are reconstructed - since the \nassumed length of the bases is T, those images contain only the bases whose other \n\n\fparts are also contained in the sequence. Since only images from T to P-T+1 are \nreconstructed, it is clear from Equation 4, that only coefficients ST to sP need to be \nfound. These considerations explain the limits of the outer summations in both \nterms of Equation 5. \n\nFor each image sequence, in the first operation, the coefficients ST, ST+!, ... , \nsP \nminimising E are found using an optimisation method. Minus the gradient of E over \ns is given by: \n\nIn the second operation, the bases A are modified so as to minimise E: \n\n-~ = 2\"\"\"\" (Xl - r' ),4,'-1 -~ctS( 1 \n\ni... i... \n\n(J\" \n\n(J\" \n\nJ \n\nJ \n\n1 \n\nchi \n\u2022 i \n\nt \n\nj \n\n(6) \n\n(7) \n\nIn equation 7, 17 denotes the learning rate. The vector length of each basis function \nAi is adapted over time so as to maintain equal variance on each coefficient s, m \nexactly the same way as described in [12]. \n\n3 Methods of simulations \n\nThe proposed algorithm was implemented in Matlab except for finding s minimising \nE, which was implemented in C++, using the conjugate gradient method for the sake \nof speed. In the implementation, the original codes of Olshausen & Field were used \nand modified (downloaded from http://redwood.ucdavis.edu/bruno/sparsenet.html). \n\nMany parameters of the proposed algorithm were taken from [11]. In particular, \nC(x) = In(1+x2), cris the standard deviation of pixels' colours in the images, A is set \nup such that A/cr = 0.14, and 17 = 1. ~A is averaged over 100 image sequences, and \nhence the bases A are updated with the average of ~A every 100 optimisation steps. \nThe length of an image sequence P is set up such that P = 3T. \nThe proposed algorithm was tested on two types of video sequences: 'toy' problems \nand natural video sequences. Each of the toy sequences consisted of 10 frames -\n100x100 pixels. In the sequence, there were 20 moving lines. Each line was either \nhorizontal or vertical and 1 pixel thick. Each line was either black or white, which \ncorresponded to positive or negative values of the elements of x vectors (the grey \nbackground corresponded to zero). Each horizontal line moved up or down, each \nvertical - left or right, with the speed of one pixel per frame. \n\nThen the algorithm was tested on five natural video sequences showing moving \npeople or animals. In each optimisation step, a sequence of image patches was \nselected from a randomly chosen video. The video sequences were preprocessed. \nFirst, to remove the static aspect of the images, from each frame the previous one \nwas subtracted, i.e., each image encoded the difference between two successive \nframes of the video. This simple operation reduces redundancy in data since the \ncorresponding pixels in the successive frames tend to have similar colours. An \nanalogous operation may be performed by the retina, since the ganglion cells \ntypically respond to the changes in light intensity [10]. \n\nThen, to remove the pair-wise correlation between pixels of the same frame , Zero(cid:173)\nphase Component Analysis (ZCA) [4] was applied to each of the patches from the \nselected sequence, i.e., x' := W x', where W = (X'(X'?)-I> i.e., W is equal to the \ninverted square root of the covariance matrix of x. The filters in W have centre(cid:173)\nsurround receptive fields resembling those of retinal ganglion neurons [4]. \n\n\f", "award": [], "sourceid": 1937, "authors": [{"given_name": "Rafal", "family_name": "Bogacz", "institution": null}, {"given_name": "Malcolm", "family_name": "Brown", "institution": null}, {"given_name": "Christophe", "family_name": "Giraud-Carrier", "institution": null}]}