{"title": "Hierarchical Non-linear Factor Analysis and Topographic Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 492, "abstract": "", "full_text": "Hierarchical Non-linear Factor Analysis \n\nand Topographic Maps \n\nZoubin Ghahramani and Geoffrey E. Hinton \nDept. of Computer Science, University of Toronto \n\nToronto, Ontario, M5S 3H5, Canada \n\nhttp://www.cs.toronto.edu/neuron/ \n\n{zoubin,hinton}Ocs.toronto.edu \n\nAbstract \n\nWe first describe a hierarchical, generative model that can be \nviewed as a non-linear generalisation of factor analysis and can \nbe implemented in a neural network. The model performs per(cid:173)\nceptual inference in a probabilistically consistent manner by using \ntop-down, bottom-up and lateral connections. These connections \ncan be learned using simple rules that require only locally avail(cid:173)\nable information. We then show how to incorporate lateral con(cid:173)\nnections into the generative model. The model extracts a sparse, \ndistributed, hierarchical representation of depth from simplified \nrandom-dot stereograms and the localised disparity detectors in \nthe first hidden layer form a topographic map. When presented \nwith image patches from natural scenes, the model develops topo(cid:173)\ngraphically organised local feature detectors. \n\n1 \n\nIntroduction \n\nFactor analysis is a probabilistic model for real-valued data which assumes that \nthe data is a linear combination of real-valued uncorrelated Gaussian sources (the \nfactors). After the linear combination, each component of the data vector is also \nassumed to be corrupted by additional Gaussian noise. A major advantage of this \ngenerative model is that, given a data vector, the probability distribution in the \nspace of factors is a multivariate Gaussian whose mean is a linear function of the \ndata. It is therefore tractable to compute the posterior distribution exactly and to \nuse it when learning the parameters of the model (the linear combination matrix \nand noise variances). A major disadvantage is that factor analysis is a linear model \nthat is insensitive to higher order statistical structure of the observed data vectors. \n\nOne way to make factor analysis non-linear is to use a mixture of factor analyser \nmodules, each of which captures a different linear regime in the data [3]. We can \nview the factors of all of the modules as a large set of basis functions for describing \nthe data and the process of selecting one module then corresponds to selecting \nan appropriate subset of the basis functions. Since the number of subsets under \nconsideration is only linear in the number of modules, it is still tractable to compute \n\n\fHierarchical Non-linear Factor Analysis and Topographic Maps \n\n487 \n\nthe full posterior distribution when given a data point. Unfortunately, this mixture \nmodel is often inadequate. Consider, for example, a typical image that contains \nmultiple objects. To represent the pose and deformation of each object we want \na componential representation of the object's parameters which could be obtained \nfrom an appropriate factor analyser. But to represent the multiple objects we need \nseveral of these componential representations at once, so the pure mixture idea is \nnot tenable. A more powerful non-linear generalisation of factor analysis iF to have \na large set of factors and to allow any subset of the factors to be selected. This \ncan be achieved by using a generative model in which there is a high probability of \ngenerating factor activations of exactly zero. \n\n2 Rectified Gaussian Belief Nets \n\nThe Rectified Gaussian Belief Net (RGBN) uses multiple layers of units with states \nthat are either positive real values or zero [5]. Its main disadvantage is that com(cid:173)\nputing the posterior distribution over the factors given a data vector involves Gibbs \nsampling. In general, Gibbs sampling can be very time consuming, but in practice \n10 to 20 samples per unit have proved adequate and there are theoretical reasons \nfor believing that learning can work well even when the Gibbs sampling fails to \nreach equilibrium [10]. \nWe first describe the RGBN without considering neural plausibility. Then we show \nhow lateral interactions within a layer can be used to perform probabilistic infer(cid:173)\nence correctly using locally available information. This makes the RGBN far more \nplausible as a neural model than a sigmoid belief net [9, 8] because it means that \nGibbs sampling can be performed without requiring units in one layer to see the \ntotal top-down input to units in the layer below. \n\nThe generative model for RGBN's consists of multiple layers of units each of which \nhas a real-valued unrectified state, Yj, and a rectified state, [Yj]+, which is zero if \nYj is negative and equal to Yj otherwise. This rectification is the only non-linearity \nin the network. 1 The value of Yj is Gaussian distributed with a standard deviation \n(Jj and a mean, ih that is determined by the generative bias, gOj, and the combined \neffects of the rectified states of units, k, in the layer above: \n\nYj = gOj + Lgkj[Yk]+ \n\nk \n\n(1) \n\nThe rectified state [Yj]+ therefore has a Gaussian distribution above zero, but all \nof the mass of the Gaussian that falls below zero is concentrated in an infinitely \ndense spike at zero as shown in Fig. la. This infinite density creates problems if we \nattempt to use Gibbs sampling over the rectified states, so, following a suggestion \nby Radford Neal, we perform Gibbs sampling on the unrectified states. \nConsider a unit, j, in some intermediate layer of a multilayer RGBN. Suppose \nthat we fix the unrectified states of all the other units in the net. To perform Gibbs \nsampling, we need to stochastically select a value for Yj according to its distribution \ngiven the unrectified states of all the other units. If we think in terms of energy \nfunctions, which are equal to negative log probabilities (up to a constant), the \nrectified states of the units in the layer above contribute a quadratic energy term \nby determining Yj. The unrectified states of units, i, in the layer below contribute a \nconstant if [Yj]+ is 0, and if [Yj]+ is positive they each contribute a quadratic term \n\n1 The key arguments presented in this paper hold for general nonlinear belief networks \n\nas long as the noise is Gaussian; they are not specific to the rectification nonlinearity. \n\n\f488 \n\na \n\nb \n\nc \n\nI \n\nI \n\nI \n\n/ \n\nW \n\n, \n~----..J,' Top-down \n, \n'-_ .. \" \n\n-3-2-1 0 1 2 3 \n\nY \n\n-3-2-1 0 1 2 3 \n\nY \n\nbecause of the effect of [Yj] + on Yi. \n\nZ Ghahramani and G. E. Hinton \n\nFigure 1: a) Probability den(cid:173)\nsity in which all the mass of a \nGaussian below zero has been \nreplaced by an infinitely dense \nspike at zero. b) Schematic \nof the density of a unit's un(cid:173)\nrectified state. \nc) Bottom(cid:173)\nup and top-down energy func(cid:173)\ntions corresponding to b. \n\n(2) \n\nwhere h is an index over all the units in the same layer as j including j itself. Terms \nthat do not depend on Yj have been omitted from Eq. 2. For values of Yj below zero \nthere is a quadratic energy function which leads to a Gaussian distribution. The \nsame is true for values of Yj above zero, but it is a different quadratic (Fig. Ic) . The \nGaussian distributions corresponding to the two quadratics must agree at Yj = 0 \n(Fig. Ib). Because this distribution is piecewise Gaussian it is possible to perform \nGibbs sampling exactly. \n\nGiven samples from the posterior, the generative weights of a RGBN can be learned \nby using the online delta rule to maximise the log probability of the data. 2 \n\n(3) \nThe variance of the local Gaussian noise of each unit, o}, can also be learned by \nan online rule, D-.o} = f [(Yj - Yj)2 - o}]. Alternatively, o} can be fixed at I for \nall hidden units and the effective local noise level can be controlled by scaling the \ngenerative weights. \n\n3 The Role of Lateral Connections in Perceptual Inference \n\nIn RGBNs and other layered belief networks, fixing the value of a unit in one layer \ncauses correlations between the parents of that unit in the layer above. One of \nthe main reasons why purely bottom-up approaches to perceptual inference have \nproven inadequate for learning in layered belief networks is that they fail to take \ninto account this phenomenon, which is known as \"explaining away.\" \n\nLee and Seung (1997) introduced a clever way of using lateral connections to handle \nexplaining away effects during perceptual inference. Consider the network shown \nin Fig. 2. One contribution, Ebelow, to the energy of the state of the network is \nthe squared difference between the unrectified states of the units in one layer, Yj, \na.nd the top-down expectations generated by the states of units in the layer above. \nAssuming the local noise models for the lower layer units all have unit variance, and \n\n2 If Gibbs sampling has not been run long enough to reach equilibrium, the delta rule \nfollows the gradient of the penalized log probability of the data [10]. The penalty term is \nthe Kullback-Liebler divergence between the equilibrium distribution and the distribution \nproduced by Gibbs sampling. Other things being equal, the delta rule therefore adjusts \nthe parameters that determine the equilibrium distribution to reduce this penalty, thus \nfavouring models for which Gibbs sampling works quickly. \n\n\fHierarchical Non-linear Factor Analysis and Topographic Maps \n\n489 \n\nignoring biases and constant terms that are unaffected by the states of the units \n\nEbe\\ow = ~ l:)Yj - Yj)2 = ~ I)Yj - 2:k[Yk]+9kj)2. \n\n(4) \n\nj \n\nj \n\nRearranging this expression and setting rjk = gkj and mkl = - Lj gkjglj we get \n\nEbe\\ow = ~ LyJ - L[Yk]+ LYjrjk - ~ L[Yk]+ L[y!l+mkl . \n\n(5) \n\nj \n\nk \n\nj \n\nk \n\nI \n\nThis energy function can be exactly implemented in a network with recognition \nweights, rjk, and symmetric lateral interactions, mkl. The lateral and recognition \nconnections allow a unit, k, to compute how Ebe\\ow for the layer below depends on \nits own state and therefore they allow it to follow the gradient of E or to perform \nGibbs sampling in E . \n\nFigure 2: A small segment of a network, \nshowing the generative weights (dashed) and \nthe recognition and lateral weights (solid) \nwhich implement perceptual inference and \ncorrectly handle explaining away effects. \n\nSeung's trick can be used in an RGBN and it eliminates the most neurally implau(cid:173)\nsible aspect of this model which is that a unit in one layer appears to need to send \nboth its state Y and the top-down prediction of its state Y to units in the layer above. \nUsing the lateral connections, the units in the layer above can, in effect, compute \nall they need to know about the top-down predictions. In computer simulations, we \ncan simply set each lateral connection mkl to be the dot product - 2:j gkjglj. It is \nalso possible to learn these lateral connections in a more biologically plausible way \nby driving units in the layer below with unit-variance independent Gaussian noise \nand using a simple anti-Hebbian learning rule. Similarly, a purely local learning \nrule can learn recognition weights equal to the generative weights . . If units at one \nlayer are driven by unit-variance, independent Gaussian noise, and these in turn \ndrive units in the layer below using the generative weights, then Hebbian learning \nbetween the two layers will learn the correct recognition weights [5]. \n\n4 Lateral Connections in the Generative Model \n\nWhen the generative model contains only top-down connections, lateral connections \nmake it possible to do perceptual inference using locally available information. But \nit is also possible, and often desirable, to have lateral connections in the generative \nmodel. Such connections can cause nearby units in a layer to have a priori correlated \nactivities, which in turn can lead to the formation of redundant codes and, as we \nwill see, topographic maps. \nSymmetric lateral interactions between the unrectified states of units within a layer \nhave the effect of adding a quadratic term to the energy function \n\nEMRF = ~ L: L Mkl YkYI, \n\nk \n\nI \n\n(6) \n\nwhich corresponds to a Gaussian Markov Random Field (MRF). During sampling, \nthis term is simply added to the top-down energy contribution. Learning is more \ndifficult. The difficulty sterns from the need to know the derivatives of the partition \nfunction of the MRF for each data vector. This partition function depends on the \n\n\f490 \n\nZ Ghahramani and G. E. Hinton \n\ntop-down inputs to a layer so it varies from one data vector to the next, even if the \nlateral connections themselves are non-adaptive. Fortunately, since both the MRF \nand the top-down prediction define Gaussians over the states of the units in a layer, \nthese derivatives can be easily calculated. Assuming unit variances, \n\ntlYj; = , ([Yj]+(Y; - ii;) + [Yj]+ ~ [M(I + M)-ll;. ii.) \n\n(7) \n\nwhere M is the MRF matrix for the layer including units i and k, and I is the identity \nmatrix. The first term is the delta rule (Eq. 3); the second term is the derivative \nof the partition function which unfortunately involves a matrix inversion. Since \nthe partition function for a multivariate Gaussian is analytical it is also possible to \nlearn the lateral connections in the MRF. \n\nLateral interactions between the rectified states of units add the quadratic term \n~ Lk Ll Mkl [Yk]+[YzJ+\u00b7 The partition function is no longer analytical, so comput(cid:173)\ning the gradient of the likelihood involves a two-phase Boltzmann-like procedure: \n\n([Yj]+Yi r) , \n\n!19ji = f ([Yj]+Yi) * -\n\n(8) \nwhere 0* averages with respect to the posterior distribution of Yi and Yj, and 0-\naverages with respect to the posterior distribution of Yj and the prior of Yi given \nunits in the same layer as j. This learning rule suffers from all the problems of \nthe Boltzmann machine, namely it is slow and requires two-phases. However, there \nis an approximation which results in the familiar one-phase delta rule that can \nbe described in three equivalent ways: (1) it treats the lateral connections in the \ngenerative model as if they were additional lateral connections in the recognition \nmodel; (2) instead of lateral connections in the generative model it assumes some \nfictitious children with clamped values which affect inference but whose likelihood \nis not maximised during learning; (3) it maximises a penalized likelihood of the \nmodel without the lateral connections in the generative model. \n\n5 Discovering depth in simplified stereograms \n\nConsider the following generative process for stereo pairs. Random dots of uniformly \ndistributed intensities are scattered sparsely on a one-dimensional surface, and the \nimage is blurred with a Gaussian filter. This surface is then randomly placed at one \nof two different depths, giving rise to two possible left-to-right disparities between \nthe images seen by each eye. Separate Gaussian noise is then added to the image \nseen by each eye. Some images generated in this manner are shown in Fig. 3a. \n\nFigure 3: a) Sample data from the stereo \ndisparity problem. The left and right column \nof each 2 x 32 image are the inputs to the left \nand right eye, respectively. Periodic bound(cid:173)\nary conditions were used. The value of a pixel \nis represented by the size of the square, with \nwhite being positive and black being nega(cid:173)\ntive. Notice that pixel noise makes it difficult \nto infer the disparity, i.e. the vertical shift \nbetween the left and right columns, in some \nimages. b) Sample images generated by the \nmodel after learning. \n\nWe trained a three-layer RGBN consisting of 64 visible units, 64 units in the first \nhidden layer and 1 unit in the second hidden layer on the 32-pixel wide stereo \n\n\fHierarchical Non-linear Factor Analysis and Topographic Maps \n\n491 \n\ndisparity problem. Each of the hidden units in the first hidden layer was connected \nto the entire array of visible units, i.e. it had inputs from both eyes. The hidden \nunits in this layer were also laterally connected in an MRF over the unrectified \nunits. Nearby units excited each other and more distant units inhibited each other, \nwith the net pattern of excitation/inhibition being a difference of two Gaussians. \nThis MRF was initialised with large weights which decayed exponentially to zero \nover the course of training. The network was trained for 30 passes through a data \nset of 2000 images. For each image we used 16 iterations of Gibbs sampling to \napproximate the posterior distribution over hidden states. Each iteration consisted \nof sampling every hidden unit once in a random order. The states after the fourth \niteration of Gibbs sampling were used for learning, with a learning rate of 0.05 and \na weight decay parameter of 0.001. Since the top level of the generative process \nmakes a discrete decision between left and right global disparity we used a trivial \nextension of the RGBN in which the top level unit saturates both at 0 and 1. \n\n._--=\"TI:~\u00a3:I=-[J __ \n\nIEI[I:I _1II_-=-_.-:rr::JI...___I:IIUI::JI-L1D-.--:tIl::Jl-=-::l .-'-' _______ OW''--o--.,u'-'-''__=_-..._.-'-\"._ \n\na \n\nb \nc \n\nFigure 4: Generative weights of a three-layered RGBN after being trained on the stereo \ndisparity problem. a) Weights from the top layer hidden unit to the 64 middle-layer hidden \nunits. b) Biases of the middle-layer hidden units, and c) weights from the hidden units to \nthe 2 x 32 visible array. \n\nThirty-two of the hidden units learned to become local left-disparity detectors, while \nthe other 32 became local right-disparity detectors (Fig. 4c). The unit in the second \nhidden layer learned positive weights to the left-disparity detectors in the layer \nbelow, and negative weights to the right detectors (Fig. 4a). In fact, the activity \nof this top unit discriminated the true global disparity of the input images with \n99% accuracy. A random sample of images generated by the model after learning is \nshown in Fig. 3b. In addition to forming a hierarchical distributed representation \nof disparity, units in the hidden layer self-organised into a topographic map. The \nMRF caused high correlations between nearby units early in learning, which in \nturn resulted in nearby units learning similar weight vectors. The emergence of \ntopography depended on the strength of the MRF and on the speed with which it \ndecayed. Results were relatively insensitive to other parametric changes. \n\nWe also presented image patches taken from natural images [1] to a network with \nunits in the first hidden layer arranged in laterally-connected 2D grid. The network \ndeveloped local feature detectors, with nearby units responding to similar features \n(Fig. 5). Not all units were used, but the unused units all clustered into one area. \n\n6 Discussion \n\nClassical models of topography formation such as Kohonen's self-organising map [6] \nand the elastic net [2, 4] can be thought of as variations on mixture models where \nadditional constraints have been placed to encourage neighboring hidden units to \nhave similar generative weights. The problem with a mixture model is that it cannot \nhandle images in which there are several things going on at once. In contrast, we \n\n\f492 \n\nZ. Ghahramani and G. E. Hinton \n\nFigure 5: Generative weights of an \nRGBN trained on 12 x 12 natural \nimage patches: weights from each \nof the 100 hidden units which were \narranged in a 10 x 10 sheet with \ntoroidal boundary conclitions. \n\nhave shown that topography can arise in much richer hierarchical and componential \ngenerative models by inducing correlations between neighboring units. \n\nThere is a sense in which topography is a necessary consequence of the lateral \nconnection trick used for perceptual inference. It is infeasible to interconnect all \npairs of units in a cortical area. If we assume that direct lateral interactions (or \ninteractions mediated by interneurons) are primarily local, then widely separated \nunits will not have the apparatus required for explaining away. Consequently the \ncomputation of the posterior distribution will be incorrect unless the generative \nweight vectors of widely separated units are orthogonal. If the generative weights \nare constrained to be positive, the only way two vectors can be orthogonal is for \neach to have zeros wherever the other has non-zeros. Since the redundancies that \nthe hidden units are trying to model are typically spatially localised, it follows \nthat widely separated units must attend to different parts of the image and units \ncan only attend to overlapping patches if they are laterally interconnected. The \nlateral connections in the generative model assist in the formation of the topography \nrequired for correct perceptual inference. \n\nAcknowledgements. We thank P. Dayan, B. Frey, G. Goodhill, D. MacKay, R. Neal \nand M. Revow. The research was funded by NSERC and ITRC. GEH is the Nesbitt-Burns \nfellow of CIAR. \n\nReferences \n[1] A. Bell & T. J. Sejnowski. The 'Independent components' of natural scenes are edge \n\nfilters. Vision Research, In Press. \n\n[2] R. Durbin & D. Willshaw. An analogue approach to the travelling salesman problem \n\nusing an elastic net method. Nature, 326(16):689-691, 1987. \n\n[3] Z. Ghahramani & G. E. Hinton. The EM algorithm for mixtures of factor analyzers. \n\nUniv. Toronto Technical Report CRG-TR-96-1, 1996. \n\n[4] G. J . Goodhill & D. J . Willshaw. Application of the elatic net algorithm to the \nformation of ocular dominance stripes. Network: Compo in Neur. Sys ., 1:41-59, 1990. \n[5] G. E. Hinton & Z. Ghahramani. Generative models for cliscovering sparse clistributed \n\nrepresentations. Philos. Trans. Roy. Soc . B, 352:1177-1190, 1997. \n\n[6] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological \n\nCybernetics, 43:59-69, 1982. \n\n[7] D. D. Lee & H. S. Seung. Unsupervised learning by convex and conic cocling. \n\nIn \nM. Mozer, M. Jordan, & T. Petsche, eds., NIPS 9. MIT Press, Cambridge, MA, 1997. \n[8] M. S. Lewicki & T. J. Sejnowski. Bayesian unsupervised learning of higher order \n\nstructure. In NIPS 9. MIT Press, Cambridge, MA, 1997. \n\n[9] R. M. Neal. Connectionist learning of belief networks. Arti/. Intell., 56:71-113, 1992. \n[10] R. M. Neal & G. E. Hinton. A new view of the EM algorithm that justifies incremental \n\nand other variants. Unpublished Manuscript, 1993. \n\n\f", "award": [], "sourceid": 1472, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}