{"title": "Competitive Anti-Hebbian Learning of Invariants", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1024, "abstract": null, "full_text": "Competitive Anti-Hebbian Learning of Invariants \n\nNicol N. Schraudolph \n\nComputer Science & Engr. Dept. \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0114 \n\nnici@cs.ucsd.edu \n\nTerrence J. Sejnowski \n\nComputational Neurobiology Laboratory \nThe Salk Institute for Biological Studies \n\nLa Jolla, CA 92186-5800 \n\ntsejnowski@ucsd.edu \n\nAbstract \n\nAlthough the detection of invariant structure in a given set of input patterns is \nvital to many recognition tasks, connectionist learning rules tend to focus on \ndirections of high variance (principal components). The prediction paradigm is \noften used to reconcile this dichotomy; here we suggest a more direct approach \nto invariant learning based on an anti-Hebbian learning rule. An unsupervised \ntWO-layer network implementing this method in a competitive setting learns to \nextract coherent depth information from random-dot stereograms. \n\n1 \n\nINTRODUCTION: LEARNING INVARIANT STRUCTURE \n\nMany connectionist learning algorithms share with principal component analysis (Jolliffe, \n1986) the strategy of extracting the directions of highest variance from the input. A single \nHebbian neuron, for instance, will come to encode the input's first principal component \n(Oja and Karhunen, 1985); various forms of lateral interaction can be used to force a layer \nof such nodes to differentiate and span the principal component subspace -\ncf. (Sanger, \n1989; Kung, 1990; Leen, 1991), and others. The same type of representation also develops \nin the hidden layer of backpropagation autoassociator networks (Baldi and Hornik, 1989). \n\nas the case may be -\n\nHowever, the directions of highest variance need not always be those that yield the most \ninformation, or -\nthe information we are interested in (Intrator, \n1991). In fact, it is sometimes desirable to extract the invariant structure of a stimulus \ninstead, learning to encode those aspects that vary the least. The problem, then, is how to \nachieve this within a connectionist framework that is so closely tied to the maximization of \nvariance. \n\n1017 \n\n\f1018 \n\nSchraudolph and Sejnowski \n\nIn (FOldiak, 1991), spatial invariance is turned into a temporal feature by presenting transfor(cid:173)\nmation sequences within invariance classes as a stimulus. A built-in temporal smoothness \nconstraint enables Hebbian neurons to learn these transformations, and hence the invariance \nclasses. Although this is an efficient and neurobiologically attractive strategy it is limited \nby its strong assumptions about the nature of the stimulus. \n\nA more general approach is to make information about invariant structure available in the \nerror signal of a supervised network. The most popular way of doing this is to require the \nnetwork to predict the next patch of some structured input from the preceding context, as in \n(Elman, 1990); the same prediction technique can be used across space as well as time. It is \nalso possible to explicitly derive an error signal from the mutual information between two \npatches of structured input (Becker and Hinton, 1992), a technique which has been applied \nto viewpoint-invariant object recognition (Zemel and Hinton, 1991). \n\n2 METHODS \n\n2.1 ANTI-HEBBIAN FEEDFORWARD LEARNING \n\nIn most formulations of the covariance learning rule it is quietly assumed that the learning \nrate be positive. By reversing the sign of this constant in a recurrent autoassociator, Kohonen \nconstructed a \"novelty filter\" that learned to be insensitive to familiar features in its input \n(Kohonen, 1989). More recently, such anti-Hebbian synapses have been used for lateral \ndecorrelation of feature detectors (Barlow and FOldiak, 1989; Leen, 1991) as well as -\nin \nremoval of temporal variations from the input (Mitchison, 1991). \ndifferential form -\n\nWe suggest that in certain cases the use of anti-Hebbian feedforward connections to learn \ninvariant structure may eliminate the need to bring in the heavy machinery of supervised \nlearning algorithms required by the prediction paradigm, with its associated lack of neu(cid:173)\nrobiological plausibility. Specifically, this holds for linear problems, where the stimuli lie \nnear a hyperplane in the input space: the weight vector of an anti-Hebbian neuron will \nmove into a direction normal to that hyperplane, thus characterizing the invariant structure. \n\nOf course a set of Hebbian feature detectors whose weight vectors span the hyperplane \nwould characterize the associated class of stimuli just as well. The anti-Hebbian learning \nalgorithm, however, provides a more efficient representation when the dimensionality of \nthe hyperplane is more than half that of the input space, since less normal vectors than \nspanning vectors are required for unique characterization in this case. Since they remove \nrather than extract the variance within a stimulus class, anti-Hebbian neurons also present \na very different output representation to subsequent layers. \n\nUnfortunately it is not sufficient to simply negate the learning rate of a layer of Hebbian \nfeature detectors in order to turn them into working anti-Hebbian in variance detectors: \nalthough such a change of sign does superficially achieve the intended effect, many of the \nsubtleties that make Hebb's rule work in practice do not survive the transformation. In what \nfollows we address some of the problems thus introduced. \n\nLike the Hebb rule, anti-Hebbian learning requires weight normalization, in this case to \nprevent weight vectors from collapsing to zero. Oja's active decay rule (Oja, 1982) is a \npopular local approximation to explicit weight normalization: \n\n8w = TJ(xy - wy2), where y = ijjT X \n\n(1) \n\n\fCompetitive Anti-Hebbian Learning of Invariants \n\n1019 \n\nHere the first term in parentheses represents the standard Hebb rule, while the second is the \nactive decay. Unfortunately, Oja's rule can not be used for weight growth in anti-Hebbian \nneurons since it is unstable for negative learning rates (ry < 0), as is evident from the \nobservation that the growth/decay term is proportional to w. In our experiments, explicit \nL2-normalization of weight vectors was therefore used instead. \n\nHebbian feature detectors attain maximal activation for the class of stimuli they represent. \nSince the weight vectors of anti-Hebbian invariance detectors are normal to the invariance \nclass they represent, membership in that class is Signalled by a zero activation. In other \nwords, linear anti-Hebbian nodes signal violations of the constraints they encode rather \nthan compliance. While such an output representation can be highly desirable for some \napplications 1, it is unsuitable for others, such as the classification of mixtures of invariants \ndescribed below. \n\nWe therefore use a symmetric activation function that responds maximally for a zero net \ninput, and decays towards zero for large net inputs. More specifically, we use Gaussian \nactivation functions, since these allow us to interpret the nodes' outputs as class membership \nprobabilities. Soft competition between nodes in a layer can then be implemented simply \nby normalizing these probabilities (Le. dividing each output by the sum of outputs in a \nlayer), then using them to scale weight changes (Nowlan, 1990). \n\n2.2 AN ANTI-HEBBIAN OBJECTIVE FUNCTION \n\nThe magnitude of weight change in a Hebbian neuron is proportional to the cosine of \nthe angle between input and weight vectors. This means that nodes that best represent the \ncurrent input learn faster than those which are further away, thus encouraging differentiation \namong weight vectors. Since anti-Hebbian weight vectors are normal to the hyperplanes \nthey represent, those that best encode a gi ven stimulus will experience the least change in \nweights. As a result, weight vectors will tend to clump together unless weight changes \nare rescaled to counteract this deficiency. In our experiments, this is done by the soft \ncompetition mechanism; here we present a more general framework towards this end. \n\nA simple Hebbian neuron maximizes the variance of its output y through stochastic ap(cid:173)\nproximation by performing gradient ascent in !y2 (Oja and Karhunen, 1985): \n\ndWi ex: - - -y = y-;:;-y = XiY \n\n8 1 2 \n8wi 2 \n\n8 \nUWi \n\n(2) \n\nAs seen above, it is not sufficient for an anti-Hebbian neuron to simply perform gradient \ndescent in the same function. Instead, an objective function whose derivative has inverse \nmagnitude to the above at every point is needed, as given by \n\nXi \ndWi ex: --log(y ) = - - y = -\nY \n\n8 1 \n8Wi 2 \n\n1 8 \nY 8 Wi \n\n2 \n\n(3) \n\nI Consider the subsumption architecture of a hierarchical network in which higher layers only \n\nreceive infonnation that is not accounted for by earlier layers. \n\n\f1020 \n\nSchraudolph and Sejnowski \n\n-\n\n3.00 \n\n2.00 \n\n1.00 \n\n0.00 \n\n-1.00 \n\n-2.00 \n\n-3.00 \n\n-\n\n~ \n\n~ \n\n.~ \n\n~ \n.~ //' \n\nA ~ \n\n'\\ \n\n.f \n\\\\ ,/ \n, \n\\ 'j \n\n-4.00 \n\n-2.00 \n\n0.00 \n\n2.00 \n\n4 .00 \n\nFigure I: Possible objective functions for anti-Hebbian learning (see text). \n\nUnfortunately, the pole at y = 0 presents a severe problem for Simple gradient descent \nmethods: the near-infinite derivatives in its vicinity lead to catastrophically large step sizes. \nMore sophisticated optimization methods deal with this problem by explicitly controlling \nthe step size; for plain gradient descent we suggest reshaping the objective function at the \npole such that its partials never exceed the input in magnitude: \n\n~Wi ex -8 dog(y + \u20ac \n\n2 \n\n8 \nWi \n\n2 \n\n) = 2 \n\n2\u20acXiY \nY +\u20ac \n\n2 ' \n\n(4) \n\nwhere \u20ac > 0 is a free parameter determining at which point the logarithmic slope is \nabandoned in favor of a quadratic function which forms an optimal trapping region for \nsimple gradient descent (Figure 1). \n\n3 RESULTS ON RANDOM-DOT STEREOGRAMS \n\nIn random-dot stereograms, stimuli of a given stereo disparity lie on a hyperplane whose \ndimensionality is half that of the input space plus the disparity in pixels. This is easily \nappreciated by considering that given, say, the left half-image and the disparity, one can \npredict the right half-image except for the pixels shifted in at the edge. Thus stereo \ndisparities that are small compared to the receptive field width can be learned equally well \nby Hebbian and anti-Hebbian algorithms; when the disparity approaches receptive field \nwidth, however, anti-Hebbian neurons have a distinct advantage. \n\n3.1 SINGLE LAYER NETWORK: LOCAL DISPARITY TUNING \n\nOur training set consisted of stereo images of 5,000 frontoparallel strips at uniformly \nrandom depth covered densely with Gaussian features of random location, width, polarity \nand power. The images were discretized by integrating over pixel bins in order to allow for \nsub-pixel disparity acuity. Figure 2 shows that a single cluster of five anti-Hebbian nodes \nwith soft competition develops near-perfect tuning curves for local stereo disparity after \n10 sweeps through this training set. This disparity tuning is achieved by learning to have \ncorresponding weights (at the given disparity) be of equal magnitude but opposite sign, so \nthat any stimulus pattern at that disparity yields a zero net input and thus maximal response. \n\n\fCompetitive Anti-Hebbian Learning of Invariants \n\n1021 \n\naverage response \n\n0.9 ~--+----+------l-----I......_-----1f_-\n\n0.8 ~--H-\\_---_f+__\\_--_+4+_--_H_+_--_l_1I_l_-\n\no. 7 ~-4-+-t---+-+-+------,I--+-4-----j~-+---J.---1f-+--\n\n0.6 --+-_+-\\---f-+----\\---f--+~;___f__-f.~-_I_--4~-\n\n0.5 -~_+~Ir_+--+--_\\_-l---+-_+_I_-f.-+J~--4---L-\n\n0.4 - - -+ ---Jlr..--+---ic-\n\n--+---.b#+-- -4-MUf-lIA-- -1 - -\n\n0.1 -\n\n-\n\n-+-- - - -+ - - ---+- - - -4 -- - - -1 - -\n\n-2.00 \n\n-1.00 \n\n0.00 \n\n1.00 \n\n2.00 \n\nstereo \ndisparity \n\nFigure 2: Sliding window average response of first-layer nodes after presentation of 50,000 \nstereograms as a function of stimulus disparity: strong disparity tuning is evident. \n\n. \niOOOOOi \n. \n\n\u2022 soft competition ~ ~ \n\n\u2022 Gaussian \n\nclusters \n\nnonlinearities \n\nfull connectivity \n\n. \niOOOOOl \n. \n\n1 ______ - - - - - - - - - - - - - - - - - - - -\n\n. \n. \niOOOOOi \n\nII random connectivity II \u2022 anti-Hebbian \n\n(5/7 per half-image) \n\nlearning rule \n\nleft \n\nright \n\n0000000 0000000 \n0000000 0000000 \n\nInput \n\nFigure 3: Architecture of the network (see text). \n\n\f1022 \n\nSchraudolph and Sejnowski \n\nNote, however, that this type of detector suffers from false positives: input patterns that \nhappen to yield near-zero net input even though they have a different stereo disparity. \nAlthough the individual response of a tuned node to an input pattern of the wrong disparity \nis therefore highly idiosyncratic, the sliding window average of each response with its 250 \nclosest neighbors (with respect to disparity) shown in Figure 2 is far more well-behaved. \nThis indicates that the average activity over a number of patterns (in a \"moving stereogram\" \nparadigm) -\nallows discrimination of disparities with sub-pixel accuracy. \n\nor, alternatively, over a population of nodes tuned to the same disparity -\n\n3.2 TWO-LAYER NETWORK: COHERENT DISPARITY TUNING \n\nIn order to investigate the potential for hierarchical application of this architecture, it \nwas extended to two layers as shown in Figure 3. The two first-layer clusters with non(cid:173)\noverlapping receptive fields extract local stereo disparity as before; their output is monitored \nby a second-layer cluster. Note that there is no backpropagation of derivatives: all three \nclusters use the same unsupervised learning algorithm. \n\nThis network was trained on coherent input, i.e. stimuli for which the stereo disparity was \nidentical across the receptive field boundary of first-layer clusters. As shown in Figure 4, \nthe second layer learns to preserve the first layer's disparity tuning for coherent patterns, \nalbeit in in somewhat degraded form. Each node in the second layer learns to pick out \nexactly the two corresponding nodes in the first-layer clusters, again by giving them weights \nof equal magnitude but opposite sign. \n\nHowever, the second layer represents more than just a noisy copy of the first layer: it \nmeaningfully integrates coherence information from the two receptive fields. This can be \ndemonstrated by testing the trained network on non-coherent stimuli which exhibit a depth \ndiscontinuity between the receptive fields of first-layer clusters. The overall response of \nthe second layer is tuned to the coherent stimuli it was trained on (Figure 5). \n\n4 DISCUSSION \n\nAlthough a negation of the learning rate introduces various problems to the Hebb rule, \nfeedforward anti-Hebbian networks can pick up invariant structure from the input. We \nhave demonstrated this in a competitive classification setting; other applications of this \nframework are possible. We find the subsumption aspect of anti-Hebbian learning particu(cid:173)\nlarly intriguing: the real world is so rich in redundant data that a learning rule which can \nadaptively ignore much of it must surely be an advantage. From this point of view, the \npromising first experiments we have reported here use quite impoverished inputs; one of \nour goals is therefore to extend this work towards real-world stimuli. \n\nAcknowledgements \n\nWe would like to thank Geoffrey Hinton, Sue Becker, Tony Bell and Steve Nowlan for the \nstimulating and helpful discussions we had. Special thanks to Sue Becker for permission \nto use her random-dot stereogram generator early in our investigation. This work was \nsupported by a fellowship stipend from the McDonnell-Pew Center for Cognitive Neuro(cid:173)\nscience at San Diego to the first author, who also received a NIPS travel grant enabling him \nto attend the conference. \n\n\fCompetitive Anti-Hebbian Learning of Invariants \n\n1023 \n\naverage response \n\n0.& - ----ft\\-- --I+-.-- ----f-- - --H-+-- - - I - ll - - -\n\n0.7 -Nll-'-+--\\--'Tt--~H\\I~_\\_---.:++_H--+-+_t_--l--l_u__-\n\n0.5 -\n\n---+------IF----+-\n\n-+-+--t---If-IMIl---.... +-H-Yt--t---I---\n\n-2.00 \n\n-1.00 \n\n0.00 \n\n1.00 \n\n2.00 \n\nstereo \ndisparity \n\nFigure 4: Sliding window average response of second-layer nodes after presentation of \n250,000 coherent stereograms as a function of stimulus disparity: disparity tuning is pre(cid:173)\nserved in degraded form. \n\naverage tOtal response \n\n3.1 \n\n3.0 \n\n2.9 \n\n2.& \n\n2.7 \n\n2.6 \n\n2.5 \n\n-4.00 \n\n-2.00 \n\n0.00 \n\n2.00 \n\n4.00 \n\nstereo discontinuity (disparity difference) \n\nFigure 5: Sliding window average of total second-layer response to non-coherent input as \na function of stimulus discontinuity: second layer is tuned to coherent patterns. \n\n\f1024 \n\nSchraudolph and Sejnowski \n\nReferences \n\nBaldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: \n\nLearning from examples without local minima. Neural Networks, 2:53-58. \n\nBarlow, H. B. and FOldh1k, P. (1989). Adaptation and decorrelation in the cortex. In Durbin, \nR. M., Miall, c., and Mitchison, G. J., editors, The Computing Neuron, chapter 4, pages \n54-72. Addison-Wesley, Wokingham. \n\nBecker, S. and Hinton, G. E. (1992). A self-organizing neural network that discovers \n\nsurfaces in random-dot stereograms. Nature, to appear. \n\nElman, J. (1990). Finding structure in time. Cognitive Science, 14: 179-211. \nFOldiak, P. (1991). Learning invariance from transformation sequences. Neural Computa(cid:173)\n\ntion, 3: 194-200. \n\nIntrator, N. (1991). Exploratory feature extraction in speech signals. In (Lippmann et al., \n\n1991), pages 241-247. \n\nJolliffe, I. (1986). Principal Component Analysis. Springer-Verlag, New York. \nKohonen, T. (1989). Self-Organization and Associative Memory. Springer-Verlag, Berlin, \n\n3 edition. \n\nKung, S. Y. (1990). Neural networks for extracting constrained principal components. \n\nsubmitted to IEEE Trans. Neural Networks. \n\nLeen, T. K. (1991). Dynamics of learning in linear feature-discovery networks. Network, \n\n2:85-105. \n\nLippmann, R. P., Moody, J. E., and Touretzky, D. S., editors (1991). Advances in Neural \nInformation Processing Systems, volume 3, Denver 1990. Morgan Kaufmann, San \nMateo. \n\nMitchison, G. (1991). Removing time variation with the anti-hebbian differential synapse. \n\nNeural Computation, 3:312-320. \n\nNowlan, S. J. (1990). Maximum likelihood competitive learning. In Touretzky, D. S., \neditor, Advances in Neural Information Processing Systems, volume 2, pages 574-\n582, Denver 1989. Morgan Kaufmann, San Mateo. \n\nOja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of \n\nMathematical Biology, 15:267-273. \n\nOJ a, E. and Karhunen, J. (1985). On stochastic approximation of the eigenvectors and \neigenvalues of the expectation of a random matrix. Journal Of Mathematical Analysis \nand Applications, 106:69-84. \n\nSanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward \n\nneural network. Neural Networks, 2:459-473. \n\nZemel, R. S. and Hinton, G. E. (1991). Discovering viewpoint-invariant relationships that \n\ncharacterize objects. In (Lippmann et al., 1991), pages 299-305. \n\n\f", "award": [], "sourceid": 472, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}