{"title": "Non-Linear Statistical Analysis and Self-Organizing Hebbian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 407, "page_last": 414, "abstract": null, "full_text": "Non-linear Statistical Analysis and \nSelf-Organizing Hebbian Networks \n\nJonathan L. Shapiro and Adam Priigel-Bennett \n\nDepartment of Computer Science \n\nThe University, Manchester \n\nManchester, UK \n\nM139PL \n\nAbstract \n\nNeurons learning under an unsupervised Hebbian learning rule can \nperform a nonlinear generalization of principal component analysis. \nThis relationship between nonlinear PCA and nonlinear neurons is \nreviewed. The stable fixed points of the neuron learning dynamics \ncorrespond to the maxima of the statist,ic optimized under non(cid:173)\nlinear PCA. However, in order to predict. what the neuron learns, \nknowledge of the basins of attractions of the neuron dynamics is \nrequired. Here the correspondence between nonlinear PCA and \nneural networks breaks down. This is shown for a simple model. \nMethods of statistical mechanics can be used to find the optima \nof the objective function of non-linear PCA. This determines what \nthe neurons can learn. In order to find how the solutions are parti(cid:173)\ntioned amoung the neurons, however, one must solve the dynamics. \n\n1 \n\nINTRODUCTION \n\nLinear neurons learning under an unsupervised Hebbian rule can learn to perform a \nlinear statistical analysis ofthe input data. This was first shown by Oja (1982), who \nproposed a learning rule which finds the first principal component of the variance \nmatrix of the input data. Based on this model, Oja (1989), Sanger (1989), and \nmany others have devised numerous neural networks which find many components \nof this matrix. These networks perform principal component analysis (PCA), a \nwell-known method of statistical analysis. \n\n407 \n\n\f408 \n\nShapiro and Priigel-Bennett \n\nSince PCA is a form of linear analysis, and the neurons used in the PCA networks \nare linear -\nthe output of these neurons is equal to the weighted sum of inputs; \nthere is no squashing function of sigmoid - it is obvious to ask whether non-linear \nHebbian neurons compute some form of non-linear PCA? Is this a useful way to \nunderstand the performance of the networks? Do these networks learn to extract \nfeatures of the input data which are different from those learned by linear neurons? \nCurrently in the literature, the phrase \"non-linear PCA\" is used to describe what \nis learned by any non-linear generalization of Oja neurons or other PCA networks \n(see for example, Oja, 1993 and Taylor, 1993). \n\nIn this paper, we discuss the relationship between a particular form of non-linear \nHebbian neurons (Priigel-Bennett and Shapiro, 1992) and a particular generaliza(cid:173)\ntion of non-linear PCA (Softky and Kammen 1991). It is clear that non-linear neu(cid:173)\nrons can perform very differently from linear ones. This has been shown through \nanalysis (Priigel-Bennett and Shapiro, 1993) and in application (Karhuenen and \nJoutsensalo, 1992). It can also be very useful way of understanding what the neu(cid:173)\nrons learn. This is because non-linear PCA is equivalent to maximizing some objec(cid:173)\ntive function. The features that this extracts from a data set can be studied using \ntechniques of statistical mechanics. However, non-linear PCA is ambiguous because \nthere are multiple solutions. What the neuron can learn is given by non-linear PCA. \nThe likelihood of learning the different solutions is governed by the dyanamics cho(cid:173)\nsen to implement non-linear PCA, and may differ in different implementations of \nthe dynamics. \n\n2 NON-LINEAR HEBBIAN NEURONS \n\nNeurons with non-linear activation functions can learn to perform very different \ntasks from those learned by linear neurons. Nonlinear Hebbian neurons have been \nanalyzed for general non-linearities by Oja (1991), and was applied to sinusoidal \nsignal detection by Karhuenen and Joutsensalo (1992). \n\nPreviously, we analysed a simple non-linear generalization of Oja's rule (Priigel(cid:173)\nBennett and Shapiro, 1993). We showed how the shape of the neuron activation \nfunction can control what a neuron learns. Whereas linear neurons learn to a \nstatistic mixture of all of the input patterns, non-linear neurons can learn to become \ntuned to individual patterns, or to small clusters of closely correlated patterns. \n\nIn this model, each neuron has weights, Wi is the weight from the ith input, and \nresponds to the usual sum of input times weights through an activation function \nA(y). This is assumed a simple power-law above a threshold and zero below it. I.e. \n\nHere \u00a2 is the threshold, b controls the power of the power-law, xf is the ith compo(cid:173)\nnent of the pth pattern, and VP = Li xf Wi. Curves of these functions are shown \nin figure laj if b = 1 the neurons are threshold-linear. For b > 1 the curves can be \nthought of as low activation approximations to a sigmoid which is shown in figure \n1 b. The generalization of Oja's learning rule is that the change in the weights 8Wi \n\n(1) \n\n\fNon-Linear Statistical Analysis and Self-Organizing Hebbian Networks \n\n409 \n\nNeuron Activation Function \n\nb>1 \n\nA Sigmoid Activation Function \n\nb<1 \n\n\u2022 \n\npsp \n\nFigure 1: a) The form of the neuron activation function. Control by two parameters \nband 1, this activation function approximates a sigmoid, which is \nshown in b) . \n\nis given by \n\n6Wi = LA(VP) [xf - VP Wi ] . \n\nP \n\n(2) \n\nIf b < 1, the neuron learns to average a set of patterns. If b = 1, the neuron finds \nthe principal component of the pattern set. When b > 1, the neuron learns to \ndistinguish one of the patterns in the presence of the others, if those others are not \ntoo correlated with the pattern. There is a critical correlation which is determined \nby b; the neuron learns to individual patterns which are less correlated than the \ncritical value, but learns to something like the center of the cluster if the patterns \nare more correlated. The threshold controls the size of the subset of patterns which \nthe neuron can respond to. \n\nFor these neurons, the relationship between non-PCA and the activation function \nwas not previously discussed. That is done in the next section. \n\n3 NON-LINEAR peA \n\nA non-linear generalization of PCA was proposed by Softky and Kammen (1991). \nIn this section, the relationship between non-linear PCA and unsupervised Hebbian \nlearning is reviewed. \n\n\f410 \n\nShapiro and Priigel-Bennett \n\n3.1 WHAT IS NON-LINEAR PCA \n\nThe principal component of a set of data is the direction which maximises the \nvariance. I.e. to find the principal component of the data set, find the vector tV of \nunit length which maximises \n\n(3) \n\nHere, Xi denotes the ith component of an input pattern and < .. . > denotes \nthe average over the patterns. Sofky and Kammen suggested that an appropriate \ngeneralization is to find the vector tV which maximizes the d-dimensional correlation, \n\n(4) \n\nThey argued this would give interesting results if higher order correlations are im(cid:173)\nportant, or ifthe shape ofthe data cloud is not second order. This can be generalized \nfurther, of course, maximizing the average of any non-linear function of the input \nU(y), \n\n(5) \n\nThe equations for the principal components are easily found using Lagrange multi(cid:173)\npliers. The extremal points are given by \n\n< U' (1: WkXk )Xi >= AWi. \n\nk \n\nThese points will be (local) maxima if the Hessian 1lij, \n\n1lij =< U\"(I: WkXk)XiXj > -ADij, \n\nk \n\nHere, A is a Lagrange multiplier chosen to make Iwl 2 = 1. \n\n3.2 NEURONS WHICH LEARN PCA \n\n(6) \n\n(7) \n\nA neuron learning via unsupervised Hebbian learning rule can perform this opti(cid:173)\nmization. This is done by associating Wi with the weight from the ith input to \nthe neuron, and the data average < . > as the sum over input patterns xf. The \nnonlinear function which is optimized is determined by the integral of the activation \nfunction of the neuron \n\nA(y) = U'(y). \n\nIn their paper, Softky and Kammen propose a learning rule which does not perform \nthis optimization in general. The correct learning rule is a generalization of Oja's \nrule (equation (2) above), in this notation, \n\n(8) \n\n\fNon-Linear Statistical Analysis and Self-Organizing Hebbian Networks \n\n411 \n\nThis fixed points of this dynamical equation will be solutions to the extremal equa(cid:173)\ntion of nonlinear peA, equation (6), when the a.'3sociations \n\nand \n\nare made. \n\nA = (A(V)V) , \n\nA(y) = U'(y) \n\nHere (.) is interpreted as sum over patterns; this is batch learning. The rule can also \nbe used incrementally, but then the dynamics are stochastic and the optimization \nmight be performed only on average, and then maybe only for small enough learning \nrates. These fixed points will be stable when the Hessian llij is negative definite at \nthe fixed point. This is now, \n\nwhich is the same as the previous, equation (7),in directions perpendicular to the \nfixed point, but contains additional terms in direction of the fixed point which \nnormalize it. \n\nThe neurons described in section 2 would perform precisely what Softky and Kam(cid:173)\nmen proposed if the activation function was pure power-law and not thresholded; \nas it is they maximize a more complicated objective function. \n\nSince there is a one to one correspondence between the stable fixed points of the \ndynamics and the local maxima of the non-linear correlation measure, one says that \nthese non-linear neurons compute non-linear peA. \n\n3.3 THEORETICAL STUDIES OF NONLINEAR PCA \n\nIn order to understand what these neurons learn, we have studied the networks \nlearning on model data drawn from statistical distributions. For very dense clusters \np ~ 00, N fixed, the stable fixed point equations are algebraic. In a few simple \ncases they can be solved. For example, if the data is Gaussian or if the data cloud is \na quadratic cloud (a function of a quadratic form), the neuron learns the principal \ncomponent, like the linear neuron. Likewise, if the patterns are not random, the \nfixed point equations can be solved in some cases. \n\nFor large number of patterns in high dimensions fluctuations in the data are im(cid:173)\nportant (N and P goes to 00 together in some way). In this case, methods of \nstatistical mechanics can be used to average over the data. The objective function \nof the non-linear peA acts as (minus) the energy in statistical mechanics. The free \nenergy is formally, \n\nF =< IOg(D. J Of, 6 (t wl- I) exp (3U(V) > . \n\n(10) \n\nIn the limit that f3 is large, this calculation finds the local maxima of U. In this \nform of analysis, the fact that the neuron optimizes an objective function is very \nimportant. This technique was used to produce the results outlined in section 2. \n\n\f412 \n\nShapiro and Priigel-Bennett \n\n3.4 WHAT NON-LINEAR peA FAILS TO REVEAL \n\nIn the linear peA, there is one unique solution, or if there are many solutions \nit is because the solutions are degenerate. However, for the non-linear situation, \nthere are many stable fixed points of the dynamics and many local maxima of the \nnon-linear correlation measure. \n\nThis has two effects. First, it means that you cannot predict what the neuron will \nlearn simply by studying fixed point equations. This tells you what the neuron \nmight learn, but the probability that this solution will be can only be ascertained if \nthe dynamics are understood. This also breaks the relationship between non-linear \npeA and the neurons, because, in principle, there could be other dynamics which \nhave the same fixed point structure, but do not have the same basins of attraction. \nSimple fixed point analysis would be incapable of predicting what these neurons \nwould learn. \n\n4 PARTITIONING \n\nAn important question which the fixed-point analysis, or corresponding statistical \nmechanics cannot address is: what is the likelihood of learning the different solu(cid:173)\ntions? This is the essential ambiguity of non-linear peA - there are many solutions \nand the size of the basin of attractions of each is determined by the dynamics, not \nby local maxima of the nonlinear correlation measure. \n\nAs an example, we consider the partitioning of the neurons described in section 2. \nThese neurons act much like neurons in competitive networks, they become tuned to \nindividual patterns or highly correlated clusters. Given that the density of patterns \nin the input set is p(i), what is the probability p(i) that a neuron will become \ntuned to this pattern. It is often said that the desired result should be p(i) = p(i), \nalthough for Kohonen I-d feature maps ha.~ been shown to be p(i) = p(i)2/3 (see \nfor example, Hertz, Krogh, and Palmer 1991). \n\nWe have found that he partitioning cannot be calculated by finding the optima \nof the objective function . For example, in the case of weakly correlated patterns, \nthe global maxima is the most likely pattern, whereas all of the patterns are local \nmaxima. To determine the partitioning, the basin of attraction of each pattern \nmust be computed. This could be different for different dynamics with the same \nfixed point structure. \n\nIn order to determine the partitioning, the dynamics must be understood. The \ndetails will be described elsewhere (Priigel-Bennett and Shapiro, 1994). For the \ncase of weakly correlated patterns, a neuron will learn a pattern for which \n\np(xp)(Vcr/- 1 > p(xq)(Voq)b-l \n\nVq f- p. \n\nHere Vcr is the initial overlap (before learning) of the neuron's weights with the pth \npattern. This defines the basin of attraction for each pattern. \n\nIn the large P limit and for random patterns \n\n(11) \nwhere a ~ 210g(P)/(b -1), P is the number of patterns, and where b is a parameter \nthat controls the non-linearity of the neuron's response. If b is chosen so that a = 1, \n\np(i) ~ p(iYx \n\n\fNon-Linear Statistical Analysis and Self-Organizing Hebbian Networks \n\n413 \n\nthen the probability of a neuron learning a pattern will be proportional to the \nfrequency with which the pattern is presented. \n\n5 CONCLUSIONS \n\nThe relationship between a non-linear generalization of Oja's rule and a non-linear \ngeneralization of PCA was reviewed. Non-linear PCA is equivalent to maximizing a \nobjective function which is a statistical measure of the data set. The objective func(cid:173)\ntion optimized is determined by the form of the activation function of the neuron. \nViewing the neuron in this way is useful, because rather than solving the dynamics, \none can use methods of statistical mechanics or other methods to find the maxima \nof the objective function. Since this function has many local maxima, however, \nthese techniques cannot determine how the solutions are partitioned amoung the \nneurons. To determine this, the dynamics must be solved. \n\nAcknowledgements \n\nThis work was supported by SERC grant GRG20912. \n\nReferences \n\nJ. Hertz, A. Krogh, and R.G. Palmer. (1991). Introduction to the Theory of Neural \nComputation. Addison-Wesley. \n\nJ. Karhunen and J. J outsensalo. (1992) Nonlinear Heb bian algorithms for sinusoidal \nfrequency estimation, in Artificial Neural Networks, 2, I. Akeksander and J . Taylor, \neditors, North-Holland. \n\nErkki Oja. (1982) A simplified neuron model as a principal component analyzer. \nem J. Math. Bio., 15:267-273. \n\nErkki Oja. (1989) Neural networks, principal components, and subspaces. Int. J. \nof Neural Systems, 1(1):61-68. \n\nE. Oja, H. Ogawa, and J. Wangviwattan. (1992) Principal Component Analysis \nby homogeneous neural networks: Part II: analysis and extension of the learning \nalgorithms IEICE Trans. on Information and Systems, E75-D, 3, pp 376-382. \n\nE. Oja. \nWorld Congress on Neural Networks, Portland, Or. 1993. \n\n(1993) Nonlinear PCA: algorithms and applications, in Proceedings of \n\nA. Prugel-Bennett and Jonathan 1. Shapiro. (1993) Statistical Mechanics of Unsu(cid:173)\npervised Hebbian Learning. J. Phys. A: 26, 2343. \n\nA. Prugel-Bennett and Jonathan L. Shapiro. (1994) The Partitioning Problem for \nUnsupervised Learning for Non-linear Neurons. J. Phys. A to appear. \n\nT. D. Sanger. \nFeedforward Neural Network. Neural Networks 2,459-473. \n\n(1989) Optimal Unsupervised Learning in a Single-Layer Linear \n\nJonathan L. Shapiro and A. Prugel-Bennett (1992), Unsupervised Hebbian Learning \nand the Shape of the Neuron Activation Function, in Artificial Neural Networks, 2, \nI. Akeksander and J. Taylor, editors, North-Holland. \n\n\f414 \n\nShapiro and Prugel-Bennett \n\nW . Softky and D. Kammen (1991). Correlations in High Dimensional or Asymmet(cid:173)\nric Data Sets: Hebbian Neuronal Processing. Neural Networks 4, pp 337-347. \nJ . Taylor, (1993) Forms of Memory, in Proceedings of World Congress on Neural \nNetworks, Portland, Or. 1993. \n\n\f", "award": [], "sourceid": 862, "authors": [{"given_name": "Jonathan", "family_name": "Shapiro", "institution": null}, {"given_name": "Adam", "family_name": "Pr\u00fcgel-Bennett", "institution": null}]}