{"title": "Hebb Learning of Features based on their Information Content", "book": "Advances in Neural Information Processing Systems", "page_first": 246, "page_last": 252, "abstract": null, "full_text": "Hebb Learning of Features \n\nbased on their Information Content \n\nFerdinand Peper \n\nHideki Noda \n\nCommunications Research Laboratory \n\n588-2, Iwaoka, Iwaoka-cho \n\nNishi-ku, Kobe 651-24 \n\nJapan \n\npeper@crl.go.jp \n\nKyushu Institute of Technology \n\nDept. Electr., Electro., and Compo Eng. \n\n1-1 Sensui-cho, Tobata-ku \nKita-Kyushu 804, Japan \n\nnoda@kawa.comp.kyutech.ac.jp \n\nAbstract \n\nThis paper investigates the stationary points of a Hebb learning rule \nwith a sigmoid nonlinearity in it. We show mathematically that when \nthe input has a low information content, as measured by the input's \nvariance, this learning rule suppresses learning, that is, forces the weight \nvector to converge to the zero vector. When the information content \nexceeds a certain value, the rule will automatically begin to learn a \nfeature in the input. Our analysis suggests that under certain conditions \nit is the first principal component that is learned. The weight vector \nlength remains bounded, provided the variance of the input is finite. \nSimulations confirm the theoretical results derived. \n\n1 \n\nIntroduction \n\nHebb learning, one of the main mechanisms of synaptic strengthening, is induced \nby cooccurrent activity of pre- and post-synaptic neurons. It is used in artificial \nneural networks like perceptrons, associative memories, and unsupervised learning \nneural networks. Unsupervised Hebb learning typically employs rules of the form: \n(1) \nwhere w is the vector of a neuron's synaptic weights, x is a stochastic input vector, \ny is the output expressed as a function of x T w , and the vector function d is a for(cid:173)\ngetting term forcing the weights to decay when there is little input. The integration \nconstant J.t determines the learning speed and will be assumed 1 for convenience. \n\nJ.tw(t) = x(t)y(t) - d(x(t), y(t), w(t)) , \n\nThe dynamics of rule (1) determines which features are learned, and, with it, the \nrule's stationary points and the boundedness of the weight vector. In some cases, \nweight vectors grow to zero or grow unbounded. Either is biologically implausible. \nSuppression and unbounded growth of weights is related to the characteristics of \nthe input x and to the choice for d. Understanding this relation is important to \nenable a system, that employs Hebb learning, to learn the right features and avoid \nimplausible weight vectors. \n\nUnbounded or zero length of weight vectors is avoided in [5] by keeping the total \nsynaptic strength :Ei Wi constant. Other studies, like [7], conserve the sum-squared \n\n\fHebb Learning of Features based on their Information Content \n\n247 \n\nsynaptic strength. Another way to keep the weight vector length bounded is to \nlimit the range of each of the individual weights [4]. The effect of these constraints \non the learning dynamics of a linear Hebb rule is studied in [6]. \n\nThis paper constrains the weight vector length by a nonlinearity in a Hebb rule. It \nuses a rule of the form (1) with y = S(xT W - h) and d(x, y, w) = c.w, the function \nS being a smooth sigmoid, h being a constant, and c being a positive constant \n(see [1] for a similar rule). We prove that the weight vector w assumes a bounded \nnonzero solution if the largest eigenvalue Al of the input covariance matrix satisfies \nAi > cjS'(-h). Furthermore, if Al ~ cjS'(-h) the weight vector converges to the \nvector O. Since Al equals the variance of the input's first principal component, that \nis, Ai is a measure for the amount of information in the input, learning is enabled \nby a high information content and suppressed by a low information content. \n\nThe next section describes the Hebb neuron and its input in more detail. After \ncharacterizing the stationary points of the Hebb learning rule in section 3, we ana(cid:173)\nlyze their stability in section 4. Simulations in section 5 confirm that convergence \ntowards a nonzero bounded solution occurs only when the information content of \nthe input is sufficiently high. We finish this paper with a discussion. \n\n2 The Hebb Neuron and its Input \n\nAssume that the n-dimensional input vectors x presented to the neuron are gener(cid:173)\nated by a stationary white stochastic process with mean O. The process's covariance \nmatrix :E = E[xxT] has eigenvalues AI, \"', An (in order of decreasing size) and cor(cid:173)\nresponding eigenvectors UI, ... , Un. Furthermore, E[llxl12] is finite. This implies that \nthe eigenvalues are finite because E[lIx112] = E[tr[xxT)) = tr[E[xxT)) = L:~=l Ai. It \nis assumed that the probability density function of x is continuous. Given an input \nx and a synaptic weight vector w, the neuron produces an output y = S(xTw - h), \nwhere S : R -+ R. is a function that satisfies the conditions: \n\nCl. S is smooth, i.e., S is continuous and differentiable and S' is continuous. \n\nC2. Sis sublinear, Le., lim S(z)jz = \n\nz--+oo \n\nlim S(z)jz = O. \nz--+-oo \n\nC3. S is monotonically nondecreasing. \n\nC4. S' has one maximum, which is at the point z = -h. \n\nTypically, these conditions are satisfied by smooth sigmoidal functions. This in(cid:173)\ncludes sigmoids with infinite saturation values, like S(z) = sign(z)lzli/2 (see [9]). \nThe point at which a sigmoid achieves maximal steepness (condition C4) is called its \nbase. Though the step function is discontinuous at its base, thus violating condition \nCl, the results in this paper apply to the step function too, because it is the limit \nof a sequence of continuous sigmoids, and the input density function is continuous \nand thus Lebesgue-integrable. The learning rule of the neuron is given by \n\nVI = xy - cw, \n\n(2) \nc being a positive constant. Use of a linear S(z) = az in this rule gives unstable \ndynamics: if a > cj Ai, then the length of the weight vector w grows out of bound \nthough ultimately w becomes collinear with Ui' It is proven in the next section \nthat a sublinear S prevents unbounded growth of w. \n\n\f248 \n\nF. Peper and H. Noda \n\n3 Stationary Points of the Learning Rule \n\nTo get insight into what stationary points the weight vector w ultimately converges \nto, we average the stochastic equation (2) over the input patterns and obtain \n\n(w) = E [xS (XT(W) - h)] - c(w), \n\n(3) \n\nwhere (w) is the averaged weight vector and the expectation is taken over x, as \nwith all expectations in this paper. Since the solutions of (2) correspond with the \nsolutions of (3) under conditions described in [2], the averaged (w) will be referred \nto as w. Learning in accordance to (2) can then be interpreted [lJ as a gradient \ndescent process on an averaged energy function J associated with (3): \n\nwith T(z) = [~ S(v)dv. \n\nTo characterize the solutions of (3) we use the following lemma. \n\nLemma 1. Given a unit-length vector u, the function f u : R ~ R is defined by \n\nand the constant Au by Au = E[u T xxT u]. The fixed points of fu are as follows. \n\n1. If AuS'( -h) ~ c then fu has one fixed point, i.e. , z = o. \n2. If AuS'( -h) > c then fu has three fixed points, i.e., z = 0, z = Q:~, and \n\nz = Q:~, where Q:~ (Q:~) is a positive (negative) value depending on u. \n\nProof:(Sketch; for a detailed proof see [11]). Function fu is a smooth sigmoid, since \nconditions C1 to C4 carryover from S to fu. The steepness of fu in its base at z = 0 \ndepends on vector u. If AuS'( -h) ~ c, function fu intersects the line h(z) = z only \nat the origin, giving z = 0 as the only fixed point . If AuS'( -h) > c, the steepness \nof f u is so large as to yield two more intersections: z = Q:~ and z = Q:~ \u2022 \n0 \n\nThus characterizing the fixed points of f u, the lemma allows us to find the fixed \npoints of a vector function g: R n ~ R n that is closely related to (3). Defining \n\ng(w) = - E [xS (xTw - h) ] , \n\n1 \nc \n\nwe find that a fixed point z = Q:u of f u corresponds to the fixed point w = Q:u u of \ng. Then, since (3) can be written as w = c.g(w) - c.w, its stationary points are \nthe fixed points of g, that is, w = 0 is a stationary point and for each u for which \nAuS'( -h) > c there exists one bounded stationary point associated with Q:~ and \none associated with Q:~. Consequently, if Al ~ c/ S' (-h) then the only fixed point \nof g is w = 0, because Al ~ Au for all u. \n\nWhat is the implication of this result? The relation Al ~ c/ S'( -h) indicates a low \ninformation content of the input, because AI - equaling the variance of the input's \nfirst principal component-is a measure for the input's information content . A low \ninformation content thus results in a zero w, suppressing learning. Section 4 shows \n\n\fHebb Learning of Features based on their Information Content \n\n249 \n\nthat a high information content results in a nonzero w. The turnover point of what \nis considered high/low information is adjusted by changing the steepness of the \nsigmoid in its base or changing constant c in the forgetting term. \nTo show the boundedness of w, we consider an arbitrary point P: w = f3u suffi(cid:173)\nciently far away from the origin 0 (but at finite distance) and calculate the compo(cid:173)\nnent of w along the line OP as well as the components orthogonal to OP. Vector \nu has unit length, and f3 may be assumed positive since its sign can be absorbed \nby u. Then, the component along 0 P is given by the projection of w on u: \n\nThis is negative for all f3 exceeding the fixed points of I u because of the sigmoidal \nshape of lu. So, for any point Pin nn lying far enough from 0 the vector compo(cid:173)\nnent of win P along the line OP is directed towards 0 and not away from it. This \ncomponent decreases as we move away from 0, because the value of [f3 - I u (f3)] \nincreases as f3 increases (fu is sublinear). Orthogonal to this is a component given \nby the projection of w on a unit-length vector v that is orthogonal to u: \n\nThis component increases as we move away from 0; however, it changes at a slower \npace than the component along OP, witness the quotient of both components: \n\nlim vTwl \n{3-+00 uTw w={3u \n\n(3-+00 -c[f3 - lu(f3)] \n\n= lim \n\ncvTg(f3u) = lim \n\nvTg(f3u)/f3 = 0 \n. \n\n(3-+00 lu(f3)/f3 - 1 \n\nVector w thus becomes increasingly dominated by the component along 0 P as f3 \nincreases. So, the origin acts as an attractor if we are sufficiently far away from it, \nimplying that w remains bounded during learning. \n\n4 Stability of the Stationary Points \n\nTo investigate the stability of the stationary points, we use the Hessian of the \naveraged energy function J described in the last section. The Hessian at point w \nequals: H(w) = cI - E [xxTS' (xTw - h)]. A stationary point w = w is stable \niff H(w) is a positive definite matrix. The latter is satisfied if for every unit-length \nvector v, \n\n(4) \nthat is, if all eigenvalues of the matrix E[xxTS'(XTw - h)] are less than c. First \nconsider the stationary point w = o. The eigenvalues of E[XXT S'( -h)] in decreasing \norder are AlS'( -h), ... , AnS'( -h). The Hessian H(O) is thus positive definite iff \nA1S'( -h) < c. In this case w = 0 is stable. It is also stable in the case Al = \nc/S'( -h), because then (4) holds for all v 1= Ul, preventing growth ofw in directions \nother than Ul\u00b7 Moreover, w will not grow in the direction of Ul, because II Ul (f3) I < \n1f31 for all f3 1= O. Combined with the results of the last section this implies: \n\nCorollary 1. If Al ~ c/S'(-h) then the averaged learning equation (3) will have \nas its only stationary point w = 0, and this point is stable. If Al > c/ S'( -h) the \nstationary point w = 0 is not stable, and there will be other stationary points. \n\n\f250 \nF. Peper and H. Noda \nWe now investigate the other stationary points. Let w = au u be such a point, u \nbeing a unit-length vector and au a nonzero constant. To check whether the Hessian \nH(auu) is positive definite, we apply the relation E[XYJ = E[XJ E[YJ + Cov[X, Y] \nto the expression E [ u T xx T uS' (aux T u - h)] and obtain after rewriting: \n\nThe sigmoidal shape of the function lu implies that lu is less steep than the line \nh(z) = z at the intersection at z = au, that is, I~(au) < 1. It then follows that \nE [uTxxTuS'(auxTu - h)] = c/~(au) < c, giving: \n\nE [ S' ( aux T u - h)] < Au {c - Cov [ U T xx T U, S' (au x T u - h) ] } . \n\n1 \n\nThen, yTE [xxTS'(auxTu - h)] y = \n\nAvE [ S' (aux T u - h)] + Cov [ y T xx T y, S' (aux T u - h)] < \n~: c - ~: Cov [ u T xx T U, S' (aux T u - h)] + Cov [ y T xx T y, S' (aux T u - h) ] . \n\nThe probability distribution of x unspecified, it is hard to evaluate this upper bound. \nFor certain distributions the upper bound is minimized when Au is maximized, \nthat is, when u = Ul and Au = AI, implying the Hebb neuron to be a nonlinear \nprincipal component analyzer. Distributions that are symmetric with respect to \nthe eigenvectors of :E are probably examples of such distributions, as suggested \nby [11, 12J. For other distributions vector w may assume a solution not collinear \nwith Ul or may periodically traverse (part of) the nonzero fixed-point set of g. \n\n5 Simulations \n\nWe carry out simulations to test whether learning behaves in accordance with corol(cid:173)\nlary 1. The following difference equation is used as the learning rule: \n\n(5) \n\nwhere, is the learning rate and a a constant. The use of a difference .6. in (5) rather \nthan the differential in (2) is computationally easier, and gives identical results if \n, decreases over training time in accordance with conditions described in [3]. We \nuse ,(t) = 1/(O.Olt + 20). It satisfies these conditions and gives fast convergence \nwithout disrupting stability [10J. Its precise choice is not very critical here, though. \n\nThe neuron is trained on multivariate normally distributed random input samples of \ndimension 6 with mean 0 and a covariance matrix :E that has the eigenvalues 4.00, \n2.25, 1.00, 0.09, 0.04, and 0.01. The degree to which the weight vector and :E's first \neigenvector Ul are collinear is measured by the match coefficient [10], defined by: \nm = cos2 L(Ul' w). In every experiment the neuron is trained for 10000 iterations \nby (5) with the value of parameter a set to 0.20, 0.25, and 0.30, respectively. This \ncorresponds to the situations in which Al < c/ S'( -h), Al = cj S'( -h), and Al > \nc/ S'( -h), respectively, since c = 1 and the steepness ofthe sigmoid S(z) = tanh(az) \n\n\fHebb Learning of Features based on their Information Content \n\n251 \n\nin its base z = -h = 0 is 8'(0) = a. We perform each experiment 2000 times, which \nallows us to obtain the match coefficients beyond iteration 100 within \u00b10.02 with \na confidence coefficient of 95 % (and a smaller confidence coefficient on the first 100 \niterations). The random initialization of the weight vector-its initial elements are \nuniformly distributed in the interval (-1, I)-is different in each experiment. \n\nm \n1.0.,-------::;;::::;::;;=---, \n\n0.5 \n\na=0.30 \n---a=0.25 \n- - - - - a=0.20 \n\n0.0 -'--+----+----1---+-----+--' \n\nIlwll \n\n1.0 \n\n0.0 \n\n1 \n\n10 \n\n102 \n\n103 \n104 \nIterations \n\n1 \n\n10 \n\n102 \n\n103 \n104 \nIterations \n\nFigure 1: Match coefficients averaged \nover 2000 experiments for parameter \nvalues a = 0.20, 0.25, and 0.30. \n\nFigure 2: Lengths of the weight vector \naveraged over 2000 experiments. The \ncurve types are similar to those in Fig. 1. \n\nFig. 1 shows that for all tested values of parameter a the weight vector gradually \nbecomes collinear with Ul over 10000 iterations. The length of the weight vector \nconverges to 0 when a = 0.20 or a = 0.25 (see Fig. 2). In the case a = 0.30, \ncorresponding to Al > cj8'(-h), the length converges to a nonzero bounded value. \nIn conclusion, convergence is as predicted by corollary 1: the weight vector converges \nto 0 if the information content in the input is too low for climbing the slope of the \nsigmoid in its base, and otherwise the weight vector becomes nonzero. \n\n6 Discussion \n\nLearning by the Hebb rule discussed in this paper is enabled if the input's informa(cid:173)\ntion content as measured by the variance is sufficiently high, and only then. The \nresults, though valid for a single neuron, have implications for systems consisting of \nmultiple neurons connected by inhibitory connections. A neuron in such a system \nwould have as output y = 8(xT w - h - yT y '), where the inhibitory signal yT y ' \nwould consist of the vector of output signals y' of the other neurons, weighted by the \nvector y (see also [1]). Function fu in lemma 1 would, when extended to contain the \nsignal y T y', still pass through the origin because of the zero-meanness of the input, \nbut would have a reduced steepness at the origin caused by the shift in S's argument \naway from the base. The reduced steepness would make an intersection of f u with \nthe line h(z) = z in a point other than the origin less likely. Consequently, an in(cid:173)\nhibitory signal would bias the neuron towards suppressing its weights. In a system \nof neurons this would reduce the emergence of neurons with correlated outputs, be(cid:173)\ncause of the mutual presence of their outputs in each other's inhibitory signals. The \nneurons, then, would extract different features, while suppressing information-poor \nfeatures. \n\nIn conclusion, the Hebb learning rule in this paper combines well with inhibitory \nconnections, and can potentially be used to build a system of nonredundant feature \nextractors, each of which is optimized to extract only information-rich features. \n\n\f252 \n\nF. Peper and H. Noda \n\nMoreover, the suppression of weights with a low information content suggests a \nstraightforward way [8J to adaptively control the number of neurons, thus minimiz(cid:173)\ning the necessary neural resources. \n\nAcknowledgments \n\nWe thank Dr. Mahdad N. Shirazi at Communications Research Laboratory (CRL) \nfor the helpful discussions, Prof. Dr. S.-1. Amari for his encouragement, and Dr. \nHidefumi Sawai at CRL for providing financial support to present this paper at \nNIPS'96 from the Council for the Promotion of Advanced Information and Com(cid:173)\nmunications Technology. This work was financed by the Japan Ministry of Posts \nand Telecommunications as part of their Frontier Research Project in Telecommu(cid:173)\nnications. \n\nReferences \n\n[1] SA. Amari, \"Mathematical Foundations of Neurocomputing,\" Proceedings of the \n\nIEEE, vol. 78, no. 9, pp. 1443-1463, 1990. \n\n[2] S. Geman, \"Some Averaging and Stability Results for Random Differential Equa(cid:173)\n\ntions,\" SIAM J. Appl. Math., vol. 36, no. 1, pp. 86-105, 1979. \n\n[3] H.J. Kushner and D.S. Clark, \"Stochastic Approximation Methods for Constrained \nand Unconstrained Systems,\" Applied Mathematical Sciences, vol. 26, New York: \nSpringer-Verlag, 1978. \n\n[4] R. Linsker, \"Self-Organization in a Perceptual Network,\" Computer, vol. 21, pp. 105-\n\n117, 1988. \n\n[5] C. von der Malsburg, \"Self-Organization of Orientation Sensitive Cells in the Striate \n\nCortex,\" Kybernetik, vol. 14, pp. 85-100, 1973. \n\n[6] K.D. Miller and D.J.C. MacKay, \"The Role of Constraints in Hebbian Learning,\" \n\nNeural Computation, vol. 6, pp. 100-126, 1994. \n\n[7] E. Oja, \"A simplified neuron model as a principal component analyzer,\" Journal of \n\nMathematics and Biology, vol. 15, pp. 267-273, '1982. \n\n[8] F. Peper and H. Noda, \"A Mechanism for the Development of Feature Detecting \nNeurons,\" Proc. Second New-Zealand Int. Two-Stream Conf. on Artificial Neural \nNetworks and Expert Systems, ANNES'95, Dunedin, New-Zealand, pp. 59-62, 20-23 \nNov. 1995. \n\n[9] F. Peper and H. Noda, \"A Class of Simple Nonlinear I-unit PCA Neural Networks,\" \n1995 IEEE Int. Con/. on Neural Networks, ICNN'95, Perth, Australia, pp. 285-289, \n27 Nov.-l Dec. 1995. \n\n[10] F. Peper and H. Noda, \"A Symmetric Linear Neural Network that Learns Principal \nComponents and their Variances,\" IEEE Trans. on Neural Networks, vol. 7, pp. 1042-\n1047, 1996. \n\n[11] F. Peper and H. Noda, \"Stationary Points of a Hebb Learning Rule for a Nonlinear \nNeural Network,\" Proc. 1996 Int. Symp. Nonlinear Theory and Appl. (NOLTA '96), \nKochi, Japan, pp. 241-244, 7-9 Oct 1996. \n\n[12] F. Peper and M.N. Shirazi, \"On the Eigenstructure of Nonlinearized Covariance \nMatrices,\" Proc. 1996 Int. Symp. Nonlinear Theory and Appl. (NOLTA '96), Kochi, \nJapan, pp. 491-493, 7-9 Oct 1996. \n\n\f", "award": [], "sourceid": 1195, "authors": [{"given_name": "Ferdinand", "family_name": "Peper", "institution": null}, {"given_name": "Hideki", "family_name": "Noda", "institution": null}]}