{"title": "Plasticity-Mediated Competitive Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 475, "page_last": 480, "abstract": null, "full_text": "Plasticity-Mediated Competitive Learning \n\nNicol N. Schraudolph \nnici@salk.edu \n\nTerrence J. Sejnowski \nterry@salk.edu \n\nComputational Neurobiology Laboratory \nThe Salk Institute for Biological Studies \n\nSan Diego, CA 92186-5800 \n\nand \n\nComputer Science &  Engineering Department \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0114 \n\nAbstract \n\nDifferentiation between the nodes of a competitive learning net(cid:173)\nwork is conventionally achieved through competition on the ba(cid:173)\nsis of neural activity.  Simple inhibitory mechanisms are  limited \nto  sparse  representations,  while  decorrelation  and  factorization \nschemes that support distributed representations are computation(cid:173)\nally unattractive.  By letting neural plasticity mediate the compet(cid:173)\nitive interaction instead, we obtain diffuse, nonadaptive alterna(cid:173)\ntives for fully distributed representations.  We  use this technique \nto  Simplify and improve our binary information gain optimiza(cid:173)\ntion algorithm for feature extraction (Schraudolph and Sejnowski, \n1993); the same approach could be used to improve other learning \nalgorithms. \n\n1 \n\nINTRODUCTION \n\nUnsupervised neural networks frequently employ sets of nodes or subnetworks \nwith identical architecture and objective function.  Some form of competitive inter(cid:173)\naction is then needed for these nodes to differentiate and efficiently complement \neach other in their task. \n\n\f476 \n\nNicol  Schraudolph,  Terrence 1.  Sejnowski \n\n1.00 -\n\n0.50  -\n\n0.00  -\n\nj ................................. '.' \n\n-\n\nf(y) \n\u00b74r(y)' .... \n\n.......... / /  .... \u00b7\u00b71 \n\n= ... = ... :::. ... :::: ... :j:. ... :..... ... -.. -~ \n\n'.:! ' .... \"\u00b7\u00b7,,\u00b7\u00b7.,, . .. \n\n'  .. 1 \u2022\u2022\u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022 \u2022\u2022 \u2022 \n\n-4.00 \n\n-2.00 \n\n0.00 \n\n2.00 \n\n4.00 \n\ny \n\nFigure 1:  Activity f  and plasticity f' of a logistic node as a function of its net input \ny.  Vertical  lines indicate those values of y  whose pre-images in input space  are \ndepicted in Figure 2. \n\nInhibition is the simplest competitive mechanism:  the most active nodes suppress \nthe ability of their peers  to learn,  either directly  or by depressing  their activity. \nSince inhibition can be implemented by diffuse, nonadaptive mechanisms, it is an \nattractive solution from  both neurobiological and computational points of view. \nHowever, inhibition can only form  either localized (unary)  or sparse distributed \nrepresentations, in which each output has only one state with significant informa(cid:173)\ntion content. \nFor fully distributed representations, schemes to decorrelate (Barlow and Foldiak, \n1989; Leen, 1991) and even factorize (Schmidhuber, 1992; Bell and Sejnowski, 1995) \nnode  activities  do exist.  Unfortunately  these  require  specific,  weighted  lateral \nconnections  whose adaptation is  computationally expensive  and  may  interfere \nwith feedforward learning.  While they certainly have their place as competitive \nlearning algorithms, the capability of biological neurons to implement them seems \nquestionable. \nIn this paper, we suggest an alternative approach:  we extend the advantages of \nsimple inhibition  to  distributed  representations by  decoupling  the competition \nfrom the activation vector.  In particular, we use neural plasticity -\nthe derivative \nof a logistic activation function - as a medium for competition. \nPlasticity is low for both high and low activation values but high for intermediate \nones (Figure 1); distributed patterns of activity may therefore have localized plastic(cid:173)\nity. If competition is controlled by plasticity then, simple competitive mechanisms \nwill constrain us to localized plasticity but allow representations with distributed \nactivity. \nThe next section reintroduces the binary information gain optimization (BINGO) \nalgorithm for a single node; we then discuss how plasticity-mediated competition \nimproves  upon the  decorrelation  mechanism used  in our original  extension  to \nmultiple nodes.  Finally,  we establish a  close relationship between the plasticity \nand  the  entropy  of a  logistiC  node  that provides  an  intuitive  interpretation  of \nplasticity-mediated competitive learning in this context. \n\n\fPlasticity-Med;ated  Competitive  Learning \n\n477 \n\n2  BINARY INFORMATION GAIN OPTIMIZATION \n\nIn (Schraudolph and Sejnowski, 1993), we proposed an unsupervised learning rule \nthat uses logistic nodes to seek out binary features in its input. The output \n\nz  = f(y),  where  f(y)  = 1 + e- Y  and  y = tV \u00b7 x \n\n1 \n\n(1) \n\nof each node is interpreted stochastically as the probability that a given feature is \npresent. We then search for informative directions in weight space by maximizing \nthe information gained about an unknown binary feature through observation of \nz.  This binary infonnation gain is given by \n\nD.H(z) = H(Z)  - H(z) , \n\n(2) \nwhere H(z)  is  the entropy of a binary random variable with probability z, and z \nis  a  prediction of z based on prior knowledge.  Gradient ascent in this objective \nresults in the learning rule \n\nD.w  <X  J'(y)  . (y - fI)  . x, \n\n(3) \n\nwhere fI is a prediction of y.  In the simplest case, fI is an empirical average (y)  of past \nactivity, computed either over batches of input data or by means of an exponential \ntrace; this amounts to a nonlinear version of the covariance rule (Sejnowski, 1977). \nUsing just the average as prediction introduces a strong preference for splitting the \ndata into two equal-sized clusters.  While such a bias is appropriate in the initial \nphase of learning, it fails  to take the nonlinear nature of f  into account.  In order \nto discount data in the saturated regions of the logistic function appropriately, we \nweigh the average by the node's plasticity J'(y): \n\n(y  . f'(y)) \nfI  =  --'-'---'--'-'--'-'--\n(f'(y)) + C , \n\n(4) \n\nwhere c is a very small positive constant introduced to ensure numerical stability \nfor large values of y.  Now the bias for splitting the data evenly is gradually relaxed \nas the network's weights grow and data begins to fall into saturated regions of f. \n\n3  PLASTICITY-MEDIATED  COMPETITION \n\nFor multiple nodes the original BINGO algorithm used a decorrelating predictor \nas the competitive mechanism: \n\n(5) \nwhere Qg is the autocorrelation matrix of y, and I  the identity matrix.  Note that \nQg is computationally expensive to maintain; in connectionist implementations it \n\ng = y + (Qg - 2I)(y - (y)) , \n\n\f478 \n\nNicol  Schraudolph,  Terrence  J.  Sejnowski \n\nj \n\n! \ni \n.:  f \n. . ~'  i.. \n. .. . .. \n. : . \u00b7,\"f.e: 1',. \n\n.. .\n.j \n':\" ! \n\nI .  . \n\n:: \n\n.\n\n, \n\n. .... .. . \n,. \n\" .. ~.',  \" . \n.. : ..... \n, , :~X~.\"  . . \n\n\u00b7'IJ\"~~ .~~~ .. \n\n.. . ~~ \n\nFigure 2:  The \"three cigars\" problem.  Each plot shows the pre-image of zero net \n. input,  superimposed on a  scatter plot of the data set,  in input space.  The  two \nflanking  lines  delineate  the  \"plastic region\"  where  the  logistic is  not saturated, \nproviding  an indication of weight vector size.  Left,  two-node BINGO network \nusing decorrelation (Equations 3 & 5) fails to separate the three data clusters. Right, \nsame network using plasticity-mediated competition (Equations 4 & 6) succeeds. \n\nis often approximated by lateral anti-Hebbian connections whose adaptation must \noccur on a faster time scale than that of the feedforward weights (Equation 3) for \nreasons of stability (Leen,  1991).  In practice this  means that learning is  slowed \nsignificantly. \nIn addition, decorrelation can be inappropriate when nonlinear objectives are op(cid:173)\nin our case,  two  prominent binary features  may  well be correlated. \ntimized -\nConsider the \"three cigars\" problem illustrated in Figure 2:  the decorrelating pre(cid:173)\ndictor (left) forces  the two nodes into a near-orthogonal arrangement, interfering \nwith their ability to detect the parallel gaps separating the data clusters. \nFor our purposes, decorrelation is thus too strong a constraint on the discriminants: \nall we require is that the discovered features be distinct.  We achieve this by reverting \nto the simple predictor of Equation 4 while adding a  global, plasticity-mediated \nexcitationl  factor to the weight update: \n\n~Wi ex:  f'(Yi)  . (Yi  - 1li) . X \u00b7 L f'(Yj) \n\nj \n\n(6) \n\nAs  Figure  2  (right) illustrates,  this  arrangement solves  the  \"three  cigars\"  prob(cid:173)\nlem.  In the high-dimensional environment of hand-written digit recognition, this \nalgorithm discovers a set of distributed binary features that preserve most of the \ninformation needed to classify the digits, even though the network was never given \nany class labels (Figure 3). \n\n1 The interaction is excitatory rather than inhibitory since a node's plasticity is inversely \n\ncorrelated with the magnitude of its net input. \n\n\fPlasticity-Mediated  Competitive  Learning \n\n479 \n\n.. .... \n.. .. \n................... .  . .. \" \n. .... , \n. .... ,  ..... \n. .... \n.... .-\n. ... \n........ \n................... \n... \n...... \n\u00b7 ... \n. ......... \n........ \n........ \n\u00b7 ... \"  I. \n....... \n.., .. -..... \n\u2022 \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022\u2022\u2022\u2022\u2022\u2022  . .......... \n.. ....  .. . \nI  . . .   .............. .  _ .... \n\" ..... \n~ ..... '\" \n\"  , ....  \u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022\u2022\u2022 . ....... , \n... ~ \n, .. \n. .. ' \n. ..... ..... \n\u2022 \u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n' . . . \n....... \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 .... \n\u2022\u2022\u2022\u2022 \n\" .......  \u2022\u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022 \n\u2022 \u2022\u2022 \n\n\u2022 \n\n\u2022 \n\n..  ..  \" \n\nI  ...... \n\n\u00ab  . . . . . .  \n\u2022  a  .... \n\u00ab  I  ............  t \n\n\u2022\u2022\u2022\u2022\u2022 \n\n' ..... \n.. ...... \n.. ..... \n, ...... \n. ..... \n. .. '  , \n. .. '\" \n\u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022 \n........ , \n\u2022 \u2022\u2022\u2022 \n. ........... \n\na ....... \n\n...... \n\u2022\u2022\u2022\u2022 \n' . \n.. \n.. \n\u2022 \n. .. \nl  \u2022\u2022 \n........ \n\n. . . . . .  to.  .... \n..  ... a  .. \n\n...... I \n\nFigure 3:  Weights found by a four-node network running the improved BINGO \nalgorithm (Equations 4 & 6) on a set of 1200 handwritten digits due to (Guyon et aI., \n1989).  Although the network is unsupervised, its four-bit output conveys most of \nthe information necessary to classify the digits. \n\n4  PLASTICITY AND BINARY ENTROPY \n\nIt is possible to establish a relationship between the plasticity /' of a logistiC node \nand its entropy that provides an intuitive account of plasticity-mediated competi(cid:173)\ntion as applied to BINGO. Consider the binary entropy \n\nH(z) = - z logz - (1  - z) log(l - z) \n\nA well-known quadratic approximation is \n\nH(z)  = 8e- 1 z (1  - z)  ~ H(z) \n\nNow observe that the plasticity of a logistic node \n\n!'(Y)=:Y  l+le_y  =, .. =z(l-z) \n\n(7) \n\n(8) \n\n(9) \n\nis  in fact  proportional to  H(z)  -\nthat is,  a  logistic node's plasticity is  in effect \na  convenient quadratic approximation to its binary output entropy.  The overall \nentropy in a layer of such nodes equals the sum of individual entropies less their \nredundancy: \n\nH(z)  =  L H(zj)  - R(Z) \n\nThe plasticity-mediated excitation factor in Equation 6 \n\nj \n\n(10) \n\n(11) \n\nj \n\nj \n\nis thus proportional to an approximate upper bound on the entropy of the layer, \nwhich  in turn  indicates  how  much  more  information  remains  to be  gained  by \nlearning from  a  particular input.  In the context of BINGO,  plasticity-mediated \n\n\f480 \n\nNicol  SchraudoLph.  Terrence  J.  Sejnowski \n\ncompetition thus scales weight changes according to a measure of the network's \nignorance:  the less it is able to identify a given input in terms of its set of binary \nfeatures, the more it tries to learn doing so. \n\n5  CONCLUSION \n\nBy using the derivative of a logistic activation function as a medium for competitive \ninteraction, we were able to obtain differentiated, fully distributed representations \nwithout resorting to computationally expensive decorrelation schemes.  We have \ndemonstrated this plasticity-mediated competition approach on the BINGO feature \nextraction algorithm, which is significantly improved by it.  A close relationship \nbetween the plasticity of a logistic node and its binary output entropy provides an \nintuitive interpretation of this unusual form of competition. \nOur general approach of using a nonmonotonic function of activity -\nactivity itself -\nlearning schemes, in particular those that seek distributed rather than local repre(cid:173)\nsentations. \n\nrather than \nto control competitive interactions may prove valuable in other \n\nAcknowledgements \n\nWe  thank Rich  Zemel and Paul Viola  for  stimulating discussions,  and  the Mc(cid:173)\nDonnell-Pew Center for Cognitive Neuroscience in San Diego for financial support. \n\nReferences \nBarlow, H. B. and Foldiak, P.  (1989).  Adaptation and decorrelation in the cortex. In \nDurbin, R. M., Miall, c., and Mitchison, G. J., editors, The  Computing Neuron, \nchapter 4, pages 54-72. Addison-Wesley, Wokingham. \n\nBell,  A.  J.  and  Sejnowski,  T.  J.  (1995).  A  non-linear  information  maximisation \nalgorithm that performs blind separation.  In Advances  in  Neural  Information \nProcessing Systems, volume 7, Denver 1994. \n\nGuyon,!., Poujaud, 1., Personnaz, L., Dreyfus, G., Denker, J., and Le Cun, Y.  (1989). \nComparing different neural network architectures for classifying handwritten \ndigits.  In Proceedings  of the  International  Joint  Conference  on  Neural  Networks, \nvolume II, pages 127-132. IEEE. \n\nLeen,  T.  K.  (1991).  Dynamics  of learning  in linear  feature-discovery  networks. \n\nNetwork,  2:85-105. \n\nSchmidhuber, J.  (1992).  Learning factorial codes by predictability minimization. \n\nNeural Computation, 4(6):863-879. \n\nSchraudolph,  N.  N.  and Sejnowski, T.  J.  (1993).  Unsupervised discrimination of \nclustered data via optimization of binary information gain.  In Hanson, S. J., \nCowan, J.  D., and Giles, C.  L., editors, Advances in Neural Information  Process(cid:173)\ning Systems,  volume 5, pages 499-506, Denver 1992. Morgan Kaufmann, San \nMateo. \n\nSejnowski, T.  J.  (1977).  Storing covariance with nonlinearly interacting neurons. \n\nJournal of Mathematical Biology, 4:303-321. \n\n\f", "award": [], "sourceid": 1003, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}