{"title": "Sigma-Pi Learning: On Radial Basis Functions and Cortical Associative Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 474, "page_last": 481, "abstract": null, "full_text": "474  Mel and Koch \n\nSigma-Pi Learning: \n\nOn Radial Basis Functions and  Cortical \n\nAssociative Learning \n\nChristof Koch \nBartlett W.  Mel \nComputation and  Neural Systems Program \n\nCaltech,  216-76 \n\nPasadena, CA 91125 \n\nABSTRACT \n\nThe  goal  in  this  work  has  been  to  identify  the  neuronal  elements \nof the cortical column that are most likely to support  the learning \nof nonlinear associative  maps.  We show that  a  particular style  of \nnetwork learning algorithm based on locally-tuned  receptive fields \nmaps  naturally  onto cortical  hardware,  and  gives  coherence  to  a \nvariety of features  of cortical anatomy,  physiology,  and  biophysics \nwhose  relations to learning remain poorly understood. \n\nINTRODUCTION \n\n1 \nSynaptic  modification  is  widely  believed  to be  the  brain's primary mechanism for \nlong-term information  storage.  The enormous practical and theoretical importance \nof biological  synaptic  plasticity  has  stimulated  interest  among  both  experimental \nneuroscientists and neural  network modelers,  and  has provided strong incentive for \nthe development of computational models  that can both explain and  predict. \n\nWe  present  here a  model for  the  synaptic  basis  of associative  learning  in  cerebral \ncortex.  The  main  hypothesis  of  this  work  is  that  the  principal  output  neurons \nof a  cortical association  area learn  functions  of their  inputs as locally-generalizing \nlookup  tables.  As  abstractions,  locally-generalizing  learning  methods  have a  long \nhistory in statistics and approximation theory (see Atkeson, 1989; Barron &  Barron, \n\n\fSigma-Pi Learning \n\n475 \n\nFigure  1:  A  Neural  Lookup  Table.  A  nonlinear function  of several  variables may \nbe decomposed  as  a  weighted sum over a set of localized  \"receptive fields\"  units. \n\n1988).  Radial Basis Function (RBF) methods are essentially similar (see Broomhead \n&  Lowe,  1988)  and  have  recently  been  discussed  by  Poggio  and  Girosi  (1989)  in \nrelation  to  regularization  theory.  As  is  standard  for  network  learning  problems, \nlocally-generalizing  methods  involve  the  learning  of  a  map  f(~)  :  ~ ~ y  from \nexample  (~, y)  pairs.  Rather  than  operate  directly  on  the  input  space,  however, \ninput  vectors  are  first  \"decoded\"  by  a  population  of  \"receptive  field\"  units  with \ncenters ei  that each represents a local,  often radially-symmetric,  region in the input \nspace.  Thus, an output unit computes its activation level y = L:i wig( x - ei),  where \n9  defines a  \"radial basis function\" , commonly a  Gaussian, and Wi  is its weight (Fig. \n1).  The  learning  problem  can  then  be  characterized  as  one  of finding  weights  w \nthat  minimize  the  mean squared  error  over  the  N  element  training set.  Learning \nschemes  of this  type  lend  themselves  directly  to  very  simple  Hebb-type  rules  for \nsynaptic modification since  the initially  nonlinear learning  problem  is  transformed \ninto a  linear one  in  the unknown parameters w (see Broomhead &  Lowe,  1988). \nLocally-generalizing learning algorithms as  neurobiological models  date at least  to \nAlbus (1971) and Marr (1969,  1970);  they have also been explored more recently by \na  number  of workers  with  a  more  pure computational  bent  (Broomhead  &  Lowe, \n1988;  Lapedes  &  Farber, 1988;  Mel,  1988,  1989;  Miller,  1988;  Moody,  1989;  Poggio \n&  Girosi,  1989). \n\n\f476 \n\nMel and Koch \n\n2  SIGMA-PI LEARNING \nUnlike  the  classic  thresholded  linear  unit  that  is  the  mainstay  of  many  current \nconnectionist models,  the output of a  sigma-pi unit is computed as a sum of contri(cid:173)\nbutions from a  set of independent multiplicative clusters of input weights (adapted \nfrom Rumelhart &  McClelland,  1986):  y = O'(Ej  WjCj),  where  Cj  = rt ViXi  is  the \nproduct  of weighted  inputs  to  cluster  j,  Wj  is  the  weight  on  cluster  j  as  a  whole, \nand  0'  is  an  optional  thresholding  nonlinearity  applied  to  the  sum  of total  clus(cid:173)\nter activity.  During learning,  the output may also by clamped by an unconditioned \nteacher input, i.e.  such that y =  ti(~)'  Units of this general type were first proposed \nby  Feldman  &  Ballard  (1982),  and  have  been  used  occasionally  by  other connec(cid:173)\ntionist  modelers,  most commonly to allow certain inputs to gate others or to allow \nthe  activation  of one  unit  to  control  the  strength  of interconnection  between  two \nother units (Rumelhart &  McClelland,  1986).  The use of sigma-pi units as function \nlookup  tables  was  suggested  by  Feldman  &  Ballard  (1982),  who  cited  a  possible \nrelevance to local dendritic  interactions among synaptic inputs  (see  also  Durbin & \nRumelhart,  1989). \n\nIn  the  present  work,  the specific  nonlinear  interaction among inputs  to  a  sigma-pi \ncluster  is  not  of primary  theoretical  importance.  The crucial property of a  cluster \nis  that  its  output should  be  AND-like,  i.e.  selective for  the simultaneous activity \nof all of its k  input lines!. \n\n2.1  NETWORK ARCHITECTURE \n\nWe assume  an  underlying d-dimensional input space X  E  Rd  over  which functions \nare  to  be  learned.  Vectors  in  X  are  represented  by  a  population  X  of N  units \nwhose  state  is  denoted  by ~ E  RN.  Within  X,  each  of the  d  dimensions  of X  is \nindividually value-coded,  i.e.  consists of a  set of units with gaussian receptive fields \ndistributed  in  overlapping fashion  along  the  range  of allowable  parameter  values, \nfor  example,  the angle of a joint, or the orientation of a visual stimulus at a specific \nretinal location.  (A  more  biologically realistic  case would allow for  individual units \nin  X  to  have  multi-dimensional  gaussian  receptive  fields,  for  example  a  4-d  visual \nreceptive field  encoding retinal x  and y,  edge  orientation,  and  binocular disparity.) \nWe assume a map t(~) : ~ ~ y.  is to be learned, where the components ofy' E RM are \nrepresented by an output population Y of M  units.  According to the familiar single(cid:173)\nlayer feedforward network learning paradigm, X projects to Y via an \"associational\" \npathway with  modifiable synapses.  We  consider the  task of a  single  output unit Yi \n(hereafter  denoted  by y),  whose job is  to estimate  the  underlying  teacher function \nti(~) :  ~ ~ y from examples.  Output unit  y  is  assumed  to have access  to the entire \ninput vector ~, and a  single unconditioned  teacher input ti.  We further assume that \n\n1 A local threshold function can act as an AND in place of a  multiplication, and for purposes of \nbiological modeling,  is a  more likely dendritic mechanism than pure multiplication.  In continuing \nwork, we are exploring the more detailed interactions between Hebb-type learning rules and various \npost-synaptic nonlinearities,  specifically  the  NMDA channel, that could underlie  a  multiplication \nrelation among nearby inputs. \n\n\fSigma-Pi Learning \n\n477 \n\nall  possible  clusters  Cj  of size  1  through  k  = k maz  pre-exist  in  y's  dendritic  field, \nwith cluster weights Wj  initially set to 0,  and input weights Vi  within each cluster set \nequal to 1.  Following from our assumption that each of the input lines Xi  represents \na  I-dimensional  gaussian  receptive  field  in  X,  a  multiplicative  cluster  of  k  such \ninputs can  yield  a  k-dimensional  receptive  field  in  X  that  may  then  be  weighted . \nIn this way,  a  sigma-pi unit can directly implement an RBF decomposition over X. \nAdditionally,  since  a  sigma-pi unit  is  essentially a  massively  parallel  lookup  table \nwith  clusters  as  stored  table entries,  it  is  significant  that  the  sigma-pi function  is \ninherently modular, such that groups of sigma-pi units that receive the same teacher \nsignal can,  by simply adding their outputs, act as a single much larger virtual sigma(cid:173)\npi unit  with correspondingly increased  table  capacity2.  A  neural  architecture  that \nallows system storage capacity to be multiplied by a factor of k  by growing k  neurons \nin the place of one,  is  one  that should  be strongly preferred  by biological evolution. \n\n2.2  THE LEARNING RULE \n\nThe cluster weights Wj  are modified during training according to the following self(cid:173)\nnormalizing Hebb  rule: \n\nwi = a  cip  tp -\n\nf3W j, \n\nwhere a  and f3  are small positive constants, and cip  and tp  are,  respectively, the jth \ncluster response and teacher signal in state p.  The steady state of this learning rule \noccurs when Wj  = ~ < cit >,  which tries to maximize the correlation3  of cluster out(cid:173)\nput and teacher signal over the training set, while minimizing total synaptic weight \nfor  all  clusters.  The inputs weights Vi  are unmodified  during learning,  representing \nthe degree of cluster membership for  each input line. \n\nWe briefly note that because this Hebb-type learning rule is  truly local,  i.e.  depends \nonly  upon  activity levels available directly  at  a  synapse to be modified,  it  may be \napplied transparently to a  group of neurons driven by the same global teacher input \n(see  above discussion  of sigma-pi  modularity).  Error-correcting  rules  that  modify \nsynapses  based  on  a  difference  between  desired  vs.  actual  neural  output  do  not \nshare  this property. \n\n3  TOWARD  A  BIOLOGICAL MODEL \nIn  the  remainder of this  paper we  examine  the  hypothesis  that sigma-pi units  un(cid:173)\nderlie associative learning in cerebral cortex.  To do so,  we  identify the six essential \nelements of the sigma-pi learning scheme and discuss the evidence for each:  i)  a pop(cid:173)\nulation of output neurons,  ii)  a  focal  teacher input,  iii),  a  diffuse  association input, \niv) Hebb-type synaptic plasticity, v)  local dendritic multiplication (or thresholding), \nand  vi)  a  cluster reservoir. \n\nFollowing  Eccles  (1985),  we  concern  ourselves  here  with  the  cytoarchitecture  of \n\"generic\"  association cortex,  rather than with the more specialized (and more often \nstudied)  primary  sensory  and  motor  areas.  We  propose  the  cortical circuit  of fig. \n\n2This assumes the global thresholding nonlinearity q  is  weak, i.e.  has an extended linear range. \n3Strictly speaking,  the average product. \n\n\f478  Mel and Koch \n\nASSOciationl~;.,~~lil~ll~fi~~~ \n\nribers\"\" \n\nj \n\nIV \n\nV,VI \n\nAssociation \n\nInputs \n\nSpecific \nAfferent \n\nFigure 2:  Elements of the cortical column in  a  generic  association cortex. \n\n2  to  contain  all  of  the  basic  elements  necessary  for  associative  learning,  closely \nparalleling the accounts of Marr (1970) and Eccles (1985) at this level of description. \nWe limit our focus to the cortically-projecting \"output\" pyramids oflayers II and III, \nwhich are  posited  to  be sigma-pi units.  These cells  are a  likely locus of associative \nlearning  as  they  are  well  situated  to  receive  both  teacher  and  associational  input \npathways.  With reference to the modularity property of sigma-pilearning (sec.  2.1), \nwe  interpret  the  aggregates  of layer  II/III  pyramidal cells  whose  apical  dendrites \nrise  toward  the  cortical  surface  in  tight  clumps  (on  the order of 100  cells,  Peters, \n1989),  as  a  single virtual sigma-pi unit. \n\n3.1  THE TEACHER INPUT \n\nWe tentatively define  the  \"teacher\"  input to an association area to be those inputs \nthat  terminate  primarily  in  layer  IV  onto  spiny  stellate  cells  or  small  pyramidal \ncells.  Lund  et  al.  (1985)  points  out  that  spiny  stellate  cells  are  most  numerous \nin  primary sensory areas,  but  that  the morphologically similar class  of small pyra(cid:173)\nmidal cells  in  layer IV seem  to mimick  the spiny stellates  in  their  local,  vertically \noriented  excitatory  axonal  distributions.  The  layer  IV  spiny  stellates  are  known \nto  project  primarily  up  (but  also  down)  a  narrow  vertical  cylinder  in  which  they \nsit,  probably making  powerful  \"cartridge\"  synapses onto overlying pyramidal cells. \nThese excitatory interneurons are presumably capable of strongly deplorarizing en(cid:173)\ntire output cells (Szentagothai,  1977),  thus providing the needed  unit-wide  teacher \nsignals  to  the  output  neurons.  We  therefore  assumethis  teacher  pathway  plays  a \nrole analagous to the presumed role of cerebellar climbing fibers (Albus,  1971;  Marr, \n\n\fSigma-Pi Learning \n\n479 \n\n1969}  The inputs to layer  IV can be of both thalamic and/or cortical origin. \n\n3.2  THE ASSOCIATIONAL INPUT \n\nA second  major form  of extrinsic  excitatory input with  access  to layer II/III pyra(cid:173)\nmidal cells  is  the massive system of horizontal fibers  in layer I. The primary source \nof these  fibers  is  currently  believed  to  be  long  range  excitatory  association  fibers \nfrom  both  other cortical  and  subcortical  areas  (Jones,  1981).  In  accordance  with \nMarr (1970)  and  Eccles (1985),  we  interpret this  system of horizontal fibers,  which \nvirtually permeates the dendritic fields of the layer II/III pyramidal cells,  as the pri(cid:173)\nmary  conditioned input pathway at which  cortical associative  learning takes place. \nThere  is  evidence  that  an  individual  layer  I  fibers  can  make  excitatory  synapses \non  apical  dendrites of pyramidal cells  across an area  of cortex 5-6mm in  diameter \n(Szentagothai,  1977). \n\n3.3  HEBB  RULES, MULTIPLICATION, AND  CLUSTERING \n\nThe process of cluster formation in sigma-pi learning is  driven by a local Hebb-type \nrule.  Long term Hebb-type synaptic modification has been demonstrated in several \ncortical areas,  dependent only upon local post-synaptic depolarization (Kelso et al., \n1986),  and  thought  to  be  mediated  by  the  the  voltage-dependent  NMDA  channel \n(see Brown et al., 1988).  In addition to the standard tendency for LTP with pre- and \npost-synaptic correlation,  sigma-pi learning implicitly  specifies  cooperation  among \npre-synaptic units,  in the sense that  the largest  increase in cluster weight Wj  occurs \nwhen  all  inputs  Xi  to a  cluster are simultaneously and strongly active.  This type of \ncooperation among pre-synaptic inputs should follow  directly from  the assumption \nthat local post-synaptic depolarization is  the key ingredient in the induction of LTP. \nIn  other  words,  like-activated  synaptic  inputs  must  inevitably  contribute  to  each \nother's  enhancement  during  learning  to  the  extent  they  are  clustered  on  a  post(cid:173)\nsynaptic  dendrite.  This  type  of cooperativity in  learning gives  key  importance  to \ndendritic space in neural learning, and has not until very recently been modelled at \na  biophysical level  (T.  Brown,  pers.  comm;  J.  Moody,  pers.  comm.). \nIn addition to its possible role in enhancing like-activated synaptic clusters however, \nthe NMDA channel may be hypothesized to simultaneously underlie the \"multiplica(cid:173)\ntive\"  interaction  among  neighboring  inputs  needed  for  ensuring  cluster-selectivity \nin  sigma-pi learning.  Thus,  if sufficiently  endowed  with  NMDA  channels,  cortical \npyramidal cells could respond highly selectively to associative input \"vectors\" whose \nactive afferents are  spatially  clumped,  rather  than scattered  uniformly,  across  the \ndendritic  arbor.  The  possibility  that  dendritic  computations  could  include  local \nmultiplicative nonlinearities is  widely accepted (e.g.  Shepherd et al.,  1985;  Koch et \nal.,  1983). \n\n3.4  A  VIRTUAL  CLUSTER RESERVOIR \n\nThe abstract definition  of sigma-pi learning specifies  that all  possible clusters  Cj  of \nsize  1 <  k  <  kmax  pre-exist on the  \"dendrites\"  of each virtual sigma-pi unit (which \nwe  have  previously  proposed  to  consist  of  a  vertically  aggregated  clump  of  100 \n\n\f480  Mel and Koch \n\npyramidal cells  that receive the same teacher input from  layer 4).  During learning, \nthe weight on each cluster is  governed by a  simple Hebb  rule.  Since the number of \npossible clusters of size  k  overwhelms  total available dendritic space for  even  small \nk 4 ,  it  must  be  possible  to  create  a  cluster  when  it  is  needed.  We  propose  that \nthe complex 3-d mesh  of axonal and dendritic arborizations  in  layer 1 are  ideal for \nmaximizing the probability that arbitrary (small) subsets of association axons cross \nnear to  each  other  in  space  at some  point  in  their  collective  arborizations.  Thus, \nwe  propose that the tangle ofaxons within a  dendrite's  receptive field  gives rise  to \nan enormous set of almost-clusters,  poised  to  \"latch\"  onto a  post-synaptic dendrite \nwhen  called  for  by  a  Hebb-type  learning  rule.  This  geometry  of pre- and  post(cid:173)\nsynaptic interface is  to be strongly contrasted with  the architecture of cerebellum, \nwhere the afferent \"parallel\" fibers have no possibility of clustering on post-synaptic \ndendrites. \n\nKnown  biophysical  mechamisms  for  the  sprouting and  guidance  of growth  cones \nduring development, in some cases driven by neural activity seem well suited  to the \ntask of cluster formation  over small distances in the adult  brain. \n\n4  CONCLUSIONS \n\nThe  locally-generalizing,  table-based  sigma-pi  learning  scheme  is  a  parsimonious \nmechanisms that can account for  the learning of nonlinear associative maps in cere(cid:173)\nbral cortex.  Only a single layer of excitatory synapses is modified,  under the control \nof a  Hebb-type learning rule.  Numerous open questions remain  however, for  exam(cid:173)\nple  the  degree  to  which  clusters  of active  synapses  scattered  across  a  pyramidal \ndendritic  tree  can function  independently,  providing  the  necessary  AND-like selec(cid:173)\ntivity. \n\nAcknowledgements \n\nThanks  are  due  to  Ojvind  Bernander,  Rodney  Douglas,  Richard  Durbin,  Kamil \nGrajski,  David  Mackay,  and  John  Moody  for  numerous  helpful  discussions.  We \nacknowledge support  from  the  Office  of Naval  Research,  the  James S.  McDonnell \nFoundation, and  the Del  Webb Foundation. \n\nReferences \n\nAlbus,  J.S.  A  theory of cerebellar function.  Math.  Bio6Ci.,  1971,  10,25-61. \n\nAtkeson, C.G.  Using associative content-addressable memories to control robots,  MIT  A.I.  Memo \n1124, September 1989. \n\nBarron, A.R. &  Barron, R.L. Statistical learning networks:  a unifying view.  Presented at  the 1988 \nSympo6ium  on  the  Interface:  Stati6tic6  and  Computing  Science,  Reston,  Virginia. \n\nBliss,  T.V.P.  &  Lf/Jmo,  T.  Long-lasting potentiation of synaptic  transmission in  the  dentate area \nof the anaesthetized  rabbit  following  stimulation of the  perforant  path.  J.  PhY6ioi.,  1973,  232, \n331-356. \n\n4 For example, assume  a  3-d learning problem and clusters of size k  = 3;  with 100 afferents per \ninput dimension,  there are 1003  = 106  possible clusters.  If we assume 5,000  available association \nsynapses  per pyramidal cell,  there is dendritic space for  at  most 166,000 clusters of size 3. \n\n\fSigma-Pi Learning \n\n481 \n\nBroomhead, D.S.  &  Lowe, D. Multivariable functional interpolation and adaptive networks.  Com(cid:173)\nplex  Sy.tem., 1988,  2,  321-355. \n\nBrown,  T.H.,  Chapman,  P.F.,  Kairiss,  E.W.,  &  Keenan,  C.L.  Long-term  synaptic  potentiation. \nScience, 1988,  242,  724-728. \n\nDurbin,  R.  &  Rumelhart,  D.E.  Product  units:  a  computationally powerful and biologically plau(cid:173)\nsible extension  to backpropagation networks.  Complex  Sy.tem., 1989,  1,  133. \n\nEccles,  J.C.  The  cerebral  neocortex:  a  theory  of its  operation.  In  Cerebral  Cortex,  vol.  2,  A. \nPeters &  E.G. Jones,  (Eds.),  Plenum:  New  York,  1985. \n\nFeldman,  J.A.  &  Ballard,  D.H.  Connectionist  models  and  their  properties.  Cognitive  Science, \n1982,  6,  205-254. \n\nGiles,  C.L.  &  Maxwell, T.  Learning,  invariance, and generalization in high-order neural networks. \nApplied  Optic.,  1987,  26(23),  4972-4978. \n\nHebb,  D.O.  The  organization  oj behavior.  New  York:  Wiley,  1949. \n\nJones,  E.G.  Anatomy of cerebral  cortex:  columnar  input-ouput  relations.  In  The  organization \noj cerebral  cortex,  F.O.  Schmitt,  F.G.  Worden,  G.  Adelman,  &  S.G.  Dennis,  (Eds.),  MIT  Press: \nCambridge, MA,  1981. \n\nKelso, S.R.,  Ganong,  A.H.,  &  Brown, T.H. Hebbian synapses in hippocampus.  PNAS USA, 1986, \n83,  5326-5330. \n\nKoch, C.,  Poggio, T., &  Torre,  V.  Nonlinear interactions in a  dendritic tree:  localization,  timing, \nand role in information processing.  PNAS, 1983,  80,  2799-2802. \n\nLapedes,  A.  &  Farber,  R.  How neural nets work.  In NeurallnJormation  Procfuing Sy.tem.,  D.Z. \nAnderson,  (Ed.),  American Institute of Physics:  New  York, 1988. \n\nLund,  J.S.  Spiny  stellate neurons.  In  Cerebral  Cortex,  vol.  1,  A.  Peters &  E.G.  Jones,  (Eds.), \nPlenum:  New York, 1985. \n\nMarr,  D.  A  theory for cerebral neocortex.  Proc.  Roy.  Soc.  Lond.  B, 1970,  176,  161-234. \n\nMarr,  D.  A  theory of cerebellar cortex.  J.  Phy.iol.,  1969,  202,  437-470. \n\nMel,  B.W.  MURPHY:  A  robot  that learns by doing.  In NeurallnJormation  Proceuing  SY6tem., \nD.Z.  Anderson,  (Ed.), American Institute of Physics:  New  York,  1988. \n\nMel,  B.W. MURPHY:  A neurally inspired connectionist approach to learning and perfonnance in \nvision-based robot motion planning.  Ph.D.  thesis,  University of Illinois,  1989. \n\nMiller  W.T.,  Hewes,  R.P.,  Glanz,  F.H.,  &  Kraft,  L.G.  Real  time  dynamic  control of an  indus(cid:173)\ntrial  manipulator using  a  neural  network  based learning  controller.  Technical Report,  Dept.  of \nElectrical and Computer Engineering,  University of New  Hampshire,  1988. \n\nMoody,  J.  &  Darken,  C.  Learning  with  localized  receptive fields.  In  Proc.  1988  Connectioni6t \nModel.  Summer  School,  Morgan-Kaufmann, 1988. \n\nPeters,  A.  Plenary address,  1989 Soc.  Neurosc.  Meeting,  Phoenix,  AZ. \n\nPoggio,  T. &  Girosi,  F.  Learning,  networks and approximation theory.  Science,  In press. \n\nRumelhart,  D.E.,  Hinton,  G.E.,  &  McClelland,  J.L.  A  general framework for  parallel  distributed \nprocessing.  In Parallel di.tributed proceuing:  exploration. in the  micro.tructure  oj cognition,  vol. \n1,  D.E.  Rumelhart, J.L.  McClelland,  (Eds.),  Cambridge, MA:  Bradford,  1986. \n\nShepherd,  G.M.,  Brayton, R.K.,  Miller,  J.P., Segev, I.,  Rinzel,  J.,  &  Rall,  W.  Signal enhancement \nin distal cortical dendrites by means of interactions between active dendritic spines.  PNAS,  1985, \n82,  2192-2195. \n\nSzentagothai,  J.  The neuron network of the cerebral  cortex:  a  functional  interpretation.  (1977) \nProc.  R.  Soc.  Lond.  B., 201:219-248. \n\n\f", "award": [], "sourceid": 282, "authors": [{"given_name": "Bartlett", "family_name": "Mel", "institution": null}, {"given_name": "Christof", "family_name": "Koch", "institution": null}]}