{"title": "How Receptive Field Parameters Affect Neural Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 763, "abstract": null, "full_text": "How Receptive Field Parameters Affect  Neural \n\nLearning \n\nBartlett W.  Mel \nCNS  Program \nCaltech, 216-76 \nPasadena, CA  91125 \n\nStephen M.  Omohundro \nICSI \n1947 Center  St., Suite 600 \nBerkeley,  CA  94704 \n\nAbstract \n\nWe  identify  the three  principle factors  affecting  the  performance of learn(cid:173)\ning by  networks  with  localized  units:  unit noise,  sample density,  and  the \nstructure of the target function.  We then  analyze the effect  of unit recep(cid:173)\ntive  field  parameters  on  these  factors  and  use  this  analysis  to  propose  a \nnew  learning algorithm which  dynamically alters receptive field  properties \nduring learning. \n\n1  LEARNING WITH LOCALIZED RECEPTIVE FIELDS \n\nLocally-tuned  representations  are  common in  both  biological  and  artificial  neural \nnetworks.  Several workers have analyzed the effect of receptive field  size, shape, and \noverlap  on  representation  accuracy:  (Baldi,  1988),  (Ballard,  1987),  and  (Hinton, \n1986).  This paper investigates the additional interactions introduced by  the task of \nfunction learning.  Previous studies which have considered learning have for the most \npart restricted attention to the use of the input probability distribution to determine \nreceptive field  layout (Kohonen,  1984)  and (Moody and Darken,  1989).  We  will see \nthat the structure of the function  being learned may also  be  advantageously taken \ninto account. \n\nFunction learning using  radial basis functions  (RBF's)  is  currently  a  popular tech(cid:173)\nnique  (Broomhead  and  Lowe,  1988)  and  serves  as  an  adequate framework  for  our \ndiscussion.  Because  we  are interested  in constraints on biological systems,  we  must \nexplictly  consider  the  effects  of unit  noise.  The  goal  is  to  choose  the  layout  of \nreceptive fields  so as  to minimize average performance error. \nLet  y  = f(x)  be  the  function  the  network  is  attempting  to  learn  from  example \n\n757 \n\n\f758  Mel and Omohundro \n\n(x, y)  pairs.  The network  consists  of N  units  whose  locally-tuned  receptive  fields \nare distributed  across  the input space.  The activity of the  ith unit is  the sum of a \nradial basis function  <Pi(X)  and  a  mean-zero noise process  7Ji(X).  A  typical form for \n<Pi  is  an  n-dimensional Gaussian  parametrized by  its center  Xi  and  width CTi, \n\n<fJi(X)  = e \n\n-IIX j-XII:l \n\n:l\"i 2 \n\n\u2022 \n\n(1) \n\nThe function  f(x)  is  approximated as  a  weighted sum of the output of N  of these \nunits: \n\nN \n\nF(x) = L Wi [<p,(x) + 7Ji(X)]. \n\n(2) \n\ni=l \n\nThe weights Wi  are trained using the LMS (least mean square) rule,  which attempts \nto  minimize the  mean squared  distance  between  f  and  F  over  the set  of training \npatterns p  for  the current  layout of receptive fields.  In  the next section  we  address \nthe  additional  considerations  that arise  when  the  receptive  field  centers  and  sizes \nare allowed  to vary  in addition to the weights. \n\n2  TWO  KINDS  OF ERROR \n\nTo understand  the effect  of receptive field  properties  on performance  we  must dis(cid:173)\ntinguish  two  basic sources of error.  The first  we  call estimation error  and is  due to \nthe intrinsic unit noise.  The other  we  call  approximation error and  arises from  the \ninability of the  unit activity functions  to represent  the  target function. \n\n2.1  ESTIMATION  ERROR \nThe  estimation  error can  be  characterized  by  the  variance  in  F(x)  I x.  Because \nof the  intrinsic  unit  noise,  repeated  stimulation of a  network  with  the same input \nvector  Xo  will  generate  a  distribution  of outputs  F(xo).  If this  variance  is  large, \nit  can  be  a  significant  contribution  to  the  MSE  (fig.  1).  Consideration  of  noisy \nunits is  most relevant  to biological networks and analog hardware implementations \nof artificial  units.  Averaging  is  a  powerful  statistical  technique  for  reducing  the \nvariance of a distribution.  In the current context, averaging corresponds to receptive \nfield  overlap.  In  general,  the  more overlap  the  better  the noise  reduction  in  F(x) \n(though  see  section  2.2).  The  overlap  of units  at  Xo  can  be  increased  by  either \nincreasing  the  density  of receptive  field  centers  there,  or  broadening  the  receptive \nfields  of units in the neighborhood. \n\nFrom equation  2,  F(x) may be rewritten \n\nN \n\nF(x) = L <Pi (X)Wi  + e(x), \n\n,=1 \n\n(3) \n\nwhere  the  summation term  is  the  noise-free  LMS  approximation to  f(x),  and  the \nsecond  term \n\nN \n\ne(x) = L 7Ji(X)Wi, \n\ni=l \n\n(4) \n\n\fHow Receptive Field Parameters Affect Neural Learning \n\n759 \n\nJOint Density \n\na,c,e \n\nx \n\ny \n\nb,c,f \n\n(a)  low input density \n(b)  high input density \n(c)  low estimation error \n(d)  high estimation error \n(e)  low approximation error \n(f)  high approximation error \n\nFigure  1:  A.  Estimation error  arises  from the  variance of F(x) Ix.  B.  Approxima(cid:173)\ntion error is the deviation of the mean from the desired response (f(x)- < F(x) \u00bb2. \n\nis  the estimation error.  Since e(x) has mean zero  for  all  x,  its  variance is \n\nVar[e(x)] = E[e(x)] = E[L7]i(x)wi]. \n\nN \n\ni=l \n\nIf each  unit has the same noise  profile,  this reduces  to \n\nVar[e] = Var[7]] L wi\u00b7 \n\nN \n\ni=l \n\n(5) \n\n(6) \n\nThe dependence of estimation error e on the size of weights explains why  increasing \nthe density of receptive  fields  in the input space reduces  noise  in the output of the \nlearning network.  Though the  number of units, and  hence  weights,  that contribute \nto the output is increased in this manipulation, the estimation error is proportional \nto the sum of squared weights  (6).  The benefit  achieved  by  making weights smaller \noutruns  the  cost  of increasing  their  number.  For  example,  each  receptive  field \nwith  weight  Wi  may be replaced  by  two  copies  of itself with  weight  wd2 and leave \nF(x)  unchanged.  The  new  sum  of squared  weights,  L~l 2( T)2,  and  hence  the \nestimation error,  is  reduced  by  a factor of two,  however. \n\nA  second  strategy  that  may  lead  to  a  reduction  in  the  size  of  weights  involves \nbroadening  receptive  fields  (see  section  2.2  for  conditions).  In  general,  broadening \nreceptive fields increases the unweighted output ofthe network L~l <Pi(X), implying \nthat the weights Wi  must be correspondingly reduced  in order that \\I  F(x) \\I  remain \napproximately constant. \n\n\f760  Mel and Omohundro \n\nThese  observations suggest  that  the  effects  of noise  are  best  mitigated by  allocat(cid:173)\ning  receptive  field  resources  in  regions  of the  input  space  where  units  are  heavily \nweighted.  It  is  interesting  to  note  that  under  the  assumption  of additive  noise, \nthe  functional form  tP  of the  receptive  fields  themselves  has  no direct  effect  on the \nestimation error in  F(x).  The response  profiles  may, however,  indirectly affect  esti(cid:173)\nmation error via the weight vector, since LMS weights on receptive fields of different \nfunctional forms will generally be different. \n\n2.2  APPROXIMATION ERROR \n\nThe second  fundamental  type of error,  which  we  call  approximation  error,  persists \neven for  noise-free input units, and is due to error in the \"fit\"  of the approximating \nfunction  F  to  the  target  function  f  (fig.  1).  Two  aspects  of approximation  error \nare  distinguished  in  the following sections. \n\n2.2.1  MISMATCH  OF FUNCTIONAL  FORM \n\nFirst,  there  may  be  mismatch  between  the  specific  functional  form  of  the  basis \nfunctions and that of the target function.  For example, errors  naturally arise when \nlinear RBF's are used to approximate nonlinear target functions, since curves cannot \nbe perfectly  fit  with straight lines.  However,  these errors  may be  made vanishingly \nsmall by  increasing  the  density  of receptive  fields.  For example,  if linear receptive \nfields  are  trained to best fit  a  curved  region  of f(x)  with second  derivative c,  then \nthe  mean squared  error,  J~~~2(~x2 - a)2  has  a  value O(c2d5).  This type  of error \nfalls  off as  the  5th  power  of d,  where  d is  the  spacing  of the  receptive  fields.  In  a \nsimilar  result,  (Baldi  and  lIeilegenberg,  1988)  show  that  approximations to  both \nlinear and quadratic functions improve exponentially fast with increasing density of \nGaussian receptive fields. \n\n2.2.2  MISMATCH  OF SPATIAL SCALE \n\nA  more  general  source  of error  in  fitting  target  functions  occurs  when  receptive \nfields  are either too broad or too widely spaced  relative to the fine  spatial structure \nof f.  Both of these factors can act to locally limit the high frequency  content of the \napproximation F, which  may give  rise  to severe  approximation errors. \n\nThe  Nyquist  (and  Shannon)  result  on  signal  sampling says  that  the  highest  fre(cid:173)\nquency  which  may  be  recovered  from  a  sampled  signal  is  half the  sampling  fre(cid:173)\nquency.  If the  receptive  field  density  is  not  high  enough  then  this  kind  of result \nshows  that  high  frequency  fine  structure  in  the function  being  approximated  will \nbe  lost. \n\nWhen the unit receptive fields are excessively  wide,  they can also wash out the high \nfrequency  fine  structure of the function.  One can think of F  as a  \"blurred\"  version \nof the  the  weight  vector  which  in  turn  is  a  sampled  version  of f.  The  blurring \nis  greater  for  wide  receptive  fields.  The  density  and  width  should  be  chosen  to \nmatch  their  frequency  transfer  characteristics  and  best  approximate the  function. \nFor one-dimensional  Gaussian  receptive  fields  of width  u,  we  choose  the  receptive \n\n\fHow Receptive Field Parameters Affect Neural Learning \n\n761 \n\nfield  spacing  d  to be \n\n(7) \nA  density  that satisfies this type of condition will be referred  to in the next section \nas a  \"frequency-matched\"  density. \n\n1r \n\nd =  20'. \n\n3  A  RECEPTIVE FIELD DESIGN STRATEGY \n\nIn this section we  describe an adaptive learning strategy based on the results above. \nFigure 2 shows  the results of an experimental implementation of this procedure. \nIt is  possible  to  empirically  measure  the  magnitude of the  two  sources  of error \nanalyzed  above.  Since we  wish  to minimize the expected  performance error for  the \nnetwork  as  a  whole,  we  weight  our  measurements  of each  type  of error  at each  x \nby  the  input  probability  p(x).  Errors  in  high  density  regions  count  more.  Small \nmagnitude  errors  may  be  important  in  high  probability  regions  while  even  large \nerrors  may be neglected  in low  probability regions.  The learning algorithm adjusts \nthe  layout  of receptive  fields  to  adjust  to  each  form  of error  in  turn.  The  steps \ninvolved follow. \n\n1.  Uniformly  distribute  broad  receptive  fields  at  frequency-matched  density \n\nthroughout  regions  of the  input  space  that  contain  data.  (In  our  I-d  exam(cid:173)\nple,  data,  and hence  receptive  fields,  are  present  across  the entire domain.) \n\n2.  Train the network weights to an LMS solution with fixed  receptive fields.  Using \n\nthe  trained  network,  accrue  approximation errors  across  the input space. \n\n3.  Where the approximation error exceeds a  threshold  T  anywhere within a  unit's \nreceptive  field,  split  the  receptive  field  into two  subfields  that  are  as small as \npossible  while  still  locally  maintaining frequency-matched  density.  (This  de(cid:173)\npends on receptive field  profile).  Repeat steps 2 and 3 until the approximation \nerror is under threshold across entire input space.  We now have a layout where \nreceptive  field  width  and  density  are  locally matched to the spatial frequency \ncontent  of the  target  function,  and  approximation error  is  small  and  uniform \nacross  the  input  space.  Note  that  since  errors  accrue  according  to  p(x),  we \nhave preferentially allocated resources  (through splitting) in regions  with  both \nhigh error  and high  input probability. \n\n4.  Using  the  current  network,  measure  and  accrue  estimation errors  across  the \n\ninput space. \n\n5.  Where the estimation error exceeds  T  anywhere  within a  unit's receptive  field, \nreplace  the  receptive  field  by  two  of the  same  size,  adding  a  small  random \npertubation  to  each  center.  Repeat  from  4  until  estimation  error  is  below \nthreshold  across  entire  input  space.  We  now  have  a  layout  where  receptive \nfield  density  is  highest  where  the  effects  of noise  were  most severe,  such  that \nestimation error  is now small and uniform across  the input space.  Once  again, \nwe  have  preferentially  allocated  resources  in regions  with both  high  error  and \nhigh  input probability. \n\nFigure  2  illustrates  this  process  for  a  noisy,  one-dimensionsal  learning  problem. \nEach  frame  shows  the estimation error,  the  approximation error,  the  target  func-\n\n\f762  Mel and Omohundro \n\nISTDIATiCII  __  -\n\n.'.31 \n\nISTDIATSCII  _ .  - 1'. 07 \n\n_llllAnCII  __  - 17. 1. \n\n*_1111,. \n\nIITDIA\"CII  __  - ZI.45 \n\n_lDlAnCII fUll - 11.11 \n\nFigure 2:  Results of an adaptive strategy for choosing receptive field size and density. \nSee  text for  details. \n\n\fHow Receptive Field Parameters Affect Neural Learning \n\n763 \n\ntion and network output, and the unit response functions.  In the top frame 24 units \nwith broad, noisy receptive fields  have been  LMS-trained  to fit  the target function. \nEstimation error  is  visible  across  the entire  domain,  though  it  is  concentrated  in \nthe small region just to  the right  of center  where  the  input  probability is  peaked. \nApproximation error is concentrated  in the central region  which contains high spa(cid:173)\ntial frequencies,  with  minor secondary  peaks in other  regions,  including the region \nof high  input probability. \n\nIn  the  second  frame,  the  receptive  field  width  was  uniformly  decreased  and  den(cid:173)\nsity  was  uniformly increased  to the  point where  MSE  fell  below  r;  384  units  were \nrequired.  In  the  third  frame,  the  adaptive  strategy  presented  above  was  used  to \nallocate units and choose widths.  Fewer than half as many units (173)  were  needed \nin  this example to achieve the same MSE  as in the second  frame.  In higher dimen(cid:173)\nsions,  and  with sparser  data,  this  kind  of recursive splitting and  doubling strategy \nshould  be even  more important. \n\n4  CONCLUSIONS \n\nIn this paper we  have shown how receptive field  size,  shape, density, and noise char(cid:173)\nacteristics  interact  with  the frequency  content  of target  functions  and  input prob(cid:173)\nability  density  to  contribute  to  both  estimation  and  approximation errors  during \nsupervised function  learning.  Based  on these interrelationships, a simple, adaptive, \nerror-driven  strategy  for  laying out  receptive  fields  was  demonstrated  that  makes \nefficient use of unit resources in the attempt to minimize mean squared performance \nerror. \nAn  improved understanding of the  role  of receptive  field  structure  in learning may \nin  the  future  help  in  the  interpretation  of patterns  of coarse-coding  seen  in  many \nbiological sensory  and motor systems. \n\nReferences \n\nBaldi,  P.  &  Heiligengerg,  W.  How  sensory  maps could  enhance  resolution  through \nordered arrangements of broadly tuned receptors.  Bioi.  Cybern.,  1988, 59,  313-318. \nBallard, D.H. Interpolation coding:  a representation for  numbers in neural  models. \nBioi.  Cybern.,  1987,  57,  389-402. \nBroomhead,  D.S.  &  Lowe,  D.  Multivariable functional  interpolation  and  adaptive \nnetworks.  Complex Systems,  1988,  2,  321-355. \nHinton,  G.E.  (1986)  Distributed  representations.  In  Parallel  distributed  process(cid:173)\ning:  explorations  in  the  microstructure  of cognition,  vol.  1,  D.E.  Rumelhart,  J .L. \nMcClelland,  (Eds.),  Bradford,  Cambridge. \nKohonen,  T.  Self organization  and  associative  memory.  Springer-Verlag:  Berlin, \n1984. \nMacKay,  D.  Hyperacuity  and coarse-coding.  In preparation. \nMoody, J. &  Darken, C.  Fast learning in networks of locally-tuned processing units. \nNeural  Computation,  1989,  1,  281-294. \n\n\f\fPart XII \n\nLearning Systell1.s \n\n\f\f", "award": [], "sourceid": 296, "authors": [{"given_name": "Bartlett", "family_name": "Mel", "institution": null}, {"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}