{"title": "Learning Theory and Experiments with Competitive Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 846, "page_last": 852, "abstract": null, "full_text": "Learning  Theory and  Experiments with \n\nCompetitive Networks \n\nGriff L.  Bilbro \nNorth  Carolina State University \nBox 7914 \nRaleigh,  NC  27695-7914 \n\nDavid E. Van den Bout \nNorth Carolina State University \nBox 7914 \nRaleigh,  NC  27695-7914 \n\nAbstract \n\nWe  apply  the  theory  of Tishby,  Levin,  and Sol1a  (TLS)  to two problems. \nFirst  we analyze  an elementary problem for  which we find  the predictions \nconsistent  with  conventional  statistical  results.  Second  we  numerically \nexamine  the  more realistic  problem of training a  competitive net  to learn \na  probability  density  from  samples.  We  find  TLS  useful  for  predicting \naverage  training  behavior. \n\n. \n\n1  TLS  APPLIED  TO  LEARNING DENSITIES \n\nRecently  a  theory  of learning  has  been  constructed  which  describes  the  learning \nof a  relation  from  examples  (Tishby, Levin,  and Sol1a,  1989),  (Schwarb, Samalan, \nSol1a,  and Denker,  1990).  The original derivation  relies  on  a  statistical  mechanics \ntreatment  of the  probability  of independent  events  in  a  system  with  a  specified \naverage value of an additive error function. \n\nThe resulting  theory is  not restricted  to learning  relations  and it is  not  essentially \nstatistical  mechanical.  The TLS  theory  can be derived  from  the  principle  of maz(cid:173)\nimum  entropy,  a  general  inference  tool which  produces  probabilities  characterized \nby certain values of the averages of specified functions(Jaynes,  1979).  A TLS theory \ncan be  constructed  whenever  the specified  function  is  additive and associated  with \nindependent examples.  In this paper we treat the problem of learning a  probability \ndensity from samples. \nConsider the model as some function p( z Iw) of fixed form and adjustable parameters \nw  which are  to be chosen  to approximate 1'(z)  where  the overline denotes  the  true \ndensity.  All we  know about l' are the elements of a  training set T  which are drawn \n\n846 \n\n\fLearning Theory and Experiments with Competitive Networks \n\n847 \n\nfrom it.  Define  an error  e(zlw).  By  the  principal of maximum entropy \n\np(zlw)=  z(.B)e-~(zIW), \n\n1 \n\n(1) \n\ncan be interpreted as the unique density which contains no other information except \na  specified  value of the average error \n\n(e)  = f dz p(zlw)e(zlw). \n\n(2) \n\nIn  Equation 1 z  is  a  normalization  that is assumed  to be independent  of the  value \nof Wj  the  parameter .B  is  called  the  ,en,itivity and is  adjusted so  that  the  average \nerror is  equal to some eT,  the specified  target error on the training set.  We will use \nthe convention that an integral operates on the entire  expression  that follows it. \nThe usual Bayes rule  produces  a  density in w  from p(zlw)  and from a  prior density \np(O)(w)  which  reflects  at  best  a  genuine  prior  probability  or  at least  a  restriction \nto  the  acceptable  portion  of the  search  space.  Posterior  to  training  on m  certain \nexamples, \n\n(3) \n\nwhere  Zm  is  a normalization that depends on the particular set of examples as well \nas their  number.  In order to remove the effect  of any particular set of examples,  we \ncan average this  posterior  density  over all possible  m  examples \n\n(4) \n\nThis average posterior density models the expected density of nets or  w after train(cid:173)\ning.  This distribtution  in w  implies  the  followin~ expected  posterior  density  for  a \nnew  example  Zm+l \n\n(5) \n\nTLS  compare  this  probability  in  Zm+l  with  the  true  target  probability  to  obtain \nthe  A verage  Prediction Probability or APP  after training \n\n(6) \n\nthe average over both the training set  z(m)  and an independent test example Zm+l. \n\nIn the averages of Equations 4 and 6 are inconvenient to evaluate exactly because of \nthe  Zm  term in  Equation  3.  TLS propose  an  \"annealed approximation\"  to APP in \nwhich the average of the ratio of Equation 4 is replaced by the ratio of the averages. \nEquation 6 becomes \n\nwhere \n\np(m)  = J dwp(o)(w)gm+l(w) \nJ dwp(O) (w)gm (w) \ng(w) = J dzp{z)p(zlw). \n\n(7) \n\n(8) \n\n\f848 \n\nBilbro and Van den Bout \n\nEquation  7 is  well  suited for  theoretical  analysis and is  also  convenient for  numer(cid:173)\nical  predictions.  To  apply  Equation  7  numerically,  we  will  produce  Monte  Carlo \nestimates for  the moments of 9  that involve sampling p(O) (w).  If the dimension of w \nis  larger  than 50,  it is  preferable  to histogram 9  rather  than evaluate the moments \ndirectly. \n\n1.1  ANALYSIS  OF  AN  ELEMENTARY EXAMPLE \n\nIn  this  section  we  theoretically  analyze  a  learning  problem  with  the  TLS  theory. \nWe  will  study  the  adjustment  of the  mean  of a  Gaussian  density  to  represent  a \nfinite  number of samples.  The  utility  of this elementary  example is  that it  admits \nan analytic solution for  the APP of the  previous section.  All  the relevant integrals \ncan be  computed with  the identity \n\n100  dz  exp (-adz - bd2 - a2(z - b2)2) =  ~ exp (- a1a2  (b1 - b2)2). \n\nV~  al +a2 \n\n-00 \n\nWe take the  true  density to be a  Gaussian of mean wand variance 1/20 \n\np(z) =  ~e-a(Z-iii)3. \n\nWe model  the  prior  density as a  Gaussian  with mean wo  and variance  1/21' \n\np(O)(w) = ~e-\"(W-WO)3. \n\nWe choose the  simplest  error function \n\ne(zlw) = (z - w)2, \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nthe  squared  error  between  a  sample  z  and  the  Gaussian  \"model\"  defined  by  its \nmean w,  which  is  to  become  our  estimate  of w.  In  Equation  1,  this error  function \nleads  to \n\nwith z(/3)  = fi which is  independent of w as assumed.  We determine /3  by solving \nfor  the error on the  training  set  to get /3 = -21  \u2022 \nThe generalization,  Equation 8,  can now  be evaluated with  Equation 9 \n\nET \n\n(13) \n\ng(w)  = ~e-\"(W-iii)3, \n\nwhere \n\nK= \n\n0/3 \n, \n0+/3 \n\nis less  than either  0  or /3.  The denominator  of Equation 7 becomes \n\n(~)m/2 ~ exp(- mK1'  (w-wo)2) \n7r  V~  mK+1' \n\n(14) \n\n(15) \n\n(16) \n\n\fLearning Theory and Experiments with Competitive Networks \n\n849 \n\nwith  a  similar  expression  for  the numerator. \nThe case of many examples or little  prior knowledge is interesting.  Consider  Equa(cid:173)\ntions  7 and  16  in  the limit  mit > > f' \n\n(m)  =  {K  {Tn \nY;Ym+1' \np \n\n(17) \n\nwhich climbs to an asymptotic value of ~ for m  - - t  00.  In order to compare  this \nwith  intuition,  consider  that  the  sample  mean  of {ZlJ Z2J \"'J zm}  approaches w to \nwithin  a  variance of 1/2ma:, so  that \n\n(p(m)(w))z  ~ Jrn; e-ma(z-w)3 \n\n(18) \n\nwhich  makes Equation  6 agree  with  Equation  17 for  large enough {3.  In  this sense, \nthe statistical  mechanical  theory  of learning  differs  from  conventional Bayesian es(cid:173)\ntimation only in its choice  of an unconventional performance criterion  APP. \n\n2  GENERAL NUMERICAL PROCEDURE \n\nIn  this  section  we  apply  the  theory  to  the  more  realistic  problem  of learning  a \ncontinuous probability density  from  a  finite  sample  set.  We can  estimate  the  mo(cid:173)\nments of Equation  7 by the following  Monte Carlo procedure.  Given a  training set \nT  = {Zt H~r drawn from the unknown density p on domain X  with finite volume V J \nan error function  f( Z \\w ),  a  training error  fT J  and a  prior  density p(O) (w)  of vectors \nsuch  that each w  specifies  a candidate function, \n1.  Construct  two sample  sets:  a  prior  set  of P  functions  P  = {wp }  drawn from \np(O)(w)  and  a  set of U  input  vectors U  = {zu} drawn uniformly from  X.  For \neach p  in  the  prior  set,  tabulate the error fup  = \u20ac(zulwp)  for  every  point in U \nand the error ftp  =  f(Zt\\Wp) for  every  point in T. \n\n2.  Determine the sensitivity f3  by solving  the equation  (\u20ac)  = \u20acT  where \n\n()  Eu e-/J\u00b7 .... fup \nf  =  Eu e-/J'.. \n\n. \n\n3.  Estimate  the average generalization  of a  given wp  from  Equation 8 \n\n(19) \n\n(20) \n\n4.  The performance after m  examples is  the ratio of Equation 7.  By construction \n\nP  is  drawn from p(O)  so  that \n\n(21) \n\n\f850 \n\nBilbro and Vern den Bout \n\n2r-------~--~~--~--~ \n\n2r---~--~----~--~---' \n\n1.5 \n\nA-\n~ \n\n.010 \n.013 \n.ol!l \n.oIl \n.0111 \nmI \n\n1.5 \n\n8: \nC \n\n:I!! \n:81 \n\n0.7 \n\n0 \n\n2D \n\n40 \n\n60 \n\n80 \n\n100 \n\nTraining Set Size \n\n(a) \n\n1 \n\n0.7 \n\n0 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nTraining Set SIzIe \n\n(b) \n\nFigure  1:  Predicted  APP  versus  number of training samples  for  a  20-neuron  com(cid:173)\npetitive  network  trained  to  various  target  errors  where  the  neuron  weights  were \ninitialized from  (a) a uniform density,  (b) an antisymmetrically  skewed density. \n\n3  COMPETITIVE LEARNING NETS \n\nWe consider  competitive  learning  nets  (CLNs)  because  they  are  familiar  and  use(cid:173)\nful  to  us  (Van  den  Bout  and  Miller,  1990),  because  there  exist  two  widely  known \ntraining strategies for  CLN s (the neurons can learn either independently or under a \nglobal interaction called  conscience  (DeSieno,  1988), and because  CLNs can be ap(cid:173)\nplied  to one-dimensional  problems  without  being  too trivial.  Competitive learning \nnets  with  conscience  qualitatively change  their  behavior when  they  are  trained  on \nfinite  sample  sets  containing fewer  examples  than  neurons;  except  for  that  regime \nwe  found  the  theory  satisfactory.  All  experiments  in  this  section  were  conducted \nupon the following  one-dimensional  training  density \n\n15(z) = { ~!;z  o <z< I, \n\notherwise. \n\nIn  Figure  1  is  the  Average  Prediction  Probability  (APP)  for  k  = 20  versus  m, \nfor  several  values  of target  error  fT  and  for  two  prior  densitsities;  first  consider \npredictions  from  the  uniform  prior.  For  fT  =  0.01,  APP  practically  attains  its \nasymptote  of 1.5  by  m  = 40  examples.  Assuming  the  APP  to  be  dominated  in \nthe limit  by the largest  g,  we  expect  a  CLN  trained  to an  error  of 0.01  on  a  set  of \n40  examples  to perform  1.5  times  better  than an untrained net  on unseen  samples \ndrawn from  the same probability density.  This leads  to a  predicted  probable error \nof about \n\n1 \n\nfJWob  =  2 k  pCm) \u2022 \n\nFor k  = 20,  fpf'ob  = 0.017 for  fT = .01  and fpt'ob  = 0.021  for  fT = 0.02. \nWe performed 5,000 training trials of a  20-neuron CLN on randomly selected sets of \n\n(22) \n\n\fLearning Theory and Experiments with Competitive Networks \n\n851 \n\n0.04  r-------.---__._----. \n\n0.04 \n\n0.03 \n\n... \n2 \nW \niI \n~ \n\n0.02 \n\n\u2022 \n\n. ( \n\u2022 \n) . \u2022 \n,  \u2022  \u2022 \n\n\u2022 \n\u2022 \n\n0.03 \n\n2 \nw \niI \n~ \n\n0.02 \n\n\u2022 \n\n\u2022 \n\n. .-\n\n\u2022 \n\u2022 \n\u2022  \u2022 \n\u2022 \n\n0~1~-~~---~---~ \n0.03 \n\n0.02 \n\n0.01 \n\no \n\n0~1~--~---~---~ \n0.03 \n\n0.02 \n\n0.01 \n\no \n\nTrlining Error \n(a) \n\nTrlinlng ElTor \n(b) \n\nFigure  2:  Experimentally  determined  and  predicted  values  of total  error  across \nthe  training  density  after  competitive  learning  was  performed  using  a  20-neuron \nnetwork trained to various target errors  (a) with 40  samples,  (b) with 20  samples. \n\n40  samples from the training density.  Each network was trained to a  target error in \nthe  range  [0.005,0.03] on its 40  samples,  and the average error on the total density \nwas  then  calculated  for  the  trained  network.  Figure  2  is  a  plot  of 500  of these \ntrials  along  with  the predicted  errors for  various target errors.  The  probable error \nis  qualitatively correct  and the seatter  of actual experiments increases in width  by \nabout the ratio of APPs for  m = 20  and m = 40.  For the ease of m = 20  examples, \nthe  same  net  can  only  be  expected  to exhibit  probable  errors  of .019  and  .023 for \ncorresponding training target errors, which is compared graphically in Figure 2 with \nthe experimentally  determined  errors for  m  = 20. \nThe  APP  curves  saturate  at  a  value  of m  that  is  insensitive  to  the  prior  density \nfrom  which  the  nets  are  drawn.  The  vertical seale  does  depend  somewhat  on  the \nprior  however.  Consider  Figure  1,  which  also  shows  the  APP curves for  the same \nk  = 20  net  with  the  prior  density  antisymmetrically  skewed  away from  the  true \ndensity by the following  function: \n\n(0)  { l   0 ~ W  < 1, \n(w)  =  OV1-W  otherwise. \np \n\nFor m > 20  the  6hape6 of the curves are almost unchanged, even though the vertical \nscale  is  different:  saturation  occurs  at  about  the  same  value  of m.  Even  when \nthe  prior  greatly  overrepresents  poor  nets,  their  effect  on  the  prediction  rapidly \ndiminishes  with  training set  size.  This is  important because  in actual training,  the \neffect  of the  initial  configuration  is  also  quickly  lost.  For  m  < 20  the  predictions \nare not valid in any case,  since  our simple error function  does not reflect  the actual \nprobability  even  approximately for  m  < k  in  these  nets.  It is  for  m  < 20  where \nthe  only  significant  differences  between  the  two families  of curves  occur.  We have \nalso  been  able  to  draw  the  same  conclusions  from  less  structured  prior  densities \ngenerated  by  assigning  positive  normalized  random  numbers  to  intervals  of the \n\n\f852 \n\nBilbro and v.m den Bout \n\ndomain.  Moreover,  we  generally  find  that TLS  predicts  that about twice  as many \nsamples  as neurons are needed  to  train competitive nets of other sizes. \n\n4  CONCLUSION \n\nTLS  can  be  applied  to  learning  densities  as  well  as  relations.  We  considered  the \neffects  of varying the number of examples, the target training error,  and the choice \nof prior density.  In these experiments on learning a density as well as others dealing \nwith learning  a  binary output  (Bilbro  and Snyder, 1990), a  ternary output (Chow, \nBilbro,  and  Yee,  1990),  and  a  continuous  output  (Bilbro  and  Klenin,  1990)  we \nfind  if saturation occurs for  m  substantially less  than the total number of available \nsamples,  say m  < ITI/2,  that  m  is  a  good predictor  of sufficient  training  set  size. \nMoreover there is  evidence from a reformulation of the learning theory based on the \ngrand canonical ensemble  that supports  this statistical approach (Klenin,1990). \n\nReferences \n\nG.  L.  Bilbro  and M.  Klenin.  (1990) Thermodynamic Models  of Learning:  Applica(cid:173)\ntions.  Unpublished. \n\nG.  L.  Bilbro  and  W.  E.  Snyder.  (1990)  Learning  theory,  linear  separability,  and \nnoisy  data.  CCSP-TR-90/7,  Center  for  Communications  and  Signal  Processing, \nBox 7914,  Raleigh,  NC  27695-7914. \n\nM.  Y.  Chow, G. L.  Bilbro and S.  O.  Yee.  (1990) Application of Learning Theory to \nSingle-Phase  Induction  Motor Incipient Fault Detection Artificial  Neural Networks. \nSubmitted  to  International  Journal  of Neural  Syltem,. \n\nD.  DeSieno.  (1988)  Adding a conscience  to competitive learning.  In  IEEE Interna(cid:173)\ntional  Conference  on Neural Network\"  pages 1:117-1:124. \n\nE.  T. Jaynes.  (1979)  Where  Do  We Stand on Maximum Entropy?  In R.  D.  Leven \nand  M.  Tribus  (Eds.),  Mazimum  Entropy Formali,m,  M.  I.  T.  Press,  Cambridge, \npages 17-118. \n\nM.  Klenin.  (1990)  Learning  Models  and Thermostatistics:  A  Description  of Over(cid:173)\ntraining  and  Generalization  Capacities.  NETR-90/3,  Center for  Communications \nand Signal  Processing,  Neural  Engineering  Group,  Box  7914,  Raleigh,  NC  27695-\n7914. \nD.  B.  Schwartz,  V.  K.  Samalan,  S.  A.  Solla  &.  J.  S.  Denker.  (1990)  Exhaustive \nLearning.  Neural  Computation. \nN.  Tishby, E.  Levin,  and S.  A.  Solla.  (1989) Consistent inference of probabilities in \nlayered networks:  Predictions  and generalization.  IJCNN, IEEE,  New  York, pages \nII:403-410. \n\nD.  E. Van den  Bout and T. K.  Miller  III. (1990) TInMANN: The integer markovian \nartificial  neural  network.  Accepted  for  publication  in  the  Journal  of Parallel and \nDiltributed Computing. \n\n\f", "award": [], "sourceid": 357, "authors": [{"given_name": "Griff", "family_name": "Bilbro", "institution": null}, {"given_name": "David", "family_name": "van den Bout", "institution": null}]}