{"title": "Adaptive Soft Weight Tying using Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 993, "page_last": 1000, "abstract": null, "full_text": "Adaptive  Soft  Weight  Tying \n\nusing  Gaussian  Mixtures \n\nSteven J.  Nowlan \n\nComputational  Neuroscience  Laboratory \n\nThe Salk  Institute,  P.O .  Box 5800 \n\nSan  Diego,  CA  92186-5800 \n\nGeoffrey  E.  Hinton \n\nDepartment of Computer Science \n\n. U ni versi ty  of Toran to \n\nToronto, Canada M5S  lA4 \n\nAbstract \n\nOne way  of simplifying neural  networks  so  they generalize  better is  to add \nan  extra  t.erm  10  the  error  fUll ction  that  will  penalize  complexit.y. \n\\Ve \npropose  a  new  penalt.y  t.erm  in  which  the  dist rihution  of  weight  values \nis  modelled  as  a  mixture  of  multiple  gaussians .  C nder  this  model,  a  set \nof  weights  is  simple  if  the  weights  can  be  clustered  into  subsets  so  that \nthe  weights  in  each  cluster  have  similar  values .  We  allow  the  parameters \nof  the  mixture  model  to  adapt  at  t.he  same  time  as  t.he  network  learns. \nSimulations demonstrate  that  this complexity  term  is  more  effective  than \nprevious  complexity terms. \n\nIntroduction \n\n1 \nA  major  problem  in  training  artificial  nellral  network:>  is  to  ellsure  t.hat  th ey  wIll \ngel/eraiIze  well  to  ra .. .,f'~  thaI  they  h(lvl> 1I0t  been  tralHeu  OIl.  SUIlle  recellt  t.heuretical \nresults  (Baurn  anu  Iiallssier.  10S~I)  Itave  :,.ug,g,e~teU  that  ill  order  to  guaralltee  goou \ngeneralizatioll  Ilw  <IIIIOllnl  of  lllforillatiull  requireJ  te.  dlr\"L\"t1~  \"p~'c if~  Ihe  Ulltput \nvectors  of  all  t.he  t rallllng  casps  ll11l..;t  he  considerahly  larg,t>r  than  t hI'  lllllTlber  of \nindependellt  weight:,.  III  th e  ll t' twork \nIII  1I1any  practI cal  problt'lllS  there  IS  only \na  small  amount  of  labelled  data  available  for  traming  and  this  creates  problellls \nfor  any  approach  that  uses  a  large.  homogeneous  network  with  many  indepeIldent \nweights.  As  a  result.  there  has  been  much  recent  int.erest  in  techniques  that  can \ntrain  large  networks  wil h  relatively  small amounts of labelled  data and  still  provide \ngood  generalization  performance. \n\nIn  order  to  improve  generalization,  t.he  number  of free  parameters  in  the  network \nmust  be  reduced.  Olle  of  the  oldest  and  simplest  approaches  to  removing  excess \ndegrees  of fr eedolll  from  a  net work  i~  to  add  an  ext fa  term  10  the  error  [Ullct 1011 \n\n993 \n\n\f994 \n\nNowlan and Hinton \n\nthat penalizes  complexity: \n\ncost = data-misfit + A complexity \n\n(1) \n\nDuring  learning,  the  network  is  trying  to  find  a  locally  optimal  trade-off  between \nthe  data-misfit  (the  usual  error  term)  and  the  complexity of the  net.  The  relative \nimportance  of  these  two  terms  can  be  estimated  by  finding  the  value  of  A that \noptimizes  generalization  to  a  validation  set.  Probably  the  simplest  approximation \nto  complexity  is  the  sum  of  the  squares  of  the  weights,  Li w;.  Differentiating \nthis  complexity  measure  leads  to simple  weight  decay  (Plaut,  Nowlan  and  Hinton, \n1986) in which  each  weight  decays  towards zero  at a  rate  that is  proportional  to its \nmagnitude.  This  decay  is  countered  by  the  gradient  of the  error  term,  so  weights \nwhich  are  not  critical  to  network  performance,  and  hence  always have  small error \ngradients,  decay  away  leaving only  the  weights  necessary  to solve  the  problem. \nThe  use  of  a  Li'IV;  penalty  term  can  also  be  interpreted  from  a  Bayesian \nperspective. l  The  \"complexity\"  of a  set  of  weights,  ALi w;,  may  be  described \nas  its  negat.ive  log  probahilit.y  dellsit.y  under  a  radially  symmetric  gaussian  prior \ndistribution on  the  weights.  The distribution is  centered  at  the origin and  has  vari(cid:173)\nance 1/ A.  For  multilayer networks,  it is  hard to find  a good theoretical justificatioll \nfor  this  prior,  but  Hinton  (1987)  justifies  it  empirically  by  sllOwiug  tllat  it  greatly \nimproves  generalizat.ioll  on  a  very  difficult,  task.  MOI'e  recently,  Mackay  (1991)  has \nshown  that  even  better  generalization  can  be  achieved  by  using  different  values  of \nA for  the  weights  in different  layers. \n\n2  A  more  conlplex measure  of network  complexity \nIf we  wish  to  eliminate small  weights  without  forcing  large  weights  away  from  the \nvalues  they  lleed  to  model  the  data,  we  can  use  a  prior  which  is  a  mixture  of a \nnarrow  (n)  and  a  broad (b)  gaussian,  both  centered  at zero. \n\n1 \np(w) =  trn  yI2; \n\n27rl1n \n\n-5 \ne  2\"n  + trb ~ e \n\n1 -6 -\nb \n27rl1 b \n\n(2) \n\nwhere  trn  and  trb  are  the  mixing proportions of the  two  gaussians and  are  therefore \nconstrained  to sum  to  l. \n\nAssuming  that.  the weight  values  were  generated from  a  gaussian  mixture,  the  con(cid:173)\nditional  probability  that  a  particular  weight,  Wi,  was  generated  by  a  particular \ngaussian,  j, is  called  the  responsibility of that gaussian  fOI'  the  weight  and is: \n\n(3) \n\nwhere  Pj(Wj)  is  the  probahilit.y  density  of Wi  under  gaussian  j. \n\nWhen the mixing proportions of t.lw  two gatlssians are comparable, t.he  llal'l'OW  gaus(cid:173)\nsian gets  most of the  responsibilit.y  for  a  small  weight.  Adopting  the  Bayesiall  per(cid:173)\nspective,  the  cost of a  weight  under  the  narrow gaussian  is  proportional  to w 2 /2l1~. \nAs  long  as  l1 n  is  quite small  there  will  be  strong  pressure  to reduce  the  magnitude \n\n1 R.  Szeliski,  personal  communication,  1985. \n\n\fAdaptive Soft Weight Tying using Gaussian Mixtures \n\n995 \n\nof small  weights  even  further.  Conversely,  the  broad  gaussian  takes  most  of  the \nresponsibility for  large weight  values,  so there  is  much less  pressure  to reduce  them. \nIn  the limiting case  when  the  broad gaussian  becomes  a  unifonTI distribution,  there \nis almost no pressure  to reduce  very  large weights  because  they  are almost certainly \ngenerated  by  the uniform distribution.  A complexity term very  similar to this limit(cid:173)\ning case is  used  in  t.he  \"weight elimination\"  technique  of CWeigend,  Huberman  and \nRumelhart,  1990)  to  improve generalization for  a  time series  prediction  task.  2 \n\n3  Adaptive  Gaussian  Mixtures and  Soft  Weight-Sharing \nA mixture of a narrow, zero-mean gaussian with a broad gaussian Or  a uniform allows \nus  to favor networks  with many near-zero  weights,  and this improves generalization \non  many  tasks.  But  practical  experience  with  hand-coded  weight  constraints  has \nalso  shown  that  great  improvements  can  be  achieved  by  constraining  particular \nsubsets  of the  weights  t.o  share  the  same  value (Lang,  '-\\Taibel  and  Hinton,  1990;  Le \nCun,  1989).  Mixtures  of zero-mean  gaussians  and  uniforms  canllot  implement  this \ntype of symllletry  constraint.  If however,  we  use  multiple gaussians and  allow  their \nmeans  and  variances  to  adapt  as  t.lw  lIetwol\u00b7k  learns,  we  call  implemellt  a  \"soft\" \nversion  of weight.-sharing  III  which  the  leawing  algoritlllll  decides  for  itself  which \nweights  should  be  t.ied  together. \n(We  may  also  allow  the  lllixillg,  proportiolls  to \nadapt so  that.  we  are  1I0t  assulllillg  all  sets  of tied  weights  al\u00b7e  the  sallle size.) \n\nThe  basic  idea  is  t.hat  a  gallssiall  which  takps  responsibility  for  a  subset  of  the \nweights  will  squeeze  those  weight.s  t.ogether  since  it  can  then  have  a  lower  variance \nand  assign  a  higher  probability  dellsit.y  t.o  each  weight.  If  t.he  gaussialls  all  start \nwith  high  variallce,  the  initial  division  of weights  into subsets  will  be  very  soft .  As \nthe  variances shrink  and  the  network  learns,  the  decisions  about  how  to  group  the \nweights  iuto subsets  are  influenced  by  the  task  the  network  is  learning  t.o  perforul. \n\nTo make  t.hese  intuit.ive  ideas  a  bit  more concrete,  \\ve  may  define  a  cost  function  of \nthe general  form  given  in  (1): \n\n(4) \n\nwhere 0\";  is  the  variance of the squared  error  and each Pj (wd  is  a  gaussian  density \nwith  mean  /1j  and  standard  deviation  O\"j.  \\Ve  optimize this  function  by  adjusting \nthe  Wi  and  the  mixture  parameters  1fj,  /1j,  and  O\"j,  and  O\"y.3 \nThe  partial  derivative  of C  with  respect  to  each  weight  is  the  sum  of  the  usual \nsquared  error  derivative and  a  term  due  to  the  complexity  cost  for  the  weight: \n\n(5) \n\n2See (N owl au,  1991)  for  a  precise descri pt iOll  of t.he  rela.tionshi p  bet.weeu  rni xture models \n\nand  the  model  Ilsed  by  (Weigend.  Huherman  a.nd  Rllmelltart.  1990). \n\nJl/a~ lllay  be  tlLOUgltt  of as  playillg  tlte  sallle  role  a.s  A ill  equatiou  1  ill  detcrminiug  a \nit \n\ntrade-off  between  the  misfit.  auo  complexity  costs .  K  is  a  1I0rlllaiiting  factor  ba.sed  011 \ngaussia.u  error  LLlude!. \n\n\f996 \n\nNowlan  and Hinton \n\nMethod \n\nVanilla Back  Prop. \nCross  Valid. \nWeight  Elimination \nSoft-share - 5 Compo \nSoft-share - 10  Compo \n\nTrain % Correct  Test  % Correct \n\n100.0 \u00b1 0.0 \n98.8 \u00b1 1.1 \n100 .0 \u00b1 0.0 \n100.0 \u00b1 0.0 \n100.0 \u00b1 0.0 \n\n67.3 \u00b1 5.7 \n83.5 \u00b1 5.1 \n89.8 \u00b1 3.0 \n95.6 \u00b1 2.7 \n97.1 \u00b1 2.1 \n\nTable  1:  SUllllllal'y  of generalization  performance of 5  different  training  techniques \non  the shift  detection  problem. \n\nThe derivative of the complexity cost term is simply a  weighted sum of the difference \nbetween  the  weight  value  and  the  center  of each  of the  gaussians.  The  weighting \nfactors  are  the  responsibility  measures  defined  in  equation  3  and  if  over  time  a \nsingle  gaussian  claims  most  of the  responsibility  for  a  particular  weight  the  effect \nof the complexity  cost  t.erm  is  simply  to  pull  the  weight  towards  the  center  of the \nresponsible  gaussian.  The  strength  of  this  force  is  inversely  proport.ional  to  the \nvariance of the  gaussian. \n\nIn the simulations described  below, all of the parameters (Wi,  Pj, (Jj,  7rj)  are updated \nsimultaneously using a  conjugate gradient  descent  procedure.  To prevent  variances \nshrinking  too  fast  or  going  negative  we  optimize  log (Jj  rather  than  (Jj.  To  ensure \nthat  the mixing  proportions sum  t.o  1 and  are  positive,  we  optimize  Xj  where  trj  = \nexp(xj)/ L exp(x/,;).  For furtiter  details  see  (Nowlan  and  Hinton,  1992). \n\n4  SilTIulation  Results \n\nV>le  compared  the  gelleralization  performance  of  soft  weight-tying  to  other  tech(cid:173)\nniques  on  two  different.  problems.  The  first  problem,  a  20  input.,  one  output shift \ndetection  !letwork,  vvas  chosell  because  it  was  biJlary  problem  for  which  solutiotls \nwhich  generalize  well  exhibit  a  lot.  of repeat.ed  weight structure.  The generalizatioll \nperfOrlllallCt\u00b7  of lwtworks  trailled  using,  the  co:st  Cl\"it.erion  giveJl  ill  equation  4  was \ncompared  to  Ilet.works  t.rained  in  three  other  ways:  No  cost  term  to  penalize  com(cid:173)\nplexity;  No  explicit  complexity  cost.  term,  but  use  of a  validat.ion  set  to  terminate \nlearning;  Weight  elimination  (Wf'igelld,  Huberman  a.nd  Rumelhart,  1990)4.  The \nsimulation results  art'  sllmmarized  in  Table  1. \n\nThe  network  had  20  input  units,  10  hidden  units,  and  a  single  output  unit  and \ncontained  101  weights.  The first  10  input units in  this  network were given a  random \nbinary  pattern, and  the second  group of 10  input units  were  given the same pattern \ncircularly  shifted  by  1  bit  left  or  right.  The desired  output of the  network  was  +1 \nfor  a  left  shift.  and  -1 for  a  right shift.  A  data set of 2400  patterns  was  created  by \nrandomly  generating  a  10  bit  string,  and  choosing  with  equal  probability  to  shift \nthe  string  left  or  right.  The  data  set  was  divided  into  100  training  cases,  1000 \nvalidation  cnses,  and  1 :WO  t.est.  cases.  The  training :set  was  deliberat.ely  chosen  to \nbe  very  small \u00ab  5% of possible patterns)  to explore the region in  which  complexity \npenalties should have the largest.  impa.ct .  Ten simulations were  performed  with each \n\n4With  a  fixed  value  of >.  chosen  by  cross-validation. \n\n\fAdaptive Soft Weight Tying using Gaussian Mixtures \n\n997 \n\n1,' \n\n1.3 \n\nI.Z \n\n1.1 \n\n0.9 \n\n0 .8 \n\n0.7 \n\n0.& \n\n0.5 \n\n0 \u2022\u2022 \n\n0.3 \n\n0.2 \n\no. ~-l=+7+=~~~+::=f=*B~:;::+:=+-'H+~\\~=+=l~ \n\n- -4.5-4-l.5\u00b7~-2.5-2-1,5-1-o,5 0  0,5  11,5  2  2.5  1 l,5  \u2022  ',5  5 \n\nFigure  1:  Final  mixture  probability  density  for  a  typical  solution  to  the  shift  de(cid:173)\ntection  problem.  Five  of  the  components  in  the  mixture  can  be  seen  as  distinct \nbumps in  the probabilit.y densit.y.  Of the remaining five  components,  two have  been \neliminated  by  having  their  mixing  proportions  go  to  zero  and  the  other  three  are \nvery  broad  and  form  the  baseline offset  of the  density  function. \n\nmethod,  starting  frolll  ten  difl\u00b7erent.  initial  weight  sets  (t.e.  each  method  used  the \nsame ten  initial weight.  configurations). \n\nThe  final  weight  distl'ihlltiolls  discovered  by  the  soft  weight-tyiug  technique  are \nshown  in  Figlll'e  1.  There  is  no  significant  component  with  mean  O.  The  classical \nassumpt.ioll  t.hat.  the  nt't.work  collt.aiw;  a  large  lIulllber  of illessellt.ial  weight.s  which \ncan  be eliI1IilIated  to  ililprove generalizatioll  is  lIOt  appropriate  COL'  this problelll  aBd \nnetwork  arcilitecture.  Tilis  may  explaiu  why  the  weight  elimination  model  used \nby  'Veigend  ef  af  ('Veigend,  Huberman  and  Rumelhart,  1990)  performs  relatively \npoorly in  this si tuation. \n\nThe second  task  chosen  to evaluate the effectiveness  of our complexity  penalty was \nthe  prediction  of the  yearly  sunspot  average  from  the  averages  of previous  years . \nThis task  has  been  well  studied  as  a  time-series  prediction  benchmark in the statis(cid:173)\ntics  literature  (Priestley,  1991b;  Priestley,  19910.)  and has also  been  investigated  by \n(Weigend,  Huberman  and  Rumelhart,  1990)  using  a  complexity  penalty  similar  to \nthe one  discllssed  in  section  2. \n\nThe network archit.ect.me  Llsed  was identical to the one used  in  the study  by  VVeigend \net  af:  The Iwtwork  had  1 L input.  unit.s  which  represent.ed  the yearly  average from til<-' \npreceding  lL  years, 8 hidden  unit.s,  and a silIgle lillear output unit  which  represented \nthe  predictioll  for  tlw  averagl'  Illllllhu'  of  SllIlSPOt.S  ill  t.he  current  year.  Yearly \nsunspot  dat.a  from  l700  to  uno  wa:-;  lIsed  to  train  the  lIetwork  to  perform  this  OllC(cid:173)\nstep  prediction  task,  aud  t.he  evaluation  of  the  network  was  based  on  data  from \n\n\f998 \n\nNowlan  and Hinton \n\nMethod \n\nTAR \nRBF \n\\VRH \nSoft-share - 3  Compo \nSoft-share - 8  Compo \n\nTest  arv \n\n0.097 \n0.092 \n0.086 \n0.077 \u00b1 0.0029 \n0.072 \u00b1 0.0022 \n\nTable 2:  Summary of average relat,ivp  variance of 5  different  models on the one-step \nsunspot  prediction  problelll. \n\n1921  to  1955. 5  The  evaluation  of prediction  performance  used  the  aver\u00b7age  relative \nvariance  (ar\u00b7v)  measure  discussed  in  (Weigend,  Huberman  and  Rumelhart,  1990). \n\nSimulations were  performed  using the same conjugate gradient method  used for  the \nfirst  problem.  Complexity  measures  based  on  gaussian  mixtures  with;) and  8  com(cid:173)\nponents  were  used  and  ten  simulat.ions  were  performed  with  each  (USillg  the  same \ntraining data but different  initial weight  configurations).  The results of these simu(cid:173)\nlations are summarized in Table 2 along with the best result obtailled by  Weigend  et \nat (Weigend,  H ubermall  and RUlllelhart,  1990) (HI RH), the bilinear auto-regression \nmodel  of Tong  and  Lim  (Tong  ano  Lim,  1980)  (T A R) 6,  and  the  multi-layer  RBF \nnetwork  of He  and  Lapeoes  (lIe  alld  Lapedes,  1991)  (RBF).  All  figure:::;  represent \nthe arv on  t.he  t.est  set.  For  the mixture complexity  models,  this is  the  average  over \nthe  ten  simlllations,  plus or  minus one standard  deviation. \n\nSince the results for  the models ot.her  than the mixture complexity trained networks \nare  based on a single simulation it is difficult to assign statistical signifigance to the \ndifferences  shown  in  Table  2.  We  may  note  however,  that  the  difference  between \nthe 3  and 8  component  mixture complexity  models is significant  (p  > 0.95)  and  the \ndifferences  bet.ween  t.he  8  componellt.  model  and  the other  models are  much  larger . \n\nFigure 2 shows an 8 component mixture Blodel  of the fillal  weight distribution.  It is \nquite unlike  t.he  distribution  ill  Figure  1 and  is  actually  quite  close  to  a  mixture of \ntwo zero-meall  gallssians, one  hroad  ano  one  lIarrow.  This may explain  why  weight \nelimination  works  quite  well  for  t.his  t.ask. \nWeigeno  el  at  point.  Ollt  that.  fOJ\"  Lillie  series  preoiction  tasks  sllch  as  the  SUllspot \ntask  a  mudl  rnore  int,(' resl.illg  nlca:-;llI\"t'  of performance  is  th e  ability of the  Illouel  to \npreoict  Illore  thall  aile  t.illlt:  st.ep  into  the  fUl.LUe.  One  way  to  appl\"Oacll  th e  Illulti(cid:173)\nstep  prediction  problem is  to llse  iterated szng/e-step  predzctzon.  In  this method,  the \npredicted  output  is  fed  back  as  input  fOI\u00b7  the  next  preJictioll  and  all  otlter  illput \nunits  have  theil'  values  shifted  back  Olle  unit.  Thus  t.he  input.  typically  consists \nof a  combination of act.ual  and  preJicted  values.  \\Vhen  preuictillg  more  thaJl  one \nstep  int.o  the future,  the  prediction error  depends  both on  how  ma.ny  steps into the \nfuture  one  is  predicting  (/)  ano  on  what  point  in  the  time  series  the  prediction \nbegan.  An  appropriate enor  measure  for  iterated  prediction  is  the  aVe1\u00b7age  relaltve \nI-times  iter'ated  pr'ediclion  V(lT\"wnce  (\\Veigend,  Huberman  and  Rumelhart,  1990) \n\n5The  aut.hors  thallk  Andreas  vVeigend  for  providing  his  version  of this  data. \n6This  was  the  morl el  fa.vored  b~1  Priestly  (Pri estley,  1991a.)  in  a  recent  evaluation  of \n\nclassical  stat.istical  approaches  1.0  t.his  t.ask. \n\n\fAdaptive Soft Weight Tying using Gaussian Mixtures \n\n999 \n\n1.9 \n1.8 \n\n1.7 \n1.& \n1.5 \n1.. \n\n1.3 \n\n1.2 \n\n1.1 \n\n0.9 \n0.8 \n0.7 \n\n0.6 \n0.5 \n0 \u2022\u2022 \n\n0.3 \n\n0.2 \n\n0.1 \n\nFigure 2:  Typical final  mixture  probability density  for  the SUllspot  prediction  pl'Ob(cid:173)\nlem with  a  model  containing 8  ruixt.llI'e  components. \n\n3.5  \u2022\u2022. 5  5 \n\n0.9 \n\n0.8 \n\n0./ \n\nO,r. \n\n0.5 \n\n0 .4 \n\n0 ,1 \n\n0.2 \n\n0.1 \n\nFigure  3:  A verage  relative  I-times  iterated  prediction  variance  versus  number  of \nprediction  iterations for  t.he  sunspot.  time series  from  1921  to  1955.  Closed  circles \nrepresent  the  TAR  model,  opell  circles  the  W RH  model,  closed  sljuares  the  j \ncomponent complexity 1l10del,  and opell squares the ~ componellt complexity lllodei. \nTen different  set.s  of initial weights were  used  for  the 3 and 8  component complexity \nmodels  and  one standard deviation  error  bars are shown. \n\n\f1000 \n\nNowlan and Hinton \n\nwhich  averages  predi ctions  I  steps  into  the future over  all  possible starting  points. \nUsing  this  measure , the performance  of various models  is  shown in  Figure  3. \n\n5  Sun1mary \nThe simulations we  have  described  provide evidence  that  the  use  of a  more flexible \nmodel for  the distribution of weights  in a  network  can lead  to  better generalization \nperformance  than weight  decay,  weight  elimination, or  techniques  that  control  the \nlearning time.  The flexibility of our model is  clearly demonstrated in the very  differ(cid:173)\nent final  weight  distributions discovered  for  the two  different  problems investigated \nin  this  paper.  The  a.bility  to  automatically  adapt  to  individual  problems  suggests \nthat  the method  should  ha.ve  broad  applicability. \n\nAcknowledgements \nThis  research  was  funded  by  the  Ontario  ITRC,  the  Canadian  NSERC  and  the  Howard \nHughes  Medical  Institute.  Hiutoll  is  the  Noranda  fellow  of  the  Calladiall  lllstitute  for \nAdvanced  Research. \n\nReferences \n\nBaum ,  E.  B.  allo  Hallssler.  D.  (I()~~l) .  What  size  net  gives  vidid  generalizat.ioJl ?  N ew 'o/ \n\nCompu/a/ion.l :151 -- lGO . \n\nHe,  X.  and  Lapedes,  A.  (1991) .  Nonlin ear  modelling  alld  prediction  by  stlccessive  approxi(cid:173)\n\nmation  \\Ising  Radial  Basis  FUllctioJls.  Techllical  Report  LA-U H.-Yl-lJ75,  Los  Alamos \nNational  Laboratory. \n\nHinton,  G.  \"8.  (1987) .  Learning  t.rallslatioll  invariant  recognition  in  a  massively  parallel \n\nnetwork.  In  Pmc.  Conf.  Pam/ft'i  Al'chitectw'es  and Languages  EU7\"Ope,  Eindhoveu . \n\nLang,  K.  J.,  Waibel,  A.  H.,  and  Hint.on,  G .  E .  (1990).  A  time-delay  neural  network \n\narchitecture for  isolated  word  recognition.  Neural  Networks,  3:23-43. \n\nLe Cun,  Y . (1989) .  Generalization  and  network  design  strategies.  Technical  Report  CRG(cid:173)\n\nTR-89-4,  University  of Toronto . \n\nMacKay,  D.  J.  C.  (1991).  Bayesian  Modelling  and New'al  Networ'ks.  PhD  thesis,  Compu(cid:173)\n\ntation  and  Neural  Systems,  California Institute  of Technology,  Pasadena,  CA . \n\nNowlan,  S.  J .  (1991) .  Soft  Competitive  Adaptation:  New'al  Network  Learning  Algo(cid:173)\nrithms  based on  Fitting Sta.tistical  Mixtures.  PhD  thesis , School  of Computer Science, \nCarnegie  Melloll  Uuiversit.y,  Pittsburgh,  PA. \n\nNowlan,  S.  J.  and  Hinton,  G.  E.  (Hlq~) .  Sirnplifyillg  neural  networks  by  soft  weight(cid:173)\n\nsharing.  New'al  ComputatlOlI .  III  press. \n\nPlaut,  D.  C .,  Nowlall ,  S.  J ..  alld  Hill/on,  G .  E.  (lY86).  Experimellt.s  0 11 \n\nlearllillg  by \nback- propagation.  Tech nical  Report.  CM U -CS-86-l'26,  Carnegie-Melloll  U IIi versi ty, \nPittsburgh  PA  15213. \n\nPriestley,  M.  B.  (1991a.).  Non-lineal' ond Non-stationa \u00b7ry  Time  Senes  Analysis.  Acad emic \n\nPress. \n\nPriestley,  M.  B.  (1~~lb) .  Spectml  AIHllY8is  and  Time  Series .  Academic  Press. \nTong,  H . and  Lilli,  1\\ . S.  (1980) .  Threshold  autoregression,  limit cycles,  a.lld  cyclical dat.a. \n\n10'l.1mal  Royal Stati\" tical Society B,  42 . \n\nWeigend,  A.  S.,  Huberman,  B .  A.,  alld  Rurnelhart,  D.  E.  (lY90).  Predictillg  the  futur e:  A \n\nconnectiollist  approach .  lllte,.,w/wllal  lou.null of Neuml  Systems,  1. \n\n\f", "award": [], "sourceid": 498, "authors": [{"given_name": "Steven", "family_name": "Nowlan", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}