{"title": "Convergence of a Neural Network Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 839, "page_last": 845, "abstract": null, "full_text": "Convergence  of a  Neural  Network  Classifier \n\nJohn  S.  Baras \nSystems  Research  Center \nUniversity  of Maryland \nCollege  Park,  Maryland  20705 \n\nAnthony  La Vigna \nSystems  Research  Center \nUniversity  of Maryland \nCollege  Park,  Maryland  20705 \n\nAbstract \n\nIn  this  paper,  we  prove  that  the  vectors  in  the  LVQ  learning  algorithm \nconverge.  We  do  this  by  showing  that  the  learning  algorithm  performs \nstochastic  approximation.  Convergence  is  then  obtained  by  identifying \nthe  appropriate  conditions  on  the  learning  rate  and  on  the  underlying \nstatistics of the  classification  problem.  We  also  present  a  modification  to \nthe  learning algorithm which  we  argue  results  in  convergence  of the  LVQ \nerror to the Bayesian optimal error as  the appropriate parameters become \nlarge. \n\n1 \n\nIntroduction \n\nLearning  Vector  Quantization  (LVQ)  originated  in  the  neural  network  community \nand  was  introduced  by  Kohonen \n(Kohonen  [1986]).  There  have  been  extensive \nsimulation studies reported in  the literature demonstrating the effectiveness of LVQ \nas  a  classifier  and  it  has generated  considerable  interest  as  the training times  asso(cid:173)\nciated  with  LVQ  are significantly  less  than  those  associated  with  backpropagation \nnetworks. \n\nIn  this  paper  we  analyse  the convergence  properties of LVQ.  Using a  theorem from \nthe  stochastic  approximation  literature,  we  prove  that  the  update  algorithm  con(cid:173)\nverges  under  the suitable  conditions.  We  also  present  a  modification  to  the  algo(cid:173)\nrithm which provides for  more stable learning.  Finally,  we  discuss the decision error \nassociated  with  this  \"modified\"  LVQ  algorithm. \n\n839 \n\n\f840 \n\nBaras and LaVigna \n\n2  A  Review  of Learning  Vector  Quantization \n\nLet  {(Xi, dX')}~1 be  the training data or  past observation set.  This  means  that  Xi \nis  observed  when  pattern dx ,  is  in  effect.  We  assume  that  the  xi's  are  statistically \nindependent  (this  assumption  can be  relaxed).  Let  OJ  be  a  Voronoi  vector  and  let \n8  = {Ol, ... , Od  be the set ofVoronoi vectors.  We assume that there are many more \nobservations than Voronoi vectors (Duda &  Hart [1973]).  Once the Voronoi  vectors \nare  initialized,  training proceeds  by  taking a sample  (Xj, dx }) from  the  training set, \nfinding the closest Voronoi  vector and adjusting its value  according to equations (1) \nand  (2).  After  several  passes  through  the  data,  the  Voronoi  vectors  converge  and \ntraining is  complete. \n\nSuppose  Oe  is  the closest  vector.  Adjust  Oe  as  follows: \n\nOe(n + 1)  =  Oe(n)  - an  (Xj  - Oe(n)) \n\n(1) \n\n(2) \n\nif dec  f::.  d X1 '  The ot.her  Voronoi  vectors  are  not  modified. \nThis update has the effect  that if Xj  and Oe  have  the same decision then Oe  is  moved \ncloser  to  Xj,  however if they have different  decisions then Oe  is  moved away  from  Xj. \nThe constants  {an}  are  positive  and  decreasing, e.g.,  an  =  lin.  We  are  concerned \nwith  the convergence  properties of 8( n)  and  with  the  resulting detection  error. \n\nFor  ease  of  notation,  we  assume  that  there  are  only  two  pattern  classes.  The \nequations  for  the  case  of  more  than  two  pattern  classes  are  given  in \n(LaVigna \n[1989]). \n\n3  Convergence  of the Learning Algorithm \n\nThe LVQ  algorithm has  the general form \n\n0i(n + 1)  =  Oi(n) + an ,(dxn,de.(n),xn,8n) (xn  - Oi(n)) \n\n(3) \n\nwhere  Xn  is  the  currently  chosen  past  observation.  The  function  I  determines \nwhether  there is  an  update and  what  its sign  should be and is  given  by \n\nif dXn  =  de,  and  Xn  EVe, \nif dXn  f::.  de,  and  Xn  EVe, \notherwise \n\n(4) \n\nHere  Ve,  represents  the set  of points closest  to  OJ  and  is  given  by \n\nVe, =  {x E ~d  : \n\nIIOi  - xii  < IIOj  - xiI,  j  f::.  i} \n\ni  =  1, ... , k. \n\n(5) \n\nThe update in  (3)  is  a stochastic approximation algorithm (Benveniste,  Metivier  & \nPriouret  [1987]).  It has  the form \n\n(6) \nwhere  8  is  the  vector  with  components  OJ;  H(8, z)  is  the  vector  with components \ndefined  in  the  obvious  manner  from  (3)  and  Zn  =  (xn' dx n)  is  the  random  pair \n\n\fConvergence of a Neural Network Classifier \n\n841 \n\nconsisting of the observation and  the associated  true pattern number.  If the appro(cid:173)\npriate  conditions are  satisfied  by  On,  H,  and  Zn,  then  8 n  approaches  the  solution \nof \n\nd  -\ndt 8(t) =  h(8(t)) \n\n-\n\n(7) \n\nfor  the appropriate choice  of h(8). \nFor  the  two  pattern  case,  we  let  PI (x)  represent  the  density  for  pattern  1  and  11\"1 \nrepresent  its  prior.  Likewise  for  po{x)  and  11\"0.  It can  be  shown  (Kohonen  [1986]) \nthat \n\nwhere \n\n(8) \n\n(9) \n\nIf the following hypotheses hold then using techniques from (Benveniste,  Metivier & \nPriouret [1987])  or (Kushner & Clark [1978])  we  can prove  the convergence theorem \nbelow: \n[H.1]  {on}  is  a  non increasing  sequence  of positive  reals  such  that  Ln an  =  00, \n\nLnO~ < 00. \n\n[H.2]  Given  dxn ,  Xn  are  independent and  distributed  according  to  Pd:rn (x). \n[H.3]  The  pattern densities, Pi(X),  are  continuous. \nTheorem  1  Assume  that  [H.l]-[H.3]  hold.  Let  8*  be  a  locally  asymptotic  stable \nequilibrium  point  of (7)  with  domain  of attraction  D*.  Let  Q  be  a  compact  subset \nof D*.  If 8 n  E Q  for  infinitely  many n  then \n\nlim  8 n  =  0*  a.s. \nn-oo \n\n( 10) \n\nProof:  (see  (LaVigna (1989))) \n\nHence  if  the  initial  locations  and  decisions  of  the  Voronoi  vectors  are  close  to  a \nlocally  asymptotic stable equilibrium of (7)  and if they  do not move  too much  then \nthe vectors  converge. \n\nGiven  the form  of (8)  one might  try  to  use  Lyapunov  theory  to  prove  convergence \nwith \n\nK \n\nL(8) = L J IIx - 8il1 2 qi(X), dx \n\ni=I  VIl, \n\n(11) \n\nas  a  candidate  Lyapunov  function.  This function  will  not  work  as  is  demonstrated \nby  the following  calculation in  the one  dimensional  case.  Suppose  that  f{  =  2  and \n(h  < O2  then \n\n{) \n-L(8) \n{)Ol \n\n(12) \n\n\f842 \n\nHaras and LaVigna \n\n\u2022 \n-00 \n\n\u2022  0 \n\n0  o \n\n00 \n\nFigure  1:  A  possible  distribution  of observations  and  two  Voronoi  vectors. \n\nLikewise \n\n(18) \n\nTherefore \n~L(E\u00bbe = -h 1(E\u00bb2-h2(E\u00bb2+1I(01-02)/2W Ql((Ol +(2)/2)(h 1(E\u00bb-h 2 (E\u00bb)  (19) \nIn order for this to be a Lyapunov function (19) would have to be strictly nonpositive \nwhich is not the case.  The problem with this candidate occurs because the integrand \nqi (x)  is  not  strictly  positive  as  is  the  case  for  ordinary  vector  quantization  and \nadaptive  K-means. \n\n4  Modified  LVQ  AlgorithlTI \n\nThe  convergence  results  above  require  that  the  initial  conditions  are  close  to  the \nstable  points  of  (7)  in  order  for  the  algorithm  to  converge. \nIn  this  section  we \npresent  a  modification  to the  LVQ  algorithm which  increases  the  number of stable \nequilibrium for  equation  (7)  and  hence  increases  the  chances  of convergence.  First \nwe  present  a  simple  example  which  emphasizes  a  defect  of LVQ  and  suggests  an \nappropriate modification  to the algorithm. \nLet 0  represent an observation  from  pattern 2 and  let 6.  represent  an observation \nfrom  pattern  1.  We  assume  that  the  observations  are  scalar.  Figure  1 shows  a \npossible distribution of observations.  Suppose there are  two  Voronoi vectors  01  and \nO2  with  decisions  1  and  2,  respectively,  initialized  as  shown  in  Figure  1.  At  each \nupdate of the LVQ  algorithm,  a  point is  picked  at random from  the observation set \nand  the  closest  Voronoi  vector  is  modified.  We  see  that  during  this  update ,  it  is \npossible  for  02(n)  to  be  pushed  towards  00  and  01(n)  to  be  pushed  towards  -00, \nhence  the  Voronoi  vectors  may  not  converge. \n\nRecall  that  during  the  update  procedure in  (3),  the  Voronoi  cells  are  changed  by \nchanging the  location of one  Voronoi  vector.  After an  update,  the  majority  vote  of \n\n\fConvergence of a Neural Network Classifier \n\n843 \n\nthe observations in each new  Voronoi cell may  not agree  with the decision previously \nassigned  to  that  cell.  This  discrepency  can  cause  the  divergence  of the  algorithm. \nIn  order  to  prevent  this  from  occuring  the  decisions  associated  with  the  Voronoi \nvectors should  be updated  to agree  with  the  majority  vote of the observations  that \nfall  within their  Voronoi  cells.  Let \n\ng,(8; N) =  {  : \n\n1  N \n\nj=l \n\notherwise. \n\nif  N  L I{YJE V8,lI{dyJ =1}  >  N  L I{Y J Ev8 .}I{dyJ =2} \n\n1  N \n\n(20) \n\nj=l \n\nThen  gi  represents  the  decision  of the  majority  vote  of the  observations  falling  in \nVe,.  With  this  modification,  the  learning for  ()j  becomes \n\n()i(n + 1) = ()i(n) + an ,(dxn ,gi(8n ; N),x n ,8n )  \\70,(n)(()i(n) - xn). \n\n(21) \n\nThis equation has the same form as (3)  with the function  H(8, z)  defined from (21) \nreplacing H(8, z). \n\nThis  divergence  happens because  the decisions  of the  Voronoi  vectors  do  not  agree \nwith  the  majority  vote  of the observations  closest  to  each  vector.  As  a  result,  the \nVoronoi  vectors  are  pushed  away  from  the  origin.  This  phenomena  occurs  even \nthough  the  observation  data  is  bounded.  The  point  here  is  that,  if  the  decision \nassociated  with  a  Voronoi  vector  does  not  agree  with  the  majority  vote  of  the \nobservations  closest  to  that  vector  then  it  is  possible  for  the  vector  to  diverge.  A \nsimple solution  to this problem is  to correct  the decisions of all  the Voronoi  vectors \na.fter  every  adjustment  so  that  their  decisions  correspond  to  the  majority  vote.  In \npractice  this  correction  would  only  be  done  during  the  beginning  iterations  of the \nlearning algorithm since that is  when an  is  large and the Voronoi  vectors are moving \naround significantly.  \"Vith this modification it is  possible to show convergence to the \nBayes optimal classifier  (La Vigna  [1989])  as  the number of Voronoi  vectors  become \nlarge. \n\n5  Decision  Error \n\nIn  this  section  we  discuss  the  error  associated  with  the  modified  LVQ  algorithm. \nHere  two  results  are  discussed.  The  first  is  the  simple  comparison  between  LVQ \nand  the  nearest  neighbor  algorithm.  The second  result  is  if the  number of Voronoi \nvectors  is  allowed  to go  to  infinity  at  an  appropriate  rate  as  the  number  of obser(cid:173)\nvations  goes  to  infinity,  then  it  is  possible  to  construct  a  convergent  estimator  of \nthe  Bayes  risk.  That  is,  the error  associated  with  LVQ  can  be  made  to  approach \nthe optimal error.  As  before,  we  concentrate on  the  binary  pa.ttern  case  for  ease  of \nnotation. \n\n5.1  Nearest  Neighbor \n\nIf a  Voronoi  vector is  assigned  to each observation then  the  LVQ  algorithm reduces \nto the nearest neighbor algorithm.  For that algorithm, it was shown  (Cover &  Hart \n[1967])  that  its  Bayes  minimum  probability  of error  is  less  than  twice  that  of the \noptimal  classifier.  More  specifically,  let  r*  be  the  Bayes  optimal  risk  and  let  l'  be \n\n\f844 \n\nBaras and LaVigna \n\nthe nearest neighbor  risk.  It was  shown  that \n\nr*::; r::; 2r*(1- r*) < 2r*. \n\n(22) \n\nHence  in  the case of no iteration, the  Bayes' risk associated  with  LVQ  is  given  from \nthe nearest  neighbor  algorithm. \n\n5.2  Other Choices for  Number of Voronoi Vectors \n\nWe  saw  above that if the number of Voronoi  vectors equals  the  number of observa(cid:173)\ntions then LVQ  coincides with the nearest neighbor algorithm.  Let kN  represent the \nnumber  of Voronoi  vectors  for  an  observation  sample size  of N.  We  are  interested \nin  determining  the  probability  of error for  LVQ  when  kN  satisfies  (1)  limkN  = 00 \nand  (2)  lim(kN / N) = O.  In  this case,  there are  more observations  than vectors  and \nhence  the  Voronoi  vectors  represent  averages  of the  observations.  It is  possible  to \nshow  that  with  kN  satisfying  (1)-(2)  the  decision  error  associated  with  modified \nLVQ  can  be  made  to  approach  the  Bayesian  optimal  decision  error  as  N  becomes \nlarge  (LaVigna [1989]). \n\n6  Conclusions \n\nWe  have shown  convergence of the Voronoi  vectors  in  the LVQ  algorithm.  We  have \nalso  presented  the  majority  vote  modification  of the  LVQ  algorithm.  This  modifi(cid:173)\ncation  prevents  divergence  of the  Voronoi  vectors  and  results  in  convergence  for  a \nlarger set of initial  conditions.  In  addition,  with  this  modification  it  is  possible  to \nshow  that  as  the  appropriate  parameters  go  to  infinity  the  decision  regions  asso(cid:173)\nciated  with  the modified  LVQ  algorithm approach  the  Bayesian  optimal  (La Vigna \n[1989]). \n\n7  Acknowledgements \n\nThis work  was supported by  the  National Science  Foundation through grant CDR-\n8803012,  Texas Instruments  through  a  TI/SRC  Fellowship  and  the  Office  of Naval \nResearch  through  an  ONR Fellowship. \n\n8  References \n\nA.  Benveniste, M.  Metivier & P.  Priouret [1987],  Algorithmes Adaptatifs et Approx(cid:173)\n\nimations Stochastiques,  Mason,  Paris. \n\nT.  M.  Cover  &  P.  E.  Hart [1967],  \"Nearest  Neighbor  Pattern Classification,\"  IEEE \n\nTransactions  on  Information  Theory IT-13,  21-27. \n\nR.  O.  Duda & P.  E.  Hart [1973],  Pattern  Classification  and  Scene  Analysis,  John \n\nWiley &  Sons,  New  York,  NY. \n\nT. Kohonen [1986],  \"Learning Vector  Quantization for  Pattern  Recognition,\"  Tech(cid:173)\n\nnical  Report TKK-F-A601,  Helsinki  University  of Technology. \n\n\fConvergence of a Neural Network Classifier \n\n845 \n\nH.  J.  Kushner  &  D.  S.  Clark [1978],  Stochastic  Approximation  Methods  for \n\nConstrained  and  Unconstrained  Systems,  Springer-Verlag,  New  York(cid:173)\nHeidelberg-Berlin. \n\nA.  La Vigna [1989],  \"Nonparametric Classification  using Learning Vector  Quantiza(cid:173)\n\ntion,\"  Ph.D. Dissertation, Department of Electrical Engineering, University \nof Maryland. \n\n\f", "award": [], "sourceid": 407, "authors": [{"given_name": "John", "family_name": "Baras", "institution": null}, {"given_name": "Anthony", "family_name": "LaVigna", "institution": null}]}