{"title": "Self-Organizing Rules for Robust Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 474, "abstract": null, "full_text": "Self-Organizing  Rules  for  Robust \n\nPrincipal  Component  Analysis \n\nLei Xu l ,2\"'and  Alan Yuille l \n\n1.  Division of Applied Sciences,  Harvard  University,  Cambridge, MA  02138 \n\n2.  Dept.  of Mathematics, Peking  University,  Beijing,  P.R.China \n\nAbstract \n\nIn  the  presence  of outliers,  the  existing  self-organizing  rules  for \nPrincipal  Component  Analysis  (PCA)  perform  poorly.  Using  sta(cid:173)\ntistical physics techniques including the Gibbs distribution, binary \ndecision  fields  and  effective  energies,  we  propose  self-organizing \nPCA  rules  which  are  capable  of  resisting  outliers  while  fulfilling \nvarious PCA-related tasks such as obtaining the first  principal com(cid:173)\nponent vector,  the first  k  principal component vectors,  and directly \nfinding  the  subspace  spanned  by  the  first  k  vector  principal  com(cid:173)\nponent  vectors  without solving for  each  vector  individually.  Com(cid:173)\nparative  experiments  have  shown  that  the  proposed  robust  rules \nimprove  the  performances  of the  existing  PCA  algorithms signifi(cid:173)\ncantly when outliers are  present. \n\n1 \n\nINTRODUCTION \n\nPrincipal Component Analysis (PCA) is an essential technique for data compression \nand feature  extraction,  and has been  widely  used  in statistical data analysis,  com(cid:173)\nmunication theory, pattern recognition and image processing.  In the neural network \nliterature, a  lot of studies have  been  made on learning rules for  implementing PCA \nor on networks closely related to PCA (see Xu & Yuille, 1993 for  a detailed reference \nlist  which  contains more than 30  papers related  to these  issues).  The existing rules \ncan fulfil  various PCA-type  tasks for  a  number of application purposes. \n\n\"'Present  address:  Dept.  of  Brain  and  Cognitive  Sciences,  E10-243,  Massachusetts \n\nInstitute of Technology,  Cambridge,  MA  02139. \n\n467 \n\n\f468 \n\nXu and  Yuille \n\nHowever,  almost  all  the  previously  mentioned  peA  algorithms  are  based  on  the \nassumption  that  the  data has  not  been  spoiled  by outliers  (except  Xu,  Oja&Suen \n1992,  where  outliers  can  be  resisted  to some extent.).  In  practice,  real  data often \ncontains some outliers  and  usually they are not easy  to separate from the data set. \nAs shown by the experiments described in this paper, these outliers will significantly \nworsen  the  performances of the existing  peA learning algorithms.  Currently,  little \nattention has  been  paid  to  this problem in the  neural  network  literature,  although \nthe problem is  very  important for  real  applications. \n\nRecently,  there have been some success  in applying t:te statistical physics approach \nto  a  variety  of computer vision  problems  (Yuille,  1990;  Yuille,  Yang&Geiger  1990; \nYuille,  Geiger&Bulthoff,  1991).  In  particular,  it  has  also  been  shown  that  some \ntechniques  developed  in  robust  statistics  (e.g.,  redescending  M-estimators,  least(cid:173)\ntrimmed squares  estimators)  appear naturally within  the  Bayesian formulation  by \nthe  use  of the  statistical  physics  approach.  In  this  paper  we  adapt  this  approach \nto tackle  the  problem of robust  PCA.  Robust rules  are  proposed  for  various  PCA(cid:173)\nrelated  tasks  such  as  obtaining  the  first  principal  component  vector,  the  first  k \nprincipal  component  vectors,  and  principal  subspaces.  Comparative experiments \nhave been made and the results show that our robust rules improve the performances \nof the existing  peA algorithms significantly when  outliers are  present. \n\n2  peA LEARNING  AND  ENERGY MINIMIZATION \n\nThere  exist  a  number of self-organizing rules  for  finding  the  first  principal compo(cid:173)\nnent.  Three of them are listed as  follows  (Oja 1982,  85;  Xu,  1991,93): \n\nm(t + 1) = m(t) + aa(t)(xy - m(t)y2), \nm(t + 1) = m(t) + aa(t)(xy - m(~~~(t)y2), \nm(t + 1) =  m(t) + aa(t)[y(x - iI) + (y - y')X]. \n\n(1) \n\n(2) \n\n(3) \nwhere y = m(t)T x,  iI = ym(t),  y'  = m(tf iI and aa(t) 2::  0 is the learning rate which \ndecreases  to  zero  as  t  -- 00  while  satisfying  certain  conditions,  e.g.,  Lt aa(t)  = \n00,  Lt aa(t)q  < 00  for  some  q> 1. \nEach  of the  three  rules  will  converge  to  the  principal  component  vector  i almost \nsurely  under some mild conditions which  are studied in  detail in by  Oja (1982&85) \nand Xu (1991&93).  Regarding m as the weight vector of a linear neuron with output \ny = mT x,  all the  three  rules  can  be  considered  as  modifications of the  well  known \nHebbian  rule  m(t + 1)  =  m(t) + aa(t)xy through  introducing additional terms for \npreventing  IIm(t)1I  from going  to  00 as t  -- 00. \nThe  performances  of these  rules  deteriorate  considerably  when  data  contains out(cid:173)\nliers.  Although  some outlier-resisting  versions  of eq.(l)  and  eq.(2)  have  also  been \nrecently proposed (Xu, Oja & Suen,  1992), they work well only for  data which is  not \nseverely spoiled by outliers.  In this paper, we  adopt a totally different approach-we \ngeneralize eq.(1),eq.(2) and eq.(3)  into more robust versions  by using the statistical \nphysics  approach. \n\nTo do so, first  we  need to connect these  rules to energy functions.  It follows from Xu \n(1991&93)  and Xu  & Yuille(1993)  that  the  rules  eq.(2)  and eq.(3)  are respectively \n\n\fSelf-Organizing Rules  for  Robust Principal Component Analysis \n\n469 \n\non-line gradient  descent  rules  for  minimizing J 1 (m), J2(m)  respectivelyl: \n\nJ  ( -) =  _  \"'(-'!' -. _  m  Xixi  m) \n1  m \n\nN \nL..J  x,  X, \n\nI\n\n-T -\n\n::::T  -\n\n-T _ \nm  m \n\nN  i=l \n\nN \n\nhem) = ~ L !Iii - uill 2 . \n\ni=1 \n\n(4) \n\n(5) \n\nIt  has  also  been  proved  that  the  rule  given  by  eq.(l)  satisfies  (Xu,  1991,  93): \n(a)  hTh2  2:  0,E(hJ)T JJ(h1)  2:  0,  with  hI  = iy-my2,  h2  = iy- mo/.m y2 ;  (b) \nE(hl)TE(h3) > 0,  with h3  = y(i-iI)+(y-y')i; (c)  Both J1  and h  have only one \nlocal  (also  global)  minimum tr(~) - iI'r-i,  and  all  the  other  critical  points  (i.e., \nthe  points satisfy  8Jakm) =  0, i = 1,2) are saddle points.  Here  ~ = E{ii t},  and i \nis  the eigenvector of r- corresponding  to  the largest  eigenvalue. \nThat  is,  the  rule  eq.(l)  is  a  downhill  algorithm for  minimizing  J1  in  both  the  on \nline sense  and  the  average sense,  and for  minimizing J2  in  the  average sense. \n\n3  GENERALIZED  ENERGY AND  ROBUST peA \n\nWe further  regard  J 1(m),  J2(m)  as special  cases  of the following general  energy: \n\nJ(m) = ~L Z(ii, m),  Z(ii' m) 2:  0. \n\nN \n\ni=1 \n\nwhere  Z(ii' m)  is  the  portion of energy  contributed  by  the sample ii, and \n\n(6) \n\n(7) \n\nFollowing (Yuille,  1990 a&  b),  we  now  generalize energy  eq.(6) into \n\nE(V, m) =  L:f:1 Vi  Z(ii' m) + Eprior(V) \n\n(8) \nwhere  V = {Vi, i  = 1, .. \"  N}  is  a  binary  field  {\\Ii}  with  each  \\Ii  being  a  random \n\\Ii  acts  as  a  decision  indicator  for  deciding \nvariable  taking  value  either  0  or  1. \nwhether ii is an outlier or a sample.  When \\Ii = 1, the portion of energy contributed \nby  the sample ii is  taken into consideration; otherwise, it is equivalent to discarding \nii  as  an  outlier.  Eprior(V)  is  the  a  priori portion  of energy  contributed  by  the  a \npriori distribution of {Vi}.  A natural choice  is \n\nEpriorCV)  = 11 1:(1- Vi) \n\nN \n\ni=1 \n\n(9) \n\nThis  choice  of priori  has  a  natural  interpretation:  for  fixed  m it  is  energetically \nfavourable to set  \\Ii = 1 (i.e.,  not  regarding  ii as  an outlier)  if Z(ii' m) <  yfii  (i.e., \n\nlWe have  J1(ffi)  2:  0,  since  iTi - m\"fm  =  lIiW sin2 (Jxm  2:  o. \n\n\fPmargin(m) \n\ne \n\nL..J, \n\n1 L  -{3 ~ {V,z(x\"m)+T/(l-V,)} \n-\nZ  _ \nv \n!  II  L  e-{3{V,z(x\"m)+T/(l- V,)}  =  _1_ e-{3EeJJ (m). \n\nZ \n\n. \n,  V,={O,l} \nEeJj(m) =  -1 Llog{1 + e-{3{z(x\"m)-T/}}. \n\nZm \n\n(11) \n\n(12) \n\n470 \n\nXu  and Yuille \n\nthe  portion  of energy  contributed  by  Xi  is  smaller  than  a  prespecified  threshold) \nand to set  it to 0 otherwise. \nBased  on  E(V, m),  we  define  a  Gibbs distribution (Parisi  1988): \n\n-\n\nP[V  m] = _e-{3E V,m \n[- -] \n' \n\n'z  \n\n1 \n\n(10) \n\nwhere  Z  is  the  partition  function  which  ensures  Lv Lm pry, m]  = 1.  Then  we \ncompute \n\n(3 \n\ni \n\nEel! is  called  the effective  energy.  Each term in  the sum for  Eel I  is  approximately \nz(xi,m) for  small values  of Z  but  becomes  constant  as  z(xi,m)  -+  00.  In  this  way \noutliers, which are more likely to yield large values of z( Xi, m), are treated differently \nfrom samples, and thus  the estimation m obtained by  minimizing EeJj(m)  will  be \nrobust  and  able  to resist  outliers. \n\nEe! f (m)  is  usually  not  a  convex  function  and may have  many local  minima.  The \nstatistical  physics  framework  suggests  using  deterministic  annealing  to  minimize \nEeJj(m).  That  is,  by  the  following  gradient  descent  rule  eq.(13),  to  minimize \nEeJj(m)  for  small (3  and  then  track  the  minimum as  (3  increases  to  infinity  (the \nzero  temperature limit): \n\n_( \nm  t + 1  =  m  t  -\n\n) \n\n_()  (~ \n\nlYb  t) ~ 1 + e{3(z(x\"m(f))-T/) \n\n1 \n\n, \n\noz(xi,m(t)) \n. \n\nom(t) \n\n(13) \n\nMore  specifically,  with  z's  chosen  to  correspond  to  the  energies  hand J2  respec(cid:173)\ntively,  we  have  the following  batch-way learning rules  for  robust  peA: \n\n2) \n_ ( \nm  t + 1  = m  t  + lYb  t  ~ 1 + e{3(z(x\"m(t))-T/)  XiYi  - m(t)Tm(t)Yi' \n\n(  ) ~ \n\nm( t) \n\n_ (  ) \n\n( _ \n\n1 \n\n) \n\nz \n\n() \n14 \n\nmet + 1) = met) + abet) ~ 1 + e{3(Z(;\"m(f))-T/) [Yi(Xi  -\n\n, \n\nild + (Yi  - yDXi]. \n\n(15) \n\nFor  data that  comes  incrementally or in  the  on-line  way,  we  correspondingly  have \nthe following  adaptive or stochastic  approximation versions \n\n2) \n-( \nm  t +  =  m  t  + aa  t  1 + e{3(z(x\"m(t))-17)  XiYi  - m(t)T met) Yi \n, \n\nmet) \n\n1) \n\n(-\n\n1 \n\n-C) \n\n() \n\nmet + 1) = met) + aa(t) 1 + e{3(Z(;\"m(t))-17) [Yi(Xi  -\n\niii) + (Yi  - YDXi]. \n\n(16) \n\n(17) \n\n\fSelf-Organizing Rules for  Robust  Principal  Component Analysis \n\n471 \n\nIt can  be  observed  that  the  difference  between  eq.(2)  and  eq.(16)  or  eq.(3)  and \neq.(17)  is  that the learning rate G'a(t)  has been  modified by a  multiplicative factor \n\nG'm(t)  =  1 + e{j(Z(tri,m(t))-\")' \n\n1 \n\n(18) \n\nwhich  adaptively  modifies  the  learning  rate  to  suit  the  current  input  Xi.  This \nmodifying factor  has  a  similar function  as  that  used  in  Xu,  Oja&Suen(1992)  for \nrobust  line  fitting.  But  the  modifying  factor  eq.(18)  is  more  sophisticated  and \nperforms better. \n\nBased  on  the  connecticn  between  the  rule  eq.(I)  and  J 1  or  J2 ,  given  in  sec.2,  we \ncan  also  formally  use  t ile  modifying factor  G'm(t)  to  turn  the  rule  eq.(I)  into  the \nfollowing robust  version: \n\nmet + 1) = met) + G'a(t) 1 + e{j(Z(;.,m(t))-,,) (iiYi  - m(t)yi), \n\n(19) \n\n4  ROBUST  RULES  FOR k  PRINCIPAL  COMPONENTS \n\nIn a similar way to SGA (Oja, 1992) and GHA (Sanger,  1989) we  can generalize the \nrobust  rules  eq.(19),  eq.(16)  and  eq.(17)  into the following  general  form  of robust \nrules for finding  the first  k principal components: \n\nmj(t + 1) = mj(t) + G'a(t) 1 + e{j(Z(tr)n,m;(t))-,,) ~mj(xi(j), mj(t\u00bb, \nii(j + 1) =  Xi(j)  - L Yi(r)mr(t),  Yi(j) = mJ (t)ii(j), \n\nXi(O)  =  ii, \n\nj - l  \n\n(20) \n\n(21) \n\nr=l \n\nwhere ~mj(ii(j), mj(t\u00bb, Z(Xi(j), mj(t\u00bb have four possibilities (Xu & Yuille, 1993). \nAs  an  example, one of them is given here \n\ndmj(xi(j), mj(t\u00bb = (Xi(j)Yi(j) - mj(t)Yi(j)2), \n\n( ..  (.)  ..  (t\u00bb \nZ Xi  J  ,mj  = Xi  J  Xi  J  - mj(t)Tmj(t)' \n\n.. (')T - (.) \n\nYi(j)2 \n\nIn  this  case,  eq.(20)  can  be  regarded  as  the generalization of GHA  (Sanger,  1989). \n\nWe  can  also  develop  an  alternative set  of rules for  a  type of nets  with  asymmetric \nlateral weights as used in  (Rubner&Schulten,  1990).  The rules  can also get the first \nk  principal components robustly in  the presence  of outliers  (Xu & Yuille,  1993). \n\n5  ROBUST  RULES  FOR PRINCIPAL SUBSPACE \nLet  M = [ml, .. \" mk],  ~ = [\u00a21, .. \" \u00a2k],  Y = [Yl, .. \" Ykf and y = MT X,  it follows \nfrom Oja(1989) and Xu(1991) the rules eq.(l), eq.(3) can be generalized into eq.(22) \nand eq.(23)  respectively: \n\n(22) \n\n\f472 \n\nXu and Yuille \n\nu =  y,  y = MTa  (23) \n- M-\nIn the case without outliers, by both the rules, the weight matrix M(t) will converge \nto  a  matrix MOO  whose  column  vectors  mj, j  =  1,\"\"  k  span  the  k-dimensional \nprincipal subspace  (Oja,  1989;  Xu,  1991&93), although the  vectors  are,  in general, \nnot  equal to the  k principal component vectors \u00a2j, j  = 1, ... , k. \nSimilar to  the  previously  used  procedure,  we  have  the following  results: \n(1).  We can SllOW  that eq.(23) is  an  on-line  or  stochastic  approximation rule which \nminimizes  the energy  13  in  the gradient descent  way  (Xu,  1991& 93): \n\nJ3 (ffi) = ~ L: IIXi  - ai ll 2 ,  a = My,  Y'  = MT iI. \n\nN \n\n(24) \n\ni=l \n\nand that in the average  sense the subspace rule eq.(22) is also an  on-line \"down-hill\" \nrule for  minimizing the energy function  Ja. \n(2).  We  can  also  generalize  the  non-robust  rules  eq.(22)  and  eq.(23)  into  robust \nversions  by  using  the statistical physics  approach  again: \n\nM(t + 1) = M(t) + GA(t) 1 + e!3(I//-U.1I2_'1) [Yi(Xi  - ildT  -\n\n(fii  - Y1)iT]' \n\n-\nM(t + 1) =  M(t) + GA(t) 1 + e!3(l/x.-u;1/2_'1) [y,Xi  - YiY,  M(t)] \n\n-,..fJ'-~ \n\n-\n\n1 \n\n(25) \n\n(26) \n\n6  EXAMPLES  OF  EXPERIMENTAL  RESULTS \nLet x from a population of 400 samples with zero mean.  These samples are located \non  an  elliptic  ring  centered  at  the  origin  of R3 ,  with  its largest  elliptic axis  being \nalong  the direction  (-1,1,0), the  plane of its other  two  axes  intersecting the  x  - Y \nplane with an acute angle (30\u00b0).  Among the 400 samples,  10 points (only 2.5%) are \nrandomly chosen  and  replaced  by outliers.  The obtained data set is shown in  Fig.1. \n\nBefore  the outliers were  introduced, either the conventional simple-variance-matrix \nbased  approach  (i.e.,  solving S\u00a2 = A\u00a2,  S =  k L~l iiX[) or  the  unrobust  rules \neqs.(I)(2)(3)  can  find  the  correct  1st  principal component vector of this  data set. \n\nOn the data set contaminated by outliers, shown  in Fig.l, the result of the simple(cid:173)\nvariance-matrix  based  approach  has  an  angular  error  of \u00a2p  by  71.04\u00b0-a  result \ndefinitely  unacceptable.  The  results  of using  the  proposed  robust  rules  eq.(19), \neq.(16) and eq.(17) are shown in Fig.2(a) in comparison with those of their unrobust \ncounterparts- the rules  eq.(I), eq.(2) and eq.(3).  We observe that all  the unrobust \nrules  get  the  solutions  with  errors  of more  than  21\u00b0  from  the  correct  direction  of \n\u00a2p.  By  contrast,  the  robust  rules  can  still  maintain  a  very  good  accuracy-the \nerror is  about 0.36\u00b0.  Fig.2(b) gives  the  results  of solving for  the first  two  principal \ncomponent  vectors.  Again,  the  unrobust  rule  produce  large  errors  of around  23\u00b0, \nwhile  the  robust  rules  have  an  error  of  about  1. 7\u00b0 .  Fig.3  shows  the  results  of \nsoIling for  the  2-dimensional  principal  subspace,  it  is  easy  to  see  the  significant \nimprovements obtained by  using the  robus.t  rules. \n\n\fSelf-Organizing Rules  for  Robust  Principal Component Analysis \n\n473 \n\n\" \n\n, \n\n,  , \n\n,~ \n\n,  ' \n\n\\\\' \n\n\"~\"114\" \n\n., \n\n~ \n\n\u2022 \n\n\u2022 \n\nJ \n\nt \n\n\u2022 \n\n, \n\nf \n\n., \n\n~ \n\n\u2022 \n\nI \n\n2 \n\n\u2022 \n\n\u2022 \n\nJ \n\n\u2022 \n\nFigure  1:  The projections of the data on  the  x  - y,  y - z  and z - x  planes,  with 10 \noutliers. \n\nAcknowledgements \n\nWe  would  like  to  thank  DARPA  and  the  Air  Force  for  support  with  contracts \nAFOSR-89-0506 and  F4969092-J-0466. \n\nWe like  to menta ion  that some further issues  about the proposed  robust  rules  are studied \nin Xu & Yuille  (1993),  including  the selection  of parameters  0',  j3  and  1],  the extension  of \nthe  rules  for  robust  Minor  Component  Analysis  (MCA) ,  the  relations  between  the  rules \nto the two main  types of existing  robust  peA algorithms in  the literature  of statistics,  as \nwell  as to Maximal  Likelihood  (ML)  estimation  of finite  mixture  distributions. \n\nReferences \n\nE.  Oja,  J.  Math.  Bio.  16,  1982,267-273. \nE.  Oja &  J.  Karhunen,  J.  Math.  Anal.  Appl.  106,1985,69-84. \n\nE.  Oja,  Int.  J.  Neural Systems  1,  1989,61-68. \nE.  Oja,  Neural Networks  5,  1992,  927-935. \n\nG.  Parisi,  Statistical Field  Theory,  Addison-Wesley,  Reading,  Mass.,  1988. \nJ.  Rubner &  K.  Schulten,  Biological Cybernetics,  62,  1990,  193-199. \nT.D.  Sanger,  Neural  Networks,  2,  1989,459-473. \nL.  Xu,  Proc.  of IJCNN'91-Singapore,  Nov.,  1991,2368-2373. \nL. Xu, Least mean square error reconstruction for self-organizing neural-nets,  Neural \nNetworks  6,  1993, in  press. \nL.  Xu,  E.  Oja &  C.Y. Suen,  Neural Networks  5,  1992,441-457. \nL.  Xu  &  A.L.  Yuille,  Robust  principal component  analysis  by  self-organizing rules \nbased on statistical physics approach,  IEEE Trans.  Neural Networks, 1993, in press. \n\nA.L.  Yuille,  Neural  computation  2,  1990,  1-24. \nA.L.  Yuille,  D.  Geiger  and  H.H.  Bulthoff,Networks 2,  1991.  423-442. \n\n\f474 \n\nXu and Yuille \n\n--. .. \n\n(a) \n\n(b) \n\nFigure 2:  The learning curves obtained in the comparative experiments for principal \ncomponent vectors.  (a) for the first principal component vector, RAl, RA2, RA3 de(cid:173)\nnote the robust rules  eq.(19), eq.(16) and eq.(17)  respectively,  and U AI, U A2, U A3 \ndenote  the  rules  eq.(l), eq.(2)  and eq.(3)  respectively.  The horizontal axis  denotes \nthe  learning  steps,  and  the  vertical  axis  is  (Jm(t)\u00a2Pl  with  (Jx,y  denoting  the  acute \nangle  between  x and  y.  (b)  for  the  first  two  principal  component  vectors,  by  the \nrobust  rule  eq.(20)  and  its  unrobust  counterpart  GHA.  U Akl,  U Ak2  denote  the \nlearning  curves  of angles  (Jml(t)\u00a2Pl  and  (Jm2(t)\u00a2P2  respectively,  obtained  by  GHA  . \nRAk 1,  RAk2 denote the learning curves  of the angles obtained by using the robust \nrule  eq.(20).  In  both  (a)  & (b),  i pj , j  =  1,2 is  the  correct  1st  and  2nd  principal \ncomponent vector  respectively. \n\nt \n\nt \n\n1 _______  _ \n\n........ \n\nFigure 3:  The learning curves obtained in the comparative experiments for for  solv(cid:173)\ning the  2-dimensional principal subspace.  Each learning curve expresses  the change \nof the  residual er(t) =  L:J=ll!mj(t) - L:;=l(mj(tf i pr)\u00a2prI12  with  learning steps. \nThe smaller the residual,  the closer  the estimated principal subspace to the correct \none.  SU Bl, SU B2  denote  the unrobust  rules eq.(22)  and eq.(23)  respectively,  and \nRSU Bl, RSU B2  denote  the robust  rules  eq.(26)  and eq.(25)  respectively. \n\n\f", "award": [], "sourceid": 686, "authors": [{"given_name": "Lei", "family_name": "Xu", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}