{"title": "Two Iterative Algorithms for Computing the Singular Value Decomposition from Input/Output Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 144, "page_last": 151, "abstract": null, "full_text": "Two Iterative Algorithms  for  Computing \nthe  Singular Value  Decomposition from \n\nInput / Output  Samples \n\nTerence D.  Sanger \n\nJet  Propulsion Laboratory \n\nMS  303-310 \n\n4800  Oak Grove  Drive \nPasadena, CA 91109 \n\nAbstract \n\nThe Singular Value  Decomposition (SVD)  is  an important tool for \nlinear  algebra and  can  be  used  to invert  or  approximate matrices. \nAlthough  many  authors  use  \"SVD\"  synonymously  with  \"Eigen(cid:173)\nvector  Decomposition\"  or  \"Principal  Components  Transform\",  it \nis  important  to  realize  that  these  other  methods  apply  only  to \nsymmetric  matrices,  while  the  SVD  can  be  applied  to  arbitrary \nnonsquare matrices.  This property is  important for  applications to \nsignal transmission and  control. \nI propose two new  algorithms for iterative computation of the SVD \ngiven  only  sample  inputs  and  outputs  from  a  matrix.  Although \nthere  currently exist  many algorithms for  Eigenvector  Decomposi(cid:173)\ntion  (Sanger  1989,  for  example),  these  are  the  first  true  sample(cid:173)\nbased  SVD algorithms. \n\n1 \n\nINTRODUCTION \n\nThe  Singular  Value  Decomposition  (SVD)  is  a  method  for  writing  an  arbitrary \nnons quare matrix as the product of two orthogonal matrices and a diagonal matrix. \nThis  technique  is  an  important  component  of methods  for  approximating  near(cid:173)\nsingular matrices and computing pseudo-inverses.  Several efficient  techniques  exist \nfor  finding  the SVD  of a  known matrix (Golub  and Van  Loan  1983, for  example). \n\n144 \n\n\fSingular Value Decomposition \n\n145 \n\np \n\nr- - - - - - - - - - - - - - - - - - - - - - - - - - - -\n\nU __ --I~ \n\ns \n\nI 1 __ __ ___ __ ___ ___ __ __ ___ __ ___ __ __ I \n\nFigure  1:  Representation of the plant matrix P  as  a linear system  mapping inputs \nu  into outputs y.  LT SR is  the singular value  decomposition of P. \n\nHowever,  for  certain  signal processing  or  control  tasks,  we  might wish  to find  the \nSVD  of  an  unknown  matrix  for  which  only  input-output  samples  are  available. \nFor  example,  if we  want  to  model  a  linear  transmission  channel  with  unknown \nproperties, it would be useful to be able to approximate the SVD based on samples \nof the inputs and outputs of the channel.  If the channel is time-varying, an iterative \nalgorithm for  approximating the SVD might be  able  to track slow  variations. \n\n2  THE  SINGULAR VALUE  DECOMPOSITION \n\nThe SVD  of a  nonsymmetric matrix P  is  given  by  P  =  LT SR where  Land  Rare \nmatrices with orthogonal rows  containing the left  and right  \"singular vectors\" , and \nS  is  a  diagonal matrix of \"singular values\".  The inverse  of P  can  be  computed by \ninverting  S,  and  approximations to  P  can  be  formed  by  setting  the  values  of the \nsmallest elements of S  to zero. \nFor  a  memoryless  linear  system  with  inputs  u  and  outputs  y  =  Pu,  we  can  write \ny  = LT SRu  which  shows  that  R  gives  the  \"input\"  transformation  from  inputs \nto  internal  \"modes\",  S  gives  the  gain  of the  modes,  and  LT  gives  the  \"output\" \ntransformation which  determines  the  effect  of each  mode on  the  output.  Figure  1 \nshows  a  representation of this arrangement. \n\nThe goal of the two algorithms presented below is to train two linear neural networks \nNand G  to find  the  SVD  of P.  In  particular,  the  networks  attempt to invert  P \nby  finding  orthogonal  matrices  Nand G  such  that  NG  ~ p- 1 ,  or  P NG = I.  A \nparticular advantage of using  the  iterative algorithms described  below  is  that it  is \npossible  to  extract  only  the  singular  vectors  associated  with  the  largest  singular \nvalues.  Figure  2  depicts  this  situation,  in which  the  matrix S  is  shown  smaller  to \nindicate a small number of significant singular values. \n\nThere is a close relationship with algorithms that find  the eigenvalues of a symmetric \nmatrix,  since  any  such  algorithm  can  be  applied  to  P pT  =  LT S2 Land  pT P  = \nRT S2 R  in order to find  the left and right singular vectors.  But in a behaving animal \nor operating robotic system it is generally not possible to compute the product with \npT, since the plant is  an unknown component of the system.  In the following, I will \npresent  two  new  iterative  algorithms for  finding  the  singular  value  decomposition \n\n\f146 \n\nSanger \n\np \n\n, ---- -- -- ---' \n\nI \nI \n\nI \nI \nI \nI __ _ _ _ __ _ _ _ _ _  J \n\ny \n\nu \n\nN \n\nG \n\nFigure  2:  Loop  structure  of  the  singular  value  decomposition  for  control.  The \nplant  is  P  =  LT SR,  where  R  determines  the  mapping from  control  variables  to \nsystem modes, and LT  determines the outputs produced by each mode . The optimal \nsensory network is G = L, and the optimal motor network is  N  = RT S-l.  Rand L \nare shown  as  trapezoids  to indicate that the number of nonzero elements of S  (the \n\"modes\")  may be less  than the number of sensory  variables y  or motor variables u. \n\nof a  matrix P  given only samples of the inputs u  and outputs y. \n\n3  THE DOUBLE  GENERALIZED  HEBBIAN \n\nALGORITHM \n\nThe first  algorithm is  the  Double Generalized  Hebbian  Algorithm (DGHA),  and it \nis  described  by  the  two coupled  difference  equations \n\nl(zyT - LT[zzT]G) \n\nb-.G  = \nb-.NT  =  l(zuT - LT[zzT]NT ) \n\n(1) \n(2) \nwhere  LT[  ]  is  an  operator  that  sets  the  above  diagonal  elements  of its  matrix \nargument to zero,  y = Pu, z = Gy,  and I  is  a  learning rate constant. \nEquation  1 is  the  Generalized  Hebbian  Algorithm  (Sanger  1989)  which  finds  the \neigenvectors of the autocorrelation matrix of its inputs y.  For random uncorrelated \ninputs  u,  the  autocorrelation  of y  is  E[yyT]  = LT S2 L,  so  equation  1  will  cause \nG  to  converge  to  the  matrix of left  singular  vectors  L .  Equation  2  is  related  to \nthe  Widrow-Hoff  (1960)  LMS  rule  for  approximating  uT  from  z,  but  it  enforces \northogonality of the columns of N.  It appears similar in form to equation 1,  except \nthat  the intermediate variables  z  are  computed from  y  rather than  u.  A  graphical \nrepresentation  of the  algorithm is  given  in  figure  3.  Equations  1  and  2  together \ncause  N  to  converge  to  RT S-l ,  so  that  the  combination  N G  = RT S-l L  is  an \napproximation to the plant inverse. \nTheorem 1:  (Sanger 1993)  If y =  Pu,  z =  Gy,  and E[uuT] =  I,  then  equations  1 \nand  2  converge  to  the  left  and  right  singular vectors  of P . \n\n\fSingular Value Decomposition \n\n147 \n\nu \n\np \n\ny \n\nIGHA \n\nFigure  3:  Graphic  representation  of the  Double  Generalized  Hebbian  Algorithm. \nG  learns  according  to the usual  G HA  rule,  while  N  learns using an orthogonalized \nform of the Widrow-Hoff LMS  Rule. \n\nProof: \nAfter  convergence  of equation  1,  E[zzT]  will  be  diagonal,  so  that  E[LT[zzT]]  = \nE[zzT].  Consider  the Widrow-Hoff LMS  rule for  approximating uT from  z: \n\n~NT = 'Y(zuT - zzT NT). \n\n(3) \nAfter  convergence  of G,  this  will  be equivalent  to equation 2,  and will  converge  to \nthe  same  attractor.  The  stable  points  of 3  occur  when  E[uzT - NzzT]  = 0,  for \nwhich  N  =  RT 5- 1 \n\u2022 \nThe convergence behavior of the Double Generalized Hebbian Algorithm is shown in \nfigure  4.  Results  are  measured  by  computing B = GP N  and  determining whether \nB  is  diagonal using  a score \n\n\"  ......  b~. \nL...I~)  I) \n\u20ac=  L  2 \n. b\u00b7 \n1 \n1 \n\nThe reduction in  \u20ac  is shown  as  a function of the number of (u, y)  examples given to \nthe network  during training, and the curves  in the figure  represent  the average over \n100  training runs  with different  randomly-selected  plant matrices P. \n\nNote  that  the  Double  Generalized  Hebbian  Algorithm  may perform  poorly  in  the \npresence of noise or uncontrollable modes.  The sensory mapping G depends only on \nthe outputs y,  and not directly on the plant inputs u.  So if the outputs include noise \nor  autonomously  varying  uncontrollable  modes,  then  the  mapping  G  will  respond \nto these  modes.  This is  not a  problem if most of the  variance in  the output is  due \nthe inputs  u, since in  that case  the most significant output  components  will reflect \nthe input  variance transmitted through  P. \n\n4  THE ORTHOGONAL  ASYMMETRIC  ENCODER \n\nThe second  algorithm is  the  Orthogonal Asymmetric Encoder  (OAE)  which  is  de(cid:173)\nscribed  by the equations \n\n(4) \n\n\f148 \n\nSanger \n\n0.7 \n\n0.& \n\n0.5 \n\nj  0.4 \nj \n! I  0.3 \nis \n\n0.2 \n\n0.1 \n\nDouble Generalized Hebbian Algorithm \n\n... ... . . . . . . . . . . . . . . . . . \n\n~~~-~~~~~~~~~-~~~~~ \n\nExomple \n\nFigure 4:  Convergence of the Double Generalized Hebbian Algorithm averaged over \n100 random choices  of 3x3  or  10xlO matrices  P. \n\n(5) \n\nwhere  z = NT u. \nThis algorithm uses a variant of the Backpropagation learning algorithm (Rumelhart \net  al.  1986).  It is named for the  \"Encoder\"  problem in which a three-layer network is \ntrained to approximate the identity mapping but is forced to use a narrow bottleneck \nlayer.  I  define  the  \"Asymmetric Encoder  Problem\"  as  the case  in  which a  mapping \nother than the identity is to be learned while the data is passed through a bottleneck. \nThe  \"Orthogonal  Asymmetric  Encoder\"  (OAE)  is  the  special  case  in  which  the \nhidden  units  are  forced  to  be  uncorrelated  over  the  data  set.  Figure  5  gives  a \ngraphical depiction of the  algorithm. \nTheorem 2:  (Sanger 1993) Equations 4 and 5 converge to the left and right singular \nvectors  of P. \n\nProof: \nSuppose  z  has  dimension  m.  If P  =  LT SR  where  the  elements  of S  are  distinct, \nand  E[uuT ]  = I,  then  a  well-known  property  of the  singular  value  decomposition \n(Golub and Van  Loan  1983,  , for  example) shows  that \n\nE[IIPu - CT NT ullJ \n\n(6) \nis minimized when  CT  = LrnU,  NT = V Rm ,  and U  and V  are any m  x m  matrices \nfor  which  UV  = 1mS/;;\".  (L~ and  Rm  signify  the  matrices  of only  the  first  m \ncolumns of LT  or rows of R.)  If we  want E[zzT]  to be diagonal, then U and V  must \nbe  diagonal.  OAE  accomplishes  this  by  training the first  hidden  unit  as  if m  =  1, \nthe second  as  if m  =  2,  and so on. \nFor the case  m = 1,  the error 6 is  minimized when  C  is  the first  left singular vector \nof P  and  N  is  the  first  right  singular  vector.  Since  this  is  a  linear  approximation \nproblem,  there  is  a  single  global  minimum  to  the  error  surface  6,  and  gradient \ndescent  using the  backpropagation algorithm will  converge  to this solution. \n\n\fSingular Value Decomposition \n\n149 \n\nu \n\np \n\ny \n\nI \n,  ~as!pr.2Pa~ti~ I \n\nFigure 5:  The Orthogonal Asymmetric Encoder  algorithm computes a forward  ap(cid:173)\nproximation to the plant  P  through a  bottleneck layer of hidden  units. \n\nAfter convergence,  the remaining error is  E[II(P - GT N T )ull1.  If we  decompose the \nplant matrix as \n\nwhere  Ii  and ri  are the rows of Land R,  and  Si  are the diagonal elements of S,  then \nthe  remaining error is \n\nn \n\ni=l \n\nP2  =  LlisirT \n\ni=2 \n\nwhich  is  equivalent  to the  original plant matrix with the first  singular value set  to \nzero.  If we  train the second hidden unit using P2  instead of P, then minimization of \nE[IIP2 u - GT NT ull1  will yield the second  left and right singular vectors.  Proceeding \nin this way  we  can obtain the first  m  singular vectors. \n\nCombining the update rules for  all the singular vectors so that they learn in parallel \nleads  to  the  governing  equations  of the  OAE  algorithm  which  can  be  written  in \nmatrix form  as  equations 4  and 5 . \n\u2022 \n(Bannour  and  Azimi-Sadjadi 1993)  proposed  a  similar technique  for  the  symmet(cid:173)\nric  encoder  problem in  which  each  eigenvector  is  learned  to  convergence  and  then \nsubtracted from the data before learning the succeeding one.  The orthogonal asym(cid:173)\nmetric encoder  is  different  because  all the  components learn  simultaneously.  After \nconvergence,  we  must multiply the learned N  by S-2  in order to compute the plant \ninverse.  Figure 6 shows the performance of the algorithm averaged over 100 random \nchoices  of matrix P. \n\nConsider  the  case  in  which  there  may be  noise  in  the  measured  outputs  y.  Since \nthe  Orthogonal Asymmetric Encoder  algorithm learns  to approximate the forward \nplant transformation from u  to y,  it will  only be  able to predict  the components of \ny  which  are  related  to  the  inputs  u.  In other words,  the  best  approximation to  y \nbased  on  u  is  if  ~ Pu, and  this  ignores  the  noise  term.  Figure 7  shows  the  results \nof additive noise  with  an SNR of 1.0. \n\n\f150 \n\nSanger \n\n0 .7 \n\n0.8 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n\nI!! \nc\u00a7 \nc j \nii c \n0 \nGO \n.l!I \n0 \n\nOrthogonal Asymmetric Encoder \n\n\" \n\n,  .. . . . ..... . . . . \n\n. ..... \n\n0 \n\n50 \n\n100  150  200  250  300  350  400  450  500  550  800  850  700  750  800  850  900  950 \n\nExample \n\nFigure  6:  Convergence  of the  Orthogonal  Asymmetric  Encoder  averaged  over  100 \nrandom choices  of 3x3  or 10xlO matrices  P. \n\nAcknowledgements \n\nThis report describes research  done within the laboratory of Dr.  Emilio Bizzi in the \ndepartment of Brain and Cognitive Sciences at MIT. The author was supported dur(cid:173)\ning this work by a  National Defense  Science  and Engineering Graduate Fellowship, \nand by  NIH  grants 5R37 AR26710 and 5ROINS09343  to Dr.  Bizzi. \n\nReferences \n\nBannour  S.,  Azimi-Sadjadi M.  R.,  1993,  Principal  component extraction  using  re(cid:173)\ncursive least squares learning, submitted to IEEE Transactions  on  Neural Networks. \nGolub G.  H.,  Van Loan C. F.,  1983,  Matrix  Computations,  North Oxford Academic \nP.,  Oxford. \nRumelhart  D.  E.,  Hinton G.  E.,  Williams R.  J.,  1986,  Learning internal  represen(cid:173)\ntations  by  error  propagation,  In  Parallel  Distributed  Processing,  chapter  8,  pages \n318-362,  MIT Press,  Cambridge, MA. \nSanger T.  D.,  1989,  Optimal unsupervised  learning in a  single-layer linear feedfor(cid:173)\nward  neural network,  Neural Networks,  2:459-473. \nSanger  T.  D.,  1993,  Theoretical  Elements  of Hierarchical  Control  in  Vertebrate \nMotor Systems,  PhD thesis,  MIT. \nWidrow B., Hoff M.  E.,  1960,  Adaptive switching circuits,  In IRE  WESCON Conv. \nRecord,  Part 4,  pages  96-104. \n\n\fSingular Value Decomposition \n\n151 \n\nOAE with 50% Added Noise \n\n1- 3x3 \n\n\u2022 \n\n\u2022 \n\n101110 \n\n. . . \n\n... . . . . . . . . . . . . \n\n2 \n\n1 \n\n! \n.\u00a7 \nj \nj \n\nI CI \n\noL-.---,--,--~~~:;:::::;:::::;~~ \n\no  50  100  150  200  250  300  350  400  450  500  650  &00  860  700  750  &00  860  000  860 \n\nElIIlmple \n\nFigure  7:  Convergence  of the  Orthogonal Asymmetric Encoder  with  50%  additive \nnoise  on  the  outputs,  averaged  over  100  random choices  of 3x3  or  10xlO  matrices \nP. \n\n\f", "award": [], "sourceid": 869, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}]}