{"title": "Two Iterative Algorithms for Computing the Singular Value Decomposition from Input/Output Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 144, "page_last": 151, "abstract": null, "full_text": "Two Iterative Algorithms for Computing \nthe Singular Value Decomposition from \n\nInput / Output Samples \n\nTerence D. Sanger \n\nJet Propulsion Laboratory \n\nMS 303-310 \n\n4800 Oak Grove Drive \nPasadena, CA 91109 \n\nAbstract \n\nThe Singular Value Decomposition (SVD) is an important tool for \nlinear algebra and can be used to invert or approximate matrices. \nAlthough many authors use \"SVD\" synonymously with \"Eigen(cid:173)\nvector Decomposition\" or \"Principal Components Transform\", it \nis important to realize that these other methods apply only to \nsymmetric matrices, while the SVD can be applied to arbitrary \nnonsquare matrices. This property is important for applications to \nsignal transmission and control. \nI propose two new algorithms for iterative computation of the SVD \ngiven only sample inputs and outputs from a matrix. Although \nthere currently exist many algorithms for Eigenvector Decomposi(cid:173)\ntion (Sanger 1989, for example), these are the first true sample(cid:173)\nbased SVD algorithms. \n\n1 \n\nINTRODUCTION \n\nThe Singular Value Decomposition (SVD) is a method for writing an arbitrary \nnons quare matrix as the product of two orthogonal matrices and a diagonal matrix. \nThis technique is an important component of methods for approximating near(cid:173)\nsingular matrices and computing pseudo-inverses. Several efficient techniques exist \nfor finding the SVD of a known matrix (Golub and Van Loan 1983, for example). \n\n144 \n\n\fSingular Value Decomposition \n\n145 \n\np \n\nr- - - - - - - - - - - - - - - - - - - - - - - - - - - -\n\nU __ --I~ \n\ns \n\nI 1 __ __ ___ __ ___ ___ __ __ ___ __ ___ __ __ I \n\nFigure 1: Representation of the plant matrix P as a linear system mapping inputs \nu into outputs y. LT SR is the singular value decomposition of P. \n\nHowever, for certain signal processing or control tasks, we might wish to find the \nSVD of an unknown matrix for which only input-output samples are available. \nFor example, if we want to model a linear transmission channel with unknown \nproperties, it would be useful to be able to approximate the SVD based on samples \nof the inputs and outputs of the channel. If the channel is time-varying, an iterative \nalgorithm for approximating the SVD might be able to track slow variations. \n\n2 THE SINGULAR VALUE DECOMPOSITION \n\nThe SVD of a nonsymmetric matrix P is given by P = LT SR where Land Rare \nmatrices with orthogonal rows containing the left and right \"singular vectors\" , and \nS is a diagonal matrix of \"singular values\". The inverse of P can be computed by \ninverting S, and approximations to P can be formed by setting the values of the \nsmallest elements of S to zero. \nFor a memoryless linear system with inputs u and outputs y = Pu, we can write \ny = LT SRu which shows that R gives the \"input\" transformation from inputs \nto internal \"modes\", S gives the gain of the modes, and LT gives the \"output\" \ntransformation which determines the effect of each mode on the output. Figure 1 \nshows a representation of this arrangement. \n\nThe goal of the two algorithms presented below is to train two linear neural networks \nNand G to find the SVD of P. In particular, the networks attempt to invert P \nby finding orthogonal matrices Nand G such that NG ~ p- 1 , or P NG = I. A \nparticular advantage of using the iterative algorithms described below is that it is \npossible to extract only the singular vectors associated with the largest singular \nvalues. Figure 2 depicts this situation, in which the matrix S is shown smaller to \nindicate a small number of significant singular values. \n\nThere is a close relationship with algorithms that find the eigenvalues of a symmetric \nmatrix, since any such algorithm can be applied to P pT = LT S2 Land pT P = \nRT S2 R in order to find the left and right singular vectors. But in a behaving animal \nor operating robotic system it is generally not possible to compute the product with \npT, since the plant is an unknown component of the system. In the following, I will \npresent two new iterative algorithms for finding the singular value decomposition \n\n\f146 \n\nSanger \n\np \n\n, ---- -- -- ---' \n\nI \nI \n\nI \nI \nI \nI __ _ _ _ __ _ _ _ _ _ J \n\ny \n\nu \n\nN \n\nG \n\nFigure 2: Loop structure of the singular value decomposition for control. The \nplant is P = LT SR, where R determines the mapping from control variables to \nsystem modes, and LT determines the outputs produced by each mode . The optimal \nsensory network is G = L, and the optimal motor network is N = RT S-l. Rand L \nare shown as trapezoids to indicate that the number of nonzero elements of S (the \n\"modes\") may be less than the number of sensory variables y or motor variables u. \n\nof a matrix P given only samples of the inputs u and outputs y. \n\n3 THE DOUBLE GENERALIZED HEBBIAN \n\nALGORITHM \n\nThe first algorithm is the Double Generalized Hebbian Algorithm (DGHA), and it \nis described by the two coupled difference equations \n\nl(zyT - LT[zzT]G) \n\nb-.G = \nb-.NT = l(zuT - LT[zzT]NT ) \n\n(1) \n(2) \nwhere LT[ ] is an operator that sets the above diagonal elements of its matrix \nargument to zero, y = Pu, z = Gy, and I is a learning rate constant. \nEquation 1 is the Generalized Hebbian Algorithm (Sanger 1989) which finds the \neigenvectors of the autocorrelation matrix of its inputs y. For random uncorrelated \ninputs u, the autocorrelation of y is E[yyT] = LT S2 L, so equation 1 will cause \nG to converge to the matrix of left singular vectors L . Equation 2 is related to \nthe Widrow-Hoff (1960) LMS rule for approximating uT from z, but it enforces \northogonality of the columns of N. It appears similar in form to equation 1, except \nthat the intermediate variables z are computed from y rather than u. A graphical \nrepresentation of the algorithm is given in figure 3. Equations 1 and 2 together \ncause N to converge to RT S-l , so that the combination N G = RT S-l L is an \napproximation to the plant inverse. \nTheorem 1: (Sanger 1993) If y = Pu, z = Gy, and E[uuT] = I, then equations 1 \nand 2 converge to the left and right singular vectors of P . \n\n\fSingular Value Decomposition \n\n147 \n\nu \n\np \n\ny \n\nIGHA \n\nFigure 3: Graphic representation of the Double Generalized Hebbian Algorithm. \nG learns according to the usual G HA rule, while N learns using an orthogonalized \nform of the Widrow-Hoff LMS Rule. \n\nProof: \nAfter convergence of equation 1, E[zzT] will be diagonal, so that E[LT[zzT]] = \nE[zzT]. Consider the Widrow-Hoff LMS rule for approximating uT from z: \n\n~NT = 'Y(zuT - zzT NT). \n\n(3) \nAfter convergence of G, this will be equivalent to equation 2, and will converge to \nthe same attractor. The stable points of 3 occur when E[uzT - NzzT] = 0, for \nwhich N = RT 5- 1 \n\u2022 \nThe convergence behavior of the Double Generalized Hebbian Algorithm is shown in \nfigure 4. Results are measured by computing B = GP N and determining whether \nB is diagonal using a score \n\n\" ...... b~. \nL...I~) I) \n\u20ac= L 2 \n. b\u00b7 \n1 \n1 \n\nThe reduction in \u20ac is shown as a function of the number of (u, y) examples given to \nthe network during training, and the curves in the figure represent the average over \n100 training runs with different randomly-selected plant matrices P. \n\nNote that the Double Generalized Hebbian Algorithm may perform poorly in the \npresence of noise or uncontrollable modes. The sensory mapping G depends only on \nthe outputs y, and not directly on the plant inputs u. So if the outputs include noise \nor autonomously varying uncontrollable modes, then the mapping G will respond \nto these modes. This is not a problem if most of the variance in the output is due \nthe inputs u, since in that case the most significant output components will reflect \nthe input variance transmitted through P. \n\n4 THE ORTHOGONAL ASYMMETRIC ENCODER \n\nThe second algorithm is the Orthogonal Asymmetric Encoder (OAE) which is de(cid:173)\nscribed by the equations \n\n(4) \n\n\f148 \n\nSanger \n\n0.7 \n\n0.& \n\n0.5 \n\nj 0.4 \nj \n! I 0.3 \nis \n\n0.2 \n\n0.1 \n\nDouble Generalized Hebbian Algorithm \n\n... ... . . . . . . . . . . . . . . . . . \n\n~~~-~~~~~~~~~-~~~~~ \n\nExomple \n\nFigure 4: Convergence of the Double Generalized Hebbian Algorithm averaged over \n100 random choices of 3x3 or 10xlO matrices P. \n\n(5) \n\nwhere z = NT u. \nThis algorithm uses a variant of the Backpropagation learning algorithm (Rumelhart \net al. 1986). It is named for the \"Encoder\" problem in which a three-layer network is \ntrained to approximate the identity mapping but is forced to use a narrow bottleneck \nlayer. I define the \"Asymmetric Encoder Problem\" as the case in which a mapping \nother than the identity is to be learned while the data is passed through a bottleneck. \nThe \"Orthogonal Asymmetric Encoder\" (OAE) is the special case in which the \nhidden units are forced to be uncorrelated over the data set. Figure 5 gives a \ngraphical depiction of the algorithm. \nTheorem 2: (Sanger 1993) Equations 4 and 5 converge to the left and right singular \nvectors of P. \n\nProof: \nSuppose z has dimension m. If P = LT SR where the elements of S are distinct, \nand E[uuT ] = I, then a well-known property of the singular value decomposition \n(Golub and Van Loan 1983, , for example) shows that \n\nE[IIPu - CT NT ullJ \n\n(6) \nis minimized when CT = LrnU, NT = V Rm , and U and V are any m x m matrices \nfor which UV = 1mS/;;\". (L~ and Rm signify the matrices of only the first m \ncolumns of LT or rows of R.) If we want E[zzT] to be diagonal, then U and V must \nbe diagonal. OAE accomplishes this by training the first hidden unit as if m = 1, \nthe second as if m = 2, and so on. \nFor the case m = 1, the error 6 is minimized when C is the first left singular vector \nof P and N is the first right singular vector. Since this is a linear approximation \nproblem, there is a single global minimum to the error surface 6, and gradient \ndescent using the backpropagation algorithm will converge to this solution. \n\n\fSingular Value Decomposition \n\n149 \n\nu \n\np \n\ny \n\nI \n, ~as!pr.2Pa~ti~ I \n\nFigure 5: The Orthogonal Asymmetric Encoder algorithm computes a forward ap(cid:173)\nproximation to the plant P through a bottleneck layer of hidden units. \n\nAfter convergence, the remaining error is E[II(P - GT N T )ull1. If we decompose the \nplant matrix as \n\nwhere Ii and ri are the rows of Land R, and Si are the diagonal elements of S, then \nthe remaining error is \n\nn \n\ni=l \n\nP2 = LlisirT \n\ni=2 \n\nwhich is equivalent to the original plant matrix with the first singular value set to \nzero. If we train the second hidden unit using P2 instead of P, then minimization of \nE[IIP2 u - GT NT ull1 will yield the second left and right singular vectors. Proceeding \nin this way we can obtain the first m singular vectors. \n\nCombining the update rules for all the singular vectors so that they learn in parallel \nleads to the governing equations of the OAE algorithm which can be written in \nmatrix form as equations 4 and 5 . \n\u2022 \n(Bannour and Azimi-Sadjadi 1993) proposed a similar technique for the symmet(cid:173)\nric encoder problem in which each eigenvector is learned to convergence and then \nsubtracted from the data before learning the succeeding one. The orthogonal asym(cid:173)\nmetric encoder is different because all the components learn simultaneously. After \nconvergence, we must multiply the learned N by S-2 in order to compute the plant \ninverse. Figure 6 shows the performance of the algorithm averaged over 100 random \nchoices of matrix P. \n\nConsider the case in which there may be noise in the measured outputs y. Since \nthe Orthogonal Asymmetric Encoder algorithm learns to approximate the forward \nplant transformation from u to y, it will only be able to predict the components of \ny which are related to the inputs u. In other words, the best approximation to y \nbased on u is if ~ Pu, and this ignores the noise term. Figure 7 shows the results \nof additive noise with an SNR of 1.0. \n\n\f150 \n\nSanger \n\n0 .7 \n\n0.8 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n\nI!! \nc\u00a7 \nc j \nii c \n0 \nGO \n.l!I \n0 \n\nOrthogonal Asymmetric Encoder \n\n\" \n\n, .. . . . ..... . . . . \n\n. ..... \n\n0 \n\n50 \n\n100 150 200 250 300 350 400 450 500 550 800 850 700 750 800 850 900 950 \n\nExample \n\nFigure 6: Convergence of the Orthogonal Asymmetric Encoder averaged over 100 \nrandom choices of 3x3 or 10xlO matrices P. \n\nAcknowledgements \n\nThis report describes research done within the laboratory of Dr. Emilio Bizzi in the \ndepartment of Brain and Cognitive Sciences at MIT. The author was supported dur(cid:173)\ning this work by a National Defense Science and Engineering Graduate Fellowship, \nand by NIH grants 5R37 AR26710 and 5ROINS09343 to Dr. Bizzi. \n\nReferences \n\nBannour S., Azimi-Sadjadi M. R., 1993, Principal component extraction using re(cid:173)\ncursive least squares learning, submitted to IEEE Transactions on Neural Networks. \nGolub G. H., Van Loan C. F., 1983, Matrix Computations, North Oxford Academic \nP., Oxford. \nRumelhart D. E., Hinton G. E., Williams R. J., 1986, Learning internal represen(cid:173)\ntations by error propagation, In Parallel Distributed Processing, chapter 8, pages \n318-362, MIT Press, Cambridge, MA. \nSanger T. D., 1989, Optimal unsupervised learning in a single-layer linear feedfor(cid:173)\nward neural network, Neural Networks, 2:459-473. \nSanger T. D., 1993, Theoretical Elements of Hierarchical Control in Vertebrate \nMotor Systems, PhD thesis, MIT. \nWidrow B., Hoff M. E., 1960, Adaptive switching circuits, In IRE WESCON Conv. \nRecord, Part 4, pages 96-104. \n\n\fSingular Value Decomposition \n\n151 \n\nOAE with 50% Added Noise \n\n1- 3x3 \n\n\u2022 \n\n\u2022 \n\n101110 \n\n. . . \n\n... . . . . . . . . . . . . \n\n2 \n\n1 \n\n! \n.\u00a7 \nj \nj \n\nI CI \n\noL-.---,--,--~~~:;:::::;:::::;~~ \n\no 50 100 150 200 250 300 350 400 450 500 650 &00 860 700 750 &00 860 000 860 \n\nElIIlmple \n\nFigure 7: Convergence of the Orthogonal Asymmetric Encoder with 50% additive \nnoise on the outputs, averaged over 100 random choices of 3x3 or 10xlO matrices \nP. \n\n\f", "award": [], "sourceid": 869, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}]}