{"title": "Multiplicative Updating Rule for Blind Separation Derived from the Method of Scoring", "book": "Advances in Neural Information Processing Systems", "page_first": 696, "page_last": 702, "abstract": null, "full_text": "Multiplicative Updating Rule \nfor Blind Separation Derived \nfrom the Method of Scoring \n\nHoward Hua Yang \n\nDepartment of Computer Science \n\nOregon Graduate Institute \n\nPO Box 91000, Portland, OR 97291, USA \n\nhyang@cse.ogi.edu \n\nAbstract \n\nFor blind source separation, when the Fisher information matrix is \nused as the Riemannian metric tensor for the parameter space, the \nsteepest descent algorithm to maximize the likelihood function in \nthis Riemannian parameter space becomes the serial updating rule \nwith equivariant property. This algorithm can be further simplified \nby using the asymptotic form of the Fisher information matrix \naround the equilibrium. \n\n1 \n\nIntroduction \n\nThe relative gradient was introduced by (Cardoso and Laheld, 1996) to design \nmultiplicative updating algorithms with equivariant property for blind separation \nproblems. The idea is to calculate differentials by using a relative increment instead \nof an absolute increment in the parameter space. This idea has been extended to \ncompute the relative Hessian by (Pham, 1996). \nFor a matrix function f = f (W), the relative gradient is defined by \n\n(1) \nFrom the differential of f (W) based on the relative gradient, the following learning \nrule is given by (Cardoso and Laheld, 1996) to maximize the function f: \n\nVf= ::V WT . \n\ndW = VfW= 8f WTW \ndt \n\n1] 8W \n\n1] \n\n(2) \n\nAlso motivated by designing blind separation algorithms with equivariant property, \n\n\fMultiplicative Updating Rule for Blind Separation \n\nthe natural gradient defined by \n\n697 \n\n(3) \n\nwas introduced in (Amari et al, 1996) which yields the same learning rule (2). The \ngeometrical meaning of the natural gradient is given by (Amari, 1996). More details \nabout the natural gradient can be found in (Yang and Amari, 1997) and (Amari, \n1997). \nThe framework of the natural gradient learning was proposed by (Amari, 1997) . In \nthis framework, the ordinary gradient descent learning algorithm in the Euclidean \nspace is not optimal in minimizing a function defined in a Riemannian space. The \nordinary gradient should be replaced by the natural gradient which is defined by \noperating the inverse of the metric tensor in the Riemannian space on the ordinary \ngradient. Let w denote a parameter vector. It is proved by (Amari, 1997) that if \nC (w) is a loss function defined on a Riemannian space {w} with a metric tensor G, \nthe negative natural gradient of C(w), namely, _G- 1 gg is the steepest descent \ndirection to decrease this function in the Riemannian space. Therefore, the steepest \ndescent algorithm in this Riemannian space has the following form: \n\ndw = -T}G- I ac. \now \ndt \n\nIf the Fisher information matrix is used as the metric tensor for the Riemannian \nspace and C(w) is replaced by the negative log-likelihood function, the above learn(cid:173)\ning rule becomes the method of scoring (Kay, 1993) which is the focus ofthis paper. \nBoth the relative gradient V and the natural gradient V were proposed in order to \ndesign the multiplicative updating algorithms with the equivariant property. The \nformer is due to a multiplicative increment in calculating differential while the latter \nis du,: to an increment based on a nonholonomic basis (Amari, 1997). Neither V \nnor V' depends on the data model. The Fisher information matrix is a special \nand important choice for the Riemannian metric tensor for statistical estimation \nproblems. It depends on the data model. Operating the inverse of the Fisher \ninformation matrix on the ordinary gradient, we have another gradient operator. It \nis called a natural gradient induced by the Fisher information matrix. \nIn this paper, we show how to derive a multiplicative updating algorithm from \nthe method of scoring. This approach is different from those based on the relative \ngradient and the natural gradient defined by (3). \n\n2 Fisher Information Matrix For Blind Separation \n\nConsider a linear mixing system: \n\nx =As \n\nwhere A E Rnxn, x = (Xl,' .. ,xn)T and s = (Sl,\u00b7\u00b7\u00b7, snf. Assume that sources \nare independent with a factorized joint pdf: \nn \n\nres) = II r(si). \n\ni=l \n\nThe likelihood function is \n\np(x;A) = \n\nrCA -IX) \n\nIAI \n\n\f698 \n\nH.H. Yang \nwhere IAI = Idet(A)I. Let W = A -1 and y = Wx ( a demixing system), then we \nhave the log-likelihood function \n\nn \n\nL(W) = Llogri(Yi) + log IWI. \n\ni=1 \n\nIt is easy to obtain \n\n(4) \nwhere w;t is the (i,j) entry in W- T = (W-l)T. Writing (4) in a matrix form, \nwe have \n\n8L _ rHYi) x. + W:-: T \n8W ij -\n'1 \n\nri(Yi) 1 \n\n:~ = W- T _ ~(Y)XT = (I - ~(y)yT)W-T = F(y)W-T \n\n(5) \n\nwhere ~(y) = (*I (y.),\"', n(Yn))T, i(Yi) = - ;H~:~ and F(y) = I - ~(y)yT. \nThe maximum likelihood algorithm based on the ordinary gradient 8'1:tt is \n\nd: = TJ(I _ ~(y)yT)W-T = TJF(y)W-T \n\nwhich has the high computational complexity due to the matrix inverse W- I . The \nmaximum likelihood algorithm based on the natural gradient of matrix functions is \n\n-\n\ndW \ndt = TJ\"VL = TJ(I - ~(y)yT)W. \n\n(6) \nThe same algorithm is obtained from d!f = TJV LW by using the relative gradient. \nAn apparent reason for using this algorithm is to avoid the matrix inverse W- 1 \u2022 \nAnother good reason for using it is due to the fact that the matrix W driven by \n(6) never becomes singular if the initial matrix W is not singular. This is proved \nby (Yang and Amari, 1997). In fact, this property holds for any learning rule of the \nfollowing type: \n\ndW dt = H(y)W. \n\n(7) \nLet < U,V >= Tr(UTV) denote the inner product of U and V E 3?nxn. When \nWet) is driven by the equation (7), we have \nd'W, _< 8'U;' dW >-< IWI(W-I)T dW > \ndt \n= Tr(IWIW- 1 H(y)W) = Tr(H(y))IWI. \nTherefore, \n\n'dt \n\n, dt \n\n-\n\n-\n\n8 \n\nIW(t)1 = IW(O)I exP{lt Tr(H(y(r)))dr} \n\n(8) \n\nwhich is non-singular when the initial matrix W(O) is non-singular. \nThe matrix function F(y) is also called an estimating function. At the equilibrium \nof the system (6), it satisfies the zero condition E[F(y)] = 0, i.e., \n\nE[**i (Yi)Yj] = f5ij \n\n(9) \n\nwhere f5ij = 1 if i = j and 0 otherwise. \nTo calculate the Fisher information matrix, we need a vector form of the equation \n(5). Let Vec(\u00b7) denote an operator on a matrix which cascades the columns of the \n\n\fMultiplicative Updating Rule for Blind Separation \n\n699 \n\nmatrix from the left to the right and forms a column vector. This operator has the \nfollowing property: \n\n(10) \nwhere 0 denotes the Kronecker product. Applying this property, we first rewrite \n(5) as \n\nVec(ABC) = (CT 0 A)Vec(B) \n\naL \n\naVec(W) = Vec(aW) = (W 0I)Vec(F(y)), \n\n-1 \n\naL \n\nand then obtain the Fisher information matrix \n)T] \n\naL \n\naL \n\nG \n\n( \n\n- E[ \n-\n= (W- l 0 I)E[Vec(F(y))VecT(F(y))](W-T 0 I). \n\naVec(W) aVec(W) \n\nThe inverse of G is \n\nG- l = (WT 0 I)D- l (W 0 I) \n\nwhere D = E[Vec(F(y))VecT(F(y))]. \n\n(11) \n\n(12) \n\n(13) \n\n3 Natural Gradient Induced By Fisher Information Matrix \n\nDefine a Riemannian space \n\nV = {Vec(W); W E Gl(n)} \n\nin which the Fisher information matrix G is used as its metric. Here, Gl(n) is the \nspace of all the n x n invertible matrices. \nLet C(W) be a matrix function to be minimized. It is shown by (Amari, 1997) that \nthe steepest descent direction in the Riemannian space V is _G- 1 8V:C~W)' \n\nLet us define the natural gradient in V by \n\n-\n\\lC W) = W 0I)D W 0 I aVec(W) \n\n(T \n\n-l( \n\n( \n\n) ac \n\n(14) \n\nwhich is called the natural gradient induced by the Fisher information matrix. The \ntime complexity of computing the natural gradient in the space V is high since \ninverting the matrix D of n2 x n 2 is needed. \nUsing the natural gradient in V to maximize the likelihood function L(W) or the \nmethod of scoring, from (11) and (14) we have the following learning rule \n\nVec(d:) = T/(W T 0 I)D-1Vec(F(y)) \n\n(15) \n\nWe shall prove that the above learning rule has the equivariant property. \nDenote Vec- l the inverse of the operator Vec. Let matrices B and A be of n 2 x n 2 \nand n x n, respectively. Denote B(i,\u00b7) the i-th row of Band Bi = Vec-I(B(i, .)), \ni = 1, ... , n 2 . Define an operator B* as a mapping from ~n x n to ~n x n: \n\nB*A= \n\n[ \n\n< Bl,A > \n.., \n< Bn,A > ... \n\n... \n\n.. , < BnLn+I,A > 1 \n\n... \n\n< B n2,A > \n\nwhere < .,. > is the inner product in ~nxn. With the operation *, we have \nBVec(A) = [ < Bl:' A> ] = Vec(Vec- l ( [ < ~\\ A> ])) = Vec(B * A), \n\n \n\n \n\n\f700 \n\ni.e., \n\nBVec(A) = Vec(B * A) . \n\nApplying the above relation, we first rewrite the equation (15) as \n\nH. H. Yang \n\ndW \n\nVec( dt) = 1](WT 0 J)Vec(D 1* F(y)), \n\n_ \n\nthen applying (10) to the above equation we obtain \n\nd: = 1](D-1 *F(y))W. \n\n(16) \n\nTheorem 1 For the blind separation problem, the maximum likelihood algorithm \nbased on the natural gradient induced by the Fisher information matrix or the method \nof scoring has the form (16) which is a multiplicative updating rule with the equiv(cid:173)\nariant property. \n\nTo implement the algorithm (16), we estimate D by sample average. Let fij(Y) be \nthe (i,j) entry in F(y). A general form for the entries in D is \n\ndij,kl = E[Jij (y)fkl (y)] \n\nwhich depends on the source pdfs ri(si) . When the source pdfs are unknown, in \npractice we choose Ti(Si) as our prior assumptions about the source pdfs. To simplify \nthe algorithm (16), we replace D by its asymptotic form at the solution points \na = (ClSlT(I),\u00b7 .. , CnSlT(n\u00bb)T where (0\"(1),.\u00b7\u00b7, O\"(n)) is a permutation of (1,\u00b7 \u00b7 \u00b7, n). \nRegarding the structure of the asymptotic D, we have the following theorem: \n\nTheorem 2 Assume that the pdfs of the sources Si are even fu.nctions. \nThen at the solution point a = (Cl SlT(l) , ... ,CnSlT(n\u00bb)T, D is a diagonal matrix and \nits n 2 diagonal entries have two forms, namely, \n\nE[Jij(a)!ij(a)] = J-LiAj, \nE[(Jii(a))2] = Vi \n\nfor i =fi j and \n\nwhere J-Li = E[4>;(ai)], Ai = E[a;] and Vi = E[4>~(ai)a~] - 1. More concisely, we \nhave \n\nD = diag( Vec( H)) \n\n(17) \n\nwhere \n\nH = (J-LiAj )nx n - diag(J-L1 AI, .. . ,J-LnAn) + diag( VI, ... , vn) \n\nThe proof of Theorem 2 is given in Appendix 1. \nLet H = (hij)nxn. Since all J-Li, Ai, and Vi are positive, and so are all hij . We define \n\nThen from (17), we have \n\n1 \n\n1 \nH = (hij )nxn. \n\nD- 1 = diag(Vec( ~)). \n\nThe results in Theorem 2 enable us to simplify the algorithm (16) to obtain a low \ncomplexity learning rule. Since D-1 is a diagonal matrix, for any n x n matrix A \nwe have \n\nD-1Vec(A) = Vec( ~ 0 A) \n\n(18) \n\n\fMultiplicative Updating Rule/or Blind Separation \n\n701 \n\nwhere 0 denotes the componentwise multiplication of two matrices of the same \ndimension. Applying (18) to the learning rule (15), we obtain the following learning \nrule \n\ndVV \n\nVec( ---;It) = 1](VVT Ci9 I)Vec(H 0 F(y\u00bb. \n\n1 \n\nAgain, applying (10) t.o the above equation we have the following learning rule \n\ndVV \ndt = 1]( H 0 F(y\u00bbVV. \n\n1 \n\n(19) \n\nLike the learning rule (16), the algorithm (19) is also multiplicative; but unlike \\16), \nthere is no need to inverse the n 2 x n 2 matrix in (19). The computation of H is \nstraightforward by computing the reciprocals of the entries in H. \n(f.Li, Ai, Vi) are 3n unknowns in G. Let us impose the following constraint \n\n(20) \nUnder this constraint, the number of unknowns in G is 2n, and D can be written \nas \n\nVi = f.LiAi. \n\nD=D>.Ci9D~ \n\n(21) \n\nwhere D>. = diag(Al,' . \" An) and D~ = diag(f.LI, \" . ,f.Ln)' \nFrom (14), using (21) we have the natural gradient descent rule in the Riemannian \nspace V \n\ndVec(VV) = _ (VVTD- 1VVCi9D- 1 ) \n\ndt \n\n1] \n\n>. \n\nae \n\n~ aVec(VV) . \n\n(22) \n\n(23) \n\nApplying the property (10), we rewrite the above equation in a matrix form \n\ndVV = _ D- 1 ae VVTD-1VV \ndt \n\n~ avv \n\n1] \n\n>\" \n\nSince f.Li and Ai are unknown, D ~ and D>. are replaced by the identity matrix in \npractice. Therefore, the algorithm (2) is an approximation of the algorithm (23). \nTaking e = - L(VV) as the negative likelihood function and applying the expres(cid:173)\nsion (5), we have the following maximum likelihood algorithm based on the natural \ngradient in V: \n\n( \n\n-I \n\n-1 ( \n\n) T) \n\ndVV \n-;It = 1]D~ I - ~ y y D>. VV. \n\n( \n) \n24 \nAgain, replacing D~ and D>. by the identity matrix we obtain the maximum like(cid:173)\nlihood algorithm (6) based on the relative gradient or natural gradient of matrix \nfunctions. \nIn the context of the blind separation, the source pdfs are unknown. The prior \nassumption ri(si) used to define the functions i(Yi)Yj. At the equilibrium a = (CI s\".(1) , ... ,cns\".(n)V, \nwe have E[4>i(ai)aj] = 0 for i i= j and E[4>i(ai)ai] = 1. So E[jij(a)] = O. Since \nthe source pdfs are even functions, we have E[ai] = 0 and E[4>i(ai)] = o. Applying \nthese equalities , it is not difficult to verify that \n\nE[jij(a)!k,(a)] = 0, for (i,j) i= (k,l) . \n\n(25) \n\nSo, D is a diagonal matrix and \n\nE[jii(a)!ii(a)] = E[(l - 4>i(ai)ai)2] = E[4>;(ai)a;] - 1, \n\nE[jij(a)!ij(a)] = E[4>;(ai)a;] = {tiAj \n\nfor i i= j. \nQ.E.D. \n\nReferences \n[1] S. Amari. Natural gradient works efficiently in learning. Accepted by Neural \n\nComputation, 1997. \n\n[2] S. Amari. Neural learning in structured parameter spaces - natural Riemannian \ngradient. In Advances in Neural Information Processing Systems, 9, ed. M. C. \nMozer, M. 1. Jordan and T. Petsche, The MIT Press: Cambridge, MA., pages \n127-133, 1997. \n\n[3] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind \nsignal separation. In Advances in Neural Information Processing Systems, 8, \neds. David S. Touretzky, Michael C. Mozer and Michael E. Hasselmo, MIT \nPress: Cambridge, MA., pages 757-763, 1996. \n\n[4] J.-F. Cardoso. Infomax and maximum likelihood for blind source separation. \n\nIEEE Signal Processing Letters, April 1997. \n\n[5] J.-F. Cardoso and B. Laheld. Equivariant adaptive source separation. IEEE \n\nTrans. on Signal Processing, 44(12):3017-3030, December 1996. \n\n[6] S. M. Kay. FUndamentals of Statistical Signal Processing: Estimation Theory. \n\nPTR Prentice Hall, Englewood Cliffs, 1993. \n\n[7] D. T. Pham. Blind separation of instantaneous mixture of sources via an ica. \n\nIEEE Trans. on Signal Processing, 44(11):2768-2779, November 1996. \n\n[8] H. H. Yang and S. Amari. Adaptive on-line learning algorithms for blind sep(cid:173)\naration: Maximum entropy and minimum mutual information. Neural Compu(cid:173)\ntation, 9(7):1457-1482, 1997. \n\n\f", "award": [], "sourceid": 1342, "authors": [{"given_name": "Howard", "family_name": "Yang", "institution": null}]}*