{"title": "Algorithms for Independent Components Analysis and Higher Order Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 491, "page_last": 497, "abstract": null, "full_text": "Algorithms for Independent Components \n\nAnalysis and Higher Order Statistics \n\nDaniel D. Lee \n\nBell Laboratories \n\nLucent Technologies \nMurray Hill, NJ 07974 \n\nUri Rokni and Haim Sompolinsky \n\nRacah Institute of Physics and \nCenter for Neural Computation \n\nHebrew University \n\nJerusalem, 91904, Israel \n\nAbstract \n\nA  latent  variable  generative  model  with  finite  noise  is  used  to  de(cid:173)\nscribe  several  different  algorithms  for  Independent Components  Anal(cid:173)\nysis  (lCA).  In  particular,  the  Fixed  Point  ICA  algorithm  is  shown  to \nbe equivalent to  the Expectation-Maximization algorithm for maximum \nlikelihood  under certain  constraints,  allowing  the conditions for  global \nconvergence to  be  elucidated.  The algorithms can  also  be explained by \ntheir generic  behavior  near  a  singular point where the  size  of the  opti(cid:173)\nmal generative bases vanishes.  An expansion of the likelihood about this \nsingular point indicates the role of higher order correlations in  determin(cid:173)\ning the features discovered by  ICA. The application and convergence of \nthese algorithms are demonstrated on a simple illustrative example. \n\nIntroduction \n\nIndependent Components Analysis (lCA) has generated much recent theoretical and prac(cid:173)\ntical  interest because of its successes on  a number of different signal processing problems. \nICA attempts to  decompose the observed data into components that are  as  statistically  in(cid:173)\ndependent from each other as possible, and  can be viewed  as a nonlinear generalization of \nPrincipal Components Analysis (PCA). Some applications of ICA include blind separation \nof audio  signals,  beamforming of radio  sources,  and  discovery  of features  in  biomedical \ntraces [I] . \n\nThere  have  also  been  a  number  of approaches to  deriving  algorithms  for  ICA  [2,  3,  4]. \nFundamentally, they all consider the problem of recovering independent source signals {s} \nfrom observations {x} such that: \n\nM \n\nXi  = L WijS j ,  i  = l..N \n\nj = 1 \n\n(I) \n\nHere, Wij  is  a N  x  M  mixing matrix where the number of sources M  is  not greater than \nthe dimensionality N  of the observations.  Thus, the columns of W  represent the different \nindependent features present in  the observed data. \n\nBell and  Sejnowski formulated their Infomax algorithm for ICA as maximizing the mutual \ninformation between  the data  and  a nonlinearly  transformed  version  of the  data  [5].  The \n\n\f492 \n\nD. D.  Lee.  U.  Rokni and H.  Sompolinsky \n\ncovariant version  of this  algorithm uses  the  natural  gradient of the  mutual  information to \niteratively  update  the  estimate  for  the  demixing  matrix  W- 1  in  terms  of the  estimated \ncomponentss =  W - 1x  [6]: \n\n.6.W-1 ex:  [1  - (g(s)sT)]  W- 1, \n\n(2) \n\nThe  nonlinearity  g( s)  differentiates  the  features  learned  by  the  lnfomax  ICA  algorithm \nfrom  those  found  by  conventional  PCA.  Fortunately,  the  exact  form  of the  nonlinearity \nused  in  Eq.  2  is  not crucial  for  the  success  of the  algorithm,  as  long  as  it  preserves  the \nsub-Gaussian or super-Gaussian nature of the sources [7] . \n\nAnother approach to ICA due to Hyvarinen and Oja was derived from maximizing objective \nfunctions  motivated by  projection pursuit [8].  Their Fixed Point ICA  algorithm  attempts \nto  self-consistently solve for the extremum of a nonlinear objective function.  The simplest \nformulation considers a  single source M  =  1 so that the mixing matrix is  a single vector \nw,  constrained  to  be  unit  length  Iwl  =  1.  Assuming  the  data  is  first  preprocessed  and \nwhitened, the Fixed Point ICA algorithm iteratively updates the estimate of w  as follows: \n\nw \n\nw \n\nt-\n\nt-\n\n(xg(w T x)  - ACW \nw \nIwl' \n\n(3) \n\nwhere g(wT x)  is  a nonlinear function  and  AC  is  a constant given  by  the  integral over the \nGaussian: \n\n(4) \n\nThe Fixed Point algorithm can  be extended to  an arbitrary  number M  ~ N  of sources by \nusing Eq. 3 in a serial deflation scheme.  Alternatively, the M  columns of the mixing matrix \nW  can be updated simultaneously by  orthogonalizing the N  x  M  matrix: \n\n(5) \nUnder the assumption that the observed data match the underlying ICA model, x  = W s, it \nhas been shown that the Fixed Point algorithm converges locally to the correct solution with \nat least quadratic convergence. However, the global convergence of the generic Fixed Point \nICA  algorithm  is  uncertain .  This  is  in  contrast to  the  gradient-based  lnfomax algorithm \nwhose convergence is guaranteed as long as a sufficiently small step size is chosen. \n\nIn  this paper,  we first review  the  latent variable generative model framework for Indepen(cid:173)\ndent Components Analysis.  We then consider the generative model in the presence of finite \nnoise,  and  show  how  the  Fixed  Point  ICA  algorithm  can  be  related  to  an  Expectation(cid:173)\nMaximization algorithm for  maximum likelihood.  This  allows  us  to  elucidate  the condi(cid:173)\ntions under which the Fixed Point algorithm is guaranteed to  globally converge. Assuming \nthat  the  data  are  indeed  generated from  independent components,  we  derive  the  optimal \nparameters for convergence.  We  also  investigate how  the optimal  size  of the  ICA  mixing \nmatrix varies as  a function of the  added  noise,  and demonstrate the presence of a  singular \npoint.  By expanding the likelihood about this singular point, the behavior of the ICA algo(cid:173)\nrithms can be related to the higher order statistics present in  the data.  Finally, we illustrate \nthe application and convergence of these ICA algorithms on some artificial  data. \n\nGenerative model \n\nA convenient method for interpreting the different ICA algorithms is in  terms of the hidden, \nor latent,  variable  generative  model  shown  in  Fig.  1 [9,  10].  The  hidden  variables  {s j} \n\n\fleA Algorithms and Higher Order Statistics \n\n493 \n\nM  hidden variables \n\nN  visible variables \n\nFigure 1:  Generative model for ICA algorithms.  s  are the hidden variables, Ij are additive \nGaussian noise terms, and x  =  W s + Ij are the visible variables. \n\ncorrespond to the different independent components and are assumed to have the factorized \nnon-Gaussian prior probability distribution: \n\nM \n\nP(s ) =  II e-F (Sj). \n\nj=l \n\n(6) \n\nOnce the  hidden  variables  are  instantiated,  the  visible  variables  {x t }  are  generated  via  a \nlinear mapping through the generative weights W: \n\nP(xls)  = II ~ exp \n\nN  1 \n\n[1 \n- 2lj2 (Xi - L WijSj)2 \n\n1 \n\n, \n\ni = l \n\n7T1j \n\nj \n\n(7) \n\nwhere 1j2  is  the variance of the Gaussian noise added to  the visible variables. \n\nThe probability of the data given this model is  then calculated by  integrating over all  pos(cid:173)\nsible values of the hidden variables: \n\nP( x ) = f ds P(s)P(xls) = (27T1j;) N/2 f ds  exp  [-F(S) - 2~2 (x  - WS)2] \n\n(8) \n\nIn  the  limit  that  the  added  noise  vanishes,  1j2  -T  0,  it  has  previously  been  shown  that \nmaximizing the  likelihood of Eq.  8 is  equivalent to  the  Infomax algorithm in  Eq.  2  [11]. \nIn  the  following  analysis, we will  consider the situation when  the variance of the  noise is \nnonzero, 1j2  1=  o. \n\nExpectation-Maximization \nWe  assume  that  the  data  has  initially  been  preprocessed  and  spherized:  (XiXj )  = Oij . \nUnfortunately, for finite  noise 1j2  and an  arbitrary prior F(sj) , deriving a learning rule for \nW  in  closed  form  is  analytically  intractable.  However,  it  becomes  possible  to  derive  a \nsimple Expectation-Maximization (EM) learning rule under the constraint: \n\nW  = ~Wo ,  wlwo =  I , \n\n(9) \nwhich  implies  that W  is  orthogonal,  and  ~ is  the  length of the  individual  columns of W . \nIndeed,  for  data that obeys the ICA  model, x  =  W s,  it can  be shown  that the optimal W \nmust satisfy this orthogonality condition.  By  assuming the constraint in Eq.  9 for arbitrary \ndata, the posterior distribution P(slx)  becomes conveniently factorized: \n\nF(.lx) ()( i! exp [-F(';) + :' I(WT x);,; - ~e';ll\u00b7 \n\n(10) \n\n\f494 \n\nD.  D.  Lee,  U.  Rokni and H.  Sompolinsky \n\nFor  the  E-step,  this  factorized  form  allows  the  expectation  function  J ds P(slx)s  = \ng(WT x)  to  be analytically evaluated.  This expectation is  then  used  in  the M-step to  find \nthe new estimate W': \n\n(xg(WT x)T)  - AsW' = 0, \n\n(11 ) \nwhere As is  a symmetric matrix of Lagrange multipliers  that constrain the  new W'  to  be \northogonal.  Eq.  11  is  easily solved by  taking the reduced singular value decomposition of \nthe rectangular matrix: \n\n(12) \nwhere UTU  =  VV T  =  I  and  D  is  a diagonal M  x  M  matrix.  Then the solution for the \nEM estimate of the mixing matrix is  given by: \n\nW' \n\nAs \n\n(13) \n\n(14) \n\nAs  a specific  example,  consider the  following  prior for  binary hidden  variables:  P( s)  = \n~[8(s - 1) + 8(s + 1)]. In this case, the expectation J ds P(slx)s =  tanh(WT X/(j2)  and \nso  the EM update rule is given by  onhogonalizing the matrix: \n\nW  f- (xtanh(:2 WT X))  . \n\n(15) \n\nFixed Point leA \n\nBesides the presence of the linear term AC Win Eq. 5, the EM update rule looks very much \nlike  that  of the Fixed  Point leA algorithm.  It turns  out that without this  linear term,  the \nconvergence of the naive EM algorithm is  much slower than that of Eq.  5.  Here we show \nthat  it  is  possible to  interpret the role of this linear term  in  the Fixed Point leA algorithm \nwithin the framework of this generative model. \n\nSuppose that the distribution of the observed data PD (x)  is  actually a mixture between an \nisotropic distribution Po(x)  and a non-isotropic distribution P1 (x): \n\nPD(X)  = aPo(x) + (1  - a)P1 (x). \n\n(16) \n\nBecause the isotropic part does not break rotational symmetry, it does not affect the choice \nof the  directions  of the  learned  basis  W .  Thus,  it  is  more efficient to  apply  the  learning \nalgorithm to  only the non-isotropic portion of the distribution, Pt (x)  (X  PD(X)  - aPo(x), \nrather than  to  the whole observed distribution PD(X).  Applying EM to  P1 (x)  results in  a \ncorrection term arising from the subtracted isotropic distribution . With this correction, the \nEM update becomes: \n\nW  f- (xg(WT x))  - aAcW \n\n(17) \n\nwhich is equivalent to  the Fixed Point leA algorithm when a  =  1. \nUnfortunately, it is not clear how to compute an appropriate value for a to use in fitting data. \nTaking a  very  small  value,  a  \u00ab  1,  will  result in  a  learning rule that is  very  similar to  the \nnaive EM update rule.  This implies that the algorithm will  be guaranteed to monotonically \nconverge,  albeit  very  slowly,  to  a  local  maximum  of the  likelihood.  On  the  other  hand, \nchoosing a large value, a \u00bb  1, will  result in  a subtracted probability density  P1 (x)  that is \nnegative everywhere.  In  this case,  the algorithm will  converge slowly  to  a local  minimum \nof the likelihood. For the Fixed Point algorithm which operates in  the intermediate regime, \na  ~ 1, the algorithm is likely to converge most rapidly.  However, it is also in  this situation \nthat  the  subtracted  density  P1 (x)  could  have  both  positive  and  negative regions,  and  the \nalgorithm is  no longer guaranteed to converge. \n\n\fleA Algorithms and Higher Order Statistics \n\n495 \n\nNoise  0 2 \n\nFigure 2:  Size of the optimal generative bases as  a function of the added noise (J2,  showing \nthe singular point behavior around (J~  ~ 1. \n\nOptimal value of a \n\nIn  order to  determine the optimal  value of a,  we make the  assumption that the observed \ndata obeys the ICA model, x  =  A8.  Note that the statistics of the sources in  the data need \nnot match the assumed prior distribution of the sources in the generative model Eq.  6.  With \nthis assumption, which is not related to the mixture assumption in Eq.  16, it is easy to show \nthat W  =  A  is  a fixed  point of the algorithm . By analyzing the behavior of the  algorithm \nin the vicinity of this fixed  point, a simple expression emerges for the change in  deviations \nfrom this fixed point, 8W, after a single iteration of Eq.  17: \n\n8Wij  +- (  ()) \n\n(g'(8))  - aAG \n8g  8 \n- a  G \n\n3 \nA  8Wij + O(8W  ) \n\n(18) \n\nwhere the averaging here is over the true source distribution, assumed for simplicity to  be \nidentical for all sources.  Thus, the algorithm converges most rapidly if one chooses: \n\naopt  = \n\n(g' (8)) \n\nAG \n\n' \n\n(19) \n\nso that the local convergence is cubic. From Eq. 18  one can show that the condition for the \nstability of the fixed  point is  given by a  < ae , where: \n(8g(8)  + g'(8)) \n\n(20) \n\na c  = \n\n2AG \n\n. \n\nThus, for a  =  0, the stability criterion in Eq.  18 is equivalent to  (8g( 8))  > (g' (8)).  For the \ncubic nonlinearity g( 8)  =  S3,  this implies that the algorithm will find the true independent \nfeatures only if the source distribution has positive kurtosis. \n\nSingular point expansion \n\nLet us  now  consider how  the  optimal  size ~ of the  weights W  varies  as  a  function  of the \nnoise parameter (J2.  For very small  (J2  \u00ab  1,  the weights  W  are  approximately described \nby the Infomax algorithm of Eq. 2, and the lengths of the columns should be unity in order \nto  match  the  covariance of the  data.  For large  (12  \u00bb  1,  however,  the optimal size of the \nweights should be very small because the covariance of the noise is already larger than that \nof the data.  In  fact,  for  Factor Analysis  which  is  a  special  case  of the  generative model \nwith  F(s)  =  ~s2 in  Eq.  6,  it can  be shown that the weights are exactly zero, W  =  0, for \n(J2  > 1. \nThus, the size of the  optimal generative weights W  varies with  (J2  as  shown qualitatively \nin  Fig.  2.  Above  a  certain  critical  noise  value  (J~  ~ 1,  the  weights  are  exactly  equal  to \n\n\f496 \n\nD.  D.  Lee,  U.  Rokni and H.  Sompolinsky \n\n0.81 r----~---~--_, \n\na=O.9 \n\n0.77 \n\na=1.5 \n\n0.76'------'------'-------' \n15 \n\n10 \n\n5 \n\no \n\nIteration \n\nFigure 3:  Convergence of the  modified EM  algorithm as  a  function  of a .  With  9(S)  = \ntanh(s)  as the nonlinearity, the likelihood  (In cosh{WT x))  is  plotted as  a function  of the \niteration number. The optimal basis W  are plotted on the two-dimensional data distribution \nwhen the likelihood is maximized (top) and minimized (bottom). \n\nzero, W  = O.  Only below this critical value do the weights become nonzero.  We  expand \nthe likelihood of the generative model in the vicinity of this singular point. This expansion \nis  well-behaved because the size of the generative weights W  acts  as  a small perturbative \nparameter in  this expansion.  The log likelihood of the model around this singular value is \nthen given by: \n\nL  =  -~Tr [WWT - {I  _  (j2)J] 2 \n\n4 \n1 \n+ 4!  L  kurt{sm) (XiXjXkXI)c WimWjmWkmWlm \n\nijklm \n\n(21) \n\n+0(1 _ (j2)3, \n\nwhere kurt(sm) represents the kurtosis of the prior distribution over the hidden variables. \nNote that this expansion is valid for any symmetric prior, and differs from other expansions \nthat assume small deviations from a Gaussian prior [12,  13].  Eq.  21  shows the importance \nof the fourth-order cumulant of the observed data in  breaking the rotational degeneracy of \nthe weights W.  The generic behavior of ICA is manifest in  optimizing the cumulant term \nin  Eq.21, and again depends crucially on the sign of the kurtosis that is used for the prior. \n\nExample with artificial data \n\nAs  an  illustration of the convergence of the algorithm in  Eq.  17,  we consider the  simple \ntwo-dimensional uniform distribution: \n\nP(x  x) = {1/12,  -vls~. Xl, X2  ~ vis \n\nI,  2 \n\n0, \n\notherWIse \n\n(22) \n\nWith  9(S)  = tanh(s)  as  the  nonlinearity,  Fig.  3  shows  how  the  overall  likelihood  con(cid:173)\nverges for  different values  of the  parameter a  as  the  algorithm is  iterated.  For a  ~ 1.0, \nthe  algorithm converges to  a  maximum of the  likelihood,  with the fastest convergence at \naopt  = 0.9.  However,  for  a  >  1.2,  the  algorithm  converges to  a minimum of the  like(cid:173)\nlihood.  At an  intermediate value,  a  = 1.1,  the likelihood does  not converge at all,  fluc(cid:173)\ntuating  wildly  between the maximum and minimum likelihood solutions.  The maximum \n\n\fleA Algorithms and Higher Order Statistics \n\n497 \n\nlikelihood solution shows the basis vectors in W  aligned with the sides of the square distri(cid:173)\nbution, whereas the minimum likelihood solution has the basis aligned with the diagonals. \nThese solutions can  also  be  understood as  maximizing and minimizing the  kurtosis terms \nin Eq. 21. \n\nDiscllssion \n\nThe utility  of the latent variable generative model  is  demonstrated on deriving algorithms \nfor  leA.  By  constraining  the  generative  weights  to  be  orthogonal,  an  EM  algorithm  is \nanalytically  obtained.  By  interpreting  the  data to  be  fitted  as  a  mixture  of isotropic  and \nnon-isotropic  parts,  a  simple  correction  to  the  EM  algorithm  is  derived.  Under certain \nconditions,  this  modified  algorithm  is  equivalent to  the  Fixed  Point leA algorithm,  and \nconverges much  more  rapidly  than  the  naive  EM  algorithm.  The  optimal  parameter for \nconvergence  is  derived  assuming  the  data  is  consistent  with  the  leA  generative  model. \nThere  also  exists  a  critical  value  for  the  noise  parameter  in  the  generative model ,  about \nwhich  a  controlled  expansion  of the  likelihood  is  possible.  This  expansion  makes  clear \nthe  role  of higher  order  statistics  in  determining  the  generic  behavior  of different  leA \nalgorithms. \n\nWe  acknowledge the support of Bell Laboratories, Lucent Technologies, the US-Israel Bi(cid:173)\nnational  Science  Foundation,  and  the  Israel  Science  Foundation.  We  also  thank  Hagai \nAttias,  Simon  Haykin ,  Juha  Karhunen,  Te-Won  Lee,  Erkki  Oja,  Sebastian  Seung,  Boris \nShraiman, and Oren Shriki for helpful discussions. \n\nReferences \n\n[1]  Haykin, S (1999). Neural networks: a comprehensivefoundation. 2nd ed., Prentice-Hall, Upper \n\nSaddle River, NJ. \n\n[2]  Jutten, C & Herault, J (1991).  Blind separation of sources, part I: An adaptive algorithm based \n\non  neuromimetic architecture. Signal Processing 24,  1-10. \n\n[3]  Comon,  P  (1994). Independent  component  analysis:  a new  concept?  Signal  Processing  36, \n\n287-314. \n\n[4]  Roth, Z & Baram, Y (1996).  Multidimensional density shaping by sigmoids. IEEE Trans. Neu(cid:173)\n\nral Networks 7, 1291-1298. \n\n[5]  Bell,  AJ  & Sejnowski,  TJ (1995).  An  information  maximization  approach  to  blind separation \n\nand blind deconvolution. Neural Computation 7,1129-1159. \n\n[6]  Amari, S, Cichocki, A &  Yang, H (1996).  A new learning algorithm for blind signal separation. \n\nAdvances in Neural Information Processing Systems 8, 757-763. \n\n[7]  Lee,  TW,  Girolami,  M,  &  Sejnowski,  TJ  (1999). Independent  component  analysis  using  an \n\nextended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Com(cid:173)\nputation 11, 609-633 . \n\n[8]  Hyvarinen, A & Oja, E (1997).  A fast  fixed-point algorithm for independent component analy(cid:173)\n\nsis. Neural Computation 9,  1483-1492. \n\n[9]  Hinton, G & Ghahramani, Z (1997). Generative models for discovering sparse distributed rep(cid:173)\n\nresentations.  Philosophical Transactions Royal Society B 352,  1177-1190. \n\n[10]  Attias, H (1998). Independent factor analysis. Neural Computation 11, 803-851. \n[11]  Pearlmutter, B & Parra, L (1996). A context-sensitive generalization of ICA.  In  ICONIP  ' 96, \n\n151-157. \n\n[12]  Nadal,  JP  &  Parga,  N  (1997).  Redundancy  reduction  and  independent  component  analysis: \n\nconditions on cumulants and adaptive approaches. Neural Computation 9,  1421-1456. \n\n[13]  Cardoso, JF (1999). High-order contrasts for independent component analysis. Neural Compu(cid:173)\n\ntation 11,157-192. \n\n\f", "award": [], "sourceid": 1639, "authors": [{"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "Uri", "family_name": "Rokni", "institution": null}, {"given_name": "Haim", "family_name": "Sompolinsky", "institution": null}]}