{"title": "Independent Components Analysis through Product Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 672, "abstract": "", "full_text": "Independent  Components Analysis \nthrough Product  Density Estimation \n\n'frevor  Hastie and Rob  Tibshirani \n\nDepartment of Statistics \n\nStanford University \nStanford, CA,  94305 \n\n{ hastie, tibs } @stat.stanford. edu \n\nAbstract \n\nWe  present a  simple direct approach for  solving the ICA problem, \nusing density estimation and maximum likelihood.  Given a  candi(cid:173)\ndate  orthogonal frame,  we  model  each  of the  coordinates  using  a \nsemi-parametric density estimate based on cubic splines.  Since our \nestimates have two continuous derivatives, we  can easily run a sec(cid:173)\nond order search for  the frame  parameters.  Our method performs \nvery favorably  when compared to state-of-the-art techniques. \n\n1 \n\nIntroduction \n\nIndependent  component  analysis  (ICA)  is  a  popular  enhancement  over  principal \ncomponent analysis  (PCA)  and factor  analysis.  In its  simplest form, we  observe a \nrandom vector  X  E  IRP  which  is  assumed to arise from  a  linear mixing  of a  latent \nrandom source vector S  E IRP, \n(1) \n\nX=AS; \n\nthe components Sj,  j  =  1, ... ,p of S are assumed to be independently distributed. \nThe classical  example  of such  a  system  is  known  as  the  \"cocktail party\"  problem. \nSeveral  people  are  speaking,  music  is  playing,  etc.,  and  microphones  around  the \nroom  record  a  mix  of the  sounds.  The  ICA  model  is  used  to  extract  the  original \nsources from  these different  mixtures. \n\nWithout  loss  of  generality,  we  assume  E(S)  =  0  and  Cov(S)  =  I ,  and  hence \nCov(X) =  AA T.  Suppose S*  =  R S represents a transformed version of S, where R \nis p  x p  and orthogonal.  Then with A * =  ART  we  have X*  =  A * S*  =  AR TR S = \nX.  Hence  the  second  order  moments  Cov(X)  =  AAT  =  A * A *T  do  not  contain \nenough information to distinguish these two situations. \nModel  (1)  is  similar  to  the  factor  analysis  model  (Mardia,  Kent  &  Bibby  1979), \nwhere  S  and  hence  X  are  assumed  to  have  a  Gaussian  density,  and  inference  is \ntypically  based  on the likelihood of the  observed  data.  The factor  analysis  model \ntypically  has  fewer  than p  components,  and  includes  an error component  for  each \nvariable.  While  similar  modifications  are  possible  here  as  well,  we  focus  on  the \nfull-component  model in this paper.  Two facts  are clear: \n\n\f\u2022  Since a  multivariate Gaussian distribution is  completely determined  by its \nfirst  and second  moments,  this  model  would  not  be able  to  distinguish  A \nand  A * .  Indeed,  in  factor  analysis  one  chooses  from  a  family  of factor \nrotations to select a  suitably interpretable version. \n\n\u2022  Multivariate  Gaussian  distributions  are  completely  specified  by  their \nsecond-order moments.  If we  hope to recover the original A, at least p - 1 \nof the components of S  will  have to be non-Gaussian. \n\nBecause of the lack of information in the second moments, the first  step in an ICA \nmodel is  typically to transform X  to have a  scalar covariance, or to pre-whiten the \ndata.  From now on we  assume  Cov(X)  =  I , which  implies  that A  is  orthogonal. \nSuppose  the  density  of  Sj  is  Ij,  j  =  1, ... ,p,  where  at  most  one  of  the  Ii  are \nGaussian.  Then the joint density of S  is \n\n(2) \n\np \n\nIs(s)  =  II Ii(Sj), \n\nj = l \n\nand since  A  is  orthogonal, the joint density of X  is \nIx(x)  =  II Ii(aJ x), \n\n(3) \n\np \n\nj=l \n\nwhere  aj  is  the jth column of A .  Equation  (3)  follows  from  S =  AT X  due to the \northogonality of A , and the fact  that the determinant in this multivariate transfor(cid:173)\nmation is  1. \n\nIn  this  paper  we  fit  the  model  (3)  directly  using  semi-parametric  maximum  like(cid:173)\nlihood.  We  represent  each  of the  densities  Ii  by  an  exponentially  tilted  Gaussian \ndensity  (Efron &  Tibshirani 1996). \n\n(4) \nwhere  \u00a2  is  the  standard univariate  Gaussian density,  and gj  is  a  smooth function, \nrestricted so that Ii integrates to 1.  We represent each of the functions gj by a cubic \nsmoothing spline, a rich class of smooth functions whose roughness is controlled by a \npenalty functional.  These choices lead to an attractive and effective semi-parametric \nimplementation of ICA: \n\n\u2022  Given A,  each of the components  Ii  in  (3)  can be estimated separately by \nmaximum  likelihood.  Simple  algorithms  and  standard  software  are  avail(cid:173)\nable. \n\n\u2022  The  components  gj  represent  departures  from  Gaussianity,  and  the  ex(cid:173)\npected  log-likelihood  ratio  between  model  (3)  and  the  gaussian  density  is \ngiven  by Ex 2:j  gj(aJ X),  a flexible  contrast function. \n\n\u2022  Since the first  and second derivatives of each of the estimated gj  are imme(cid:173)\n\ndiately  available,  second  order  methods  are  available  for  estimating  the \northogonal  matrix  A .  We  use  the  fixed  point  algorithms  described  in \n(Hyvarinen &  Oja 1999). \n\n\u2022  Our representation of the gj  as smoothing splines casts the estimation prob(cid:173)\n\nlem as density estimation in a  reproducing kernel  Hilbert space,  an infinite \nfamily  of smooth  functions.  This  makes  it  directly  comparable  with  the \n\"Kernel ICA\"  approach of Bach &  Jordan (2001),  with the advantage that \nwe  have  O(N)  algorithms  available  for  the  computation  of our  contrast \nfunction,  and its first  two derivatives. \n\n\fIn  the remainder of this article,  we  describe the model in more detail, and evaluate \nits  performance on some simulated data. \n\n2  Fitting the Product  Density leA model \n\nGiven  a  sample  Xl, ... ,XN  we  fit  the  model  (3),(4)  by  maximum  penalized  like(cid:173)\nlihood.  The  data  are  first  transformed  to  have  zero  mean  vector,  and  identity \ncovariance matrix using  the singular  value  decomposition.  We  then  maximize  the \ncriterion \n\n(5) \n\nsubject to \n(6) \n\n(7) \n\nT a j  ak \n\nJ \u00a2(s)e9j (slds \n\nbjk 't/j, k \n\n1 't/j \n\nFor fixed  aj  and hence Sij  =  aT Xi  the solutions for  9j  are known to be cubic splines \nwith  knots  at  each  of  the  unique  values  of  Sij  (Silverman  1986).  The  p  terms \ndecouple for  fixed  aj, leaving us p  separate penalized density estimation problems. \nWe fit  the functions 9j and directions aj by optimizing (5)  in an alternating fashion , \nas  described  in  Algorithm  1.  In  step  (a),  we  find  the  optimal  9j  for  fixed  9j;  in \n\nAlgorithm 1  Product Density leA algorithm \n\n1.  Initialize A  (random Gaussian matrix followed  by orthogonalization). \n2.  Alternate until convergence of A,  using  the Amari metric  (16). \n\n(a)  Given  A ,  optimize  (5)  w.r.t.  9j  (separately  for  each  j),  using  the \n\npenalized  density estimation algorithm 2. \n\n(b)  Given 9j ,  j  =  1, ... ,p, perform one step of the fixed  point algorithm 3 \n\ntowards finding  the optimal A. \n\nstep  (b),  we  take  a  single  fixed-point  step  towards  the  optimal  A.  In  this  sense \nAlgorithm 1 can be seen to be maximizing the profile penalized log-likelihood w.r.t. \nA. \n\n2.1  Penalized  density estimation \n\nWe  focus  on  a  single  coordinate,  with  N  observations  Si, \nSi  = af Xi  for  some  k).  We  wish to maximize \n\n1, ... ,N  (where \n\n(8) \n\nsubject to J \u00a2(s)e9(slds  =  1.  Silverman  (1982)  shows  that one  can incorporate the \nintegration  constraint  by  using  the  modified  criterion  (without  a  Lagrange  multi(cid:173)\nplier) \n\nN \n\n~ l:= {lOg\u00a2(Si) + 9(Si )} - J \u00a2(s)e9(slds  - A J 91/ 2 (S)ds. \n\n(9) \n\n>=1 \n\n\fSince  (9)  involves an integral,  we  need an approximation.  We  construct a  fine  grid \nof L  values  s; in increments  ~ covering the observed values  Si,  and let \n\n(10) \n\n*  #Si  E  (sf  - ~/2, Sf  + ~/2) \nY\u00a3  = \n\nN \n\nTypically we  pick L  to be 1000, which is  more than adequate.  We can then approx(cid:173)\nimate  (9)  by \n\nL \n\nL {Y; [log(\u00a2(s;)) + g(s;)]- ~\u00a2(se)e9(sll} - A J gI/2(s)ds. \n\n(11) \n\n\u00a3=1 \n\nThis  last  expression  can  be  seen  to  be  proportional  to  a  penalized  Poisson  log(cid:173)\nlikelihood  with  response  Y;! ~ and  penalty  parameter  A/~,  and  mean  J-t(s)  = \n\u00a2(s)e9(s).  This  is  a  generalized  additive  model  (Hastie  &  Tibshirani  1990),  with \nan  offset term  log(\u00a2(s)),  and  can  be fit  using  a  Newton  algorithm in  O(L)  opera(cid:173)\ntions.  As  with other GAMs,  the Newton  algorithm is  conveniently re-expressed as \nan iteratively reweighted penalized least squares regression problem,  which we  give \nin  Algorithm  2. \n\nAlgorithm  2  Iteratively  reweighted  penalized  least  squares  algorithm  for  fitting \nthe tilted Gaussian spline  density model. \n\n1.  Initialize 9  ==  O. \n2.  Repeat until convergence: \n\n(a)  Let  J-t(s;)  =  \u00a2(s;)e9(sll,  \u00a3 = 1, ... ,L, and w\u00a3  = J-t(s;). \n(b)  Define  the working response \n\n(12) \n\nz\u00a3  =  g(s*) + Ye  - J-t(sf) \n\n\u00a3 \n\nJ-t( sf) \n\n(c)  Update g by  solving the weighted penalized least squares problem \n\n(13) \n\nThis amounts to fitting a weighted smoothing spline to the pairs (sf, ze) \nwith weights w\u00a3  and tuning parameter 2A/~. \n\nAlthough  other  semi-parametric  regression  procedures  could  be  used  in  (13),  the \ncubic smoothing spline has several advantages: \n\n\u2022  It  has  knots  at  all  L  of  the  pseudo  observation  sites  sf'  The  values  sf \ncan  be  fixed  for  all  terms  in  the  model  (5),  and  so  a  certain  amount  of \npre-computation  can  be  performed.  Despite  the  large  number  of  knots \nand hence basis functions , the local support of the B-spline basis functions \nallows the solution to  (13)  to be obtained in  O(L)  computations. \n\n\u2022  The  first  and  second  derivatives  of  9  are  immediately  available,  and  are \n\nused in  the second-order search for  the direction  aj  in  Algorithm 1. \n\n\u2022  As  an alternative to  choosing  a  value  for  A,  we  can control the  amount of \nsmoothing through the  effective  number  of parameters,  given  by  the  trace \nof the linear operator matrix implicit  in  (13)  (Hastie &  Tibshirani 1990). \n\n\f\u2022  It can also  be shown that because of the form  of (9),  the resulting  density \ninherits the mean and variance of the data (0  and  1);  details  will  be given \nin  a  longer version of this paper. \n\n2.2  A  fixed  point  method for  finding  the orthogonal frame \n\nFor fixed  functions  g1>  the  penalty  term  in  (5)  does  not  playa role  in  the  search \nfor  A.  Since  all  of  the  columns  aj  of  any  A  under  consideration  are  mutually \northogonal and unit norm, the  Gaussian component \n\np L log \u00a2(aJ Xi) \n\nj=l \n\ndoes  not  depend  on  A.  Hence  what  remains  to  be  optimized  can  be  seen  as  the \nlog-likelihood  ratio  between  the  fitted  model  and  the  Gaussian  model,  which  is \nsimply \n\n(14) \n\nC(A) \n\nSince the choice of each gj improves the log-likelihood relative to the Gaussian, it is \neasy to show that C(A) is positive and zero only if, for the particular value of A, the \nlog-likelihood cannot distinguish the tilted model from a Gaussian model.  C(A) has \nthe form  of a  sum of contrast functions  for  detecting departures from  Gaussianity. \nHyvarinen,  Karhunen &  Oja (2001)  refer to the expected log-likelihood ratio as the \nnegentropy,  and  use  simple  contrast functions  to  approximate  it  in  their  FastICA \nalgorithm.  Our regularized  approach can  be  seen  as  a  way  to construct  a  flexible \ncontrast function  adaptively using a  large set of basis functions . \n\nAlgorithm 3  Fixed point update forA. \n\n1.  For j  =  1, ... ,p: \n\n(15) \n\nwhere  E  represents  expectation  w.r.t. \ncolumn of A. \n\nthe  sample  Xi,  and  aj  is  the  jth \n\n2.  Orthogonalize A:  Compute its SVD , A  =  UDVT ,  and replace A  f- UVT . \n\nSince we have first and second derivatives avaiable for each gj , we  can mimic exactly \nthe fast  fixed  point  algorithm developed  in  (Hyvarinen  et  al.  2001,  page  189) ; see \nalgorithm 3.  Figure 1 shows the optimization criterion C  (14)  above, as well  as the \ntwo criteria used to approximate negentropy in  FastICA by Hyvarinen et al.  (2001) \n[page  184].  While  the  latter  two  agree  with  C  quite  well  for  the  uniform  example \n(left  panel),  they  both  fail  on  the  mixture-of-Gaussians  example,  while  C  is  also \nsuccessful there. \n\n\fUniforms \n\nGaussian  Mixtures \n\nx \n\" \n\"0 \nC \n\n0 \n\n'\" \n,,; \n\n'\" \n,,; \n\n'\" \n,,; \n\n'\" \n,,; \n\n0 \n,,; \n\nx \n\" \n\"0 \nC \n\n0 \n\n'\" \n,,; \n\n'\" \n,,; \n\n'\" \n,,; \n\n'\" \n,,; \n\n0 \n,,; \n\n0.0 \n\n0.5 \n\n1.0 \n\n1.5 \n\ne \n\n2.0 \n\n2.5 \n\n3.0 \n\n0.0 \n\n0.5 \n\n1.0 \n\n2.0 \n\n2.5 \n\n3.0 \n\n1.5 \n\ne \n\nFigure 1:  The optimization criteria and solutions found for  two  different  examples in  lR2 \nusing  FastICA  and our  ProDenICA . G1  and G2  refer  to  the two  functions  used  to  define \nnegentropy  in  FastICA.  In  the  left  example  the  independent  components  are  uniformly \ndistributed,  in the right  a mixture of Gaussians.  In the left  plot, all  the procedures found \nthe correct  frame;  in  the right  plot,  only  the spline  based  approach  was  successful.  The \nvertical lines  indicate the solutions found,  and the two  tick marks at the top  of each  plot \nindicate the true angles. \n\n3  Comparisons  with fast  ICA \n\nIn  this section we  evaluate the performance of the product density approach  (Pro(cid:173)\nDenICA) , by mimicking some of the simulations performed by Bach & Jordan (2001) \nto demonstrate their Kernel ICA approach.  Here we compare ProDenICA only with \nFastICA; a future expanded version of this paper will include comparisons with other \nleA procedures as well. \n\nThe left panel in Figure 2 shows the 18 distributions used as a  basis of comparison. \nThese exactly or very closely approximate those used by Bach & Jordan (2001) .  For \neach distribution,  we  generated a  pair of independent  components  (N=1024) , and \na  random mixing  matrix in ill?  with  condition number  between  1 and  2.  We  used \nour Splus implementation of the FastICA  algorithm, using the negentropy criterion \nbased on the nonlinearity G1 (s)  =  log cosh(s) , and the symmetric orthogonalization \nscheme  as  in  Algorithm 3  (Hyvarinen et  al.  2001,  Section 8.4.3).  Our  ProDenICA \nmethod is  also implemented in Splus.  For both methods we  used five  random starts \n(without iterations).  Each of the algorithms delivers an orthogonal mixing matrix A \n(the data were pre-whitenea) , which is  available for  comparison with the generating \northogonalized mixing matrix A o.  We used the Amari metric(Bach &  Jordan 2001) \nas  a  measure of the closeness of the two frames: \n\nd(Ao,A) =  ~ f.- (L~=1 Irijl -1) + ~ f.- (Lf=1Irijl -1) , \n\n(16) \n\n2p ~ max\u00b7lr\u00b7 \u00b71 \nJ\"J \n\ni=1 \n\n2p ~ max\u00b7lr\u00b7\u00b71 \nj=1 \"   \"J \n\nwhere rij  =  (AoA - 1 )ij .  The right panel in Figure 2 shows boxplots of the pairwise \ndifferences d(Ao, A F ) -d(Ao , Ap ) (x100), where the subscripts denote ProDenICA \nor  FastICA.  ProDenICA  is  competitive  with  FastICA  in  all  situations,  and  dom(cid:173)\ninates  in  most  of the  mixture  simulations.  The  average  Amari  error  (x 100)  for \nFastICA  was  13.4  (2.7),  compared with  3.0  (0.4)  for  ProDenICA  (Bach  &  Jordan \n(2001)  report averages of 6.2 for  FastICA,  and 3.8 and 2.9  for their two  KernelICA \nmethods). \nWe  also  ran  300  simulations  in  1R.4,  using  N  =  1000,  and  selecting  four  of  the \n\n\f~ \n\nro \nci \n\n~ \n\n, \n\nb \n\n, \n\nd \n\n9 \n\nj \n\nh \n\nJL ~ \nf JJL \n~-~ \nflL ~ \n~ ~ \n~ ~ \n~ ~ \n\n\" \n\nq \n\nk \n\nI \n\nm \n\np \n\n~ , \n\nN \nci \n\n~ \n\no \nci \n\n0 ..,---_____________  ----, \n\n~ \n\nabcde fgh i jk lmnopq r  \n\ndistribution \n\nFigure 2:  The left panel shows eighteen distributions used for  comparisons.  These include \nthe  \"t\",  uniform, exponential, mixtures  of exponentials, symmetric and asymmetric gaus(cid:173)\nsian  mixtures.  The  right  panel  shows  boxplots  of the  improvement  of  ProDenICA  over \nFastICA  in  each  case,  using  the  Amari  metric,  based  on  30  simulations  in  lR?  for  each \ndistribution. \n\n18  distributions  at  random.  The  average  Amari  error  (x 100)  for  FastICA  was \n26.1  (1.5),  compared with  9.3  (0.6)  for  ProDenICA  (Bach &  Jordan  (2001)  report \naverages of 19 for  FastICA , and 13  and 9 for  their two  K ernelICA  methods). \n\n4  Discussion \n\nThe lCA model stipulates that after a suitable orthogonal transformation, the data \nare independently distributed.  We implement this specification directly using semi(cid:173)\nparametric product-density  estimation.  Our  model  delivers  estimates  of both  the \nmixing matrix A, and estimates of the densities of the independent  components. \n\nthe KL divergence between the full  density and its indepen(cid:173)\n\nMany approaches to lCA, including FastICA,  are based  on minimizing approxima(cid:173)\ntions  to  entropy.  The  argument,  given  in  detail  in  Hyvarinen  et  al.  (2001)  and \nreproduced  in  Hastie,  Tibshirani  &  Friedman  (2001),  starts  with  minimizing  the \nmutual information -\ndence version.  FastICA uses very simple approximations based on single (or a small \nnumber of)  non-linear contrast functions , which work well for a variety of situations, \nbut not  at all  well  for  the more complex gaussian mixtures.  The log-likelihood for \nthe spline-based product-density model can be seen as  a  direct estimate of the mu(cid:173)\ntual information; it uses the empirical distribution of the observed data to represent \ntheir joint  density,  and  the  product-density  model  to  represent  the  independence \ndensity.  This  approach  works  well  in  both  the  simple  and  complex  situations  au(cid:173)\ntomatically,  at  a  very modest  increase  in  computational effort.  As  a  side  benefit, \n\n\fthe form  of our tilted  Gaussian density estimate allows  our log-likelihood  criterion \nto  be  interpreted  as  an  estimate  of  negentropy,  a  measure  of  departure  from  the \nGaussian. \nBach &  Jordan (2001)  combine a nonparametric density approach (via reproducing \nkernel  Hilbert function  spaces)  with a  complex measure of independence  based on \nthe maximal correlation.  Their procure requires O(N3)  computations, compared to \nour  O(N).  They  motivate  their  independence  measures  as  approximations  to  the \nmutual  independence.  Since  the  smoothing  splines  are  exactly  function  estimates \nin  a  RKHS,  our  method  shares  this  flexibility  with  their  Kernel  approach  (and  is \nin  fact  a  \"Kernel\"  method).  Our  objective  function,  however,  is  a  much  simpler \nestimate  of the  mutual information.  In the simulations  we  have  performed so  far , \nit seems  we  achieve  comparable accuracy. \n\nReferences \n\nBach,  F . &  Jordan,  M.  (2001),  Kernel  independent  component  analysis,  Technical \n\nReport UCBjCSD-01-1166,  Computer Science  Division,  University of Califor(cid:173)\nnia,  Berkeley. \n\nEfron,  B.  &  Tibshirani,  R.  (1996),  'Using  specially  designed  exponential  families \n\nfor  density estimation', Annals  of Statistics 24(6),  2431-246l. \n\nHastie,  T.  &  Tibshirani,  R.  (1990),  Generalized  Additive  Models,  Chapman  and \n\nHall. \n\nHastie, T., Tibshirani, R.  &  Friedman, J. (2001),  The  Elements  of Statistical Learn(cid:173)\n\ning;  Data  mining,  Inference  and Prediction, Springer Verlag,  New  York. \n\nHyvarinen,  A.,  Karhunen,  J.  &  Oja,  E.  (2001),  Independent  Component  Analysis, \n\nWiley,  New  York. \n\nHyvarinen, A.  &  Oja, E.  (1999),  'Independent component analysis:  Algorithms and \n\napplications' , Neural  Networks  . \n\nMardia, K.,  Kent,  J. &  Bibby,  J.  (1979),  Multivariate  Analysis,  Academic Press. \nSilverman,  B.  (1982),  'On the  estimation  of a  probability  density  function  by  the \nmaximum penalized likelihood method',  Annals of Statistics 10(3),795-810. \nSilverman,  B.  (1986),  Density  Estimation  for  Statistics  and  Data  Analysis,  Chap(cid:173)\n\nman and Hall. \n\n\f", "award": [], "sourceid": 2155, "authors": [{"given_name": "Trevor", "family_name": "Hastie", "institution": null}, {"given_name": "Rob", "family_name": "Tibshirani", "institution": null}]}