{"title": "A Non-linear Information Maximisation Algorithm that Performs Blind Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 474, "abstract": null, "full_text": "A  Non-linear Information Maximisation \n\nAlgorithm that  Performs \n\nBlind Separation. \n\nAnthony J.  Bell \n\ntonylOsalk.edu \n\nTerrence J.  Sejnowski \n\nterrylOsalk.edu \n\nComputational Neurobiology  Laboratory \n\nThe Salk Institute \n\n10010  N.  Torrey  Pines Road \nLa Jolla, California 92037-1099 \n\nand \n\nDepartment of Biology \n\nUniversity of California at San Diego \n\nLa Jolla CA 92093 \n\nAbstract \n\nA new learning algorithm is derived which performs online stochas(cid:173)\ntic gradient ascent in the mutual information between outputs and \ninputs  of a  network.  In  the  absence  of a  priori knowledge  about \nthe  'signal'  and  'noise'  components  of the  input,  propagation  of \ninformation depends  on  calibrating network  non-linearities to  the \ndetailed  higher-order  moments of the  input  density functions.  By \nincidentally  minimising  mutual  information  between  outputs,  as \nwell  as  maximising  their  individual  entropies,  the  network  'fac(cid:173)\ntorises'  the  input  into  independent  components.  As  an  example \napplication,  we  have  achieved  near-perfect  separation of ten  digi(cid:173)\ntally mixed speech signals.  Our simulations lead us to believe  that \nour network performs better at blind separation than the Herault(cid:173)\nJ utten network,  reflecting the fact that it is derived rigorously from \nthe  mutual information objective. \n\n\f468 \n\nAnthony J.  Bell,  Terrence  J.  Sejnowski \n\n1 \n\nIntroduction \n\nUnsupervised  learning  algorithms  based  on  information  theoretic  principles  have \ntended to focus on linear decorrelation (Barlow &  Foldiak 1989) or maximisation of \nsignal-to-noise ratios assuming Gaussian sources (Linsker 1992).  With the exception \nof (Becker  1992),  there  has  been  little  attempt to  use  non-linearity in  networks to \nachieve  something a linear network could not. \n\nNon-linear  networks,  however,  are  capable  of computing  more  general  statistics \nthan those second-order  ones  involved in  decorrelation,  and as  a  consequence  they \nare  capable of dealing  with  signals  (and  noises)  which  have  detailed  higher-order \nstructure.  The  success  of the  'H-J'  networks  at  blind  separation  (Jutten  &  Her(cid:173)\nault  1991)  suggests  that it should be  possible to separate statistically independent \ncomponents, by using learning rules  which  make use  of moments of all orders. \n\nThis paper takes  a  principled approach to this problem, by  starting with the ques(cid:173)\ntion of how  to maximise the  information passed  on  in  non-linear feed-forward  net(cid:173)\nwork.  Starting  with  an  analysis  of a  single  unit,  the  approach  is  extended  to  a \nnetwork  mapping  N  inputs  to  N  outputs.  In  the  process,  it  will  be  shown  that, \nunder certain fairly weak conditions, the N  ---.  N  network forms a minimally redun(cid:173)\ndant encoding ofthe inputs, and that it therefore performs Independent Component \nAnalysis  (ICA) . \n\n2 \n\nInformation maximisation \n\nThe information that output Y  contains  about input X  is  defined  as: \n\nI(Y, X) = H(Y) - H(YIX) \n\n(1) \n\nwhere  H(Y) is  the entropy  (information) in the output, while  H(YIX) is whatever \ninformation the output has which  didn't come from  the input.  In the  case  that we \nhave  no  noise  (or  rather,  we  don't  know  what  is  noise  and  what  is  signal  in  the \ninput),  the  mapping between  X  and Y  is  deterministic and H(YIX) has its lowest \npossible  value  of -00 .  Despite  this,  we  may  still  differentiate  eq.l  as  follows  (see \n[5]): \n\n8 \n8w I(Y, X) =  8w H(Y) \n\n8 \n\n(2) \n\nThus in the noiseless case,  the mutual information can be maximised by maximising \nthe entropy  alone. \n\n2.1  One input, one output. \n\nConsider an input variable, x,  passed through a transforming function, g( x), to pro(cid:173)\nduce  an output variable,  y,  as in Fig.2.1(a).  In the  case  that g(x)  is monotonically \nincreasing or decreasing  (ie:  has a  unique inverse),  the probability density function \n(pdf) of the output fy(y)  can be written as a function of the pdfofthe input  fx(x), \n(Papoulis, eq.  5-5): \n\nf  (y)  =  fx(x) \n8y/8x \ny \n\n(3) \n\n\fA  Non-Linear  Information  Maximization  Algorithm  That  Performs  Blind Separation \n\n469 \n\nThe entropy of the output,  H(y),  is  given by: \n\nH(y)  =  -E[lnfy(y)] = -1: fy(y)lnfy(y)dy \n\nwhere  E[.]  denotes  expected  value.  Substituting eq.3  into eq.4  gives \n\nH(y) = E  [In ~:] - E[lnfx(x)] \n\n(4) \n\n(5) \n\nThe second  term on  the right may be  considered  to be  unaffected  by alterations in \na  parameter, w,  determining g(x).  Therefore  in order to maximise the entropy of y \nby  changing w, we  need only concentrate on maximising the first  term, which is the \naverage  log  of how  the input  affects  the  output.  This  can  be  done  by  considering \nthe  'training set' of x's to approximate the  density  fx(x),  and deriving an  'online', \nstochastic gradient  descent  learning rule: \n\n~W ex  oH  = ~ (In OY)  = (oy) -1 ~ (oy) \nOW  ox \n\nOW \n\nOW \n\nOX \n\nOX \n\n(6) \n\nIn the case of the logistic transfer function  y  =  (1 + e- U )-l ,  u =  wx + Wo  in which \nthe  input  is  multiplied  by  a  weight  wand  added  to  a  bias-weight  wo,  the  terms \nabove evaluate as: \n\nwY(l - y) \n\ny(l - y)(l + wX(l - 2y)) \n\n(7) \n\n(8) \n\nDividing eq.8  by eq.7  gives  the learning  rule for  the logistic function,  as  calculated \nfrom  the general rule of eq.6: \n\n~w ex  - + x(l - 2y) \n\n1 \n\nW \n\nSimilar reasoning leads to the rule for  the bias-weight: \n\n~wo ex  1- 2y \n\n(9) \n\n(10) \n\nThe  effect  of these  two  rules  can  be  seen  in  Fig.  1a.  For  example,  if the  input \npdf fx(x)  was  gaussian,  then  the  ~wo-rule would  centre  the  steepest  part  of the \nsigmoid curve  on  the  peak  of fx(x),  matching input  density  to  output  slope,  in  a \nmanner suggested  intuitively  by  eq.3.  The  ~w-rule would  then  scale  the  slope  of \nthe sigmoid curve to match the  variance of fx (x).  For example, narrow pdfs  would \nlead  to sharply-sloping sigmoids.  The  ~w-rule is  basically  anti-Hebbian1 ,  with  an \nanti-decay  term.  The  anti-Hebbian  term  keeps  y  away  from  one  uninformative \nsituation:  that of y  being saturated to  0 or  1.  But  anti-Hebbian rules  alone  make \nweights  go  to  zero,  so  the  anti-decay  term  (l/w)  keeps  y  away  from  the  other \nuninformative  situation:  when  w  is  so  small  that  y  stays  around  0.5.  The  effect \nof these  two  balanced  forces  is  to  produce  an  output  pdf  fy(y)  which  is  close  to \nthe flat  unit distribution,  which  is the  maximum entropy distribution for  a variable \n\n11\u00a3  y  =  tanh(wx + wo)  then  ~w ex  !;; - 2xy \n\n\f470 \n\nAnthony 1.  Bell,  Terrence  1.  Sejnowski \n\n~---' 0  --~---4\"\"\"\",\"\"::~---- X \n\n4 \n\ny \n\n(a) \n\n----~~~--~----X \n\nox \n\n(b) \n\nFigure  1:  (a)  Optimal information flow  in  sigmoidal  neurons  (Schraudolph  et  al \n1992).  Input  x  having  density  function  fx(x),  in  this  case  a  gaussian,  is  passed \nthrough a  non-linear function  g(x).  The information in the resulting  density,  fy(Y) \ndepends on matching the mean and variance of x to the threshold and slope of g( x). \nIn (b)  fy(y)  is plotted for  different values of the weight w.  The optimal weight,  Wopt \ntransmits most information. \n\nbounded between  0 and  1.  Fig.  1b  illustrates a  family of these  distributions,  with \nthe highest  entropy one occuring at Wopt . \nA rule which maximises information for one input and one output may be suggestive \nfor  structures such as synapses  and photoreceptors  which  must position the gain of \ntheir  non-linearity at a  level appropriate to the average  value  and size  of the input \nfluctuations.  However,  to see  the  advantages  of this  approach  in  artificial  neural \nnetworks,  we  now  analyse the  case  of multi-dimensional inputs and outputs. \n\n2.2  N  inputs, N  outputs. \n\nConsider a network with an input vector x, a weight matrix Wand a monotonically \ntransformed output vector y  = g(Wx + wo).  Analogously to eq.3,  the multivariate \nprobability density function of y  can  be  written  (Papoulis, eq.  6-63): \n\n(11) \nwhere  IJI is the absolute value ofthe Jacobian of the transformation.  The Jacobian \nis the determinant of the matrix of partial derivatives: \n\nf  (  ) =  fx(x) \nIJI \ny  y \n\n8Xl \nJ  = det: \n\n[ ~  ~] \n\n8x n \n: \n\n~  ~ \n8Xl \n\n8xn \n\n(12) \n\nThe  derivation  proceeds  as  in  the  previous  section  except  instead  of maximising \nIn(8y/8x),  now  we  maximise  In IJI.  For  sigmoidal  units,  y  = (1  + e-U)-l, U  = \n\n\fA  Non-Linear Information  Maximization  Algorithm  That  Performs  Blind Separation \n\n471 \n\nWx + Wo,  the  resulting learning rules  are familiar in form: \n[WTrl + x(1 - 2y? \n\nD. W \nD.wo \n\nex: \nex:  1- 2y \n\n(13) \n(14) \n\nexcept  that now  x, y,  Wo  and 1 are  vectors  (1  is  a  vector of ones),  W  is  a  matrix, \nand  the anti-Hebbian term has  become an outer product.  The anti-decay term has \ngeneralised  to the inverse  of the  transpose of the weight  matrix.  For  an individual \nweight,  Wij,  this rule  amounts to: \n\ncof  Wij \n\nD.wij  ex:  det W  + xj(l - 2Yi) \n\n(15) \n\nwhere  cof  Wij,  the  cofactor of Wij,  is  (-1 )i+j  times the  determinant of the  matrix \nobtained by  removing the  ith row  and  the jth column from W. \nThis rule is  the same as the one for  the single  unit mapping, except  that instead of \nW  =  0 being  an unstable  point of the dynamics, now  any degenerate  weight  matrix \nis  unstable, since  det W  = 0 if W  is  degenerate.  This fact  enables  different  output \nunits,  Yi,  to learn to represent  different  components in the input.  When  the weight \nvectors entering two output units become too similar, det W  becomes small and the \nnatural  dynamic of learning causes  these  weight  vectors  to  diverge.  This  effect  is \nmediated by the numerator, cof Wij.  When this cofactor becomes small, it indicates \nthat  there  is  a  degeneracy  in  the  weight  matrix of the  rest of the  layer  (ie:  those \nweights not associated with input  Xj  or output Yi).  In this case,  any  degeneracy  in \nW  has less  to  do  with the specific  weight  Wij  that we  are  adjusting. \n\n3  Finding independent components -\n\nblind separation \n\nMaximising the  information contained  in  a  layer of units  involves  maximising the \nentropy  of the  individual  units  while  minimising the  mutual  information  (the  re(cid:173)\ndundancy)  between  them.  Considering two units: \n\n(16) \n\nFor  I(Yl, Y2)  to be  zero,  Yl  and Y2  must be  statistically independent  of each  other, \nso  that  fY1Y2(Yl, Y2)  = fYl (ydfY2(Y2).  Achieving such  a  representation  is  variously \ncalled factorial  code  learning,  redundancy  reduction  (Barlow  1961,  Atick  1992), or \nindependent  component  analysis  (ICA),  and  in  the  general  case  of continuously \nvalued  variables  of arbitrary  distributions,  no  learning  algorithm has  been  shown \nto converge  to such  a  representation. \n\nOur  method  will converge  to  a  minimum redundancy,  factorial  representation  as \nlong as the individual entropy terms in eq.16  do  not override the redundancy  term, \nmaking an  I(Yl, Y2)  = 0 solution sub-optimal.  One  way to ensure  this cannot occur \nis  if we  have  a  priori  knowledge  of the  general  form  of the  pdfs  of the  indepen(cid:173)\ndent components.  Then we  can tailor our choice of node-function to be optimal for \ntransmitting information about these  components.  For example, unit  distributions \nrequire  piecewise  linear  node-functions  for  highest  H(y),  while  the  more  common \ngaussian  forms  require  roughly  sigmoidal curves.  Once  we  have  chosen  our  node(cid:173)\nfunctions  appropriately,  we  can  be sure  that an  output node  Y  cannot have  higher \n\n\f472 \n\nAnthony 1.  Bell,  Terrence  1.  Sejnowski-\n\nslt--_---<x\\. __ ---4!'.~ \no \no \no \n)<----~ 1'-----..:\u00b70 \n\nunknown \nmixing \nprocess \n\nBUND \nSEPARATION \n(learnt weights) \n\nWA  after learning: \n\n'\" \n\n1-4.091  0.13  0.09  -0.07  -0.01 \n0.07 1-2.921 0.00  0.02  -0.06 \n0.02  -0.02  -0.06  -0.08 1-2.2ij \n0.02  0.03  0.00  11.971  0.02 \n-0.07  0.14  1-3.501  -0.01  0.04 \n\nFigure  2:  (a)  In  blind  separation,  sources,  s,  have  been  linearly  scrambled  by  a \nmatrix,  A,  to  form  the  inputs  to  the  network,  x.  We  must recover  the  sources  at \nour output y, by somehow inverting the mapping A with our weight matrix W.  The \nproblem:  we  know  nothing about A  or the sources.  (b) A successful  'unscrambling' \noccurs  when WA is  a  'permutation' matrix.  This one resulted from separating five \nspeech  signals with our algorithm. \n\nentropy by  representing some combination of independent  components than by rep(cid:173)\nresenting just one.  When this condition is satisfied for  all output units, the residual \ngoal, of minimising the mutual information between  the outputs, will dominate the \nlearning.  See  [5]  for further  discussion  of this. \nWith  this  caveat  in  mind,  we  turn  to  the  problem of  blind  separation,  (Jutten  & \nHerault  1991),  illustrated  in  Fig.2.  A  set  of sources,  Sl, ... ,  SN,  (different  people \nspeaking,  music,  noise  etc)  are  presumed  to  be  mixed  approximately  linearly  so \nthat all we  receive  is N  superpositions of them,  Xl, ... , X N,  which  are  input to our \nsingle-layer information maximisation network.  Providing the mixing matrix, A, is \nnon-singular then the original sources  can be recovered  if our weight matrix, W, is \nsuch  that W A  is  a 'permutation' matrix containing only one high value in each row \nand column. \nUnfortunately we  know  nothing about the sources or  the  mixing matrix.  However, \nif the sources  are statistically independent  and non-gaussian,  then the information \nin the output nodes will be maximised when each output transmits one independent \ncomponent only.  This  problem cannot  be solved  in general by  linear  decorrelation \ntechniques  such  as  those  of (Barlow  &  Foldicik  1989)  since  second-order  statistics \ncan only produce symmetrical decorrelation matrices. \n\nWe  have  tested  the  algorithm in eq.13  and eq.14 on digitally mixed speech  signals, \nand  it  reliably  converges  to  separate  the  individual sources.  In  one  example,  five \nseparately-recorded  speech  signals  from  three  individuals  were  sampled  at  8kHz. \nThree-second  segments  from  each  were  linearly  mixed  using  a  matrix of random \nvalues  between  0.2  and  4.  Each  resulting  mixture  formed  an  incomprehensible \nbabble.  Time points  were  generated  at  random,  and  for  each,  the  corresponding \nfive  mixed values were  entered into the network,  and weights were  altered according \nto eq.13 and eq.14.  After on the order of 500,000 points were  presented,  the network \n\n\fA  Non-Linear  Information  Maximization  Algorithm  That  Performs  Blind Separation \n\n473 \n\nhad converged  so  that WA  was  the  matrix shown in  Fig.2b.  As  can  be  seen  from \nthe  permutation structure  of this  matrix, on  average,  95%  of each  output  unit  is \ndedicated to one source only, with each unit carrying a different source.  Any residual \ninterference  from  the four other sources  was inaudible. \n\nWe  have not yet  performed any systematic studies on rates of convergence or exis(cid:173)\ntence  of local minima.  However  the  algorithm has  converged  to separate  N  inde(cid:173)\npendent  components  in  all  our tests  (for  2  ~ N  ~ 10).  In  contrast,  we  have  not \nbeen  able to obtain convergence of the  H-J  network on our  data set for  N  > 2. \nFinally, the kind of linear static mixing we have been using is not equivalent to what \nwould  be  picked  up  by  N  microphones positioned  around a  room.  However,  (Platt \n&  Faggin  1992)  in  their  work  on  the  H-J  net,  discuss  extensions  for  dealing  with \ntime-delays and non-static filtering,  which  may also  be  applicable to our methods. \n\n4  Discussion \n\nThe  problem  of Independent  Component  Analysis  (ICA)  has  become  popular  re(cid:173)\ncently in the signal processing community, partly as a result ofthe success ofthe H-J \nnetwork.  The  H-J  network  is identical to the linear  decorrelation network of (Bar(cid:173)\nlow & Foldicik 1989) except for non-linearities in the anti-Hebb rule which normally \nperforms only  decorrelation.  These non-linearities are chosen somewhat arbitrarily \nin the hope that their polynomial expansions will yield higher-order cross-cumulants \nuseful in converging to independence  (Comon et aI,  1991).  The H-J  algorithm lacks \nan  objective  function,  but  these  insights  have  led  (Comon  1994)  to  propose  min(cid:173)\nimising mutual information between  outputs (see  also  Becker  1992).  Since  mutual \ninformation cannot  be  expressed  as  a  simple function  of the  variables  concerned, \nComon expands it as a function of cumulants of increasing order. \n\nIn  this  paper,  we  have  shown  that mutual information, and  through  it,  ICA,  can \nbe  tackled  directly  (in  the sense  of eq.16)  through  a  stochastic  gradient approach. \nSigmoidal units,  being  bounded,  are  limited in  their  'channel  capacity' .  Weights \ntransmitting information try,  by  following  eq.13,  to  project  their  inputs  to  where \nthey  can  make  a  lot  of difference  to  the  output,  as  measured  by  the  log  of the \nJacobian of the transformation.  In the process,  each  set of statistically 'dependent' \ninformation is  channelled to the same output unit. \nThe non-linearity is crucial.  If the network were just linear, the weight matrix would \ngrow  without  bound since  the learning rule would be: \n\n(17) \n\nThis  reflects  the  fact  that  the  information  in  the  outputs  grows  with  their  vari(cid:173)\nance.  The  non-linearity  also  supplies  the  higher-order  cross-moments  necessary \nto  maximise  the  infinite-order  expansion  of the  information.  For  example,  when \ny  = tanh( u), the learning rule has the form D.. W  ex:  [wT] -1 - 2yxT , from which we \ncan write that the weights stabilise (or  (D..W)  =  0)  when 1= 2(tanh(u)uT ).  Since \ntanh is  an odd function,  its series  expansion  is  of the form  tanh( u) = L:j  bj u2P+1 , \nthe  bj  being  coefficients,  and  thus  this  convergence  criterion  amounts to  the  con(cid:173)\ndition L:i,j bijp(U;P+1Uj)  = 0 for  all  output  unit  pairs  i  1=  j, for  p = 0,1,2,3 ... , \n\n\f474 \n\nAnthony  J.  Bell,  Terrence  J.  Sejnowski \n\nand  for  the  coefficients  bijp  coming from  the  Taylor  series  expansion  of the  tanh \nfunction. \n\nThese  and other  issues  are  covered  more  completely  in  a  forthcoming  paper  (Bell \n& Sejnowski  1995).  Applications  to  blind  deconvolution  (removing  the  effects  of \nunknown  causal  filtering)  are  also  described,  and  the  limitations of the  approach \nare  discussed. \n\nAcknowledgements \n\nWe  are much indebted to Nici Schraudolph, who not only supplied the original idea \nin  Fig.l  and  shared  his  unpublished  calculations  [13],  but  also  provided  detailed \ncriticism at every  stage of the work.  Much constructive advice also came from Paul \nViola and Alexandre  Pouget. \n\nReferences \n\n[1]  Atick  J.J. 1992. Could information theory provide an ecological theory of sen(cid:173)\n\nsory  processing,  Network 3,  213-251 \n\n[2]  Barlow H.B.  1961. Possible principles underlying the transformation of sensory \n\nmessages,  in  Sensory  Communication,  Rosenblith W.A.  (ed),  MIT press \n\n[3]  Barlow  H.B.  &  Foldicik  P.  1989.  Adaptation  and  decorrelation  in  the  cortex, \n\nin  Durbin R. et al (eds)  The  Computing  Neuron,  Addison-Wesley \n\n[4]  Becker  S.  1992.  An  information-theoretic unsupervised  learning  algorithm for \n\nneural networks,  Ph.D.  thesis,  Dept.  of Compo  Sci.,  Univ. of Toronto \n\n[5]  Bell  A.J.  &  Sejnowski  T.J.  1995.  An  information-maximisation  approach  to \n\nblind separation and blind deconvolution,  Neural  Computation,  in  press \n\n[6]  Comon P.,  Jutten  C.  &  Herault J.  1991.  Blind separation of sources,  part II: \n\nproblems statement,  Signal processing,  24,  11-21 \n\n[7]  Comon P.  1994.  Independent  component analysis,  a new  concept?  Signal pro(cid:173)\n\ncessing,  36,  287-314 \n\n[8]  Hopfield  J.J.  1991.  Olfactory  computation and object  perception,  Proc.  N atl. \n\nAcad.  Sci.  USA,  vol.  88,  pp.6462-6466 \n\n[9]  Jutten  C.  & Herault  J.  1991.  Blind separation of sources,  part I:  an adaptive \n\nalgorithm based on neuromimetic architecture,  Signal processing,  24,  1-10 \n\n[10]  Linsker  R.  1992.  Local  synaptic  learning  rules  suffice  to  maximise mutual in(cid:173)\n\nformation in a  linear network,  Neural  Computation,  4,  691-702 \n\n[11]  Papoulis A.  1984.  Probability,  random  variables  and  stochastic  processes,  2nd \n\nedition,  McGraw-Hill,  New  York \n\n[12]  Platt  J .C.  &  Faggin  F.  1992.  Networks for  the separation  of sources  that  are \nsuperimposed and delayed, in Moody J.E et al (eds)  Adv.  Neur.  Inf.  Proc.  Sys. \n4,  Morgan-Kaufmann \n\n[13]  Schraudolph  N.N.,  Hart  W .E.  &  Belew  R.K.  1992.  Optimal information flow \n\nin sigmoidal neurons,  unpublished  manuscript \n\n\f", "award": [], "sourceid": 953, "authors": [{"given_name": "Anthony", "family_name": "Bell", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}