{"title": "A Non-linear Information Maximisation Algorithm that Performs Blind Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 474, "abstract": null, "full_text": "A Non-linear Information Maximisation \n\nAlgorithm that Performs \n\nBlind Separation. \n\nAnthony J. Bell \n\ntonylOsalk.edu \n\nTerrence J. Sejnowski \n\nterrylOsalk.edu \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\n10010 N. Torrey Pines Road \nLa Jolla, California 92037-1099 \n\nand \n\nDepartment of Biology \n\nUniversity of California at San Diego \n\nLa Jolla CA 92093 \n\nAbstract \n\nA new learning algorithm is derived which performs online stochas(cid:173)\ntic gradient ascent in the mutual information between outputs and \ninputs of a network. In the absence of a priori knowledge about \nthe 'signal' and 'noise' components of the input, propagation of \ninformation depends on calibrating network non-linearities to the \ndetailed higher-order moments of the input density functions. By \nincidentally minimising mutual information between outputs, as \nwell as maximising their individual entropies, the network 'fac(cid:173)\ntorises' the input into independent components. As an example \napplication, we have achieved near-perfect separation of ten digi(cid:173)\ntally mixed speech signals. Our simulations lead us to believe that \nour network performs better at blind separation than the Herault(cid:173)\nJ utten network, reflecting the fact that it is derived rigorously from \nthe mutual information objective. \n\n\f468 \n\nAnthony J. Bell, Terrence J. Sejnowski \n\n1 \n\nIntroduction \n\nUnsupervised learning algorithms based on information theoretic principles have \ntended to focus on linear decorrelation (Barlow & Foldiak 1989) or maximisation of \nsignal-to-noise ratios assuming Gaussian sources (Linsker 1992). With the exception \nof (Becker 1992), there has been little attempt to use non-linearity in networks to \nachieve something a linear network could not. \n\nNon-linear networks, however, are capable of computing more general statistics \nthan those second-order ones involved in decorrelation, and as a consequence they \nare capable of dealing with signals (and noises) which have detailed higher-order \nstructure. The success of the 'H-J' networks at blind separation (Jutten & Her(cid:173)\nault 1991) suggests that it should be possible to separate statistically independent \ncomponents, by using learning rules which make use of moments of all orders. \n\nThis paper takes a principled approach to this problem, by starting with the ques(cid:173)\ntion of how to maximise the information passed on in non-linear feed-forward net(cid:173)\nwork. Starting with an analysis of a single unit, the approach is extended to a \nnetwork mapping N inputs to N outputs. In the process, it will be shown that, \nunder certain fairly weak conditions, the N ---. N network forms a minimally redun(cid:173)\ndant encoding ofthe inputs, and that it therefore performs Independent Component \nAnalysis (ICA) . \n\n2 \n\nInformation maximisation \n\nThe information that output Y contains about input X is defined as: \n\nI(Y, X) = H(Y) - H(YIX) \n\n(1) \n\nwhere H(Y) is the entropy (information) in the output, while H(YIX) is whatever \ninformation the output has which didn't come from the input. In the case that we \nhave no noise (or rather, we don't know what is noise and what is signal in the \ninput), the mapping between X and Y is deterministic and H(YIX) has its lowest \npossible value of -00 . Despite this, we may still differentiate eq.l as follows (see \n[5]): \n\n8 \n8w I(Y, X) = 8w H(Y) \n\n8 \n\n(2) \n\nThus in the noiseless case, the mutual information can be maximised by maximising \nthe entropy alone. \n\n2.1 One input, one output. \n\nConsider an input variable, x, passed through a transforming function, g( x), to pro(cid:173)\nduce an output variable, y, as in Fig.2.1(a). In the case that g(x) is monotonically \nincreasing or decreasing (ie: has a unique inverse), the probability density function \n(pdf) of the output fy(y) can be written as a function of the pdfofthe input fx(x), \n(Papoulis, eq. 5-5): \n\nf (y) = fx(x) \n8y/8x \ny \n\n(3) \n\n\fA Non-Linear Information Maximization Algorithm That Performs Blind Separation \n\n469 \n\nThe entropy of the output, H(y), is given by: \n\nH(y) = -E[lnfy(y)] = -1: fy(y)lnfy(y)dy \n\nwhere E[.] denotes expected value. Substituting eq.3 into eq.4 gives \n\nH(y) = E [In ~:] - E[lnfx(x)] \n\n(4) \n\n(5) \n\nThe second term on the right may be considered to be unaffected by alterations in \na parameter, w, determining g(x). Therefore in order to maximise the entropy of y \nby changing w, we need only concentrate on maximising the first term, which is the \naverage log of how the input affects the output. This can be done by considering \nthe 'training set' of x's to approximate the density fx(x), and deriving an 'online', \nstochastic gradient descent learning rule: \n\n~W ex oH = ~ (In OY) = (oy) -1 ~ (oy) \nOW ox \n\nOW \n\nOW \n\nOX \n\nOX \n\n(6) \n\nIn the case of the logistic transfer function y = (1 + e- U )-l , u = wx + Wo in which \nthe input is multiplied by a weight wand added to a bias-weight wo, the terms \nabove evaluate as: \n\nwY(l - y) \n\ny(l - y)(l + wX(l - 2y)) \n\n(7) \n\n(8) \n\nDividing eq.8 by eq.7 gives the learning rule for the logistic function, as calculated \nfrom the general rule of eq.6: \n\n~w ex - + x(l - 2y) \n\n1 \n\nW \n\nSimilar reasoning leads to the rule for the bias-weight: \n\n~wo ex 1- 2y \n\n(9) \n\n(10) \n\nThe effect of these two rules can be seen in Fig. 1a. For example, if the input \npdf fx(x) was gaussian, then the ~wo-rule would centre the steepest part of the \nsigmoid curve on the peak of fx(x), matching input density to output slope, in a \nmanner suggested intuitively by eq.3. The ~w-rule would then scale the slope of \nthe sigmoid curve to match the variance of fx (x). For example, narrow pdfs would \nlead to sharply-sloping sigmoids. The ~w-rule is basically anti-Hebbian1 , with an \nanti-decay term. The anti-Hebbian term keeps y away from one uninformative \nsituation: that of y being saturated to 0 or 1. But anti-Hebbian rules alone make \nweights go to zero, so the anti-decay term (l/w) keeps y away from the other \nuninformative situation: when w is so small that y stays around 0.5. The effect \nof these two balanced forces is to produce an output pdf fy(y) which is close to \nthe flat unit distribution, which is the maximum entropy distribution for a variable \n\n11\u00a3 y = tanh(wx + wo) then ~w ex !;; - 2xy \n\n\f470 \n\nAnthony 1. Bell, Terrence 1. Sejnowski \n\n~---' 0 --~---4\"\"\"\",\"\"::~---- X \n\n4 \n\ny \n\n(a) \n\n----~~~--~----X \n\nox \n\n(b) \n\nFigure 1: (a) Optimal information flow in sigmoidal neurons (Schraudolph et al \n1992). Input x having density function fx(x), in this case a gaussian, is passed \nthrough a non-linear function g(x). The information in the resulting density, fy(Y) \ndepends on matching the mean and variance of x to the threshold and slope of g( x). \nIn (b) fy(y) is plotted for different values of the weight w. The optimal weight, Wopt \ntransmits most information. \n\nbounded between 0 and 1. Fig. 1b illustrates a family of these distributions, with \nthe highest entropy one occuring at Wopt . \nA rule which maximises information for one input and one output may be suggestive \nfor structures such as synapses and photoreceptors which must position the gain of \ntheir non-linearity at a level appropriate to the average value and size of the input \nfluctuations. However, to see the advantages of this approach in artificial neural \nnetworks, we now analyse the case of multi-dimensional inputs and outputs. \n\n2.2 N inputs, N outputs. \n\nConsider a network with an input vector x, a weight matrix Wand a monotonically \ntransformed output vector y = g(Wx + wo). Analogously to eq.3, the multivariate \nprobability density function of y can be written (Papoulis, eq. 6-63): \n\n(11) \nwhere IJI is the absolute value ofthe Jacobian of the transformation. The Jacobian \nis the determinant of the matrix of partial derivatives: \n\nf ( ) = fx(x) \nIJI \ny y \n\n8Xl \nJ = det: \n\n[ ~ ~] \n\n8x n \n: \n\n~ ~ \n8Xl \n\n8xn \n\n(12) \n\nThe derivation proceeds as in the previous section except instead of maximising \nIn(8y/8x), now we maximise In IJI. For sigmoidal units, y = (1 + e-U)-l, U = \n\n\fA Non-Linear Information Maximization Algorithm That Performs Blind Separation \n\n471 \n\nWx + Wo, the resulting learning rules are familiar in form: \n[WTrl + x(1 - 2y? \n\nD. W \nD.wo \n\nex: \nex: 1- 2y \n\n(13) \n(14) \n\nexcept that now x, y, Wo and 1 are vectors (1 is a vector of ones), W is a matrix, \nand the anti-Hebbian term has become an outer product. The anti-decay term has \ngeneralised to the inverse of the transpose of the weight matrix. For an individual \nweight, Wij, this rule amounts to: \n\ncof Wij \n\nD.wij ex: det W + xj(l - 2Yi) \n\n(15) \n\nwhere cof Wij, the cofactor of Wij, is (-1 )i+j times the determinant of the matrix \nobtained by removing the ith row and the jth column from W. \nThis rule is the same as the one for the single unit mapping, except that instead of \nW = 0 being an unstable point of the dynamics, now any degenerate weight matrix \nis unstable, since det W = 0 if W is degenerate. This fact enables different output \nunits, Yi, to learn to represent different components in the input. When the weight \nvectors entering two output units become too similar, det W becomes small and the \nnatural dynamic of learning causes these weight vectors to diverge. This effect is \nmediated by the numerator, cof Wij. When this cofactor becomes small, it indicates \nthat there is a degeneracy in the weight matrix of the rest of the layer (ie: those \nweights not associated with input Xj or output Yi). In this case, any degeneracy in \nW has less to do with the specific weight Wij that we are adjusting. \n\n3 Finding independent components -\n\nblind separation \n\nMaximising the information contained in a layer of units involves maximising the \nentropy of the individual units while minimising the mutual information (the re(cid:173)\ndundancy) between them. Considering two units: \n\n(16) \n\nFor I(Yl, Y2) to be zero, Yl and Y2 must be statistically independent of each other, \nso that fY1Y2(Yl, Y2) = fYl (ydfY2(Y2). Achieving such a representation is variously \ncalled factorial code learning, redundancy reduction (Barlow 1961, Atick 1992), or \nindependent component analysis (ICA), and in the general case of continuously \nvalued variables of arbitrary distributions, no learning algorithm has been shown \nto converge to such a representation. \n\nOur method will converge to a minimum redundancy, factorial representation as \nlong as the individual entropy terms in eq.16 do not override the redundancy term, \nmaking an I(Yl, Y2) = 0 solution sub-optimal. One way to ensure this cannot occur \nis if we have a priori knowledge of the general form of the pdfs of the indepen(cid:173)\ndent components. Then we can tailor our choice of node-function to be optimal for \ntransmitting information about these components. For example, unit distributions \nrequire piecewise linear node-functions for highest H(y), while the more common \ngaussian forms require roughly sigmoidal curves. Once we have chosen our node(cid:173)\nfunctions appropriately, we can be sure that an output node Y cannot have higher \n\n\f472 \n\nAnthony 1. Bell, Terrence 1. Sejnowski-\n\nslt--_--- 2. \nFinally, the kind of linear static mixing we have been using is not equivalent to what \nwould be picked up by N microphones positioned around a room. However, (Platt \n& Faggin 1992) in their work on the H-J net, discuss extensions for dealing with \ntime-delays and non-static filtering, which may also be applicable to our methods. \n\n4 Discussion \n\nThe problem of Independent Component Analysis (ICA) has become popular re(cid:173)\ncently in the signal processing community, partly as a result ofthe success ofthe H-J \nnetwork. The H-J network is identical to the linear decorrelation network of (Bar(cid:173)\nlow & Foldicik 1989) except for non-linearities in the anti-Hebb rule which normally \nperforms only decorrelation. These non-linearities are chosen somewhat arbitrarily \nin the hope that their polynomial expansions will yield higher-order cross-cumulants \nuseful in converging to independence (Comon et aI, 1991). The H-J algorithm lacks \nan objective function, but these insights have led (Comon 1994) to propose min(cid:173)\nimising mutual information between outputs (see also Becker 1992). Since mutual \ninformation cannot be expressed as a simple function of the variables concerned, \nComon expands it as a function of cumulants of increasing order. \n\nIn this paper, we have shown that mutual information, and through it, ICA, can \nbe tackled directly (in the sense of eq.16) through a stochastic gradient approach. \nSigmoidal units, being bounded, are limited in their 'channel capacity' . Weights \ntransmitting information try, by following eq.13, to project their inputs to where \nthey can make a lot of difference to the output, as measured by the log of the \nJacobian of the transformation. In the process, each set of statistically 'dependent' \ninformation is channelled to the same output unit. \nThe non-linearity is crucial. If the network were just linear, the weight matrix would \ngrow without bound since the learning rule would be: \n\n(17) \n\nThis reflects the fact that the information in the outputs grows with their vari(cid:173)\nance. The non-linearity also supplies the higher-order cross-moments necessary \nto maximise the infinite-order expansion of the information. For example, when \ny = tanh( u), the learning rule has the form D.. W ex: [wT] -1 - 2yxT , from which we \ncan write that the weights stabilise (or (D..W) = 0) when 1= 2(tanh(u)uT ). Since \ntanh is an odd function, its series expansion is of the form tanh( u) = L:j bj u2P+1 , \nthe bj being coefficients, and thus this convergence criterion amounts to the con(cid:173)\ndition L:i,j bijp(U;P+1Uj) = 0 for all output unit pairs i 1= j, for p = 0,1,2,3 ... , \n\n\f474 \n\nAnthony J. Bell, Terrence J. Sejnowski \n\nand for the coefficients bijp coming from the Taylor series expansion of the tanh \nfunction. \n\nThese and other issues are covered more completely in a forthcoming paper (Bell \n& Sejnowski 1995). Applications to blind deconvolution (removing the effects of \nunknown causal filtering) are also described, and the limitations of the approach \nare discussed. \n\nAcknowledgements \n\nWe are much indebted to Nici Schraudolph, who not only supplied the original idea \nin Fig.l and shared his unpublished calculations [13], but also provided detailed \ncriticism at every stage of the work. Much constructive advice also came from Paul \nViola and Alexandre Pouget. \n\nReferences \n\n[1] Atick J.J. 1992. Could information theory provide an ecological theory of sen(cid:173)\n\nsory processing, Network 3, 213-251 \n\n[2] Barlow H.B. 1961. Possible principles underlying the transformation of sensory \n\nmessages, in Sensory Communication, Rosenblith W.A. (ed), MIT press \n\n[3] Barlow H.B. & Foldicik P. 1989. Adaptation and decorrelation in the cortex, \n\nin Durbin R. et al (eds) The Computing Neuron, Addison-Wesley \n\n[4] Becker S. 1992. An information-theoretic unsupervised learning algorithm for \n\nneural networks, Ph.D. thesis, Dept. of Compo Sci., Univ. of Toronto \n\n[5] Bell A.J. & Sejnowski T.J. 1995. An information-maximisation approach to \n\nblind separation and blind deconvolution, Neural Computation, in press \n\n[6] Comon P., Jutten C. & Herault J. 1991. Blind separation of sources, part II: \n\nproblems statement, Signal processing, 24, 11-21 \n\n[7] Comon P. 1994. Independent component analysis, a new concept? Signal pro(cid:173)\n\ncessing, 36, 287-314 \n\n[8] Hopfield J.J. 1991. Olfactory computation and object perception, Proc. N atl. \n\nAcad. Sci. USA, vol. 88, pp.6462-6466 \n\n[9] Jutten C. & Herault J. 1991. Blind separation of sources, part I: an adaptive \n\nalgorithm based on neuromimetic architecture, Signal processing, 24, 1-10 \n\n[10] Linsker R. 1992. Local synaptic learning rules suffice to maximise mutual in(cid:173)\n\nformation in a linear network, Neural Computation, 4, 691-702 \n\n[11] Papoulis A. 1984. Probability, random variables and stochastic processes, 2nd \n\nedition, McGraw-Hill, New York \n\n[12] Platt J .C. & Faggin F. 1992. Networks for the separation of sources that are \nsuperimposed and delayed, in Moody J.E et al (eds) Adv. Neur. Inf. Proc. Sys. \n4, Morgan-Kaufmann \n\n[13] Schraudolph N.N., Hart W .E. & Belew R.K. 1992. Optimal information flow \n\nin sigmoidal neurons, unpublished manuscript \n\n\f", "award": [], "sourceid": 953, "authors": [{"given_name": "Anthony", "family_name": "Bell", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}