{"title": "Algorithms for Independent Components Analysis and Higher Order Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 491, "page_last": 497, "abstract": null, "full_text": "Algorithms for Independent Components \n\nAnalysis and Higher Order Statistics \n\nDaniel D. Lee \n\nBell Laboratories \n\nLucent Technologies \nMurray Hill, NJ 07974 \n\nUri Rokni and Haim Sompolinsky \n\nRacah Institute of Physics and \nCenter for Neural Computation \n\nHebrew University \n\nJerusalem, 91904, Israel \n\nAbstract \n\nA latent variable generative model with finite noise is used to de(cid:173)\nscribe several different algorithms for Independent Components Anal(cid:173)\nysis (lCA). In particular, the Fixed Point ICA algorithm is shown to \nbe equivalent to the Expectation-Maximization algorithm for maximum \nlikelihood under certain constraints, allowing the conditions for global \nconvergence to be elucidated. The algorithms can also be explained by \ntheir generic behavior near a singular point where the size of the opti(cid:173)\nmal generative bases vanishes. An expansion of the likelihood about this \nsingular point indicates the role of higher order correlations in determin(cid:173)\ning the features discovered by ICA. The application and convergence of \nthese algorithms are demonstrated on a simple illustrative example. \n\nIntroduction \n\nIndependent Components Analysis (lCA) has generated much recent theoretical and prac(cid:173)\ntical interest because of its successes on a number of different signal processing problems. \nICA attempts to decompose the observed data into components that are as statistically in(cid:173)\ndependent from each other as possible, and can be viewed as a nonlinear generalization of \nPrincipal Components Analysis (PCA). Some applications of ICA include blind separation \nof audio signals, beamforming of radio sources, and discovery of features in biomedical \ntraces [I] . \n\nThere have also been a number of approaches to deriving algorithms for ICA [2, 3, 4]. \nFundamentally, they all consider the problem of recovering independent source signals {s} \nfrom observations {x} such that: \n\nM \n\nXi = L WijS j , i = l..N \n\nj = 1 \n\n(I) \n\nHere, Wij is a N x M mixing matrix where the number of sources M is not greater than \nthe dimensionality N of the observations. Thus, the columns of W represent the different \nindependent features present in the observed data. \n\nBell and Sejnowski formulated their Infomax algorithm for ICA as maximizing the mutual \ninformation between the data and a nonlinearly transformed version of the data [5]. The \n\n\f492 \n\nD. D. Lee. U. Rokni and H. Sompolinsky \n\ncovariant version of this algorithm uses the natural gradient of the mutual information to \niteratively update the estimate for the demixing matrix W- 1 in terms of the estimated \ncomponentss = W - 1x [6]: \n\n.6.W-1 ex: [1 - (g(s)sT)] W- 1, \n\n(2) \n\nThe nonlinearity g( s) differentiates the features learned by the lnfomax ICA algorithm \nfrom those found by conventional PCA. Fortunately, the exact form of the nonlinearity \nused in Eq. 2 is not crucial for the success of the algorithm, as long as it preserves the \nsub-Gaussian or super-Gaussian nature of the sources [7] . \n\nAnother approach to ICA due to Hyvarinen and Oja was derived from maximizing objective \nfunctions motivated by projection pursuit [8]. Their Fixed Point ICA algorithm attempts \nto self-consistently solve for the extremum of a nonlinear objective function. The simplest \nformulation considers a single source M = 1 so that the mixing matrix is a single vector \nw, constrained to be unit length Iwl = 1. Assuming the data is first preprocessed and \nwhitened, the Fixed Point ICA algorithm iteratively updates the estimate of w as follows: \n\nw \n\nw \n\nt-\n\nt-\n\n(xg(w T x) - ACW \nw \nIwl' \n\n(3) \n\nwhere g(wT x) is a nonlinear function and AC is a constant given by the integral over the \nGaussian: \n\n(4) \n\nThe Fixed Point algorithm can be extended to an arbitrary number M ~ N of sources by \nusing Eq. 3 in a serial deflation scheme. Alternatively, the M columns of the mixing matrix \nW can be updated simultaneously by orthogonalizing the N x M matrix: \n\n(5) \nUnder the assumption that the observed data match the underlying ICA model, x = W s, it \nhas been shown that the Fixed Point algorithm converges locally to the correct solution with \nat least quadratic convergence. However, the global convergence of the generic Fixed Point \nICA algorithm is uncertain . This is in contrast to the gradient-based lnfomax algorithm \nwhose convergence is guaranteed as long as a sufficiently small step size is chosen. \n\nIn this paper, we first review the latent variable generative model framework for Indepen(cid:173)\ndent Components Analysis. We then consider the generative model in the presence of finite \nnoise, and show how the Fixed Point ICA algorithm can be related to an Expectation(cid:173)\nMaximization algorithm for maximum likelihood. This allows us to elucidate the condi(cid:173)\ntions under which the Fixed Point algorithm is guaranteed to globally converge. Assuming \nthat the data are indeed generated from independent components, we derive the optimal \nparameters for convergence. We also investigate how the optimal size of the ICA mixing \nmatrix varies as a function of the added noise, and demonstrate the presence of a singular \npoint. By expanding the likelihood about this singular point, the behavior of the ICA algo(cid:173)\nrithms can be related to the higher order statistics present in the data. Finally, we illustrate \nthe application and convergence of these ICA algorithms on some artificial data. \n\nGenerative model \n\nA convenient method for interpreting the different ICA algorithms is in terms of the hidden, \nor latent, variable generative model shown in Fig. 1 [9, 10]. The hidden variables {s j} \n\n\fleA Algorithms and Higher Order Statistics \n\n493 \n\nM hidden variables \n\nN visible variables \n\nFigure 1: Generative model for ICA algorithms. s are the hidden variables, Ij are additive \nGaussian noise terms, and x = W s + Ij are the visible variables. \n\ncorrespond to the different independent components and are assumed to have the factorized \nnon-Gaussian prior probability distribution: \n\nM \n\nP(s ) = II e-F (Sj). \n\nj=l \n\n(6) \n\nOnce the hidden variables are instantiated, the visible variables {x t } are generated via a \nlinear mapping through the generative weights W: \n\nP(xls) = II ~ exp \n\nN 1 \n\n[1 \n- 2lj2 (Xi - L WijSj)2 \n\n1 \n\n, \n\ni = l \n\n7T1j \n\nj \n\n(7) \n\nwhere 1j2 is the variance of the Gaussian noise added to the visible variables. \n\nThe probability of the data given this model is then calculated by integrating over all pos(cid:173)\nsible values of the hidden variables: \n\nP( x ) = f ds P(s)P(xls) = (27T1j;) N/2 f ds exp [-F(S) - 2~2 (x - WS)2] \n\n(8) \n\nIn the limit that the added noise vanishes, 1j2 -T 0, it has previously been shown that \nmaximizing the likelihood of Eq. 8 is equivalent to the Infomax algorithm in Eq. 2 [11]. \nIn the following analysis, we will consider the situation when the variance of the noise is \nnonzero, 1j2 1= o. \n\nExpectation-Maximization \nWe assume that the data has initially been preprocessed and spherized: (XiXj ) = Oij . \nUnfortunately, for finite noise 1j2 and an arbitrary prior F(sj) , deriving a learning rule for \nW in closed form is analytically intractable. However, it becomes possible to derive a \nsimple Expectation-Maximization (EM) learning rule under the constraint: \n\nW = ~Wo , wlwo = I , \n\n(9) \nwhich implies that W is orthogonal, and ~ is the length of the individual columns of W . \nIndeed, for data that obeys the ICA model, x = W s, it can be shown that the optimal W \nmust satisfy this orthogonality condition. By assuming the constraint in Eq. 9 for arbitrary \ndata, the posterior distribution P(slx) becomes conveniently factorized: \n\nF(.lx) ()( i! exp [-F(';) + :' I(WT x);,; - ~e';ll\u00b7 \n\n(10) \n\n\f494 \n\nD. D. Lee, U. Rokni and H. Sompolinsky \n\nFor the E-step, this factorized form allows the expectation function J ds P(slx)s = \ng(WT x) to be analytically evaluated. This expectation is then used in the M-step to find \nthe new estimate W': \n\n(xg(WT x)T) - AsW' = 0, \n\n(11 ) \nwhere As is a symmetric matrix of Lagrange multipliers that constrain the new W' to be \northogonal. Eq. 11 is easily solved by taking the reduced singular value decomposition of \nthe rectangular matrix: \n\n(12) \nwhere UTU = VV T = I and D is a diagonal M x M matrix. Then the solution for the \nEM estimate of the mixing matrix is given by: \n\nW' \n\nAs \n\n(13) \n\n(14) \n\nAs a specific example, consider the following prior for binary hidden variables: P( s) = \n~[8(s - 1) + 8(s + 1)]. In this case, the expectation J ds P(slx)s = tanh(WT X/(j2) and \nso the EM update rule is given by onhogonalizing the matrix: \n\nW f- (xtanh(:2 WT X)) . \n\n(15) \n\nFixed Point leA \n\nBesides the presence of the linear term AC Win Eq. 5, the EM update rule looks very much \nlike that of the Fixed Point leA algorithm. It turns out that without this linear term, the \nconvergence of the naive EM algorithm is much slower than that of Eq. 5. Here we show \nthat it is possible to interpret the role of this linear term in the Fixed Point leA algorithm \nwithin the framework of this generative model. \n\nSuppose that the distribution of the observed data PD (x) is actually a mixture between an \nisotropic distribution Po(x) and a non-isotropic distribution P1 (x): \n\nPD(X) = aPo(x) + (1 - a)P1 (x). \n\n(16) \n\nBecause the isotropic part does not break rotational symmetry, it does not affect the choice \nof the directions of the learned basis W . Thus, it is more efficient to apply the learning \nalgorithm to only the non-isotropic portion of the distribution, Pt (x) (X PD(X) - aPo(x), \nrather than to the whole observed distribution PD(X). Applying EM to P1 (x) results in a \ncorrection term arising from the subtracted isotropic distribution . With this correction, the \nEM update becomes: \n\nW f- (xg(WT x)) - aAcW \n\n(17) \n\nwhich is equivalent to the Fixed Point leA algorithm when a = 1. \nUnfortunately, it is not clear how to compute an appropriate value for a to use in fitting data. \nTaking a very small value, a \u00ab 1, will result in a learning rule that is very similar to the \nnaive EM update rule. This implies that the algorithm will be guaranteed to monotonically \nconverge, albeit very slowly, to a local maximum of the likelihood. On the other hand, \nchoosing a large value, a \u00bb 1, will result in a subtracted probability density P1 (x) that is \nnegative everywhere. In this case, the algorithm will converge slowly to a local minimum \nof the likelihood. For the Fixed Point algorithm which operates in the intermediate regime, \na ~ 1, the algorithm is likely to converge most rapidly. However, it is also in this situation \nthat the subtracted density P1 (x) could have both positive and negative regions, and the \nalgorithm is no longer guaranteed to converge. \n\n\fleA Algorithms and Higher Order Statistics \n\n495 \n\nNoise 0 2 \n\nFigure 2: Size of the optimal generative bases as a function of the added noise (J2, showing \nthe singular point behavior around (J~ ~ 1. \n\nOptimal value of a \n\nIn order to determine the optimal value of a, we make the assumption that the observed \ndata obeys the ICA model, x = A8. Note that the statistics of the sources in the data need \nnot match the assumed prior distribution of the sources in the generative model Eq. 6. With \nthis assumption, which is not related to the mixture assumption in Eq. 16, it is easy to show \nthat W = A is a fixed point of the algorithm . By analyzing the behavior of the algorithm \nin the vicinity of this fixed point, a simple expression emerges for the change in deviations \nfrom this fixed point, 8W, after a single iteration of Eq. 17: \n\n8Wij +- ( ()) \n\n(g'(8)) - aAG \n8g 8 \n- a G \n\n3 \nA 8Wij + O(8W ) \n\n(18) \n\nwhere the averaging here is over the true source distribution, assumed for simplicity to be \nidentical for all sources. Thus, the algorithm converges most rapidly if one chooses: \n\naopt = \n\n(g' (8)) \n\nAG \n\n' \n\n(19) \n\nso that the local convergence is cubic. From Eq. 18 one can show that the condition for the \nstability of the fixed point is given by a < ae , where: \n(8g(8) + g'(8)) \n\n(20) \n\na c = \n\n2AG \n\n. \n\nThus, for a = 0, the stability criterion in Eq. 18 is equivalent to (8g( 8)) > (g' (8)). For the \ncubic nonlinearity g( 8) = S3, this implies that the algorithm will find the true independent \nfeatures only if the source distribution has positive kurtosis. \n\nSingular point expansion \n\nLet us now consider how the optimal size ~ of the weights W varies as a function of the \nnoise parameter (J2. For very small (J2 \u00ab 1, the weights W are approximately described \nby the Infomax algorithm of Eq. 2, and the lengths of the columns should be unity in order \nto match the covariance of the data. For large (12 \u00bb 1, however, the optimal size of the \nweights should be very small because the covariance of the noise is already larger than that \nof the data. In fact, for Factor Analysis which is a special case of the generative model \nwith F(s) = ~s2 in Eq. 6, it can be shown that the weights are exactly zero, W = 0, for \n(J2 > 1. \nThus, the size of the optimal generative weights W varies with (J2 as shown qualitatively \nin Fig. 2. Above a certain critical noise value (J~ ~ 1, the weights are exactly equal to \n\n\f496 \n\nD. D. Lee, U. Rokni and H. Sompolinsky \n\n0.81 r----~---~--_, \n\na=O.9 \n\n0.77 \n\na=1.5 \n\n0.76'------'------'-------' \n15 \n\n10 \n\n5 \n\no \n\nIteration \n\nFigure 3: Convergence of the modified EM algorithm as a function of a . With 9(S) = \ntanh(s) as the nonlinearity, the likelihood (In cosh{WT x)) is plotted as a function of the \niteration number. The optimal basis W are plotted on the two-dimensional data distribution \nwhen the likelihood is maximized (top) and minimized (bottom). \n\nzero, W = O. Only below this critical value do the weights become nonzero. We expand \nthe likelihood of the generative model in the vicinity of this singular point. This expansion \nis well-behaved because the size of the generative weights W acts as a small perturbative \nparameter in this expansion. The log likelihood of the model around this singular value is \nthen given by: \n\nL = -~Tr [WWT - {I _ (j2)J] 2 \n\n4 \n1 \n+ 4! L kurt{sm) (XiXjXkXI)c WimWjmWkmWlm \n\nijklm \n\n(21) \n\n+0(1 _ (j2)3, \n\nwhere kurt(sm) represents the kurtosis of the prior distribution over the hidden variables. \nNote that this expansion is valid for any symmetric prior, and differs from other expansions \nthat assume small deviations from a Gaussian prior [12, 13]. Eq. 21 shows the importance \nof the fourth-order cumulant of the observed data in breaking the rotational degeneracy of \nthe weights W. The generic behavior of ICA is manifest in optimizing the cumulant term \nin Eq.21, and again depends crucially on the sign of the kurtosis that is used for the prior. \n\nExample with artificial data \n\nAs an illustration of the convergence of the algorithm in Eq. 17, we consider the simple \ntwo-dimensional uniform distribution: \n\nP(x x) = {1/12, -vls~. Xl, X2 ~ vis \n\nI, 2 \n\n0, \n\notherWIse \n\n(22) \n\nWith 9(S) = tanh(s) as the nonlinearity, Fig. 3 shows how the overall likelihood con(cid:173)\nverges for different values of the parameter a as the algorithm is iterated. For a ~ 1.0, \nthe algorithm converges to a maximum of the likelihood, with the fastest convergence at \naopt = 0.9. However, for a > 1.2, the algorithm converges to a minimum of the like(cid:173)\nlihood. At an intermediate value, a = 1.1, the likelihood does not converge at all, fluc(cid:173)\ntuating wildly between the maximum and minimum likelihood solutions. The maximum \n\n\fleA Algorithms and Higher Order Statistics \n\n497 \n\nlikelihood solution shows the basis vectors in W aligned with the sides of the square distri(cid:173)\nbution, whereas the minimum likelihood solution has the basis aligned with the diagonals. \nThese solutions can also be understood as maximizing and minimizing the kurtosis terms \nin Eq. 21. \n\nDiscllssion \n\nThe utility of the latent variable generative model is demonstrated on deriving algorithms \nfor leA. By constraining the generative weights to be orthogonal, an EM algorithm is \nanalytically obtained. By interpreting the data to be fitted as a mixture of isotropic and \nnon-isotropic parts, a simple correction to the EM algorithm is derived. Under certain \nconditions, this modified algorithm is equivalent to the Fixed Point leA algorithm, and \nconverges much more rapidly than the naive EM algorithm. The optimal parameter for \nconvergence is derived assuming the data is consistent with the leA generative model. \nThere also exists a critical value for the noise parameter in the generative model , about \nwhich a controlled expansion of the likelihood is possible. This expansion makes clear \nthe role of higher order statistics in determining the generic behavior of different leA \nalgorithms. \n\nWe acknowledge the support of Bell Laboratories, Lucent Technologies, the US-Israel Bi(cid:173)\nnational Science Foundation, and the Israel Science Foundation. We also thank Hagai \nAttias, Simon Haykin , Juha Karhunen, Te-Won Lee, Erkki Oja, Sebastian Seung, Boris \nShraiman, and Oren Shriki for helpful discussions. \n\nReferences \n\n[1] Haykin, S (1999). Neural networks: a comprehensivefoundation. 2nd ed., Prentice-Hall, Upper \n\nSaddle River, NJ. \n\n[2] Jutten, C & Herault, J (1991). Blind separation of sources, part I: An adaptive algorithm based \n\non neuromimetic architecture. Signal Processing 24, 1-10. \n\n[3] Comon, P (1994). Independent component analysis: a new concept? Signal Processing 36, \n\n287-314. \n\n[4] Roth, Z & Baram, Y (1996). Multidimensional density shaping by sigmoids. IEEE Trans. Neu(cid:173)\n\nral Networks 7, 1291-1298. \n\n[5] Bell, AJ & Sejnowski, TJ (1995). An information maximization approach to blind separation \n\nand blind deconvolution. Neural Computation 7,1129-1159. \n\n[6] Amari, S, Cichocki, A & Yang, H (1996). A new learning algorithm for blind signal separation. \n\nAdvances in Neural Information Processing Systems 8, 757-763. \n\n[7] Lee, TW, Girolami, M, & Sejnowski, TJ (1999). Independent component analysis using an \n\nextended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Com(cid:173)\nputation 11, 609-633 . \n\n[8] Hyvarinen, A & Oja, E (1997). A fast fixed-point algorithm for independent component analy(cid:173)\n\nsis. Neural Computation 9, 1483-1492. \n\n[9] Hinton, G & Ghahramani, Z (1997). Generative models for discovering sparse distributed rep(cid:173)\n\nresentations. Philosophical Transactions Royal Society B 352, 1177-1190. \n\n[10] Attias, H (1998). Independent factor analysis. Neural Computation 11, 803-851. \n[11] Pearlmutter, B & Parra, L (1996). A context-sensitive generalization of ICA. In ICONIP ' 96, \n\n151-157. \n\n[12] Nadal, JP & Parga, N (1997). Redundancy reduction and independent component analysis: \n\nconditions on cumulants and adaptive approaches. Neural Computation 9, 1421-1456. \n\n[13] Cardoso, JF (1999). High-order contrasts for independent component analysis. Neural Compu(cid:173)\n\ntation 11,157-192. \n\n\f", "award": [], "sourceid": 1639, "authors": [{"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "Uri", "family_name": "Rokni", "institution": null}, {"given_name": "Haim", "family_name": "Sompolinsky", "institution": null}]}