{"title": "Multi Dimensional ICA to Separate Correlated Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 993, "page_last": 1000, "abstract": null, "full_text": "Multi Dimensional ICA to Separate \n\nCorrelated Sources \n\nRoland Vollgraf, Klaus Obermayer \n\nDepartment of Electrical Engineering and Computer Science \n\nTechnical University of Berlin Germany \n\n{ vro, oby} @cs.tu-berlin.de \n\nAbstract \n\nWe present a new method for the blind separation of sources, which \ndo not fulfill the independence assumption. In contrast to standard \nmethods we consider groups of neighboring samples (\"patches\") \nwithin the observed mixtures. \nFirst we extract independent features from the observed patches. \nIt turns out that the average dependencies between these features \nin different sources is in general lower than the dependencies be(cid:173)\ntween the amplitudes of different sources. We show that it might \nbe the case that most of the dependencies is carried by only a \nsmall number of features. \nIs this case - provided these features \ncan be identified by some heuristic - we project all patches into \nthe subspace which is orthogonal to the subspace spanned by the \n\"correlated\" features. \nStandard ICA is then performed on the elements of the transformed \npatches (for which the independence assumption holds) and ro(cid:173)\nbustly yields a good estimate of the mixing matrix. \n\n1 \n\nIntroduction \n\nICA as a method for blind source separation has been proven very useful in a wide \nrange of statistical data analysis. A strong criterion, that allows to detect and \nseparate linearly mixed source signals from the observed mixtures, is the indepen(cid:173)\ndence of the source signals amplitude distribution. Many contrast functions rely on \nthis assumption, e.g. in the way, that they estimate the Kullback-Leibler distance \nto a (non-Gaussian) factorizing multivariate distribution [1 , 2, 3]. Others consider \nhigher order moments of the source estimates [4, 5]. Naturally these algorithms \nfail when the independence assumption does not hold. In such situations it can be \nvery useful to consider temporal/spatial statistical properties of the source signals \nas well. This has been done in form of suitable linear filtering [6] to achieve a sparse \nand independent representation of the signals. In [7] the author suggests to model \nthe sources as a stochastic process and to do the ICA on the innovations rather \nthan on the signals them self. \nIn this work we extend the ICA to multidimensional channels of neighboring realiza(cid:173)\ntions. The used data model is explained in detail in the following section. In section \n3 it will be shown, that there are optimal features, that carry lower dependencies \n\n\fbetween the sources and can be used for source separation. A heuristic is intro(cid:173)\nduced, that allows to discard those features, that carry most of the dependencies. \nThis leads to the Two-Step algorithm described in section 4. Our method requires \n(i) sources which exhibit correlations between neighboring pixels (e.g. continuous \nsources like images or sound signals), and (ii) sources from which sparse and almost \nindependent features can be extracted. In section 5 we show separation results and \nbenchmarks for linearly mixed passport photographs. The method is fast and pro(cid:173)\nvides good separation results even for sources, whose correlation coefficient is as \nlarge as 0.9. \n\n2 Sources and observations \n\nLet us consider a set of N source signals Si(r), i = 1, ... , N of length L, where \nr is a discrete sample index. The sample index could be of arbitrary dimension, \nbut we assume that it belongs to some metric space so that neighborhood relations \ncan be defined. The sample index might be a scalar for sources which are time \nseries and a two-dimensional vector for sources which are images1 . The sources are \nlinearly combined by an unknown mixing matrix A of full rank to produce a set of \nN observations Xi(r), \n\nN \n\nXi(r) = l: AijSj(r) , \n\nj =l \n\n(1) \n\nand we assume that the mixing process is stationary, i.e. that the mixing matrix A is \nindependent of r. In the following we refer to the vectors S(r) = (Sl (r), ... ,SN(r))T \nand X(r) = (X1(r), ... , XN(r))T as a source and an observation stack. The goal is \nto find an appropriate demixing matrix W which - when applied to the observations \nX(r) - recovers good estimates S(r), \n\nS(r) = WX(r) ~ S(r) \n\n(2) \n\nof the original source signals (up to a permutation and scaling of the sources). Since \nthe mixing matrix A is not known its inverse W has to be detected blindly, i.e. only \nproperties of the sources which are detectable in the mixtures can be exploited. For \na large class of ICA algorithms one assumes that the sources are non-Gaussian and \nindependent, i.e. that the random vector S which is sampled by L realizations \n\nS: {S(rd, 1= I, ... ,L} \n\n(3) \n\nhas a factorizing and non-Gaussian joint probability distribution2 . In situations, \nhowever, where the independence assumption does not hold, it can be helpful to \ntake into account spatial dependencies, which can be very prominent for natural \nsignals, and have been subject for a number of blind source separation algorithms \n[8, 9, 6]. Let us now consider patches si(r), \n\ns(r) = \n\n(4) \n\n1 In the following we will mostly consider images, hence we will refer to the abovemen(cid:173)\n\ntioned neighborhood relations as spatial relations. \n\n2In the following, symbols without sample index will refer to the random variable rather \n\nthan to the particular realization. \n\n\fof M \u00ab L neighboring source samples. si(r) could be a sequence of M adjacent \nsamples of an audio signal or a rectangular patch of M pixels in an image. Instead \nof L realizations of a random N-vector S (cf. eq. (3)) we now obtain a little less \nthan L realizations of a random N x M matrix s, \n\ns: {s(r)}. \n\nBecause of the stationarity of the mixing process we obtain \n\nx = As \n\nand \n\ns = Wx, \n\n(5) \n\n(6) \n\nwhere x is an N x M matrix of neighboring observations and where the matrices \nA and W operate on every column vector of sand x. \n\n3 Optimal spatial features \n\nLet us now consider a set of sources which are not statistically independent , i.e. for \nwhich \n\np(S) = p(Slk\"'\" SNk) :j:. IIp(sik) \n\nfor all k = 1 ... M. \n\n(7) \n\nN \n\ni=1 \n\nOur goal is to find in a first step a linear transformation 0 E IRMxM which -\nwhen applied to every patch - yields transformed sources u = sOT for which the \nindependence assumption, p(Ulk, ... ,UNk) = rr~1p(Uik) does hold for all k = \n1 .. . M, at least approximately. When 0 is applied to the observations x , v = xOT , \nwe obtain a modified source separation problem \n\n(8) \n\nwhere the demixing matrix W can be estimated from the transformed observations \nv in a second step using standard ICA. Eq. (7) is tantamount to positive trans(cid:173)\ninformation of the source amplitudes. \n\n(9) \n\nwhere DKL is the Kullback-Leibler distance. As all elements of the patches are \nequally distributed, this quantity is the same for all k. Clearly, the dependencies, \nthat are carried by single elements of the patches, are also present between whole \npatches, i.e. J(S1 , S2,\"', SN) > O. However, since neighboring samples are corre(cid:173)\nlated, it holds \n\nJ(S1 ,S2, \"' ,SN ) < LJ(Slk ,S2k\"\",SNk ) . \n\nM \n\nk=1 \n\n(10) \n\nOnly if the sources where spatially white and s would consist of independent column \nvectors, this would hold with equality. When 0 is applied to the source patches, \nthe trans-information between patches is not changed, provided 0 is a non-singular \ntransformation. Neither information is introduced nor discarded by this transfor(cid:173)\nmation and it holds \n\n(11) \n\n\fFor the optimal 0 now the column vectors of u = sOT shall be independent. From \n(10) and (11) it follows that \n\nM \n\nM \n\nI(u1 ,u2, \" ',uN) = 2::I(ulk,u2k\"\",uNk) < 2::I(slk ,s2k\"\",sNk) \n\n(12) \n\nk=1 \n\nk=1 \n\nThe column vectors of u are in general not equally distributed anymore, however the \naverage trans-information has decreased to the level of information carried between \nthe patches. In the experiments we shall see that this can be sufficiently small to \nreliably estimate the de-mixing matrix W. \nSo it remains to estimate a matrix 0 that provides a matrix u with independent \ncolumns. We approach this by estimating 0 so that it provides row vectors of \nu that have independent elements, i.e. P(Ui) = IT;!1 P(Uik) for all i. With that \nand under the assumption that all sources may come from the same distribution \nand that there are no \"cross dependencies\" in u (i.e. p( Uik) is independent from \np( Ujl) for k :j:. l), the independence is guaranteed also for whole column vectors of \nu. Thus, standard leA can be applied to patches of sources which yields 0 as the \nde-mixing matrix. For real world applications however, 0 has to be estimated from \nthe observations xOT = v. It holds the relation v = Au, i.e. A only interchanges \nrows. So column vectors of u are independent to each other if, and only if columns \nof v are independent3 . Thus, 0 can be computed from x as well. \nAccording to Eq. (12) the trans-information of the elements of columns of u has \ndecreased in average, but not necessarily uniformly. One can expect some columns \nto have more independent elements than others. Thus, it may be advantageous to \ndetect these columns rsp. the corresponding rows of 0 and discard them prior to the \nsecond leA step. Each source patch Si can be considered as linear combination of \nindependent components, that are given by the columns of 0- 1 , where the elements \nof Ui are the coefficients. In the result of the leA, the coefficients have normalized \nvariance. Therefore, those components, that have large Euklidian norm, occur as \nfeatures with high entropy in the source patches. At the same time it is clear that, \nif there are features, that are responsible for the source dependencies, these features \nhave to be present with large entropy, otherwise the source dependencies would have \nbeen low. Accordingly we propose a heuristic that discards the rows of 0 with the \nsmallest Euklidian norm prior to the second leA step. How many rows have to be \ndiscarded and if this type of heuristic is applicable at all, depends of the statistical \nnature of the sources. In section 5 we show that for the test data this heuristic is \nwell applicable and almost all dependencies are contained in one feature. \n\n4 The Two-Step algorithm \n\nThe considerations of the previous section give rise to a Two-Step algorithm. In \nthe first step the transformation 0 has to be estimated. Standard leA [1, 2, 5] is \nperformed on M -dimensional patches, which are chosen with equal probability from \nall of the observed mixtures and at random positions. The positions may overlap \nbut don't overlap the boundaries of the signals. \nThe resulting \"demixing matrix\" 0 is applied to the patches of observations, gen(cid:173)\nerating a matrix v(r) = x(r )OT, the columns of which are candidates for the input \nfor the second leA. A number of M D columns that belong to rows of 0 with small \nnorm are discarded as they very likely represent features , that carry dependencies \nbetween the sources. M D is chosen as a model parameter or it can be determined \nempirically, given the data at hand (for instance by detecting a major jump in the \n\n3We assume non-Gaussian distributions for u and v. \n\n\fincrease of the row norm of n). For the remaining columns it is not obvious which \none represents the most sparse and independent feature. So any of them with equal \nprobability now serve as input sample for the second ICA, which estimates the \ndemixing matrix W. \nWhen the number N of sources is large, the first ICA may fail to extract the in(cid:173)\ndependent source features, because, according to the central limit theorem, the \ndistribution of their coefficients in the mixtures may be close to a Gaussian distri(cid:173)\nbution. In such a situation we recommend to apply the abovementioned two steps \nrepeatedly. The source estimates Wx(r) are used as input for the first ICA to \nachieve a better n, which in turn allows to better estimate W. \n\nFigure 1: Results of standard and multidimensional ICA performed on a set of \n8 correlated passport images. Top row: source images; Second row: linearly \nmixed sources; Third row: separation results using kurtosis optimization (FastICA \nMatlab package); Bottom row: separation results using multidimensional ICA \n(For explanation see text). \n\n5 Numerical experiments \n\nWe applied our method to a linear mixture of 8 passport photographs which are \nshown in Fig. 1, top row. The images were mixed (d. Fig. 1, second row) using \na matrix whose elements were chosen randomly from a normal distribution with \nmean zero and variance one. The mixing matrix had a condition number of 80. \nThe correlation coefficients of the source images were between 0.4 and 0.9 so that \nstandard ICA methods failed to recover the sources: Fig. 1, 3rd row, shows the \nresults of a kurtosis optimization using the FastICA Matlab package4 . \nFig. 1, bottom row, shows the result of the Two-Step multidimensional ICA de(cid:173)\nscribed in section 4. For better comparison images were inverted manually to appear \npositive. In the first step n was estimated using FastICA on 105 patches, 6 x 6 pix(cid:173)\nels in size, which were taken with equal probability from random positions from all \nmixtures. The result of the first ICA is displayed in Fig. 2. The top row shows the \nrow vectors of n sorted by the logarithm of their norm. The second row shows the \nfeatures (the corresponding columns of n - 1 ) which are extracted by n . In the dia-\n\n4http://www.cis.hut.fi/projects/ica/fastica/ \n\n\fgram below the stars indicate the logarithm of the row norm, log V'Lt!1 0%1' and \nthe squares indicate the mutual information J(Ulk,U7k) between the k-th features \nin sources 1 and 7 5, calculated using a histogram estimator. It is quite promi(cid:173)\nnent that (i) a small norm of a column vector corresponds to a strongly correlated \nfeature, and (ii) there is only one feature which carries most of the dependencies \nbetween the sources. Thus, the first column of v was discarded. The second ICA \nwas applied to any of the remaining components, chosen randomly and with equal \nprobability. A comparison between Figs. 1, top and bottom rows, shows that all \nsources were successfully recovered. \n\nFigure 2: Result of an ICA (kurtosis optimization) performed on patches of obser(cid:173)\nvations (cf. Fig. 1, 2nd row), 6 x 6 pixels in size. Top row: Row vectors of the \ndemixing matrix O. Second row: Corresponding column vectors of 0- 1 . Vectors \nare sorted by increasing norm of the row vectors; dark and bright pixels indicate \npositive and negative values. Bottom diagram: Logarithm of the norm of row \nvectors (stars) and mutual information J(Ulk' U7k) (squares) between the coefficients \nof the corresponding features in the source images 1 and 7. \n\nIn the next experiment we examined the influence of selecting columns of v prior \nto the second ICA. In Fig. 3 we show the reconstruction error (cf. appendix A), \nthat could be achieved with the second ICA when only a single column of v served \nas input. From the previous experiment we have seen, that only the first compo(cid:173)\nnent has considerable dependencies. As expected, only the first column yields poor \nreconstruction error. Fig. 4 shows the reconstruction error vs. M D when the M D \nsmallest norm rows of 0 (rsp. columns of v) are discarded. We see, that for all \nvalues a good reconstruction is achieved (re < 0.6). Even if no row is discarded the \nresult is only slightly worse than for one or two discarded rows. The dependencies \nof the first component are \"averaged\" by the vast majority of components, that \ncarry no dependencies, in this case. The conspicuous large variance of the error for \nlarger numbers M D might be due to convergence instabilities or close to Gaussian \ndistributed columns of u. In either case it gives rise to discard as few components \nas possible. To evaluate the influence of the patch size M, the Two-Step algorithm \nwas applied to 9 different mixtures of the sources shown in Fig. 1, top row, and \nusing patch sizes between M = 2 x 2 and M = 6 x 6. Table 1 shows the mean \nand standard deviation of the achieved reconstruction error. The mixing matrix A \nwas randomly chosen from a normal distribution with mean zero and variance one. \nFastICA was used for both steps, where 5.105 sample patches were used to extract \nthe optimal features and 2.5.104 samples were used to estimate W. The smallest \nrow of 0 was always discarded. The algorithm shows a quite robust performance, \nand even for patch sizes of 2 x 2 pixels a fairly good separation result is achieved \n\n5Images no. 1 and 7 were chosen exemplarily as the two most strongly correlated sources. \n\n\fJl \u00b7~. ==1 !C.,,\". : .. ::. :':,. ::!\u00b7;::=I \n\n36 \n\n31 \n\n10 \n\n5 \n\n1 \n\n6 \n\n11 \n\n16 \n\n21 \n\n26 \n\n0 \n\nlarge row norm \n\nsmall row norm \n\nFigure 3: Every single row of 0 used to \ngenerate input for the second leA. Only \nthe first (smallest norm) row causes bad \nreconstruction error for the second leA \nstep. \n\nFigure 4: M D rows with smallest norm \ndiscarded. All values of M D provide \ngood reconstruction error in the second \nstep. Note the slidely worse result for \nMD=O! \n\npatch size M \n\nJ-lr e \n\n(Jre \n\n2x2 \n3x3 \n4x4 \n5x5 \n6x6 \n\n0.4361 \n0.0383 \n0.2322 0.0433 \n0.1667 0.0263 \n0.1408 0.0270 \n0.1270 0.0460 \n\nTable 1: Separation result of the Two(cid:173)\nStep algorithm performed on a set of 8 \ncorrelated passport images (d. Fig. 1, top \nrow). The table shows the average recon(cid:173)\nstruction error J-lr e and its standard devi(cid:173)\nation (Jr e calculated from 9 different mix(cid:173)\ntures. \n\n(Note, for comparison, that the reconstruction error of the separation in Fig. 1, \nbottom row, was 0.2). \n\n6 Summary and outlook \n\nWe extended the source separation model to multidimensional channels (image \npatches). There are two linear transformations to be considered, one operating in(cid:173)\nside the channels (0) and one operating between the different channels (W). The \ntwo transformations are estimated in two adjacent leA steps. There are mainly \ntwo advantages, that can be taken from the first transformation: (i) By arranging \nindependence among the columns of the transformed patches, the average trans(cid:173)\ninformation between different channels is decreased. (ii) A suitable heuristic can \nbe applied to discard those columns of the transformed patches, that carry most \nof the dependencies between different channels. A heuristic, that identifies the de(cid:173)\npendence carrying components by a small norm of the corresponding rows of 0 \nhas been introduced. It shows, that for the image data only one component carries \nmost of the dependencies. Due this fact, the described method works well also when \nall components are taken into account . In future work, we are going to establish \na Maximum Likelihood model for both transformations. We expect a performance \ngain due to the mutual improvement of the estimates of W and 0 during the it(cid:173)\nerations. It remains to examine what the model has to be in case some rows of 0 \nare discarded. In this case the transformations don't preserve the dimensionality of \nthe observation patches. \n\nA Reconstruction error \n\nThe reconstruction error re is a measure for the success of a source separation. \nIt compares the estimated de-mixing matrix W with the inverse of the original \nmixing matrix A with respect to the indeterminacies: scalings and permutations. \nIt is always nonnegative and equals zero if, and only if P = W A is a nonsingular \n\n\fpermutation matrix. This is the case when for every row of P exactly one element \nis different from zero and the rows of P are orthogonal, i.e. ppT is a diagonal \nmatrix. The reconstruction error is the sum of measures for both aspects \n\nN \n\nN \n\nN \n\nN \n\nN \n\nN \n\nre \n\n2LlogL P 7j - Llog LPij + Llog L P 7j -log detppT \n\ni=1 \nN \n\nj=1 \nN \n\n3 L log L P 7j - L log L pij -\n\ni=1 \nN \n\nj=1 \nN \n\ni=1 \n\nj=1 \n\nlog det ppT . \n\n(13) \n\ni=1 \n\nj=1 \n\ni=1 \n\nj=1 \n\nAcknowledgment: This work was funded by the German Science Founda(cid:173)\ntion (grant no. DFG SE 931/1-1 and DFG OB 102/3-1 ) and Wellcome Trust \n061113/Z/00. \n\nReferences \n\n[1] Anthony J. Bell and Terrence J . Sejnowski, \"An information-maximization \napproach to blind separation and blind deconvolution,\" Neural Computation, \nvol. 7, no. 6, pp. 1129-1159, 1995. \n\n[2] S. Amari, A. Cichocki, and H. H. Yang, \"A new learning algorithm for blind \nin Advances in Neural Information Processing Systems, \n\nsignal separation,\" \nD. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., 1995, vol. 8. \n\n[3] J . F. Cardoso, \"Infomax and maximum likelihood for blind source separation,\" \n\nIEEE Signal Processing Lett., 1997. \n\n[4] J ean-Franc;ois Cardoso, Sandip Bose, and Benjamin Friedlander, \"On optimal \nsource separation based on second and fourth order cumulants,\" in Proc. IEEE \nWorkshop on SSAP, Co rfou, Greece, 1996. \n\n[5] A. Hyvarinen and E. Oja, \"A fast fixed point algorithm for independent com(cid:173)\n\nponent analysis.,\" Neural Comput., vol. 9, pp. 1483- 1492,1997. \n\n[6] M. Zibulevski and B. A. Pearlmutter, \"Blind source separation by sparse \ndecomposition in a signal dictionary,\" Neural Computation, vol. 12, no. 3, pp. \n863- 882, April 200l. \n\n[7] A. Hyvi:irinen, \"Independent component analysis for time-dependent stochastic \nin Proc. Int. Conf. on Artificial Neural Networks (ICANN'98), \n\nprocesses,\" \n1998, pp. 541-546. \n\n[8] 1. Molgedey and H. G. Schuster, \"Separation of a mixture of independent \nsignals using time delayed correlations,\" Phys. Rev. Lett., vol. 72, pp. 3634-\n3637, 1994. \n\n[9] H. Attias and C. E. Schreiner, \"Blind source separation and deconvolution: \nThe dynamic component analysis algorithm,\" Neural Comput., vol. 10, pp. \n1373- 1424, 1998. \n\n[10] Anthony J. Bell and Terrence J. Sejnowski, \"The 'independent components' of \n\nnatural scenes are edge filters,\" Vision Res. , vol. 37, pp. 3327- 3338, 1997. \n\n\f", "award": [], "sourceid": 2046, "authors": [{"given_name": "Roland", "family_name": "Vollgraf", "institution": null}, {"given_name": "Klaus", "family_name": "Obermayer", "institution": null}]}