{"title": "Blind Source Separation via Multinode Sparse Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 1049, "page_last": 1056, "abstract": null, "full_text": "BLIND SOURCE SEPARATION VIA \n\nMULTINODE SPARSE REPRESENTATION \n\nMichael Zibulevsky \n\nPavel Kisilev \n\nDepartment of Electrical Engineering \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000, Israel \n\nmzib@ee.technion.ac. if \n\nTechnion, Haifa 32000, Israel \n\npaufk@tx.technion.ac. if \n\nYehoshua Y. Zeevi \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000, Israel \n\nzeevi@ee.technion.ac. if \n\nBarak Pearlmutter \n\nDepartment of Computer Science \n\nUniversity of New Mexico \n\nAlbuquerque, NM 87131 USA \n\nbap@cs. unm. edu \n\nAbstract \n\nWe consider a problem of blind source separation from a set of instan(cid:173)\ntaneous linear mixtures, where the mixing matrix is unknown. It was \ndiscovered recently, that exploiting the sparsity of sources in an appro(cid:173)\npriate representation according to some signal dictionary, dramatically \nimproves the quality of separation. In this work we use the property of \nmulti scale transforms, such as wavelet or wavelet packets, to decompose \nsignals into sets of local features with various degrees of sparsity. We \nuse this intrinsic property for selecting the best (most sparse) subsets of \nfeatures for further separation. The performance of the algorithm is ver(cid:173)\nified on noise-free and noisy data. Experiments with simulated signals, \nmusical sounds and images demonstrate significant improvement of sep(cid:173)\naration quality over previously reported results. \n\n1 Introduction \n\nIn the blind source separation problem an N-channel sensor signal x(~ ) is generated by \nM unknown scalar source signals srn(~) , linearly mixed together by an unknown N x M \nmixing, or crosstalk, matrix A , and possibly corrupted by additive noise n(~): \n\nx(~) = As(~) + n(~). \n\n(1) \n\nThe independent variable ~ is either time or spatial coordinates in the case of images. We \nwish to estimate the mixing matrix A and the M-dimensional source signal s(~). \nThe assumption of statistical independence of the source components Srn(~) , m = 1, ... , M \nleads to the Independent Component Analysis (lCA) [1], [2]. A stronger assumption is the \n\n\u00b0Supported in part by the Ollendorff Minerva Center, by the Israeli Ministry of Science, by NSF \n\nCAREER award 97-02-311 and by the National Foundation for Functional Brain Imaging \n\n\fsparsity of decomposition coefficients, when the sources are properly represented [3]. In \nparticular, let each 8 m (~) have a sparse representation obtained by means of its decompo(cid:173)\nsition coefficients Cmk according to a signal dictionary offunctions Y k (~): \n\n8m (~) = L Cmk Yk(~)' \n\nk \n\n(2) \n\nThe functions Yk (~ ) are called atoms or elements of the dictionary. These elements do \nnot have to be linearly independent, and instead may form an overcomplete dictionary, \ne.g. wavelet-related dictionaries (wavelet packets, stationary wavelets, etc., see for exam(cid:173)\nple [9]). Sparsity means that only a small number of coefficients Cmk differ significantly \nfrom zero. Then, unmixing of the sources is performed in the transform domain, i.e. in the \ndomain of these coefficients Cmk. The property of sparsity often yields much better source \nseparation than standard ICA, and can work well even with more sources than mixtures. In \nmany cases there are distinct groups of coefficients, wherein sources have different sparsity \nproperties. The key idea in this study is to select only a subset of features (coefficients) \nwhich is best suited for separation, with respect to the following criteria: (1) sparsity of \ncoefficients (2) separability of sources' features. After this subset is formed , one uses it \nin the separation process, which can be accomplished by standard ICA algorithms or by \nclustering. The performance of our approach is verified on noise-free and noisy data. Our \nexperiments with ID signals and images demonstrate that the proposed method further \nimproves separation quality, as compared with result obtained by using sparsity of all de(cid:173)\ncomposition coefficients. \n\n2 Two approaches to sparse source separation: InfoMax and \n\nClustering \n\nSparse sources can be separated by each one of several techniques, e.g. the Bell-Sejnowski \nInformation Maximization (BS InfoMax) approach [1], or by approaches based on geo(cid:173)\nmetric considerations (see for example [8]). In the former case, the algorithm estimates the \nunmixing matrix W = A - I, while in the later case the output is the estimated mixing \nmatrix. In both cases, these matrices can be estimated only up to a column permutation and \na scaling factor [4]. \n\nInfoMax. Under the assumption of a noiseless system and a square mixing matrix in (1), \nthe BS InfoMax is equivalent to the maximum likelihood (ML) formulation of the problem \n[4], which is used in this section. For the sake of simplicity of the presentation, let us \nconsider the case where the dictionary of functions used in a source decomposition (2) is \nan orthonormal basis. (In this case, the corresponding coefficients Cmk =< 8m, 'Pk >, \nwhere < ',' > denotes the inner product). From (1) and (2) the decomposition coefficients \nof the noiseless mixtures, according to the same signal dictionary of functions Y k (~) ' are: \n\nAk= ACk, \n\n(3) \n\nwhere M -dimensional vector Ck forms the k-th column of the matrix C = { Cmk}. \n\nLet Y be thefeatures, or (new) data, matrix of dimension M x K , where K is the number of \nfeatures. Its rows are either the samples of sensor signals (mixtures), or their decomposition \ncoefficients. In the later case, the coefficients Ak'S form the columns ofY. (In the following \ndiscussion we assume this setting for Y , if not stated other). We are interested in the \nmaximum likelihood estimate of A given the data Y. \n\nLet the corresponding coefficients Cmk be independent random variables with a probability \ndensity function (pdf) of an exponential type \n\n(4) \n\n\fwhere the scalar function v(\u00b7) is a smooth approximation of an absolute value function. \nSuch kind of distribution is widely used for modeling sparsity [5]. In view of the indepen(cid:173)\ndence of Cmk, and (4), the prior pdf of C is \n\n(5) \n\np(C) ex II exp{ - V(Cmk)}. \n\nm,k \n\nTaking into account that Y = AC, the parametric model for the pdf of Y with respect to \nparameters A is \n\n(6) \nLet W = A -I be the unmixing matrix, to be estimated. Then, substituting C = WY, \ncombining (6) with (5) and taking the logarithm we arrive at the log-likelihood function: \n\nLw(Y) = Klog ldetWI- L LV((WY)mk). \n\n(7) \n\nM K \n\nm=l k = l \n\nMaximization of Lw(Y) with respect to W is equivalent to the BS InfoMax, and can \nbe solved efficiently by the Natural Gradient algorithm [6]. We used this algorithm as \nimplemented in the ICAlEEG Matlab toolbox [7]. \n\nClustering. In the case of geometry based methods, separation of sparse sources can be \nachieved by clustering along orientations of data concentration in the N-dimensional space \nwherein each column Yk of the matrix Y represents a data point (N is the number of mix(cid:173)\ntures). Let us consider a two-dimensional noiseless case, wherein two source signals, Sl(t) \nand S2(t), are mixed by a 2x2 matrix A, arriving at two mixtures Xl(t) and X2(t). (Here, \nthe data matrix is constructed from these mixtures Xl (t) and xd t)). Typically, a scatter \nplot of two sparse mixtures X1(t) versus X2(t), looks like the rightmost plot in Figure 2. If \nonly one source, say Sl (t), was present, the sensor signals would be Xl (t) = all Sl (t) \nand X2(t) = a21s1 (t) and the data points at the scatter diagram of Xl (t) versus X2(t) \nwould belong to the straight line placed along the vector [ana21 ]T. The same thing hap(cid:173)\npens, when two sparse sources are present. In this sparse case, at each particular index \nwhere a sample of the first source is large, there is a high probability, that the correspond(cid:173)\ning sample of the second source is small, and the point at the scatter diagram still lies close \nto the mentioned straight line. The same arguments are valid for the second source. As a \nresult, data points are concentrated around two dominant orientations, which are directly \nrelated to the columns of A. Source signals are rarely sparse in their original domain. In \ncontrast, their decomposition coefficients (2) usually show high sparsity. Therefore, we \nconstruct the data matrix Y from the decomposition coefficients of mixtures (3), rather \nthan from the mixtures themselves. \n\nIn order to determine orientations of scattered data, we project the data points onto the \nsurface of a unit sphere by normalizing corresponding vectors, and then apply a standard \nclustering algorithm. This clustering approach works efficiently even if the number of \nsources is greater than the number of sensors. Our clustering procedure can be summarized \nas follows: \n\n1. Form the feature matrix Y , by putting samples of the sensor signals or (subset of) their \ndecomposition coefficients into the corresponding rows ofthe matrix; \n2. Normalize feature vectors (columns ofY): Yk = Yk /II Yk I12' in order to project data \npoints onto the surface of a unit sphere, where 11 \u00b711 2 denotes the l2 norm. Before nonnal(cid:173)\nization, it is reasonable to remove data points with a very small norm, since these very likely to be \ncrosstalk-corrupted by small coefficients from others' sources. \n3. Move data points to a half-sphere, e.g. by forcing the sign of the first coordinate yk to \nbe positive: IF yk < 0 THEN Yk = - Yk. Without this operation each set oflineariy (i.e., along \na line) clustered data points would yield two clusters on opposite sides of the sphere. \n\n\f-s \n\n5 \n\n\u00b7 : \n,: \n, : \n, : \n\n-s \n\ntOO \n\n200 \n\n300 \n\n~oo \n\n\u00bb:I \n\neoo \n\n700 \n\n~ 900 \n\ntOC>:l \n\n100 \n\n200 \n\n300 \n\n~OO \n\n500 \n\n600 \n\n700 \n\n1\\00 \n\n900 \n\n1000 \n\ntOO \n\n200 \n\n300 \n\n~OO \n\n500 \n\neoo \n\n700 \n\n1\\00 \n\n900 \n\ntOC>:l \n\n-5 \n\n100 \n\n200 \n\n300 \n\n~OO \n\n500 \n\n600 \n\n700 \n\n1\\00 \n\n900 \n\n1():Xl \n\nFigure 1: Random block signals (two upper) and their mixtures (two lower) \n\n4. Estimate cluster centers by using a clustering algorithm. The coordinates of these centers \nwill form the columns of the estimated mixing matrix A. We used Fuzzy C-means (FCM) \nclustering algorithm as implemented in Matlab Fuzzy Logic Toolbox. \nSources recovery. The estimated unmixing matrix A-I is obtained by either the BS \nInfoMax or the above clustering procedure, applied to either complete data set, or to some \nsubsets of data (to be explained in the next section). Then, the sources are recovered in their \noriginal domain by s(t) = A - lX(t). We should stress here that if the clustering approach \nis used, the estimation of sources is not restricted to the case of square mixing matrices, \nalthough the sources recovery is more complicated in the rectangular cases (this topic is \nout of scope of this paper). \n\n3 Multinode based source separation \n\nMotivating example: sparsity of random blocks in the Haar basis. To provide intuitive \ninsight into the practical implications of our main idea, we first use ID block functions, \nthat are piecewise constant, with random amplitude and duration of each constant piece \n(Figure 1). It is known, that the Haar wavelet basis provides compact representation of such \nfunctions. Let us take a close look at the Haar wavelet coefficients at different resolution \nlevels j =O, 1, ... ,1. Wavelet basis functions at the finest resolution level j =J are obtained \nby translation of the Haar mother wavelet:
!~. \n: >/_~:.~ --\n\n; /\" \n\" \"\" \n\n.,/', \n\nI \n\n, \n\n1\u00b7. \n\nInfoMax \n\nFCM \n\nl.93 \n1.78 \n\n0.183 \n0.058 \n\n0.005 \n0.002 \n\nFigure 2: Separation of block signals: scatter plots of sensor signals (left), and of their \nwavelet coefficients (middle and right). Lower colwnns present the normalized mean(cid:173)\nsquared separation error (%) corresponding to the Bell-Sejnowski InfoMax, and to the \nFuzzy C-Means clustering, respectively. \n\nSince a crosstalk matrix A is estimated only up to a column permutation and a scaling fac(cid:173)\ntor, in order to measure the separation accuracy, we normalize the original sources sm(t) \nand their corresponding estimated sources sm(t). The averaged (over sources) normal-\nized squared error (NSE) is then computed as: NSE = it 2:~=1 (ilsm - sm ll\u00a7/llsmll\u00a7)\u00b7 \nResulting separation errors for block sources are presented in the lower part of Figure 2. \nThe largest error (l.93%) is obtained on the raw data, and the smallest \u00ab0.005%) - on \nthe wavelet coefficients at the highest resolution, which have the best sparsity. Using all \nwavelet coefficients yields intermediate sparsity and performance. \n\nMultinode representation. Our choice of a particular wavelet basis and of the sparsest \nsubset of coefficients was obvious in the above example: it was based on knowledge of the \nstructure of piecewise constant signals. For sources having oscillatory components (like \nsounds or images with textures), other systems of basis functions , such as wavelet packets \nand trigonometric function libraries [9], might be more appropriate. The wavelet packet \nlibrary consists of the triple-indexed family of functions: i.f!j ,i,q(t) = 2j / 2i.f!q(2j t - i), j , i E \nZ , q E N,where j , i are the scale and shift parameters, respectively, and q is the frequency \nparameter. [Roughly speaking, q is proportional to the nwnber of oscillations of a mother \nwavelet i.f!q(t)]. These functions form a binary tree whose nodes are indexed by the depth \nof the level j and the node number q = 0, 1, 2, 3, ... , 2j - l at the specified level j. This \nsame indexing is used for corresponding subsets of wavelet packet coefficients (as well as \nin scatter diagrams in the section on experimental results). \n\nAdaptive selection of sparse subsets. When signals have a complex nature, it is difficult \nto decide in advance which nodes contain the sparsest sets of coefficients. That is why we \nuse the following simple adaptive approach. First, for every node of the tree, we apply our \nclustering algorithm, and compute a measure of clusters' distortion. In our experiments we \nused a standard global distortion, the mean squared distance of data points to the centers of \ntheir own (closest) clusters (here again, the weights of the data points can be incorporated): \nd=2:f=l min II U m - Yk II ,where K is the nwnber of data points, U m is the m-th centroid \ncoordinates, Yk is the k-th data point coordinates, and 11 . 11 is the sum-of-squares distance. \n\nm \n\n\fSecond, we choose a few best nodes with the minimal distortion, combine their coefficients \ninto one data set, and apply a separation algorithm (clustering or Infomax) to these data. \n\n4 Experimental results \n\nThe proposed blind separation method based on the wavelet-packet representation, was \nevaluated by using several types of signals. We have already discussed the relatively simple \nexample of a random block signal. The second type of signal is a frequency modulated \n(FM) sinusoidal signal. The carrier frequency is modulated by either a sinusoidal function \n(FM signal) or by random blocks (BFM signal). The third type is a musical recording of \nflute sounds. Finally, we apply our algorithm to images. An example of such images is \npresented in the left part of Figure 3. \n\n111 \n\n, \n\n, \n' 22 \n\n00 \n\n8 \n\n, \n\n'JJ \n\nS. \n\n' 12 \n\n' 13 \n\n'10 \n\n' 11 \n\n'~ \u2022 \u2022 t : , ' \n\u2022\u2022 .. \n0\u00b0 \u2022 . '. \n~ :. , \n'11 t , \"*, ' , :, \n\nSI \n'~' \nfoo 0 \n8 \nSs \n\n\",t, \n\n. , \n. \n11 \n\n:Y6~, \n\n\" ' \n'21 \n\n' 26 \n\n\\; \n\n'8 \n\n' \n\n\" \n\n\u2022 \n\n'lI \n\nFigure 3: Left: two source images (upper pair), their mixtures (middle pair) and estimated \nimages (lower pair). Right: scatter plots ofthe wavelet packet (WP) coefficients of mixtures \nof images; subsets are indexed on the WP tree. \n\nIn order to compare accuracy of our adaptive best nodes method with that attainable by \nstandard methods, we form the following feature sets: (1) raw data, (2) Short Time Fourier \nTransform (STFT) coefficients (in the case of ID signals), (3) Wavelet Transform coeffi(cid:173)\ncients (4) Wavelet packet coefficients at the best nodes found by our method, while using \nvarious wavelet families with different smoothness (haar, db-4, db-S). In the case of image \nseparation, we used the Discrete Cosine Transform (DCT) instead of the STFT, and the \nsym4 and symS mother wavelet instead of db-4 and db-S, when using wavelet transform \nand wavelet packets. \n\nThe right part of Figure 3 presents an example of scatter plots of the wavelet packet co(cid:173)\nefficients obtained at various nodes of the wavelet packet tree. The upper left scatter plot, \nmarked with 'C' , corresponds to the complete set of coefficients at all nodes. The rest are \nthe scatter plots of sets of coefficients indexed on a wavelet packet tree. Generally speak(cid:173)\ning, the more distinct the two dominant orientations appear on these plots, the more precise \n\n\fis the estimation of the mixing matrix, and, therefore, the better is the quality of separation. \nNote, that only two nodes, C22 and C23 , show clear orientations. These nodes will most \nlikely be selected by the algorithm for further estimation process. \n\nSignals \n\nBlocks \n\nBFM sine \nFM sine \nFlutes \n\nImages \n\nraw \ndata \n10.16 \n24.51 \n25 .57 \n1.48 \nraw \ndata \n4.88 \n\nSTFT WT \ndb8 \n2.669 \n0.174 \n0.667 \n0.665 \n0.32 \n1.032 \n0.287 \n0.355 \nOCT WT \nsym8 \nl.l64 \n\n3.651 \n\nWT \nhaar \n0.037 \n2.34 \n6.105 \n0.852 \nWT \nhaar \nl.l14 \n\nWP \ndb8 \n0.073 \n0.2 \n0.176 \n0.154 \nWP \nsym8 \n0.365 \n\nWP \nhaar \n0.002 \n0.442 \n0.284 \n0.648 \nWP \nhaar \n0.687 \n\nTable 1: Experimental results: normalized mean-squared separation error (%) for noise(cid:173)\nfree signals and images, applying the FCM separation to raw data and decomposition coef(cid:173)\nficients in various domains. In the case of wavelet packets (WP) the best nodes selected by \nour algorithm were used. \n\nTable 1 summarizes results of experiments in which we applied our approach of the best \nfeatures selection along with the FCM separation to each noise-free feature set. In these \nexperiments, we compared the quality of separation of deterministic signals by calculating \nN SE's (i.e., residual crosstalk errors). In the case of random block and BFM signals, we \nperformed 100 Monte-Carlo simulations and calculated the normalized mean-squared er(cid:173)\nrors (N M SE) for the above feature sets. From Table 1 it is clear that using our adaptive \nbest nodes method outperforms all other feature sets (including complete set of wavelet \ncoefficients), for each type of signals. Similar improvement was achieved by using our \nmethod along with the BS InfoMax separation, which provided even better results for im(cid:173)\nages. In the case of the random block signals, using the Haar wavelet function for the \nwavelet packet representation yields a better separation than using some smooth wavelet, \ne.g. db-S. The reason is that these block signals, that are not natural signals, have a sparser \nrepresentation in the case of the Haar wavelets. In contrast, as expected, natural signals \nsuch as the Flute's signals are better represented by smooth wavelets, that in turn provide \na better separation. This is another advantage of using sets of features at multiple nodes \nalong with various families of 'mother' functions: one can choose best nodes from several \ndecomposition trees simultaneously. \n\nIn order to verify the performance of our method in presence of noise, we added various \ntypes of noise (white gaussian and salt&pepper) to three mixtures of three images at various \nsignal-to-noise energy ratios (SNR). Table 2 summarizes these experiments in which we \napplied our approach along with the BS InfoMax separation. It turns out that the ideas \nused in wavelet based signal denoising (see for example [10] and references therein), are \napplied to signal separation from noisy mixtures. In particular, in case of white gaussian \nnoise, the noise energy is uniformly distributed over all wavelet coefficients at various \nscales. Therefore, at sufficiently high SNR's, the large coefficients of the signals are only \nslightly distorted by the noise coefficients, and the estimation of the unmixing matrix is \nalmost not affected by the presence of noise. (In contrast, the BS InfoMax applied to \nthree noisy mixtures themselves, failed completely, arriving at N S E of 19% even in the \ncase of SNR=12dB). We should stress here that, although our adaptive best nodes method \nperforms reasonably well in the presence of noise, it is not supposed to further denoise the \nreconstructed images (this can be achieved by some denoising method, after source signals \nare separated). More experimental results, as well as parameters of simulations, can be \nfound in [11]. \n\n\fSNR [dB] \n\nMixtures w. white gaussian noise \nMixtures w. salt&pepper noise \n\nTable 2: Perfonnance of the algorithm in presence of various sources of noise in mixtures \nof images: nonnalized mean-squared separation error (%), applying our adaptive approach \nalong with the BS InfoMax separation. \n\n5 Conclusions \n\nExperiments with both one- and two-dimensional simulated and natural signals demon(cid:173)\nstrate that multinode sparse representations improve the efficiency of blind source separa(cid:173)\ntion. The proposed method improves the separation quality by utilizing the structure of \nsignals, wherein several subsets of the wavelet packet coefficients have significantly better \nsparsity and separability than others. In this case, scatter plots of these coefficients show \ndistinct orientations each of which specifies a column of the mixing matrix. We choose \nthe 'good subsets' according to the global distortion adopted as a measure of cluster qual(cid:173)\nity. Finally, we combine together coefficients from the best chosen subsets and restore \nthe mixing matrix using only this new subset of coefficients by the Infomax algorithm or \nclustering. This yields significantly better results than those obtained by applying standard \nInfomax and clustering approaches directly to the raw data. The advantage of our method \nis in particular noticeable in the case of noisy mixtures. \n\nReferences \n\n[1] A. 1. Bell and T. 1. Sejnowski, \"An information-maximization approach to blind sep(cid:173)\n\naration and blind deconvolution,\" Neural Computation, vol. 7, no. 6, pp. 1129- 1159, \n1995. \n\n[2] A. Hyvarinen, \"Survey on independent component analysis,\" Neural Computing Sur(cid:173)\n\nveys, no. 2, pp. 94- 128, 1999. \n\n[3] M. Zibulevsky and B. A. Pearlmutter, \"Blind separation of sources with sparse repre(cid:173)\nsentations in a given signal dictionary,\" Neural Computation, vol. l3 , no. 4, pp. 863-\n882,2001. \n\n[4] 1.-F. Cardoso. \"Infomax and maximum likelihood for blind separation,\" IEEE Signal \n\nProcessing Letters 4 112-114, 1997. \n\n[5] M. S. Lewicki and T. 1. Sejnowski, \"Learning overcomplete representations,\" Neural \n\nComputation, 12(2): 337-365, 2000. \n\n[6] S. Amari, A. Cichocki, and H. H. Yang, \"A new learning algorithm for blind signal \nseparation,\" In Advances in Neural Information Processing Systems 8. MIT Press. \n1996. \n\n[7] S. Makeig, ICAlEEG toolbox. Computational Neurobiology Laboratory, the Salk \n\nInstitute. http://www.cnl.salk.edurtewonlica _ cnl.html, 1999. \n\n[8] A. Prieto, C. G. Puntonet, and B. Prieto, \"A neural algorithm for blind separation of \nsources based on geometric prperties.,\" Signal Processing, vol. 64, no. 3, pp. 315- 331, \n1998. \n\n[9] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, 1998. \n[10] D. L. Donoho, \"De-Noising by Soft Thresholding,\" IEEE Trans. Inf. Theory, vol. 41, \n\n3, 1995, pp.613-627. \n\n[11] P. Kisilev, M. Zibulevsky, Y. Y. Zeevi, and B. A. Pearlmutter, Multiresolution frame(cid:173)\n\nworkfor sparse blind source separation, CCIT Report no.317, June 2000 \n\n\f", "award": [], "sourceid": 1980, "authors": [{"given_name": "Michael", "family_name": "Zibulevsky", "institution": null}, {"given_name": "Pavel", "family_name": "Kisilev", "institution": null}, {"given_name": "Yehoshua", "family_name": "Zeevi", "institution": null}, {"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}]}