{"title": "Kernel Feature Spaces and Nonlinear Blind Souce Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": null, "full_text": "Kernel Feature Spaces and\n\nNonlinear Blind Source Separation\n\nStefan Harmeling1(cid:3), Andreas Ziehe1, Motoaki Kawanabe1, Klaus-Robert M\u00fcller1;2\n\n1Fraunhofer FIRST.IDA, Kekul\u00e9str. 7, 12489 Berlin, Germany\n\n2University of Potsdam, Department of Computer Science,\n\nAugust-Bebel-Strasse 89, 14482 Potsdam, Germany\n\n{harmeli,ziehe,kawanabe,klaus}@first.fhg.de\n\nAbstract\n\nIn kernel based learning the data is mapped to a kernel feature space of\na dimension that corresponds to the number of training data points. In\npractice, however, the data forms a smaller submanifold in feature space,\na fact that has been used e.g. by reduced set techniques for SVMs. We\npropose a new mathematical construction that permits to adapt to the in-\ntrinsic dimension and to \ufb01nd an orthonormal basis of this submanifold.\nIn doing so, computations get much simpler and more important our\ntheoretical framework allows to derive elegant kernelized blind source\nseparation (BSS) algorithms for arbitrary invertible nonlinear mixings.\nExperiments demonstrate the good performance and high computational\nef\ufb01ciency of our kTDSEP algorithm for the problem of nonlinear BSS.\n\n1 Introduction\n\nIn a widespread area of applications kernel based learning machines, e.g. Support Vector\nMachines (e.g. [19, 6]) give excellent solutions. This holds both for problems of supervised\nand unsupervised learning (e.g. [3, 16, 12]). The general idea is to map the data xi (i =\n1; : : : ; T ) into some kernel feature space F by some mapping (cid:8) : \nexists. So, now we can de\ufb01ne an orthonormal basis\n\nv (cid:8)v has full rank and its inverse\n\n(cid:4) := (cid:8)v((cid:8)>\n\nv (cid:8)v)(cid:0) 1\n\n2\n\n(5)\n\nthe column space of which is identical to the column space of (cid:8)v. Consequently this basis\n(cid:4) enables us to parameterize all vectors that lie in the column space of (cid:8)x by some vectors\nin \n\n(cid:8) (cid:8)>\n\nx (cid:8)x(cid:12)(cid:8) = (cid:11)>\n\n(cid:4) (cid:4)>(cid:4)(cid:12)(cid:4) = (cid:11)>\n\n(cid:4) (cid:12)(cid:4)\n\n(6)\n\n\finput space\n\n\n\nv (cid:8)v)ij = (cid:8)(vi)>(cid:8)(vj) = k(vi; vj) with i; j = 1 : : : d\n\nare the entries of a real valued d (cid:2) d matrix (cid:8)>\nv (cid:8)v that can be effectively calculated using\nthe kernel trick and by construction of v1; : : : ; vd, it has full rank and is thus invertible.\nSimilarly we get\n\n((cid:8)>\n\nv (cid:8)x)ij = (cid:8)(vi)>(cid:8)(xj ) = k(vi; xj ) with\n\ni = 1 : : : d;\n\nj = 1 : : : T ;\n\nwhich are the entries of the real valued d (cid:2) T matrix (cid:8)>\ncompute \ufb01nally the parameter matrix\n\nv (cid:8)x. Using both matrices we\n\n(cid:9)x := (cid:4)>(cid:8)x = ((cid:8)>\n\nv (cid:8)v)(cid:0) 1\n\n2 (cid:8)>\n\nv (cid:8)x\n\n(7)\n\n1The column space of (cid:8)x is the space that is spanned by the column vectors of (cid:8)x, written\n\nspan((cid:8)x).\n\n\f2 is symmetric. Regarding\nwhich is also a real valued d (cid:2) T matrix; note that ((cid:8)>\ncomputational costs, we have to evaluate the kernel function O(d 2) + O(dT ) times and\neq.(7) requires O(d3) multiplications; again note that d is much smaller than T . Further-\nmore storage requirements are cheaper as we do not have to hold the full T (cid:2) T kernel\nmatrix but only a d (cid:2) T matrix. Also, kernel based algorithms often require centering in\nF, which in our setting is equivalent to centering in \nv (cid:8)v. Repeating this random\nsampling process several times (e.g. 100 times) stabilizes this process in practice. Then we\ndenote by rk(n) the largest achieved rank; note that rk(n) (cid:20) n. Using this de\ufb01nition we\ncan formulate a recipe to \ufb01nd d (the dimension of the subspace of F): (1) start with a large\nd with rk(d) < d. (2) Decrement d by one as long as rk(d) < d holds. As soon as we\nhave rk(d) = d we found the d. Choose v1; : : : ; vd as the vectors that achieve rank d. As\nan alternative to random sampling we have also employed k-means clustering with similar\nresults.\n\n3 Nonlinear blind source separation\n\nTo demonstrate the use of the orthonormal basis in F, we formulate a new nonlinear BSS\nalgorithm based on TDSEP [21]. We start from a set of points v1; : : : ; vd, that are provided\nby the algorithm from the last section such that eq.(4) holds. Next, we use eq.(7) to compute\n\n(cid:9)x[t] := (cid:4)>(cid:8)(x[t]) = ((cid:8)>\n\nv (cid:8)v)(cid:0) 1\n\n2 (cid:8)>\n\nv (cid:8)(x[t]) 2 are a sinusoidal and a saw-\ntooth signal with 2000 samples each. The nonlinearly mixed signals are de\ufb01ned as (cf. Fig.2\nupper left panel)\n\nx1[t] = exp(s1[t]) (cid:0) exp(s2[t])\nx2[t] = exp((cid:0)s1[t]) + exp((cid:0)s2[t]):\n\nA dimension d = 22 of the manifold in feature space was obtained by kTDSEP using\na polynomial kernel k(x; y) = (x>y + 1)6 by sampling from the inputs. The basis-\ngenerating vectors v1; : : : ; v22 are shown as big dots in the upper left panel of Figure\n2. Applying TDSEP to the 22 dimensional mapped signals (cid:9)x[t] we get 22 components\nin parameter space. A scatter plot with the two components that best match the source\nsignals are shown in the right upper panel of Figure 2. The left lower panel also shows for\ncomparison the two components that we obtained by applying linear TDSEP directly to the\nmixed signals x[t]. The plots clearly indicate that kTDSEP has unfolded the nonlinearity\nsuccessfully while the linear demixing algorithm failed.\n\nIn a second experiment two speech signals (with 20000 samples, sampling rate 8 kHz) that\nare nonlinearly mixed by\n\nx1[t] = s1[t] + s3\nx2[t] = s3\n\n2[t]\n\n1[t] + tanh(s2[t]):\n\nThis time we used a Gaussian RBF kernel k(x; y) = exp((cid:0)jx (cid:0) yj2). kTDSEP identi\ufb01ed\nd = 41 and used k-means clustering to obtain v1; : : : ; v41. These points are marked as\n\u2019+\u2019 in the left panel of \ufb01gure 4. An application of TDSEP to the 41 dimensional parameter\n\n\u0014\n\f\n\u001a\n\f\n\u0014\n\f\n\b\n!\n\fmixture\n\nx1\n0.56\n0.63\n\nx2\n0.72\n0.46\n\nkTDSEP\nu2\nu1\n0.07\n0.89\n0.04\n0.86\n\nTDSEP\n\nu1\n0.09\n0.31\n\nu2\n0.72\n0.55\n\ns1\ns2\n\nTable 3: Correlation coef\ufb01cients for the signals shown in Fig.4.\n\nspace yields nonlinear components whose projections to the input space are depicted in the\nright lower panel. We can see that linear TDSEP (right middle panel) failed and that the\ndirections of best matching kTDSEP components closely resemble the sources.\n\nTo con\ufb01rm this visual impression we calculated the correlation coef\ufb01cients of the kTDSEP\nand TDSEP solution to the source signals (cf. table 3). Clearly, kTDSEP outperforms the\nlinear TDSEP algorithm, which is of course what one expects.\n\n5 Conclusion\n\nOur work has two main contributions. First, we propose a new formulation in the \ufb01eld of\nkernel based learning methods that allows to construct an orthonormal basis of the subspace\nof kernel feature space F where the data lies. This technique establishes a highly useful\n(scalar product preserving) isomorphism between the image of the data points in F and a\nd-dimensional space