{"title": "Robust Spatial Filtering with Beta Divergence", "book": "Advances in Neural Information Processing Systems", "page_first": 1007, "page_last": 1015, "abstract": "The efficiency of Brain-Computer Interfaces (BCI) largely depends upon a reliable extraction of informative features from the high-dimensional EEG signal. A crucial step in this protocol is the computation of spatial filters. The Common Spatial Patterns (CSP) algorithm computes filters that maximize the difference in band power between two conditions, thus it is tailored to extract the relevant information in motor imagery experiments. However, CSP is highly sensitive to artifacts in the EEG data, i.e. few outliers may alter the estimate drastically and decrease classification performance. Inspired by concepts from the field of information geometry we propose a novel approach for robustifying CSP. More precisely, we formulate CSP as a divergence maximization problem and utilize the property of a particular type of divergence, namely beta divergence, for robustifying the estimation of spatial filters in the presence of artifacts in the data. We demonstrate the usefulness of our method on toy data and on EEG recordings from 80 subjects.", "full_text": "Robust Spatial Filtering with Beta Divergence\n\nWojciech Samek1,4 Duncan Blythe1,4 Klaus-Robert M\u00a8uller1,2 Motoaki Kawanabe3\n\n1Machine Learning Group, Berlin Institute of Technology (TU Berlin), Berlin, German\n\n2Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea\n3ATR Brain Information Communication Research Laboratory Group, Kyoto, Japan\n\n4Bernstein Center for Computational Neuroscience, Berlin, Germany\n\nAbstract\n\nThe ef\ufb01ciency of Brain-Computer Interfaces (BCI) largely depends upon a reliable\nextraction of informative features from the high-dimensional EEG signal. A cru-\ncial step in this protocol is the computation of spatial \ufb01lters. The Common Spatial\nPatterns (CSP) algorithm computes \ufb01lters that maximize the difference in band\npower between two conditions, thus it is tailored to extract the relevant informa-\ntion in motor imagery experiments. However, CSP is highly sensitive to artifacts\nin the EEG data, i.e. few outliers may alter the estimate drastically and decrease\nclassi\ufb01cation performance. Inspired by concepts from the \ufb01eld of information ge-\nometry we propose a novel approach for robustifying CSP. More precisely, we\nformulate CSP as a divergence maximization problem and utilize the property of\na particular type of divergence, namely beta divergence, for robustifying the esti-\nmation of spatial \ufb01lters in the presence of artifacts in the data. We demonstrate the\nusefulness of our method on toy data and on EEG recordings from 80 subjects.\n\n1\n\nIntroduction\n\nSpatial \ufb01ltering is a crucial step in the reliable decoding of user intention in Brain-Computer In-\nterfacing (BCI) [1, 2]. It reduces the adverse effects of volume conduction and simpli\ufb01es the clas-\nsi\ufb01cation problem by increasing the signal-to-noise-ratio. The Common Spatial Patterns (CSP)\n[3, 4, 5, 6] method is one of the most widely used algorithms for computing spatial \ufb01lters in mo-\ntor imagery experiments. A spatial \ufb01lter computed with CSP maximizes the differences in band\npower between two conditions, thus it aims to enhance detection of the synchronization and desyn-\nchronization effects occurring over different locations of the sensorimotor cortex after performing\nmotor imagery. It is well known that CSP may provide poor results when artifacts are present in\nthe data or when the data is non-stationary [7, 8]. Note that artifacts in the data are often unavoid-\nable and can not always be removed by preprocessing, e.g. with Independent Component Analysis.\nThey may be due to eye movements, muscle movements, loose electrodes, sudden changes of atten-\ntion, circulation, respiration, external events, among the many possibilities. A straight forward way\nto robustify CSP against over\ufb01tting is to regularize the \ufb01lters or the covariance matrix estimation\n[3, 7, 9, 10, 11]. Several other strategies have been proposed for estimating spatial \ufb01lters under\nnon-stationarity [12, 8, 13, 14].\nIn this work we propose a novel approach for robustifying CSP inspired from recent results in the\n\ufb01eld of information geometry [15, 16]. We show that CSP may be formulated as a divergence\nmaximization problem, in particular we prove by using Cauchy\u2019s interlacing theorem [17] that the\nspatial \ufb01lters found by CSP span a subspace with maximum symmetric Kullback-Leibler divergence\nbetween the distributions of both classes. In order to robustify the CSP algorithm against the in\ufb02u-\nence of outliers we propose solving the divergence maximization problem with a particular type of\n\n1\n\n\fdivergence, namely beta divergence. This divergence has been successfully used for robustifying\nalgorithms such as Independent Component Analysis (ICA) [18] and Non-negative Matrix Factor-\nization (NMF) [19]. In order to capture artifacts on a trial-by-trial basis we reformulate the CSP\nproblem as sum of trial-wise divergences and show that our method downweights the in\ufb02uence of\nartifactual trials, thus it robustly integrates information from all trials.\nThe remainder of this paper is organized as follows. Section 2 introduces the divergence-based\nframework for CSP. Section 3 describes the beta-divergence CSP method and discusses its robust-\nness property. Section 4 evaluates the method on toy data and EEG recordings from 80 subjects\nand interprets the performance improvement. Section 5 concludes the paper with a discussion. An\nimplementation of our method is available at http://www.divergence-methods.org.\n\n2 Divergence-Based Framework for CSP\n\nSpatial \ufb01lters computed by the Common Spatial Patterns (CSP) [3, 4, 5] algorithm have been widely\nused in Brain-Computer Interfacing as they are well suited to discriminate between distinct motor\nimagery patterns. A CSP spatial \ufb01lter w maximizes the variance of band-pass \ufb01ltered EEG signals\nin one condition while minimizing it in the other condition. Mathematically the CSP solution can\nbe obtained by solving the generalized eigenvalue problem\n\n(1)\nwhere \u03a31 and \u03a32 are the estimated (average) D \u00d7 D covariance matrices of class 1 and 2,\nrespectively. Note that the spatial \ufb01lters W = [w1 . . . wD] can be sorted by importance\n\u03b11 = max{\u03bb1, 1\n\n} > . . . > \u03b1D = max{\u03bbD, 1\n\n\u03a31wi = \u03bbi\u03a32wi\n\n}.\n\n\u03bb1\n\n\u03bbD\n\n2.1 divCSP Algorithm\n\nInformation geometry [15] has provided useful frameworks for developing various machine learning\n(ML) algorithms, e.g. by optimizing divergences between two different probability distributions [20]\n[21]. In particular, a series of robust ML methods have been successfully obtained from Bregman\ndivergences which are generalization of the Kullback-Leibler (KL) divergence [22]. Among them,\nwe employ in this work the beta divergence. Before proposing our novel algorithm, we show that\nCSP can also be interpreted as maximization of the symmetric KL divergence.\nTheorem 1: Let W = [w1 . . . wd] be the d top (sorted by \u03b1i) spatial \ufb01lters computed by CSP and let\n\u03a31 and \u03a32 denote the covariance matrices of class 1 and 2. Let V(cid:62) = \u02dcRP be a d\u00d7 D dimensional\nmatrix that can be decomposed into a whitening projection P \u2208 RD\u00d7D (P(\u03a31 + \u03a32)P(cid:62) = I) and\nan orthogonal projection \u02dcR \u2208 Rd\u00d7D. Then\n\nspan(W) = span(V\u2217)\nwith V\u2217 = argmax\n\u02dcDkl\n\n(2)\n(3)\nwhere \u02dcDkl(\u00b7 || \u00b7) denotes the symmetric Kullback-Leibler Divergence1 between zero mean Gaus-\nsians and span(M) stands for the subspace spanned by the columns of matrix M. Note that [23]\nhas provided a proof for the special case of one spatial \ufb01lter, i.e. for V \u2208 RD\u00d71.\nProof: See appendix and supplement material.\nThe objective function that is maximized in Eq. (3) can be written as\n\n(cid:0)V(cid:62)\u03a31V || V(cid:62)\u03a32V(cid:1)\n\nV\n\ntr(cid:0)(V(cid:62)\u03a31V)\u22121(V(cid:62)\u03a32V)(cid:1) +\n\ntr(cid:0)(V(cid:62)\u03a32V)\u22121(V(cid:62)\u03a31V)(cid:1) \u2212 d.\n\n(4)\n\nLkl(V) =\n\n1\n2\n\nIn order to cater for artifacts on a trial-by-trial basis we need to reformulate the above objective\nfunction. Instead of maximizing the divergence between the average class distributions we propose\nto optimize the sum of trial-wise divergences\n\n1\n2\n\n(5)\n\nLsumkl(V) =\n\nN(cid:88)\n\n(cid:0)V(cid:62)\u03a3i\ng(x) dx + (cid:82) g(x) \u00b7 log g(x)\n\n\u02dcDkl\n\ni=1\n\nf (x) dx.\n\n2V(cid:1) ,\n\n1V || V(cid:62)\u03a3i\n\n\u02dcDkl(f (x) || g(x)) =(cid:82) f (x) \u00b7 log f (x)\n\n1The symmetric Kullback-Leibler Divergence between distributions f (x) and g(x) is de\ufb01ned as\n\n2\n\n\f1 and \u03a3i\n\nwhere \u03a3i\n2 denote the covariance matrices estimated from the i-th trial of class 1 and class\n2, respectively, and N is the number of trials per class. Note that the reformulated problem is\nnot equivalent to CSP; in Eq. (4) averaging is performed w.r.t. the covariance matrices, whereas in\nEq. (5) it is performed w.r.t. the divergences. We denote the former approach by kl-divCSP and the\nlatter one by sumkl-divCSP. The following theorem relates both approaches in the asymptotic case.\nTheorem 2: Suppose that the number of discriminative sources is one; then let c be such that\nD/n \u2192 c as D, n \u2192 \u221e (D dimensions, n data points per trial). Then if there exists \u03b3(c) with\nN/D \u2192 \u03b3(c) for N \u2192 \u221e (N the number of trials) then the empirical maximizer of Lsumkl(v)\n(and of course also of Lkl(v)) converges almost surely to the true solution.\nSketched Proof: See appendix.\nThus Theorem 2 says that both divergence-based CSP variants kl-divCSP and sumkl-divCSP almost\nsurely converge to the same (true) solution in the asymptotic case. The theorem can be easily\nextended to multiple discriminative sources.\n\n2.2 Optimization Framework\n\nWe use the methods developed in [24], [25] and [26] for solving the maximization problems in\nEq. (4) and Eq. (5). The projection V \u2208 RD\u00d7d to the d-dimensional subspace can be decomposed\ninto three parts, namely V(cid:62) = IdRP where Id is an identity matrix truncated to the \ufb01rst d rows, R\nis a rotation matrix with RR(cid:62) = I and P is a whitening matrix. The optimization process consists\nof \ufb01nding the rotation R that maximizes our objective function and can be performed by gradient\ndescent on the manifold of orthogonal matrices. More precisely, we start with an orthogonal matrix\nR0 and \ufb01nd an orthogonal update U in the k-th step such that Rk+1 = URk. The update matrix\nis chosen by identifying the direction of steepest descent in the set of orthogonal transformations\nand then performing a line search along this direction to \ufb01nd the optimal step. Since the basis of\nthe extracted subspace is arbitrary (one can right multiply a rotation matrix to V without changing\nthe divergence), we select the principal axes of the data distribution of one class (after projection)\nas basis in order to maximally separate the two classes. The optimization process is summarized in\nAlgorithm 1 and explained in the supplement material of the paper.\n\n2\n\nCompute the gradient matrix and determine the step size (see supplement material)\nUpdate the rotation matrix Rk+1 = URk\nApply the rotation to the data \u03a3c = U\u03a3cU(cid:62)\n\nCompute the whitening matrix P = \u03a3\u2212 1\nInitialise R0 with a random rotation matrix\nWhiten and rotate the data \u03a3c = (R0P)\u03a3c(R0P)(cid:62) with c = {1, 2}\nrepeat\n\nAlgorithm 1 Divergence-based Framework for CSP\n1: function DIVCSP(\u03a31, \u03a32, d)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\nuntil convergence\n9:\nLet V(cid:62) = IdRk+1P\n10:\nRotate V by G \u2208 Rd\u00d7d where G are eigenvectors of V(cid:62)\u03a31V\n11:\nreturn V\n12:\n13: end function\n\n3 Beta Divergence CSP\n\nRobustness is a desirable property of algorithms that work in data setups which are known to be\ncontaminated by outliers. For example, in the biomedical \ufb01elds, signals such as EEG may be highly\naffected by artifacts, i.e. outliers, which may drastically in\ufb02uence statistical estimation. Note that\nboth of the above approaches kl-divCSP and sumkl-divCSP are not robust w.r.t. artifacts as they\nboth perform simple (non-robust) averaging of the covariance matrices and of the divergence terms,\nrespectively. In this section we show that by using beta divergence we robustify the averaging of the\ndivergence terms as beta divergence downweights the in\ufb02uence of outlier trials.\n\n3\n\n\fBeta divergence was proposed in [16, 27] and is de\ufb01ned (for \u03b2 > 0) as\nD\u03b2 (f (x) || g(x)) =\n\n(f \u03b2(x) \u2212 g\u03b2(x))f (x)dx \u2212 1\n\n(f \u03b2+1(x) \u2212 g\u03b2+1(x))dx,\n\n(6)\n\n(cid:90)\n\n\u03b2 + 1\n\n(cid:90)\n\n1\n\u03b2\n\nwhere f (x) and g(x) are two probability distributions. Like every statistical divergence it is\nalways positive and equals zero iff g = f [15]. The symmetric version of beta divergence\n\u02dcD\u03b2(f (x) || g(x)) = D\u03b2(f (x) || g(x)) + D\u03b2(g(x) || f (x)) can be interpreted as discrepancy\nbetween two probability distributions. One can show easily that beta and Kullback-Leibler diver-\ngence coincide as \u03b2 \u2192 0.\nIn the context of parameter estimation, one can show that minimizing the divergence function from\nan empirical distribution p to the statistical model q(\u03c6) is equivalent to maximizing the \u03a8-likelihood\n\u00afL\u03a8\u03b2 (\u03c6)\n\nD\u03b2(p || q(\u03c6)) = argmax\n\n\u00afL\u03a8\u03b2 (q(\u03c6))\n\nargmin\n\nq(\u03c6)\n\nq(\u03c6)\n\nn(cid:88)\n\n1\nn\n\nwith \u00afL\u03a8\u03b2 (q(\u03c6)) =\n\n\u03a8\u03b2((cid:96)(xi, q(\u03c6))) \u2212 b\u03a8\u03b2 (\u03c6) and \u03a8\u03b2(z) =\n\ni=1\n\n(\u03b2 + 1)\u22121(cid:82) q(\u03c6)\u03b2+1dx. Basu et al. [27] showed that the \u03a8-likelihood method weights each obser-\n\nwhere (cid:96)(xi, q(\u03c6)) denotes the log-likelihood of observation xi and distribution q(\u03c6), and b\u03a8\u03b2 (\u03c6) :=\n\nvation according to the magnitude of likelihood evaluated at the observation; if an observation is an\noutlier, i.e. of lower likelihood, then it is downweighted. Thus, beta divergence allows to construct\nrobust estimators as samples with low likelihood are downweighted (see also M-estimators [28]).\n\n(7)\n\n,\n\n(8)\n\nexp(\u03b2z) \u2212 1\n\n\u03b2\n\n\u03b2-divCSP Algorithm\n\n\u02dcD\u03b2\n\n2V(cid:1)\n\nL\u03b2(V) =\n\n1V || VT\u03a3i\n\n(cid:88)\n(cid:88)\n\nWe propose applying beta divergence to the objective function in Eq. (5) in order to downweight the\nin\ufb02uence of artifacts in the computation of spatial \ufb01lters. An overview over the different divergence-\nbased CSP variants is provided in Figure 1. The objective function of our \u03b2-divCSP approach is\n\n(cid:0)VT\u03a3i\n(cid:18)(cid:90)\n(cid:90)\n(cid:1) being the zero-mean Gaussian distributions with pro-\n(cid:1) and gi = N(cid:0)0, \u00af\u03a3i\n(cid:17)(cid:17)\n(cid:16)| \u00af\u03a3i\n\n1\njected covariances \u00af\u03a3i\nOne can show easily (see the supplement \ufb01le to this paper) that L\u03b2(V) has an explicit form\n1|\u2212 1\n\nwith fi = N(cid:0)0, \u00af\u03a3i\n(cid:88)\n\n2V \u2208 Rd\u00d7d, respectively.\n\n1V \u2208 Rd\u00d7d and \u00af\u03a3i\n\n(cid:16)| \u00af\u03a3i\n\n2 \u2212 (\u03b2 + 1)\n\n2| 1\u2212\u03b2\n\n2 |\u03b2 \u00af\u03a3i\n\n1| 1\u2212\u03b2\n\n2 |\u03b2 \u00af\u03a3i\n\ni gidx \u2212\nf \u03b2\n\n2 = VT \u03a3i\n\n1 = VT \u03a3i\n\n2|\u2212 1\n\n2 + | \u00af\u03a3i\n\n2 + | \u00af\u03a3i\n\n2|\u2212 \u03b2\n\nfig\u03b2\n\ni dx\n\ng\u03b2+1\ni\n\ndx +\n\ndx \u2212\n\nf \u03b2+1\ni\n\n2 + \u00af\u03a3i\n\n2\n\n(cid:19)\n\n,\n\n1 + \u00af\u03a3i\n\ni\n1\n\u03b2\n\n(cid:90)\n\n(cid:90)\n\n(10)\n\n(9)\n\n=\n\nd\n2\n\n2\n\ni\n\n, (11)\n\n\u03b3\n\ni\n\n1|\u2212 \u03b2\n(cid:113)\n\n1\n\n1 and \u00af\u03a3i\n\nwith \u03b3 = 1\n(2\u03c0)\u03b2d(\u03b2+1)d . We use Algorithm 1 to maximize the objective function of \u03b2-divCSP.\n\u03b2\nIn the following we show that the robustness property of \u03b2-divCSP can be directly understood from\ninspection of its objective function.\nAssume \u00af\u03a3i\nobjective functions of \u03b2-divCSP and sumkl-divCSP when \u00af\u03a3i\nlarge, e.g. because it is affected by artifacts.\nfunction L\u03b2 does not go to in\ufb01nity but is constant as \u00af\u03a3i\nof the objective function | \u00af\u03a3i\ngo to zero as \u00af\u03a3i\nchanges in \u00af\u03a3i\n\n2 are full rank d \u00d7 d covariance matrices. We investigate the behaviour of the\n2 becomes very\nIt is not hard to see that for \u03b2 > 0 the objective\n2 becomes arbitrarily large. The \ufb01rst term\n2 and all the other terms\n2 increases. Thus the in\ufb02uence function of the \u03b2-divCSP estimator is bounded w.r.t.\n2 (the same argument holds for changes of \u00af\u03a3i\n1). Note that this robustness property\n\nvanishes when applying Kullback-Leibler divergences Eq. (4) as the trace term tr(cid:0)( \u00af\u03a3i\n\n2 is constant with respect to changes of \u00af\u03a3i\n\n1 is constant and \u00af\u03a3i\n\n(cid:1) is\n\n1|\u2212 \u03b2\n\n1)\u22121 \u00af\u03a3i\n\n2\n\nnot bounded when \u00af\u03a3i\n\n2 becomes arbitrarily large, thus this artifactual term will dominate the solution.\n\n4\n\n\fFigure 1: Relation between the different CSP formulations outlined in this paper.\n\n4 Experimental Evaluation\n\n4.1 Simulations\n\n(cid:20) sdis(t)\n\n(cid:21)\n\n+ \u0001,\n\nIn order to investigate the effects of artifactual trials on CSP and \u03b2-divCSP we generate data x(t)\nusing the following mixture model\n\nsndis(t)\n\nx(t) = A\n\n(12)\nwhere A \u2208 R10\u00d710 is a random orthogonal mixing matrix, sdis is a discriminative source sampled\nfrom a zero mean Gaussian with variance 1.8 in one condition and 0.2 in the other one, sndis are 9\nsources with variance 1 in both conditions and \u0001 is a noise variable with variance 2. We generate\n100 trials per condition, each consisting of 200 data points. Furthermore we randomly add artifacts\nwith variance 10 independently to each data dimension (i.e. virtual electrode) and trial with varying\nprobability and evaluate the angle between the true \ufb01lter extracting the source activity of sdis and\nthe spatial \ufb01lter computed by CSP and \u03b2-divCSP. The median angles are shown in Figure 2 using\n100 repetitions. One can clearly see that the angle error between the spatial \ufb01lter extracted by CSP\nand the true one increases with larger artifact probability. Furthermore one can see from the \ufb01gure\nthat using very small \u03b2 values does not attenuate the artefact problem, but it rather increases the\nerror by adding up trial-wise divergences without downweighting outliers. However, as the \u03b2 value\nincreases the artifactual trials are downweighted and a robust average is computed over the trial-wise\ndivergence terms. This increased robustness signi\ufb01cantly reduces the angle error.\n\nFigure 2: Angle between the true spatial \ufb01lter and the \ufb01lter computed by CSP and \u03b2-divCSP for\ndifferent probabilities of artifacts. The robustness of our approach increases with the \u03b2 value and\nsigni\ufb01cantly outperforms the CSP solution.\n\n5\n\n-divCSP-divCSProbustCSPsum-divCSPangle error (in \u00b0)angle error (in \u00b0)angle error (in \u00b0)angle error (in \u00b0)angle error (in \u00b0)prob. of outlier 0020406080prob. of outlier 0.001020406080prob. of outlier 0.005020406080angle error (in \u00b0)prob. of outlier 0.01020406080prob. of outlier 0.02020406080prob. of outlier 0.05beta value0.0010.010.10.250.50.7511.52020406080beta value0.0010.010.10.250.50.7511.52beta value0.0010.010.10.250.50.7511.52beta value0.0010.010.10.250.50.7511.52beta value0.0010.010.10.250.50.7511.52beta value0.0010.010.10.250.50.7511.52CSP\u03b2-divCSP\f4.2 Data Sets and Experimental Setup\n\nThe data set [29] used for the evaluation contains EEG recordings from 80 healthy BCI-\ninexperienced volunteers performing motor imagery tasks with the left and right hand or feet. The\nsubjects performed motor imagery \ufb01rst in a calibration session and then in a feedback mode in which\nthey were required to control a 1D cursor application. Activity was recorded from the scalp with\nmulti-channel EEG ampli\ufb01ers using 119 Ag/AgCl electrodes in an extended 10-20 system sampled\nat 1000 Hz (downsampled to 100 Hz) and a band-pass from 0.05 to 200 Hz. Three runs with 25\ntrials of each motor condition were recorded in the calibration session and the two best classes were\nselected; the subjects performed feedback with three runs of 100 trials. Both sessions were recorded\non the same day.\nFor the of\ufb02ine analysis we manually select 62 electrodes densely covering the motor cortex, extract\na time segment located from 750ms to 3500ms after the cue indicating the motor imagery class and\n\ufb01lter the signal in 8-30 Hz using a 5-th order Butterworth \ufb01lter. We do not apply manual or automatic\nrejection of trials or electrodes and use six spatial \ufb01lters for feature extraction. For classi\ufb01cation\nwe apply Linear Discriminant Analysis (LDA) after computing the logarithm of the variance on\nthe spatially \ufb01ltered data. We measure performance as misclassi\ufb01cation rate and normalize the\ncovariance matrices by dividing them by their traces. The parameter \u03b2 is selected from the set of\n15 candidates {0, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.5, 0.75, 1, 1.5, 2, 5} by 5-fold\ncross-validation on the calibration data using minimal training error rate as selection criterion. For\nfaster convergence we use the rotation part of the CSP solution as initial rotation matrix R0.\n\n4.3 Results\n\nWe compare our \u03b2-divCSP method with three CSP baselines using different estimators for the co-\nvariance matrices. The \ufb01rst baseline uses the standard empirical estimator, the second one applies a\nstandard analytic shrinkage estimator [9] and the third one relies on the minimum covariance deter-\nminant (MCDE) estimate [30]. Note that the shrinkage estimator usually provides better estimates\nin small-sample settings, whereas MCDE is robust to outliers. In order to perform a fair comparison\nwe applied MCDE over various ranges [0, 0.05, 0.1 . . . 0.5] of parameters and selected the best one\nby cross-validation (as with \u03b2-divCSP). The MCDE parameter determines the expected proportion\nof artifacts in the data. The results are shown in Figure 3. Each circle denotes the error rate of one\nsubject. One can see that the \u03b2-divCSP method outperforms the baselines as most circles are be-\nlow the solid line. Furthermore the performance increases are signi\ufb01cant according to the one-sided\nWilcoxon sign rank test as the p-values are smaller than 0.05.\n\nFigure 3: Performance results of the CSP, shrinkage + CSP and MCDE + CSP baselines compared to\n\u03b2-divCSP. Each circle represents the error rate of one subject. Our method outperforms the baselines\nfor circles that are below the solid line. The p-values of the one-sided Wilcoxon sign rank test are\nshown in the lower right corner.\n\nWe made an interesting observation when analysing the subject with largest improvement over the\nCSP baseline; the error rates were 48.6% (CSP), 48.6% (MCDE+CSP) and 11.0% (\u03b2-divCSP).\nOver all ranges of MCDE parameters this subject has an error rate higher than 48% i.e. MCDE\nwas not able help in this case. This example shows that \u03b2-divCSP and MCDE are not equivalent.\nEnforcing robustness on the CSP algorithm may in some cases be better than enforcing robustness\nwhen estimating the covariance matrices.\n\n6\n\nCSP error rate [%]-divCSP error rate [%]p = 0.000502040600204060shrinkCSP error rate [%]-divCSP error rate [%]p = 0.017802040600204060MCDE+CSP error rate [%]-divCSP error rate [%]p = 0.040702040600204060\u03b2\u03b2\u03b2\fIn the following we study the robustness property of the \u03b2-divCSP method on subject 74, the user\nwith the largest improvement (CSP error rate: 48.6 % and \u03b2-divCSP error rate: 11.0 %). The left\npanel of Figure 4 displays the activity pattern associated with the most important CSP \ufb01lter of subject\n74. One can clearly see that the pattern does not encode neurophysiologically relevant activity,\nbut focuses on a single electrode, namely FFC6. When analysing the (\ufb01ltered) EEG signal of this\nelectrode one can identify a strong artifact in one of the trials. Since neither the empirical covariance\nestimator nor the CSP algorithm is robust to this kind of outliers, it dominates the solution. However,\nthe resulting pattern is meaningless as it does not capture motor imaginary related activity. The right\npanel of Figure 4 shows the relative importance of the divergence term of the artifactual trial with\nrespect to the average divergence terms of the other trials. One can see that the divergence term\ncomputed from the artifactual trial is over 1800 times larger than the average of the other trials. This\nratio decreases rapidly for larger \u03b2 values, thus the in\ufb02uence of the artifact decreases. Thus, our\nexperiments provide an excellent example of the robustness property of the \u03b2-divCSP approach.\n\nFigure 4: Left: The CSP pattern of subject 74 does not re\ufb02ect neurophysiological activity but it rep-\nresents the artifact (red ellipse) in electrode FFC6. Right: The relative importance of this artifactual\ntrial decreases with the \u03b2 parameters. The relative importance is measured as quotient between the\ndivergence term of the artifactual trial and the average divergence terms of the other trials.\n\n5 Discussion\n\nAnalysis of EEG data is challenging because the signal of interest is typically present with a low\nsignal to noise ratio. Moreover artifacts and non-stationarity require robust algorithms. This paper\nhas placed its focus on a robust estimation and proposed a novel algorithm family giving rise to a beta\ndivergence algorithm which allows robust spatial \ufb01lter computation for BCI. In the very common\nsetting where EEG electrodes become loose or movement related artifacts occur in some trials, it\nis a practical necessity to either ignore these trials (which reduces an already small sample size\nfurther) or to enforce intrinsic invariance to these disturbances into the learning procedures. Here,\nwe have used CSP, the standard \ufb01ltering technique in BCI, as a starting point and reformulated it\nin terms of an optimization problem maximizing the divergence between the class-distributions that\ncorrespond to two cognitive states. By borrowing the concept of beta divergences, we could adapt\nthe optimization problem and arrive at a robust spatial \ufb01lter computation based on CSP. We showed\nthat our novel method can reduce the in\ufb02uence of artifacts in the data signi\ufb01cantly and thus allows\nto robustly extract relevant \ufb01lters for BCI applications.\nIn future work we will investigate the properties of other divergences for Brain-Computer Interfacing\nand consider also further applications like ERP-based BCIs [31] and beyond the neurosciences.\nAcknowledgment We thank Daniel Bartz and Frank C. Meinecke for valuable discussions. This\nwork was supported by the German Research Foundation (GRK 1589/1), by the Federal Ministry\nof Education and Research (BMBF) under the project Adaptive BCI (FKZ 01GQ1115) and by the\nBrain Korea 21 Plus Program through the National Research Foundation of Korea funded by the\nMinistry of Education.\n\n7\n\nCSP patternartefact in FFC6beta valuePercentage of artefact term00.00010.0010.010.050.10.150.20.250.50.7511.5250200400600800100012001400160018002000\fAppendix\n\nSketch of proof of Theorem 1\nCauchy\u2019s interlacing theorem [17] establishes a relation between the eigenvalues \u00b51 \u2264 . . . \u2264 \u00b5D of\nthe original covariance matrix \u03a3 and the eigenvalues \u03bd1 \u2264 . . . \u2264 \u03bdd of the projected one V\u03a3V(cid:62).\nThe theorem says that\n\n\u00b5j \u2264 \u03bdj \u2264 \u00b5D\u2212d+j.\n\nIn the proof we split the optimal projection V\u2217 into two parts U1 \u2208 Rk\u00d7D and U2 \u2208 Rd\u2212k\u00d7D based\non whether the \ufb01rst or second trace term in Eq. (4) is larger when applying the spatial \ufb01lters. By\nusing Cauchy\u2019s theorem we then show that Lkl(U) \u2264 Lkl(W) where W consists of k eigenvectors\nwith largest eigenvalues; equality only holds if U and W coincide (up to linear transformations).\nWe show an analogous relation for U2 and conclude that V\u2217 must be the CSP solution (up to linear\ntransformations). See the full proof in the supplement material.\nSketch of the proof of Theorem 2\nSince there is only one discriminative direction we may perform analysis in a basis whereby the\ncovariances of both classes have the form diag(a, 1, . . . , 1) and diag(b, 1, . . . , 1). If we show in this\nbasis that consistency holds then it is a simple matter to prove consistency in the original basis. We\nwant to show that as the number of trials N increases the \ufb01lter provided by sumkl-divCSP converges\nto the true solution v\u2217. If the support of the density of the eigenvalues includes a region around 0,\nthen there is no hope of showing that the matrix inversion is stable. However, it has been shown\nin the random matrix theory literature [32] that if D and n tend to \u221e in a ratio c = D\nthe eigenvalues apart from the largest lie between (1 \u2212 \u221a\nn then all of\nc)2 whereas the largest\nsample eigenvalue (\u03b1 denotes the true non-unit eigenvalue) converges almost surely to \u03b1 + c \u03b1\n\u03b1\u22121\nprovided \u03b1 > 1 +\nc, independently of the distribution of the data; a similar result applies if one\ntrue eigenvalue is smaller than the rest. This implies that for suf\ufb01cient discriminability in the true\ndistribution and suf\ufb01ciently many data points per trial, each \ufb01lter maximizing each term in the sum\nhas non-zero dot-product with the true maximizing \ufb01lter. But since the trials are independent, this\nimplies that in the limit of N trials the maximizing \ufb01lter corresponds to the true \ufb01lter. Note that the\nfull proof goes well beyond the scope of this contribution.\n\nc)2 and (1 +\n\n\u221a\n\n\u221a\n\nReferences\n[1] G. Dornhege, J. del R. Mill\u00b4an, T. Hinterberger, D. McFarland, and K.-R. M\u00a8uller, Eds., Toward\n\nBrain-Computer Interfacing. Cambridge, MA: MIT Press, 2007.\n\n[2] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, \u201cBrain-\ncomputer interfaces for communication and control,\u201d Clin. Neurophysiol., vol. 113, no. 6, pp.\n767\u2013791, 2002.\n\n[3] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. M\u00a8uller, \u201cOptimizing Spatial\n\ufb01lters for Robust EEG Single-Trial Analysis,\u201d IEEE Signal Proc. Magazine, vol. 25, no. 1, pp.\n41\u201356, 2008.\n\n[4] H. Ramoser, J. M\u00a8uller-Gerking, and G. Pfurtscheller, \u201cOptimal spatial \ufb01ltering of single trial\neeg during imagined hand movement,\u201d IEEE Trans. Rehab. Eng., vol. 8, no. 4, pp. 441\u2013446,\n1998.\n\n[5] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, \u201cRecipes for the linear analysis of eeg,\u201d\n\nNeuroImage, vol. 28, pp. 326\u2013341, 2005.\n\n[6] S. Lemm, B. Blankertz, T. Dickhaus, and K.-R. M\u00a8uller, \u201cIntroduction to machine learning for\n\nbrain imaging,\u201d NeuroImage, vol. 56, no. 2, pp. 387\u2013399, 2011.\n\n[7] F. Lotte and C. Guan, \u201cRegularizing common spatial patterns to improve bci designs: Uni\ufb01ed\n\ntheory and new algorithms,\u201d IEEE Trans. Biomed. Eng., vol. 58, no. 2, pp. 355 \u2013362, 2011.\n\n[8] W. Samek, C. Vidaurre, K.-R. M\u00a8uller, and M. Kawanabe, \u201cStationary common spatial patterns\nfor brain-computer interfacing,\u201d Journal of Neural Engineering, vol. 9, no. 2, p. 026013, 2012.\n[9] O. Ledoit and M. Wolf, \u201cA well-conditioned estimator for large-dimensional covariance ma-\n\ntrices,\u201d Journal of Multivariate Analysis, vol. 88, no. 2, pp. 365 \u2013 411, 2004.\n\n8\n\n\f[10] H. Lu, H.-L. Eng, C. Guan, K. Plataniotis, and A. Venetsanopoulos, \u201cRegularized common\nspatial pattern with aggregation for eeg classi\ufb01cation in small-sample setting,\u201d IEEE Transac-\ntions on Biomedical Engineering, vol. 57, no. 12, pp. 2936\u20132946, 2010.\n\n[11] D. Devlaminck, B. Wyns, M. Grosse-Wentrup, G. Otte, and P. Santens, \u201cMulti-subject learning\nfor common spatial patterns in motor-imagery bci,\u201d Computational Intelligence and Neuro-\nscience, vol. 2011, no. 217987, pp. 1\u20139, 2011.\n\n[12] B. Blankertz, M. K. R. Tomioka, F. U. Hohlefeld, V. Nikulin, and K.-R. M\u00a8uller, \u201cInvariant\ncommon spatial patterns: Alleviating nonstationarities in brain-computer interfacing,\u201d in Ad.\nin NIPS 20, 2008, pp. 113\u2013120.\n\n[13] W. Samek, F. C. Meinecke, and K.-R. M\u00a8uller, \u201cTransferring subspaces between subjects in\nbrain-computer interfacing,\u201d IEEE Transactions on Biomedical Engineering, vol. 60, no. 8,\npp. 2289\u20132298, 2013.\n\n[14] M. Arvaneh, C. Guan, K. K. Ang, and C. Quek, \u201cOptimizing spatial \ufb01lters by minimizing\nwithin-class dissimilarities in electroencephalogram-based brain-computer interface,\u201d IEEE\nTrans. Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 610\u2013619, 2013.\n\n[15] S. Amari, H. Nagaoka, and D. Harada, Methods of information geometry. American Mathe-\n\nmatical Society, 2000.\n\n[16] S. Eguchi and Y. Kano, \u201cRobustifying maximum likelihood estimation,\u201d Tokyo Institute of\n\nStatistical Mathematics, Tokyo, Japan, Tech. Rep, 2001.\n\n[17] R. Bhatia, Matrix analysis, ser. Graduate Texts in Mathematics. Springer, 1997, vol. 169.\n[18] M. Mihoko and S. Eguchi, \u201cRobust blind source separation by beta divergence,\u201d Neural Com-\n\nput., vol. 14, no. 8, pp. 1859\u20131886, Aug. 2002.\n\n[19] C. F\u00b4evotte and J. Idier, \u201cAlgorithms for nonnegative matrix factorization with the β-\n\ndivergence,\u201d Neural Comput., vol. 23, no. 9, pp. 2421\u20132456, Sep. 2011.\n\n[20] A. Hyv\u00a8arinen, \u201cSurvey on independent component analysis,\u201d Neural Computing Surveys,\n\nvol. 2, pp. 94\u2013128, 1999.\n\n[21] M. Kawanabe, W. Samek, P. von B\u00a8unau, and F. Meinecke, \u201cAn information geometrical view\nof stationary subspace analysis,\u201d in Arti\ufb01cial Neural Networks and Machine Learning - ICANN\n2011, ser. LNCS. Springer Berlin / Heidelberg, 2011, vol. 6792, pp. 397\u2013404.\n\n[22] N. Murata, T. Takenouchi, and T. Kanamori, \u201cInformation geometry of u-boost and bregman\n\ndivergence,\u201d Neural Computation, vol. 16, pp. 1437\u20131481, 2004.\n\n[23] H. Wang, \u201cHarmonic mean of kullbackleibler divergences for optimizing multi-class eeg\n\nspatio-temporal \ufb01lters,\u201d Neural Processing Letters, vol. 36, no. 2, pp. 161\u2013171, 2012.\n\n[24] P. von B\u00a8unau, F. C. Meinecke, F. C. Kir\u00b4aly, and K.-R. M\u00a8uller, \u201cFinding Stationary Subspaces\n\nin Multivariate Time Series,\u201d Physical Review Letters, vol. 103, no. 21, pp. 214 101+, 2009.\n\n[25] P. von B\u00a8unau, \u201cStationary subspace analysis - towards understanding non-stationary data,\u201d\n\nPh.D. dissertation, Technische Universit\u00a8at Berlin, 2012.\n\n[26] W. Samek, M. Kawanabe, and K.-R. M\u00a8uller, \u201cDivergence-based framework for common spa-\n\ntial patterns algorithms,\u201d IEEE Reviews in Biomedical Engineering, 2014, in press.\n\n[27] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, \u201cRobust and ef\ufb01cient estimation by min-\n\nimising a density power divergence,\u201d Biometrika, vol. 85, no. 3, pp. 549\u2013559, 1998.\n\n[28] P. J. Huber, Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley-\n\nInterscience, 1981.\n\n[29] B. Blankertz, C. Sannelli, S. Halder, E. M. Hammer, A. K\u00a8ubler, K.-R. M\u00a8uller, G. Curio,\nand T. Dickhaus, \u201cNeurophysiological predictor of smr-based bci performance,\u201d NeuroImage,\nvol. 51, no. 4, pp. 1303\u20131309, 2010.\n\n[30] P. J. Rousseeuw and K. V. Driessen, \u201cA fast algorithm for the minimum covariance determinant\n\nestimator,\u201d Technometrics, vol. 41, no. 3, pp. 212\u2013223, 1999.\n\n[31] B. Blankertz, S. Lemm, M. S. Treder, S. Haufe, and K.-R. M\u00a8uller, \u201cSingle-trial analysis and\nclassi\ufb01cation of ERP components \u2013 a tutorial,\u201d NeuroImage, vol. 56, no. 2, pp. 814\u2013825, 2011.\n[32] J. Baik and J. Silverstein, \u201cEigenvalues of large sample covariance matrices of spiked popula-\n\ntion models,\u201d Journal of Multivariate Analysis, vol. 97, no. 6, pp. 1382\u20131408, 2006.\n\n9\n\n\f", "award": [], "sourceid": 538, "authors": [{"given_name": "Wojciech", "family_name": "Samek", "institution": "TU Berlin"}, {"given_name": "Duncan", "family_name": "Blythe", "institution": "TU Berlin"}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": "TU Berlin"}, {"given_name": "Motoaki", "family_name": "Kawanabe", "institution": "ATR"}]}