{"title": "Logistic Regression for Single Trial EEG Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1377, "page_last": 1384, "abstract": null, "full_text": "Logistic Regression for Single Trial EEG Classification\nRyota Tomioka Kazuyuki Aihara Dept. of Mathematical Informatics, IST, The University of Tokyo, 113-8656 Tokyo, Japan. ryotat@first.fhg.de aihara@sat.t.u-tokyo.ac.jp Klaus-Rob ert Muller Dept. of Computer Science, Technical University of Berlin, Franklinstr. 28/29, 10587 Berlin, Germany. klaus@first.fhg.de\n\nAbstract\nWe propose a novel framework for the classification of single trial ElectroEncephaloGraphy (EEG), based on regularized logistic regression. Framed in this robust statistical framework no prior feature extraction or outlier removal is required. We present two variations of parameterizing the regression function: (a) with a full rank symmetric matrix coefficient and (b) as a difference of two rank=1 matrices. In the first case, the problem is convex and the logistic regression is optimal under a generative model. The latter case is shown to be related to the Common Spatial Pattern (CSP) algorithm, which is a popular technique in Brain Computer Interfacing. The regression coefficients can also be topographically mapped onto the scalp similarly to CSP pro jections, which allows neuro-physiological interpretation. Simulations on 162 BCI datasets demonstrate that classification accuracy and robustness compares favorably against conventional CSP based classifiers.\n\n1\n\nIntro duction\n\nThe goal of Brain-Computer Interface (BCI) research [1, 2, 3, 4, 5, 6, 7] is to provide a direct control pathway from human intentions reflected in brain signals to computers. Such a system will not only provide disabled people more direct and natural control over a neuroprosthesis or over a computer application (e.g. [2]) but also opens up a further channel of man machine interaction for healthy people to communicate solely by their intentions. Machine learning approaches to BCI have proven to be effective by requiring less sub ject training and by compensating for the high inter-sub ject variability. In this field, a number of studies have focused on constructing better low dimensional representations that combine various features of brain activities [3, 4], because the problem of classifying EEG signals is intrinsically high dimensional. In particular, efforts have been made to reduce the number of electrodes by eliminating electrodes recursively [8] or by decomposition techniques e.g., ICA, which only uses the marginal distribution, or Common Spatial Patterns (CSP) [9] which additionally takes the labels into account. In practice, often a BCI system has been constructed by combining a feature extraction step and a classification step. Our contribution is a logistic regression classifier that integrates both steps under the roof of a single minimization problem and uses well controlled regularization. Moreover, the classifier output has a probabilistic interpretation. We study a BCI based on the motor\n \n\nFraunhofer FIRST.IDA, Kekulstr. 7, 12489 Berlin, Germany. e ERATO Aihara Complexity Modeling Pro ject, JST, 153-8505 Tokyo, Japan\n\n\f\nimagination paradigm. Motor imagination can be captured through spatially localized bandpower modulation in the - (10-15Hz) or - (20-30Hz) band characterized by the secondorder statistics of the signal; the underlying neuro-physiology is well known as Event Related Desynchronization (ERD) [10]. 1.1 Problem setting\n\nLet us denote by X RdT the EEG signal of a single trial of an imaginary motor movement1 , where d is the number of electrodes and T is the number of sampled time-points in a trial. We consider a binary classification problem where each class, e.g. right or left hand imaginary movement, is called positive (+) or negative (-) class. Let y {+1, -1} be the class label. Given a set of trials and labels {Xi , yi }n 1 , the task is to predict the class label i= y for an unobserved trial X . 1.2 Conventional metho d: classifying with CSP features\n\nIn the motor-imagery EEG signal classification, Common Spatial Pattern (CSP) based classifiers have proven to be powerful [11, 3, 6]. CSP is a decomposition method proposed by Koles [9] that finds a set of pro jections that simultaneously diagonalize the covariance matrices corresponding to two brain states. Formally, the covariance matrices2 are defined as: 1i c = Xi Xi (c {+, -}), (1) |Ic |\nIc\n\nwhere Ic is the set of indices belonging to a class c {+, -}; thus I+ I- = {1, . . . , n}. Then, the simultaneous diagonalization is achieved by solving the following generalized eigenvalue problem: + w = - w. (2)\n\nwj + wj Note that for each pair of eigenvector and eigenvalue (wj , j ), the equality j = wj - wj holds. Therefore, the eigenvector with the largest eigenvalue corresponds to the pro jection with the maximum ratio of power for the \"+\" class and the \"-\" class, and the otherway-around for the eigenvector with the smallest eigenvalue. In this paper, we call these eigenvectors filters3 ; we call the eigenvector of an eigenvalue smaller (or larger) than one a filter for the \"+\" class (or the \"-\" class), respectively, because the signal pro jected with them optimally (in the spirit of eigenvalues) captures the task related de-synchronization in each class. It is common practice that only the first nof largest eigenvectors and the last nof smallest eigenvectors are used to construct a low dimensional feature representation. The feature vector consists of logarithms of the pro jected signal powers and a Linear Discriminant Analysis (LDA) classifier is trained on the resulting feature vector. To summarize, the conventional CSP based classifier can be constructed as follows: How to build a CSP based classifier: 1. Solve the generalized eigenvalue problem Eq. (2). 2. Take the nof largest and smallest eigenvectors {wj }J=1 j l J 3. xi := og wj Xi Xi wj j =1 (i = 1, . . . , n). 4. Train an LDA classifier on {xi , yi }n 1 . i=\nFor simplicity, we assume that the signal is already band-pass filtered and each trial is centered ` 1 1 and scaled as X = T Xoriginal IT - T 11 . 2 Although it is convenient to call Eq. (1) a covariance matrix, calling it an averaged cross power matrix gives better insight into the nature of the problem, because we are focusing on the task related modulation of rhythmic activities. 3 according to the convention by [12].\n1\n\n(J = 2nof ).\n\n\f\n2\n2.1\n\nTheory\nThe mo del\n\nWe consider the following discriminative model; we model the symmetric logit transform of the posterior class probability to be a linear function with respect to the second order statistics of the EEG signal: W P (y = +1|X ) log = f (X ; ) := tr X X + b, (3) P (y = -1|X ) where := (W, b) Sym(d) R, W is a symmetric d d matrix and b is the bias term. The model (3) can be derived by assuming a zero-mean Gaussian distribution with no temporal correlation with a covariance matrix for each class as follows: - -1 X P (y = +1|X ) 1 log = tr + + - -1 X + const.. (4) P (y = -1|X ) 2 However training of a discriminative model is robust to misspecification of the marginal distribution P (X ) [13]. In another words, the marginal distribution P (X ) is a nuisance parameter; we maximize the joint log-likelihood, which is decomposed as log P (y , X | ) = log P (y |X, ) + log P (X ), only with respect to [14]. Therefore, no assumption about the generative model is necessary. Note that from Eq. (4) normally the optimal W has both positive and negative eigenvalues. 2.2 2.2.1 Logistic regression Linear logistic regression\n\nWe minimize the negative log-likelihood of Eq. (3) with an additional regularization term, which is written as follows: n 1 +C t . 1i min log + e-yi f (Xi ; ) rP W P W + b2 (5) 2n W Sym(d),bR n =1 n 1 Here, the pooled covariance matrix P := n i=1 Xi Xi is introduced in the regularization term in order to make the regularization invariant to linear transformation of the data; if -1/2 ~ -1/2 we rewrite W as W := P W P , one can easily see that the regularization term is ~ simply the Frobenius norm of a symmetric matrix W ; the transformation corresponds to the ~ = -1/2 X . By simple calculation, one can see that the loss term whitening of the signal X P n is the negative logarithm of the conditional likelihood i=1 1/(1 + e-yi f (Xi ; ) ), in another words the probability of observing head (yi = +1) or tail (yi = -1) by tossing n coins with probability P (y = +1|X = Xi , ) (i = 1, . . . , n) for the head. From a general point of view, the loss term of Eq. (5) converges asymptotically to the true loss where the empirical average is replaced by the expectation over X and y , whose minimum over functions in L2 (PX ) is achieved by the symmetric logit transform of P (y = +1|X ) [15]. Note that the problem Eq. (5) is convex. The problem of classifying motor imagery EEG signals is now addressed under a single loss function. Based on the criterion (Eq. (5)) we can say how good a solution is and we know how to properly regularize it. 2.2.2 Rank=2 approximation of the linear logistic regression\n\nHere we present a rank=2 approximation of the regression function (3). Using this approximation we can greatly reduce the number of parameters to be estimated from a symmetric matrix coefficient to a pair of pro jection coefficients and additionally gain insight into the relevant feature the classifier has found. The rank=2 approximation of the regression function (3) is written as follows: - X 1 f (X ; ) := tr w1 w1 + w2 w2 X + b, 2\n\n(6)\n\n\f\n where := (w1 , w2 , b) Rd Rd R. The rationale for choosing this special form of function is that the Bayes optimal regression coefficients in Eq. (4) is the difference of two positive definite matrices; therefore two bases with opposite signs are at least necessary in capturing the nature of Eq. (4) (incorporating more bases goes beyond the scope of this contribution). The rank=2 parameterized logistic regression can be obtained by minimizing the sum of the logistic regression loss and regularization terms similarly to Eq. (5):\nn 1 . 1i + C w1 min log + e-yi f (Xi ; ) P w1 + w2 P w2 + b2 2n w1 ,w2 Rd ,bR n =1\n\n(7)\n\nHere, again the pooled covariance matrix P is used as a metric in order to ensure the invariance to linear transformations. Note that the bases {w1 , w2 } give pro jections of the signal into a two dimensional feature space in a similar manner as CSP (see Sec. 1.2). We call w1 and w2 filters corresponding to \"+\" and \"-\" classes, respectively, similarly to CSP. The filters can be topographically mapped onto the scalp, from which insight into the classifier can be obtained. However, the ma jor difference between CSP and the rank=2 parameterized logistic regression (Eq. (7)) is that in our new approach, there is no distinction between the feature extraction step and the classifier training step. The coefficient that linearly combines the features (i.e., the norm of w1 and w2 ) is optimized in the same optimization problem (Eq. (7)).\n\n3\n3.1\n\nResults\nExp erimental settings\n\nWe compare the logistic regression classifiers (Eqs. (3) and (6)) against CSP based classifiers with nof = 1 (total 2 filters) and nof = 3 (total 6 filters). The comparison is a chronological validation. All methods are trained on the first half of the samples and applied on the second half. We use 60 BCI experiments [6] from 29 sub jects where the sub jects performed three imaginary movements, namely \"right hand\" (R), \"left hand\" (L) and \"foot\" (F) according to the visual cue presented on the screen, except 9 experiments where only two classes were performed. Since we focus on binary classification, all the pairwise combination of the performed classes produced 162 (= 51 3 + 9) datasets. Each dataset contains 70 to 600 trials (at median 280) of imaginary movements. All the recordings come from the calibration measurements, i.e. no feedback was presented to the sub jects. The signal was recorded from the scalp with multi-channel EEG amplifiers using 32, 64 or 128 channels. The signal was sampled at 1000Hz and down-sampled to 100Hz before the processing. The signal is band-pass filtered at 7-30Hz and the interval 500-3500ms after the appearance of visual cue is cut out from the continuous EEG signal as a trial X . The training data is whitened before minimizing Eqs. (5) and (7) because both problems become considerably simpler when P is an identity matrix. For the prediction of test data, coefficients including - 1 /2 ~ - 1 /2 - 1 /2 ~ the whitening operation W = P W P for Eq. (3) and wj = P wj (j = 1, 2) for ~ and wj denote the minimizer of Eqs. (5) and (7) for the whitened ~ Eq. (6) are used, where W data. Note that we did not whitened the training and test data jointly, which could have improved the performance. The regularization constant C for the proposed method is chosen by 510 cross-validation on the training set. 3.2 Classification p erformance\n\nIn Fig. 1, logistic regression (LR) classifiers with the full rank parameterization (Eq. (3); left column) and the rank=2 parameterization (Eq. (6); right column) are compared against CSP based classifiers with 6 filters (top row) and 2 filters (bottom row). Each plot shows the bit-rates achieved by CSP (horizontal) and LR (vertical) for each dataset as a circle. Here the bit-rate (per decision) is defined based on the classification test error perr as the capacity of a binary symmetric channel with the same error probability:\n\n\f\n1 0.8 LR (full rank)\n\n43%\n0.8 LR (rank=2) 0.6 0.4 0.2\n\n52%\n\n0.6 0.4 0.2 0 0\n\n48%\n0.2 0.4 0.6 0.8 CSP (6 filters) 1\n\n0 0\n\n38%\n0.2 0.4 0.6 0.8 CSP (6 filters)\n\n1 0.8 LR (full rank)\n\n52%\n0.8 LR (rank=2) 0.6 0.4 0.2\n\n64%\n\n0.6 0.4 0.2 0 0\n\n43%\n0.2 0.4 0.6 0.8 CSP (2 filters) 1\n\n0 0\n\n28%\n0.2 0.4 0.6 0.8 CSP (2 filters)\n\nFigure 1: Comparison of bit-rates achieved by the CSP based classifiers and the logistic regression (LR) classifiers. The bit-rates achieved by the conventional CSP based classifier and the proposed LR classifier are shown as a circle for each dataset. The proportion of datasets lying above/below the diagonal is shown at top-left/bottom-right corners of each plot, respectively. Only the difference between CSP with 2 filters and rank=2 approximated LR (lower right) is significant based on Fisher sign test at 5% level. p . 1 1 - err log2 perr + (1 - perr ) log2 1-1 err The proposed method improves upon the conp ventional method for datasets lying above the diagonal. Note that our proposed logistic regression ansatz is significantly better only in the lower right plot. Figure 2 shows examples of spatial filter coefficients obtained by CSP (6 filters) and rank=2 parameterized logistic regression. The CSP filters for sub ject A (see Fig. 2(a)) include typical cases (the first filter for the \"left hand\" class and the first two filters for the \"right hand\" class) of filters corrupted by artifacts, e.g., muscle movements. The CSP filters for the \"foot\" class in sub ject B (see Fig. 2(b)) are corrupted by strong occipital -activity, which might have been weakly correlated to the labels by chance. Note that CSP with 2 filters only use the first filter for each class, which corresponds to the first row in Figs. 2(a) and 2(b). On the other hand the filter coefficients obtained by the logistic regression are clearly focused on the area physiologically corresponding to ERD in the motor cortex (see Figs. 2(c) and (d)).\n\n4\n4.1\n\nDiscussion\nRelation to CSP\n\nHere, we show that at the optimum of Eq. (7) the regression coefficients w1 and w2 are generalized eigenvectors of two uncertainty weighted covariance matrices corresponding to two motor imagery classes, which are weighted by the uncertainty of the decision 1 - P (y = yi |X = Xi ) for each sample. Samples that are easily explained by the regression function are weighed low whereas those lying close to the decision boundary or those lying on the wrong side of the boundary are highly weighted. Although, both CSP and the rank=2 approximated logistic regression can be understood as generalized eigenvalue decomposition, the classification-optimized weighting in the logistic regression yields filters that focus on\n\n\f\nthe task related modulation of rhythmic activities more clearly when compared to CSP, as shown in Fig. 2. Differentiating Eq. (7) with either w1 or w2 , we obtain the following equality which holds at the optimum. in e-zi yi Xi Xi w + C P w = 0 (j = 1, 2), (8) j j 1 + e-zi =1 where we define the short hand zi := yi f (Xi ; ) and denotes + and - for j = 1, 2, respectively. Moreover, Eq. (8) can be rewritten as follows: -( , 0)w = +( , C )w , 1 1 , 0)w = -( , C )w , +( 2 2 where we define the uncertainty weighted covariance matrix as: n i Ci e-zi Xi Xi . ( , C ) = Xi Xi + 1 + e-zi n =1\nI\n\n(9) (10)\n\nNote that increasing the regularization constant C biases the uncertainty weighted covariance matrix to the pooled covariance matrix P ; the regularization only affects the righthand side of Eqs. (9) and (10). If C > 0, the optimal filter coefficients w (j = 1, 2) are the j generalized eigenvectors of Eqs. (9) and (10), respectively. 4.2 CSP is not optimal\n\nWhen first proposed, CSP was rather a decomposition technique than a classification technique (see [9]). After being introduced to the BCI community by [11], it has proved to be also powerful in classifying imaginary motor movements [3, 6]. However, since it is not optimized for the classification problem, there are two ma jor drawbacks. Firstly, the selection of \"good\" CSP components is usually done somewhat arbitrarily. A widely used heuristic is to choose several generalized eigenvectors from both ends of the eigenvalue spectrum. However, as in sub ject B in Fig. 2, it is often observed that filters corresponding to overwhelming strong power come to the top of the spectrum though they are not correlated to the label so strongly. In practice, an experienced investigator can choose good filters by looking at them, however the validity of the selection cannot be assessed because the manual selection cannot be done inside the cross-validation. Secondly, simultaneous diagonalization of covariance matrices can suffer greatly from a few outlier trials as seen in sub ject A in Fig. 2. Again, in practice one can inspect the EEG signals to detect outliers, however a manual outlier detection is also a somewhat arbitrary, non-reproducible process, which cannot be validated.\n\n5\n\nConclusion\n\nIn this paper, we have proposed an unified framework for single trial classification of motorimagery EEG signals. The problem is addressed as a single minimization problem without any prior feature extraction or outlier removal steps. The task is to minimize a logistic regression loss with a regularization term. The regression function is a linear function with respect to the second order statistics of the EEG signal. We have tested the proposed method on 162 BCI datasets. By parameterizing the whole regression coefficients directly, we have obtained comparable classification accuracy with CSP based classifiers. By parameterizing the regression coefficients as the difference of two rank-one matrices, improvement against CSP based classifiers was obtained. We have shown that in the rank=2 parameterization of the logistic regression function, the optimal filter coefficients has an interpretation as a solution to a generalized eigenvalue problem similarly to CSP. However, the difference is that in the case of logistic regression every sample is weighted according to the importance to the overall classification problem whereas in CSP all the samples have uniform importance.\n\n\f\nThe proposed framework provides a basis for various future directions. For example, incorporating more than two filters will connect the two parameterizations of the regression function shown in this paper and it may allow us to investigate how many filters are sufficient for good classification. Since the classifier output is the logit transform of the class probability, it is straightforward to generalize the method to multi-class problems. Also non-stationarities, e.g. caused by a covariate shift (see [16, 17]) in the density P (X ) from one session to another, could be corrected by adapting the likelihood model. Acknowledgments: This research was partially supported by MEXT, Grant-in-Aid for JSPS fellows, 17-11866 and Grant-in-Aid for Scientific Research on Priority Areas, 17022012, by BMBF-grant FKZ 01IBE01A, and by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors' views.\n\nReferences\n[1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, \"Braincomputer interfaces for communication and control\", Clin. Neurophysiol., 113: 767791, 2002. [2] N. Birbaumer, N. Ghanayim, T. Hinterberger, I. Iversen, B. Kotchoubey, A. Kubler, J. Perel mouter, E. Taub, and H. Flor, \"A spelling device for the paralysed\", Nature, 398: 297298, 1999. [3] G. Pfurtscheller, C. Neuper, C. Guger, W. Harkam, R. Ramoser, A. Schlogl, B. Obermaier, and M. Pregenzer, \"Current Trends in Graz Brain-computer Interface (BCI)\", IEEE Trans. Rehab. Eng., 8(2): 216219, 2000. [4] B. Blankertz, G. Curio, and K.-R. Muller, \"Classifying Single Trial EEG: Towards Brain Computer Interfacing\", in: T. G. Diettrich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Inf. Proc. Systems (NIPS 01), vol. 14, 157164, 2002. [5] B. Blankertz, G. Dornhege, C. Schfer, R. Krepki, J. Kohlmorgen, K.-R. Muller, V. Kunzmann, a F. Losch, and G. Curio, \"Boosting Bit Rates and Error Detection for the Classification of Fast-Paced Motor Commands Based on Single-Trial EEG Analysis\", IEEE Trans. Neural Sys. Rehab. Eng., 11(2): 127131, 2003. [6] B. Blankertz, G. Dornhege, M. Krauledat, K.-R. Muller, V. Kunzmann, F. Losch, and G. Cu rio, \"The Berlin Brain-Computer Interface: EEG-based communication without sub ject training\", IEEE Trans. Neural Sys. Rehab. Eng., 14(2): 147152, 2006. [7] G. Dornhege, J. del R. Milln, T. Hinterberger, D. McFarland, and K.-R. Muller, eds., Towards a Brain-Computer Interfacing, MIT Press, 2006, in press. [8] T. N. Lal, M. Schroder, T. Hinterberger, J. Weston, M. Bogdan, N. Birbaumer, and B. Schlkopf, \"Support Vector Channel Selection in BCI\", IEEE Transactions Biomedical o Engineering, 51(6): 10031010, 2004. [9] Z. J. Koles, \"The quantitative extraction and topographic mapping of the abnormal components in the clinical EEG\", Electroencephalogr. Clin. Neurophysiol., 79: 440447, 1991. [10] G. Pfurtscheller and F. H. L. da Silva, \"Event-related EEG/MEG synchronization and desynchronization: basic principles\", Clin. Neurophysiol., 110(11): 18421857, 1999. [11] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, \"Optimal spatial filtering of single trial EEG during imagined hand movement\", IEEE Trans. Rehab. Eng., 8(4): 441446, 2000. [12] N. J. Hill, J. Farquhar, T. N. Lal, and B. Scholkopf, \"Time-dependent demixing of task-relevant EEG sources\", in: Proceedings of the 3rd International Brain-Computer Interface Workshop and Training Course 2006, Verlag der Technischen Universitat Graz, 2006. [13] B. Efron, \"The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis\", J. Am. Stat. Assoc., 70(352): 892898, 1975. [14] T. Minka, \"Discriminative models, not discriminative training\", Tech. Rep. TR-2005-144, Microsoft Research Cambridge, 2005. [15] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, SpringerVerlag, 2001. [16] H. Shimodaira, \"Improving predictive inference under covariate shift by weighting the loglikelihood function\", Journal of Statistical Planning and Inference, 90: 227244, 2000. [17] S. Sugiyama and K.-R. Muller, \"Input-Dependent Estimation of Generalization Error under Covariate Shift\", Statistics and Decisions, 23(4): 249279, 2005.\n\n\f\nleft hand\n\nright hand\n\nleft hand\n\nfoot\n\n[2.40]\n\n[0.33]\n\n[7.11]\n\n[2.04]\n\n[0.41]\n\n[4.74]\n\n[1.88]\n\n[0.59]\n\n[3.19]\n\n(a) Sub ject A. CSP filter coefficients\n\n(b) Sub ject B. CSP filter coefficients\n\nleft hand\n\nright hand\n\n[0.70]\n\n[0.67]\n\n[0.61]\n\nleft hand\n\nfoot\n\n(c) Sub ject A. Logistic regression (rank=2) filter coefficients\n\n(d) Sub ject B. Logistic regression (rank=2) filter coefficients\n\nFigure 2: Examples of spatial filter coefficients obtained by CSP and the rank=2 parameterized logistic regression. (a) Sub ject A. Some CSP filters are corrupted by artifacts. (b) Sub ject B. Some CSP filters are corrupted by strong occipital -activity. (c) Sub ject A. Logistic regression coefficients are focusing on the physiologically expected \"left hand\" and \"right hand\" areas. (d) Sub ject B. Logistic regression coefficients are focusing on the \"left hand\" and \"foot\" areas. Electrode positions are marked with crosses in every plot. For CSP filters, the generalized eigenvalues (Eq. (2)) are shown inside brackets.\n\n\f\n", "award": [], "sourceid": 3134, "authors": [{"given_name": "Ryota", "family_name": "Tomioka", "institution": null}, {"given_name": "Kazuyuki", "family_name": "Aihara", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}