{"title": "Robust Fisher Discriminant Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 659, "page_last": 666, "abstract": null, "full_text": "Robust Fisher Discriminant Analysis\n\nSeung-Jean Kim\n\nAlessandro Magnani\n\nStephen P. Boyd\n\nInformation Systems Laboratory Electrical Engineering Department, Stanford University Stanford, CA 94305-9510 sjkim@stanford.edu alem@stanford.edu boyd@stanford.edu\n\nAbstract\nFisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space.\n\n1\n\nIntroduction\n\nFisher linear discriminant analysis (LDA), a widely-used technique for pattern classification, finds a linear discriminant that yields optimal discrimination between two classes which can be identified with two random variables, say X and Y in Rn . For a (linear) discriminant characterized by w Rn , the degree of discrimination is measured by the Fisher discriminant ratio f (w, x , y , x , y ) = wT (x - y )(x - y )T w (wT (x - y ))2 =T , T ( + )w w w (x + y )w x y\n\nwhere x and x (y and y ) denote the mean and covariance of X (Y). A discriminant that maximizes the Fisher discriminant ratio is given by wnom = (x + y )-1 (x - y ), which gives the maximum Fisher discriminant ratio (x - y )T (x + y )-1 (x - y ) = max f (w, x , y , x , y ).\nw=0\n\nIn applications, the problem data x , y , x , and y are not known but are estimated from sample data. Fisher LDA can be sensitive to the problem data: the discriminant wnom computed from an estimate of the parameters x , y , x , and y can give very\n\n\f\npoor discrimination for another set of problem data that is also a reasonable estimate of the parameters. In this paper, we attempt to systematically alleviate this sensitivity problem by explicitly incorporating a model of data uncertainty in the classification problem and optimizing for the worst-case scenario under this model. We assume that the problem data x , y , x , and y are uncertain, but known to belong to a convex compact subset U of Rn Rn Sn + Sn + . Here we use Sn + (Sn ) to denote + + + + the set of all n n symmetric positive definite (semidefinite) matrices. We make one technical assumption: for each (x , y , x , y ) U , we have x = y . This assumption simply means that for each possible value of the means and covariances, the classes are distinguishable via Fisher LDA. The worst-case analysis problem of finding the worst-case means and covariances for a given discriminant w can be written as minimize subject to f (w, x , y , x , y ) (x , y , x , y ) U , (1)\n\nwith variables x , y , x , and y . The optimal value of this problem is the worst-case Fisher discriminant ratio (over the class U of possible means and covariances), and any optimal points for this problem are called worst-case means and covariances. These depend on w. We will show in 2 that (1) is a convex optimization problem, since the Fisher discriminant ratio is a convex function of x , y , x , y for a given discriminant w. As a result, it is computationally tractable to find the worst-case performance of a discriminant w over the set of possible means and covariances. The robust Fisher LDA problem is to find a discriminant that maximizes the worst-case Fisher discriminant ratio. This can be cast as the optimization problem maximize subject to\n(x ,y ,x ,y )U\n\nmin\n\nf (w, x , y , x , y )\n\nw = 0,\n\n(2)\n\nwith variable w. We denote any optimal w for this problem as w . Here we choose a linear discriminant that maximizes the Fisher discrimination ratio, with the worst possible means and covariances that are consistent with our data uncertainty model. The main result of this paper is to give an effective method for solving the robust Fisher LDA problem (2). We will show in 2 that the robust optimal Fisher discriminant w can be found as follows. First, we solve the (convex) optimization problem minimize max f (w, x , y , x , y ) = (x - y )T (x + y )-1 (x - y )\nw=0\n\nsubject to (x , y , x , y ) U ,\n\n(3)\n\nwith variables (x , y , x , y ). Let ( , , , ) denote any optimal point. Then the y x y x discriminant -1 (x - ) (4) w = x + y y is a robust optimal Fisher discriminant, i.e., it is optimal for (2). Moreover, we will see that , and , are worst-case means and covariances for the robust optimal Fisher x y x y discriminant w . Since convex optimization problems are tractable, this means that we have a tractable general method for computing a robust optimal Fisher discriminant. A robust Fisher discriminant problem of modest size can be solved by standard convex optimization methods, e.g., interior-point methods [3]. For some special forms of the uncertainty model, the robust optimal Fisher discriminant can be solved more efficiently than by a general convex optimization formulation. In 3, we consider an important special form for U for which a more efficient formulation can be given.\n\n\f\nIn comparison with the `nominal' Fisher LDA, which is based on the means and covariances estimated from the sample data set without considering the estimation error, the robust Fisher LDA performs well even when the sample size used to estimate the means and covariances is small, resulting in estimates which are not accurate. This will be demonstrated with some numerical examples in 4. Recently, there has been a growing interest in kernel Fisher discriminant analysis i.e., Fisher LDA in a higher dimensional feature space, e.g., [7]. Our results can be extended to robust kernel Fisher discriminant analysis under certain uncertainty models. This will be briefly discussed in 5. Various types of robust classification problems have been considered in the prior literature, e.g., [2, 5, 6]. Most of the research has focused on formulating robust classification problems that can be efficiently solved via convex optimization. In particular, the robust classification method developed in [6] is based on the criterion g (w, x , y , x , y ) = (wT x w)1/2 |wT (x - y )| , + (wT y w)1/2\n\nwhich is similar to the Fisher discriminant ratio f . With a specific uncertainty model on the means and covariances, the robust classification problem with discrimination criterion g can be cast as a second-order cone program, a special type of convex optimization problem [5]. With general uncertainty models, however, it is not clear whether robust discriminant analysis with g can be performed via convex optimization.\n\n2\n\nRobust Fisher LDA\n\nWe first consider the worst-case analysis problem (1). Here we consider the discriminant w as fixed, and the parameters x , y , x , and y are variables, constrained to lie in the convex uncertainty set U . To show that (1) is a convex optimization problem, we must show that the Fisher discriminant ratio is a convex function of x , y , x , and y . To show this, we express the Fisher discriminant ratio f as the composition f (w, x , y , x , y ) = g (H (x , y , x , y )), where g (u, t) = u2 /t and H is the function H (x , y , x , y ) = (wT (x - y ), wT (x + y )w). The function H is linear (as a mapping from x , y , x , and y into R2 ), and the function g is convex (provided t > 0, which holds here). Thus, the composition f is a convex function of x , y , x , and y . (See [3].) Now we turn to the main result of this paper. Consider a function of the form R(w, a, B ) = (wT a)2 , wT B w (5)\n\nwhich is the Rayleigh quotient for the matrix pair aaT Sn and B Sn + , evaluated at w. + + The robust Fisher LDA problem (2) is equivalent to a problem of the form maximize subject to where a = x - y , B = x + y , V = {(x - y , x + y ) | (x , y , x , y ) U }. (7)\n(a,B )V\n\nmin R(w, a, B )\n\nw = 0,\n\n(6)\n\n\f\n(This equivalence means that robust FLDA is a special type of robust matched filtering problem studied in the 1980s; see, e.g., [8] for more on robust matched filtering.) We will prove a `nonconventional' minimax theorem for a Rayleigh quotient of the form (5), which will establish the main result described in 1. To do this, we consider a problem of the form minimize aT B -1 a (8) subject to (a, B ) V , with variables a Rn , B Sn + , and V is a convex compact subset of Rn Sn + such + + that for each (a, B ) V , a is not zero. The objective of this problem is a matrix fractional function and so is convex on Rn Sn + ; see [3, 3.1.7]. Our problem (3) is the same as (8), + with (7). It follows that (3) is a convex optimization problem. The following theorem states the minimax theorem for the function R. While R is convex in (a, B ) for fixed w, it is not concave in w for fixed (a, B ), so conventional convex-concave minimax theorems do not apply here. Theorem 1. Let (a , B ) be an optimal solution to the problem (8), and let w = B -1 a . Then (w , a , B ) satisfies the minimax property R(w , a , B ) = max min R(w, a, B ) = min max R(w, a, B ),\nw=0 (a,B )V (a,B )V w=0\n\n(9)\n\nand the saddle point property R(w, a , B ) R(w , a , B ) R(w , a, B ), w Rn \\{0}, (a, B ) V . (10)\n\nProof. It suffices to prove (10), since the saddle point property (10) implies the minimax property (9) [1, 2.6]. We start by observing that R(w, a , B ) is maximized over nonzero w = 0 by w = B -1 a (by the Cauchy-Schwartz inequality). What remains is to show\n(a,B )V\n\nmin R(w , a, B ) = R(w , a , B ).\n\n(11)\n\nSince a and B are optimal for the convex problem (8) (by definition), they must satisfy the optimality condition + ( ( T -1 T -1 a) a ,B ) , (a - a ) a) a ,B ) , (B - B ) a (a B B (a B 0, (a, B ) V (see [3, 4.2.3]). Using a (aT B -1 a) = 2B -1 a, B (aT B -1 a) = -B -1 aaT B -1 , and X, Y = Tr(X Y ) for X, Y Sn , where Tr denotes trace, we can express the optimality condition as 2a T B -1 (a - a ) - TrB -1 a a T B -1 (B - B ) 0, or equivalently, 2w T (a - a ) - w T (B - B )w 0, Now we turn to the convex optimization problem minimize R(w , a, B ) subject to (a, B ) V , (13) (a, B ) V . (12) (a, B ) V ,\n\nwith variables (a, B ). We will show that (a , B ) is optimal for this problem, which will establish (11).\n\n\f\nA pair (a, B ) is optimal for (13) if and only if + ( ( (wT a)2 (wT a)2 , (a - a) B T , (B - B ) a T w B w a,B ) w B w a,B ) Using a aT w (wT a)2 =2 w, w T B w w B w B\n\n\n\n0,\n\n (a, B ) V .\n\n(wT a)2 (aT w )2 = - T w w T , w T B w (w B w )2\n\nthe optimality condition can be written as 2 (aT w )2 aT w w T (a - a) - Tr T 2 w w T (B - B ) T w B w (w B w ) = 2\n\naT w (aT w )2 w T (a - a) - T 2 w T (B - B )w T B w w (w B w ) 0, (a, B ) V . Substituting a = a , B = B , and noting that aT w /wT B w = 1, the optimality condition reduces to 2w T (a - a ) - w T (B - B )w 0, (a, B ) V , which is precisely (12). Thus, we have shown that (a , B ) is optimal for (13), which in turn establishes (11).\n\n3\n\nRobust Fisher LDA with product form uncertainty models\n\nIn this section, we focus on robust Fisher LDA with the product form uncertainty model U = M S, (14) where M is the set of possible means and S is the set of possible covariances. For this model, the worst-case Fisher discriminant ratio can be written as\n(x ,y ,x ,y )U\n\nmin\n\nf (a, x , y , x , y ) =\n\n(wT (x - y ))2 . (x ,y )M max(x ,y )S w T (x + y )w min\n\nIf we can find an analytic expression for max(x ,y )S wT (x + y )w (as a function of w), we can simplify the robust Fisher LDA problem. As a more specific example, we consider the case in which S is given by S Sx Sy = Sx Sy , = {x | x = {y | y 0, x - x F x }, 0, y - y F y },\nF\n\n(15)\n\n where x , y are positive constants, x , y Sn + , and A + n 2 1/ 2 of A, i.e., A F = ( i,j =1 Aij ) . For this case, we have\n( x , y ) S\n\ndenotes the Frobenius norm (16)\n\nmax\n\n wT (x + y )w = wT (x + y + (x + y )I )w.\n\n Here we have used the fact that for given Sn + , max - F xT x = xT ( + I )x + (see, e.g., [6]). The worst-case Fisher discriminant ratio can be expressed as (wT (x - y ))2 . (x ,y )M w T (x + y + (x + y )I )w min\n\n\f\nwhere\n\nThis is the same worst-case Fisher discriminant ratio obtained for a problem in which the covariances are certain, i.e., fixed to be x + x I and y + y I , and the means lie in the set M. We conclude that a robust optimal Fisher discriminant with the uncertainty model (14) in which S has the form (15) can be found by solving a robust Fisher LDA problem with these fixed values for the covariances. From the general solution method described in 1, it is given by -1 w = x + y + (x + y )I ( - ),\nx y\n\n x\n\nand\n\n y\n\nsolve the convex optimization problem\n\n -1 minimize (x - y )T x + y + (x + y )I (x - y ) subject to (x , y ) M, with variables x and y .\n\n(17)\n\nThe problem (17) is relatively simple: it involves minimizing a convex quadratic function over the set of possible x and y . For example, if M is a product of two ellipsoids, (e.g., x and y each lie in some confidence ellipsoid) the problem (17) is to minimize a convex quadratic subject to two convex quadratic constraints. Such a problem is readily solved in O(n3 ) flops, since the dual problem has two variables, and evaluating the dual function and its derivatives can be done in O(n3 ) flops [3]. Thus, the effort to solve the robust is the same order (i.e., n3 ) as solving the nominal Fisher LDA (but with a substantially larger constant).\n\n4\n\nNumerical results\n\nTo demonstrate robust Fisher LDA, we use the sonar and ionosphere benchmark problems from the UCI repository (www.ics.uci.edu/mlearn/MLRepository.html). The two benchmark problems have 208 and 351 points, respectively, and the dimension of each data point is 60 and 34, respectively. Each data set is randomly partitioned into a training set and a test set. We use the training set to compute the optimal discriminant and then test its performance using the test set. A larger training set typically gives better test performance. We let denote the size of the training set, as a fraction of the total number of data points. For example, = 0.3 means that 30% of the data points are used for training, and 70% are used to test the resulting discriminant. For various values of , we generate 100 random partitions of the data (for each of the two benchmark problems), and collect the results. We use the following uncertainty models for the means x , y and the covariances x , y : (x - x )T Px (x - x ) 1, (y - y ) Py (y - y ) 1, \nT\n\n x - x F x , y F y , y - \n\n Here the vectors x , y represent the nominal means and the matrices x , y represent the nominal covariances, and the matrices Px , Py and the constants x and y represent the confidence regions. The parameters are estimated through a resampling technique [4] as follows. For a given training set we create 100 new sets by resampling the original training set with a uniform distribution over all the data points. For each of these sets we estimate its mean and covariance and then take their average values as the nominal mean and covariance. We also evaluate the covariance of all the means obtained with the resampling. We then take Px = -1 /n and Py = -1 /n. This choice corresponds to a 50% confidence ellipsoid in the case of a Gaussian distribution. The parameters x and y are taken to be the maximum deviations between the covariances and the average covariances in the Frobenius norm sense, over the resampling of the training set.\n\n\f\nsonar 100 90 T SA ( % ) 80 70 60 50 20 30 nominal 40 50 60 ( %) 70 80 robust 100 90 80\n\nionosphere\n\nrobust nominal\n\n70 60 50 0 10 20 30 40 ( %) 50 60\n\nFigure 1: Test-set accuracy (TSA) for sonar and ionosphere benchmark versus size of the training set. The solid line represents the robust Fisher LDA results and the dotted line the nominal Fisher LDA results. The vertical bars represent the standard deviation.\n\nFigure 1 summarizes the classification results. For each of our two problems, and for each value of , we show the average test set accuracy (TSA), as well as the standard deviation (over the 100 instances of each problem with the given value of ). The plots show the robust Fisher LDA performs substantially better than the nominal Fisher LDA for small training sets, but this performance gap disappears as the training set becomes larger.\n\n5\n\nRobust kernel Fisher discriminant analysis\n\nIn this section we show how to `kernelize' the robust Fisher LDA. We will consider only a specific class of uncertainty models; the arguments we develop here can be extended to more general cases. In the kernel approach we map the problem to an higher dimensional space Rf via a mapping : Rn Rf so that the new decision boundary is more general and possibly nonlinear. Let the data be mapped as x (x) ((x) , (x) ), y (y ) ((y) , (y) ). The uncertainty model we consider has the form uf 1, (x) - (y) = (x) - (y) + P uf , (x) - (x) F x , (y) - (y) F y . (18)\n\n Here the vectors (x) , (y) represent the nominal means, the matrices (x) , (y) rep resent the nominal covariances, and the (positive semidefinite) matrix P and the constants x and y represent the confidence regions in the feature space. The worst-case Fisher discriminant ratio in the feature space is then given by\n uf 1, (x) -(x) F x , (y) -(y) F y\n\nmin\n\nT (wf ((x) - (y) + P uf ))2 T wf ((x) + (y) )wf\n\n.\n\nThe robust kernel Fisher discriminant analysis problem is to find the discriminant in the feature space that maximizes this ratio. Using the technique described in 3, we can see that the robust kernel Fisher discriminant analysis problem can be cast as maximize subject to\nT (wf ((x) - (y) + P uf ))2 T uf 1 wf ((x) + (y ) + (x + y )I )wf wf = 0,\n\nmin\n\n(19)\n\n\f\nwhere the discriminant wf Rf is defined in the new feature space. To apply the kernel trick to the problem (19), the nonlinear decision boundary should be entirely expressed in terms of inner products of the mapped data only. The following proposition tells us a set of conditions to do so.\ny Proposition 1. Given the sample points {xi }Nx1 and {yi }i=1 , suppose that (x) ,(y) , i= (x) ,(y) , and P can be written as Nx Ny (x) = i=1 i (xi ), (y) = i=1 i+Nx (yi ), P = U U T , Nx (x) = i=1 i,i ((xi ) - (x) )((xi ) - (x) )T , Ny (y) = i=1 i+Nx ,i+Nx ((yi ) - (y) )((yi ) - (y) )T , \n\nN\n\nwhere RNx +Ny , S+x y , S+x y is a diagonal matrix, and U is a matrix Ny whose columns are the vectors {(xi ) - (x) }Nx1 and {(yi ) - (y) }i=1 . Denote as i= Ny the matrix whose columns are the vectors {(xi )}Nx1 , {(yi )}i=1 and define i= D1 = K , D2 = K (I - 1T )(I - 1T )K T , N N D4 = K, D3 = K (I - 1T )(I - 1T )K T + (x + y )K, N N\n\nN +N\n\nN +N\n\nwhere K is the kernel matrix Kij = (T )ij , 1N is a vector of ones of length Nx + Ny , and RNx +Ny is such that i = i for i = 1, . . . , Nx and i = -i for i = Nx + 1, . . . , Nx + Ny . Let be an optimal solution of the problem maximize subject to\nT D\n\n T (D1 + D2 )(D1 + D2 )T T D3 4 1 = 0. min\n\n(20)\n\n Then, wf = is an optimal solution of the problem (19). Moreover, for every point n zR , iNx iNy T i K (z , xi ) + wf (z ) = i+Nx K (z , yi ). (21) =1 =1\n\nAlong the lines of the proofs of Corollary 5 in [6], we can prove this proposition.\n\nReferences\n [1] D. Bertsekas, A. Nedic, and A. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. [2] C. Bhattacharyya. Second order cone programming formulations for feature selection. Journal of Machine Learning Research, 5:14171433, 2004. [3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [4] B. Efron and R.J. Tibshirani. An Introduction to Bootstrap. Chapman and Hall, London UK, 1993. [5] K. Huang, H. Yang, I. King, M. Lyu, and L. Chan. The minimum error minimax probability machine. Journal of Machine Learning Research, 5:12531286, 2004. [6] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust minimax approach to classification. Journal of Machine Learning Research, 3:555582, 2002. [7] S. Mika, G. Ratsch, and K. Muller. A mathematical programming approach to the kernel Fisher algorithm, 2001. In Advances in Neural Information Processing Systems, 13, pp. 591-597, MIT Press. [8] S. Verdu and H. Poor. On minimax robustness: A general approach and applications. IEEE Transactions on Information Theory, 30(2):328340, 1984.\n\n\f\n", "award": [], "sourceid": 2792, "authors": [{"given_name": "Seung-jean", "family_name": "Kim", "institution": null}, {"given_name": "Alessandro", "family_name": "Magnani", "institution": null}, {"given_name": "Stephen", "family_name": "Boyd", "institution": null}]}