{"title": "Invariant Feature Extraction and Classification in Kernel Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 526, "page_last": 532, "abstract": null, "full_text": "U nmixing Hyperspectral Data \n\nLucas Parra, Clay Spence, Paul Sajda \n\nSarnoff Corporation, CN-5300, Princeton, NJ 08543, USA \n\n{lparra, cspence,psajda} @sarnoff.com \n\nAndreas Ziehe, Klaus-Robert Miiller \n\nGMD FIRST.lDA, Kekulestr. 7, 12489 Berlin, Germany \n\n{ziehe,klaus}@first.gmd.de \n\nAbstract \n\nIn hyperspectral imagery one pixel typically consists of a mixture \nof the reflectance spectra of several materials, where the mixture \ncoefficients correspond to the abundances of the constituting ma(cid:173)\nterials. We assume linear combinations of reflectance spectra with \nsome additive normal sensor noise and derive a probabilistic MAP \nframework for analyzing hyperspectral data. As the material re(cid:173)\nflectance characteristics are not know a priori, we face the problem \nof unsupervised linear unmixing. The incorporation of different \nprior information (e.g. positivity and normalization of the abun(cid:173)\ndances) naturally leads to a family of interesting algorithms, for \nexample in the noise-free case yielding an algorithm that can be \nunderstood as constrained independent component analysis (ICA). \nSimulations underline the usefulness of our theory. \n\n1 \n\nIntroduction \n\nCurrent hyperspectral remote sensing technology can form images of ground surface \nreflectance at a few hundred wavelengths simultaneously, with wavelengths ranging \nfrom 0.4 to 2.5 J.Lm and spatial resolutions of 10-30 m. The applications of this \ntechnology include environmental monitoring and mineral exploration and mining. \nThe benefit of hyperspectral imagery is that many different objects and terrain \ntypes can be characterized by their spectral signature. \nThe first step in most hyperspectral image analysis systems is to perform a spectral \nunmixing to determine the original spectral signals of some set of prime materials. \nThe basic difficulty is that for a given image pixel the spectral reflectance patterns \nof the surface materials is in general not known a priori. However there are gen(cid:173)\neral physical and statistical priors which can be exploited to potentially improve \nspectral unmixing. In this paper we address the problem of unmixing hyperspectral \nimagery through incorporation of physical and statistical priors within an unsuper(cid:173)\nvised Bayesian framework. \n\nWe begin by first presenting the linear superposition model for the reflectances \nmeasured. We then discuss the advantages of unsupervised over supervised systems. \n\n\fUnmixing Hyperspectral Data \n\n943 \n\nWe derive a general maximum a posteriori (MAP) framework to find the material \nspectra and infer the abundances. Interestingly, depending on how the priors are \nincorporated, the zero noise case yields (i) a simplex approach or (ii) a constrained \nleA algorithm. Assuming non-zero noise our MAP estimate utilizes a constrained \nleast squares algorithm. The two latter approaches are new algorithms whereas the \nsimplex algorithm has been previously suggested for the analysis of hyperspectral \ndata. \n\nLinear Modeling To a first approximation the intensities X (Xi>.) measured in \neach spectral band A = 1, ... , L for a given pixel i = 1, ... , N are linear combi(cid:173)\nnations of the reflectance characteristics S (8m >.) of the materials m = 1, ... , M \npresent in that area. Possible errors of this approximation and sensor noise are \ntaken into account by adding a noise term N (ni>'). In matrix form this can be \nsummarized as \n\nX = AS + N, subject to: AIM = lL, A ~ 0, \n\n(1) \n\nwhere matrix A (aim) represents the abundance of material m in the area cor(cid:173)\nresponding to pixel i, with positivity and normalization constraints. Note that \nground inclination or a changing viewing angle may cause an overall scale factor for \nall bands that varies with the pixels. This can be incorporated in the model by sim(cid:173)\nply replacing the constraint AIM = lL with AIM ~ lL which does does not affect \nthe discussion in the remainder of the paper. This is clearly a simplified model of \nthe physical phenomena. For example, with spatially fine grained mixtures, called \nintimate mixtures, multiple reflectance may causes departures from this first or(cid:173)\nder model. Additionally there are a number of inherent spatial variations in real \ndata, such as inhomogeneous vapor and dust particles in the atmosphere, that will \ncause a departure from the linear model in equation (1). Nevertheless, in practical \napplications a linear model has produced reasonable results for areal mixtures. \n\nSupervised vs. Unsupervised techniques Supervised spectral un mixing re(cid:173)\nlies on the prior knowledge about the reflectance patterns S of candidate surface \nmaterials, sometimes called endmembers, or expert knowledge and a series of semi(cid:173)\nautomatic steps to find the constituting materials in a particular scene. Once the \nuser identifies a pixel i containing a single material, i.e. aim = 1 for a given m and \ni, the corresponding spectral characteristics of that material can be taken directly \nfrom the observations, i.e., 8 m >. = Xi>. [4]. Given knowledge about the endmembers \none can simply find the abundances by solving a constrained least squares problem. \nThe problem with such supervised techniques is that finding the correct S may re(cid:173)\nquire substantial user interaction and the result may be error prone, as a pixel that \nactually contains a mixture can be misinterpreted as a pure endmember. Another \napproach obtains endmembers directly from a database. This is also problematic \nbecause the actual surface material on the ground may not match the database en(cid:173)\ntries, due to atmospheric absorption or other noise sources. Finding close matches \nis an ambiguous process as some endmembers have very similar reflectance charac(cid:173)\nteristics and may match several entries in the database. \nUnsupervised unmixing, in contrast, tries to identify the endmembers and mixtures \ndirectly from the observed data X without any user interaction. There are a variety \nof such approaches. In one approach a simplex is fit to the data distribution [7, 6, 2]. \nThe resulting vertex points of the simplex represent the desired endmembers, but \nthis technique is very sensitive to noise as a few boundary points can potentially \nchange the location of the simplex vertex points considerably. Another approach by \nSzu [9] tries to find abundances that have the highest entropy subject to constraints \nthat the amount of materials is as evenly distributed as possible - an assumption \n\n\f944 \n\nL. Parra, C. D. Spence, P Sajda, A. Ziehe and K.-R. Muller \n\nwhich is clearly not valid in many actual surface material distributions. A relatively \nnew approach considers modeling the statistical information across wavelength as \nstatistically independent AR processes [1]. This leads directly to the contextual \nlinear leA algorithm [5]. However, the approach in [1] does not take into account \nconstraints on the abundances, noise, or prior information. Most importantly, the \nmethod [1] can only integrate information from a small number of pixels at a time \n(same as the number of endmembers). Typically however we will have only a few \nendmembers but many thousand pixels. \n\n2 The Maximum A Posterior Framework \n\n2.1 A probabilistic model of unsupervised spectral unmixing \n\nOur model has observations or data X and hidden variables A, S, and N that \nare explained by the noisy linear model (1). We estimate the values of the hidden \nvariables by using MAP \n\n(A SIX) = p(XIA, S)p(A, S) = Pn(XIA, S)Pa(A)ps(S) \np \n\np(X) \n\n, \n\np(X) \n\n(2) \n\nwith Pa(A), Ps(S), Pn(N) as the a priori assumptions of the distributions. With \nMAP we estimate the most probable values for given priors after observing the data, \n\nA MAP , SMAP = argmaxp(A, SIX) \n\nA,S \n\n(3) \n\nNote that for maximization the constant factor p(X) can be ignored. Our first as(cid:173)\nsumption, which is indicated in equation (2) is that the abundances are independent \nof the reflectance spectra as their origins are completely unrelated: (AO) A and S \nare independent. \n\nThe MAP algorithm is entirely defined by the choices of priors that are guided by \nthe problem of hyperspectral unmixing: (AI) A represent probabilities for each \npixel i. (A2) S are independent for different material m. (A3) N are normal i.i.d. \nfor all i, A. In summary, our MAP framework includes the assumptions AO-A3. \n\n2.2 \n\nIncluding Priors \n\nPriors on the abundances Positivity and normalization of the abundances can \nbe represented as, \n\n(4) \nwhere 60 represent the Kronecker delta function and eo the step function. With \nthis choice a point not satisfying the constraint will have zero a posteriori probabil(cid:173)\nity. This prior introduces no particular bias of the solutions other then abundance \nconstraints. It does however assume the abundances of different pixels to be inde(cid:173)\npendent. \n\nPrior on spectra Usually we find systematic trends in the spectra that cause \nsignificant correlation. However such an overall trend can be subtracted and/or \nfiltered from the data leaving only independent signals that encode the variation \nfrom that overall trend. For example one can capture the conditional dependency \nstructure with a linear auto-regressive (AR) model and analyze the resulting \"inno(cid:173)\nvations\" or prediction errors [3]. In our model we assume that the spectra represent \nindependent instances of an AR process having a white innovation process em.>. dis(cid:173)\ntributed according to Pe(e). With a Toeplitz matrix T of the AR coefficients we \n\n\fUnmixing Hyperspectral Data \n\n945 \n\ncan write, em = Sm T. The AR coefficients can be found in a preprocessing step on \nthe observations X. If S now represents the innovation process itself, our prior can \nbe represented as, \n\nPe (S) .d>.>.,) , \n\nM \n\nL \n\nL \n\nm=1 >.=1 \n\n>.'=1 \n\n(5) \n\nAdditionally Pe (e) is parameterized by a mean and scale parameter and potentially \nparameters determining the higher moments of the distributions. For brevity we \nignore the details of the parameterization in this paper. \n\nPrior on the noise As outlined in the introduction there are a number of prob(cid:173)\nlems that can cause the linear model X = AS to be inaccurate (e.g. multiple \nreflections, inhomogeneous atmospheric absorption, and detector noise.) As it is \nhard to treat all these phenomena explicitly, we suggest to pool them into one noise \nvariable that we assume for simplicity to be normal distributed with a wavelength \ndependent noise variance a>., \n\np(XIA, S) = Pn(N) = N(X - AS,~) = II N(x>. - As>., a>.l) , \n\nL \n\n(6) \n\nwhere N (', .) represents a zero mean Gaussian distribution, and 1 the identity matrix \nindicating the independent noise at each pixel. \n\n>.=1 \n\n2.3 MAP Solution for Zero Noise Case \n\nLet us consider the noise-free case. Although this simplification may be inaccurate it \nwill allow us to greatly reduce the number of free hidden variables - from N M + M L \nto M2 . In the noise-free case the variables A, S are then deterministically dependent \non each other through a N L-dimensional 8-distribution, Pn(XIAS) = 8(X - AS). \nWe can remove one of these variables from our discussion by integrating (2). It is \ninstructive to first consider removing A \n\np(SIX) M. In that case the observations X can be \nmapped into a M dimensional subspace using the singular value decomposition (SVD) , \nX = UDVT , The discussion applies then to the reduced observations X = u1x with \nU M being the first M columns of U . \n\n\f946 \n\nL. Parra. C. D. Spence. P Sajda. A. Ziehe and K.-R. Muller \n\nthe prior on A will restrict the solutions to satisfy the abundance constraints and \nbias the result depending on the detailed choice of Pa(A), so we are led to con(cid:173)\nstrained ICA. \nIn summary, depending on which variable we integrate out we obtain two methods \nfor solving the spectral unmixing problem: the known technique of simplex fitting \nand a new constrained ICA algorithm. \n\n2.4 MAP Solution for the Noisy Case \n\nCombining the choices for the priors made in section 2.2 (Eqs.(4), (5) and (6)) with \n(2) and (3) we obtain \n\nAMAP, SMAP = \"''i~ax ft {g N(x\", - a,s\" a,) ll. P,(t. 'm,d\",) } , \n\n(9) \n\nsubject to AIM = lL, A 2: O. The logarithm of the cost function in (9) is denoted \nby L = L(A, S). Its gradient with respect to the hidden variables is \n\n88L = _AT nm diag(O')-l -\nSm \n\nfs(sm) \n\n(10) \n\nwhere N = X - AS, nm are the M column vectors of N, fs(s) = - olnc;(s). In (10) \nfs is applied to each element of Sm. \nThe optimization with respect to A for given S can be implemented as a standard \nweighted least squares (L8) problem with a linear constraint and positivity bounds. \nSince the constraints apply for every pixel independently one can solve N separate \nconstrained LS problems of M unknowns each. We alternate between gradient steps \nfor S and explicit solutions for A until convergence. Any additional parameters of \nPe(e) such as scale and mean may be obtained in a maximum likelihood (ML) sense \nby maximizing L. Note that the nonlinear optimization is not subject to constraints; \nthe constraints apply only in the quadratic optimization. \n\n3 Experiments \n\n3.1 Zero Noise Case: Artificial Mixtures \n\nIn our first experiment we use mineral data from the United States Geological Sur(cid:173)\nvey (USGS)2 to build artificial mixtures for evaluating our unsupervised unmixing \nframework. Three target endmembers where chosen (Almandine WS479, Montmo(cid:173)\nrillonite+Illi CM42 and Dickite NMNH106242). A spectral scene of 100 samples \nwas constructed by creating a random mixture of the three minerals. Of the 100 \nsamples, there were no pure samples (Le. no mineral had more than a 80% abun(cid:173)\ndance in any sample). Figure 1A is the spectra of the endmembers recovered by the \nconstrained ICA technique of section 2.3, where the constraints were implemented \nwith penalty terms added to the conventional maximum likelihood ICA algorithm. \nThese are nearly identical to the spectra of the true endmembers, shown in fig(cid:173)\nure 1B, which were used for mixing. Interesting to note is the scatter-plot of the \n100 samples across two bands. The open circles are the absorption values at these \ntwo bands for endmembers found by the MAP technique. Given that each mixed \nsample consists of no more than 80% of any endmember, the endmember points \non the scatter-plot are quite distant from the cluster. A simplex fitting technique \nwould have significant difficulty recovering the endmembers from this clustering. \n\n2see http://speclab.cr . usgs.gov /spectral.lib.456.descript/ decript04.html \n\n\fUnmixing Hyperspectral Data \n\n947 \n\nfound endmembers \n\ntarget endmembers \n\nobserved X and found S \n\no \n\ng 0.8 \n~ \n., \n~0.6 \n~ \n~ 0.4 \n\no \n\nO~------' \n\n50 \n\n100 150 200 \n\n50 \n\n100 150 200 \n\nO~------' \n\n0.2'---~------' \n\nwavelength \n\nA \n\nwavelength \n\nB \n\n0.4 \n\n0.6 \n\n0.8 \nwavelength=30 \n\nC \n\nFigure 1: Results for noise-free artificial mixture. A recovered endmembers using \nMAP technique. B \"true\" target endmembers. C scatter plot of samples across 2 \nbands showing the absorption of the three endmembers computed by MAP (open \ncircles). \n\n3.2 Noisy Case: Real Mixtures \n\nTo validate the noise model MAP framework of section 2.4 we conducted an ex(cid:173)\nperiment using ground truthed USGS data representing real mixtures. We selected \nlOxl0 blocks of pixels from three different regions3 in the AVIRIS data of the \nCuprite, Nevada mining district. We separate these 300 mixed spectra assuming \ntwo endmembers and an AR detrending with 5 AR coefficients and the MAP tech(cid:173)\nniques of section 2.4. Overall brightness was accounted for as explain in the linear \nmodeling of section 1. The endmembers are shown in figure 2A and B in comparison \nto laboratory spectra from the USGS spectral library for these minerals [8J . Figure \n2C shows the corresponding abundances, which match the ground truth; region \n(III) mainly consists of Muscovite while regions (1)+(I1) contain (areal) mixtures of \nKaolinite and Muscovite. \n\n4 Discussion \n\nHyperspectral unmixing is a challenging practical problem for unsupervised learn(cid:173)\ning. Our probabilistic approach leads to several interesting algorithms: (1) simplex \nfitting, (2) constrained ICA and (3) constrained least squares that can efficiently use \nmulti-channel information. An important element of our approach is the explicit \nuse of prior information. Our simulation examples show that we can recover the \nendmembers, even in the presence of noise and model uncertainty. The approach \ndescribed in this paper does not yet exploit local correlations between neighboring \npixels that are well known to exist. Future work will therefore exploit not only \nspectral but also spatial prior information for detecting objects and materials. \n\nAcknowledgments \n\nWe would like to thank Gregg Swayze at the USGS for assistance in obtaining the \ndata. \n\n3The regions were from the image plate2.cuprite95.alpha.2um.image.wlocals.gif in \nftp:/ /speclab.cr.usgs.gov /pub/cuprite/gregg.thesis.images/, at the coordinates (265,710) \nand (275,697), which contained Kaolinite and Muscovite 2, and (143,661), which only \ncontained Muscovite 2. \n\n\f948 \n\n0.65 \n\n0.6 \n\n0.55 \n\n0.5 \n\n0.45 \n\nL. Parra, C. D, Spence, P Sajda, A. Ziehe and K-R. Muller \n\nMuscovite \n\nKaolinite \n\n0.8 \n\n0.7 \n\n0.6 \n\n0.4 \n\n0.3 \n\n'c .\u2022.\u2022 \", \"'0 .. \n' ., \n\n0.4,--~--:-:-:-\"-~----:--:--~ \n220 \n\n210 \n\n160 \n\n190 \n200 \nwaveleng1h \n\n180 \n\n190 \n\n200 \nwavelength \n\n210 \n\n220 \n\nA \n\nB \n\nC \n\nFigure 2: A Spectra of computed endmember (solid line) vs Muscovite sample \nspectra from the USGS data base library. Note we show only part of the spectrum \nsince the discriminating features are located only between band 172 and 220. B \nComputed endmember (solid line) vs Kaolinite sample spectra from the USGS data \nbase library. C Abundances for Kaolinite and Muscovite for three regions (lighter \npixels represent higher abundance). Region 1 and region 2 have similar abundances \nfor Kaolinite and Muscovite, while region 3 contains more Muscovite. \n\nReferences \n\n[1] J. Bayliss, J. A. Gualtieri, and R. Cromp. Analyzing hyperspectral data with \nindependent component analysis. In J. M. Selander, editor, Proc. SPIE Applied \nImage and Pattern Recognition Workshop, volume 9, P.O. Box 10, Bellingham \nWA 98227-0010, 1997. SPIE. \n\n[2] J.W. Boardman and F.A. Kruse. Automated spectral analysis: a geologic exam(cid:173)\n\nple using AVIRIS data, north Grapevine Mountains, Nevada. In Tenth Thematic \nConference on Geologic Remote Sensing, pages 407-418, Ann arbor, MI, 1994. \nEnvironmental Research Institute of Michigan. \n\n[3] S. Haykin. Adaptive Filter Theory. Prentice Hall, 1991. \n[4] F. Maselli, , M. Pieri, and C. Conese. Automatic identification of end-members \nfor the spectral decomposition of remotely sensed scenes. Remote Sensing for \nGeography, Geology, Land Planning, and Cultural Heritage (SPIE) , 2960:104-\n109,1996. \n\n[5] B. Pearlmutter and L. Parra. Maximum likelihood blind source separation: A \ncontext-sensitive generalization ofICA. In M. Mozer, M. Jordan, and T. Petsche, \neditors, Advances in Neural Information Processing Systems 9, pages 613-619, \nCambridge MA, 1997. MIT Press. \n\n[6] J.J. Settle. Linear mixing and the estimation of ground cover proportions. In(cid:173)\n\nternational Journal of Remote Sensing, 14:1159-1177,1993. \n\n[7] M.O. Smith, J .B. Adams, and A.R. Gillespie. Reference endmembers for spectral \nmixture analysis. In Fifth Australian remote sensing conference, volume 1, pages \n331-340, 1990. \n\n[8] U.S. Geological Survey. USGS digital spectral library. Open File Report 93-592, \n\n1993. \n\n[9] H. Szu and C. Hsu. Landsat spectral demixing a la superresolution of blind \nmatrix inversion by constraint MaxEnt neural nets. In Wavelet Applications \nIV, volume 3078, pages 147-160. SPIE, 1997. \n\n\fInvariant Feature Extraction and \n\nClassification in Kernel Spaces \n\nSebastian Mikal , Gunnar Ratschl , Jason Weston2 , \n\nBernhard Sch8lkopf3, Alex Smola4 , and Klaus-Robert Mullerl \n\n1 GMD FIRST, Kekulestr. 7,12489 Berlin, Germany \n\n2 Barnhill BioInformatics, 6709 Waters Av., Savannah, GR 31406, USA \n3 Microsoft Research Ltd., 1 Guildhall Street, Cambridge CB2 3NH, UK \n\n4 Australian National University, Canberra, 0200 ACT, Australia \n\n{mika, raetsch, klaus }@first.gmd.de, jasonw@dcs.rhbnc.ac.uk \n\nbsc@microsoft.com, Alex.Smola.anu.edu.au \n\nAbstract \n\nWe incorporate prior knowledge to construct nonlinear algorithms \nfor invariant feature extraction and discrimination. Employing a \nunified framework in terms of a nonlinear variant of the Rayleigh \ncoefficient, we propose non-linear generalizations of Fisher's dis(cid:173)\ncriminant and oriented PCA using Support Vector kernel functions . \nExtensive simulations show the utility of our approach. \n\n1 \n\nIntroduction \n\nIt is common practice to preprocess data by extracting linear or nonlinear features. \nThe most well-known feature extraction technique is principal component analysis \nPCA (e.g. [3]). It aims to find an orthonormal, ordered basis such that the i-th \ndirection describes as much variance as possible while maintaining orthogonality to \nall other directions. However, since PCA is a linear technique, it is too limited to \ncapture interesting nonlinear structure in a data set and nonlinear generalizations \nhave been proposed, among them Kernel PCA [14], which computes the principal \ncomponents of the data set mapped nonlinearly into some high dimensional feature \nspace F. \nOften one has prior information, for instance, we might know that the sample is \ncorrupted by noise or that there are invariances under which a classification should \nnot change. For feature extraction, the concepts of known noise or transformation \ninvariance are to a certain degree equivalent, i.e. they can both be interpreted as \ncausing a change in the feature which ought to be minimized. Clearly, invariance \nalone is not a sufficient condition for a good feature, as we could simply take the \nconstant function. What one would like to obtain is a feature which is as invariant \nas possible while still covering as much of the information necessary for describing \nthe particular data. Considering only one (linear) feature vector wand restricting \nto first and second order statistics of the data one arrives at a maximization of the \nso called Rayleigh coefficient \n\n(1) \n\n\fInvariant Feature Extraction and Classification in Kernel Spaces \n\n527 \n\nwhere w is the feature vector and Sf, SN are matrices describing the desired and \nundesired properties of the feature , respectively (e.g. information and noise). If S/ \nis the data covariance and SN the noise covariance, we obtain oriented PCA [3J . \nIf we leave the field of data description to perform supervised classification, it is \ncommon to choose S / as the separability of class centers (between class variance) \nand SN to be the within class variance. In that case , we recover the well known \nFisher Discriminant [7J. The ratio in (1) is maximized when we cover much of \nthe information coded by S/ while avoiding the one coded by SN . The problem is \nknown to be solved, in analogy to PCA , by a generalized symmetric eigenproblem \nS/w = >\"SNW [3], where>.. E ~ is the corresponding (biggest) eigenvalue. \nIn this paper we generalize this setting to a nonlinear one. In analogy to [8, 14J \nwe first map the data via some nonlinear mapping to some high-dimensional fea(cid:173)\nture space F and then optimize (1) in F . To avoid working with the mapped data \nexplicitly (which might be impossible if F is infinite dimensional) we introduce sup(cid:173)\nport vector kernel functions [11], the well-known kernel trick. These kernel functions \nk(x , y) compute a dot product in some feature space F , i.e. k(x , y) = ((x)\u00b7 (y)) . \nFormulating the algorithms in Fusing only in dot products, we can replace any \noccurrence of a dot product by the kernel function k. Possible choices for k which \nhave proven useful e.g. in Support Vector Machines [2] or Kernel PCA [14J are Gaus(cid:173)\nsian RBF, k(x , y) = exp( -llx - yI12/c), or polynomial kernels , k(x , y) = (x\u00b7 y)d , \nfor some positive constants c E ~ and dEN, respectively. \nThe remainder of this paper is organized as follows: The next section shows how to \nformulate the optimization problem induced by (1) in feature space. Section 3 con(cid:173)\nsiders various ways to find Fisher's Discriminant in F; we conclude with extensive \nexperiments in section 4 and a discussion of our findings. \n\n2 Kernelizing the Rayleigh Coefficient \n\nTo optimize (1) in some kernel feature space F we need to find a formulation which \nuses only dot products of -images. As numerator and denominator are both scalars \nthis can be done independently. Furthermore, the matrices S/ and SN are basically \ncovariances and thus the sum over outer products of -images. Therefore, and due \nto the linear nature of (1) every solution W E F can be written as an expansion in \nterms of mapped training datal, i.e. \n\nl \n\nW = L Cti(Xi). \n\ni=l \n\n(2) \n\nTo define some common choices in F let X = {Xl , .. . ,xe} be our training sample \nand, where appropriate, Xl U X2 = X , Xl n X2 = 0, two subclasses (with I Xi I = \u00a3i). \nWe get the full covariance of X by \n\nC = f L ((x) - m)((x) - m)T with m = f L (x) , \n\n1 \n\n1 \n\n(3) \n\n~EX \n\n~EX \n\nI SB and Sw are operators on a (finite-dimensional) subspace spanned by the CP(Xi) (in \na possibly infinite space). Let w = VI + V2, where VI E Span(CP(Xi) : i = 1, .. . ,f) and \nV2 1. Span(CP(xi) : i = 1, ... , f) . Then for S = Sw or S = SB (which are both symmetric) \n\n(w , Sw) \n\n((VI + V2) , S(VI + V2)) \n((VI + V2)S, VI) \n(VI , SVI) \n\nAs VI lies in the span of the cp(Xi) and S only operates on this subspace there exist an \nexpansion of w which maximizes J(w) . \n\n\f528 \n\nS. Mika. G. Riitsch. J. Weston. B. Scholkopj, A. J. Smola and K.-R. Muller \n\nwhich could be used as Sf in oriented Kernel PCA. For SN we could use an estimate \nof the noise covariance, analogous to the definition of C but over mapped patterns \nsampled from the assumed noise distribution. The standard formulation of the \nFisher discriminant in F, yielding the Kernel Fisher Discriminant (KFD) [8] is \ngiven by \n\nSw = L \n\nL \n\n(cJ>(x) - mi)(cJ>(x) - mdT \n\nand SB = (m2 - mt}(m2 - ml)T, \n\ni=I,2 xEX; \n\nthe within-class scatter Sw (as S N), and the between class scatter S B ( as Sf). Here \nmi is the sample mean for patterns from class i. \nTo incorporate a known invariance e.g. in oriented Kernel PCA, one could use the \ntangent covariance matrix [12], \n\n1 \n\nT = ft2 L \n\n(cJ>(x) - cJ>(\u00a3tx))(cJ>(x) - cJ>(\u00a3tx))T for some small t> O. \n\n(4) \n\n:IlEX \n\nHere \u00a3t is a local I-parameter transformation. T is a finite difference approximation \nt of the covariance of the tangent of \u00a3t at point cJ>(x) (details e.g. in [12]). Using \nSf = C and SN = T in oriented Kernel PCA, we impose invariance under the local \ntransformation \u00a3t. Crucially, this matrix is not only constructed from the training \npatterns X. Therefore, the argument used to find the expansion (2) is slightly \nincorrect. Neverthless, we can assume that (2) is a reasonable approximation for \ndescribing the variance induced by T. \nMultiplying either of these matrices from the left and right with the expansion (2), \nwe can find a formulation which uses only dot products. For the sake of brevity, we \nonly give the explicit formulation of (1) in F for KFD (cf. [8] for details) . Defining \n(I-'i)j = t L:IlEXi k(xj,x) we can write (1) for KFD as \naTMa \naTNa' \n\n(5) \nwhere N = KKT - Li=1,2fil-'iI-'T, I-' = 1-'2 - 1-'1 ' M = I-'I-'T, and Kij = k(xi,xj). \nThe results for other choices of Sf and S N in F as for the cases of oriented kernel \nPCA or transformation invariance can be obtained along the same lines. Note that \nwe still have to maximize a Rayleigh coefficient. However, now it is a quotient in \nterms of expansion coefficients a, and not in terms of w E F which is a potentially \ninfinite-dimensional space. Furthermore, it is well known that the solution for this \nspecial eigenproblem is in the direction of N-1 (1-'2 - 1-'1) [7), which can be solved \nusing e.g. a Cholesky factorization of N. The projection of a new pattern x onto \nw in F can then be computed by \n\nJ(a) = (aTI-') 2 \naTNa \n\n(w\u00b7 cJ>(x)) = LQik(xi'x). \n\nl \n\ni=1 \n\n(6) \n\n3 Algorithms \n\nEstimating a covariance matrix with rank up to f from f samples is ill-posed. Fur(cid:173)\nthermore, by performing an explicit centering in F each covariance matrix loses one \nmore dimension, i.e. it has only rank f - 1 (even worse, for KFD the matrix N has \nrank f - 2). Thus the ratio in (1) is not well defined anymore, as the denomina(cid:173)\ntor might become zero. In the following we will propose several ways to deal with \nthis problem in KFD. Furthermore we will tackle the question how to solve the \noptimization problem of KFD more efficiently. So far, we have an eigenproblem of \nsize .e x .e. If .e becomes large this is numerically demanding. Reformulations of the \noriginal problem allow to overcome some of these limitations. Finally, we describe \nthe connection between KFD and RBF networks. \n\n\fInvariant Feature Extraction and Classification in Kernel Spaces \n\n529 \n\n3.1 Regularization and Solution on a Subspace \n\nAs noted before, the matrix N has only rank \u00a3 - 2. Besides numerical problems \nwhich can cause the matrix N to be not even positive, we could think of imposing \nsome regularization to control capacity in F. To this end, we simply add a mUltiple \nof the identity matrix to N, Le. replace N by NJ1. where \n\nNJ1. := N + /-LI. \n\n(7) \nThis can be viewed in different ways: (i) for /-L > 0 it makes the problem feasible \nand numerically more stable as NJ1. becomes positive; (ii) it can be seen as decreas(cid:173)\ning the bias in sample based estimation of eigenvalues (cf. [6)); (iii) it imposes a \nregularization on 110112, favoring solutions with small expansion coefficients. fur(cid:173)\nthermore, one could use other regularization type additives to N, e.g. penalizing \nIIwl12 in analogy to SVM (by adding the kernel matrix Kij = k(xi' Xj)). \nTo optimize (5) we need to solve an \u00a3 x \u00a3 eigenproblem, which might be intractable \nfor large \u00a3. As the solutions are not sparse one can not directly use efficient algo(cid:173)\nrithms like chunking for Support Vector Machines (cf. [13]). To this end, we might \nrestrict the solution to lie in a subspace, Le. instead of expanding w by (2) we write \n\n(8) \n\ni=l \n\nwith m < l. The patterns Zi could either be a subset of the training patterns X \nor e.g. be estimated by some clustering algorithm. The derivation of (5) does not \nchange, only K is now m x \u00a3 and we end up with m x m matrices N and M. Another \nadvantage is, that it increases the rank of N (relative to its size) although there \nstill might be some need for regularization. \n\n3.2 Quadratic optimization and Sparsification \n\nEven if N has full rank, maximizing (5) is underdetermined: if 0 is optimal, then so \nis any multiple thereof. Since 0 T M 0 = (0 T J..L)2, M has rank one. Thus we can seek \nfor a vector 0, such that oTNo is minimal for fixed OTJ..L (e.g. to 1). The solution \nis unique and we can find the optimal 0 by solving the quadratic optimization \nproblem: \n\n(9) \nAlthough the quadratic optimization problem is not easier to solve than the eigen(cid:173)\nproblem, it has an appealing interpretation. The constraint 0 T J..L = 1 ensures, that \nthe average class distance, projected onto the direction of discrimination, is con(cid:173)\nstant, while the intra class variance is minimized, i.e. we maximize the average \nmargin. Contrarily, the SVM approach [2] optimizes for a large minimal margin. \nConsidering (9) we are able to overcome another shortcoming of KFD. The solu(cid:173)\ntions 0 are not sparse and thus evaluating (6) is expensive. To solve this we can \nadd an h-regularizer >'110111 to the objective function, where>. is a regularization \nparameter allowing us to adjust the degree of sparseness. \n\n3.3 Connection to RBF Networks \n\nInterestingly, there exists a close connection between RBF networks (e.g. [9, 1)) and \nKFD. If we add no regularization and expand in all training patterns, we find that \nis given by 0 = K- 1y, where K is the symmetric, positive matrix of \nan optimal 0 \nall kernel elements k(xi' Xj) and y the \u00b11 label vector2. A RBF-network with the \n2To see this, note that N can be written as N = KDK where D = I -YIyT -Y2Y; has \nrank e - 2, while Yi is the vector of l/Vli's for patterns from class i and zero otherwise. \n\n\f530 \n\nS. Mika, G. Ratsch, J. Weston, B. SchOlkopf, A. J. Smola and K.-R. Muller \n\nBanana \nB.Cancer \nDiabetes \nGerman \nHeart \nImage \nRingnorm \nF.Sonar \nSplice \nThyroid \nTitanic \nTwonorm \nWaveform \n\nRBF \n\nAB \n\nABR \n\nSVM \n\nKFD \n\n10.8\u00b1O.06 12.3\u00b1O.07 10.9\u00b10.04 1l.5\u00b1O.07 10.8\u00b1O.05 \n27.6\u00b10.47 30.4\u00b10.47 26.5\u00b10.45 26.o\u00b1O.4725.8\u00b10.46 \n24.3\u00b1O.19 26.5\u00b1O.23 23.8\u00b1O.18 23.5\u00b10.17 23.2\u00b1O.16 \n24.7\u00b1O.24 27.5\u00b1O.25 24.3\u00b1O.21 23.6\u00b1O.21 23.1\u00b10.22 \n17.6\u00b1O.33 20.3\u00b1O.34 16.5\u00b1O.35 16.0\u00b1O.33 16.1\u00b10.34 \n3.3\u00b1O.06 2.1\u00b1O.01 2.1\u00b1O.06 \n3.o\u00b1O.06 4.8\u00b1O.06 \n1.7\u00b1O.02 1.9\u00b1O.03 1.6\u00b10.01 \n1.7\u00b1O.01 1.5\u00b1O.01 \n34.4\u00b1O.20 35.7\u00b1O.18 34.2\u00b1O.22 32.4\u00b1O.18 33.2\u00b10.11 \n10.o\u00b1O.10 10.1\u00b1O.05 9.5\u00b1O.01 10.9\u00b1O.07 10.5\u00b1O.06 \n4.5\u00b1O.21 4.4\u00b1O.22 4.6\u00b1O.22 \n4.8\u00b1O.22 4.2\u00b1O.21 \n23.3\u00b1O.13 22.6\u00b1O.12 22.6\u00b1O.12 22.4\u00b1O.10 23.2\u00b1O.20 \n2.9\u00b1O.03 3.0\u00b1O.03 2.1\u00b1O.02 3.0\u00b1O.02 2.6\u00b1O.02 \n10.7\u00b1O.1l 10.8\u00b1O.06 9.8\u00b1O.08 \n9.9\u00b1O.04 \n\n9. 9\u00b1 o. 04 \n\nTable 1: Com(cid:173)\nparison between \nsingle \nKFD, \nRBF \nclassifier, \nAdaBoost (AB), \nAda-\nregul. \n(ABR) \nBoost \nand SVMs (see \ntext) . Best re(cid:173)\nsult in bold face, \nsecond best \nin \nitalics. \n\nsame kernel at each sample and fixed kernel width gives the same solution, if the \nmean squared error between labels and output is minimized. Also for the case of \nrestricted expansions (8) there exists a connection to RBF networks with a smaller \nnumber of centers (cf. [4]) . \n\n4 Experiments \n\nKernel Fisher Discriminant Figure 1 shows an illustrative comparison of the \nfeatures found by KFD, and Kernel PCA. The KFD feature discriminates the two \nclasses, the first Kernel PCA feature picks up the important nonlinear structure. \nTo evaluate the performance of the KFD on real data sets we performed an extensive \ncomparison to other state-of-the-art classifiers, whose details are reported in [8j.3 \nWe compared the Kernel Fisher Discriminant and Support Vector Machines, both \nwith Gaussian kernel, to AdaBoost [5], and regularized AdaBoost [10] (cf. table 1). \nFor KFD we used the regularized within-class scatter (7) and computed projections \nonto the optimal direction w E :F by means of (6). To use w for classification we \nhave to estimate a threshold. This can be done by e.g. trying all thresholds between \ntwo outputs on the training set and selecting the median of those with the smallest \nempirical error, or (as we did here) by computing the threshold which maximizes the \nmargin on the outputs in analogy to a Support Vector Machine, where we deal with \nerrors on the trainig set by using the SVM soft margin approach. A disadvantage \nof this is, however , that we have to control the regularization constant for the slack \nvariables. The results in table 1 show the average test error and the standard \n\nIf K has full rank, the null space of D , which is spanned by Yl and Y2' is the null space \nof N . For a = K- 1 Y we get aT N a = 0 and aT J.\u00a3 =I O. As we are free to fix the constraint \naT J.\u00a3 to any positive constant (not just 1), a is also feasible. \n\n3The breast cancer domain was obtained from the University Medical Center, \n\nInst. \nlic for \nhttp://www.first.gmd.de/-raetsch/ . \n\nof Oncology, Ljubljana, Yugoslavia. Thanks to M. Zwitter and M. Sok(cid:173)\nthe data. All data sets used in the experiments can be obtained via \n\nFigure 1: Comparison of feature \nfound by KFD (left) and first \nKernel PCA feature (right). De(cid:173)\npicted are two classes (informa(cid:173)\ntion only used by KFD) as dots \nand crosses and levels of same \nfeature value. Both with polyno(cid:173)\nmial kernel of degree two, KFD \nwith the regularized within class \nscatter (7) (/1 = 10-3 ) . \n\n\fInvariant Feature Extraction and Classification in Kernel Spaces \n\n531 \n\ndeviation of the averages' estimation, over 100 runs with different realizations of \nthe datasets. To estimate the necessary parameters, we ran 5-fold cross validation \non the first five realizations of the training sets and took the model parameters to \nbe the median over the five estimates (see [10] for details of the experimental setup). \n\nUsing prior knowledge. A toy example (figure 2) shows a comparison of Ker(cid:173)\nnel PCA and oriented Kernel PCA, which used S[ as the full covariance (3) and \nas noise matrix SN the tangent covariance (4) of (i) rotated patterns and (ii) along \nthe x-axis translated patterns. The toy example shows how imposing the desired \ninvariance yields meaningful invariant features. \nIn another experiment we incorporated prior knowledge in KFD. We used the USPS \ndatabase of handwritten digits, which consists of 7291 training and 2007 test pat(cid:173)\nterns, ~ach 2?6 .dimensional gray scale ima~es of the digits 0 ... 9: We use? the \nregulanzed withm-class scatter (7) (p, = 10- ) as SN and added to It an multiple A \nof the tangent covariance (4), i.e. SN = NJj + AT. As invariance transformations we \nhave chosen horizontal and vertical translation, rotation, and thickening (cf. [12]), \nwhere we simply averaged the matrices corresponding to each transformation. The \nfeature was extracted by using the restricted expansion (8), where the patterns Zi \nwere the first 3000 training samples. As kernel we have chosen a Gaussian of width \n0.3\u00b7256, which is optimal for SVMs [12]. For each class we trained one KFD which \nclassified this class against the rest and computed the 10-class error by the winner(cid:173)\ntakes-all scheme. The threshold was estimated by minimizing the empirical risk on \nthe normalized outputs of KFD. \nWithout invariances, i.e. A = 0, we achieved a test error of 3.7%, slightly better than \na plain SVM with the same kernel (4.2%) [12]. For A = 10- 3 , using the tangent \ncovariance matrix led to a very slight improvement to 3.6%. That the result was not \nsignificantly better than the corresponding one for KFD (3.7%) can be attributed \nto the fact that we used the same expansion coefficients in both cases. The tangent \ncovariance matrix, however, lives in a slightly different subspace. And indeed, a \nsubsequent experiment where we used vectors which were obtained by clustering a \nlarger dataset, including virtual examples generated by the appropriate invariance \ntransformation, led to 3.1 %, comparable to an SVM using prior knowledge (e.g. [12]; \nbest SVM result 2.9% with local kernel and virtual support vectors). \n\n5 Conclusion \n\nIn the task of learning from data it is equivalent to have prior knowledge about \ne.g. invariances or about specific sources of noise. In the case of feature extraction, \nwe seek features which are sufficiently (noise-) invariant while still describing in(cid:173)\nteresting structure. Oriented PCA and, closely related, Fisher's Discriminant, use \nparticularly simple features, since they only consider first and second order statis(cid:173)\ntics for maximizing the Rayleigh coefficient (1). Since linear methods can be too \nrestricted in many real-world applications, we used Support Vector Kernel functions \nto obtain nonlinear versions of these algorithms, namely oriented Kernel PCA and \nKernel Fisher Discriminant analysis. \nOur experiments show that the Kernel Fisher Discriminant is competitive or in \n\nFigure 2: Comparison of first \nfeatures found by Kernel PCA \nand oriented Kernel PCA (see \ntext); from left to right: KPCA, \nOKPCA with \nrotation and \ntranslation invariance; all with \nGaussian kernel. \n\n\f532 \n\nS. Mika, G. Riitsch, J. Weston, B. SchOlkopf, A. J. Smola and K.-R. Muller \n\nsome cases even superior to the other state-of-the-art algorithms tested. Interest(cid:173)\ningly, both SVM and KFD construct a hyperplane in :F which is in some sense \noptimal. In many cases, the one given by the solution w of KFD is superior to \nthe one of SVMs. Encouraged by the preliminary results for digit recognition, we \nbelieve that the reported results can be improved, by incorporating different invari(cid:173)\nances and using e.g. local kernels [12]. \nFuture research will focus on further improvements on the algorithmic complexity \nof our new algorithms, which is so far larger than the one of the SVM algorithm, \nand on the connection between KFD and Support Vector Machines (cf. [16, 15]) . \n\nAcknowledgments This work was partially supported by grants of the DFG (JA \n379/5-2,7-1,9-1) and the EC STORM project number 25387 and carried out while \nBS and AS were with GMD First. \n\nReferences \n[1) C.M. Bishop. Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995. \n[2] B. Boser, 1. Guyon, and V.N. Vapnik. A training algorithm for optimal margin \n\nclassifiers. In D. Haussler, editor, Proc. COLT, pages 144- 152. ACM Press, 1992. \n\n[3) K.I. Diamantaras and S.Y. Kung. Principal Component Neural Networks. Wiley, New \n\nYork,1996. \n\n[4] B.Q. Fang and A.P. Dawid. Comparison of full bayes and bayes-least squares criteria \nfor normal discrimination. Chinese Journal of Applied Probability and Statistics, \n12:401- 410, 1996. \n\n[5] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning \n\nand an application to boosting. In EuroCOLT 94. LNCS, 1994. \n\n[6] J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical \n\nAssociation, 84(405):165- 175, 1989. \n\n[7] K Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San \n\nDiego, 2nd edition, 1990. \n\n[8) S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K-R. Muller. Fisher discriminant \nanalysis with kernels. In Y.-H. Hu, J . Larsen, E. Wilson, and S. Douglas, editors, \nNeural Networks for Signal Processing IX, pages 41-48. IEEE, 1999. \n\n[9] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. \n\nNeural Computation, 1(2):281-294, 1989. \n\n[10] G. Ratsch, T. Onoda, and K-R. Muller. Soft margins for adaboost. Technical Report \n\nNC-TR-1998-021, Royal Holloway College, University of London, UK, 1998. \n\n[11] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific \n\n& Technical, Harlow, England, 1988. \n\n[12] B. Scholkopf. Support vector learning. Oldenbourg Verlag, 1997. \n[13) B. Scholkopf, C.J.C. Burges, and A.J. Smola, editors. Advances in Kernel Methods -\n\nSupport Vector Learning. MIT Press, 1999. \n\n[14] B. Scholkopf, A.J. Smola, and K-R. Muller. Nonlinear component analysis as a kernel \n\neigenvalue problem. Neural Computation, 10:1299-1319, 1998. \n\n[15] A. Shashua. On the relationship between the support vector machine for classification \nand sparsified fisher's linear discriminant. Neural Processing Letters, 9(2):129- 139, \nApril 1999. \n\n[16) S. Tong and D. Koller. Bayes optimal hyperplanes --+ maximal margin hy-\nSubmitted to IJCA1'99 WorkshOp on Support Vector Machines \n\nperplanes. \n(robotics. stanford. edurkoller/), 1999. \n\n\f", "award": [], "sourceid": 1715, "authors": [{"given_name": "Sebastian", "family_name": "Mika", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}