{"title": "Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers", "book": "Advances in Neural Information Processing Systems", "page_first": 913, "page_last": 920, "abstract": null, "full_text": "Feature Selection and Classi\ufb02cation on\nMatrix Data: From Large Margins To\n\nSmall Covering Numbers\n\nSepp Hochreiter and Klaus Obermayer\n\nDepartment of Electrical Engineering and Computer Science\n\nTechnische Universit\u02dcat Berlin\n\n10587 Berlin, Germany\n\nfhochreit,obyg@cs.tu-berlin.de\n\nAbstract\n\nWe investigate the problem of learning a classi\ufb02cation task for\ndatasets which are described by matrices. Rows and columns of\nthese matrices correspond to objects, where row and column ob-\njects may belong to di\ufb01erent sets, and the entries in the matrix\nexpress the relationships between them. We interpret the matrix el-\nements as being produced by an unknown kernel which operates on\nobject pairs and we show that - under mild assumptions - these ker-\nnels correspond to dot products in some (unknown) feature space.\nMinimizing a bound for the generalization error of a linear classi-\n\ufb02er which has been obtained using covering numbers we derive an\nobjective function for model selection according to the principle of\nstructural risk minimization. The new objective function has the\nadvantage that it allows the analysis of matrices which are not pos-\nitive de\ufb02nite, and not even symmetric or square. We then consider\nthe case that row objects are interpreted as features. We suggest an\nadditional constraint, which imposes sparseness on the row objects\nand show, that the method can then be used for feature selection.\nFinally, we apply this method to data obtained from DNA microar-\nrays, where \\column\" objects correspond to samples, \\row\" objects\ncorrespond to genes and matrix elements correspond to expression\nlevels. Benchmarks are conducted using standard one-gene classi\ufb02-\ncation and support vector machines and K-nearest neighbors after\nstandard feature selection. Our new method extracts a sparse set\nof genes and provides superior classi\ufb02cation results.\n\n1\n\nIntroduction\n\nMany properties of sets of objects can be described by matrices, whose rows and\ncolumns correspond to objects and whose elements describe the relationship between\nthem. One typical case are so-called pairwise data, where rows as well as columns\nof the matrix represent the objects of the dataset (Fig. 1a) and where the entries of\nthe matrix denote similarity values which express the relationships between objects.\n\n\fPairwise Data (a)\n\nFeature Vectors (b)\n\nA B C D\n\nE\n\nF G\n\nH\n\nI\n\nJ K L\n\nA B C D\n\nE\n\nF G\n\n0.9\n\n-0.1\n\n-0.8\n\n0.5\n\n0.2\n\n-0.5\n\n-0.7\n\n-0.9 0.2\n\n-0.7\n\n0.4\n\n-0.3\n\n1.3\n\n-2.2\n\n-1.6\n\n7.8\n\n6.6\n\n-7.5\n\n-4.8\n\n-0.1\n\n0.9\n\n0.6\n\n0.3\n\n-0.7\n\n-0.6\n\n0.3\n\n0.7\n\n-0.3\n\n-0.8\n\n-0.7\n\n-0.9\n\n-1.8\n\n-1.1\n\n7.2\n\n2.3\n\n9.0\n\n3.8\n\n3.9\n\n-0.8\n\n0.6\n\n0.9\n\n0.2\n\n-0.6\n\n0.6\n\n0.5\n\n0.2\n\n-0.7 -0.5\n\n-0.1\n\n0.6\n\n1.2\n\n1.9\n\n-2.9\n\n-2.2\n\n-4.4\n\n-4.7\n\n-8.4\n\n0.5\n\n0.3\n\n0.2\n\n0.9\n\n0.7\n\n0.1\n\n0.3\n\n-0.1\n\n0.6\n\n0.9\n\n-0.9\n\n-0.1\n\n3.7\n\n0.8\n\n-0.6\n\n2.5\n\n-5.7\n\n0.1\n\n-0.3\n\n0.2\n\n-0.7\n\n-0.6\n\n0.7\n\n0.9\n\n-0.9\n\n-0.5\n\n0.4\n\n0.1\n\n-0.3\n\n-0.6\n\n0.7\n\n9.2\n\n-9.4\n\n-8.3\n\n9.2\n\n-2.4\n\n-3.9\n\n1.9\n\n-0.5\n\n-0.6\n\n0.6\n\n0.1\n\n-0.9\n\n0.9\n\n0.9\n\n-0.2\n\n-0.6\n\n-0.5\n\n-0.4\n\n-0.3\n\n-7.7\n\n8.6 -9.7\n\n-7.4\n\n2.6\n\n6.9\n\n2.9\n\n-0.7\n\n0.3\n\n0.5\n\n0.3\n\n-0.5\n\n0.9\n\n0.9\n\n-0.3\n\n-0.3\n\n0.6\n\n0.9\n\n-0.7\n\n-4.8\n\n0.1\n\n-1.2\n\n0.9\n\n0.2\n\n2.7\n\n0.2\n\n-0.9\n\n0.7\n\n0.2\n\n-0.1\n\n0.4\n\n-0.2 -0.3\n\n0.9\n\n0.2\n\n-0.9\n\n0.3\n\n0.4\n\n0.7\n\n-1.7\n\n0.3\n\n-7.2\n\n-1.8\n\n4.6\n\n2.6\n\n0.2\n\n-0.3\n\n-0.7\n\n0.6\n\n0.1\n\n-0.6\n\n-0.3\n\n-0.7\n\n-0.8\n\n-0.5\n\n0.9\n\n-0.3\n\n-0.5\n\n0.6\n\n0.2\n\n-0.9\n\n0.9\n\n-0.3\n\n-0.7\n\n0.8\n\n-6.2\n\n-6.2\n\n1.8\n\n3.6\n\n-0.7\n\n-9.4\n\n0.9\n\n-0.3\n\n0.9\n\n-0.1 -0.5\n\n9.0 4.8\n\n-8.3\n\n-0.8 -2.0\n\n4.4\n\n-1.9\n\n0.4\n\n-0.7\n\n-0.1\n\n-0.9\n\n-0.6\n\n-0.4\n\n0.9\n\n0.3\n\n-0.7 -0.1\n\n0.9\n\n0.1\n\n6.2\n\n9.0\n\n1.5\n\n-1.1\n\n7.7\n\n8.4\n\n-2.1\n\n-0.3\n\n-0.9\n\n0.6\n\n-0.1\n\n0.7\n\n-0.3\n\n-0.7\n\n0.4\n\n0.8\n\n-0.5\n\n0.1\n\n0.9\n\n9.6\n\n7.0\n\n2.5\n\n-4.3\n\n-5.4\n\n0.7\n\n1.2\n\nA\nB\nC\nD\nE\nF\nG\nH\nI\nJ\nK\nL\n\nFigure 1: Two typical examples of matrix data (see text). (a) Pairwise data. Row\n(A-L) and column (A-L) objects coincide. (b) Feature vectors. Column objects\n(A-G) di\ufb01er form row objects (\ufb01 - \u201a). The latter are interpreted as features.\n\nAnother typical case occurs, if objects are described by a set of features (Fig. 1b).\nIn this case, the column objects are the objects to be characterized, the row objects\ncorrespond to their features and the matrix elements denote the strength with which\na feature is expressed in a particular object.\n\nIn the following we consider the task of learning a classi\ufb02cation problem on matrix\ndata. We consider the case that class labels are assigned to the column objects of\nthe training set. Given the matrix and the class labels we then want to construct\na classi\ufb02er with good generalization properties. From all the possible choices we\nselect classi\ufb02ers from the support vector machine (SVM) family [1, 2] and we use\nthe principle of structural risk minimization [15] for model selection - because of its\nrecent success [11] and its theoretical properties [15].\n\nPrevious work on large margin classi\ufb02ers for datasets, where objects are described\nby feature vectors and where SVMs operate on the column vectors of the matrix, is\nabundant. However, there is one serious problem which arise when the number of\nfeatures becomes large and comparable to the number of objects: Without feature\nselection, SVMs are prone to over\ufb02tting, despite the complexity regularization which\nis implicit in the learning method [3]. Rather than being sparse in the number of\nsupport vectors, the classi\ufb02er should be sparse in the number of features used for\nclassi\ufb02cation. This relates to the result [15] that the number of features provide an\nupper bound on the number of \\essential\" support vectors.\n\nPrevious work on large margin classi\ufb02ers for datasets, where objects are described by\ntheir mutual similarities, was centered around the idea that the matrix of similarities\ncan be interpreted as a Gram matrix (see e.g. Hochreiter & Obermayer [7]). Work\nalong this line, however, was so far restricted to the case (i) that the Gram matrix is\npositive de\ufb02nite (although methods have been suggested to modify inde\ufb02nite Gram\nmatrices in order to restore positive de\ufb02niteness [10]) and (ii) that row and column\nobjects are from the same set (pairwise data) [7].\n\nb\nc\nd\ne\nf\ng\nh\ni\nj\nk\nl\na\n\fIn this contribution we extend the Gram matrix approach to matrix data, where\nrow and column objects belong to di\ufb01erent sets. Since we can no longer expect that\nthe matrices are positive de\ufb02nite (or even square), a new objective function must be\nderived. This is done in the next section, where an algorithm for the construction\nof linear classi\ufb02ers is derived using the principle of structural risk minimization.\nSection 3 is concerned with the question under what conditions matrix elements\ncan indeed be interpreted as vector products in some feature space. The method\nis specialized to pairwise data in Section 4. A sparseness constraint for feature\nselection is introduced in Section 5. Section 6, \ufb02nally, contains an evaluation of the\nnew method for DNA microarray data as well as benchmark results with standard\nclassi\ufb02ers which are based on standard feature selection procedures.\n\n2 Large Margin Classi\ufb02ers for Matrix Data\n\nIn the following we consider two sets X and Z of objects, which are described by\nfeature vectors x and z. Based on the feature vectors x we construct a linear\nclassi\ufb02er de\ufb02ned through the classi\ufb02cation function\n\nf (x) = hw; xi + b;\n\n(1)\n\nwhere h:; :i denotes a dot product. The zero isoline of f is a hyperplane which is\nparameterized by its unit normal vector ^w and by its perpendicular distance b=kwk2\nfrom the origin. The hyperplane\u2019s margin (cid:176) with respect to X is given by\n\n(cid:176) = min\nx2X\n\njh ^w; xi + b=kwk2 j :\n\n(2)\n\nSetting (cid:176) = kwk\u00a11\n2 allows us to treat normal vectors w which are not normalized,\nif the margin is normalized to 1. According to [15] this is called the \\canonical\nform\" of the separation hyperplane. The hyperplane with largest margin is then\nobtained by minimizing kwk2\n\n2 for a margin which equals 1.\n\nIt has been shown [14, 13, 12] that the generalization error of a linear classi\ufb02er, eq.\n(1), can be bounded from above with probability 1 \u00a1 \u2013 by the bound B,\n\nB(L; a=(cid:176); \u2013) =\n\n2\n\nL (cid:181)log2\u2021EN \u2021 (cid:176)\n\n2 a\n\n; F; 2L\u00b7\u00b7 + log2(cid:181) 4 L a\n\n\u2013 (cid:176) \u00b6\u00b6 ;\n\n(3)\n\nprovided that the training classi\ufb02cation error is zero and f (x) is bounded by\n\u00a1a \u2022 f (x) \u2022 a for all x drawn iid from the (unknown) distribution of objects.\nL denotes the number of training objects x, (cid:176) denotes the margin and EN (\u2020; F; L)\nthe expected \u2020-covering number of a class F of functions that map data objects from\nT to [0; 1] (see Theorem 7.7 in [14] and Proposition 19 in [12]). In order to obtain\na classi\ufb02er with good generalization properties we suggest to minimize a=(cid:176) under\nproper constraints. a is not known in general, however, because the probability dis-\ntribution of objects (in particular its support) is not known. In order to avoid this\n\nvalues in the training set and minimize the quantity B(L; m=(cid:176); \u2013) instead of eq. (3).\n\nproblem we approximate a by the range m = 0:5 \u00a1maxih ^w; xii \u00a1 minih ^w; xii\u00a2 of\nLet X := \u00a1x1; x2; : : : ; xL\u00a2 be the matrix of feature vectors of L objects from the set\nX and Z := \u00a1z1; z2; : : : ; zP\u00a2 be the matrix of feature vectors of P objects from the\n\nset Z. The objects of set X are labeled, and we summarize all labels using a label\nmatrix Y : [Y ]ij := yi\u2013ij 2 RL\u00a3L, where \u2013 is the Kronecker-Delta. Let us consider\nthe case that the feature vectors X and Z are unknown, but that we are given the\nmatrix K := X T Z of the corresponding scalar products. The training set is then\ngiven by the data matrix K and the corresponding label matrix Y . The principle\nof structural risk minimization is implemented by minimizing an upper bound on\n\n\f(m=(cid:176))2 given by kX T wk2\n\n2, as can be seen from m=(cid:176) \u2022 kwk2 maxi jh ^w; xiij \u2022\n\nqPi (hw; xii)2 = kX T wk2. The constraints f (xi) = yi imposed by the training\ni \u2022 yi \u00a1hw; xii + b\u00a2 \u2022\n\nset are taken into account using the expressions 1 \u00a1 \u00bb +\n1 + \u00bb\u00a1\nthus obtain the optimization problem\n\ni \u201a 0 are slack variables which should also be minimized. We\n\ni , where \u00bb+\n\ni ; \u00bb\u00a1\n\nmin\n\nw;b;\u00bb+;\u00bb\u00a1\n\ns.t.\n\n1\n2\n\nkX T wk2\n\n2 + M + 1T \u00bb+ + M \u00a1 1T \u00bb\u00a1\n\n(4)\n\nY \u00a11\u00a1X T w + b1\u00a2 \u00a1 1 + \u00bb+ \u201a 0\nY \u00a11\u00a1X T w + b1\u00a2 \u00a1 1 \u00a1 \u00bb\u00a1 \u2022 0\n\n\u00bb+; \u00bb\u00a1 \u201a 0 :\n\nM + penalizes wrong classi\ufb02cation and M \u00a1 absolute values exceeding 1. For classi\ufb02-\ncation M \u00a1 may be set to zero. Note, that the quadratic expression in the objective\nfunction is convex, which follows from kX T wk2\n2 = wT X X T w and the fact\nthat X X T is positive semide\ufb02nite.\n\nLet ~\ufb01+; ~\ufb01\u00a1 be the dual variables for the constraints imposed by the training set,\n\n~\ufb01 := ~\ufb01+ \u00a1 ~\ufb01\u00a1, and \ufb01 a vector with ~\ufb01 = Y \u00a1X T Z\u00a2 \ufb01. Two cases\n\u00a1Z T X Y \u00a1T Y \u00a11 X T Z\u00a2\u00a11\n\nmust be treated: \ufb01 is not unique or does not exist. First, if \ufb01 is not unique\nwe choose \ufb01 according to Section 5. Second, if \ufb01 does not exist we set \ufb01 =\nZ T X Y \u00a1T ~\ufb01, where Y \u00a1T Y \u00a11 is the identity. The\noptimality conditions require that the following derivatives of the Lagrangian L are\nzero: @L=@b = 1T Y \u00a11 ~\ufb01, @L=@w = X X T w \u00a1 X Y \u00a11 ~\ufb01, @L=@\u00bb\u00a7 =\nM \u00a71 \u00a1 ~\ufb01\u00a7 + \u201e\u00a7, where \u201e+; \u201e\u00a1 \u201a 0 are the Lagrange multipliers for the slack\nvariables. We obtain Z T X X T (w \u00a1 Z \ufb01) = 0 which is ensured by w = Z \ufb01,\n\n0 = 1T \u00a1X T Z\u00a2 \ufb01, ~\ufb01i \u2022 M +, and \u00a1~\ufb01i \u2022 M \u00a1. The Karush{Kuhn{Tucker\nconditions give b = \u00a11T Y 1\u00a2 = \u00a11T 1\u00a2 if ~\ufb01i < M + and \u00a1~\ufb01i < M \u00a1.\nIn the following we set M + = M \u00a1 = M and C := M kY \u00a1X T Z\u00a2 k\u00a11\nk\ufb01k1 \u2022 C implies k ~\ufb01k1 \u2022 kY \u00a1X T Z\u00a2 krow k\ufb01k1 \u2022 M , where k:krow is the\n\nrow-sum norm. We then obtain the following dual problem of eq. (4):\n\nrow so that\n\nmin\n\n\ufb01\n\nsubject to\n\n\ufb01T K T K \ufb01 \u00a1 1T Y K \ufb01\n\n1\n2\n1T K \ufb01 = 0 ; j\ufb01ij \u2022 C:\n\n(5)\n\nIf M + 6= M \u00a1 we must add another constraint. For M \u00a1 = 0, for example, we have\nto add Y K (\ufb01+ \u00a1 \ufb01\u00a1) \u201a 0. If a classi\ufb02er has been selected according to eq.\n(5), a new example u is classi\ufb02ed according to the sign of\n\nf (u) = hw; ui + b =\n\nP\n\nXi=1\n\n\ufb01i hzi; ui + b:\n\n(6)\n\nThe optimal classi\ufb02er is selected by optimizing eq. (5), and as long as a = m holds\ntrue for all possible objects x (which are assumed to be drawn iid), the generalization\nerror is bounded by eq. (3). If outliers are rejected, condition a = m can always be\nenforced. For large training sets the number of rejections is small: The probability\nP fjhw; xij > mg that an outlier occurs can be bounded with con\ufb02dence 1 \u00a1 \u2013 using\nthe additive Cherno\ufb01 bounds (e.g. [15]):\n\nP fjhw; xij > mg \u2022 r \u00a1 log \u2013\n\n2L\n\n:\n\n(7)\n\nBut note, that not all outliers are misclassi\ufb02ed, and the trivial bound on the gen-\neralization error is still of the order L\u00a11.\n\n\f3 Kernel Functions, Measurements and Scalar Products\n\nIn the last section we have assumed that the matrix K is derived from scalar\nproducts between the feature vectors x and z which describe the objects from the\nsets X and Z. For all practical purposes, however, the only information available\nis summarized in the matrices K and Y . The feature vectors are not known and\nit is even unclear whether they exist. In order to apply the results of Section 2 to\npractical problems the following question remains to be answered: What are the\nconditions under which the measurement operator k(:; z) can indeed be interpreted\nas a scalar product between feature vectors and under which the matrix K can be\ninterpreted as a matrix of kernel evaluations?\n\nIn order to answer these questions, we make use of the following theorems. Let\n\nL2(H) denote the set of functions h from H with R h2(x)dx < 1 and \u20182 the set\nof in\ufb02nite vectors (a1; a2; : : : ) where Pi a2\n\nTheorem 1 (Singular Value Expansion) Let H1 and H2 be Hilbert spaces. Let\n\ufb01 be from L2(H1) and let k be a kernel from L2(H2; H1) which de\ufb02nes a Hilbert-\nSchmidt operator Tk : H1 ! H2\n\ni converges.\n\n(Tk\ufb01)(x) = f (x) = Z k(x; z) \ufb01(z) dz :\n\n(8)\n\nThen there exists an expansion k(x; z) = Pn sn en(z) gn(x) which converges in\n\nthe L2-sense. The sn \u201a 0 are the singular values of Tk, and en 2 H1, gn 2 H2 are\nthe corresponding orthonormal functions.\n\nCorollary 1 (Linear Classi\ufb02cation in \u20182) Let the assumptions of Theorem 1\nbe the a dot product in\nH1. We de\ufb02ne w := (h\ufb01; e1iH1 ; h\ufb01; e2iH1; : : : ), and `(x) := (s1g1(x); s2g2(x); : : : ).\n\n(k(x; z))2 dz \u2022 K 2 for all x. Let h:iH1\n\nhold and let RH1\n\nThen the following holds true:\n\n\u2020 w; `(x) 2 \u20182, where kwk2\n\n\u20182 = k\ufb01k2\nH1\n\n, and\n\n\u2020 kf k2\n\nH2\n\n= hT \u2044\n\nk Tk\ufb01; \ufb01iH1 , where T \u2044\n\nk is the adjoint operator of Tk,\n\nand the following sum convergences absolutely and uniformly:\n\nf (x) = hw; `(x)i\u20182 = Xn\n\nsn h\ufb01; eniH1 gn(x) :\n\n(9)\n\nEq. (9) is a linear classi\ufb02er in \u20182. ` maps vectors from H2 into the feature space. We\nde\ufb02ne a second mapping from H1 to the feature space by ! (z) := (e1(z); e2(z); : : : ).\ni=1 \ufb01i\u2013(zi), where \u2013(zi) is the Dirac delta, we recover the discrete\n= \ufb01T K T K \ufb01 =\nkX T wk2\n2. A problem may arise if zi belongs to a set of measure zero which does\nnot obey the singular value decomposition of k. If this occurs \u2013(z i) may be set to\nthe zero function.\n\nFor \ufb01 = PP\nclassi\ufb02er (6) and w = PP\n\ni=1 \ufb01i !\u00a1zi\u00a2. We observe that kf k2\n\nH2\n\nTheorem 1 tells us that any measurement kernel k applied to objects x and z can\nbe expressed for almost all x and z as k(x; z) = h` (x) ; ! (z)i, where h:i de\ufb02nes\na dot product in some feature space for almost all x; z. Hence, we can de\ufb02ne the\n\na matrix X := \u00a1`\u00a1x1\u00a2 ; `\u00a1x2\u00a2 ; : : : ; `\u00a1xL\u00a2\u00a2 of feature vectors for the L column\nobjects and a matrix Z := \u00a1!\u00a1z1\u00a2 ; !\u00a1z2\u00a2 ; : : : ; !\u00a1zP\u00a2\u00a2 of feature vectors for the\n\nP row objects and apply the results of Section 2.\n\n\f4 Pairwise Data\n\nAn interesting special case occurs if row and column objects coincide. This kind of\ndata is known as pairwise data [5, 4, 8] where the objects to be classi\ufb02ed serve as\nfeatures and vice versa. Like in Section 3 we can expand the measurement kernel\nvia singular value decomposition but that would introduce two di\ufb01erent mappings\n(` and !) into the feature space. We will use one map for row and column objects\nand perform an eigenvalue decomposition. The consequence is that that eigenvalues\nmay be negative (see the following theorem).\n\nTheorem 2 (Eigenvalue Expansion) Let de\ufb02nitions and assumptions be as in\nTheorem 1. Let H1 = H2 = H and let k be symmetric. Then there exists an\n\nexpansion k(x; z) = Pn \u201dn en(z) en(x) which converges in the L2-sense. The \u201dn\n\nare the eigenvalues of Tk with the corresponding orthonormal eigenfunctions en.\n\nCorollary 2 (Minkowski Space Classi\ufb02cation) Let the assumptions of Theo-\ndz \u2022 K 2 for all x hold true. We de\ufb02ne w :=\n\nrem 2 and RH (k(x; z))2\n(pj\u201d1jh\ufb01; e1iH ;pj\u201d2jh\ufb01; e2iH ; : : : ), `(x) := (pj\u201d1je1(x);pj\u201d2je2(x); : : : ), and \u20182\n\nto denote \u20182 with a given signature S = (sign(\u201d1); sign(\u201d2); : : : ).\n\nS\n\nThen the following holds true:\n\nS\n\n= Pn sign(\u201dn) \u2021pj\u201dnj h\ufb01; eniH\u00b72\n\nkwk2\n\u20182\nk`(x)k2\n\u20182\nconvergences absolutely and uniformly:\n\nS\n\n= Pn \u201dn en(x)2 = k(x; x) in the L2 sense, and the following sum\n\n= Pn \u201dnh\ufb01; eni2\n\nH = hTk\ufb01; \ufb01iH ,\n\nf (x) = hw; `(x)i\u20182\n\nS\n\n= Xn\n\n\u201dn h\ufb01; eniH en(x) :\n\n(10)\n\nEq.\n\n\ufb01 = PP\n\n(10) is a linear classi\ufb02er in the Minkowski space \u20182\n\ni=1 \ufb01i\u2013(zi), the normal vector is w = PP\n\nS. For the discrete case\nIn comparison to\nCorollary 1, we have kwk2\ndoes\n\u20182\nconverge. Unfortunately, this can be assured in general only for almost all x. If k is\nboth continuous and positive de\ufb02nite and if H is compact, then the sum converges\nuniformly and absolutely for all x (Mercer).\n\n= \ufb01T K \ufb01. and must assume that k`(x)k2\n\u20182\n\ni=1 \ufb01i`\u00a1zi\u00a2.\n\nS\n\nS\n\n5 Sparseness and Feature Selection\n\nAs mentioned in the text after optimization problem (4) \ufb01 may be not u nique\nand an additional regularization term is needed. We choose the regularization term\nsuch that it enforces sparseness and that it also can be used for feature selection.\nWe choose \"\u2020 k\ufb01k1\", where \u2020 is the regularization parameter. We separate \ufb01 into\na positive part \ufb01+ and a negative part \ufb01\u00a1 with \ufb01 = \ufb01+ \u00a1 \ufb01\u00a1 and \ufb01+\ni \u201a 0\n[11]. The dual optimization problem is then given by\n\ni ; \ufb01\u00a1\n\nmin\n\n\ufb01\n\ns.t.\n\nIf \ufb01 is sparse, i.e.\n\n1\n\n2 \u00a1\ufb01+ \u00a1 \ufb01\u00a1\u00a2T\nK T K \u00a1\ufb01+ \u00a1 \ufb01\u00a1\u00a2 \u00a1\n1T Y K \u00a1\ufb01+ \u00a1 \ufb01\u00a1\u00a2 + \u2020 1T \u00a1\ufb01+ + \ufb01\u00a1\u00a2\n1T K \u00a1\ufb01+ \u00a1 \ufb01\u00a1\u00a2 = 0 ; C1 \u201a \ufb01+; \ufb01\u00a1 \u201a 0 :\n\nif many \ufb01i = \ufb01+\n\ni \u00a1 \ufb01\u00a1\n\n(11)\n\nf (u) = hw; ui + b = PP\n\ni=1\u00a1\ufb01+\n\ni \u00a1 \ufb01\u00a1\n\nThis saves on the number of measurements hzi; ui for new objects and yields to\nimproved classi\ufb02cation performance due to the reduced number of features z i [15].\n\ni are zero, the classi\ufb02cation function\n\ni \u00a2 hzi; ui + b contains only few terms.\n\n\f6 Application to DNA Microarray Data\n\nWe apply our new method to the DNA microarray data published in [9]. Column\nobjects are samples from di\ufb01erent brain tumors of the medullablastoma kind. The\nsamples were obtained from 60 patients, which were treated in a similar way and\nthe samples were labeled according to whether a patient responded well to chemo-\nor radiation therapy. Row objects correspond to genes. Transcriptions of 7,129\ngenes were tagged with (cid:176)uorescent dyes and used as a probe in a binding assay.\nFor every sample-gene pair, the (cid:176)uorescence of the bound transcripts - a snapshot\nof the level of gene expression - was measured. This gave rise to a 60 \u00a3 7; 129 real\nvalued sample-gene matrix where each entry represents the level of gene expression\nin the corresponding sample. For more details see [9].\n\nThe task is now to construct a classi\ufb02er which predicts therapy outcome on the\nbasis of samples taken from new patients. The major problem of this classi\ufb02cation\ntask is the limited number of samples - given the large number of genes. Therefore,\nfeature selection is a prerequisite for good generalization [6, 16]. We construct the\nclassi\ufb02er using a two step procedure.\nIn a \ufb02rst step, we apply our new method\non a 59 \u00a3 7; 129 matrix, where one column object was withhold to avoid biased\nfeature selection. We choose \u2020 to be fairly large in order to obtain a sparse set of\nfeatures. In a second step, we use the selected features only and apply our method\nonce more on the reduced sample-gene matrix, but now with a small value of \u2020. The\nC-parameter is used for regularization instead.\n\nFeature Selection # # Feature Selection\n/ Classi\ufb02cation\n/ Classi\ufb02cation\n\nC\n\n#\nF\n\nP-SVM / C-SVM 1.0\nP-SVM / C-SVM 0.01\nP-SVM / P-SVM 0.1\n\n40/45/50\n40/45/50\n40/45/50\n\n#\nE\n\n5/4/5\n5/5/5\n4/4/5\n\nTrkC\n\nstatistic / SVM\nstatistic / Comb1\nstatistic / KNN\nstatistic / Comb2\n\nF\n1\n\n8\n\nE\n20\n15\n14\n13\n12\n\nTable 1: Benchmark results for DNA microarray data (for explanations see text).\nThe table shows the classi\ufb02cation error given by the number of wrong classi\ufb02cations\n(\\E\") for di\ufb01erent numbers of selected features (\\F\") and for di\ufb01erent values of the\nparameter C. The feature selection method is signal-to-noise-statistic and t-statitic\ndenoted by \\statistic\" or our method P-SVM. Data are provided for \\TrkC\"-Gene\nclassi\ufb02cation, standard SVMs, weighted \\TrkC\"/SVM (Comb1), K nearest neighbor\n(KNN), combined SVM/TrkC/KNN (Comb2), and our procedure (P-SVM) used for\nclassi\ufb02cation. Except for our method (P-SVM), results were taken from [9].\n\nTable 1 shows the result of a leave-one-out cross-validation procedure, where the\nclassi\ufb02cation error is given for di\ufb01erent numbers of selected features. Our method\n(P-SVM) is compared with \\TrkC\"-Gene classi\ufb02cation (one gene classi\ufb02cation),\nstandard SVMs, weighted \\TrkC\"/SVM-classi\ufb02cation, K nearest neighbor (KNN),\nand a combined SVM/TrkC/KNN classi\ufb02er. For the latter methods, feature se-\nlection was based on the correlation of features with classes using signal-to-noise-\nstatistics and t-statistics [3]. For our method we use C = 1:0 and 0:1 \u2022 \u2020 \u2022 1:5\nfor feature selection in step one which gave rise to 10 \u00a1 1000 selected features. The\nfeature selection procedure (also a classi\ufb02er) had its lowest misclassi\ufb02cation rate\nbetween 20 and 40 features. For the construction of the classi\ufb02er we used in step\ntwo \u2020 = 0:01. Our feature selection method clearly outperforms standard methods\n| the number of misclassi\ufb02cation is down by a factor of 3 (for 45 selected genes).\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers for their hints to improve the paper. This work\nwas funded by the DFG (SFB 618).\n\nReferences\n\n[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal\nmargin classi\ufb02ers. In Proc. of the 5th Annual ACM Workshop on Computa-\ntional Learning Theory, pages 144{152. ACM Press, Pittsburgh, PA, 1992.\n\n[2] C. Cortes and V. N. Vapnik. Support vector networks. Machine Learning,\n\n20:273{297, 1995.\n\n[3] R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,\nH. Coller, M. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloom\ufb02eld, and E. S.\nLander. Molecular classi\ufb02cation of cancer: Class discovery and class prediction\nby gene expression monitoring. Science, 286(5439):531{537, 1999.\n\n[4] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classi\ufb02cation\n\non pairwise proximity data. In NIPS 11, pages 438{444, 1999.\n\n[5] T. Graepel, R. Herbrich, B. Sch\u02dcolkopf, A. J. Smola, P. L. Bartlett, K.-R.\nM\u02dculler, K. Obermayer, and R. C. Williamson. Classi\ufb02cation on proximity data\nwith LP{machines. In ICANN 99, pages 304{309, 1999.\n\n[6] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer\nclassi\ufb02cation using support vector machines. Mach. Learn., 46:389{422, 2002.\n\n[7] S. Hochreiter and K. Obermayer. Classi\ufb02cation of pairwise proximity data with\nsupport vectors. In The Learning Workshop. Y. LeCun and Y. Bengio, 2002.\n\n[8] T. Hofmann and J. Buhmann. Pairwise data clustering by deterministic an-\n\nnealing. IEEE Trans. on Pat. Analysis and Mach. Intell., 19(1):1{14, 1997.\n\n[9] S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E.\nMcLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C.\nAllen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio,\nS. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov,\nE. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal\ntumour outcome based on gene expression. Nature, 415(6870):436{442, 2002.\n\n[10] V. Roth, J. Buhmann, and J. Laub. Pairwise clustering is equivalent to classical\n\nk-means. In The Learning Workshop. Y. LeCun and Y. Bengio, 2002.\n\n[11] B. Sch\u02dcolkopf and A. J. Smola. Learning with kernels | Support Vector Ma-\nchines, Reglarization, Optimization, and Beyond. MIT Press, Cambridge, 2002.\n\n[12] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anhtony. A frame-\nwork for structural risk minimisation. In Comp. Learn. Th., pages 68{76, 1996.\n\n[13] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anhtony. Struc-\ntural risk minimization over data-dependent hierarchies. IEEE Transactions\non Information Theory, 44:1926{1940, 1998.\n\n[14] J. Shawe-Taylor and N. Cristianini. On the generalisation of soft margin al-\ngorithms. Technical Report NC2-TR-2000-082, NeuroCOLT2, Department of\nComputer Science, Royal Holloway, University of London, 2000.\n\n[15] V. Vapnik. The nature of statistical learning theory. Springer, NY, 1995.\n\n[16] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik.\n\nFeature selection for SVMs. In NIPS 12, pages 668{674, 2000.\n\n\f", "award": [], "sourceid": 2321, "authors": [{"given_name": "Sepp", "family_name": "Hochreiter", "institution": null}, {"given_name": "Klaus", "family_name": "Obermayer", "institution": null}]}