{"title": "PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data", "book": "Advances in Neural Information Processing Systems", "page_first": 881, "page_last": 888, "abstract": null, "full_text": "PAC-Bayes Learning of Conjunctions and\n\nClassi\ufb01cation of Gene-Expression Data\n\nMario Marchand\n\nIFT-GLO, Universit\u00b4e Laval\n\nSainte-Foy (QC) Canada, G1K-7P4\n\nMario.Marchand@ift.ulaval.ca\n\nMohak Shah\n\nSITE, University of Ottawa\n\nOttawa, Ont. Canada,K1N-6N5\n\nmshah@site.uottawa.ca\n\nAbstract\n\nWe propose a \u201csoft greedy\u201d learning algorithm for building small\nconjunctions of simple threshold functions, called rays, de\ufb01ned on\nsingle real-valued attributes. We also propose a PAC-Bayes risk\nbound which is minimized for classi\ufb01ers achieving a non-trivial\ntradeo\ufb00 between sparsity (the number of rays used) and the mag-\nnitude of the separating margin of each ray. Finally, we test the\nsoft greedy algorithm on four DNA micro-array data sets.\n\n1 Introduction\n\nAn important challenge in the problem of classi\ufb01cation of high-dimensional data\nis to design a learning algorithm that can often construct an accurate classi\ufb01er\nthat depends on the smallest possible number of attributes. For example, in the\nproblem of classifying gene-expression data from DNA micro-arrays, if one can \ufb01nd\na classi\ufb01er that depends on a small number of genes and that can accurately predict\nif a DNA micro-array sample originates from cancer tissue or normal tissue, then\nthere is hope that these genes, used by the classi\ufb01er, may be playing a crucial role\nin the development of cancer and may be of relevance for future therapies.\nThe standard methods used for classifying high-dimensional data are often charac-\nterized as either \u201c\ufb01lters\u201d or \u201cwrappers\u201d. A \ufb01lter is an algorithm used to \u201c\ufb01lter out\u201d\nirrelevant attributes before using a base learning algorithm, such as the support\nvector machine (SVM), which was not designed to perform well in the presence of\nmany irrelevant attributes. A wrapper, on the other hand, is used in conjunction\nwith the base learning algorithm: typically removing recursively the attributes that\nhave received a small \u201cweight\u201d by the classi\ufb01er obtained from the base learner.\nThe recursive feature elimination method is an example of a wrapper that was used\nby Guyon et al. (2002) in conjunction with the SVM for classi\ufb01cation of micro-array\ndata. For the same task, Furey et al. (2000) have used a \ufb01lter which consists of\nranking the attributes (gene expressions) as function of the di\ufb00erence between the\npositive-example mean and the negative-example mean. Both \ufb01lters and wrappers\nhave sometimes produced good empirical results but they are not theoretically justi-\n\ufb01ed. What we really need is a learning algorithm that has provably good guarantees\nin the presence of many irrelevant attributes. One of the \ufb01rst learning algorithms\nproposed by the COLT community has such a guarantee for the class of conjunc-\n\n\ftions: if there exists a conjunction, that depends on r out of the n input attributes\nand that correctly classi\ufb01es a training set of m examples, then the greedy covering\nalgorithm of Haussler (1988) will \ufb01nd a conjunction of at most r ln m attributes that\nmakes no training errors. Note the absence of dependence on the number n of input\nattributes. In contrast, the mistake-bound of the Winnow algorithm (Littlestone,\n1988) has a logarithmic dependence on n and will build a classi\ufb01er on all the n\nattributes.\nMotivated by this theoretical result and by the fact that simple conjunctions of gene\nexpression levels seems an interesting learning bias for the classi\ufb01cation of DNA\nmicro-arrays, we propose a \u201csoft greedy\u201d learning algorithm for building small con-\njunctions of simple threshold functions, called rays, de\ufb01ned on single real-valued\nattributes. We also propose a PAC-Bayes risk bound which is minimized for classi-\n\ufb01ers achieving a non-trivial tradeo\ufb00 between sparsity (the number of rays used) and\nthe magnitude of the separating margin of each ray. Finally, we test the proposed\nsoft greedy algorithm on four DNA micro-array data sets.\n\n2 De\ufb01nitions\nThe input space X consists of all n-dimensional vectors x = (x1, . . . , xn) where\neach real-valued component xi \u2208 [Ai, Bi] for i = 1, . . . n. Hence, Ai and Bi are,\nrespectively, the a priori lower and upper bounds on values for xi. The output\nspace Y is the set of classi\ufb01cation labels that can be assigned to any input vector\nx \u2208 X . We focus here on binary classi\ufb01cation problems. Thus Y = {0, 1}. Each\nexample z = (x, y) is an input vector x with its classi\ufb01cation label y \u2208 Y. In the\nprobably approximately correct (PAC) setting, we assume that each example z is\ngenerated independently according to the same (but unknown) distribution D. The\n(true) risk R(f) of a classi\ufb01er f : X \u2192 Y is de\ufb01ned to be the probability that f\nmisclassi\ufb01es z on a random draw according to D:\n\nR(f) def= Pr(x,y)\u223cD (f(x) (cid:54)= y) = E(x,y)\u223cDI(f(x) (cid:54)= y)\n\nwhere I(a) = 1 if predicate a is true and 0 otherwise. Given a training set\nS = (z1, . . . , zm) of m examples, the task of a learning algorithm is to construct\na classi\ufb01er with the smallest possible risk without any information about D. To\nachieve this goal, the learner can compute the empirical risk RS(f) of any given\nclassi\ufb01er f according to:\n\nRS(f) def=\n\n1\nm\n\nI(f(xi) (cid:54)= yi) def= E(x,y)\u223cSI(f(x) (cid:54)= y)\n\nm(cid:88)\n\ni=1\n\nWe focus on learning algorithms that construct a conjunction of rays from a training\nset. Each ray is just a threshold classi\ufb01er de\ufb01ned on a single attribute (component)\nxi. More formally, a ray is identi\ufb01ed by an attribute index i \u2208 {1, . . . , n}, a threshold\nvalue t \u2208 [Ai, Bi], and a direction d \u2208 {\u22121, +1} (that speci\ufb01es whether class 1 is on\nthe largest or smallest values of xi). Given any input example x, the output ri\ntd(x)\nof a ray is de\ufb01ned as:\n\n(cid:189)\n\ntd(x) def=\nri\n\n1\n0\n\nif\nif\n\n(xi \u2212 t)d > 0\n(xi \u2212 t)d \u2264 0\n\nTo specify a conjunction of rays we need \ufb01rst to list all the attributes who\u2019s ray\nis present in the conjunction. For this purpose, we use a vector i def= (i1, . . . , i|i|)\n\n\fof attribute indices ij \u2208 {1, . . . , n} such that i1 < i2 < . . . < i|i| where |i| is the\nnumber of indices present in i (and thus the number of rays in the conjunction) 1.\nTo complete the speci\ufb01cation of a conjunction of rays, we need a vector t =\n(ti1 , ti2, . . . , ti|i|) of threshold values and a vector of d = (di1 , di2, . . . , di|i|) of di-\nrections where ij \u2208 {1, . . . , n} for j \u2208 {1, . . . ,|i|}. On any input example x, the\noutput C i\n\ntd(x) of a conjunction of rays is given by:\n\n(cid:40)\n\ntd(x) def=\nC i\n\n1\n0\n\nrj\ntj dj\n\nif\nif \u2203j \u2208 i : rj\n\n(x) = 1 \u2200j \u2208 i\n(x) = 0\n\ntj dj\n\nFinally, any algorithm that builds a conjunction can be used to build a disjunction\njust by exchanging the role of the positive and negative labelled examples. Due to\nlack of space, we describe here only the case of a conjunction.\n\n3 A PAC-Bayes Risk Bound\n\nThe PAC-Bayes approach, initiated by McAllester (1999), aims at providing PAC\nguarantees to \u201cBayesian\u201d learning algorithms. These algorithms are speci\ufb01ed in\nterms of a prior distribution P over a space of classi\ufb01ers that characterizes our\nprior belief about good classi\ufb01ers (before the observation of the data) and a pos-\nterior distribution Q (over the same space of classi\ufb01ers) that takes into account\nthe additional information provided by the training data. A remarkable result that\ncame out from this line of research, known as the \u201cPAC-Bayes theorem\u201d, provides\na tight upper bound on the risk of a stochastic classi\ufb01er called the Gibbs classi\ufb01er.\nGiven an input example x, the label GQ(x) assigned to x by the Gibbs classi\ufb01er\nis de\ufb01ned by the following process. We \ufb01rst choose a classi\ufb01er h according to the\nposterior distribution Q and then use h to assign the label h(x) to x. The risk of\nGQ is de\ufb01ned as the expected risk of classi\ufb01ers drawn according to Q:\n\nR(GQ) def= Eh\u223cQR(h) = Eh\u223cQE(x,y)\u223cDI(f(x) (cid:54)= y)\n\nThe PAC-Bayes theorem was \ufb01rst proposed by McAllester (2003). The version\npresented here is due to Seeger (2002) and Langford (2003).\nTheorem 1 Given any space H of classi\ufb01ers. For any data-independent prior\ndistribution P over H and for any (possibly data-dependent) posterior distribution\nQ over H, with probability at least 1 \u2212 \u03b4 over the random draws of training sets S\nof m examples:\n\nkl(RS(GQ)(cid:107)R(GQ)) \u2264 KL(Q(cid:107)P ) + ln m+1\n\n\u03b4\n\nm\n\nwhere KL(Q(cid:107)P ) is the Kullback-Leibler divergence between distributions2 Q and P :\n\nKL(Q(cid:107)P ) def= Eh\u223cQ ln Q(h)\nP (h)\n\nand where kl(q(cid:107)p) is the Kullback-Leibler divergence between the Bernoulli distribu-\ntions with probabilities of success q and p:\n\nkl(q(cid:107)p) def= q ln q\np\n\n+ (1 \u2212 q) ln\n\n1 \u2212 q\n1 \u2212 p\n\nfor q < p\n\n1Although it is possible to use up to two rays on any attribute, we limit ourselves here\n\nto the case where each attribute can be used for only one ray.\n\n2Here Q(h) denotes the probability density function associated to Q, evaluated at h.\n\n\fThe bound given by the PAC-Bayes theorem for the risk of Gibbs classi\ufb01ers can be\nturned into a bound for the risk of Bayes classi\ufb01ers in the following way. Given a\nposterior distribution Q, the Bayes classi\ufb01er BQ performs a majority vote (under\nmeasure Q) of binary classi\ufb01ers in H. When BQ misclassi\ufb01es an example x, at least\nhalf of the binary classi\ufb01ers (under measure Q) misclassi\ufb01es x. It follows that the\nerror rate of GQ is at least half of the error rate of BQ. Hence R(BQ) \u2264 2R(GQ).\nIn our case, we have seen that ray conjunctions are speci\ufb01ed in terms of a mixture\nof discrete parameters i and d and continuous parameters t. If we denote by Pi,d(t)\nthe probability density function associated with a prior P over the class of ray\nconjunctions, we consider here priors of the form:\n\nPi,d(t) =\n\n1\n\nBj \u2212 Aj\n\n;\n\n\u2200tj \u2208 [Aj, Bj]\n\nIf I denotes the set of all 2n possible attribute index vectors and Di denotes de set\nof all 2|i| binary direction vectors d of dimension |i|, we have that:\n\n1(cid:161) n|i|\n(cid:162) p(|i|)\n(cid:88)\n(cid:88)\n\n1\n2|i|\n\n(cid:89)\n\n(cid:89)\n(cid:90) Bj\n\nj\u2208i\n\ni\u2208I\n\nd\u2208Di\n\nj\u2208i\n\nAj\n\ndtjPi,d(t) = 1\n\n(cid:80)n\n\ne=0 p(e) = 1.\n\nwhenever\nThe reasons motivating this choice for the prior are the following. The \ufb01rst two\nfactors come from the belief that the \ufb01nal classi\ufb01er, constructed from the group of\nattributes speci\ufb01ed by i, should depend only on the number |i| of attributes in this\ngroup. If we have complete ignorance about the number of rays the \ufb01nal classi\ufb01er is\nlikely to have, we should choose p(e) = 1/(n + 1) for e \u2208 {0, 1, . . . , n}. However, we\nshould choose a p that decreases as we increase e if we have reasons to believe that\nthe number of rays of the \ufb01nal classi\ufb01er will be much smaller than n. The third\nfactor of Pi,d(t) gives equal prior probabilities for each of the two possible values of\ndirection dj. Finally, for each ray, every possible threshold value t should have the\nsame prior probability of being chosen if we do not have any prior knowledge that\nwould favor some values over the others. Since each attribute value xi is constrained,\na priori, to be in [Ai, Bi], we have chosen a uniform probability density on [Ai, Bi]\nfor each ti such that i \u2208 i. This explains the last factors of Pi,d(t).\nGiven a training set S, the learner will choose an attribute group i and a direction\nvector d. For each attribute xi \u2208 [Ai, Bi] : i \u2208 i, a margin interval [ai, bi] \u2286 [Ai, Bi]\nwill also be chosen by the learner. A deterministic ray-conjunction classi\ufb01er is then\nspeci\ufb01ed by choosing the thresholds values ti \u2208 [ai, bi]. It is tempting at this point\nto choose ti = (ai + bi)/2 \u2200i \u2208 i (i.e., in the middle of each interval). However, we\nwill see shortly that the PAC-Bayes theorem o\ufb00ers a better guarantee for another\ntype of deterministic classi\ufb01er.\nThe Gibbs classi\ufb01er is de\ufb01ned with a posterior distribution Q having all its weight\non the same i and d as chosen by the learner but where each ti is uniformly chosen\nin [ai, bi]. The KL divergence between this posterior Q and the prior P is then\ngiven by:\n\n(cid:90) bj\n(cid:89)\n(cid:182)\n(cid:181)\n\nj\u2208i\n\naj\n\nn\n|i|\n\n= ln\n\n(cid:181)(cid:81)\n(cid:182)\ni\u2208i(bi \u2212 ai)\u22121\n(cid:182)\n(cid:88)\nPi,d(t)\n\n+ |i| ln(2) +\n\ni\u2208i\n\ndtj\nbj \u2212 aj\n\n(cid:181)\n\n+ ln\n\nln\n\n1\np(|i|)\n\n(cid:182)\n\n(cid:181)\n\nln\n\nBi \u2212 Ai\nbi \u2212 ai\n\nKL(Q(cid:107)P ) =\n\nHence, we see that the KL divergence between the \u201ccontinuous components\u201d of Q\nand P (given by the last term) vanishes when [ai, bi] = [Ai, Bi] \u2200i \u2208 i. Furthermore,\n\n\fthe KL divergence between the \u201cdiscrete components\u201d of Q and P is small for small\nvalues of |i| (whenever p(|i|) is not too small). Hence, this KL divergence between\nour choices for Q and P exhibits a tradeo\ufb00 between margins (large values of bi \u2212 ai)\nand sparsity (small value of |i|) for Gibbs classi\ufb01ers. According to Theorem 1,\nthe Gibbs classi\ufb01er with the smallest guarantee of risk R(GQ) should minimize a\nnon trivial combination of KL(Q(cid:107)P ) (margins-sparsity tradeo\ufb00) and empirical risk\nRS(GQ).\nSince the posterior Q is identi\ufb01ed by an attribute group vector i, a direction vector\nd, and intervals [ai, bi] \u2200i \u2208 i, we will refer to the Gibbs classi\ufb01er GQ by Gid\nab\nwhere a and b are the vectors formed by the unions of ais and bis respectively.\nab) by \ufb01rst considering the risk\nWe can obtain a closed-form expression for RS(Gid\nR(x,y)(Gid\nab) on a single example (x, y) since RS(Gid\nab) = E(x,y)\u223cSR(x,y)(Gid\nab). From\nour de\ufb01nition for Q, we \ufb01nd that:\n\n(cid:35)\n\nR(x,y)(Gid\n\nab) = (1 \u2212 2y)\n\n\u03c3di\nai,bi\n\n(xi) \u2212 y\n\n(cid:34)(cid:89)\n\ni\u2208i\n\n\uf8f1\uf8f2\uf8f3 0\n\nx\u2212a\nb\u2212a\n1\n\nwhere we have used the following piece-wise linear functions:\n\na,b(x) def=\n\u03c3+\n\nif x < a\nif a \u2264 x \u2264 b\nif\n\nb < x\n\n;\n\n\u03c3\u2212\na,b(x) def=\n\n\uf8f1\uf8f2\uf8f3 1\n\nb\u2212x\nb\u2212a\n0\n\n(1)\n\n(2)\n\nif x < a\nif a \u2264 x \u2264 b\nif\n\nb < x\n\nai,bi\n\nab) = 1 (and R(x,0)(Gid\n\nab) except that the piece-wise linear functions \u03c3di\n\nab) = 0) whenever there exist\n(xi) = 0. This occurs i\ufb00 there exists a ray which outputs 0 on x. We\ntd) is identical to the expression for\n(xi) are replaced by\n\nHence we notice that R(x,1)(Gid\ni \u2208 i : \u03c3di\ncan also verify that the expression for R(x,y)(C i\nR(x,y)(Gid\nthe indicator functions I((xi \u2212 ti)di > 0).\nThe PAC-Bayes theorem provides a risk bound for the Gibbs classi\ufb01er Gid\nthe Bayes classi\ufb01er Bid\ndistribution as the one used by Gid\nthat Gid\n\nab. Since\nab just performs a majority vote under the same posterior\nab(x) = 1 i\ufb00 the probability\n\nab classi\ufb01es x as positive exceeds 1/2. Hence, it follows that\n\nab, we have that Bid\n\nai,bi\n\n(cid:40)\n\nBid\n\nab(x) =\n\n1\n0\n\nif\nif\n\ni\u2208i \u03c3di\ni\u2208i \u03c3di\n\nai,bi\n\nai,bi\n\n(xi) > 1/2\n(xi) \u2264 1/2\n\n(3)\n\n(cid:81)\n(cid:81)\n\nab(x) for any x \u2208 X .\nab and Gid\nab,\n\nab has an hyperbolic decision surface. Consequently, Bid\n\nNote that Bid\nab is not repre-\nsentable as a conjunction of rays. There is, however, no computational di\ufb03culty at\nobtaining the output of Bid\n\nab) for any (x, y). Consequently, R(Bid\n\nFrom the relation between Bid\n2R(x,y)(Gid\nour main theorem:\nTheorem 2 Given all our previous de\ufb01nitions, for any \u03b4 \u2208 (0, 1], and for any p\nsatisfying\n\nit also follows that R(x,y)(Bid\n\nab) \u2264\nab). Hence, we have\n\nab) \u2264 2R(Gid\n(cid:183)\n\n(cid:80)n\n(cid:195)\n\n(cid:189)\n\n(cid:181)\n\nPrS\u223cDm\n\ne=0 p(e) = 1, we have:\n(cid:182)\n(cid:181)\nab) \u2264 sup\n\u2200i, d, a, b: R(Gid\n\n+ |i| ln(2) + ln\n\n1\nln\np(|i|)\nab) \u2264 2R(Gid\nab) \u2200i, d, a, b.\n\ni\u2208i\n\n+\n\nFurthermore: R(Bid\n\n\u0001: kl(RS(Gid\n\n(cid:88)\n\n(cid:181)\n\nab)(cid:107)\u0001) \u2264 1\n(cid:182)\n\nln\n\nn\n|i|\nm\n+ ln m + 1\n\nBi \u2212 Ai\nbi \u2212 ai\n\n\u03b4\n\n(cid:182)\n(cid:35)(cid:41)(cid:33)\n\n+\n\n\u2265 1 \u2212 \u03b4\n\n\f4 A Soft Greedy Learning Algorithm\n\nTheorem 2 suggests that the learner should try to \ufb01nd the Bayes classi\ufb01er Bid\nab that\nuses a small number of attributes (i.e., a small |i|), each with a large separating\nmargin (bi \u2212 ai), while keeping the empirical Gibbs risk RS(Gid\nab) at a low value.\nTo achieve this goal, we have adapted the greedy algorithm for the set covering\nmachine (SCM) proposed by Marchand and Shawe-Taylor (2002).\nIt consists of\nchoosing the feature (here a ray) i with the largest utility Ui where:\n\nUi = |Qi| \u2212 p|Ri|\n\nwhere Qi is the set of negative examples covered (classi\ufb01ed as 0) by feature i, Ri\nis the set of positive examples misclassi\ufb01ed by this feature, and p is a learning\nparameter that gives a penalty p for each misclassi\ufb01ed positive example. Once the\nfeature with the largest Ui is found, we remove Qi and Pi from the training set S\nand then repeat (on the remaining examples) until either no more negative examples\nare present or that a maximum number s of features has been reached.\nIn our case, however, we need to keep the Gibbs risk on S low instead of the risk\nof a deterministic classi\ufb01er. Since the Gibbs risk is a \u201csoft measure\u201d that uses the\npiece-wise linear functions \u03c3d\na,b instead of the \u201chard\u201d indicator functions, we need\na \u201csofter\u201d version of the utility function Ui. Indeed, a negative example that falls\nin the linear region of a \u03c3d\na,b is in fact partly covered. Following this observation,\nlet k be the vector of indices of the attributes that we have used so far for the\nconstruction of the classi\ufb01er. Let us \ufb01rst de\ufb01ne the covering value C(Gkd\nab) of Gkd\nby the \u201camount\u201d of negative examples assigned to class 0 by Gkd\nab:\n\nab\n\nC(Gkd\nab)\n\ndef=\n\n(1 \u2212 y)\n\n\u03c3dj\naj ,bj\n\n(xj)\n\nWe also de\ufb01ne the positive-side error E(Gkd\nexamples assigned to class 0 :\n\nab) of Gkd\n\nab as the \u201camount\u201d of positive\n\nE(Gkd\nab)\n\ndef=\n\n\u03c3dj\naj ,bj\n\n(xj)\n\n(cid:88)\n\n(x,y)\u2208S\n\n(cid:88)\n\n(x,y)\u2208S\n\n(cid:89)\n\nj\u2208k\n\n\uf8ee\uf8f01 \u2212\n\uf8ee\uf8f01 \u2212\n(cid:89)\n(cid:88)\n(cid:88)\n\n(cid:104)\n\ny\n\nj\u2208k\n\ny\n\n(x,y)\u2208S\n\n(x,y)\u2208S\n\n\uf8f9\uf8fb\n\n\uf8f9\uf8fb\n\n(cid:105)(cid:89)\n(cid:105)(cid:89)\n\nj\u2208k\n\nj\u2208k\n\nWe now want to add another ray on another attribute, call it i, to obtain a new\nvector k(cid:48) containing this new attribute in addition to those present in k. Hence, we\nnow introduce the covering contribution of ray i as:\n(1 \u2212 y)\nCkd\nab (i)\n\na(cid:48)b(cid:48) ) \u2212 C(Gkd\n\ndef= C(Gk(cid:48)d(cid:48)\n\n1 \u2212 \u03c3di\n\nab) =\n\n(xj)\n\n(xi)\n\n(cid:104)\n\n\u03c3dj\naj ,bj\n\nai,bi\n\nand the positive-side error contribution of ray i as:\n\nE kd\nab (i)\n\ndef= E(Gk(cid:48)d(cid:48)\n\na(cid:48)b(cid:48) ) \u2212 E(Gkd\n\nab) =\n\n1 \u2212 \u03c3di\n\nai,bi\n\n(xi)\n\n\u03c3dj\naj ,bj\n\n(xj)\n\nTypically, the covering contribution of ray i should increase its \u201cutility\u201d and its\npositive-side error should decrease it. Moreover, we want to decrease the \u201cutility\u201d\nof ray i by an amount which would become large whenever it has a small separating\nmargin. Our expression for KL(Q(cid:107)P ) suggests that this amount should be pro-\nportional to ln((Bi \u2212 Ai)/(bi \u2212 ai)). Furthermore we should compare this margin\nterm with the fraction of the remaining negative examples that ray i has covered\n\n\f(instead of the absolute amount of negative examples covered). Hence the cover-\ning contribution Ckd\nab of negative\nexamples that remains to be covered before considering ray i:\n\nab (i) of ray i should be divided by the amount N kd\n\nN kd\n\nab\n\ndef=\n\n(1 \u2212 y)\n\n\u03c3dj\naj ,bj\n\n(xj)\n\n(cid:88)\n\n(x,y)\u2208S\n\n(cid:89)\n\nj\u2208k\n\nwhich is simply the amount of negative examples that have been assigned to class 1\nby Gkd\nab (i) of\nadding ray i to Gkd\n\nab. If P denotes the set of positive examples, we de\ufb01ne the utility U kd\n\nab as:\n\nab (i)\nU kd\n\ndef=\n\nCkd\nab (i)\nN kd\n\nab\n\n\u2212 p\n\nE kd\n|P| \u2212 \u03b7 ln Bi \u2212 Ai\nab (i)\nbi \u2212 ai\n\nwhere parameter p represents the penalty of misclassifying a positive example and\n\u03b7 is another parameter that controls the importance of having a large margin.\nThese learning parameters can be chosen by cross-validation. For \ufb01xed values of\nthese parameters, the \u201csoft greedy\u201d algorithm simply consists of adding, to the\ncurrent Gibbs classi\ufb01er, a ray with maximum added utility until either the maximum\nnumber s of rays has been reached or that all the negative examples have been\n(totally) covered.\nIt is understood that, during this soft greedy algorithm, we\ncan remove an example (x, y) from S whenever it is totally covered. This occurs\nwhenever\n\n(cid:81)\n\n(xj) = 0.\n\nj\u2208k \u03c3dj\n\naj ,bj\n\n5 Results for Classi\ufb01cation of DNA Micro-Arrays\n\nWe have tested the soft greedy learning algorithm on the four DNA micro-array data\nsets shown in Table 1. The colon tumor data set (Alon et al., 1999) provides the\nexpression levels of 40 tumor and 22 normal colon tissues measured for 6500 human\ngenes. The ALL/AML data set (Golub et al., 1999) provides the expression levels\nof 7129 human genes for 47 samples of patients with acute lymphoblastic leukemia\n(ALL) and 25 samples of patients with acute myeloid leukemia (AML). The B MD\nand C MD data sets (Pomeroy et al., 2002) are micro-array samples containing\nthe expression levels of 6817 human genes. Data set B contains 25 classic and 9\ndesmoplastic medulloblastomas whereas data set C contains 39 medulloblastomas\nsurvivors and 21 treatment failures (non-survivors).\nWe have compared the soft greedy learning algorithm with a linear-kernel soft-\nmargin SVM trained both on all the attributes (gene expressions) and on a subset\nof attributes chosen by the \ufb01lter method of Golub et al. (1999). It consists of ranking\nthe attributes as function of the di\ufb00erence between the positive-example mean and\nthe negative-example mean and then use only the \ufb01rst (cid:96) attributes. The resulting\nlearning algorithm, named SVM+gs in Table 1, is basically the one used by Furey\net al. (2000) for the same task. Guyon et al. (2002) claimed obtaining better results\nwith the recursive feature elimination method but, as pointed out by Ambroise and\nMcLachlan (2002), their work contained a methodological \ufb02aw and, consequently,\nthe superiority of this wrapper method is questionable.\nEach algorithm was tested with the 5-fold cross validation (CV) method. Each of\nthe \ufb01ve training sets and testing sets was the same for all algorithms. The learning\nparameters of all algorithms and the gene subsets (for SVM+gs) were chosen from\nthe training sets only. This was done by performing a second (nested) 5-fold CV\non each training set. For the gene subset selection procedure of SVM+gs, we have\nconsidered the \ufb01rst (cid:96) = 2i genes (for i = 0, 1, . . . , 12) ranked according to the\ncriterion of Golub et al. (1999) and have chosen the i value that gave the smallest\n5-fold CV error on the training set.\n\n\fData Set\n\nName\nColon\nB MD\nC MD\nALL/AML\n\n#exs\n62\n34\n60\n72\n\nSVM SVM+gs\nsize\nerrs\n256\n12\n32\n12\n1024\n29\n18\n64\n\nerrs\n11\n6\n21\n10\n\nratio\n0.42\n0.10\n0.077\n0.002\n\nSoft Greedy\n\nsize G-errs B-errs Bound\n1\n1\n3\n2\n\n9\n6\n22\n17\n\n18\n20\n40\n38\n\n12\n6\n24\n19\n\nTable 1: DNA micro-array data sets and results.\n\nFor each algorithm, the \u201cerrs\u201d columns of Table 1 contain the 5-fold CV error\nexpressed as the sum of errors over the \ufb01ve testing sets and the \u201csize\u201d columns\ncontain the number of attributes used by the classi\ufb01er averaged over the \ufb01ve testing\nsets. The \u201cG-err\u201d and \u201cB-err\u201d columns refer to the Gibbs and Bayes error rates.\nThe \u201cratio\u201d column refers to the average value of (bi \u2212 ai)/(Bi \u2212 Ai) obtained for\nthe rays used by classi\ufb01ers and the \u201cbound\u201d column refers to the average risk bound\nof Theorem 2 multiplied by the total number of examples. We see that the gene\nselection \ufb01lter generally improves the error rate of SVM and that the Bayes error\nrate is slightly better than the Gibbs error rate. Finally, the error rates of Bayes\nand SVM+gs are competitive but the number of genes selected by the soft greedy\nalgorithm is always much smaller.\n\nReferences\n\nU. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine. Broad\npatterns of gene expression revealed by clustering analysis of tumor and normal colon\ntissues probed by oligonucleotide arrays. PNAS USA, 96:6745\u20136750, 1999.\n\nC. Ambroise and G. J. McLachlan. Selection bias in gene extraction on the basis of\n\nmicroarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99:6562\u20136566, 2002.\n\nT. S. Furey, N. Cristianini, N. Du\ufb00y, D. W. Bednarski, M. Schummer, and D. Haus-\nsler. Support vector machine classi\ufb01cation and validation of cancer tissue samples using\nmicroarray expression data. Bioinformatics, 16:906\u2013914, 2000.\n\nT.R. Golub, D.K. Slonim, and Many More Authors. Molecular classi\ufb01cation of cancer:\nclass discovery and class prediction by gene expression monitoring. Science, 286:531\u2013\n537, 1999.\n\nI. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi\ufb01cation\n\nusing support vector machines. Machine Learning, 46:389\u2013422, 2002.\n\nD. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant\u2019s learning\n\nframework. Arti\ufb01cial Intelligence, 36:177\u2013221, 1988.\n\nJohn Langford.\n\nclassi\ufb01cation.\nhttp://hunch.net/~jl/projects/prediction_bounds/tutorial/tutorial.ps, 2003.\n\non practical prediction theory\n\nTutorial\n\nfor\n\nN. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine Learning, 2(4):285\u2013318, 1988.\n\nMario Marchand and John Shawe-Taylor. The set covering machine. Journal of Machine\n\nLearning Reasearch, 3:723\u2013746, 2002.\n\nDavid McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355\u2013363, 1999.\n\nDavid McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5\u201321,\n\n2003. A priliminary version appeared in proceedings of COLT\u201999.\n\nS. L. Pomeroy, P. Tamayo, and Many More Authors. Prediction of central nervous system\n\nembryonal tumour outcome based on gene expression. Nature, 415:436\u2013442, 2002.\n\nMatthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of\n\nMachine Learning Research, 3:233\u2013269, 2002.\n\n\f", "award": [], "sourceid": 2593, "authors": [{"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Mohak", "family_name": "Shah", "institution": null}]}