{"title": "The Decision List Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 952, "abstract": null, "full_text": "The Decision List Machine\n\nMarina Sokolova\n\nMario Marchand\n\nSITE, University of Ottawa\n\nSITE, University of Ottawa\n\nOttawa, Ont. Canada,K1N-6N5\n\nOttawa, Ont. Canada,K1N-6N5\n\nsokolova@site.uottawa.ca\n\nmarchand@site.uottawa.ca\n\nNathalie Japkowicz\n\nJohn Shawe-Taylor\n\nSITE, University of Ottawa\n\nRoyal Holloway, University of London\n\nOttawa, Ont. Canada,K1N-6N5\n\nnat@site.uottawa.ca\n\nEgham, UK, TW20-0EX\n\njst@cs.rhul.ac.uk\n\nAbstract\n\nWe introduce a new learning algorithm for decision lists to allow\nfeatures that are constructed from the data and to allow a trade-\no\ufb01 between accuracy and complexity. We bound its generalization\nerror in terms of the number of errors and the size of the classi\ufb02er\nit \ufb02nds on the training data. We also compare its performance\non some natural data sets with the set covering machine and the\nsupport vector machine.\n\n1\n\nIntroduction\n\nThe set covering machine (SCM) has recently been proposed by Marchand and\nShawe-Taylor (2001, 2002) as an alternative to the support vector machine (SVM)\nwhen the objective is to obtain a sparse classi\ufb02er with good generalization. Given\na feature space, the SCM tries to \ufb02nd the smallest conjunction (or disjunction) of\nfeatures that gives a small training error. In contrast, the SVM tries to \ufb02nd the\nmaximum soft-margin separating hyperplane on all the features. Hence, the two\nlearning machines are fundamentally di\ufb01erent in what they are trying to achieve on\nthe training data.\n\nTo investigate if it is worthwhile to consider larger classes of functions than just the\nconjunctions and disjunctions that are used in the SCM, we focus here on the class\nof decision lists introduced by Rivest (1987) because this class strictly includes both\nconjunctions and disjunctions and is strictly included in the class of linear threshold\nfunctions (Marchand and Golea, 1993). Hence, we denote by decision list machine\n(DLM) any classi\ufb02er which computes a decision list of Boolean-valued features,\nincluding features that are possibly constructed from the data. In this paper, we\nuse the set of features introduced by Marchand and Shawe-Taylor (2001, 2002)\nknown as data-dependent balls. By extending the sample compression technique\nof Littlestone and Warmuth (1986), we bound the generalization error of the DLM\nwith data-dependent balls in terms of the number of errors and the number of balls\nit achieves on the training data. We also show that the DLM with balls can provide\n\n\fbetter generalization than the SCM with this same set of features on some natural\ndata sets.\n\n2 The Decision List Machine\n\nLet x denote an arbitrary n-dimensional vector of the input space X which could be\narbitrary subsets of <n. We consider binary classi\ufb02cation problems for which the\ntraining set S = P [ N consists of a set P of positive training examples and a set N\nof negative training examples. We de\ufb02ne a feature as an arbitrary Boolean-valued\nfunction that maps X onto f0; 1g. Given any set H = fhi(x)gjHj\ni=1 of features hi(x)\nand any training set S, the learning algorithm returns a small subset R \u2030 H of\nfeatures. Given that subset R, and an arbitrary input vector x, the output f (x) of\nthe Decision List Machine (DLM) is de\ufb02ned to be:\n\nIf (h1(x)) then b1\n\nElse If (h2(x)) then b2\n\n. . .\n\nElse If (hr(x)) then br\n\nElse br+1\n\nwhere each bi 2 0; 1 de\ufb02nes the output of f (x) if and only if hi is the \ufb02rst feature\nto be satis\ufb02ed on x (i.e. the smallest i for which hi(x) = 1). The constant br+1\n(where r = jRj) is known as the default value. Note that f computes a disjunction\nof the his whenever bi = 1 for i = 1 : : : r and br+1 = 0. To compute a conjunction\nof his, we simply place in f the negation of each hi with bi = 0 for i = 1 : : : r and\nbr+1 = 1. Note, however, that a DLM f that contains one or many alternations\n(i.e. a pair (bi; bi+1) for which bi 6= bi+1 for i < r) cannot be represented as a (pure)\nconjunction or disjunction of his (and their negations). Hence, the class of decision\nlists strictly includes conjunctions and disjunctions.\n\nFrom this de\ufb02nition, it seems natural to use the following greedy algorithm for\nbuilding a DLM from a training set. For a given set S 0 = P 0 [ N 0 of examples\n(where P 0 (cid:181) P and N 0 (cid:181) N ) and a given set H of features, consider only the\nfeatures hi 2 H which make no errors on either P 0 or N 0. If hi makes no error with\nP 0, let Qi be the subset of examples of N 0 on which hi makes no errors. Otherwise,\nif hi makes no error with N 0, let Qi be the subset of examples of P 0 on which hi\nmakes no errors. In both cases we say that hi is covering Qi. The greedy algorithm\nstarts with S0 = S and an empty DLM. Then it \ufb02nds the hi with the largest jQij\nand appends this hi to the DLM. It then removes Qi from S0 and repeat to \ufb02nd\nthe hk with the largest jQkj until either P 0 or N 0 is empty. It \ufb02nally assigns br+1\nto the class label of the remaining non-empty set.\n\nFollowing Rivest (1987), this greedy algorithm is assured to build a DLM that\nmakes no training errors whenever there exists a DLM on a set E (cid:181) H of features\nthat makes zero training errors. However, this constraint is not really required in\npractice since we do want to permit the user of a learning algorithm to control the\ntradeo\ufb01 between the accuracy achieved on the training data and the complexity\n(here the size) of the classi\ufb02er.\nIndeed, a small DLM which makes a few errors\non the training set might give better generalization than a larger DLM (with more\nfeatures) which makes zero training errors. One way to include this (cid:176)exibility is to\nearly-stop the greedy algorithm when there remains a few more training examples\nto be covered. But a further reduction in the size of the DLM can be accomplished\n\n\fAlgorithm BuildDLM(P; N; pp; pn; s; H)\n\nInput: A set P of positive examples, a set N of negative examples, the penalty values\npp and pn , a stopping point s, and a set H = fhi(x)gjHj\ni=1 of Boolean-valued features.\n\nOutput: A decision list f consisting of a set R = f(hi; bi)gr\ntheir corresponding output values bi, and a default value br+1.\n\ni=1 of features hi with\n\nInitialization: R = ;; P 0 = P; N 0 = N\n\n1. For each hi 2 H, let Pi and Ni be respectively the subsets of P 0 and N 0\n\ncorrectly classi\ufb02ed by hi. For each hi compute Ui, where:\n\ndef= max fjPij \u00a1 pn \u00a2 jN 0 \u00a1 Nij; jNij \u00a1 pp \u00a2 jP 0 \u00a1 Pijg\n\nUi\n\n2. Let hk be a feature with the largest value of Uk.\n3. If (jPkj \u00a1 pn \u00a2 jN 0 \u00a1 Nkj \u201a jNkj \u00a1 pp \u00a2 jP 0 \u00a1 Pkj) then R = R [ f(hk; 1)g,\n\nP 0 = P 0 \u00a1 Pk, N 0 = Nk.\n\n4. If (jPkj \u00a1 pn \u00a2 jN 0 \u00a1 Nkj < jNkj \u00a1 pp \u00a2 jP 0 \u00a1 Pkj) then R = R [ f(:hk; 0)g,\n\nN 0 = N 0 \u00a1 Nk, P 0 = Pk.\n\n5. Let r = jRj. If (r < s and P 0 6= ; and N 0 6= ;) then go to step 1\n6. Set br+1 = :br. Return f .\n\nFigure 1: The learning algorithm for the Decision List Machine\n\nby considering features hi that do make a few errors on P 0 (or N 0) if many more\nexamples Qi 2 N 0 (or Qi 2 P 0) can be covered.\n\nHence, to include this (cid:176)exibility in choosing the proper tradeo\ufb01 between complexity\nand accuracy, we propose the following modi\ufb02cation of the greedy algorithm. For\nevery feature hi, let us denote by Pi the subset of P 0 on which hi makes no errors\nand by Ni the subset of N 0 on which hi makes no error. The above greedy algorithm\nis considering only features for which we have either Pi = P 0 or Ni = N 0, but to\nallow small deviation from these choices, we de\ufb02ne the usefullness Ui of feature hi\nby\n\ndef= max fjPij \u00a1 pn \u00a2 jN 0 \u00a1 Nij; jNij \u00a1 pp \u00a2 jP 0 \u00a1 Pijg\n\nUi\n\nwhere pn denotes the penalty of making an error on a negative example whereas pp\ndenotes the penalty of making an error on a positive example.\nHence, each greedy step will be modi\ufb02ed as follows. For a given set S 0 = P 0 [ N 0,\nwe will select the feature hi with the largest value of Ui and append this hi in the\nDLM. If jPij \u00a1 pn \u00a2 jN 0 \u00a1 Nij \u201a jNij \u00a1 pp \u00a2 jP 0 \u00a1 Pij, we will then remove from S0\nevery example in Pi (since they are correctly classi\ufb02ed by the current DLM) and\nwe will also remove from S0 every example in N 0 \u00a1 Ni (since a DLM with this\nfeature is already misclassifying N 0 \u00a1 Ni, and, consequently, the training error of\nthe DLM will not increase if later features err on examples in N 0 \u00a1 Ni). Otherwise\nif jPij \u00a1 pn \u00a2 jN 0 \u00a1 Nij < jNij \u00a1 pp \u00a2 jP 0 \u00a1 Pij, we will then remove from S0 examples in\nNi [ (P 0 \u00a1 Pi). Hence, we recover the simple greedy algorithm when pp = pn = 1.\n\nThe formal description of our learning algorithm is presented in Figure 1. The\npenalty parameters pp and pn and the early stopping point s are the model-selection\nparameters that give the user the ability to control the proper tradeo\ufb01 between the\ntraining accuracy and the size of the DLM. Their values could be determined either\n\n\fby using k-fold cross-validation, or by computing our bound (see section 4) on\nthe generalization error. It therefore generalizes the learning algorithm of Rivest\n(1987) by providing this complexity-accuracy tradeo\ufb01 and by permitting the use of\nany kind of Boolean-valued features, including those that are constructed from the\ndata. Finally let us mention that Dhagat and Hellerstein (1994) did propose an\nalgorithm for learning decision lists of few relevant attributes but this algorithm is\nnot practical in the sense that it provides no tolerance to noise and does not easily\naccommodate parameters to provide a complexity-accuracy tradeo\ufb01.\n\n3 Data-Dependent Balls\n\nFor each training example xi with label yi 2 f0; 1g and (real-valued) radius \u2030, we\nde\ufb02ne feature hi;\u2030 to be the following data-dependent ball centered on xi:\n\nhi;\u2030(x) def= h\u2030(x; xi) = \u2030 yi\n\nyi\n\nif d(x; xi) \u2022 \u2030\notherwise\n\nwhere yi denotes the Boolean complement of yi and d(x; x0) denotes the distance\nbetween x and x0. Note that any metric can be used for d. So far, we have used\nonly the L1; L2 and L1 metrics but it is certainly worthwhile to try to use metrics\nthat actually incorporate some knowledge about the learning task. Moreover, we\ncould use metrics that are obtained from the de\ufb02nition of an inner product k(x; x0).\n\nGiven a set S of m training examples, our initial set of features consists, in principle,\n\nof H = Si2S S\u20302[0;1[ hi;\u2030. But obviously, for each training example xi, we need\n\nonly to consider the set of m \u00a1 1 distances fd(xi; xj)gj6=i. This reduces our initial\nset H to O(m2) features. In fact, from the description of the DLM in the previous\nsection, it follows that the ball with the largest usefulness belongs to one of the\nfollowing following types of balls: type Pi, Po, Ni, and No.\n\nBalls of type Pi (positive inside) are balls having a positive example x for its center\nand a radius given by \u2030 = d(x; x0) \u00a1 \u2020 for some negative example x0 (that we call a\nborder point) and very small positive number \u2020. Balls of type Po (positive outside)\nhave a negative example center x and a radius \u2030 = d(x; x0) + \u2020 given by a negative\nborder x0. Balls of type Ni (negative inside) have a negative center x and a radius\n\u2030 = d(x; x0) \u00a1 \u2020 given by a positive border x0. Balls of type No (negative outside)\nhave a positive center x and a radius \u2030 = d(x; x0) + \u2020 given by a positive border x0.\n\nThis proposed set of features, constructed from the training data, provides to the\nuser full control for choosing the proper tradeo\ufb01 between training accuracy and\nfunction size.\n\n4 Bound on the Generalization Error\n\nNote that we cannot use the \\standard\" VC theory to bound the expected loss of\nDLMs with data-dependent features because the VC dimension is a property of a\nfunction class de\ufb02ned on some input domain without reference to the data. Hence,\nwe propose another approach.\n\nSince our learning algorithm tries to build a DLM with the smallest number of data-\ndependent balls, we seek a bound that depends on this number and, consequently,\non the number of examples that are used in the \ufb02nal classi\ufb02er (the hypothesis).\nWe can thus think of our learning algorithm as compressing the training set into\na small subset of examples that we call the compression set. It was shown by Lit-\ntlestone and Warmuth (1986) and Floyd and Warmuth (1995) that we can bound\n\n\fthe generalization error of the hypothesis f if we can always reconstruct f from\nthe compression set. Hence, the only requirement is the existence of such a recon-\nstruction function and its only purpose is to permit the exact identi\ufb02cation of the\nhypothesis from the compression set and, possibly, additional bits of information.\nNot surprisingly, the bound on the generalization error increases rapidly in terms\nof these additional bits of information. So we must make minimal usage of them.\n\nWe now describe our reconstruction function and the additional information that\nit needs to assure, in all cases, the proper reconstruction of the hypothesis from a\ncompression set. Our proposed scheme works in all cases provided that the learning\nalgorithm returns a hypothesis that always correctly classi\ufb02es the compression set\n(but not necessarily all of the training set). Hence, we need to add this constraint\nin BuildDLM for our bound to be valid but, in practice, we have not seen any\nsigni\ufb02cant performance variation introduced by this constraint. We \ufb02rst describe\nthe simpler case where only balls of types Pi and Ni are permitted and, later,\ndescribe the additional requirements that are introduced when we also permit balls\nof types Po and No.\n\nGiven a compression set \u2044 (returned by the learning algorithm), we \ufb02rst partition it\ninto four disjoint subsets Cp; Cn; Bp, and Bn consisting of positive ball centers, neg-\native ball centers, positive borders, and negative borders respectively. Each example\nin \u2044 is speci\ufb02ed only once. When only balls of type Pi and Ni are permitted, the\ncenter of a ball cannot be the center of another ball since the center is removed from\nthe remaining examples to be covered when a ball is added to the DLM. But a center\ncan be the border of a previous ball in the DLM and a border can be the border of\nmore than one ball. Hence, points in Bp [Bn are examples that are borders without\nbeing the center of another ball. Because of the crucial importance of the ordering\nof the features in a decision list, these sets do not provide enough information by\nthemselves to be able to reconstruct the hypothesis. To specify the ordering of each\nball center it is su\u2013cient to provide log2(r) bits of additional information where the\nnumber r of balls is given by r = cp + cn for cp = jCpj and cn = jCnj. To \ufb02nd the ra-\ndius \u2030i for each center xi we start with C 0\nn = Bn, and\ndo the following, sequentially from the \ufb02rst center to the last. If center xi 2 C 0\np, then\nn d(xi; xj) \u00a1 \u2020 and we remove center xi from\nthe radius is given by \u2030i = minxj 2C 0\nC 0\np covered by this ball (to \ufb02nd the radius of the other\nballs). If center xi 2 C 0\np d(xi; xj) \u00a1 \u2020\np[B 0\nand we remove center xi from C 0\nn covered by this\nball. The output bi for each ball hi is 1 if the center xi 2 Cp and 0 otherwise.\nThis reconstructed decision list of balls will be the same as the hypothesis if and\nonly if the compression set is always correctly classi\ufb02ed by the learning algorithm.\nOnce we can identify the hypothesis from the compression set, we can bound its\ngeneralization error.\n\nn, then the radius is given by \u2030i = minxj 2C 0\nn and any other point from B0\n\np and any other point from B0\n\nn[B 0\n\np = Cp; C 0\n\nn = Cn; B0\n\np = Bp; B0\n\nTheorem 1 Let S = P [ N be a training set of positive and negative examples\nof size m = mp + mn. Let A be the learning algorithm BuildDLM that uses\ndata-dependent balls of type Pi and Ni for its set of features with the constraint\nthat the returned function A(S) always correctly classi\ufb02es every example in the\ncompression set. Suppose that A(S) contains r balls, and makes kp training errors\non P , kn training errors on N (with k = kp + kn), and has a compression set\n\u2044 = Cp [ Cn [ Bp [ Bn (as de\ufb02ned above) of size \u201a = cp + cn + bp + bn . With\nprobability 1 \u00a1 \u2013 over all random training sets S of size m, the generalization error\ner(A(S)) of A(S) is bounded by\n\ner(A(S)) \u2022 1 \u00a1 exp\u2030\n\n\u00a11\n\nm \u00a1 \u201a \u00a1 k (cid:181)ln B\u201a + ln(r!) + ln\n\n1\n\n\u2013\u201a\u00b6(cid:190)\n\n\fwhere \u2013\u201a\nwhere\n\nB\u201a\n\ndef= \u2021 \u20262\ndef= (cid:181)mp\n\n6 \u00b7\u00a16\ncp \u00b6(cid:181)mp \u00a1 cp\n\n\u00a2 ((cp + 1)(cn + 1)(bp + 1)(bn + 1)(kp + 1)(kn + 1))\u00a12 \u00a2 \u2013 and\n\nbp \u00b6(cid:181)mn\n\ncn \u00b6(cid:181)mn \u00a1 cn\n\nbn \u00b6(cid:181)mp \u00a1 cp \u00a1 bp\n\n\u00b6(cid:181)mn \u00a1 cn \u00a1 bn\n\nkn\n\n\u00b6\n\nkp\n\nProof Let X be the set of training sets of size m. Let us \ufb02rst bound the probability\ndef= P fS 2 X : er(A(S)) \u201a \u2020 j m(S) = mg given that m(S) is \ufb02xed to some value\nPm\ndef= (m; mp; mn; cp; cn; bp; bn; kp; kn). For this, denote by Ep the subset\nm where m\nof P on which A(S) makes an error and similarly for En. Let I be the message of\nlog2(r!) bits needed to specify the ordering of the balls (as described above). Now\nde\ufb02ne P 0\n\nm to be\n\nP 0\n\nm\n\ndef= P fS 2 X : er(A(S)) \u201a \u2020 j Cp = S1; Cn = S2; Bp = S3; Bn = S4\n\nEp = S5; En = S6; I = I0; m(S) = mg\n\nfor some \ufb02xed set of disjoint subsets fSig6\ni=1 of S and some \ufb02xed information mes-\nsage I0. Since B\u201a is the number of di\ufb01erent ways of choosing the di\ufb01erent compres-\nsion subsets and set of error points in a training set of \ufb02xed m, we have:\n\nPm \u2022 (r!) \u00a2 B\u201a \u00a2 P 0\n\nm\n\nwhere the \ufb02rst factor comes from the additional information that is needed to specify\nthe ordering of r balls. Note that the hypothesis f def= A(S) is \ufb02xed in P 0\nm (because\nthe compression set is \ufb02xed and the required information bits are given). To bound\nP 0\nm, we make the standard assumption that each example x is independently and\nidentically generated according to some \ufb02xed but unknown distribution. Let p\nbe the probability of obtaining a positive example, let \ufb01 be the probability that\nthe \ufb02xed hypothesis f makes an error on a positive example, and let \ufb02 be the\ndef= cp + bp + kp and\nprobability that f makes an error on a negative example. Let tp\nlet tn\n\ndef= cn + bn + kn. We then have:\n\nP 0\n\nm = (1 \u00a1 \ufb01)mp\u00a1tp (1 \u00a1 \ufb02)m\u00a1tn\u00a1mp(cid:181)m \u00a1 tn \u00a1 tp\n\nmp \u00a1 tp \u00b6pmp\u00a1tp (1 \u00a1 p)m\u00a1tn\u00a1mp\n\n\u2022\n\nm\u00a1tn\n\nXm0=tp\n\n(1 \u00a1 \ufb01)m0\u00a1tp (1 \u00a1 \ufb02)m\u00a1tn\u00a1m0(cid:181)m \u00a1 tn \u00a1 tp\n\nm0 \u00a1 tp \u00b6pm0\u00a1tp (1 \u00a1 p)m\u00a1tn\u00a1m0\n\n= [(1 \u00a1 \ufb01)p + (1 \u00a1 \ufb02)(1 \u00a1 p)]m\u00a1tn\u00a1tp = (1 \u00a1 er(f ))m\u00a1tn\u00a1tp\n\u2022 (1 \u00a1 \u2020)m\u00a1tn\u00a1tp\n\nConsequently:\n\nPm \u2022 (r!) \u00a2 B\u201a \u00a2 (1 \u00a1 \u2020)m\u00a1tn\u00a1tp :\n\nThe theorem is obtained by bounding this last expression by the proposed value for\n\u2013\u201a(m) and solving for \u2020 since, in that case, we satisfy the requirement that\n\nP\u2030S 2 X : er(A(S)) \u201a \u2020(cid:190) = Xm\n\u2013\u201a(m)P\u2030S 2 X : m(S) = m(cid:190) \u2022 Xm\n\n\u2022 Xm\n\nPmP\u2030S 2 X : m(S) = m(cid:190)\n\n\u2013\u201a(m) = \u2013\n\nwhere the sums are over all possible realizations of m for a \ufb02xed mp and mn.\nWith the proposed value for \u2013\u201a(m), the last equality follows from the fact that\n\n\fP1\ni=1(1=i2) = \u20262=6.\n\nThe use of balls of type Po and No introduces a few more di\u2013culties that are\ntaken into account by sending more bits to the reconstruction function. First, the\ncenter of a ball of type Po and No can be used for more than one ball since the\ncovered examples are outside the ball. Hence, the number r of balls can now exceed\ncp + cn = c. So, to specify r, we can send log2(\u201a) bits. Then, for each ball,\nwe can send log2 c bits to specify which center this ball is using and another bit\nto specify if the examples covered are inside or outside the ball. Using the same\nnotation as before, the radius \u2030i of a center xi of a ball of type Po is given by\nn d(xi; xj) + \u2020, and for a center xi of a ball of type No, its radius is\n\u2030i = maxxj 2C 0\np d(xi; xj) + \u2020. With these modi\ufb02cations, the same proof\ngiven by \u2030i = maxxj 2C 0\nof Theorem 1 can be used to obtain the next theorem.\n\nn[B 0\n\np[B 0\n\nTheorem 2 Let A be the learning algorithm BuildDLM that uses data-dependent\nballs of type Pi, Ni, Po, and No for its set of features. Consider all the de\ufb02nitions\nused for Theorem 1 with cdef= cp +cn. With probability 1\u00a1\u2013 over all random training\nsets S of size m, we have\n\ner(A(S)) \u2022 1 \u00a1 exp\u2030\n\n\u00a11\n\nm \u00a1 \u201a \u00a1 k (cid:181)ln B\u201a + ln \u201a + r ln(2c) + ln\n\n1\n\n\u2013\u201a\u00b6(cid:190)\n\nBasically, our bound states that good generalization is expected when we can \ufb02nd a\nsmall DLM that makes few training errors. In principle, we could use it as a guide\nfor choosing the model selection parameters s, pp, and pn since it depends only on\nwhat the hypothesis has achieved on the training data.\n\n5 Empirical Results on Natural data\n\nWe have compared the practical performance of the DLM with the support vector\nmachine (SVM) equipped with a Radial Basis Function kernel of variance 1=(cid:176). The\ndata sets used and the results obtained are reported in Table 1. All these data\nsets where obtained from the machine learning repository at UCI. For each data\nset, we have removed all examples that contained attributes with unknown values\n(this has reduced substantially the \\votes\" data set) and we have removed examples\nwith contradictory labels (this occurred only for a few examples in the Haberman\ndata set). The remaining number of examples for each data set is reported in\nTable 1. No other preprocessing of the data (such as scaling) was performed. For\nall these data sets, we have used the 10-fold cross validation error as an estimate\nof the generalization error. The values reported are expressed as the total number\nof errors (i.e. the sum of errors over all testing sets). We have ensured that each\ntraining set and each testing set, used in the 10-fold cross validation process, was\nthe same for each learning machine (i.e. each machine was trained on the same\ntraining sets and tested on the same testing sets).\n\nThe results reported for the SVM are only those obtained for the best values of the\nkernel parameter (cid:176) and the soft margin parameter C found among an exhaustive\nlist of many values. The values of these parameters are reported in Marchand and\nShawe-Taylor (2002). The \\size\" column refers to the average number of support\nvectors contained in SVM machines obtained from the 10 di\ufb01erent training sets of\n10-fold cross-validation.\n\nWe have reported the results for the SCM (Marchand and Shawe-Taylor, 2002) and\nthe DLM when both machines are equipped with data-dependent balls under the\nL2 metric. For the SCM, the T column refers to type of the best machine found\n\n\fData Set\n\nSVM\n\nSCM with balls\n\nName\nBreastW\n\nVotes\n\nPima\n\nHaberman\n\nBupa\n\nGlass\n\nCredit\n\n#exs\n683\n52\n768\n294\n345\n214\n653\n\nsize\n58\n18\n526\n146\n266\n125\n423\n\nerrors\n19\n3\n203\n71\n107\n34\n190\n\np\n1.8\n0.9\n1.1\n1.4\n2.8\n\ns\nT\n2\nc\n1\nd\n3\nc\n1\nc\nd\n9\nd 1 2\nd\n4\n\n1.2\n\nerrors\n15\n6\n189\n71\n106\n36\n194\n\nT\nc\ns\nc\ns\nc\nc\nc\n\npn\n1\n0.3\n1.5\n3\n2\n\nDLM with balls\npp\ns\n2.1\n2\n0.1\n1\n1.5\n6\n2\n7\n2\n4\n4.8 1 12\n1 1 11\n\nerrors\n14\n3\n189\n65\n108\n28\n197\n\nTable 1: Data sets and results for SVMs, SCMs, and DLMs.\n\n(c for conjunction, and d for disjunction), the p column refers the best value found\nfor the penalty parameter, and the s column refers the the best stopping point in\nterms of the number of balls. The same de\ufb02nitions applies also for DLMs except\nthat two di\ufb01erent penalty values (pp and pn) are used. In the T column of the DLM\nresults, we have speci\ufb02ed by s (simple) when the DLM was trained by using only\nballs of type Pi and Ni and by c (complex) when the four possible types of balls\nwhere used (see section 3). Again, only the values that gave the smallest 10-fold\ncross-validation error are reported.\n\nThe most striking feature in Table 1 is the level of sparsity achieved by the SCM and\nthe DLM in comparison with the SVM. This di\ufb01erence is huge. The other important\nfeature is that DLMs often provide slightly better generalization than SCMs and\nSVMs. Hence, DLMs can provide a good alternative to SCMs and SVMs.\n\nAcknowledgments\n\nWork supported by NSERC grant OGP0122405 and, in part, by the EU under the\nNeuroCOLT2 Working Group, No EP 27150.\n\nReferences\n\nAditi Dhagat and Lisa Hellerstein. PAC learning with irrelevant attributes.\n\nIn\nProc. of the 35rd Annual Symposium on Foundations of Computer Science, pages\n64{74. IEEE Computer Society Press, Los Alamitos, CA, 1994.\n\nSally Floyd and Manfred Warmuth. Sample compression, learnability, and the\n\nVapnik-Chervonenkis dimension. Machine Learning, 21(3):269{304, 1995.\n\nN. Littlestone and M. Warmuth. Relating data compression and learnability. Tech-\n\nnical report, University of California Santa Cruz, 1986.\n\nMario Marchand and Mostefa Golea. On learning simple neural concepts:\n\nfrom\nhalfspace intersections to neural decision lists. Network: Computation in Neural\nSystems, 4:67{85, 1993.\n\nMario Marchand and John Shawe-Taylor. Learning with the set covering machine.\nProceedings of the Eighteenth International Conference on Machine Learning\n(ICML 2001), pages 345{352, 2001.\n\nMario Marchand and John Shawe-Taylor. The set covering machine. Journal of\n\nMachine Learning Reasearch (to appear), 2002.\n\nRonald L. Rivest. Learning decision lists. Machine Learning, 2:229{246, 1987.\n\n\f", "award": [], "sourceid": 2235, "authors": [{"given_name": "Marina", "family_name": "Sokolova", "institution": null}, {"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Nathalie", "family_name": "Japkowicz", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}