{"title": "The Decision List Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 952, "abstract": null, "full_text": "The Decision List Machine\n\nMarina Sokolova\n\nMario Marchand\n\nSITE, University of Ottawa\n\nSITE, University of Ottawa\n\nOttawa, Ont. Canada,K1N-6N5\n\nOttawa, Ont. Canada,K1N-6N5\n\nsokolova@site.uottawa.ca\n\nmarchand@site.uottawa.ca\n\nNathalie Japkowicz\n\nJohn Shawe-Taylor\n\nSITE, University of Ottawa\n\nRoyal Holloway, University of London\n\nOttawa, Ont. Canada,K1N-6N5\n\nnat@site.uottawa.ca\n\nEgham, UK, TW20-0EX\n\njst@cs.rhul.ac.uk\n\nAbstract\n\nWe introduce a new learning algorithm for decision lists to allow\nfeatures that are constructed from the data and to allow a trade-\no\ufb01 between accuracy and complexity. We bound its generalization\nerror in terms of the number of errors and the size of the classi\ufb02er\nit \ufb02nds on the training data. We also compare its performance\non some natural data sets with the set covering machine and the\nsupport vector machine.\n\n1\n\nIntroduction\n\nThe set covering machine (SCM) has recently been proposed by Marchand and\nShawe-Taylor (2001, 2002) as an alternative to the support vector machine (SVM)\nwhen the objective is to obtain a sparse classi\ufb02er with good generalization. Given\na feature space, the SCM tries to \ufb02nd the smallest conjunction (or disjunction) of\nfeatures that gives a small training error. In contrast, the SVM tries to \ufb02nd the\nmaximum soft-margin separating hyperplane on all the features. Hence, the two\nlearning machines are fundamentally di\ufb01erent in what they are trying to achieve on\nthe training data.\n\nTo investigate if it is worthwhile to consider larger classes of functions than just the\nconjunctions and disjunctions that are used in the SCM, we focus here on the class\nof decision lists introduced by Rivest (1987) because this class strictly includes both\nconjunctions and disjunctions and is strictly included in the class of linear threshold\nfunctions (Marchand and Golea, 1993). Hence, we denote by decision list machine\n(DLM) any classi\ufb02er which computes a decision list of Boolean-valued features,\nincluding features that are possibly constructed from the data. In this paper, we\nuse the set of features introduced by Marchand and Shawe-Taylor (2001, 2002)\nknown as data-dependent balls. By extending the sample compression technique\nof Littlestone and Warmuth (1986), we bound the generalization error of the DLM\nwith data-dependent balls in terms of the number of errors and the number of balls\nit achieves on the training data. We also show that the DLM with balls can provide\n\n\fbetter generalization than the SCM with this same set of features on some natural\ndata sets.\n\n2 The Decision List Machine\n\nLet x denote an arbitrary n-dimensional vector of the input space X which could be\narbitrary subsets of