{"title": "Boolean Decision Rules via Column Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 4655, "page_last": 4665, "abstract": "This paper considers the learning of Boolean rules in either disjunctive normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) or conjunctive normal form (CNF, AND-of-ORs) as an interpretable model for classification. An integer program is formulated to optimally trade classification accuracy for rule simplicity. Column generation (CG) is used to efficiently search over an exponential number of candidate clauses (conjunctions or disjunctions) without the need for heuristic rule mining. This approach also bounds the gap between the selected rule set and the best possible rule set on the training data. To handle large datasets, we propose an approximate CG algorithm using randomization. Compared to three recently proposed alternatives, the CG algorithm dominates the accuracy-simplicity trade-off in 8 out of 16 datasets. When maximized for accuracy, CG is competitive with rule learners designed for this purpose, sometimes finding significantly simpler solutions that are no less accurate.", "full_text": "Boolean Decision Rules via Column Generation\n\nSanjeeb Dash, Oktay G\u00fcnl\u00fck, Dennis Wei\n\nIBM Research\n\nYorktown Heights, NY 10598, USA\n\n{sanjeebd,gunluk,dwei}@us.ibm.com\n\nAbstract\n\nThis paper considers the learning of Boolean rules in either disjunctive normal\nform (DNF, OR-of-ANDs, equivalent to decision rule sets) or conjunctive normal\nform (CNF, AND-of-ORs) as an interpretable model for classi\ufb01cation. An integer\nprogram is formulated to optimally trade classi\ufb01cation accuracy for rule simplicity.\nColumn generation (CG) is used to ef\ufb01ciently search over an exponential number\nof candidate clauses (conjunctions or disjunctions) without the need for heuristic\nrule mining. This approach also bounds the gap between the selected rule set and\nthe best possible rule set on the training data. To handle large datasets, we propose\nan approximate CG algorithm using randomization. Compared to three recently\nproposed alternatives, the CG algorithm dominates the accuracy-simplicity trade-\noff in 8 out of 16 datasets. When maximized for accuracy, CG is competitive with\nrule learners designed for this purpose, sometimes \ufb01nding signi\ufb01cantly simpler\nsolutions that are no less accurate.\n\n1\n\nIntroduction\n\nInterpretability has become a well-recognized goal for machine learning models. The need for\ninterpretable models is certain to increase as machine learning pushes further into domains such as\nmedicine, criminal justice, and business, where such models complement human decision-makers and\ndecisions can have major consequences on human lives. Transparency is thus required for domain\nexperts to understand, critique, and trust models, and reasoning is required to explain individual\ndecisions.\nThis paper considers Boolean rules in either disjunctive normal form (DNF, OR-of-ANDs) or con-\njunctive normal form (CNF, AND-of-ORs) as a class of interpretable models for binary classi\ufb01cation.\nAn example of a DNF rule with two clauses is \u201cIF (# accounts < 5) OR (# accounts 7 AND debt\n> $1000) THEN risk = high\u201d. Particularly desirable for interpretability are compact Boolean rules\nwith few clauses and conditions in each clause.\nDNF classi\ufb01cation rules are also referred to as decision rule sets, where each conjunction is considered\nan individual rule, rules are unordered, and a positive prediction is made when at least one of the\nrules is satis\ufb01ed. Rule sets stand in contrast to decision lists [44, 35, 49, 3, 34, 53], where rules are\nordered in an IF-ELSE sequence, and decision trees [11, 43, 6], where they are organized into a tree\nstructure. While the latter two classes are also considered interpretable, the metrics for measuring\ntheir complexity are different and not directly comparable [27]. Moreover, a user study [33] has\nquanti\ufb01ed the extra effort involved in understanding decision lists due to the need to account for the\nnegations of all preceding rules.\nThe learning of Boolean rules and rule sets has an extensive history spanning multiple \ufb01elds. DNF\nlearning theory (e.g. [47, 32, 24]) focuses on the ideal noiseless setting (sometimes allowing arbitrary\nqueries) and is less relevant to the practice of learning compact models from noisy data. Predominant\npractical approaches include a covering or separate-and-conquer strategy ([15, 14, 16, 26, 28, 40],\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsee also the survey [30]) of learning rules one by one and removing \u201ccovered\u201d examples, a bottom-\nup strategy of combining more speci\ufb01c rules into more general ones [45, 22, 41], and associative\nclassi\ufb01cation in which association rule mining is followed by rule selection using various criteria\n[38, 36, 54, 50, 12, 13]. Broadly speaking, these approaches employ heuristics and/or multiple\ncriteria not directly related to classi\ufb01cation accuracy. Moreover, they do not explicitly consider model\ncomplexity, a problem that has been noted especially with associative classi\ufb01cation. Rule set models\nhave been generalized to rule ensembles [17, 29, 20], using boosting and linear combination rather\nthan logical disjunction; the interpretability of such models is again not comparable to rule sets.\nModels produced by logical analysis of data [9, 31] from the operations research community are\nsimilarly weighted linear combinations.\nIn recent years, spurred by the demand for interpretable models, several researchers have revisited\nBoolean and rule set models and proposed methods that jointly optimize accuracy and simplicity\nwithin a single objective function. These works however have both restricted the problem and\napproximated its solution. In [33, 52, 51], frequent rule miners are \ufb01rst used to produce a set of\ncandidate rules. A greedy forward-backward algorithm [33], simulated annealing [52], or integer\nprogramming (IP) (in an unpublished manuscript [51]) are then used to select rules from the candidates.\nThe drawback of rule mining is that it limits the search space while often still producing a large\nnumber of rules, which then have to be \ufb01ltered using criteria such as information gain. [51] also\npresented an IP formulation (but no computational results) that jointly constructs and selects rules\nwithout pre-mining. [46] developed an IP formulation for DNF and CNF learning in which the\nnumber of clauses (conjunctions or disjunctions) is \ufb01xed. The problem is then solved approximately\nby decomposing into subproblems and applying a linear programming (LP) method [39], which\nrequires rounding of fractional solutions.\nIn this paper, we also propose an IP formulation for Boolean rule (DNF or CNF) learning but one that\navoids the above limitations. Rather than mining rules, we use the large-scale optimization technique\nof column generation (CG) to intelligently search over the exponential number of all possible clauses,\nwithout enumerating even a pre-mined subset (which can be large). Instead, only those clauses that\ncan improve the current solution are generated on the \ufb02y. In practice, our approach solves the IP\nformulation to provable optimality for smaller datasets. For large datasets we employ an approximate\nversion of CG by randomly selecting samples and candidate features that can be used in a clause.\nTo speed up computation, we also generate additional clauses using a greedy algorithm that still\noptimizes the correct objective.\nA numerical evaluation is presented using 16 datasets, including one from the ongoing FICO Ex-\nplainable Machine Learning Challenge [1]. In terms of the trade-off achieved between accuracy and\nrule simplicity, our CG algorithm dominates three other recent proposals on 8 datasets, whereas each\nof the others dominates on at most two. When optimized for accuracy using cross-validation, CG\nremains competitive with rule learners such as RIPPER [16] that are designed for maximum accuracy.\nIn some instances it provides signi\ufb01cantly less complex models with no sacri\ufb01ce in accuracy.\nWe note that CG has been proposed for other machine learning tasks such as boosting [21, 7] and\nhash learning [37]. In [21] however, the pricing problem (see Section 2.2) is solved approximately by\na weak learning algorithm (\u201cweak\u201d in the boosting sense), not IP, whereas in [7], pricing can be done\ntractably through enumeration.\n\n2 Problem formulation\n\nWe consider supervised binary classi\ufb01cation given a training dataset of n samples (xi, yi), i =\n1, . . . , n with labels yi 2{ 0, 1}. Let the set {1, . . . , n} be partitioned into P[Z where P contains\nthe indices of the samples with label yi = 1 and Z contains the ones with label yi = 0. For\nthe problem formulation in this section, all features Xj, j 2J = {1, . . . , d}, are assumed to be\nbinary-valued as well; binarization of numerical and categorical features is discussed in Section 4.\nThe presentation focuses on the problem of learning a Boolean classi\ufb01er \u02c6y(x) in DNF (OR-of-ANDs).\nGiven a DNF and binary-valued features, a clause corresponds to a conjunction of features and a\nsample satis\ufb01es a clause if it has all features contained in the clause (i.e. xij = 1 for all such features\nj). Since a DNF classi\ufb01er is equivalent to a rule set, the terms clause, conjunction, and (single) rule\n(within a rule set) are used interchangeably. As shown in [46] using De Morgan\u2019s laws, the same\n\n2\n\n\fformulation applies equally well to CNF learning by negating both labels yi and features xi. The\nmethod can also be extended to multi-class classi\ufb01cation in the usual one-versus-rest manner.\n\n2.1 An integer program to minimize Hamming loss\nOur objective is to minimize the Hamming loss of the rule set as is also done in [46, 33]. For each\nincorrectly classi\ufb01ed sample, the Hamming loss counts the number of clauses that have to be selected\nor removed to classify it correctly. More precisely, it is equal to the number of samples with label 1\nthat are classi\ufb01ed incorrectly (false negatives) plus the sum of the number of selected clauses that\neach sample with label 0 satis\ufb01es. Thus while each false negative contributes one unit to this loss\nfunction, representing a single clause that needs to be selected, a false positive would contribute more\nthan one unit if it satis\ufb01es multiple clauses, which must all be removed.\nWe bound the complexity of the rule set by a given parameter C, both to prevent over-\ufb01tting and to\ncontrol complexity. For concreteness, we de\ufb01ne the complexity of a clause to be a \ufb01xed cost of one\nplus the number of conditions in the clause; other linear combinations can be handled equally well.\nThe total complexity of a rule set is de\ufb01ned as the sum of the complexities of its clauses. Alternatively,\nit is possible to include an additional term in the objective function to penalize complexity but we\n\ufb01nd it more natural to explicitly bound the maximum complexity as it can offer better control in\napplications where interpretable rules are preferred. Clearly it is also possible to use both a constraint\nand a penalty term.\nWe express the above notions of Hamming loss and complexity in an integer program (IP) that is not\npractical for real-life datasets as written but is useful to explain the conceptual framework behind\nour approach. Let K denote the collection of all possible (exponentially many) clauses involving\nXj, j 2J and Ki \u2713K contain the clauses satis\ufb01ed by sample i for all i 2P[Z . Note that as\nthe features Xj are binary, |K| is indeed bounded. Letting decision variable wk for k 2K denote\nwhether clause k is used in the rule set, ck denote the complexity of clause k 2K , and \u21e0i for i 2P\ndenote the positive samples classi\ufb01ed incorrectly, we have the following IP:\n\n(1)\n\n(2)\n\n(3)\n\nzM IP = min Xi2P\n\ns.t.\n\nwk\n\nwk 1,\u21e0\n\ni 0,\n\ni 2P\n\n\u21e0i +Xi2Z Xk2Ki\n\u21e0i + Xk2Ki\nXk2K\n\nckwk \uf8ff C\n\nwk 2{ 0, 1},\n\n(4)\nThe objective function (1) is the Hamming loss as described. Constraints (2) identify false negatives,\nwhich havePk2Ki\nwk = 0 and are therefore not \u201ccovered\u201d by any selected clauses. Note that\nwk being binary implies that \u21e0i 2{ 0, 1} in any optimal solution because of the objective function.\nConstraint (3) bounds the complexity of the rule set. We call this formulation the Master IP (MIP)\nand call its linear programming (LP) relaxation, obtained by dropping the integrality constraint (4),\nthe Master LP (MLP), denoting its optimal value by zM LP . It is also possible to weight the two\nterms in the objective (1) differently, for example to balance unequal classes, but we do not pursue\nthat variation here.\n\nk 2K .\n\n2.2 Column generation framework\nClearly it is only practical to solve the Master IP for very small datasets. Moreover, even solving the\nMaster LP explicitly is often intractable due to the fact that it has exponentially many variables. An\neffective way to solve such large LPs is to use the column generation framework [4, 18] where only a\nsmall subset of all possible wk variables (clauses) is generated explicitly and the optimality of the LP\nis guaranteed by iteratively solving a pricing problem.\nTo apply this framework to the MIP, the \ufb01rst step is to restrict the formulation by replacing the set K\nwith a very small subset of it and explicitly solve the LP relaxation of the resulting smaller problem,\nwhich we call the Restricted MLP. Any optimal solution of the Restricted MLP can be extended to a\nsolution of MLP with the same objective value by setting all missing wk variables to zero, and thus\nprovides an upper bound on zM LP . Such a solution can potentially be improved by augmenting the\n\n3\n\n\fRestricted MLP with additional variables corresponding to some of the missing clauses. The second\nstep is to identify such clauses without explicitly considering all of them. Repeating these steps until\nthere are no improving clauses (i.e. variables missing from the Restricted MLP that can reduce the\ncost) solves the MLP to optimality.\nTo \ufb01nd the missing clauses that can potentially improve the value of the Restricted MLP, one needs\nto check if there are variables missing from the Restricted MLP that have negative reduced cost [5].\nThe reduced cost of a missing variable gives the maximum possible change in objective value per\nunit increase in that variable\u2019s value (when it is included in the formulation). Therefore, if all missing\nvariables have non-negative reduced cost, then the current Restricted MLP cannot be improved and its\noptimal solution yields an optimal solution of the MLP. Furthermore, it is desirable to identify missing\nvariables that have large negative reduced costs as they are more likely to improve the objective value\nof the Restricted MLP. To this end, we next formulate an optimization problem that uses the optimal\ndual solution to the Restricted MLP. Let \u00b5i 0 for i 2P denote the dual variables associated with\nconstraints (2) and 0 be the dual variable associated with (3). Let i 2{ 0, 1} denote whether the\nith sample satis\ufb01es a missing clause in question. If we let c denote the complexity of the clause, then\nits reduced cost is equal to\n\n\u00b5ii + c.\n\n(5)\n\nXi2Z\n\ni Xi2P\n\nThe \ufb01rst term in (5) is the cost of the missing clause in the objective function (1), expressed in terms\nof i. The second term is the sum of the dual variables associated with constraints (2) in which the\nclause appears. The last term is the dual variable associated with constraint (3) multiplied by the\ncomplexity of the clause.\nWe now formulate an IP to express clauses as conjunctions of the original features Xj, j 2J . Let\nthe decision variable zj 2{ 0, 1} denote if feature j 2J is selected in the clause. Let Si correspond\nto the zero-valued features in sample i 2P[Z , Si = {j : xij = 0}. Then the Pricing Problem\nbelow identi\ufb01es the clause missing from the Restricted MLP that has the lowest reduced cost.\n\nzCG = min 1 +Xj2J\n\ns.t.\n\nzj! Xi2P\n\ni\n\n\u00b5ii +Xi2Z\ni 1 Xj2Si\n\ni + zj \uf8ff 1,\n\nzj,\n\ni 0,\n\nj 2 Si, i 2P\ni 2Z\n\n(6)\n\n(7)\n(8)\n\n(9)\n\nXj2J\n\nzj \uf8ff D,\nzj 2{ 0, 1},\n\nj 2 J.\n\n(10)\nThe \ufb01rst term in (6) expresses the complexity ck in terms of the number of selected features. Con-\nstraints (7), (8) ensure that the clause acts as a conjunction, i.e. it is satis\ufb01ed (i = 1) only if no\nzero-valued features are selected (zj = 0 for j 2 Si). Similar to \u21e0i in MIP, the variables i do not\nhave to be explicitly de\ufb01ned as binary due to the objective function. Constraint (9) bounds the number\nof features allowed in any clause in the rule set. Parameter D above can be set to C 1 to relax this\nconstraint, or it can be set to a smaller number if desired to limit the clause complexity.\nThe optimal solution to the Pricing Problem above gives the clause with the minimum reduced cost\nthat is missing from the Restricted MLP. The reduced cost of this clause equals zCG and if zCG < 0,\nthen the corresponding variable is added to the Restricted MLP. More generally, any feasible solution\nto the Pricing Problem that has a negative objective function value gives a clause with a negative\nreduced cost and therefore can be added to the Restricted Restricted MLP to improve its value.\n\n2.3 Optimality guarantees and bounds\nWhen the column generation framework described above is repeated until zCG 0, none of the\nvariables missing from the Restricted MLP have a negative reduced cost and the optimal solution of\nthe MLP and the Restricted MLP coincide. In addition, if the optimal solution of the Restricted MLP\nturns out to be integral, then it is also an optimal solution to the MIP and therefore MIP is solved\n\n4\n\n\fto optimality. If the optimal solution of the Restricted MLP is fractional, then one may have to use\ncolumn generation within an enumeration framework to solve MIP to optimality. This approach is\ncalled branch-and-price [4] and is quite computationally intensive.\nHowever, even when the optimal solution to the MLP is fractional, dzM LPe provides a lower bound\non zM IP as the objective function (1) has integer coef\ufb01cients. This lower bound can be compared to\nthe cost of any feasible solution to MIP. If the latter equals dzM LPe, then, once again, MIP is solved\nto optimality. As one example, a feasible solution to MIP could be obtained by solving the Restricted\nMIP obtained by imposing (4) on the variables present in the Restricted MLP. More generally, any\nheuristic method can generate feasible solutions to MIP.\nFinally, we note that even when the MLP is not solved to optimality and the column gener-\nation procedure is terminated prematurely, a valid lower bound on zM IP can be obtained by\ndzRM LP + (C/2)zCGe, where zRM LP is the objective value of the last Restricted MLP solved\nto optimality. This bound is due to the fact that ck 2 for any clause and there might be at most C/2\nmissing variables with reduced cost no less than zCG that can be added to the Restricted MLP [48].\n\n3 Computational Approach\n\nThe previous section provides a sound theoretical framework for \ufb01nding an optimal rule set for the\ntraining data. For small datasets, de\ufb01ned loosely as having less than a couple of thousand samples and\nless than a few hundred binary (binarized) features (this includes the mushroom and tic-tac-toe UCI\ndatasets appearing in Section 4), it is computationally feasible to employ this optimization framework\nas described in Section 2. However, to handle larger datasets within a time limit of 10 or 20 minutes,\none has to sacri\ufb01ce the optimality guarantees of the framework. We next describe our computational\napproach to deal with larger datasets, which can be seen as an optimization-based heuristic. We\ncall a dataset medium if it has more than a couple of thousand samples but less than a few hundred\nbinary features. We call it large if it has many thousands of samples and more than several hundred\nbinary features. The separation of datasets into small, medium and large is done based on empirical\nexperiments to improve the likelihood that the Pricing Problem can produce negative reduced cost\nsolutions.\nFor medium and large datasets, the number of non-zeros in the Pricing Problem (de\ufb01ned as the sum\nof the numbers of variables appearing in the constraints of the formulation) is at least 100,000 and\nsolving this integer problem in a reasonable amount of time is not always feasible. Consequently\nsolving the MLP to proven optimality is not likely. To deal with this practical issue, we terminate\nthe Pricing problem if a \ufb01xed time limit is exceeded. We use a standard mixed-integer programming\nsolver (CPLEX 12.7.1) to which a time limit can be provided.\nWhile the solver is \ufb01nding negative reduced cost clauses from the Pricing Problem, the presence of the\ntime limit matters little. If the Pricing Problem is solved to optimality within the time limit, then we\nobtain a minimum reduced cost clause. Moreover, the solver might discover several negative reduced\ncost clauses within the time limit and it is possible to recover all these solutions at termination (due\nto optimality or time limit). To speed up the overall solution process, we add all the negative reduced\ncost clauses returned by the solver to the Restricted MLP. As long as one variable with a negative\nreduced cost is obtained, the column generation process continues.\nEventually, the solver will fail to \ufb01nd a negative reduced cost solution within the time limit. If\nthe solver proves that there is no such solution to the Pricing Problem, then the MLP is solved to\noptimality. However, if non-existence cannot be proved within the time limit, then column generation\nusing the Pricing Problem has to terminate without an optimality guarantee or a valid lower bound on\nthe MIP. In this case, we employ a fast heuristic algorithm to continue to search for negative reduced\ncost solutions and extend the process.\nOur heuristic algorithm only explores clauses that have up to \uf8ff features (we use \uf8ff = 5 in our\nexperiments), and is as follows. We create all one-term clauses that can be potentially extended to\nnegative reduced cost clauses, and then assign each of them a score that equals the objective function\nof the Pricing problem applied to the clause. For each clause size l from 1 to \uf8ff, we do the following:\nwe process all generated clauses that have l features in increasing order of their score, and for each\nsuch clause we create new clauses by appending additional features. Whenever we \ufb01nd a clause\nwith negative reduced cost, we add it to a potential list of solutions, and then when our enumeration\n\n5\n\n\fterminates (we have an upper bound on the number of generated clauses), we return the best clauses\ngenerated by the heuristic before proceeding to the next value of l.\nIn addition to the time limit on the Pricing Problem, we also have a time limit on the overall column\ngeneration process. Thus column generation terminates in two cases: 1) when an improving clause\ncannot be found, either because one is proven not to exist or because one cannot be found within the\nPricing Problem time budget and the heuristic also fails to \ufb01nd one, or 2) when the overall time limit\nis met. At this point, we solve the Restricted MIP (the integral version of the Restricted MLP) using\nCPLEX, and use the solution as our classi\ufb01er.\nFor large datasets, the Pricing Problem can have more than a million non-zeros and even solving its LP\nrelaxation becomes challenging. In this case the solver can rarely produce any negative reduced-cost\nsolutions within the time limit. To deal with this, we formulate an approximate Pricing Problem by\nrandomly selecting a limited number of features and samples. We pick samples uniformly with a\nprobability that on average leads to a formulation with a couple of thousand samples. If the resulting\nPricing Problem has more than a hundred thousand non-zeros, then we also limit the candidate\nfeatures that can form a clause. The candidate features are selected uniformly with a probability that\nleads to a formulation with one hundred thousand non-zeros. We also note that for large datasets the\nRestricted MLP can easily have more than one million non-zeros after generating several hundred\ncolumns and it is faster to solve it with the interior point algorithm in CPLEX instead of simplex .\n\n4 Numerical Evaluation\n\nEvaluations were conducted on 15 classi\ufb01cation datasets from the UCI repository [23] that have\nbeen used in recent works on rule set/Boolean classi\ufb01ers [39, 19, 46, 52]. In addition, we used\nrecently released data from the FICO Explainable Machine Learning Challenge [1]. It contains 23\nnumerical features of the credit history of 10, 459 individuals (9871 after removing records with all\nentries missing) for predicting repayment risk (good/bad). The domain of \ufb01nancial services and the\nclear meanings of the features combine to make it a good candidate for a rule set model. Details of\nhow missing and special values were treated can be found in the supplementary material (SM). Test\nperformance on all datasets is estimated using 10-fold strati\ufb01ed cross-validation (CV).\nFor comparison with our column generation (CG) algorithm, we considered three recently proposed\nalternatives that also aim to control rule complexity: Bayesian Rule Sets (BRS) [52] and the alter-\nnating minimization (AM) and block coordinate descent (BCD) algorithms from [46]. Additional\ncomparisons include the WEKA [25] JRip implementation of RIPPER [16], a rule set learner that is\nstill state-of-the-art in accuracy, and scikit-learn [42] implementations of the decision tree learner\nCART [11] and Random Forests (RF) [10]. The last is an uninterpretable model intended as a\nbenchmark for accuracy. The SM includes further comparisons to logistic regression (LR) and\nsupport vector machines (SVM). The parameters of BRS and FPGrowth [8], the frequent rule miner\nthat BRS relies on, were set as recommended in [52] and the associated code (see SM for details).\nFor AM and BCD, the number of clauses was \ufb01xed at 10 with the option to disable unused clauses;\ninitialization and BCD updating are done as in [46]. While both [46] and our method are equally\ncapable of learning CNF rules, for these experiments we restricted both to learning DNF rules only.\nWe also experimented with code made available by the authors of [33]. Unfortunately, we were\nunable to execute this code with practical running time when the number of mined candidate rules\nexceeded 1000. Furthermore, the code was primarily designed to handle the interval representation\nof numerical features and not (\uf8ff, >) comparisons (see next paragraph). These limitations prevented\nus from making a full comparison. The SM includes partial results from [33] that are inferior to those\nfrom the other methods.\nWe used standard \u201cdummy\u201d/\u201cone-hot\u201d coding to binarize categorical variables into multiple Xj = x\nindicators, one for each category x, as well as their negations Xj 6= x. For numerical features, there\nare two common approaches. The \ufb01rst is to discretize by binning into intervals and then encode\nas above with categorical features. The second is to compare with a sequence of thresholds, again\nincluding negations (e.g. Xj \uf8ff 1, Xj \uf8ff 2 and Xj > 1, Xj > 2). For these experiments, we used\nthe second comparison method, as also recommended in [52, 46], with sample deciles as thresholds.\nFurthermore, features were binarized in the same way for all classi\ufb01ers in this comparison, which\nall rely on discretization (but not for LR and SVM in the SM). Thus the evaluation controls for\nbinarization method in addition to using the same training-test splits for all classi\ufb01ers.\n\n6\n\n\f(a) Heart disease\n\n(b) FICO Explainable Machine Learning Challenge\n\n(c) MAGIC gamma telescope\n\n(d) Musk molecules\n\nFigure 1: Rule complexity-test accuracy trade-offs on 4 datasets. Pareto ef\ufb01cient points are connected\nby line segments. Horizontal and vertical bars represent standard errors in the means. Overall, the\nproposed CG algorithm dominates the others on 8 of 16 datasets (see the SM for the full set).\n\nWe \ufb01rst evaluated the accuracy-simplicity trade-offs achieved by our CG algorithm as well as BRS,\nAM, and BCD, methods that explicitly perform this trade-off. For CG, we used an overall time limit\nof 300 seconds for training and a time limit of 45 seconds for solving the Pricing Problem in each\niteration. Low time limits were chosen partly due to practical considerations of running the algorithm\nmultiple times (e.g. for CV) on many datasets, and partly to demonstrate the viability of IP with\nlimited computation. As in Section 2, complexity is measured as the number of rules in the rule set\nplus the total number of conditions in the rules. For each algorithm, the parameter controlling model\ncomplexity (bound C in (3), regularization parameter \u2713 in [46], multiplier \uf8ff in prior hyperparameter\nl = \uf8ff|Al| from [52]) is varied, resulting in a set of complexity-test accuracy pairs. A sample of\nthese plots is shown in Figure 1 with the full set in the SM. Line segments connect points that are\nPareto ef\ufb01cient, i.e., not dominated by solutions that are more accurate and at least as simple or vice\nversa. CG dominates the other algorithms in 8 out of 16 datasets in the sense that its Pareto front\nis consistently higher; it nearly does so on a 9th dataset (tic-tac-toe) and on a 10th (banknote), all\nalgorithms are very similar. BRS, AM, and BCD each achieve (co-)dominance only one or two times,\ne.g. in Figure 1d for AM. Among cases where CG does not dominate are the highest-dimensional\ndatasets (musk and gas, although for the latter CG does attain the highest accuracy given suf\ufb01cient\ncomplexity) and ones where AM and/or BCD are more accurate at the lowest complexities. BRS\nsolutions tend to cluster in a narrow range despite varying \uf8ff from 103 to 103.\nIn a second experiment, nested CV was used to select values of C for CG and \u2713 for AM, BCD to\nmaximize accuracy on each training set. The selected model was then applied to the test set. In\nthese experiments, CG was given an overall time limit of 120 seconds for each candidate value of\nC and the time limit for the Pricing Problem was set to 30 seconds. To offset for the decrease in\nthe time limit, we performed a second pass for each dataset solving the restricted MIP with all the\nclauses generated for all possible choices of C. Mean test accuracy (over 10 partitions) and rule set\ncomplexity are reported in Tables 1 and 2. For BRS, we \ufb01xed \uf8ff = 1 as optimizing \uf8ff did not improve\n\n7\n\n\fTable 1: Mean test accuracy (%, standard error in parentheses). Bold: Best among interpretable\nmodels; Italics: Best overall.\n\ndataset\nbanknote\nheart\nILPD\nionosphere\nliver\npima\ntic-tac-toe\ntransfusion\nWDBC\nadult\nbank-mkt\ngas\nmagic\nmushroom\nmusk\nFICO\n\nCG\n\n99.1 (0.3)\n78.9 (2.4)\n69.6 (1.2)\n90.0 (1.8)\n59.7 (2.4)\n74.1 (1.9)\n100 .0 (0.0)\n77.9 (1.4)\n94.0 (1.2)\n83.5 (0.3)\n90 .0 (0.1)\n98.0 (0.1)\n85.3 (0.3)\n100 .0 (0.0)\n95.6 (0.2)\n71.7 (0.5)\n\nBRS\n\n99.1 (0.2)\n78.9 (2.4)\n69.8 (0.8)\n86.9 (1.7)\n53.6 (2.1)\n74.3 (1.2)\n99.9 (0.1)\n76.6 (0.2)\n94.7 (0.6)\n81.7 (0.5)\n87.4 (0.2)\n92.2 (0.3)\n82.5 (0.4)\n99.7 (0.1)\n93.3 (0.2)\n71.2 (0.3)\n\nAM\n\nBCD\n\n98.5 (0.4)\n72.9 (1.8)\n71 .5 (0.1)\n90.9 (1.7)\n55.7 (1.3)\n73.2 (1.7)\n84.3 (2.4)\n76.2 (0.1)\n95.8 (0.5)\n83.0 (0.2)\n90 .0 (0.1)\n97.6 (0.2)\n80.7 (0.2)\n99.9 (0.0)\n96 .9 (0.7)\n71.2 (0.4)\n\n98.7 (0.2)\n74.2 (1.9)\n71 .5 (0.1)\n91.5 (1.7)\n51.9 (1.9)\n73.4 (1.7)\n81.5 (1.8)\n76.2 (0.1)\n95.8 (0.5)\n82.4 (0.2)\n89.7 (0.1)\n97.0 (0.3)\n80.3 (0.3)\n99.9 (0.0)\n92.1 (0.2)\n70.9 (0.4)\n\nRIPPER\n99.2 (0.2)\n79.3 (2.2)\n69.8 (1.4)\n88.0 (1.9)\n57.1 (2.8)\n73.4 (2.0)\n98.2 (0.4)\n78 .9 (1.1)\n93.0 (0.9)\n83.6 (0.3)\n89.9 (0.1)\n99.0 (0.1)\n84.5 (0.3)\n100 .0 (0.0)\n95.9 (0.2)\n71.8 (0.2)\n\nCART\n\n96.8 (0.4)\n81.6 (2.4)\n67.4 (1.6)\n87.2 (1.8)\n55.9 (1.4)\n72.1 (1.3)\n90.1 (0.9)\n78.7 (1.1)\n93.3 (0.9)\n83.1 (0.3)\n89.1 (0.2)\n95.4 (0.1)\n82.8 (0.2)\n96.2 (0.3)\n90.1 (0.3)\n70.9 (0.3)\n\nRF\n\n99 .5 (0.1)\n82 .5 (0.7)\n69.8 (0.5)\n93 .6 (0.7)\n60 .0 (0.8)\n76 .1 (0.8)\n98.8 (0.1)\n77.3 (0.3)\n97 .2 (0.2)\n84 .7 (0.1)\n88.7 (0.0)\n99 .7 (0.0)\n86 .6 (0.1)\n99.9 (0.0)\n86.2 (0.4)\n73 .1 (0.1)\n\naccuracy on the whole (as can be expected from Figure 1). Tables 1 and 2 also include results from\nRIPPER, CART, and RF. We tuned the minimum number of samples per leaf for CART and RF, used\n100 trees for RF, and otherwise kept the default settings. The complexity values for CART result\nfrom a straightforward conversion of leaves to rules (for the simpler of the two classes) and are meant\nonly for rough comparison.\n\nTable 2: Mean complexity (# clauses + total # conditions, standard error in parentheses)\nCART\ndataset\n51.8 (1.4)\nbanknote\nheart\n32.0 (8.1)\n56.5 (10.9)\nILPD\nionosphere\n46.1 (4.2)\n60.2 (15.6)\nliver\n34.7 (5.8)\npima\n67.2 (5.0)\ntic-tac-toe\n14.3 (2.3)\ntransfusion\nWDBC\n15.6 (2.2)\n95.9 (4.3)\nadult\nbank-mkt\n3.0 (0.0)\n104.7 (1.0)\ngas\n125.5 (3.2)\nmagic\nmushroom\n9.3 (0.2)\n17.0 (0.7)\nmusk\nFICO\n155.0 (27.5)\n\n24.2 (1.5)\n11.5 (3.0)\n0.0 (0.0)\n16.0 (1.5)\n8.7 (1.8)\n2.7 (0.6)\n24.9 (3.1)\n0.0 (0.0)\n11.6 (2.2)\n15.0 (0.0)\n6.8 (0.7)\n62.4 (1.9)\n11.5 (0.2)\n15.4 (0.6)\n101.3 (11.6)\n8.7 (0.4)\n\nRIPPER\n28.6 (1.1)\n16.0 (1.5)\n9.5 (2.5)\n14.6 (1.2)\n5.4 (1.3)\n17.0 (2.9)\n32.9 (0.7)\n6.8 (0.6)\n16.8 (1.5)\n133.3 (6.3)\n56.4 (12.8)\n145.3 (4.2)\n177.3 (8.9)\n17.0 (0.4)\n143.4 (5.5)\n88.1 (7.0)\n\nBCD\n\n21.3 (1.9)\n15.4 (2.9)\n0.0 (0.0)\n14.6 (1.4)\n4.0 (1.1)\n2.1 (0.1)\n12.6 (1.1)\n0.0 (0.0)\n17.3 (2.5)\n13.2 (0.2)\n2.1 (0.1)\n27.8 (2.5)\n9.0 (0.0)\n14.6 (0.6)\n24.4 (1.9)\n4.8 (0.3)\n\nBRS\n\n30.4 (1.1)\n24.0 (1.6)\n4.4 (0.4)\n12.0 (1.6)\n15.1 (1.3)\n17.4 (0.8)\n32.0 (0.0)\n6.0 (0.7)\n16.0 (0.7)\n39.1 (1.3)\n13.2 (0.6)\n22.4 (2.0)\n97.2 (5.3)\n17.5 (0.4)\n33.9 (1.3)\n23.2 (1.4)\n\nAM\n\nCG\n\n25.0 (1.9)\n11.3 (1.8)\n10.9 (2.7)\n12.3 (3.0)\n5.2 (1.2)\n4.5 (1.3)\n32.0 (0.0)\n5.6 (1.2)\n13.9 (2.4)\n88.0 (11.4)\n9.9 (0.1)\n123.9 (6.5)\n93.0 (10.7)\n17.8 (0.3)\n123.9 (6.5)\n13.3 (4.1)\n\nThe superiority of CG compared to BRS, AM, and BCD is carried over into Table 1, especially for\nlarger datasets (bottom partition in the table). Compared to RIPPER, which is designed to maximize\naccuracy, CG is very competitive. The head-to-head \u201cwin-loss\u201d record is nearly even and on no\ndataset is CG less accurate by more than 1%, whereas RIPPER is worse by \u21e0 2% on ionosphere,\nliver, and tic-tac-toe. Moreover on larger datasets, CG tends to learn signi\ufb01cantly simpler rule sets\nthat are nearly as or even more accurate than RIPPER, e.g. on bank-marketing, magic, and FICO.\nCART on the other hand is less competitive in this experiment. Tic-tac-toe is notable in admitting an\nexact rule set solution, corresponding to all positions with three x\u2019s or or\u2019s in a row. CG succeeds in\n\ufb01nding this rule set whereas the other algorithms including RF cannot quite do so.\nGiven our use of IP, a relevant question is whether certi\ufb01ably optimal or near optimal solutions to\nthe Master IP can be obtained in practice. Such guarantees are most interesting when the achieved\ntraining accuracies are low as they rule out the existence of much better solutions. Among the small\ninstances where the training accuracy is below 90% for CG, we are able to obtain optimal or near\n\n8\n\n\foptimal solutions to the training problem for heart, liver, and transfusion. For example, for transfusion,\nwe can certify that the optimality gap is at most 0.7% when the bound on the complexity of the rule\nset C is set to 15. Note that our IP formulation (1)-(4) solves the training problem with the Hamming\nloss objective. For the medium to large datasets, we are unable to accomplish this task as we are\nunable to solve the Pricing problem to optimality or near-optimality within our speci\ufb01ed time limits.\nWe conclude this section with an example of a DNF rule learned by CG, speci\ufb01cally the one that\nmaximizes accuracy on the FICO data with two simple clauses:\n\nOR\n\nNumSatTrades 23 ^ExtRiskEstimate 70 ^NetFracRevolvBurden \uf8ff 63\nNumSatTrades \uf8ff 22 ^ExtRiskEstimate 76 ^NetFracRevolvBurden \uf8ff 78.\n\nAccording to the data dictionary provided with the FICO challenge [1], \u201cNumSatTrades\u201d is the\nnumber of satisfactory accounts, \u201cExtRiskEstimate\u201d is a consolidated version of some risk markers,\nand \u201cNetFracRevolvBurden\u201d is the ratio of revolving balance to credit limit. The rules thus identify\ntwo groups, one with more accounts and less revolving debt, the other with fewer accounts and\nsomewhat more revolving debt. A slightly higher (better) \u201cExtRiskEstimate\u201d is required for the\nsecond, riskier group.\n\n5 Conclusion\n\nWe have developed a column generation algorithm for learning interpretable DNF or CNF classi-\n\ufb01cation rules that ef\ufb01ciently searches the space of rules without pre-mining or other restrictions.\nExperiments have borne out the superiority of the accuracy-rule simplicity trade-offs achieved.\nWhile the results in Table 1 are competitive with RIPPER, in some instances they fall short of the\npotential suggested in the \ufb01rst accuracy-complexity trade-off experiment. For example on the heart\ndisease dataset, Figure 1a shows a maximum accuracy of 81.3% while the value resulting from CV in\nTable 1 is only 78.9%. For small datasets, the challenge is variability in estimating test accuracy. For\nlarge datasets, although we have proposed measures such as time limits and sampling to reduce the\ncomputational burden, these measures are applied more aggressively during cross-validation when\nmany more instances need to be solved, thus affecting solution quality. We leave as future work\nimproved procedures for optimizing parameter C for accuracy.\n\nReferences\n[1] FICO Explainable Machine Learning Challenge. https://community.fico.com/community/xml.\n\nLast accessed 2018-05-16.\n\n[2] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. Int.\n\nConf. Very Large Data Bases (VLDB), pages 487\u2013499, 1994.\n\n[3] Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning\ncerti\ufb01ably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining (KDD), pages 35\u201344, 2017.\n\n[4] C. Barnhart, E. L. Johnson, G.L. Nemhauser, M.W.F . Savelsbergh, and P.H. Vance. Branch-and- price:\n\nColumn generation for solving huge integer programs. Operations Research, 46:316\u2013329, 1998.\n\n[5] Mokhtar S. Bazaraa, John Jarvis, and Hanif D. Sherali. Linear programming and network \ufb02ows. Wiley,\n\n2010.\n\n[6] Dimitris Bertsimas and Jack Dunn. Optimal classi\ufb01cation trees. Mach. Learn., 106(7):1039\u20131082, July\n\n2017.\n\n[7] Jinbo Bi, Tong Zhang, and Kristin P. Bennett. Column-generation boosting methods for mixture of kernels.\nIn Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining (KDD), pages 521\u2013526, August 2004.\n\n[8] Christian Borgelt. An implementation of the FP-growth algorithm. In Proc. Workshop on Open Source\n\nData Mining Software (OSDM), pages 1\u20135, 2005.\n\n9\n\n\f[9] Endre Boros, Peter L. Hammer, Toshihide Ibaraki, Alexander Kogan, Eddy Mayoraz, and Ilya Muchnik.\nAn implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering,\n12(2):292\u2013306, Mar/Apr 2000.\n\n[10] Leo Breiman. Random forests. Machine Learning, 45(1):5\u201332, October 2001.\n\n[11] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classi\ufb01cation and Regression\n\nTrees. Chapman & Hall/CRC, 1984.\n\n[12] Guoqing Chen, Hongyan Liu, Lan Yu, Qiang Wei, and Xing Zhang. A new approach to classi\ufb01cation\n\nbased on association rule mining. Decis. Support Syst., 42(2):674\u2013689, November 2006.\n\n[13] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent pattern analysis for\n\neffective classi\ufb01cation. In Proc. IEEE Int. Conf. Data Eng. (ICDE), pages 716\u2013725, 2007.\n\n[14] Peter Clark and Robin Boswell. Rule induction with CN2: Some recent improvements. In Proceedings of\n\nthe European Working Session on Machine Learning (EWSL), pages 151\u2013163, 1991.\n\n[15] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261\u2013283, Mar 1989.\n\n[16] William W. Cohen. Fast effective rule induction. In Proc. Int. Conf. Mach. Learn. (ICML), pages 115\u2013123,\n\n1995.\n\n[17] William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner. In Proc. Conf. Artif. Intell.\n\n(AAAI), pages 335\u2013342, 1999.\n\n[18] Michele Conforti, Gerard Cornuejols, and Giacomo Zambelli. Integer programming. Springer, 2014.\n\n[19] Sanjeeb Dash, Dmitry M. Malioutov, and Kush R. Varshney. Screening for learning classi\ufb01cation rules via\nBoolean compressed sensing. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pages\n3360\u20133364, 2014.\n\n[20] Krzysztof Dembczy\u00b4nski, Wojciech Kot\u0142owski, and Roman S\u0142owi\u00b4nski. ENDER: a statistical framework for\n\nboosting decision rules. Data Mining and Knowledge Discovery, 21(1):52\u201390, Jul 2010.\n\n[21] Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor. Linear programming boosting via column\n\ngeneration. Mach. Learn., 46(1\u20133):225\u2013254, January 2002.\n\n[22] Pedro Domingos. Unifying instance-based and rule-based induction. Mach. Learn., 24(2):141\u2013168, 1996.\n\n[23] Dheeru Dua and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[24] Vitaly Feldman. Learning DNF expressions from Fourier spectrum. In Proc. Conf. Learn. Theory (COLT),\n\npages 17.1\u201317.19, 2012.\n\n[25] Eibe Frank, Mark A. Hall, and Ian H. Witten. The WEKA workbench. In Online Appendix for \"Data\n\nMining: Practical Machine Learning Tools and Techniques\". Morgan Kaufmann, 4th edition, 2016.\n\n[26] Eibe Frank and Ian H. Witten. Generating accurate rule sets without global optimization. In Proc. Int.\n\nConf. Mach. Learn. (ICML), pages 144\u2013151, 1998.\n\n[27] Alex A. Freitas. Comprehensible classi\ufb01cation models \u2013 a position paper. ACM SIGKDD Explor.,\n\n15(1):1\u201310, 2014.\n\n[28] Jerome H. Friedman and Nicholas I. Fisher. Bump hunting in high-dimensional data. Statistics and\n\nComputing, 9(2):123\u2013143, April 1999.\n\n[29] Jerome H. Friedman and Bogdan E. Popescu. Predictive learning via rule ensembles. Annals of Applied\n\nStatistics, 2(3):916\u2013954, Jul 2008.\n\n[30] Johannes F\u00fcrnkranz, Dragan Gamberger, and Nada Lavra\u02c7c. Foundations of Rule Learning. Springer-Verlag,\n\nBerlin, 2014.\n\n[31] Peter L. Hammer and Tib\u00e9rius O. Bonates. Logical analysis of data\u2014an overview: From combinatorial\n\noptimization to medical applications. Annals of Operations Research, 148(1):203\u2013225, Nov 2006.\n\n[32] Adam R. Klivans and Rocco A. Servedio. Learning DNF in time 2\u02dco(n1/3). J. Comput. Syst. Sci., 68(2):303\u2013\n\n318, March 2004.\n\n10\n\n\f[33] Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. Interpretable decision sets: A joint framework\nfor description and prediction. In Proc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining (KDD), pages\n1675\u20131684, 2016.\n\n[34] Himabindu Lakkaraju and Cynthia Rudin. Learning cost-effective and interpretable treatment regimes.\nIn Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\nvolume 54, pages 166\u2013175, Fort Lauderdale, FL, USA, 20\u201322 Apr 2017.\n\n[35] Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan. Interpretable classi\ufb01ers using\nrules and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat., September(3):1350\u2013\n1371, 09 2015.\n\n[36] Wenmin Li, Jiawei Han, and Jian Pei. CMAR: accurate and ef\ufb01cient classi\ufb01cation based on multiple\n\nclass-association rules. In Proc. IEEE Int. Conf. Data Min. (ICDM), pages 369\u2013376, 2001.\n\n[37] Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Anthony Dick. Learning hash functions\n\nusing column generation. In Proc. Int. Conf. Mach. Learn. (ICML), pages I\u2013142\u2013I\u2013150, 2013.\n\n[38] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classi\ufb01cation and association rule mining. In Proc.\n\nACM SIGKDD Int. Conf. Knowl. Disc. Data Min. (KDD), pages 80\u201386, 1998.\n\n[39] Dmitry M. Malioutov and Kush R. Varshney. Exact rule learning via Boolean compressed sensing. In Proc.\n\nInt. Conf. Mach. Learn. (ICML), pages 765\u2013773, 2013.\n\n[40] Mario Marchand and John Shawe-Taylor. The set covering machine. J. Mach. Learn. Res., 3:723\u2013746,\n\n2002.\n\n[41] Marco Muselli and Diego Liberati. Binary rule generation via Hamming clustering. IEEE Transactions on\n\nKnowledge and Data Engineering, 14(6):1258\u20131268, 2002.\n\n[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[43] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco,\n\nCA, USA, 1993.\n\n[44] Ronald L. Rivest. Learning decision lists. Machine Learning, 2(3):229\u2013246, 1987.\n\n[45] Steven Salzberg. A nearest hyperrectangle learning method. Mach. Learn., 6(3):251\u2013276, 1991.\n\n[46] Guolong Su, Dennis Wei, Kush R. Varshney, and Dmitry M. Malioutov. Learning sparse two-level Boolean\nrules. In Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), pages 1\u20136, September 2016.\n\n[47] Leslie G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, November 1984.\n\n[48] Fran\u00e7ois Vanderbeck and Laurence A. Wolsey. An exact algorithm for IP column generation. Oper. Res.\n\nLett., 19(4):151\u2013159, 1996.\n\n[49] Fulton Wang and Cynthia Rudin. Falling rule lists. In Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), pages\n\n1013\u20131022, 2015.\n\n[50] Jianyong Wang and George Karypis. HARMONY: Ef\ufb01ciently mining the best rules for classi\ufb01cation. In\n\nProc. SIAM Int. Conf. Data Min. (SDM), pages 205\u2013216, 2005.\n\n[51] Tong Wang and Cynthia Rudin. Learning Optimized Or\u2019s of And\u2019s, November 2015. arXiv:1511.02210.\n\n[52] Tong Wang, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klamp\ufb02, and Perry MacNeille. A\nBayesian framework for learning rule sets for interpretable classi\ufb01cation. Journal of Machine Learning\nResearch, 18(70):1\u201337, 2017.\n\n[53] Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable Bayesian rule lists. In Proc. Int. Conf. Mach.\n\nLearn. (ICML), pages 1013\u20131022, 2017.\n\n[54] Xiaoxin Yin and Jiawei Han. CPAR: Classi\ufb01cation based on predictive association rules. In Proc. SIAM\n\nInt. Conf. Data Min. (SDM), pages 331\u2013335, 2003.\n\n11\n\n\f", "award": [], "sourceid": 2272, "authors": [{"given_name": "Sanjeeb", "family_name": "Dash", "institution": "IBM Research"}, {"given_name": "Oktay", "family_name": "Gunluk", "institution": "IBM Research"}, {"given_name": "Dennis", "family_name": "Wei", "institution": "IBM Research"}]}