{"title": "Optimal Sparse Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 7267, "page_last": 7275, "abstract": "Decision tree algorithms have been among the most popular algorithms for interpretable (transparent) machine learning since the early 1980's. The problem that has plagued decision tree algorithms since their inception is their lack of optimality, or lack of guarantees of closeness to optimality: decision tree algorithms are often greedy or myopic, and sometimes produce unquestionably suboptimal models. Hardness of decision tree optimization is both a theoretical and practical obstacle, and even careful mathematical programming approaches have not been able to solve these problems efficiently. This work introduces the first practical algorithm for optimal decision trees for binary variables. The algorithm is a co-design of analytical bounds that reduce the search space and modern systems techniques, including data structures and a custom bit-vector library. We highlight possible steps to improving the scalability and speed of future generations of this algorithm based on insights from our theory and experiments.", "full_text": "Optimal Sparse Decision Trees\n\nXiyang Hu1, Cynthia Rudin2, Margo Seltzer3\u2217\n1Carnegie Mellon University, xiyanghu@cmu.edu\n\n2Duke University, cynthia@cs.duke.edu\n\n3The University of British Columbia, mseltzer@cs.ubc.ca\n\nAbstract\n\nDecision tree algorithms have been among the most popular algorithms for inter-\npretable (transparent) machine learning since the early 1980\u2019s. The problem that\nhas plagued decision tree algorithms since their inception is their lack of optimality,\nor lack of guarantees of closeness to optimality: decision tree algorithms are often\ngreedy or myopic, and sometimes produce unquestionably suboptimal models.\nHardness of decision tree optimization is both a theoretical and practical obstacle,\nand even careful mathematical programming approaches have not been able to\nsolve these problems ef\ufb01ciently. This work introduces the \ufb01rst practical algorithm\nfor optimal decision trees for binary variables. The algorithm is a co-design of\nanalytical bounds that reduce the search space and modern systems techniques,\nincluding data structures and a custom bit-vector library. Our experiments highlight\nadvantages in scalability, speed, and proof of optimality.\n\n1\n\nIntroduction\n\nInterpretable machine learning has been growing in importance as society has begun to realize\nthe dangers of using black box models for high stakes decisions: complications with confounding\nhave haunted our medical machine learning models [22], bad predictions from black boxes have\nannounced to millions of people that their dangerous levels of air pollution were safe [15], high-stakes\ncredit risk decisions are being made without proper justi\ufb01cation, and black box risk predictions have\nbeen wreaking havoc with the perception of fairness of our criminal justice system [10]. In all of\nthese applications \u2013 medical imaging, pollution modeling, recidivism risk, credit scoring \u2013 accurate\ninterpretable models have been created (by the Center for Disease Control and Prevention, Arnold\nFoundation, and others). However, such interpretable-yet-accurate models are not generally easy to\nconstruct. If we want people to replace their black box models with interpretable models, the tools to\nbuild these interpretable models must \ufb01rst exist.\nDecision trees are one of the leading forms of interpretable models. Despite several attempts over the\nlast several decades to improve the optimality of decision tree algorithms, the CART [7] and C4.5\n[19] decision tree algorithms (and other greedy tree-growing variants) have remained as dominant\nmethods in practice. CART and C4.5 grow decision trees from the top down without backtracking,\nwhich means that if a suboptimal split was introduced near the top of the tree, the algorithm could\nspend many extra splits trying to undo the mistake it made at the top, leading to less-accurate and\nless-interpretable trees. Problems with greedy splitting and pruning have been known since the\nearly 1990\u2019s, when mathematical programming tools had started to be used for creating optimal\nbinary-split decision trees [3, 4], in a line of work [5, 6, 16, 18] until the present [20]. However, these\ntechniques use all-purpose optimization toolboxes and tend not to scale to realistically-sized problems\nunless simpli\ufb01ed to trees of a speci\ufb01c form. Other works [11] make overly strong assumptions (e.g.,\nindependence of all variables) to ensure optimal trees are produced using greedy algorithms.\n\n\u2217Authors are listed alphabetically.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe produce optimal sparse decision trees taking a different approach than mathematical programming,\ngreedy methods, or brute force. We \ufb01nd optimal trees according to a regularized loss function that\nbalances accuracy and the number of leaves. Our algorithm is computationally ef\ufb01cient due to a\ncollection of analytical bounds to perform massive pruning of the search space. Our implementation\nuses specialized data structures to store intermediate computations and symmetries, a bit-vector\nlibrary to evaluate decision trees more quickly, fast search policies, and computational reuse. Despite\nthe hardness of \ufb01nding optimal solutions, our algorithm is able to locate optimal trees and prove\noptimality (or closeness of optimality) in reasonable amounts of time for datasets of the sizes used in\nthe criminal justice system (tens of thousands or millions of observations, tens of features).\nBecause we \ufb01nd provably optimal trees, our experiments show where previous studies have claimed\nto produce optimal models yet failed; we show speci\ufb01c cases where this happens. We test our method\non benchmark data sets, as well as criminal recidivism and credit risk data sets; these are two of\nthe high-stakes decision problems where interpretability is needed most in AI systems. We provide\nablation experiments to show which of our techniques is most in\ufb02uential at reducing computation\nfor various datasets. As a result of this analysis, we are able to pinpoint possible future paths to\nimprovement for scalability and computational speed. Our contributions are: (1) The \ufb01rst practical\noptimal binary-variable decision tree algorithm to achieve solutions for nontrivial problems. (2) A\nseries of analytical bounds to reduce the search space. (3) Algorithmic use of a tree representation\nusing only its leaves. (4) Implementation speedups saving 97% run time. (5) We present the \ufb01rst\noptimal sparse binary split trees ever published for the COMPAS and FICO datasets.\nThe code and the supplementary materials are available at https://github.com/xiyanghu/OSDT.\n\n2 Related Work\n\nOptimal decision trees have a quite a long history [3], so we focus on closely related techniques.\nThere are ef\ufb01cient algorithms that claim to generate optimal sparse trees, but do not optimally balance\nthe criteria of optimality and sparsity; instead they pre-specify the topology of the tree (i.e., they know\na priori exactly what the structure of the splits and leaves are, even though they do not know which\nvariables are split) and only \ufb01nd the optimal tree of the given topology [16]. This is not the problem\nwe address, as we do not know the topology of the optimal tree in advance. The most successful\nalgorithm of this variety is BinOCT [20], which searches for a complete binary tree of a given depth;\nwe discuss BinOCT shortly. Some exploration of learning optimal decision trees is based on boolean\nsatis\ufb01ability (SAT) [17], but again, this work looks only for the optimal tree of a given number of\nnodes. The DL8 algorithm [18] optimizes a ranking function to \ufb01nd a decision tree under constraints\nof size, depth, accuracy and leaves. DL8 creates trees from the bottom up, meaning that trees are\nassembled out of all possible leaves, which are itemsets pre-mined from the data [similarly to 2]. DL8\ndoes not have publicly available source code, and its authors warn about running out of memory when\nstoring all partially-constructed trees. Some works consider oblique trees [6], where splits involve\nseveral variables; oblique trees are not addressed here, as they can be less interpretable.\nThe most recent mathematical programming algorithms are OCT [5] and BinOCT [20]. Example\n\ufb01gures from the OCT paper [5] show decision trees that are clearly suboptimal. However, as the code\nwas not made public, the work in the OCT paper [5] is not easily reproducible, so it is not clear where\nthe problem occurred. We discuss this in Section \u00a74. Verwer and Zhang\u2019s mathematical programming\nformulation for BinOCT is much faster [20], and their experiments indicate that BinOCT outperforms\nOCT, but since BinOCT is constrained to create complete binary trees of a given depth rather than\noptimally sparse trees, it sometimes creates unnecessary leaves in order to complete a tree at a given\ndepth, as we show in Section \u00a74. BinOCT solves a dramatically easier problem than the method\nintroduced in this work. As it turns out, the search space of perfect binary trees of a given depth is\nmuch smaller than that of binary trees with the same number of leaves. For instance, the number of\ndifferent unlabeled binary trees with 8 leaves is Catalan(7) = 429, but the number of unlabeled\nperfect binary trees with 8 leaves is only 1. In our setting, we penalize (but do not \ufb01x) the number\nof leaves, which means that our search space contains all trees, though we can bound the maximum\nnumber of leaves based on the size of the regularization parameter. Therefore, our search space is\nmuch larger than that of BinOCT.\nOur work builds upon the CORELS algorithm [1, 2, 13] and its predecessors [14, 21], which create\noptimal decision lists (rule lists). Applying those ideas to decision trees is nontrivial. The rule list\n\n2\n\n\fSearch Space of CORELS and Decision Trees\n\np = 10\n\np = 20\n\nd\n1\n2\n3\n4\n5\n\nRule Lists\n5.500 \u00d7 101\n3.025 \u00d7 103\n1.604 \u00d7 105\n8.345 \u00d7 106\n4.257 \u00d7 108\n\nTrees\n\n1.000 \u00d7 101\n1.000 \u00d7 103\n5.329 \u00d7 106\n9.338 \u00d7 1020\n\n\u201cInf\u201d\n\nRule Lists\n2.100 \u00d7 102\n4.410 \u00d7 104\n9.173 \u00d7 106\n1.898 \u00d7 109\n3.911 \u00d7 1011\n\nTrees\n\n2.000 \u00d7 101\n8.000 \u00d7 103\n9.411 \u00d7 108\n9.204 \u00d7 1028\n\n\u201cInf\u201d\n\nTable 1: Search spaces of rule lists and decision trees with number of variables p = 10, 20 and depth\nd = 1, 2, 3, 4, 5. The search space of the trees explodes in comparison.\n\noptimization problem is much easier, since the rules are pre-mined (there is no mining of rules in\nour decision tree optimization). Rule list optimization involves selecting an optimal subset and an\noptimal permutation of the rules in each subset. Decision tree optimization involves considering\nevery possible split of every possible variable and every possible shape and size of tree. This is\nan exponentially harder optimization problem with a huge number of symmetries to consider. In\naddition, in CORELS, the maximum number of clauses per rule is set to be c = 2. For a data set\n\n(cid:1) rules in total, and the number of distinct rule\n\nlists with dr rules is P (D, dr), where P (m, k) is the number of k-permutations of m. Therefore, the\ndr=1 P (D, dr). But, for a full binary tree with depth dt and data with\n\nwith p binary features, there would be D = p +(cid:0)p\nsearch space of CORELS is(cid:80)d\u22121\n\u00b7\u00b7\u00b7 2ndt\u22122(cid:88)\n\np binary features, the number of distinct trees is:\n\n(cid:18)2n0\n\n1(cid:88)\n\n2n0(cid:88)\n\n(cid:19)\n\np \u00d7\n\n2\n\n(p \u2212 1)n1 \u00d7 . . . \u00d7\n\n(cid:18)2ndt\u22122\n\n(cid:19)\n\n(p \u2212 (dt \u2212 1))ndt\u22121,\n\n(1)\n\nNdt =\n\nand the search space of decision trees up to depth d is(cid:80)d\n\nndt\u22121=1\n\nn1=1\n\nn0=1\n\nn1\n\nndt\u22121\n\ndt=1 Ndt. Table 1 shows how the search\nspaces of rule lists and decision trees grow as the tree depth increases. The search space of the trees\nis massive compared to that of the rule lists.\nApplying techniques from rule lists to decision trees necessitated new designs for the data structures,\nsplitting mechanisms and bounds. An important difference between rule lists and trees is that during\nthe growth of rule lists, we add only one new rule to the list each time, but for the growth of trees, we\nneed to split existing leaves and add a new pair of leaves for each. This leads to several bounds that\nare quite different from those in CORELS, i.e., Theorem 3.4, Theorem 3.5 and Corollary E.1, which\nconsider a pair of leaves rather than a single leaf. In this paper, we introduce bounds only for the case\nof one split at a time; however, in our implementation, we can split more than one leaf at a time, and\nthe bounds are adapted accordingly.\n\n3 Optimal Sparse Decision Trees (OSDT)\n\nn=1 and y = {yn}N\n\nWe focus on binary classi\ufb01cation, although it is possible to generalize this framework to multiclass\nn=1, where xn \u2208 {0, 1}M are binary features and\nsettings. We denote training data as {(xn, yn)}N\nyn \u2208 {0, 1} are labels. Let x = {xn}N\nn=1, and let xn,m denote the m-th feature\nof xn. For a decision tree, its leaves are conjunctions of predicates. Their order does not matter in\nevaluating the accuracy of the tree, and a tree grows only at its leaves. Thus, within our algorithm,\nwe represent a tree as a collection of leaves. A leaf set d = (p1, p2, . . . , pH ) of length H \u2265 0 is an\nH-tuple containing H distinct leaves, where pk is the classi\ufb01cation rule of the path from the root\nto leaf k. Here, pk is a Boolean assertion, which evaluates to either true or false for each datum xn\nindicating whether it is classi\ufb01ed by leaf k. Here, \u02c6y(leaf)\nWe explore the search space by considering which leaves of the tree can be bene\ufb01cially split. The\nleaf set d = (p1, p2, . . . , pK, pK+1, . . . , pH ) is the H-leaf tree, where the \ufb01rst K leaves may not\nto be split, and the remaining H \u2212 K leaves can be split. We alternately represent this leaf set as\nd = (dun, \u03b4un, dsplit, \u03b4split, K, H), where dun = (p1, . . . , pK) are the unchanged leaves of d, \u03b4un =\nK ) \u2208 {0, 1}K are the predicted labels of leaves dun, dsplit = (pK+1, . . . , pH ) are the\n(\u02c6y(leaf)\nH ) \u2208 {0, 1}H\u2212K are the predicted labels of\nleaves we are going to split, and \u03b4split = (\u02c6y(leaf)\nleaves dsplit. We call dun a K-pre\ufb01x of d, which means its leaves are a size-K unchanged subset of\nun \u2287 dun, then\n(p1, . . . , pK, . . . , pH ). If we have a new pre\ufb01x d(cid:48)\n\nun, which is a superset of dun, i.e., d(cid:48)\n\nis the label for all points so classi\ufb01ed.\n\nK+1, . . . , \u02c6y(leaf)\n\n, . . . , \u02c6y(leaf)\n\nk\n\n1\n\n3\n\n\fwe say d(cid:48)\n\nun starts with dun. We de\ufb01ne \u03c3(d) to be all descendents of d:\n\n\u03c3(d) = {(d(cid:48)\n\nun, \u03b4(cid:48)\n\nun, d(cid:48)\n\nsplit, \u03b4(cid:48)\n\nsplit, K(cid:48), Hd(cid:48)) : d(cid:48)\n\nun \u2287 dun, d(cid:48) \u2283 d}.\n(2)\nsplit, \u03b4(cid:48)\nsplit, K(cid:48), H(cid:48)), where\nun starts with dun, then\n\nun, \u03b4(cid:48)\n\nun, d(cid:48)\nun \u2287 dun, i.e., d(cid:48) contains one more leaf than d and d(cid:48)\n\nIf we have two trees d = (dun, \u03b4un, dsplit, \u03b4split, K, H) and d(cid:48) = (d(cid:48)\nH(cid:48) = H + 1, d(cid:48) \u2283 d, d(cid:48)\nwe de\ufb01ne d(cid:48) to be a child of d and d to be a parent of d(cid:48).\nNote that two trees with identical leaf sets, but different assignments to dun and dsplit, are different\ntrees. Further, a child tree can only be generated through splitting leaves of its parent tree within dsplit.\nA tree d classi\ufb01es datum xn by providing the label prediction \u02c6y(leaf)\nof the leaf whose pk is true\nfor xn. Here, the leaf label \u02c6y(leaf)\nis the majority label of data captured by the leaf k. If pk evaluates\nto true for xn, we say the leaf k of leaf set dun captures xn . In our notation, all the data captured by\na pre\ufb01x\u2019s leaves are also captured by the pre\ufb01x itself.\nLet \u03b2 be a set of leaves. We de\ufb01ne cap(xn, \u03b2) = 1 if a leaf in \u03b2 captures datum xn, and 0 otherwise.\nFor example, let d and d(cid:48) be leaf sets such that d(cid:48) \u2283 d, then d(cid:48) captures all the data that d captures:\n{xn : cap(xn, d)} \u2286 {xn : cap(xn, d(cid:48))}.\nThe normalized support of \u03b2, denoted supp(\u03b2, x), is the fraction of data captured by \u03b2:\n\nk\n\nk\n\nsupp(\u03b2, x) =\n\n1\nN\n\nN(cid:88)\n\nn=1\n\ncap(xn, \u03b2).\n\n(3)\n\n3.1 Objective Function\n\nFor a tree d = (dun, \u03b4un, dsplit, \u03b4split, K, Hd), we de\ufb01ne its objective function as a combination of the\nmisclassi\ufb01cation error and a sparsity penalty on the number of leaves:\n\nR(d, x, y) = (cid:96)(d, x, y) + \u03bbHd.\n\n(4)\nR(d, x, y) is a regularized empirical risk. The loss (cid:96)(d, x, y) is the misclassi\ufb01cation error of d,\ni.e., the fraction of training data with incorrectly predicted labels. Hd is the number of leaves in the\ntree d. \u03bbHd is a regularization term that penalizes bigger trees. Statistical learning theory provides\nguarantees for this problem; minimizing the loss subject to a (soft or hard) constraint on model size\nleads to a low upper bound on test error from the Occham\u2019s Razor Bound.\n\n3.2 Optimization Framework\n\nWe minimize the objective function based on a branch-and-bound framework. We propose a series of\nspecialized bounds that work together to eliminate a large part of the search space. These bounds are\ndiscussed in detail in the following paragraphs. Proofs are in the supplementary materials.\nSome of our bounds could be adapted directly from CORELS [2], namely these two:\n(Hierarchical objective lower bound) Lower bounds of a parent tree also hold for every child tree of\nthat parent (\u00a73.3, Theorem 3.1). (Equivalent points bound) For a given dataset, if there are multiple\nsamples with exactly the same features but different labels, then no matter how we build our classi\ufb01er,\nwe will always make mistakes. The lower bound on the number of mistakes is therefore the number\nof such samples with minority class labels (\u00a7B, Theorem B.2).\nSome of our bounds adapt from CORELS [1] with minor changes: (Objective lower bound with\none-step lookahead) With respect to the number of leaves, if a tree does not achieve enough accuracy,\nwe can prune all child trees of it (\u00a73.3, Lemma 3.2). (A priori bound on the number of leaves) For\nan optimal decision tree, we provide an a priori upper bound on the maximum number of leaves (\u00a7C,\nTheorem C.3). (Lower bound on node support) For an optimal tree, the support traversing through\neach internal node must be at least 2\u03bb (\u00a73.4, Theorem 3.3).\nSome of our bounds are distinct from CORELS, because they are only relevant to trees and not to\nlists: (Lower bound on incremental classi\ufb01cation accuracy) Each split must result in suf\ufb01cient\nreduction of the loss. Thus, if the loss reduction is less than or equal to the regularization term,\nwe should still split, and we have to further split at least one of the new child leaves to search for\nthe optimal tree (\u00a73.4, Theorem 3.4). (Leaf permutation bound) We need to consider only one\n\n4\n\n\fpermutation of leaves in a tree; we do not need to consider other permutations (explained in \u00a7E,\nCorollary E.1). (Leaf accurate support bound) For each leaf in an optimal decision tree, the number\nof correctly classi\ufb01ed samples must be above a threshold. (\u00a73.4, Theorem 3.5). The supplement\ncontains an additional set of bounds on the number of remaining tree evaluations.\n\n3.3 Hierarchical Objective Lower Bound\n\nHd\n\nn=1\n\nk\n\nn=1\n\nk\n\n, . . . , \u02c6y(leaf)\n\n1\n\n(cid:80)N\n\n(cid:80)K\nK ), dsplit = (pK+1, . . . , pHd ) and \u03b4split = (\u02c6y(leaf)\nk=1 cap(xn, pk) \u2227 1[\u02c6y(leaf)\n\nThe loss can be decomposed into two parts corresponding to the unchanged leaves and the leaves\nto be split: (cid:96)(d, x, y) \u2261 (cid:96)p(dun, \u03b4un, x, y) + (cid:96)q(dsplit, \u03b4split, x, y), where dun = (p1, . . . , pK), \u03b4un =\n(\u02c6y(leaf)\n); (cid:96)p(dun, \u03b4un, x, y) =\n(cid:54)= yn] is the proportion of data in the unchanged leaves that\n1\nN\n(cid:54)= yn] is\nare misclassi\ufb01ed, and (cid:96)p(dsplit, \u03b4split, x, y) = 1\nN\nthe proportion of data in the leaves we are going to split that are misclassi\ufb01ed. We de\ufb01ne a lower\nbound b(dun, x, y) on the objective by leaving out the latter loss,\n\n(cid:80)Hd\nk=K+1 cap(xn, pk) \u2227 1[\u02c6y(leaf)\n\nK+1, . . . , \u02c6y(leaf)\n\n(cid:80)N\n\nun, \u03b4(cid:48)\n\nun, d(cid:48)\n\nsplit, \u03b4(cid:48)\n\nb(dun, x, y) \u2261 (cid:96)p(dun, \u03b4un, x, y) + \u03bbHd \u2264 R(d, x, y),\n\n(5)\nwhere the leaves dun are kept and the leaves dsplit are going to be split. Here, b(dun, x, y) gives a\nlower bound on the objective of any child tree of d.\nTheorem 3.1 (Hierarchical objective lower bound). De\ufb01ne b(dun, x, y) = (cid:96)p(dun, \u03b4un, x, y) +\n\u03bbHd, as in (5). De\ufb01ne \u03c3(d) to be the set of all d\u2019s child trees whose unchanged leaves con-\ntain dun, as in (2). For tree d = (dun, \u03b4un, dsplit, \u03b4split, K, Hd) with unchanged leaves dun, let\nsplit, K(cid:48), Hd(cid:48)) \u2208 \u03c3(d) be any child tree such that its unchanged leaves d(cid:48)\nd(cid:48) = (d(cid:48)\nun\ncontain dun and K(cid:48) \u2265 K, Hd(cid:48) \u2265 Hd, then b(dun, x, y) \u2264 R(d(cid:48), x, y).\nConsider a sequence of trees, where each tree is the parent of the following tree. In this case, the lower\nbounds of these trees increase monotonically, which is amenable to branch-and-bound. We illustrate\nour framework in Algorithm 1 in Supplement A. According to Theorem 3.1, we can hierarchically\nprune the search space. During the execution of the algorithm, we cache the current best (smallest)\nobjective Rc, which is dynamic and monotonically decreasing. In this process, when we generate a tree\nwhose unchanged leaves dun correspond to a lower bound satifying b(dun, x, y) \u2265 Rc, according to\nTheorem 3.1, we do not need to consider any child tree d(cid:48) \u2208 \u03c3(d) of this tree whose d(cid:48)\nun contains dun.\nBased on Theorem 3.1, we describe a consequence in Lemma 3.2.\nLemma 3.2 (Objective lower bound with one-step lookahead). Let d be a Hd-leaf tree with a\nK-leaf pre\ufb01x and let Rc be the current best objective. If b(dun, x, y) + \u03bb \u2265 Rc, then for any\nchild tree d(cid:48) \u2208 \u03c3(d), its pre\ufb01x d(cid:48)\nun starts with dun and K(cid:48) > K, Hd(cid:48) > Hd, and it follows that\nR(d(cid:48), x, y) \u2265 Rc.\nThis bound tends to be very powerful in practice in pruning the search space, because it states that\neven though we might have a tree with unchanged leaves dun whose lower bound b(dun, x, y) \u2264 Rc,\nif b(dun, x, y) + \u03bb \u2265 Rc, we can still prune all of its child trees.\n\n3.4 Lower Bounds on Node Support and Classi\ufb01cation Accuracy\n\nWe provide three lower bounds on the fraction of correctly classi\ufb01ed data and the normalized support\nof leaves in any optimal tree. All of them depend on \u03bb.\nTheorem 3.3 (Lower bound on node support). Let d\u2217 = (dun, \u03b4un, dsplit, \u03b4split, K, Hd\u2217 ) be any\noptimal tree with objective R\u2217, i.e., d\u2217 \u2208 argmind R(d, x, y). For an optimal tree, the support\ntraversing through each internal node must be at least 2\u03bb. That is, for each child leaf pair pk, pk+1\nof a split, the sum of normalized supports of pk, pk+1 should be no less than twice the regularization\nparameter, i.e., 2\u03bb,\n\n2\u03bb \u2264 supp(pk, x) + supp(pk+1, x).\n\n(6)\n\nTherefore, for a tree d, if any of its internal nodes capture less than a fraction 2\u03bb of the samples, it\ncannot be an optimal tree, even if b(dun, x, y) < R\u2217. None of its child trees would be an optimal tree\neither. Thus, after evaluating d, we can prune tree d.\nTheorem 3.4 (Lower bound on incremental classi\ufb01cation accuracy). Let d\u2217 = (dun, \u03b4un,\ndsplit, \u03b4split, K, Hd\u2217 ) be any optimal tree with objective R\u2217, i.e., d\u2217 \u2208 argmind R(d, x, y). Let d\u2217\n\n5\n\n\fFigure 1: Training accuracy of OSDT, CART, BinOCT on different datasets (time limit: 30 minutes).\nHorizontal lines indicate the accuracy of the best OSDT tree. On most datasets, all trees of BinOCT\nand CART are below this line.\nhave leaves dun = (p1, . . . , pHd\u2217 ) and labels \u03b4un = (\u02c6y(leaf)\nwith corresponding labels \u02c6y(leaf)\nits label \u02c6y(leaf)\nN(cid:88)\n\nHd\u2217 ). For each leaf pair pk, pk+1\nk+1 in d\u2217 and their parent node (the leaf in the parent tree) pj and\n, \u02c6y(leaf)\n, de\ufb01ne ak to be the incremental classi\ufb01cation accuracy of splitting pj to get pk, pk+1:\n\n, . . . , \u02c6y(leaf)\n\n{cap(xn, pk) \u2227 1[\u02c6y(leaf)\n\nk = yn] + cap(xn, pk+1) \u2227 1[\u02c6y(leaf)\n\nk+1 = yn] \u2212 cap(xn, pj ) \u2227 1[\u02c6y(leaf)\n\nk\n\n= yn]}.\n\n(7)\n\n1\n\nj\n\nj\n\nak \u2261 1\nN\n\nn=1\n\nIn this case, \u03bb provides a lower bound, \u03bb \u2264 ak.\nThus, when we split a leaf of the parent tree, if the incremental fraction of data that are correctly\nclassi\ufb01ed after this split is less than a fraction \u03bb, we need to further split at least one of the two child\nleaves to search for the optimal tree. Thus, we apply Theorem 3.3 when we split the leaves. We\nneed only split leaves whose normalized supports are no less than 2\u03bb. We apply Theorem 3.4 when\nconstructing the trees. For every new split, we check the incremental accuracy for this split. If it is less\nthan \u03bb, we further split at least one of the two child leaves. Both Theorem 3.3 and Theorem 3.4 are\nbounds for pairs of leaves. We give a bound on a single leaf\u2019s classi\ufb01cation accuracy in Theorem 3.5.\nTheorem 3.5 (Lower bound on classi\ufb01cation accuracy). Let d\u2217 = (dun, \u03b4un, dsplit, \u03b4split, K, Hd\u2217 ) be\nany optimal tree with objective R\u2217, i.e., d\u2217 \u2208 argmind R(d, x, y). For each leaf (pk, \u02c6y(leaf)\n) in d\u2217,\nthe fraction of correctly classi\ufb01ed data in leaf k should be no less than \u03bb,\n\nk\n\ncap(xn, pk) \u2227 1[\u02c6y(leaf)\n\nk\n\n= yn].\n\n(8)\n\nN(cid:88)\n\nn=1\n\n\u03bb \u2264 1\nN\n\nThus, in a leaf we consider extending by splitting on a particular feature, if that proposed split leads\nto less than \u03bb correctly classi\ufb01ed data going to either side of the split, then this split can be excluded,\nand we can exclude that feature anywhere further down the tree extending that leaf.\nW\n\n3.5\n\nIncremental Computation\n\nMuch of our implementation effort revolves around exploiting incremental computation, designing\ndata structures and ordering of the worklist. Together, these ideas save >97% execution time. We\nprovide the details of our implementation in the supplement.\n\n4 Experiments\n\nWe address the following questions through experimental analysis: (1) Do existing methods achieve\noptimal solutions, and if not, how far are they from optimal? (2) How fast does our algorithm converge\ngiven the hardness of the problem it is solving? (3) How much does each of the bounds contribute to\nthe performance of our algorithm? (4) What do optimal trees look like?\n\n6\n\n02468101214161820Number of Leaves0.600.650.700.750.800.850.90AccuracyClassification Accuracy of COMPAS Dataset OSDTCARTBinOCT048121620242832Number of Leaves0.700.750.800.850.900.95AccuracyClassification Accuracy of FICO Dataset OSDTCARTBinOCT02468101214161820Number of Leaves0.700.750.800.850.900.95AccuracyClassification Accuracy of Tic-Tac-Toe Dataset OSDTCARTBinOCT02468101214161820Number of Leaves0.700.750.800.850.900.951.00AccuracyClassification Accuracy of Car Dataset OSDTCARTBinOCT02468101214161820Number of Leaves0.750.800.850.900.951.00AccuracyClassification Accuracy of Monk1 Dataset OSDTCARTBinOCT02468101214161820Number of Leaves0.650.700.750.800.850.900.95AccuracyClassification Accuracy of Monk2 Dataset OSDTCARTBinOCT02468101214161820Number of Leaves0.700.750.800.850.900.951.00AccuracyClassification Accuracy of Monk3 Dataset OSDTCARTBinOCT\fThe results of the per-bound performance and memory improvement experiment (Table 2 in the\nsupplement) were run on a m5a.4xlarge instance of AWS\u2019s Elastic Compute Cloud (EC2). The\ninstance has 16 2.5GHz virtual CPUs (although we run single-threaded on a single core) and 64 GB\nof RAM. All other results were run on a personal laptop with a 2.4GHz i5-8259U processor and\n16GB of RAM.\nWe used 7 datasets: Five of them are from the UCI Machine Learning Repository [8], (Tic Tac Toe,\nCar Evaluation, Monk1, Monk2, Monk3). The other two datasets are the ProPublica recidivism data\nset [12] and the Fair Isaac (FICO) credit risk dataset [9]. We predict which individuals are arrested\nwithin two years of release (N = 7, 215) on the recidivism data set and whether an individual will\ndefault on a loan for the FICO dataset.\nAccuracy and optimality: We tested the accuracy of our algorithm against baseline methods CART and\nBinOCT [20]. BinOCT is the most recent publicly available method for learning optimal classi\ufb01cation\ntrees and was shown to outperform other previous methods. As far as we know, there is no public\ncode for most of the other relevant baselines, including [5, 6, 16]. One of these methods, OCT [5],\nreports that CART often dominates their performance (see Fig. 4 and Fig. 5 in their paper). Our\nmodels can never be worse than CART\u2019s models even if we stop early, because in our implementation,\nwe use the objective value of CART\u2019s solution as a warm start to the objective value of the current\nbest. Figure 1 shows the training accuracy on each dataset. The time limits for both BinOCT and our\nalgorithm are set to be 30 minutes.\nMain results: (i) We can now evaluate how close\nto optimal other methods are (and they are often\nclose to optimal or optimal). (ii) Sometimes, the\nbaselines are not optimal. Recall that BinOCT\nsearches only for the optimal tree given the\ntopology of the complete binary tree of a cer-\ntain depth. This restriction on the topology mas-\nsively reduces the search space so that BinOCT\nruns quickly, but in exchange, it misses optimal\nsparse solutions that our method \ufb01nds. (iii) Our\nmethod is fast. Our method runs on only one\nthread (we have not yet parallelized it) whereas\nBinOCT is highly optimized; it makes use of\neight threads. Even with BinOCT\u2019s 8-thread par-\nallelism, our method is competitive.\nConvergence: Figure 2 illustrates the behavior of\nOSDT for the ProPublica COMPAS dataset with\n\u03bb = 0.005, for two different scheduling policies\n(curiosity and lower bound, see supplement).\nThe charts show how the current best objective\nvalue Rc and the lower bound b(dun, x, y) vary\nas the algorithm progresses. When we schedule\nusing the lower bound, the lower bounds of eval-\nuated trees increase monotonically, and OSDT\ncerti\ufb01es optimality only when the value of the\nlower bound becomes large enough that we can\nprune the remaining search space or when the\nqueue is empty, whichever is reached earlier. Us-\ning curiosity, OSDT \ufb01nds the optimal tree much\nmore quickly than when using the lower bound.\nScalability: Figure 3 shows the scalability of\nOSDT with respect to the number of samples\nand the number of features. Runtime can theoretically grow exponentially with the number of features.\nHowever, as we add extra features that differ from those in the optimal tree, we can reach the optimum\nmore quickly, because we are able to prune the search space more ef\ufb01ciently as the number of\nextra features grows. For example, with 4 features, it spends about 75% of the runtime to reach the\noptimum; with 12 features, it takes about 5% of the runtime to reach the optimum.\n\n(a) This is based on all the\n12 features\nFigure 3: Scalability with respect to number of\nsamples and number of features using (multiples\nof) the ProPublica data set. (\u03bb = 0.005). Note that\nall these executions include the 4 features of the\noptimal tree, and the data size are increased by\nduplicating the whole data set multiple times.\n\nFigure 2: Example OSDT execution traces (COM-\nPAS data, \u03bb = 0.005). Lines are the objective\nvalue and dashes are the lower bound for OSDT.\nFor each scheduling policy, the time to optimum\nand optimal objective value are marked with a star.\n\n(b) The 4 features are those\nin Figure 4\n\n7\n\n050000100000150000200000Number of Trees Evaluated0.000.050.100.150.200.250.300.35ValueScheduling Policy: Curiosity020000040000060000080000010000001200000Number of Trees Evaluated0.000.050.100.150.200.250.300.35ValueScheduling Policy: Lower Bound0.000.250.500.751.001.25Number of Samples(\u00d7107)050100150200250300Time (s)Time to optimumTotal time4681012Number of Features02468Time (s)Time to optimumTotal time\fFigure 4: An optimal decision tree generated by OSDT on the COMPAS dataset. (\u03bb = 0.005,\naccuracy: 66.90%)\n\n(a) BinOCT (accuracy: 76.722%)\n\n(b) OSDT (accuracy: 82.881%)\n\nFigure 5: Eight-leaf decision trees generated by BinOCT and OSDT on the Tic-Tac-Toe data. Trees\nof BinOCT must be complete binary trees, while OSDT can generate binary trees of any shape.\n\n(a) BinOCT (accuracy: 91.129%)\n\n(b) OSDT (accuracy: 100%)\n\nFigure 6: Decision trees generated by BinOCT and OSDT on the Monk1 dataset. The tree generated\nby BinOCT includes two useless splits (the left and right splits), while OSDT can avoid this problem.\nBinOCT is 91% accurate, OSDT is 100% accurate.\n\nAblation experiments: Appendix I shows that the lookahead and equivalent points bounds are, by far,\nthe most signi\ufb01cant of our bounds, reducing time to optimum by at least two orders of magnitude and\nreducing memory consumption by more than one order of magnitude.\nTrees: We provide illustrations of the trees produced by OSDT and the baseline methods in Figures 4,\n5 and 6. OSDT generates trees of any shape, and our objective penalizes trees with more leaves,\nthus it never introduces splits that produce a pair of leaves with the same label. In contrast, BinOCT\ntrees are always complete binary trees of a given depth. This limitation on the tree shape can prevent\nBinOCT from \ufb01nding the globally optimal tree. In fact, BinOCT often produces useless splits, leading\nto trees with more leaves than necessary to achieve the same accuracy.\nAdditional experiments: It is well-established that simpler models such as small decision trees\ngeneralize well; a set of cross-validation experiments is in the supplement demonstrating this.\nConclusion: Our work shows the possibility of optimal (or provably near-optimal) sparse decision\ntrees. It is the \ufb01rst work to balance the accuracy and the number of leaves optimally in a practical\namount of time. We have reason to believe this framework can be extended to much larger datasets.\nTheorem F.1 identi\ufb01es a key mechanism for scaling these algorithms up. It suggests a bound stating\nthat highly correlated features can substitute for each other, leading to similar model accuracies.\nApplications of this bound allow for the elimination of features throughout the entire execution,\nallowing for more aggressive pruning. Our experience to date shows that by supporting such bounds\nwith the right data structures can potentially lead to dramatic increases in performance and scalability.\n\n8\n\npriors>3age<26Nopriors:2-3juvenile-crimes=0YesNoYesYestop-left=omiddle-middle=obottom-right=x11top-right=x01bottom-right=xmiddle-middle=x01top-right=o10middle-middle=xtop-left=xbottom-right=x0bottom-left=xtop-right=x011bottom-left=xtop-right=x0111head=roundbody=roundbody=squareYesYesjacket=redNoYesbody=roundjacket=redNoYeshead=roundYesYesjacket=redbody=squarehead=squarehead=roundbody=roundYesNohead=squareNoYesNohead=squareNoYesYes\fReferences\n[1] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin. Learning certi\ufb01ably optimal rule lists for\ncategorical data. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining (KDD), 2017.\n\n[2] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin. Learning certi\ufb01ably optimal rule lists for\n\ncategorical data. Journal of Machine Learning Research, 18(234):1\u201378, 2018.\n\n[3] K. Bennett. Decision tree construction via linear programming. In Proceedings of the 4th Midwest Arti\ufb01cial\n\nIntelligence and Cognitive Science Society Conference, Utica, Illinois, 1992.\n\n[4] K. P. Bennett and J. A. Blue. Optimal decision trees. Technical report, R.P.I. Math Report No. 214,\n\nRensselaer Polytechnic Institute, 1996.\n\n[5] D. Bertsimas and J. Dunn. Optimal classi\ufb01cation trees. Machine Learning, 106(7):1039\u20131082, 2017.\n[6] R. Blanquero, E. Carrizosa, C. Molero-R\u0131o, and D. R. Morales. Optimal randomized classi\ufb01cation trees.\n\nAug. 2018.\n\n[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression Trees. Wadsworth,\n\n1984.\n\n[8] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.\n[9] FICO, Google, Imperial College London, MIT, University of Oxford, UC Irvine, and UC Berkeley.\nExplainable Machine Learning Challenge. https://community.\ufb01co.com/s/explainable-machine-learning-\nchallenge, 2018.\n\n[10] A. W. Flores, C. T. Lowenkamp, and K. Bechtel. False positives, false negatives, and false analyses: A\nrejoinder to \u201cMachine bias: There\u2019s software used across the country to predict future criminals\". Federal\nprobation, 80(2), September 2016.\n\n[11] A. R. Klivans and R. A. Servedio. Toward attribute ef\ufb01cient learning of decision lists and parities. Journal\n\nof Machine Learning Research, 7:587\u2013602, 2006.\n\n[12] J. Larson, S. Mattu, L. Kirchner, and J. Angwin. How we analyzed the COMPAS recidivism algorithm.\n\nProPublica, 2016.\n\n[13] N. Larus-Stone, E. Angelino, D. Alabi, M. Seltzer, V. Kaxiras, A. Saligrama, and C. Rudin. Systems\noptimizations for learning certi\ufb01ably optimal rule lists. In Proc. Conference on Systems and Machine\nLearning (SysML), 2018.\n\n[14] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan. Interpretable classi\ufb01ers using rules and Bayesian\nanalysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350\u20131371,\n2015.\n\n[15] M. McGough. How bad is Sacramento\u2019s air, exactly? Google results appear at odds with reality, some say.\n\nSacramento Bee, 2018.\n\n[16] M. Menickelly, O. G\u00fcnl\u00fck, J. Kalagnanam, and K. Scheinberg. Optimal decision trees for categorical data\n\nvia integer programming. Preprint at arXiv:1612.03225, Jan. 2018.\n\n[17] N. Narodytska, A. Ignatiev, F. Pereira, and J. Marques-Silva. Learning optimal decision trees with SAT. In\n\nProc. International Joint Conferences on Arti\ufb01cial Intelligence (IJCAI), pages 1362\u20131368, 2018.\n\n[18] S. Nijssen and E. Fromont. Mining optimal decision trees from itemset lattices. In Proceedings of the ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 530\u2013539.\nACM, 2007.\n\n[19] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.\n[20] S. Verwer and Y. Zhang. Learning optimal classi\ufb01cation trees using a binary linear program formulation.\n\nIn 33rd AAAI Conference on Arti\ufb01cial Intelligence, 2019.\n\n[21] H. Yang, C. Rudin, and M. Seltzer. Scalable Bayesian rule lists. In International Conference on Machine\n\nLearning (ICML), 2017.\n\n[22] J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann. Variable generalization\nperformance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.\nPLoS Med., 15(e1002683), 2018.\n\n9\n\n\f", "award": [], "sourceid": 3960, "authors": [{"given_name": "Xiyang", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "Cynthia", "family_name": "Rudin", "institution": "Duke"}, {"given_name": "Margo", "family_name": "Seltzer", "institution": "University of British Columbia"}]}