{"title": "Pruning Random Forests for Prediction on a Budget", "book": "Advances in Neural Information Processing Systems", "page_first": 2334, "page_last": 2342, "abstract": "We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.", "full_text": "Pruning Random Forests for Prediction on a Budget\n\nFeng Nan\n\nSystems Engineering\n\nBoston University\nfnan@bu.edu\n\nJoseph Wang\n\nElectrical Engineering\n\nBoston University\njoewang@bu.edu\n\nVenkatesh Saligrama\nElectrical Engineering\n\nBoston University\n\nsrv@bu.edu\n\nAbstract\n\nWe propose to prune a random forest (RF) for resource-constrained prediction. We\n\ufb01rst construct a RF and then prune it to optimize expected feature cost & accuracy.\nWe pose pruning RFs as a novel 0-1 integer program with linear constraints that\nencourages feature re-use. We establish total unimodularity of the constraint set\nto prove that the corresponding LP relaxation solves the original integer program.\nWe then exploit connections to combinatorial optimization and develop an ef\ufb01cient\nprimal-dual algorithm, scalable to large datasets. In contrast to our bottom-up\napproach, which bene\ufb01ts from good RF initialization, conventional methods are\ntop-down acquiring features based on their utility value and is generally intractable,\nrequiring heuristics. Empirically, our pruning algorithm outperforms existing\nstate-of-the-art resource-constrained algorithms.\n\n1\n\nIntroduction\n\nMany modern classi\ufb01cation systems, including internet applications (such as web-search engines,\nrecommendation systems, and spam \ufb01ltering) and security & surveillance applications (such as wide-\narea surveillance and classi\ufb01cation on large video corpora), face the challenge of prediction-time\nbudget constraints [21]. Prediction-time budgets can arise due to monetary costs associated with\nacquiring information or computation time (or delay) involved in extracting features and running the\nalgorithm. We seek to learn a classi\ufb01er by training on fully annotated training datasets that maintains\nhigh-accuracy while meeting average resource constraints during prediction-time. We consider a\nsystem that adaptively acquires features as needed depending on the instance(example) for high\nclassi\ufb01cation accuracy with reduced feature acquisition cost.\nWe propose a two-stage algorithm. In the \ufb01rst stage, we train a random forest (RF) of trees using\nan impurity function such as entropy or more specialized cost-adaptive impurity [16]. Our second\nstage takes a RF as input and attempts to jointly prune each tree in the forest to meet global resource\nconstraints. During prediction-time, an example is routed through all the trees in the ensemble to the\ncorresponding leaf nodes and the \ufb01nal prediction is based on a majority vote. The total feature cost\nfor a test example is the sum of acquisition costs of unique features1 acquired for the example in the\nentire ensemble of trees in the forest. 2\nWe derive an ef\ufb01cient scheme to learn a globally optimal pruning of a RF minimizing the\nempirical error and incurred average costs. We formulate the pruning problem as a 0-1 inte-\nger linear program that incorporates feature-reuse constraints. By establishing total unimod-\nularity of the constraint set, we show that solving the linear program relaxation of the in-\nteger program yields the optimal solution to the integer program resulting in a polynomial\n\n1When an example arrives at an internal node, the feature associated with the node is used to direct the\nexample. If the feature has never been acquired for the example an acquisition cost is incurred. Otherwise, no\nacquisition cost is incurred as we assume that feature values are stored once computed.\n\n2For time-sensitive cases such as web-search we parallelize the implementation by creating parallel jobs\n\nacross all features and trees. We can then terminate jobs based on what features are returned.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fNo Usage\n\n1\u20137\n> 7 Cost Error\n7.3%\n91.7% 1% 42.0 6.6%\n68.3% 31.5% 0.2% 24.3 6.7%\n\nUnpruned RF\nBudgetPrune\n\nTable 1: Typical feature usage in a 40 tree RF before and after\npruning (our algorithm) on the MiniBooNE dataset. Columns 2-4\nlist percentage of test examples that do not use the feature, use it\n1 to 7 times, and use it greater than 7 times, respectively. Before\npruning, 91% examples use the feature only a few (1 to 7) times,\npaying a signi\ufb01cant cost for its acquisition; after pruning, 68% of\nthe total examples no longer use this feature, reducing cost with\nminimal error increase. Column 5 is the average feature cost (the\naverage number of unique features used by test examples). Column\n6 is the test error of RFs. Overall, pruning dramatically reduces\naverage feature cost while maintaining the same error level.\n\ntime algorithm for optimal pruning. We develop a primal-dual algorithm by leveraging re-\nsults from network-\ufb02ow theory for scaling the linear program to large datasets. Empirically,\nthis pruning outperforms state-of-the-art resource ef\ufb01cient algorithms on benchmarked datasets.\nOur approach is motivated by the fol-\nlowing considerations:\n(i) RFs are scalable to large datasets\nand produce \ufb02exible decision bound-\naries yielding high prediction-time ac-\ncuracy. The sequential feature usage\nof decision trees lends itself to adap-\n(ii) RF fea-\ntive feature acquisition.\nture usage is super\ufb02uous, utilizing fea-\ntures with introduced randomness to\nincrease diversity and generalization.\nPruning can yield signi\ufb01cant cost re-\nduction with negligible performance\nloss by selectively pruning features\nsparsely used across trees, leading to\ncost reduction with minimal accuracy\ndegradation (due to majority vote). See Table 1. (iii) Optimal pruning encourages examples to use\nfeatures either a large number of times, allowing for complex decision boundaries in the space of\nthose features, or not to use them at all, avoiding incurring the cost of acquisition. It enforces the\nfact that once a feature is acquired for an example, repeated use incurs no additional acquisition cost.\nIntuitively, features should be repeatedly used to increase discriminative ability without incurring\nfurther cost. (iv) Resource constrained prediction has been conventionally viewed as a top-down\n(tree-growing) approach, wherein new features are acquired based on their utility value. This is often\nan intractable problem with combinatorial (feature subsets) and continuous components (classi\ufb01ers)\nrequiring several relaxations and heuristics. In contrast, ours is a bottom-up approach that starts with\ngood initialization (RF) and prunes to realize optimal cost-accuracy tradeoff. Indeed, while we do not\npursue it, our approach can also be used in conjunction with existing approaches.\nRelated Work: Learning decision rules to minimize error subject to a budget constraint during\nprediction-time is an area of recent interest, with many approaches proposed to solve the prediction-\ntime budget constrained problem [9, 22, 19, 20, 12]. These approaches focus on learning complex\nadaptive decision functions and can be viewed as orthogonal to our work. Conceptually, these are\ntop-down \u201cgrowing\u201d methods as we described earlier (see (iv)). Our approach is bottom-up that seeks\nto prune complex classi\ufb01ers to tradeoff cost vs. accuracy.\nOur work is based on RF classi\ufb01ers [3]. Traditionally, feature cost is not incorporated when construct-\ning RFs, however recent work has involved approximation of budget constraints to learn budgeted\nRFs [16]. The tree-growing algorithm in [16] does not take feature re-use into account. Rather\nthan attempting to approximate the budget constraint during tree construction, our work focuses on\npruning ensembles of trees subject to a budget constraint. Methods such as traditional ensemble\nlearning and budgeted random forests can be viewed as complementary.\nDecision tree pruning has been studied extensively to improve generalization performance, we are not\naware of any existing pruning method that takes into account the feature costs. A popular method for\npruning to reduce generalization error is Cost-Complexity Pruning (CCP), introduced by Breiman et\nal. [4]. CCP trades-off classi\ufb01cation ability for tree size, however it does not account for feature costs.\nAs pointed out by Li et al. [15], CCP has undesirable \u201cjumps\" in the sequence of pruned tree sizes.\nTo alleviate this, they proposed a Dynamic-Program-based Pruning (DPP) method for binary trees.\nThe DPP algorithm is able to obtain optimally pruned trees of all sizes; however, it faces the curse\nof dimensionality when pruning an ensemble of decision trees and taking feature cost into account.\n[23, 18] proposed to solve the pruning problem as a 0-1 integer program; again, their formulations\ndo not account for feature costs that we focus on in this paper. The coupling nature of feature usage\nmakes our problem much harder. In general pruning RFs is not a focus of attention as it is assumed\nthat over\ufb01tting can be avoided by constructing an ensemble of trees. While this is true, it often leads\nto extremely large prediction-time costs. Kulkarni and Sinha [11] provide a survey of methods to\nprune RFs in order to reduce ensemble size. However, these methods do not explicitly account for\nfeature costs.\n\n2\n\n\f2 Learning with Resource Constraints\n\nIn this paper, we consider solving the Lagrangian relaxed problem of learning under prediction-time\nresource constraints, also known as the error-cost tradeoff problem:\n\nf\u2208F E(x,y)\u223cP [err (y, f (x))] + \u03bbEx\u223cPx [C (f, x)] ,\nmin\n\n(1)\nwhere example/label pairs (x, y) are drawn from a distribution P; err(y, \u02c6y) is the error function;\nC(f, x) is the cost of evaluating the classi\ufb01er f on example x; \u03bb is a tradeoff parameter. A larger \u03bb\nplaces a larger penalty on cost, pushing the classi\ufb01er to have smaller cost. By adjusting \u03bb we can\nobtain a classi\ufb01er satisfying the budget constraint. The family of classi\ufb01ers F in our setting is the\nspace of RFs, and each RF f is composed of T decision trees T1, . . . ,TT .\nOur approach: Rather than attempting to construct the optimal ensemble by solving Eqn. (1)\ndirectly, we instead propose a two-step algorithm that \ufb01rst constructs an ensemble with low prediction\nerror, then prunes it by solving Eqn. (1) to produce a pruned ensemble given the input ensemble. By\nadopting this two-step strategy, we obtain an ensemble with low expected cost while simultaneously\npreserving the low prediction error.\nThere are many existing methods to construct RFs, however the focus of this paper is on the second\nstep, where we propose a novel approach to prune RFs to solve the tradeoff problem Eqn.(1). Our\npruning algorithm is capable of taking any RF as input, offering the \ufb02exibility to incorporate any\nstate-of-the-art RF algorithm.\n\n3 Pruning with Costs\n\nIn this section, we treat the error-cost tradeoff problem Eqn. (1) as an RF pruning problem. Our key\ncontribution is to formulate pruning as a 0-1 integer program with totally unimodular constraints.\nWe \ufb01rst de\ufb01ne notations used throughout the paper. A training sample S = {(x(i), y(i)) :\ni = 1, . . . , N} is generated i.i.d. from an unknown distribution, where x(i) \u2208 (cid:60)K is the feature\nvector with a cost assigned to each of the K features and y(i) is the label for the ith example. In\nthe case of multi-class classi\ufb01cation y \u2208 {1, . . . , M}, where M is the number of classes. Given a\ndecision tree T , we index the nodes as h \u2208 {1, . . . ,|T |}, where node 1 represents the root node. Let\n\u02dcT denote the set of leaf nodes of tree T . Finally, the corresponding de\ufb01nitions for T can be extended\nto an ensemble of T decision trees {Tt : t = 1, . . . , T} by adding a subscript t.\nPruning Parametrization: In order to model ensemble pruning as an optimization problem, we\nparametrize the space of all prunings of an ensemble. The process of pruning a decision tree T at\nan internal node h involves collapsing the subtree of T rooted at h, making h a leaf node. We say\na pruned tree T (p) is a valid pruned tree of T if (1) T (p) is a subtree of T containing root node 1\nand (2) for any h (cid:54)= 1 contained in T (p), the sibling nodes (the set of nodes that share the same\nimmediate parent node as h in T ) must also be contained in T (p). Specifying a pruning is equivalent\nto specifying the nodes that are leaves in the pruned tree. We therefore introduce the following binary\nvariable for each node h \u2208 T\n\n(cid:26) 1\n\n0\n\nzh =\n\nif node h is a leaf in the pruned tree,\notherwise.\n\nWe call the set {zh,\u2200h \u2208 T } the node variables as they are associated with each node in the tree.\nConsider any root-to-leaf path in a tree T , there should be exactly one node in the path that is a leaf\nnode in the pruned tree. Let p(h) denote the set of predecessor nodes, the set of nodes (including h)\nthat lie on the path from the root node to h. The set of valid pruned trees can be represented as the\nu\u2208p(h) zu = 1 \u2200h \u2208 \u02dcT . Given a\n\nset of node variables satisfying the following set of constraints:(cid:80)\n\nvalid pruning for a tree, we now seek to parameterize the error of the pruning.\nPruning error: As in most supervised empirical risk minimization problems, we aim to minimize\nthe error on training data as a surrogate to minimizing the expected error. In a decision tree T , each\nnode h is associated with a predicted label corresponding to the majority label among the training\nexamples that fall into the node h. Let Sh denote the subset of examples in S routed to or through\nnode h on T and let Predh denote the predicted label at h. The number of misclassi\ufb01ed examples\n\n3\n\n\fat h is therefore eh =(cid:80)\n\ni\u2208Sh\n\n1\n\n(cid:80)\n[y(i)(cid:54)=Predh]. We can thus estimate the error of tree T in terms of the\nh\u2208 \u02dcT eh, where N = |S| is the total number\n\nh\u2208 \u02dcTt\n\n(cid:80)\n\n(cid:80)\n\nt=1\nehzh.\n\n(cid:80)T\n\n(cid:80)T\n\nnumber of misclassi\ufb01ed examples in the leaf nodes: 1\nN\nof examples.\nOur goal is to minimize the expected test error of the trees in the random forest, which we\nempirically approximate based on the aggregated probability distribution in Step (6) of Algo-\nrithm 1 with 1\neh. We can express this error in terms of the node variables:\nT N\n1\nh\u2208Tt\n\nPruning cost: Assume the acquisition costs for the K features, {ck : k = 1, . . . , K}, are given. The\nfeature acquisition cost incurred by an example is the sum of the acquisition costs of unique features\nacquired in the process of running the example through the forest. This cost structure arises due to\nthe assumption that an acquired feature is cached and subsequent usage by the same example incurs\nno additional cost. Formally, the feature cost of classifying an example i on the ensemble T[T ] is\nk=1 ckwk,i, where the binary variables wk,i serve as the indicators:\nif feature k is used by x(i) in any Tt, t = 1, . . . , T\notherwise.\n\ngiven by Cfeature(T[T ], x(i)) =(cid:80)K\n\n(cid:26) 1\n\nwk,i =\n\nt=1\n\nT N\n\n0\n\ni=1\n\nt=1\n\nh\u2208Tt\n\n(cid:80)\n\n(cid:80)T\n\nk=1 ckwk,i.\n\n|Sh|dhzh, where dh is the depth of node h.\n\nconstraint wk,i +(cid:80)\n\nThe expected feature cost of a test example can be approximated as 1\nN\nIn some scenarios, it is useful to account for computation cost along with feature acquisition cost\nduring prediction-time.\nIn an ensemble, this corresponds to the expected number of Boolean\noperations required running a test through the trees, which is equal to the expected depth of the trees.\nThis can be modeled as 1\nN\nPutting it together: Having modeled the pruning constraints, prediction performance and costs,\nwe formulate the problem of pruning using the relationship between the node variables zh\u2019s and\nfeature usage variables wk,i\u2019s. Given a tree T , feature k, and example x(i), let uk,i be the \ufb01rst node\nassociated with feature k on the root-to-leaf path the example follows in T . Feature k is used by\nx(i) if and only if none of the nodes between the root and uk,i is a leaf. We represent this by the\nh\u2208p(uk,i) zh = 1 for every feature k used by example x(i) in T . Recall wk,i\nindicates whether or not feature k is used by example i and p(uk,i) denotes the set of predecessor\nnodes of uk,i. Intuitively, this constraint says that either the tree is pruned along the path followed\nby example i before feature k is acquired, in which case zh = 1 for some node h \u2208 p(uk,i) and\nwk,i = 0; or wk,i = 1, indicating that feature k is acquired for example i. We extend the notations to\nh indicates whether node h in Tt is a leaf after pruning; w(t)\nensemble pruning with tree index t: z(t)\nindicates whether feature k is used by the ith example in Tt; wk,i indicates whether feature k is used\nby the ith example in any of the T trees T1, . . . ,TT ; ut,k,i is the \ufb01rst node associated with feature k\non the root-to-leaf path the example follows in Tt; Kt,i denotes the set of features the ith example\nuses on tree Tt. We arrive at the following integer program.\n\nk,i\n\n(cid:80)N\n\n(cid:80)K\n\n(cid:123)\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n(cid:122)\nK(cid:88)\n\nk=1\n\nfeature acquisition cost\n\ncomputational cost\n\n(cid:125)(cid:124)\nN(cid:88)\n\ni=1\n\n(cid:123)\n\n(cid:122)\n\n(cid:125)(cid:124)\n(cid:88)\n\nT(cid:88)\n\nt=1\n\nh\u2208Tt\n\ne(t)\nh z(t)\n\nh +\u03bb\n\nck(\n\n1\nN\n\nwk,i) +\n\n1\nN\n\n|Sh|dhzh\n\n(IP)\n\n(cid:123)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nmin\nz(t)\nh ,w(t)\nk,i,\nwk,i\u2208{0,1}\n\n(cid:122)\ns.t. (cid:80)\n\n1\nN T\n\nerror\n\n(cid:125)(cid:124)\n(cid:88)\nT(cid:88)\nk,i +(cid:80)\n\nu\u2208p(h) z(t)\n\nt=1\n\nh\u2208Tt\n\nw(t)\nk,i \u2264 wk,i,\nw(t)\n\nu = 1,\nh\u2208p(ut,k,i) z(t)\n\n\u2200h \u2208 \u02dcTt,\u2200t \u2208 [T ],\n\n(feasible prunings)\n\nh = 1, \u2200k \u2208 Kt,i,\u2200i \u2208 S,\u2200t \u2208 [T ], (feature usage/ tree)\n\u2200k \u2208 [K],\u2200i \u2208 S,\u2200t \u2208 [T ]. (global feature usage)\n\nTotally Unimodular constraints: Even though integer programs are NP-hard to solve in general,\nwe show that (IP) can be solved exactly by solving its LP relaxation. We prove this in two steps:\n\ufb01rst, we examine the special structure of the equality constraints; then we examine the inequality\nconstraint that couples the trees. Recall that a network matrix is one with each column having exactly\none element equal to 1, one element equal to -1 and the remaining elements being 0. A network\nmatrix de\ufb01nes a directed graph with the nodes in the rows and arcs in the columns. We have the\nfollowing lemma.\n\n4\n\n\f\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\nz1\n1\n1\n1\n1\n1\n\nr1\nr2\nr3\nr4\nr5\n\nz2\n1\n0\n0\n0\n0\n\nz3\n0\n1\n1\n1\n0\n\nz4\n0\n1\n0\n0\n0\n\n1,1 w(1)\nz5 w(1)\n2,1\n0\n0\n0\n0\n0\n0\n0\n0\n1\n1\n0\n0\n0\n1\n0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n\u2212r1\nr1\u2212r2\nr2\u2212r3\nr3\u2212r4\nr4\u2212r5\n\nr5\n\n11\n\n2\n\n32\n\n4\n\n5\n\nz1\n\nz2\n\n2,1\n\nz3\n0\n\n\u22121 \u22121\n0\n0\n0\n0\n1\n\n1,1 w(1)\nw(1)\nz5\nz4\n0\n0\n0\n0\n1 \u22121 \u22121\n0\n0\n0\n1 \u22121\n0\n0\n0\n\u22121\n0\n1\n0\n0\n0 \u22121\n1\n0\n0\n1\n0\n0\n0\n0\n\n0\n0\n1\n0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nFigure 1: A decision tree example with node numbers and associated feature in subscripts together\nwith the constraint matrix and its equivalent network matrix form.\n\nLemma 3.1 The equality constraints in (IP) can be turned into an equivalent network matrix form\nfor each tree.\n\nProof We observe the \ufb01rst constraint(cid:80)\n\nu = 1 requires the sum of the node variables along\na path to be 1. The second constraints w(t)\nh = 1 has a similar sum except the\nvariable w(t)\nk,i as yet another node variable for a \ufb01ctitious child node of ut,k,i and the\ntwo equations are essentially equivalent. The rest of proof follows directly from the construction in\nProposition 3 of [18].\n\nk,i. Imagine w(t)\n\nh\u2208p(ut,k,i) z(t)\n\nu\u2208p(h) z(t)\n\nk,i +(cid:80)\n\nFigure 1 illustrates such a construction. The nodes are numbered 1 to 5. The subscripts at node 1\nand 3 are the feature index used in the nodes. Since the equality constraints in (IP) can be separated\nbased on the trees, we consider only one tree and one example being routed to node 4 on the tree for\nsimplicity. The equality constraints can be organized in the matrix form as shown in the middle of\nFigure 1. Through row operations, the constraint matrix can be transformed to an equivalent network\nmatrix. Such transformation always works as long as the leaf nodes are arranged in a pre-order\nmanner. Next, we deal with the inequality constraints and obtain our main result.\n\nTheorem 3.2 The LP relaxation of (IP), where the 0-1 integer constraints are relaxed to interval\nconstraints [0, 1] for all integer variables, has integral optimal solutions.\n\nDue to space limit the proof can be found in the Suppl. Material. The main idea is to show the\nconstraints are still totally unimodular even after adding the coupling constraints and the LP relaxed\npolyhedron has only integral extreme points [17]. As a result, solving the LP relaxation results in the\noptimal solution to the integer program (IP), allowing for polynomial time optimization. 3\n\nAlgorithm 1 BUDGETPRUNE\nDuring Training: input - ensemble(T1, . . . ,TT ), training/validation data with labels, \u03bb\n1: initialize dual variables \u03b2(t)\n2: update z(t)\nk,i \u2190 [\u03b2(t)\n3: \u03b2(t)\n4: go to Step 2 until duality gap is small enough.\n\nk,i \u2212 wk,i)]+ for step size \u03b3, where [\u00b7]+ = max{0,\u00b7}.\n\nh , w(t)\nk,i + \u03b3(w(t)\n\nk,i for each tree t (shortest-path algo). wk,i = 0 if \u00b5k,i > 0, wk,i = 1 if \u00b5k,i < 0.\n\nk,i \u2190 0.\n\nDuring Prediction: input - test example x\n\n(cid:80)T\n\n5: Run x on each tree to leaf, obtain the probability distribution over label classes pt at leaf.\n6: Aggregate p = 1\nT\n\nt=1 pt. Predict the class with the highest probability in p.\n\n4 A Primal-Dual Algorithm\n\nEven though we can solve (IP) via its LP relaxation, the resulting LP can be too large in practical\napplications for any general-purpose LP solver. In particular, the number of variables and constraints\nis roughly O(T \u00d7 |Tmax| + N \u00d7 T \u00d7 Kmax), where T is the number of trees; |Tmax| is the maximum\n3The nice result of totally unimodular constraints is due to our speci\ufb01c formulation. See Suppl. Material for\n\nan alternative formulation that does not have such a property.\n\n5\n\n\fnumber of nodes in a tree; N is the number of examples; Kmax is the maximum number of features\nan example uses in a tree. The runtime of the LP thus scales O(T 3) with the number of trees in the\nensemble, limiting the application to only small ensembles. In this section we propose a primal-dual\napproach that effectively decomposes the optimization into many sub-problems. Each sub-problem\ncorresponds to a tree in the ensemble and can be solved ef\ufb01ciently as a shortest path problem. The\np (|Tmax| + N \u00d7 Kmax) log(|Tmax| + N \u00d7 Kmax)), where p is the number of\nruntime per iteration is O( T\nprocessors. We can thus massively parallelize the optimization and scale to much larger ensembles as\nthe runtime depends only linearly on T\nk,i for the inequality\nconstraints w(t)\n\np . To this end, we assign dual variables \u03b2(t)\n\nk,i \u2264 wk,i and derive the dual problem.\n\n(cid:32) K(cid:88)\n\nk=1\n\nN(cid:88)\n\ni=1\n\nck(\n\n1\nN\n\n(cid:33)\n\nT(cid:88)\n\nN(cid:88)\n\n(cid:88)\n\nt=1\n\ni=1\n\nk\u2208Kt,i\n\n\u02c6e(t)\nh z(t)\n\nh + \u03bb\n\nwk,i)\n\n+\n\n\u03b2(t)\nk,i(w(t)\n\nk,i \u2212 wk,i)\n\nmax\nk,i\u22650\n\u03b2(t)\n\nmin\nh \u2208[0,1]\nz(t)\nk,i\u2208[0,1]\nw(t)\nwk,i\u2208[0,1]\n\n(cid:88)\n\nh\u2208Tt\n\n1\nN T\n\nT(cid:88)\ns.t. (cid:88)\n\nt=1\n\nu\u2208p(h)\nw(t)\nk,i +\n\nz(t)\nu = 1,\n\n(cid:88)\n\nh\u2208p(ut,k,i)\n\n\u2200h \u2208 \u02dcTt,\u2200t \u2208 [T ],\n\nh = 1, \u2200k \u2208 Kt,i,\u2200i \u2208 S,\u2200t \u2208 [T ],\nz(t)\n\nwhere for simplicity we have combined coef\ufb01cients of z(t)\nh . The\nprimal-dual algorithm is summarized in Algorithm 1. It alternates between updating the primal\nand the dual variables. The key is to observe that given dual variables, the primal problem (inner\nminimization) can be decomposed for each tree in the ensemble and solved in parallel as shortest path\nproblems due to Lemma 3.1. (See also Suppl. Material). The primal variables wk,i can be solved in\n\u03b2(t)\nk,i, where Tk,i is the set of trees in which\n\nclosed form: simply compute \u00b5k,i = \u03bbck/N \u2212(cid:80)\n\nh in the objective of (IP) to \u02c6e(t)\n\nexample i encounters feature k. So wk,i should be set to 0 if \u00b5k,i > 0 and wk,i = 1 if \u00b5k,i < 0.\nNote that our prediction rule aggregates the leaf distributions from all trees instead of just their\npredicted labels. In the case where the leaves are pure (each leaf contains only one class of examples),\nthis prediction rule coincides with the majority vote rule commonly used in random forests. Whenever\nthe leaves contain mixed classes, this rule takes into account the prediction con\ufb01dence of each tree\nin contrast to majority voting. Empirically, this rule consistently gives lower prediction error than\nmajority voting with pruned trees.\n\nt\u2208Tk,i\n\n5 Experiments\n\nWe test our pruning algorithm BUDGETPRUNE on four benchmark datasets used for prediction-time\nbudget algorithms. The \ufb01rst two datasets have unknown feature acquisition costs so we assign costs\nto be 1 for all features; the aim is to show that BUDGETPRUNE successfully selects a sparse subset of\nfeatures on average to classify each example with high accuracy. 4 The last two datasets have real\nfeature acquisition costs measured in terms of CPU time. BUDGETPRUNE achieves high prediction\naccuracy spending much less CPU time in feature acquisition.\nFor each dataset we \ufb01rst train a RF and apply BUDGETPRUNE on it using different \u03bb\u2019s to obtain\nvarious points on the accuracy-cost tradeoff curve. We use in-bag data to estimate error probability at\neach node and the validation data for the feature cost variables wk,i\u2019s. We implement BUDGETPRUNE\nusing CPLEX [1] network \ufb02ow solver for the primal update step. The running time is signi\ufb01cantly\nreduced (from hours down to minutes) compared to directly solving the LP relaxation of (IP) using\nstandard solvers such as Gurobi [10]. Futhermore, the standard solvers simply break trying to solve\nthe larger experiments whereas BUDGETPRUNE handles them with ease. We run the experiments for\n10 times and report the means and standard deviations. Details of datasets and parameter settings of\ncompeting methods are included in the Suppl. Material.\nCompeting methods: We compare against four other approaches.\n(i) BUDGETRF[16]:\nthe recursive node splitting process for each tree is stopped as soon as node impu-\n\n4In contrast to traditional sparse feature selection, our algorithm allows adaptivity, meaning different examples\n\nuse different subsets of features.\n\n6\n\n\fattempts\n\n(c) Yahoo! Rank\n\n(d) Scene15\n\nCost-\n\n(a) MiniBooNE\n\n(b) Forest Covertype\n\nThe threshold is a measure of impu-\nThis can be considered as a naive pruning method\nreduces feature acquisition cost while maintaining low impurity in the leaves.\n\nrity (entropy or Pairs) falls below a threshold.\nrity tolerated in the leaf nodes.\nas it\n(ii)\nComplexity\nPruning\n(CCP)\n[4]: it iteratively\nsubtrees\nprunes\nsuch\nthat\nthe\nresulting tree has\nlow error\nand\nsmall size. We\nperform\nCCP\non\nindividual\ntrees to different\nlevels to obtain\nvarious points on\nthe accuracy-cost\ntradeoff\ncurve.\nCCP does not\ntake into account\nfeature costs. (iii)\nGREEDYPRUNE:\nis a greedy global\nfeature pruning\nstrategy\nthat\nwe propose; at\neach\niteration\nit\nto\nremove all nodes\ncorresponding\nto one\nfeature\nfrom the RF such\nthat the resulting\npruned RF has the lowest training error and average feature cost. The process terminates in at most K\niterations, where K is the number of features. The idea is to reduce feature costs by successively\nremoving features that result in large cost reduction yet small accuracy loss. We also compare against\nthe state-of-the-art methods in budgeted learning (iv) GREEDYMISER [22]: it is a modi\ufb01cation of\ngradient boosted regression tree [8] to incorporate feature cost. Speci\ufb01cally, each weak learner (a\nlow-depth decision tree) is built to minimize squared loss with respect to current gradient at the\ntraining examples plus feature acquisition cost. To build each weak learner the feature costs are set\nto zero for those features already used in previous weak learners. Other prediction-time budget\nalgorithms such as ASTC [12], CSTC [21] and cost-weighted l-1 classi\ufb01ers are shown to perform\nstrictly worse than GREEDYMISER by a signi\ufb01cant amount [12, 16] so we omit them in our plots.\nSince only the feature acquisition costs are standardized, for fair comparison we do not include the\ncomputation cost term in the objective of (IP) and focus instead on feature acquisition costs.\nMiniBooNE Particle Identi\ufb01cation and Forest Covertype Datasets:[7] Feature costs are uniform\nin both datasets. Our base RF consists of 40 trees using entropy split criteria and choosing from\nthe full set of features at each split. As shown in (a) and (b) of Figure 2, BUDGETPRUNE (in red)\nachieves the best accuracy-cost tradeoff. The advantage of BUDGETPRUNE is particularly large in (b).\nGREEDYMISER has lower accuracy in the high budget region compared to BUDGETPRUNE in (a)\nand signi\ufb01cantly lower accuracy in (b). The gap between BUDGETPRUNE and other pruning methods\nis small in (a) but much larger in (b). This indicates large gains from globally encouraging feature\nsharing in the case of (b) compared to (a). In both datasets, BUDGETPRUNE successfully prunes\naway large number of features while maintaining high accuracy. For example in (a), using only 18\nunique features on average instead of 40, we can get essentially the same accuracy as the original RF.\nYahoo! Learning to Rank:[6] This ranking dataset consists of 473134 web documents and 19944\nqueries. Each example in the dataset contains features of a query-document pair together with the\n\nFigure 2: Comparison of BUDGETPRUNE against CCP, BUDGETRF with early stopping,\nGREEDYPRUNE and GREEDYMISER on 4 real world datasets. BUDGETPRUNE (red)\noutperforms competing state-of-art methods. GREEDYMISER dominates ASTC [12],\nCSTC [21] and DAG [20] signi\ufb01cantly on all datasets. We omit them in the plots to clearly\ndepict the differences between competing methods.\n\n7\n\n5101520253035400.880.890.90.910.920.93Test AccuracyAverage Feature Cost BudgetPruneCCP [Breiman et al. 1984]BudgetRF [Nan et al. 2015]GreedyPruneGreedyMiser [Xu et al. 2012]8101214161820220.780.80.820.840.860.880.90.92Test AccuracyAverage Feature Cost BudgetPruneCCP [Breiman et al. 1984]BudgetRF [Nan et al. 2015]GreedyPruneGreedyMiser [Xu et al. 2012]4060801001201401601802000.120.1250.130.1350.14Average Precision@5Average Feature Cost BudgetPruneCCP [Breiman et al. 1984]BudgetRF [Nan et al. 2015]GreedyPruneGreedyMiser [Xu et al. 2012]51015202530350.720.740.760.780.80.820.84Test AccuracyAverage Feature Cost BudgetPruneCCP [Breiman et al. 1984]BudgetRF [Nan et al. 2015]GreedyPruneGreedyMiser [Xu et al. 2012]\frelevance rank of the document to the query. There are 141397/146769/184968 examples in the\ntraining/validation/test sets. There are 519 features for each example; each feature is associated with\nan acquisition cost in the set {1, 5, 20, 50, 100, 150, 200}, which represents the units of CPU time\nrequired to extract the feature and is provided by a Yahoo! employee. The labels are binarized so that\nthe document is either relevant or not relevant to the query. The task is to learn a model that takes a\nnew query and its associated set of documents to produce an accurate ranking using as little feature\ncost as possible. As in [16], we use the Average Precision@5 as the performance metric, which gives\na high reward for ranking the relevant documents on top. Our base RF consists of 140 trees using\ncost weighted entropy split criteria as in [16] and choosing from a random subset of 400 features\nat each split. As shown in (c) of Figure 2, BUDGETPRUNE achieves similar ranking accuracy as\nGREEDYMISER using only 30% of its cost.\nScene15 [13]: This scene recognition dataset contains 4485 images from 15 scene classes (labels).\nFollowing [22] we divide it into 1500/300/2685 examples for training/validation/test sets. We use\na diverse set of visual descriptors and object detectors from the Object Bank [14]. We treat each\nindividual detector as an independent descriptor so we have a total of 184 visual descriptors. The\nacquisition costs of these visual descriptors range from 0.0374 to 9.2820. For each descriptor we train\n15 one-vs-rest kernel SVMs and use the output (margins) as features. Once any feature corresponding\nto a visual descriptor is used for a test example, an acquisition cost of the visual descriptor is incurred\nand subsequent usage of features from the same group is free for the test example. Our base RF\nconsists of 500 trees using entropy split criteria and choosing from a random subset of 20 features at\neach split. As shown in (d) of Figure 2, BUDGETPRUNE and GREEDYPRUNE signi\ufb01cantly outperform\nother competing methods. BUDGETPRUNE has the same accuracy at the cost of 9 as at the full cost of\n32. BUDGETPRUNE and GREEDYPRUNE perform similarly, indicating the greedy approach happen\nto solve the global optimization in this particular initial RF.\n5.1 Discussion & Concluding Comments\n\nWe have empirically evaluated several resource constrained learning algorithms including BUDGET-\nPRUNE and its variations on benchmarked datasets here and in the Suppl. Material. We highlight\nkey features of our approach below. (i) STATE-OF-THE-ART METHODS. Recent work has estab-\nlished that GREEDYMISER and BUDGETRF are among the state-of-the-art methods dominating\na number of other methods [12, 21, 20] on these benchmarked datasets. GREEDYMISER requires\nbuilding class-speci\ufb01c ensembles and tends to perform poorly and is increasingly dif\ufb01cult to tune\nin multi-class settings. RF, by its nature, can handle multi-class settings ef\ufb01ciently. On the other\nhand, as we described earlier, [12, 20, 21] are fundamentally \"tree-growing\" approaches, namely\nthey are top-down methods acquiring features sequentially based on a surrogate utility value. This\nis a fundamentally combinatorial problem that is known to be NP hard [5, 21] and thus requires a\nnumber of relaxations and heuristics with no guarantees on performance. In contrast our pruning\nstrategy is initialized to realize good performance (RF initialization) and we are able to globally\noptimize cost-accuracy objective. (ii) VARIATIONS ON PRUNING. By explicitly modeling feature\ncosts, BUDGETPRUNE outperforms other pruning methods such as early stopping of BUDGETRF and\nCCP that do not consider costs. GREEDYPRUNE performs well validating our intuition (see Table. 1)\nthat pruning sparsely occurring feature nodes utilized by large fraction of examples can improve\ntest-time cost-accuracy tradeoff. Nevertheless, the BUDGETPRUNE outperforms GREEDYPRUNE,\nwhich is indicative of the fact that apart from obvious high-budget regimes, node-pruning must\naccount for how removal of one node may have an adverse impact on another downstream one. (iii)\nSENSITIVITY TO IMPURITY, FEATURE COSTS, & OTHER INPUTS. We explore these issues in Suppl.\nMaterial. We experiment BUDGETPRUNE with different impurity functions such as entropy and Pairs\n[16] criteria. Pairs-impurity tends to build RFs with lower cost but also lower accuracy compared\nto entropy and so has poorer performance. We also explored how non-uniform costs can impact\ncost-accuracy tradeoff. An elegant approach has been suggested by [2], who propose an adversarial\nfeature cost proportional to feature utility value. We \ufb01nd that BUDGETPRUNE is robust with such\ncosts. Other RF parameters including number of trees and feature subset size at each split do impact\ncost-accuracy tradeoff in obvious ways with more trees and moderate feature subset size improving\nprediction accuracy while incurring higher cost.\nAcknowledgment: We thank Dr Kilian Weinberger for helpful discussions and Dr David Castanon\nfor the insights on the primal dual algorithm. This material is based upon work supported in part by\nNSF Grants CCF: 1320566, CNS: 1330008, CCF: 1527618, DHS 2013-ST-061-ED0001, ONR Grant\n50202168 and US AF contract FA8650-14-C-1728.\n\n8\n\n\fReferences\n[1] IBM ILOG CPLEX Optimizer.\n\nintegration/optimization/cplex-optimizer/, 2010.\n\nhttp://www-01.ibm.com/software/\n\n[2] Djalel Benbouzid. Sequential prediction for budgeted learning : Application to trigger design.\n\nTheses, Universit\u00e9 Paris Sud - Paris XI, February 2014.\n\n[3] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n[4] L. Breiman, J. Friedman, C. J. Stone, and R A Olshen. Classi\ufb01cation and regression trees. CRC\n\npress, 1984.\n\n[5] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, Pranjal Awasthi, and\nMukesh K. Mohania. Decision trees for entity identi\ufb01cation: Approximation algorithms\nand hardness results. ACM Trans. Algorithms, 7(2):15:1\u201315:22, March 2011.\n\n[6] O. Chapelle, Y. Chang, and T. Liu, editors. Proceedings of the Yahoo! Learning to Rank\n\nChallenge, held at ICML 2010, Haifa, Israel, June 25, 2010, 2011.\n[7] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[8] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of\n\nStatistics, 29:1189\u20131232, 2000.\n\n[9] T. Gao and D. Koller. Active classi\ufb01cation based on value of classi\ufb01er. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2011.\n\n[10] Gurobi Optimization Inc. Gurobi optimizer reference manual, 2015.\n[11] V.Y. Kulkarni and P.K. Sinha. Pruning of random forest classi\ufb01ers: A survey and future\n\ndirections. In International Conference on Data Science Engineering (ICDSE), 2012.\n\n[12] M. Kusner, W. Chen, Q. Zhou, E. Zhixiang, K. Weinberger, and Y. Chen. Feature-cost sensitive\n\nlearning with submodular trees of classi\ufb01ers. In AAAI, 2014.\n\n[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for\n\nrecognizing natural scene categories. In IEEE CVPR, 2006.\n\n[14] L. J. Li, H. Su, E. P. Xing, and L. Fei-Fei. Object Bank: A High-Level Image Representation\n\nfor Scene Classi\ufb01cation and Semantic Feature Sparsi\ufb01cation. In NIPS. 2010.\n\n[15] X. Li, J. Sweigart, J. Teng, J. Donohue, and L. Thombs. A dynamic programming based pruning\n\nmethod for decision trees. INFORMS J. on Computing, 13(4):332\u2013344, September 2001.\n\n[16] F. Nan, J. Wang, and V. Saligrama. Feature-budgeted random forest. In Proceedings of the 32nd\n\nInternational Conference on Machine Learning (ICML-15), 2015.\n\n[17] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[18] H. D. Sherali, A. G. Hobeika, and C. Jeenanunta. An optimal constrained pruning strategy for\n\ndecision trees. INFORMS Journal on Computing, 21(1):49\u201361, 2009.\n\n[19] K. Trapeznikov and V. Saligrama. Supervised sequential classi\ufb01cation under budget constraints.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 581\u2013589, 2013.\n\n[20] J. Wang, K. Trapeznikov, and V. Saligrama. Ef\ufb01cient learning by directed acyclic graph for\nresource constrained prediction. In Advances in Neural Information Processing Systems. 2015.\nIn\n\n[21] Z. Xu, M. Kusner, M. Chen, and K. Q. Weinberger. Cost-sensitive tree of classi\ufb01ers.\n\nProceedings of the 30th International Conference on Machine Learning, 2013.\n\n[22] Z. E. Xu, K. Q. Weinberger, and O. Chapelle. The greedy miser: Learning under test-time\nbudgets. In Proceedings of the International Conference on Machine Learning, ICML, 2012.\n[23] Yi Zhang and Huang Huei-chuen. Decision tree pruning via integer programming. Working\n\npaper, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1216, "authors": [{"given_name": "Feng", "family_name": "Nan", "institution": "Boston University"}, {"given_name": "Joseph", "family_name": "Wang", "institution": "Boston University"}, {"given_name": "Venkatesh", "family_name": "Saligrama", "institution": "Boston University"}]}