{"title": "Cost efficient gradient boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1551, "page_last": 1561, "abstract": "Many applications require learning classifiers or regressors that are both accurate and cheap to evaluate. Prediction cost can be drastically reduced if the learned predictor is constructed such that on the majority of the inputs, it uses cheap features and fast evaluations. The main challenge is to do so with little loss in accuracy. In this work we propose a budget-aware strategy based on deep boosted regression trees. In contrast to previous approaches to learning with cost penalties, our method can grow very deep trees that on average are nonetheless cheap to compute. We evaluate our method on a number of datasets and find that it outperforms the current state of the art by a large margin. Our algorithm is easy to implement and its learning time is comparable to that of the original gradient boosting. Source code is made available at http://github.com/svenpeter42/LightGBM-CEGB.", "full_text": "Cost ef\ufb01cient gradient boosting\n\nSven Peter\n\nHeidelberg Collaboratory for Image Processing\nInterdisciplinary Center for Scienti\ufb01c Computing\n\nUniversity of Heidelberg\n\n69115 Heidelberg, Germany\n\nsven.peter@iwr.uni-heidelberg.de\n\nFerran Diego\n\nRobert Bosch GmbH\n\nRobert-Bosch-Stra\u00dfe 200\n\n31139 Hildesheim, Germany\n\nferran.diegoandilla@de.bosch.com\n\nFred A. Hamprecht\n\nHeidelberg Collaboratory for Image Processing\nInterdisciplinary Center for Scienti\ufb01c Computing\n\nUniversity of Heidelberg\n\n69115 Heidelberg, Germany\n\nfred.hamprecht@iwr.uni-heidelberg.de\n\nBoaz Nadler\n\nDepartment of Computer Science\nWeizmann Institute of Science\n\nRehovot 76100, Israel\n\nboaz.nadler@weizmann.ac.il\n\nAbstract\n\nMany applications require learning classi\ufb01ers or regressors that are both accurate\nand cheap to evaluate. Prediction cost can be drastically reduced if the learned\npredictor is constructed such that on the majority of the inputs, it uses cheap features\nand fast evaluations. The main challenge is to do so with little loss in accuracy. In\nthis work we propose a budget-aware strategy based on deep boosted regression\ntrees. In contrast to previous approaches to learning with cost penalties, our method\ncan grow very deep trees that on average are nonetheless cheap to compute. We\nevaluate our method on a number of datasets and \ufb01nd that it outperforms the\ncurrent state of the art by a large margin. Our algorithm is easy to implement and\nits learning time is comparable to that of the original gradient boosting. Source\ncode is made available at http://github.com/svenpeter42/LightGBM-CEGB.\n\n1\n\nIntroduction\n\nMany applications need classi\ufb01ers or regressors that are not only accurate, but also cheap to evaluate\n[33, 30]. Prediction cost usually consists of two different components: The acquisition or computation\nof the features used to predict the output, and the evaluation of the predictor itself. A common\napproach to construct an accurate predictor with low evaluation cost is to modify the classical\nempirical risk minimization objective, such that it includes a prediction cost penalty, and optimize\nthis modi\ufb01ed functional [33, 30, 23, 24].\nIn this work we also follow this general approach, and develop a budget-aware strategy based on\ndeep boosted regression trees. Despite the recent re-emergence and popularity of neural networks,\nour choice of boosted regression trees is motivated by three observations:\n\n(i) Given ample training data and computational resources, deep neural networks often give the most\naccurate results. However, standard feed-forward architectures route a single input component (for\nexample, a single coef\ufb01cient in the case of vectorial input) through most network units. While the\ncomputational cost can be mitigated by network compression or quantization [14], in the extreme\ncase to binary activations only [16], the computational graph is fundamentally dense. In a standard\ndecision tree, on the other hand, each sample is routed along a single path from the root to a leaf, thus\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fvisiting typically only a small subset of all split nodes, the \"units\" of a decision tree. In the extreme\ncase of a balanced binary tree, each sample visits only log(N ) out of a total of N nodes.\n(ii) Individual decision trees and their ensembles, such as Random Forest [4] and Gradient Boosting\n[12], are still among the most useful and highly competitive methods in machine learning, particularly\nin the regime of limited training data, little training time and little expertise for parameter tuning [11].\n(iii) When features and/or decisions come at a premium, it is convenient but wasteful to assume that\nall instances in a data set are created equal (even when assumed i.i.d.). Some instances may be easy\nto classify based on reading a single measurement / feature, while others may require a full battery of\ntests before a decision can be reached with con\ufb01dence [35]. Decision trees naturally lend themselves\nto such a \"sequential experimental design\" setup: after \ufb01rst using cheap features to split all instances\ninto subsets, the subsequent decisions can be based on more expensive features which are, however,\nonly elicited if truly needed. Importantly, the set of more expensive features is requested conditionally\non the values of features used earlier in the tree.\n\nIn this work we address the challenge of constructing an ensemble of trees that is both accurate and\nyet cheap to evaluate. We \ufb01rst describe the problem setup in Section 2, and discuss related work in\nSection 3. Our key contribution appears in Section 4, where we propose an extension of gradient\nboosting [12] which takes prediction time penalties into account. In contrast to previous approaches\nto learning with cost penalties, our method can grow very deep trees that on average are nonetheless\ncheap to compute. Our algorithm is easy to implement and its learning time is comparable to that\nof the original gradient boosting. As illustrated in Section 5, on a number of datasets our method\noutperforms the current state of the art by a large margin.\n\n2 Problem setup\nConsider a regression problem where the response Y \u2208 R and each instance X is represented by M\nfeatures, X \u2208 RM . Let L : R \u00d7 R \u2192 R be a loss function, and T be a set of admissible functions.\nIn supervised learning, given a training set of N pairs (xi, yi) sampled i.i.d. from (X, Y ), a classical\napproach to learn a predictor T \u2208 T is to minimize the empirical loss L on the training set,\n\nN(cid:88)\n\ni=1\n\nmin\nT\u2208T\n\n(cid:88)\n\ni\n\nmin\nT\u2208T\n\nL(yi, T (xi)).\n\n(1)\n\nof the form T (x) =(cid:80)K\n\nIn this paper we restrict ourselves to the set T that consists of an ensemble of trees, namely predictors\nk=1 tk(x). Each single decision tree tk can be represented as a collection\nof Lk leaf nodes with corresponding responses \u03c9k = (\u03c9k,1, . . . , \u03c91,Lk ) \u2208 RLk and a function\nqk : RM \u2192 {1, . . . , Lk} that encodes the tree structure and maps an input to its corresponding\nterminal leaf index. The output of the tree is tk(x) = \u03c9k,qk(x).\nLearning even a single tree that exactly minimizes the functional in Eq. (1) is NP-hard under several\naspects of optimality [15, 19, 25, 36]. Yet, single trees and ensemble of trees are some of the most\nsuccessful predictors in machine learning and there are multiple greedy based methods to construct\ntree ensembles that approximately solve Eq. (1) [4, 12, 11].\nIn many practical applications, however, it is important that the predictor T is not only accurate but\nalso fast to compute. Given a prediction cost function \u03a8 : T \u00d7 RM \u2192 R+a standard approach is to\nadd a penalty to the empirical risk minimization above [33, 30, 35, 23, 24]:\n\nL(yi, T (xi)) + \u03bb\u03a8(T, xi).\n\n(2)\n\nThe parameter \u03bb controls the tradeoff between accuracy and prediction cost.\nTypically, the prediction cost function \u03a8 consists of two components. The \ufb01rst is the cost of acquiring\nor computing relevant input features. For example, think of a patient at the emergency room where\ntaking his temperature and blood oxygen levels are cheap, but a CT-scan is expensive. The second\ncomponent is the cost of evaluating the function T , which in our case is the sum of the cost of\nevaluating the K individual trees tk.\nIn more detail, the \ufb01rst component of feature computation cost may also depend on the speci\ufb01c\nprediction problem. In some scenarios, test instances are independent of each other and the features\n\n2\n\n\fcan be computed for each input instance on demand. But there are also others. In image processing,\nfor example, where the input is an image which consists of many pixels and the task is to predict some\nfunction at all pixels. In such cases, even though speci\ufb01c features can be computed for each pixel\nindependently, it may be cheaper or more ef\ufb01cient to compute the same feature, such as a separable\nconvolution \ufb01lter, at all pixels at once [1, 13]. The cost function \u03a8 may be dominated in these cases\nby the second component - the time it takes to evaluate the trees.\nAfter discussing related work in Section 3, in Section 4 we present a general adaptation of gradient\nboosting [12] to minimize Eq. (2), that takes into account both prediction cost components.\n\n3 Related work\n\nThe problem of learning with prediction cost penalties has been extensively studied. One particular\ncase is that of class imbalance, where one class is extremely rare and yet it is important to accurately\nannotate it. For example, the famous Viola-Jones cascades [31] use cheap features to discard examples\nbelonging to the negative class. Later stages requiring expensive features are only used for the rare\nsuspected positive class. While such an approach is very successful, due to its early exit strategy it\ncannot use expensive features for different inputs [20, 30, 9].\nTo overcome the limitations imposed by early exit strategies, various methods [34, 35, 18, 32]\nproposed single tree constructions but with more complicated decisions at the individual split nodes.\nThe tree is \ufb01rst learned without taking prediction cost into account followed by an optimization step\nthat includes this cost. Unfortunately, in practice these single-tree methods are inferior to current\nstate-of-the-art algorithms that construct tree ensembles [23, 24].\nBUDGETRF [23] is based on Random Forests and modi\ufb01es the impurity function that decides which\nsplit to make, to take feature costs into account. BUDGETRF has several limitations: First, it assumes\nthat tree evaluation cost is negligible compared to feature acquisition, and hence is not suitable for\nproblems where features are cheap to compute and the prediction cost is dominated by predictor\nevaluation or were both components contribute equally. Second, during its training phase, each usage\nof a feature incurs its acquisition cost so repeated feature usage is not modeled, and the probability\nfor reaching a node is not taken into account. At test time, in contrast, they do allow \"free\" reuse of\nexpensive features and do compute the precise cost of reaching various tree branches. BUDGETRF\nthus typically does not yield deep but expensive branches which are only seldomly reached.\nBUDGETPRUNE [24] is a pruning scheme for ensembles of decision trees. It aims to mitigate\nlimitations of BUDGETRF by pruning expensive branches from the individual trees. An Integer\nLinear Program is formulated and ef\ufb01ciently solved to take repeated feature usage and probabilities\nfor reaching different branches into account. This method results in a better tradeoff but still cannot\ncreate deep and expensive branches which are only seldomly reached if these were not present in the\noriginal ensemble. This method is considered to be state of the art when prediction cost is dominated\nby the feature acquisition cost [24]. We show in Section 5 that constructing deeper trees with our\nmethods results in a signi\ufb01cantly better performance.\nGREEDYMISER [33], which is most similar to our work, is a stage-wise gradient-boosting type\nalgorithm that also aims to minimize Eq. (2) using an ensemble of regression trees. When both\nprediction cost components are assumed equally signi\ufb01cant, GREEDYMISER is considered state\nof the art. Yet, GREEDYMISER also has few limitations: First, all trees are assumed to have the\nsame prediction cost for all inputs. Second, by design it constructs shallow trees all having the\nsame depth. We instead consider individual costs for each leaf and thus allow construction of deeper\ntrees. Our experiments in Section 5 suggest that constructing deeper trees with our proposed method\nsigni\ufb01cantly outperforms GREEDYMISER.\n\n4 Gradient boosting with cost penalties\n\nWe build on the gradient boosting framework [12] and adapt it to allow optimization with cost\npenalties. First we brie\ufb02y review the original algorithm. We then present our cost penalty in Section\n4.1, the step wise optimization in 4.2 and \ufb01nally our tree growing algorithm that builds trees with\ndeep branches but low expected depth and feature cost in Section 4.3 (such a tree is shown in Figure\n1b and compared to a shallow tree that is more expensive and less accurate in Figure 1a).\n\n3\n\n\f0\n\n1\n\n2\n\n5\n\n6\n\nE F\n\nG H\n\n3\n\n4\n\nA B\n\nC D\n\n(a) Other methods\n\n0\n\n1\n\n4\n\nA B\n2\n\n3\n\nK 7\n\nC 5\n\n11\n\n6\n\n10\n\n9\n\nL M\n\nN O\n\n12\n\nG\n\nH 8\n\nD 13\n\nE F\n\nI\n\nJ\n\n(b) CEGB\n\nFigure 1: Illustration of trees generated by the different methods: Split nodes are numbered\nin the order they have been created, leaves are represented with letters. The vertical position of\nnodes corresponds to the feature cost required for each sample and the edge\u2019s thickness represents\nthe number of samples moving along this edge. A tree constructed by GreedyMiser is shown in\n(a): The majority of samples travel along a path requiring a very expensive feature. BudgetPrune\ncould only prune away leaves E,F,G and H which does not correspond to a large reduction in costs.\nCEGB however only uses two very cheap splits for almost all samples (leaves A and B) and builds a\ncomplex subtree for the minority that is hard to classify. The constructed tree shown in (b) is deep\nbut nevertheless cheap to evaluate on average.\n\nGradient boosting tries to minimize the empirical risk of Eq. (1), by constructing a linear combination\nof K weak predictors tk : RM \u2192 R from a set F of admissible functions (not necessarily decision\ntrees). Starting with T0(x) = 0 each iteration k > 0 constructs a new weak function tk aiming to\nreduce the current loss. These boosting updates can be interpreted as approximations of the gradient\ndescent direction in function space. We follow the notation of [8] who use gradient boosting with\nweak predictors tk from the set of regression trees T to minimize regularized empirical risk\n\n(cid:34)\n\nN(cid:88)\n\nK(cid:88)\n\n(cid:35)\n\nK(cid:88)\n\nmin\n\nt1,...,tK\u2208T\n\nL(yi,\n\ntk(xi))\n\n+\n\n\u2126(tk).\n\n(3)\n\ni=1\n\nk=1\n\nk=1\n\nThe regularization term \u2126(tk) penalizes the complexity of the regression tree functions. They assume\nthat \u2126(tk) only depends on the number of leaves Lk and leaf responses wk and derive a simple\nalgorithm to directly learn these. We instead use a more complicated prediction cost penalty \u03a8 and\nuse a different tree construction algorithm that allows optimization with cost penalties.\n\n4.1 Prediction cost penalty\n\nRecall that for each individual tree the prediction cost penalty \u03a8 consists of two components: (i) the\nfeature acquisition cost \u03a8f and (ii) the tree evaluation cost \u03a8ev. However, this prediction cost for\nthe k-th tree, which is \ufb01tted to the residual of all previous iterations, depends on the earlier trees.\nSpeci\ufb01cally, for any input x, features used in the trees of the previous iterations do not contribute to\nthe cost penalty again. We thus use the indicator function C : N0\u2264K \u00d7 N\u2264N \u00d7 N\u2264M \u2192 {0, 1} with\nC(k, i, m) = 1 if and only if feature m was used to predict xi by any tree constructed prior to and\nincluding iteration k. Furthermore \u03b2m \u2265 0 is the cost for computing or acquiring feature m for a\nsingle input x. Then the feature cost contribution \u03a8f : N0\u2264K \u00d7 N\u2264N \u2192 R+ of xi for the \ufb01rst k trees\n\n4\n\n\fis calculated as\n\nM(cid:88)\n\nm=1\n\n\u03a8f(k, i) =\n\n\u03b2mC(k, i, m)\n\n(4)\n\nFeatures computed for all inputs at once (e.g. separable convolution \ufb01lters) contribute to the penalty\nindependent of the instance x being evaluated. For those we use \u03b3m as their total computation cost\nand de\ufb01ne the indicator function D : N0\u2264K \u00d7 N\u2264M \u2192 {0, 1} with D(k, m) = 1 if and only if feature\nm was used for any input x in any tree constructed prior to and including iteration k. Then\n\n\u03a8c(k) =\n\n\u03b3mD(k, m)\n\n(5)\n\nThe evaluation cost \u03a8ev,k : N\u2264Lk \u2192 R+ for a single input x passing through a tree is the number of\nsplit nodes between the root node and the input\u2019s terminal leaf qk(x), multiplied by a suitable constant\n\u03b1 \u2265 0 which captures the cost to evaluate a single split. The total cost \u03a8ev : N0\u2264K \u00d7 N\u2264N \u2192 R+ for\nthe \ufb01rst k trees is the sum of the costs of each tree\n\n\u03a8ev(k, i) =\n\n\u03a8ev,\u02dck(q\u02dck(xi)).\n\n(6)\n\n4.2 Tree Boosting with Prediction Costs\n\n\u02dck=1\n\nWe have now de\ufb01ned all components of Eq. (2). Simultaneous optimization of all trees tk is\nintractable. Instead, as in gradient boosting , we minimize the objective by starting with T0(x) = 0\nand iteratively adding a new tree at each iteration.\nAt iteration k we construct the k-th regression tree tk by minimizing the following objective\n\nOk =\n\n[L(yi, Tk\u22121(xi) + tk(xi)) + \u03bb\u03a8(k, xi)] + \u03bb\u03a8c(k)\n\n(7)\n\nwith \u03a8(k, xi) = \u03a8ev(k, i) + \u03a8f(k, i). Note that the penalty for features, which are computed for all\ninputs at once, \u03a8c(k) does not depend on x but only on the structure of the current and previous trees.\nDirectly optimizing the objective Ok w.r.t. the tree tk is dif\ufb01cult since the argument tk appears inside\nthe loss function. Following [8] we use a second order Taylor expansion of the loss around Tk\u22121(xi).\nRemoving constant terms from earlier iterations the objective function can be approximated by\n\nOk \u2248 \u02dcOk =\n\ngitk(xi) +\n\nhit2\n\nk(xi) + \u03bb\u2206\u03a8(xi)\n\n+ \u03bb\u2206\u03a8c\n\n1\n2\n\nwhere\n\ngi = \u2202\u02c6yiL(yi, \u02c6yi)\n\nhi = \u22022\n\u02c6yi\n\u2206\u03a8c = \u03a8c(k) \u2212 \u03a8c(k \u2212 1).\nAs in [8] we rewrite Eq. (8) for a decision tree tk(x) = \u03c9k,qk(x) with a \ufb01xed structure qk,\n\n\u2206\u03a8(xi) = \u03a8(k, xi) \u2212 \u03a8(k \u2212 1, xi),\n\nL(yi, \u02c6yi)\n\n(9c)\n\n(9a)\n\n,\n\n,\n\n(8)\n\n(9b)\n\n(9d)\n\n(cid:34)(cid:32)(cid:88)\n\nLk(cid:88)\n\nl\n\ni\u2208Il\n\n\u02dcOk =\n\n(cid:80)\n(cid:80)\n\nwith the set Il = {i|qk(xi) = l} containing inputs in leaf l. For this \ufb01xed structure the optimal\nweights and the corresponding best objective reduction can be calculated explicitly:\n\nhi\n\n\u03c92\n\nk,l + \u03bb\n\n\u2206\u03a8(xi)\n\n+ \u03bb\u2206\u03a8c\n\n(10)\n\nk,l = \u2212\n\u03c9\u2217\n\ni\u2208Il\ni\u2208Il\n\ngi\nhi\n\n,\n\n(11a) \u02dcO\u2217\n\nk = \u2212 1\n2\n\ni\u2208Il\ni\u2208Il\n\ngi)2\nhi\n\n+ \u03bb\n\n\u2206\u03a8(xi)\n\n+\u03bb\u2206\u03a8c (11b)\n\n(cid:33)\n\n(cid:32)(cid:88)\n\ni\u2208Il\n\n1\n2\n\nL(cid:88)\n\n(cid:34)\n\n((cid:80)\n(cid:80)\n\nl\n\n5\n\n(cid:21)\n\n(cid:32)(cid:88)\n\ni\u2208Il\n\n(cid:32)(cid:88)\n\ni\u2208Il\n\n(cid:12)(cid:12)(cid:12)\u02c6yi=Tk\u22121(xi)\n(cid:33)(cid:35)\n\n(cid:33)(cid:35)\n\nM(cid:88)\n\nm=1\n\nk(cid:88)\n\nN(cid:88)\n\ni=1\n\ni=1\n\n(cid:20)\nN(cid:88)\n(cid:12)(cid:12)(cid:12)\u02c6yi=Tk\u22121(xi)\n(cid:33)\n\ngi\n\n\u03c9k,l +\n\n\fAs we shall see in the next section, our cost-aware impurity function depends on the difference of\nEq. (10) which results by replacing a terminal leaf with a split node [8]. Let p be any leaf of the tree\nthat can be converted to a split node and two new children r and l then the difference of Eq. (10)\nevaluated for the original and the modi\ufb01ed tree is\n\n\u2206 \u02dcOsplit\n\nk =\n\n1\n2\n\ni\u2208Ir\ni\u2208Ir\n\ngi)2\nhi\n\n+\n\ni\u2208Il\ni\u2208Il\n\ngi)2\nhi\n\ni\u2208Ip\ni\u2208Ip\n\ngi)2\nhi\n\n\u2212 \u03bb \u2206\u03a8split\n\nk\n\nLet m be the feature used by the node s that we are considering to split. Then\nis feature m\n\nis feature m\n\n(cid:123)\n\n(cid:122)\n\nused for the \ufb01rst time?\n\n(1 \u2212 D(k, m)) +\n\nof instance xi used for the \ufb01rst time?\n\n(1 \u2212 C(k, i, m))\n\n\u03b2m\n\n(12)\n\n(13)\n\n(cid:34)\n\n((cid:80)\n(cid:80)\n\n((cid:80)\n(cid:80)\n(cid:125)(cid:124)\n\n(cid:122)\n\n\u2206\u03a8split\n\nk = |Ip|\u03b1(cid:124)(cid:123)(cid:122)(cid:125)\n\n\u03a8split\nev,k\n\n+ \u03b3m\n\n(cid:124)\n\n\u2212 ((cid:80)\n(cid:80)\n(cid:88)\n(cid:123)(cid:122)\n\ni\u2208Ip\n\n\u03a8split\nf,k\n\n(cid:35)\n\n(cid:125)(cid:124)\n\n(cid:123)\n(cid:125)\n\n4.3 Learning a weak regressor with cost penalties\n\nWith these preparations we can now construct the regression trees. As mentioned above, this is a\nNP-hard problem. We use a greedy algorithm to grow a tree that approximately minimizes Eq. (10).\nStandard algorithms that grow trees start from a single leaf containing all inputs. The tree is then\niteratively expanded by replacing a single leaf with a split node and two new child leaves [4]. Typically\nthis expansion happens in a prede\ufb01ned leaf order (breadth- or depth-\ufb01rst). Splits are only evaluated\nlocally at a single leaf to select the best feature. The expansion is stopped once leaves are pure or\nonce a maximum depth has been reached. Here, in contrast, we adopt the approach of [29] and grow\nthe tree in a best-\ufb01rst order. Splits are evaluated for all current leaves and the one with the best\nobjection reduction according to Eq. (12) is chosen. The tree can thus grow at any location. This\nallows to compare splits across different leaves and features at the same time (\ufb01gure 1b shows an\nexample for a best-\ufb01rst tree while \ufb01gure 1a shows a tree constructed in breadth-\ufb01rst order). Instead\nof limiting the depth we limit the number of leaves in each tree to prevent over \ufb01tting.\nThis procedure has an important advantage when optimizing with cost penalties: Growing in a\nprede\ufb01ned order usually leads to balanced trees - all branches are grown independent of the cost.\nDeep and expensive branches using only a tiny subset of inputs are not easily possible. In contrast,\ngrowing at the leaf that promises the best tradeoff as given by Eq. (12) encourages growth on branches\nthat contain few instances or growth using cheap features. Growth on branches that contain many\ninstances or growth that requires expensive features is penalized. This strategy results in deep trees\nthat are nevertheless cheap to compute on average. Figure 1 compares an individual tree constructed\nby others methods to the deeper tree constructed by CEGB.\nWe brie\ufb02y compare our proposed strategy to GREEDYMISER: When we limit Eq. (8) to \ufb01rst order\nterms only, use breadth-\ufb01rst instead of best-\ufb01rst growth, assume that features always have to be\ncomputed for all instances at once and limit the tree depth to four we minimize Eq. (18) from [33].\nGreedyMiser can therefore be represented as a special case of our proposed algorithm.\n\n5 Experiments\n\nThe Yahoo! Learning to Rank (Yahoo! LTR) challenge dataset [7] consists of 473134 training,\n71083 validation and 165660 test document-query pairs with labels {0, 1, 2, 3, 4} where 0 means the\ndocument is irrelevant and 4 that it is highly relevant to the query. Computation cost for the 519\nfeatures used in the dataset are provided [33] and take the values {1, 5, 10, 20, 50, 100, 150, 200}.\nPrediction performance is evaluated using the Average Precision@5 metric which only considers the\n\ufb01ve most relevant documents returned for a query by the regressor [33, 23, 24]. We use the dataset\nprovided by [7] and used in [23, 24].\nWe consider two different settings for our experiments, (i) feature acquisition and classi\ufb01er evaluation\ntime both contribute to prediction cost and (ii) classi\ufb01er evaluation time is negligible w.r.t feature\nacquisition cost.\nThe \ufb01rst setting is used by GREEDYMISER. Regression trees with depth four are constructed and\nassumed to approximately cost as much as features with feature cost \u03b2m = 1. We therefore set the\n\n6\n\n\f(a)\n\n(c)\n\n(e)\n\n(b)\n\n(d)\n\n(f)\n\nFigure 2: Comparison against state of the art algorithms: The Yahoo! LTR dataset has been used\nfor (2a) and (2b) in different settings. In (2a) both tree evaluation and feature acquisition cost is\nconsidered. In (2b) only feature acquisition cost is shown. (2c) shows results on the MiniBooNE\ndataset with uniform feature costs. GREEDYMISER and BUDGETPRUNE results for (2b), (2c) and\n(2d) from [24]. BUDGETPRUNE did not \ufb01nish training on the HEPMASSS datasets to due their size\nand the associated CPU time and RAM requirements. CEGB is our proposed method.\n\n4 to allow a fair comparison with our trees which will contain deeper branches. We\nsplit cost \u03b1 = 1\nalso use our algorithm to construct trees similar to GREEDYMISER by limiting the trees to 16 leaves\nwith a maximum branch depth of four. Figure 2a shows that even the shallow trees are already always\nstrictly better than GREEDYMISER. This happens because our algorithm correctly accounts for the\ndifferent probabilities of reaching different leaves (see also \ufb01gure 1). When we allow deep branches\nthe proposed method gives signi\ufb01cantly better results than GREEDYMISER and learns a predictor\nwith better accuracy at a much lower cost.\nThe second setting is considered by BUDGETPRUNE. It assumes that feature computation is much\nmore expensive than classi\ufb01er evaluation. We set \u03b1 = 0 to adapt our algorithm to this setting.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 3: In (3a) we study the in\ufb02uence of the feature penalty on the learned classi\ufb01er. (3b) shows\nhow best-\ufb01rst training results in better precision given the same cost budget.\n\nThe dataset is additionally binarized by setting all targets y > 0 to y = 1. GREEDYMISER has a\ndisadvantage in this setting since it works on the assumption that the cost of each tree is independent\nof the input x. We still include it in our comparison as a baseline. Figure 3b shows that our proposed\nmethod again performs signi\ufb01cantly better than others. This con\ufb01rms that we learn a classi\ufb01er with\nvery expected cheap prediction cost in terms of both feature acquisition and classi\ufb01er evaluation time.\nThe MiniBooNE dataset [27, 21] consists of 45523 training, 19510 validation and 65031 test instances\nwith labels {0, 1} and 50 features. The Forest Covertype dataset [3, 21] consists of 36603 training,\n15688 validation and 58101 test instances with 54 features restricted to two classes as done in [24].\nFeature costs are not available for either dataset and assumed to be uniform, i.e. \u03b2m = 1. Since no\nrelation between classi\ufb01er evaluation and feature cost is known we only compute the latter to allow\na fair comparison, as in [24]. Figure 2c and 2d show that our proposed method again results in a\nsigni\ufb01cantly better predictor than both GREEDYMISER and BUDGETPRUNE.\nWe additionally use the HEPMASS-1000 and HEPMASS-not1000 datasets [2, 21]. Similar to\nMiniBooNE no feature costs are known and we again uniformly set them to one for all features, i.e.\n\u03b2m = 1. Both datasets contain over ten million instances which we split into 3.5 million training, 1.4\nmillion validation and 5.6 million test instances. These datasets are much larger than the others and\nwe did not manage to successfully run BUDGETPRUNE due to its RAM and CPU time requirements.\nWe only report results on GREEDYMISER and our algorithm in Figure 2e and 2f. CEGB again results\nin a classi\ufb01er with a better tradeoff than GREEDYMISER.\n\n5.1\n\nIn\ufb02uence of feature cost and tradeoff parameters\n\nWe use the Yahoo! LTR dataset to study the in\ufb02uence of the features costs \u03b2 and the tradeoff\nparameter \u03bb on the learned regressor. Figure 3a shows that regressors learned with a large \u03bb reach\nsimilar accuracy as those with smaller \u03bb at a much cheaper cost. Only \u03bb = 0.001 converges to a\nlower accuracy while others approximately reach the same \ufb01nal accuracy. The tradeoff is shifted\ntowards using cheap features too strongly. Such a regressor is nevertheless useful when the problems\nrequires very cheap results and the \ufb01nal improvement in accuracy does not matter.\nNext, we set all \u03b2m = 1 during training time only and use the original cost during test time. The\nlearned regressor behaves similar to one learned with \u03bb = 0. This shows that the regressors save\nmost of the cost by limiting usage of expensive features to a small subset of inputs.\nFinally we compare breadth-\ufb01rst to best-\ufb01rst training in Figure 3b. We use the same number of leaves\nand trees and try to build a classi\ufb01er that is as cheap as possible. Best-\ufb01rst training always reaches\na higher accuracy for a given prediction cost budget. This supports our observation that deep trees\nwhich are cheap to evaluate on average are important for constructing cheap and accurate predictors.\n\n5.2 Multi-scale classi\ufb01cation / tree structure optimization\n\nIn images processing, classi\ufb01cation using multiple scales has been extensively studied and used\nto build fast or more more accurate classi\ufb01ers [6, 31, 10, 26]. The basic idea of these schemes is\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Multi-scale classi\ufb01cation: (4a) shows a single frame from the dataset we used. (4b) shows\nhow our proposed algorithm CEGB is able to build signi\ufb01cantly cheaper trees than normal gradient\nboosting. (4c) zooms into the region showing the differences between the various patch sizes.\n\nthat a large image is downsampled to increasingly coarse resolutions. A multi-scale classi\ufb01er \ufb01rst\nanalyzes the coarsest resolution and decides whether a pixel on the coarse level represents a block of\nhomogeneous pixels on the original resolution, or if analysis on a less coarse resolution is required.\nEf\ufb01ciency results from the ability to label many pixels on the original resolution at once by labeling a\nsingle pixel on a coarser image.\nWe use this setting as an example to show how our algorithm is also capable of optimizing problems\nwhere feature cost is negligible compare to predictor evaluation cost. Inspired by average pooling\nlayers in neural networks [28] and image pyramids [5] we \ufb01rst compute the average pixel values\nacross non-overlapping 2x2, 4x4 and 8x8 blocks of the original image. We compute several commonly\nused and very fast convolutional \ufb01lters on each of those resolutions. We then replicated these features\nvalues on the original resolution, e.g. the feature response of a single pixel on the 8x8-averaged image\nk = |Ip|\u03b1\u0001m where \u0001m is the number\nis used for all 64 pixels We modify Eq. (12) and set \u2206\u03a8split\nof pixels that share this feature value, e.g. \u0001m = 64 when feature m was computed on the coarse\n8x8-averaged image.\nWe use forty frames with a resolution of 1024x1024 pixels taken from a video studying \ufb02y ethology.\nOur goal here is to detect \ufb02ies as quickly as possible, as preprocessing for subsequent tracking. A\nsingle frame is shown in Figure 4a. We use twenty of those for training and twenty for evaluation.\nAccuracy is evaluated using the SEGMeasure score as de\ufb01ned in [22]. Comparison is done against\nregular gradient boosting by setting \u03a8 = 0.\nFigure 4b shows that our algorithm constructs an ensemble that is able to reach similar accuracy with\na signi\ufb01cantly smaller evaluation cost. Figure 4c shows more clearly how the different available\nresolutions in\ufb02uence the learned ensemble. Coarser resolutions allow a very ef\ufb01cient prediction at\nthe cost of accuracy. Overall these experiments show that our algorithm is also capable of learning\npredictors that are cheap while maintaining accuracy even when the evaluation cost of these dominates\nw.r.t the feature acquisition cost.\n\n6 Conclusion\n\nWe presented an adaptation of gradient boosting that includes prediction cost penalties, and devised\nfast methods to learn an ensemble of deep regression trees. A key feature of our approach is its ability\nto construct deep trees that are nevertheless cheap to evaluate on average. In the experimental part we\ndemonstrated that this approach is capable of handing various different settings of prediction cost\npenalties consisting of feature cost and tree evaluation cost. Speci\ufb01cally, our method signi\ufb01cantly\noutperformed state of the art algorithms GREEDYMISER and BUDGETPRUNE when feature cost\neither dominates or contributes equally to the total cost.\nWe additionally showed an example where we are able to optimize the decision structure of the trees\nitself when evaluation of these is the limiting factor.\nOur algorithm can be easily implemented using any gradient boosting library and does not slow down\ntraining signi\ufb01cantly. For these reasons we believe it will be highly valuable for many applications.\nSource code based on LightGBM [17] is available at http://github.com/svenpeter42/LightGBM-\nCEGB.\n\n9\n\n\fReferences\n[1] Gholamreza Amayeh, Alireza Tavakkoli, and George Bebis. Accurate and ef\ufb01cient computation\nof Gabor features in real-time applications. Advances in Visual Computing, pages 243\u2013252,\n2009.\n\n[2] Pierre Baldi, Kyle Cranmer, Taylor Faucett, Peter Sadowski, and Daniel Whiteson. Parameter-\n\nized machine learning for high-energy physics. arXiv preprint arXiv:1601.07913, 2016.\n\n[3] Jock A. Blackard and Denis J. Dean. Comparative accuracies of arti\ufb01cial neural networks and\ndiscriminant analysis in predicting forest cover types from cartographic variables. Computers\nand Electronics in Agriculture, 24(3):131 \u2013 151, 1999.\n\n[4] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[5] Peter Burt and Edward Adelson. The Laplacian pyramid as a compact image code. IEEE\n\nTransactions on communications, 31(4):532\u2013540, 1983.\n\n[6] Vittorio Castelli, Chung-Sheng Li, John Turek, and Ioannis Kontoyiannis. Progressive clas-\nsi\ufb01cation in the compressed domain for large EOS satellite databases. In Acoustics, Speech,\nand Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International\nConference on, volume 4, pages 2199\u20132202. IEEE, 1996.\n\n[7] Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. In Yahoo! Learning\n\nto Rank Challenge, pages 1\u201324, 2011.\n\n[8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings\nof the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201916, pages 785\u2013794, New York, NY, USA, 2016. ACM.\n\n[9] Giulia DeSalvo, Mehryar Mohri, and Umar Syed. Learning with deep cascades. In International\n\nConference on Algorithmic Learning Theory, pages 254\u2013269. Springer, 2015.\n\n[10] Piotr Doll\u00e1r, Serge J Belongie, and Pietro Perona. The fastest pedestrian detector in the west. In\n\nBMVC, volume 2, page 7. Citeseer, 2010.\n\n[11] Manuel Fern\u00e1ndez-Delgado, Eva Cernadas, Sen\u00e9n Barro, and Dinani Amorim. Do we need\nJ. Mach. Learn. Res,\n\nhundreds of classi\ufb01ers to solve real world classi\ufb01cation problems?\n15(1):3133\u20133181, 2014.\n\n[12] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of\n\nstatistics, pages 1189\u20131232, 2001.\n\n[13] Pascal Getreuer. A survey of Gaussian convolution algorithms. Image Processing On Line,\n\n2013:286\u2013310, 2013.\n\n[14] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and huffman coding. International Conference on\nLearning Representations (ICLR), 2016.\n\n[15] Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision\n\nlists and trees. Information and Computation, 126(2):114\u2013122, 1996.\n\n[16] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 29, pages 4107\u20134115. Curran Associates,\nInc., 2016.\n\n[17] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and\nTie-Yan Liu. Lightgbm: A highly ef\ufb01cient gradient boosting decision tree. In I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 30, pages 3149\u20133157. Curran Associates, Inc., 2017.\n\n10\n\n\f[18] Matt J Kusner, Wenlin Chen, Quan Zhou, Zhixiang Eddie Xu, Kilian Q Weinberger, and Yixin\nChen. Feature-cost sensitive learning with submodular trees of classi\ufb01ers. In AAAI, pages\n1939\u20131945, 2014.\n\n[19] Hya\ufb01l Laurent and Ronald L Rivest. Constructing optimal binary decision trees is NP-complete.\n\nInformation Processing Letters, 5(1):15\u201317, 1976.\n\n[20] Leonidas Lefakis and Fran\u00e7ois Fleuret. Joint cascade optimization using a product of boosted\n\nclassi\ufb01ers. In Advances in neural information processing systems, pages 1315\u20131323, 2010.\n\n[21] M. Lichman. UCI machine learning repository, 2013.\n[22] Martin Ma\u0161ka, Vladim\u00edr Ulman, David Svoboda, Pavel Matula, Petr Matula, Cristina Ed-\nerra, Ainhoa Urbiola, Tom\u00e1s Espa\u00f1a, Subramanian Venkatesan, Deepak MW Balak, et al. A\nbenchmark for comparison of cell tracking algorithms. Bioinformatics, 30(11):1609\u20131617,\n2014.\n\n[23] Feng Nan, Joseph Wang, and Venkatesh Saligrama. Feature-budgeted random forest. In Francis\nBach and David Blei, editors, Proceedings of the 32nd International Conference on Machine\nLearning, volume 37 of Proceedings of Machine Learning Research, pages 1983\u20131991, Lille,\nFrance, 07\u201309 Jul 2015. PMLR.\n\n[24] Feng Nan, Joseph Wang, and Venkatesh Saligrama. Pruning random forests for prediction on a\nbudget. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 29, pages 2334\u20132342. Curran Associates, Inc., 2016.\n[25] GE Naumov. NP-completeness of problems of construction of optimal decision trees. In Soviet\n\nPhysics Doklady, volume 36, page 270, 1991.\n\n[26] Marco Pedersoli, Andrea Vedaldi, Jordi Gonzalez, and Xavier Roca. A coarse-to-\ufb01ne approach\n\nfor fast deformable object detection. Pattern Recognition, 48(5):1844\u20131853, 2015.\n\n[27] Byron P. Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor. Boosted\ndecision trees, an alternative to arti\ufb01cial neural networks. Nucl. Instrum. Meth., A543(2-3):577\u2013\n584, 2005.\n\n[28] Dominik Scherer, Andreas M\u00fcller, and Sven Behnke. Evaluation of pooling operations in\nconvolutional architectures for object recognition. Arti\ufb01cial Neural Networks\u2013ICANN 2010,\npages 92\u2013101, 2010.\n\n[29] Haijian Shi. Best-\ufb01rst decision tree learning. PhD thesis, The University of Waikato, 2007.\n[30] Kirill Trapeznikov and Venkatesh Saligrama. Supervised sequential classi\ufb01cation under budget\n\nconstraints. In AISTATS, pages 581\u2013589, 2013.\n\n[31] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features.\nIn Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE\nComputer Society Conference on, 2001.\n\n[32] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. Ef\ufb01cient learning by directed\nacyclic graph for resource constrained prediction. In Advances in Neural Information Processing\nSystems, pages 2152\u20132160, 2015.\n\n[33] Zhixiang Xu, Kilian Weinberger, and Olivier Chapelle. The greedy miser: Learning under test-\ntime budgets. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International\nConference on Machine Learning (ICML-12), ICML \u201912, pages 1175\u20131182, July 2012.\n\n[34] Zhixiang Eddie Xu, Matt J Kusner, Kilian Q Weinberger, and Minmin Chen. Cost-sensitive tree\n\nof classi\ufb01ers. In ICML (1), pages 133\u2013141, 2013.\n\n[35] Zhixiang Eddie Xu, Matt J Kusner, Kilian Q Weinberger, Minmin Chen, and Olivier Chapelle.\nClassi\ufb01er cascades and trees for minimizing feature evaluation cost. Journal of Machine\nLearning Research, 15(1):2113\u20132144, 2014.\n\n[36] Hans Zantema and Hans L Bodlaender. Finding small equivalent decision trees is hard. Interna-\n\ntional Journal of Foundations of Computer Science, 11(02):343\u2013354, 2000.\n\n11\n\n\f", "award": [], "sourceid": 990, "authors": [{"given_name": "Sven", "family_name": "Peter", "institution": "University Heidelberg"}, {"given_name": "Ferran", "family_name": "Diego", "institution": "Bosch"}, {"given_name": "Fred", "family_name": "Hamprecht", "institution": "Heidelberg University"}, {"given_name": "Boaz", "family_name": "Nadler", "institution": "Weizmann Institute of Science"}]}