{"title": "Adaptive Classification for Prediction Under a Budget", "book": "Advances in Neural Information Processing Systems", "page_first": 4727, "page_last": 4737, "abstract": "We propose a novel adaptive approximation approach for test-time resource-constrained prediction motivated by Mobile, IoT, health, security and other applications, where constraints in the form of computation, communication, latency and feature acquisition costs arise. We learn an adaptive low-cost system by training a gating and prediction model that limits utilization of a high-cost model to hard input instances and gates easy-to-handle input instances to a low-cost model. Our method is based on adaptively approximating the high-cost model in regions where low-cost models suffice for making highly accurate predictions. We pose an empirical loss minimization problem with cost constraints to jointly train gating and prediction models. On a number of benchmark datasets our method outperforms state-of-the-art achieving higher accuracy for the same cost.", "full_text": "Adaptive Classi\ufb01cation for Prediction Under a Budget\n\nFeng Nan\n\nSystems Engineering\n\nBoston University\nBoston, MA 02215\n\nfnan@bu.edu\n\nVenkatesh Saligrama\nElectrical Engineering\n\nBoston University\nBoston, MA 02215\n\nsrv@bu.edu\n\nAbstract\n\nWe propose a novel adaptive approximation approach for test-time resource-\nconstrained prediction motivated by Mobile, IoT, health, security and other ap-\nplications, where constraints in the form of computation, communication, latency\nand feature acquisition costs arise. We learn an adaptive low-cost system by train-\ning a gating and prediction model that limits utilization of a high-cost model to\nhard input instances and gates easy-to-handle input instances to a low-cost model.\nOur method is based on adaptively approximating the high-cost model in regions\nwhere low-cost models suf\ufb01ce for making highly accurate predictions. We pose an\nempirical loss minimization problem with cost constraints to jointly train gating\nand prediction models. On a number of benchmark datasets our method outper-\nforms state-of-the-art achieving higher accuracy for the same cost.\n\n1\n\nIntroduction\n\nResource costs arise during test-time prediction in a number of machine learning applications. Fea-\nture costs in Internet, Healthcare, and Surveillance applications arise due to to feature extraction\ntime [23], and feature/sensor acquisition [19]. In addition to feature acquisition costs, communica-\ntion and latency costs pose a key challenge in the design of mobile computing, or the Internet-of-\nThings(IoT) applications, where a large number of sensors/camera/watches/phones (known as edge\ndevices) are connected to a cloud.\nAdaptive System: Rather than having the edge devices constantly transmit measurements/images\nto the cloud where a centralized model makes prediction, a more ef\ufb01cient approach is to allow\nthe edge devices make predictions locally [12], whenever possible, saving the high communication\ncost and reducing latency. Due to the memory, computing and battery constraints, the prediction\nmodels on the edge devices are limited to low complexity. Consequently, to maintain high-accuracy,\nadaptive systems are desirable. Such systems identify easy-to-handle input instances where local\nedge models suf\ufb01ce, thus limiting the utilization cloud services for only hard instances. We propose\nto learn an adaptive system by training on fully annotated training data. Our objective is to maintain\nhigh accuracy while meeting average resource constraints during prediction-time.\nThere have been a number of promising approaches that focus on methods for reducing costs while\nimproving overall accuracy [9, 24, 19, 20, 13, 15]. These methods are adaptive in that, at test-\ntime, resources (features, computation etc) are allocated adaptively depending on the dif\ufb01culty of\nthe input. Many of these methods train models in a top-down manner, namely, attempt to build out\nthe model by selectively adding the most cost-effective features to improve accuracy.\nIn contrast we propose a novel bottom-up approach. We train adaptive models on annotated training\ndata by selectively identifying parts of the input space for which high accuracy can be maintained at\na lower cost. The principle advantage of our method is twofold. First, our approach can be readily\napplied to cases where it is desirable to reduce costs of an existing high-cost legacy system. Second,\ntraining top-down models in case of feature costs leads to fundamental combinatorial issues in multi-\n\n\fstage search over all feature subsets (see Sec. 2). In contrast, we bypass many of these issues by\nposing a natural adaptive approximation objective to partition the input space into easy and hard\ncases.\nIn particular, when no legacy system is available, our\nmethod consists of \ufb01rst learning a high-accuracy model\nthat minimizes the empirical loss regardless of costs. The\nresulting high prediction-cost model (HPC) can be read-\nily trained using any of the existing methods. For ex-\nample, this could be a large neural network in the cloud\nthat achieves the state-of-the-art accuracy. Next, we\njointly learn a low-cost gating function as well as a low\nprediction-cost (LPC) model so as to adaptively approx-\nimate the high-accuracy model by identifying regions of\ninput space where a low-cost gating and LPC model are\nadequate to achieve high-accuracy. In IoT applications,\nsuch low-complexity models can be deployed on the edge\ndevices to perform gating and prediction. At test-time, for\neach input instance, the gating function decides whether\nor not the LPC model is adequate for accurate classi\ufb01ca-\ntion. Intuitively, \u201ceasy\u201d examples can be correctly clas-\nsi\ufb01ed using only an LPC model while \u201chard\u201d examples\nrequire HPC model. By identifying which of the input\ninstances can be classi\ufb01ed accurately with LPCs we by-\npass the utilization of HPC model, thus reducing average\nprediction cost. The upper part of Figure 1 is a schematic\nof our approach, where x is feature vector and y is the\npredicted label; we aim to learn g and an LPC model to\nadaptively approximate the HPC. The key observation as\ndepicted in the lower \ufb01gure is that the probability of cor-\nrect classi\ufb01cation given x for a HPC model is in general\na highly complex function with higher values than that of\na LPC model. Yet there exists regions of the input space\nwhere the LPC has competitive accuracy (as shown to the\nright of the gating threshold). Sending examples in such\nregions (according to the gating function) to the LPC re-\nsults in no loss of prediction accuracy while reducing pre-\ndiction costs.\nThe problem would be simpler if our task were to pri-\nmarily partition the input space into regions where LPC\nmodels would suf\ufb01ce. The dif\ufb01culty is that we must also\nlearn a low-cost gating function capable of identifying in-\nput instances for which LPC suf\ufb01ces. Since both prediction and gating account for cost, we favor\ndesign strategies that lead to shared features and decision architectures between the gating function\nand the LPC model. We pose the problem as a discriminative empirical risk minimization problem\nthat jointly optimizes for gating and prediction models in terms of a joint margin-based objective\nfunction. The resulting objective is separately convex in gating and prediction functions. We propose\nan alternating minimization scheme that is guaranteed to converge since with appropriate choice of\nloss-functions (for instance, logistic loss), each optimization step amounts to a probabilistic approx-\nimation/projection (I-projection/M-projection) onto a probability space. While our method can be\nrecursively applied in multiple stages to successively approximate the adaptive system obtained in\nthe previous stage, thereby re\ufb01ning accuracy-cost trade-off, we observe that on benchmark datasets\neven a single stage of our method outperforms state-of-art in accuracy-cost performance.\n\nFigure 1: Upper: single stage schematic\nof our approach. We learn low-cost gating\ng and a LPC model to adaptively approx-\nimate a HPC model. Lower: Key insight\nfor adaptive approximation. x-axis repre-\nsents feature space; y-axis represents condi-\ntional probability of correct prediction; LPC\ncan match HPC\u2019s prediction in the input re-\ngion corresponding to the right of the gat-\ning threshold but performs poorly otherwise.\nOur goal is to learn a low-cost gating func-\ntion that attempts to send examples on the\nright to LPC and the left to HPC.\n\n2 Related Work\n\nLearning decision rules to minimize error subject to a budget constraint during prediction-time is an\narea of active interest[9, 17, 24, 19, 22, 20, 21, 13, 16]. Pre-trained Models: In one instantiation\n\n2\n\n\fof these methods it is assumed that there exists a collection of prediction models with amortized\ncosts [22, 19, 1] so that a natural ordering of prediction models can be imposed. In other instances,\nthe feature dimension is assumed to be suf\ufb01ciently low so as to admit an exhaustive enumeration of\nall the combinatorial possibilities [20, 21]. These methods then learn a policy to choose amongst\nthe ordered prediction models. In contrast we do not impose any of these restrictions. Top-Down\nMethods: For high-dimensional spaces, many existing approaches focus on learning complex adap-\ntive decision functions top-down [9, 24, 13, 21]. Conceptually, during training, top-down methods\nacquire new features based on their utility value. This requires exploration of partitions of the input\nspace together with different combinatorial low-cost feature subsets that would result in higher accu-\nracy. These methods are based on multi-stage exploration leading to combinatorially hard problems.\nDifferent novel relaxations and greedy heuristics have been developed in this context. Bottom-up\nMethods: Our work is somewhat related to [16], who propose to prune a fully trained random forests\n(RF) to reduce costs. Nevertheless, in contrast to our adaptive system, their perspective is to com-\npress the original model and utilize the pruned forest as a stand-alone model for test-time prediction.\nFurthermore, their method is speci\ufb01cally tailored to random forests.\nAnother set of related work includes classi\ufb01er cascade [5] and decision DAG [3], both of which aim\nto re-weight/re-order a set of pre-trained base learners to reduce prediction budget. Our method,\non the other hand, only requires to pre-train a high-accuracy model and jointly learns the low-cost\nmodels to approximate it; therefore ours can be viewed as complementary to the existing work.\nThe teacher-student framework [14] is also related to our bottom-up approach; a low-cost student\nmodel learns to approximate the teacher model so as to meet test-time budget. However, the goal\nthere is to learn a better stand-alone student model.\nIn contrast, we make use of both the low-\ncost (student) and high-accuracy (teacher) model during prediction via a gating function, which\nlearns the limitation of the low-cost (student) model and consult the high-accuracy (teacher) model if\nnecessary, thereby avoiding accuracy loss. Our composite system is also related to HME [10], which\nlearns the composite system based on max-likelihood estimation of models. A major difference\nis that HME does not address budget constraints. A fundamental aspect of budget constraints is\nthe resulting asymmetry, whereby, we start with an HPC model and sequentially approximate with\nLPCs. This asymmetry leads us to propose a bottom-up strategy where the high-accuracy predictor\ncan be separately estimated and is critical to posing a direct empirical loss minimization problem.\n\n3 Problem Setup\n\nWe consider the standard learning scenario of resource constrained prediction with feature costs. A\ntraining sample S = {(x(i), y(i)) : i = 1, . . . , N} is generated i.i.d. from an unknown distribution,\nwhere x(i) \u2208 (cid:60)K is the feature vector with an acquisition cost c\u03b1 \u2265 0 assigned to each of the features\n\u03b1 = 1, . . . , K and y(i) is the label for the ith example. In the case of multi-class classi\ufb01cation y \u2208\n{1, . . . , M}, where M is the number of classes. Let us consider a single stage of our training method\nin order to formalize our setup. The model, f0, is a high prediction-cost (HPC) model, which is either\na priori known, or which we train to high-accuracy regardless of cost considerations. We would like\nto learn an alternative low prediction-cost (LPC) model f1. Given an example x, at test-time, we\nhave the option of selecting which model, f0 or f1, to utilize to make a prediction. The accuracy of\na prediction model fz is modeled by a loss function (cid:96)(fz(x), y), z \u2208 {0, 1}. We exclusively employ\nthe logistic loss function in binary classi\ufb01cation: (cid:96)(fz(x), y) = log(1 + exp(\u2212yfz(x)), although\nour framework allows other loss models. For a given x, we assume that once it pays the cost to\nacquire a feature, its value can be ef\ufb01ciently cached; its subsequent use does not incur additional\ncost. Thus, the cost of utilizing a particular prediction model, denoted by c(fz, x), is computed as\nthe sum of the acquisition cost of unique features required by fz.\nOracle Gating: Consider a general gating likelihood function q(z|x) with z \u2208 {0, 1}, that outputs\nthe likelihood of sending the input x to a prediction model, fz. The overall empirical loss is:\n\nESnEq(z|x)[(cid:96)(fz(x), y)] = ESn[(cid:96)(f0(x), y)] + ESn\n\n3\n\n(cid:2)q(1|x) ((cid:96)(f1(x), y) \u2212 (cid:96)(f0(x), y))(cid:3)\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nExcess Loss\n\n\fThe \ufb01rst term only depends on f0, and from our perspective a constant. Similar to average loss we\ncan write the average cost as (assuming gating cost is negligible for now):\n\n(cid:125)\nESnEq(z|x)[c(fz, x)] = ESn [c(f0, x)] \u2212 ESn [q(1|x) (c(f0, x) \u2212 c(f1, x))\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n],\n\nCost Reduction\n\nwhere the \ufb01rst term is again constant. We can characterize the optimal gating function (see [19])\nthat minimizes the overall average loss subject to average cost constraint:\n\nExcess loss\n\n(cid:96)(f1, x) \u2212 (cid:96)(f0, x)\n\nCost reduction\n\n(c(f0, x) \u2212 c(f1, x))\n\n\u03b7\n\n(cid:125)(cid:124)\n\n(cid:123)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:122)\n\n(cid:123)\n\nq(1|x)=0\n\n><\n\nq(1|x)=1\n\nDKL(q(\u00b7|x)(cid:107)g(x)) =(cid:80)\n\nfor a suitable choice \u03b7 \u2208 R. This characterization encodes the important principle that if the marginal\ncost reduction is smaller than the excess loss, we opt for the HPC model. Nevertheless, this charac-\nterization is generally infeasible. Note that the LHS depends on knowing how well HPC performs\non the input instance. Since this information is unavailable, this target can be unreachable with\nlow-cost gating.\nGating Approximation: Rather than directly enforcing a low-cost structure on q, we decouple\nthe constraint and introduce a parameterized family of gating functions g \u2208 G that attempts to\nmimic (or approximate) q. To ensure such approximation, we can minimize some distance mea-\nsure D(q(\u00b7|x), g(x)). A natural choice for an approximation metric is the Kullback-Leibler (KL)\ndivergence although other choices are possible. The KL divergence between q and g is given by\nz q(z|x) log(q(z|x)/\u03c3(sgn(0.5 \u2212 z)g(x))), where \u03c3(s) = 1/(1 + e\u2212s) is\nthe sigmoid function. Besides KL divergence, we have also proposed another symmetrized metric\n\ufb01tting g directly to the log odds ratio of q. See Suppl. Material for details.\nBudget Constraint: With the gating function g, the cost of predicting x depends on whether the\nexample is sent to f0 or f1. Let c(f0, g, x) denote the feature cost of passing x to f0 through g.\nAs discussed, this is equal to the sum of the acquisition cost of unique features required by f0 and\ng for x. Similarly c(f1, g, x) denotes the cost if x is sent to f1 through g. In many cases the cost\nc(fz, g, x) is independent of the example x and depends primarily on the model being used. This\nis true for linear models where each x must be processed through the same collection of features.\nFor these cases c(fz, g, x) (cid:44) c(fz, g). The total budget simpli\ufb01es to: ESn [q(0|x)]c(f0, g) + (1 \u2212\nESn [q(0|x)])c(f1, g) = c(f1, g) + ESn [q(0|x)](c(f0, g) \u2212 c(f1, g)). The budget thus depends on 3\nquantities: ESn [q(0|x)], c(f1, g) and c(f0, g). Often f0 is a high-cost model that requires most, if\nnot all, of features so c(f0, g) can be considered a large constant.\nThus, to meet the budget constraint, we would like to have (a) low-cost g and f1 (small c(f1, g));\nand (b) small fraction of examples being sent to the high-accuracy model (small ESn [q(0|x)]). We\ncan therefore split the budget constraint into two separate objectives: (a) ensure low-cost through\n\u03b1 c\u03b1(cid:107)V\u03b1 + W\u03b1(cid:107)0, where \u03b3 is a tradeoff parameter and the indicator vari-\nables V\u03b1, W\u03b1 \u2208 {0, 1} denote whether or not the feature \u03b1 is required by f1 and g, respectively.\nDepending on the model parameterization, we can approximate \u2126(f1, g) using a group-sparse norm\nor in a stage-wise manner as we will see in Algorithms 1 and 2. (b) Ensure only Pfull fraction of\nexamples are sent to f0 via the constraint ESn[q(0|x)] \u2264 Pfull.\nPutting Together: We are now ready to pose our general optimization problem:\n(cid:122)\n\npenalty \u2126(f1, g) = \u03b3(cid:80)\n\n(cid:122) (cid:125)(cid:124) (cid:123)\n\nGating Approx\n\n(cid:125)(cid:124)\n\n(cid:125)(cid:124)\nf1\u2208F ,g\u2208G,q\nsubject to: ESn [q(0|x)] \u2264 Pfull. (F raction to f0)\n\n[q(z|x)(cid:96)(fz(x), y)] +\n\nESn\n\nLosses\n\nD(q(\u00b7|x), g(x)) +\n\n(cid:122)\n(cid:88)\n\n\u2126(f1, g)\n\n(OPT)\n\n(cid:123)\n\n(cid:123)\n\nmin\n\nCosts\n\nz\n\nThe objective function penalizes excess loss and ensures through the second term that this excess\nloss can be enforced through admissible gating functions. The third term penalizes the feature cost\nusage of f1 and g. The budget constraint limits the fraction of examples sent to the costly model f0.\nRemark 1: Directly parameterizing q leads to non-convexity. Average loss is q-weighted sum\nof losses from HPC and LPC; while the space of probability distributions is convex, a \ufb01nite-\ndimensional parameterization is generally non-convex (e.g. sigmoid). What we have done is to\nkeep q in non-parametric form to avoid non-convexity and only parameterize g, connecting both via\n\n4\n\n\fa KL term. Thus, (OPT) is now convex with respect to the f1 and g for a \ufb01xed q. It is again convex\nin q for a \ufb01xed f1 and g. Otherwise it would introduce non-convexity as in prior work. For instance,\nin [5] a non-convex problem is solved in each inner loop iteration (line 7 of their Algorithm 1).\nRemark 2: We presented the case for a single stage approximation system. However, it is straightfor-\nward to recursively continue this process. We can then view the composite system f0 (cid:44) (g, f1, f0)\nas a black-box predictor and train a new pair of gating and prediction models to approximate the\ncomposite system.\nRemark 3: To limit the scope of our paper, we focus on reducing feature acquisition cost during\nprediction as it is a more challenging (combinatorial) problem. However, other prediction-time costs\nsuch as computation cost can be encoded in the choice of functional classes F and G in (OPT).\nSurrogate Upper Bound of Composite System: We can get better insight for the \ufb01rst two terms\nof the objective in (OPT) if we view z \u2208 {0, 1} as a latent variable and consider the composite\nz Pr(z|x; g) Pr(y|x, fz). A standard application of Jensen\u2019s inequality reveals\nthat, \u2212 log(Pr(y|x)) \u2264 Eq(z|x)(cid:96)(fz(x), y) + DKL(q(z|x)(cid:107) Pr(z|x; g)). Therefore, the conditional-\nentropy of the composite system is bounded by the expected value of our loss function (we overload\nnotation and represent random-variables in lower-case format):\n\nsystem Pr(y|x) =(cid:80)\n\nH(y | x) (cid:44) E[\u2212 log(Pr(y|x))] \u2264 Ex\u00d7y[Eq(z|x)(cid:96)(fz(x), y) + DKL(q(z|x)(cid:107) Pr(z|x; g))].\n\nThis implies that the \ufb01rst two terms of our objective attempt to bound the loss of the composite\nsystem; the third term in the objective together with the constraint serve to enforce budget limits on\nthe composite system.\nGroup Sparsity: Since the cost for feature re-use is zero we encourage feature re-use among gat-\ning and prediction models. So the fundamental question here is: How to choose a common, sparse\n(low-cost) subset of features on which both g and f1 operate, such that g can effective gate exam-\nples between f1 and f0 for accurate prediction? This is a hard combinatorial problem. The main\ncontribution of our paper is to address it using the general optimization framework of (OPT).\n\n4 Algorithms\n\nAlgorithm 1 ADAPT-LIN\n\nSolve (OPT1) for q given g, f1.\nSolve (OPT2) for g, f1 given q.\n\nInput: (x(i), y(i)), Pfull, \u03b3\nTrain f0. Initialize g, f1.\nrepeat\n\nTo be concrete, we instantiate our general framework (OPT) into two algorithms via different param-\neterizations of g, f1: ADAPT-LIN for the linear class and ADAPT-GBRT for the non-parametric class.\nBoth of them use the KL-divergence as distance\nmeasure. We also provide a third algorithm\nADAPT-LSTSQ that uses the symmetrized dis-\ntance in the Suppl. Material. All of the al-\ngorithms perform alternating minimization of\n(OPT) over q, g, f1. Note that convergence\nof alternating minimization follows as in [8].\nCommon to all of our algorithms, we use two\nparameters to control cost: Pfull and \u03b3. In prac-\ntice they are swept to generate various cost-\naccuracy tradeoffs and we choose the best one\nsatisfying the budget B using validation data.\nADAPT-LIN: Let g(x) = gT x and f1(x) =\n1 x be linear classi\ufb01ers. A feature is used if the\nf T\ncorresponding component is non-zero: V\u03b1 = 1\nif f1,\u03b1 (cid:54)= 0, and W\u03b1 = 1 if g\u03b1 (cid:54)= 0. The mini-\n(cid:80)N\nmization for q solves the following problem:\n(cid:80)N\ni=1 [(1 \u2212 qi)Ai + qiBi \u2212 H(qi)]\ni=1 qi \u2264 Pfull,\n\nFind f t\n1.\nf1 = f1 + f t\nFor each feature \u03b1 used, set u\u03b1 = 0.\nFind gt using CART to minimize (2).\ng = g + gt.\nFor each feature \u03b1 used, set u\u03b1 = 0.\n\nAlgorithm 2 ADAPT-GBRT\nInput: (x(i), y(i)), Pfull, \u03b3\nTrain f0. Initialize g, f1.\nrepeat\n\nSolve (OPT1) for q given g, f1.\nfor t = 1 to T do\n\n1 using CART to minimize (1).\n\nmin\n\nq\ns.t.\n\n1\nN\n\n1\nN\n\nuntil convergence\n\n(OPT1)\nwhere we have used shorthand notations qi =\nq(z = 0|x(i)), H(qi) = \u2212qi log(qi) \u2212 (1 \u2212\nqi) log(1 \u2212 qi), Ai = log(1 + e\u2212y(i)f T\n) +\n\n1 x(i)\n\nend for\n\nuntil convergence\n\n5\n\n\f) and Bi = \u2212 log p(y(i)|z(i) = 0; f0) + log(1 + e\u2212gT x(i)\n\nlog(1 + egT x(i)\n). This optimization has\na closed form solution: qi = 1/(1 + eBi\u2212Ai+\u03b2) for some non-negative constant \u03b2 such that the\nconstraint is satis\ufb01ed. This optimization is also known as I-Projection in information geometry\nbecause of the entropy term [8]. Having optimized q, we hold it constant and minimize with respect\n\u03b1 c\u03b1(cid:107)V\u03b1 +\nW\u03b1(cid:107)0 into a L2,1 norm for group sparsity and a tradeoff parameter \u03b3 to make sure the feature budget\nis satis\ufb01ed. Once we solve for g, f1, we can hold them constant and minimize with respect to q again.\nADAPT-LIN is summarized in Algorithm 1.\n\nto g, f1 by solving the problem (OPT2), where we have relaxed the non-convex cost(cid:80)\n(cid:113)\n(cid:88)\n\nN(cid:88)\n\n(cid:17)\n\n(cid:16)\n\n(cid:104)\n\n(cid:105)\n\n(1 \u2212 qi)\n\nlog(1 + e\u2212y(i)f T\n\n1 x(i)\n\n) + log(1 + egT x(i)\n\n)\n\n+ qi log(1 + e\u2212gT x(i)\n\n)\n\n+ \u03b3\n\nmin\ng,f1\n\n1\nN\n\ni=1\n\nboosted trees [7]: g(x) = (cid:80)T\n\nt=1 gt(x) and f1(x) = (cid:80)T\n\nADAPT-GBRT: We can also consider the non-parametric family of classi\ufb01ers such as gradient\n1 are limited-\ndepth regression trees. Since the trees are limited to low depth, we assume that the feature utility\nof each tree is example-independent: V\u03b1,t(x) (cid:117) V\u03b1,t, W\u03b1,t(x) (cid:117) W\u03b1,t,\u2200x. V\u03b1,t = 1 if fea-\nture \u03b1 appears in f t\n1, otherwise V\u03b1,t = 0, similarly for W\u03b1,t. The optimization over q still solves\n(OPT1). We modify Ai = log(1 + e\u2212y(i)f1(x(i))) + log(1 + eg(x(i))) and Bi = \u2212 log p(y(i)|z(i) =\n0; f0) + log(1 + e\u2212g(x(i))). Next, to minimize over g, f1, denote loss:\n\n1(x), where gt and f t\n\nt=1 f t\n\n\u03b1 + f 2\ng2\n\n1,\u03b1.\n\n\u03b1\n\n(OPT2)\n\n(cid:34)\n(1 \u2212 qi) \u00b7(cid:16)\n\nN(cid:88)\n\ni=1\n\n(cid:96)(f1, g) =\n\n1\nN\n\nlog(1 + e\u2212y(i)f1(x(i))) + log(1 + eg(x(i)))\n\n+ qi log(1 + e\u2212g(x(i)))\n\n,\n\n(cid:17)\n\nwhich is essentially the same as the \ufb01rst part of the objective in (OPT2). Thus, we need to minimize\n(cid:96)(f1, g) + \u2126(f1, g) with respect to f1 and g. Since both f1 and g are gradient boosted trees, we\nnaturally adopt a stage-wise approximation for the objective. In particular, we de\ufb01ne an impurity\nfunction which on the one hand approximates the negative gradient of (cid:96)(f1, g) with the squared\nloss, and on the other hand penalizes the initial acquisition of features by their cost c\u03b1. To capture\nthe initial acquisition penalty, we let u\u03b1 \u2208 {0, 1} indicates if feature \u03b1 has already been used in\nprevious trees (u\u03b1 = 0), or not (u\u03b1 = 1). u\u03b1 is updated after adding each tree. Thus we arrive at\nthe following impurity for f1 and g, respectively:\n\n(cid:35)\n\n(1)\n\n(2)\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n1\n2\n\n1\n2\n\n(\u2212 \u2202(cid:96)(f1, g)\n\u2202f1(x(i))\n\n\u2212 f t\n\n1(x(i)))2 + \u03b3\n\nu\u03b1c\u03b1V\u03b1,t,\n\n(\u2212 \u2202(cid:96)(f1, g)\n\u2202g(x(i))\n\n\u2212 gt(x(i)))2 + \u03b3\n\nu\u03b1c\u03b1W\u03b1,t.\n\n(cid:88)\n(cid:88)\n\n\u03b1\n\n\u03b1\n\nMinimizing such impurity functions balances the need to minimize loss and re-using the already\nacquired features. Classi\ufb01cation and Regression Tree (CART) [2] can be used to construct deci-\nsion trees with such an impurity function. ADAPT-GBRT is summarized in Algorithm 2. Note\nthat a similar impurity is used in GREEDYMISER [24]. Interestingly, if Pfull is set to 0, all the ex-\namples are forced to f1, then ADAPT-GBRT exactly recovers the GREEDYMISER. In this sense,\nGREEDYMISER is a special case of our algorithm. As we will see in the next section, thanks to the\nbottom-up approach, ADAPT-GBRT bene\ufb01ts from high-accuracy initialization and is able to perform\naccuracy-cost tradeoff in accuracy levels beyond what is possible for GREEDYMISER.\n\n5 Experiments\n\nBASELINE ALGORITHMS: We consider the following simple L1 baseline approach for learning\nf1 and g: \ufb01rst perform a L1-regularized logistic regression on all data to identify a relevant, sparse\nsubset of features; then learn f1 using training data restricted to the identi\ufb01ed feature(s); \ufb01nally,\nlearn g based on the correctness of f1 predictions as pseudo labels (i.e. assign pseudo label 1 to\nexample x if f1(x) agrees with the true label y and 0 otherwise). We also compare with two state-\nof-the-art feature-budgeted algorithms: GREEDYMISER[24] - a top-down method that builds out an\n\n6\n\n\fensemble of gradient boosted trees with feature cost budget; and BUDGETPRUNE[16] - a bottom-up\nmethod that prunes a random forest with feature cost budget. A number of other methods such as\nASTC [13] and CSTC [23] are omitted as they have been shown to under-perform GREEDYMISER\non the same set of datasets [15]. Detailed experiment setups can be found in the Suppl. Material.\nWe \ufb01rst visualize/verify the adaptive approximation ability of ADAPT-LIN and ADAPT-GBRT on the\nSynthetic-1 dataset without feature costs. Next, we illustrate the key difference between ADAPT-LIN\nand the L1 baseline approach on the Synthetic-2 as well as the Letters datasets. Finally, we compare\nADAPT-GBRT with state-of-the-art methods on several resource constraint benchmark datasets.\n\n(a) Input Data\n\n(b) Lin Initialization\n\n(c) Lin after 10 iterations\n\n(d) RBF Contour\n\n(e) Gbrt Initialization\n\n(f) Gbrt after 10 iterations\n\nFigure 2: Synthetic-1 experiment without feature cost. (a): input data. (d): decision contour of\nRBF-SVM as f0. (b) and (c): decision boundaries of linear g and f1 at initialization and after 10\niterations of ADAPT-LIN. (e) and (f): decision boundaries of boosted tree g and f1 at initialization\nand after 10 iterations of ADAPT-GBRT. Examples in the beige areas are sent to f0 by the g.\n\nPOWER OF ADAPTATION: We construct a 2D binary classi\ufb01cation dataset (Synthetic-1)\nas shown in (a) of Figure 2. We learn an RBF-SVM as the high-accuracy classi\ufb01er f0\nTo better visualize the adaptive approximation process in 2D, we turn off the\nas in (d).\nfeature costs (i.e.\nset \u2126(f1, g) to 0 in (OPT)) and run ADAPT-LIN and ADAPT-GBRT.\nThe initializations of g and f1 in (b) results in wrong pre-\ndictions for many red points in the blue region. After 10\niterations of ADAPT-LIN, f1 adapts much better to the lo-\ncal region assigned by g while g sends about 60% (Pfull)\nof examples to f0. Similarly, the initialization in (e) re-\nsults in wrong predictions in the blue region. ADAPT-\nGBRT is able to identify the ambiguous region in the cen-\nter and send those examples to f0 via g. Both of our algo-\nrithms maintain the same level of prediction accuracy as\nf0 yet are able to classify large fractions of examples via\nmuch simpler models.\nPOWER OF JOINT OPTIMIZATION: We return to the\nproblem of prediction under feature budget constrains.\nWe illustrate why a simple L1 baseline approach for\nlearning f1 and g would not work using a 2D dataset\n(Synthetic-2) as shown in Figure 3 (left). The data points\nare distributed in four clusters, with black triangles and\nred circles representing two class labels. Let both feature\n1 and 2 carry unit acquisition cost. A complex classi\ufb01er\nf0 that acquires both features can achieve full accuracy\nat the cost of 2. In our synthetic example, clusters 1 and 2 are given more data points so that the\nL1-regularized logistic regression would produce the vertical red dashed line, separating cluster 1\nfrom the others. So feature 1 is acquired for both g and f1. The best such an adaptive system can\n\nFigure 3: A 2-D synthetic example for\nadaptive feature acquisition. On the left:\ndata distributed in four clusters. The\ntwo features correspond to x and y co-\nordinates, respectively. On the right:\naccuracy-cost tradeoff curves. Our al-\ngorithm can recover the optimal adap-\ntive system whereas a L1-based ap-\nproach cannot.\n\n7\n\n\fdo is to send cluster 1 to f1 and the other three clusters to the complex classi\ufb01er f0, incurring an\naverage cost of 1.75, which is sub-optimal. ADAPT-LIN, on the other hand, optimizing between\nq, g, f1 in an alternating manner, is able to recover the horizontal lines in Figure 3 (left) for g and\nf1. g sends the \ufb01rst two clusters to the full classi\ufb01er and the last two clusters to f1. f1 correctly\nclassi\ufb01es clusters 3 and 4. So all of the examples are correctly classi\ufb01ed by the adaptive system; yet\nonly feature 2 needs to be acquired for cluster 3 and 4 so the overall average feature cost is 1.5, as\nshown by the solid curve in the accuracy-cost tradeoff plot on the right of Figure 3. This example\nshows that the L1 baseline approach is sub-optimal as it doesnot optimize the selection of feature\nsubsets jointly for g and f1.\n\n(a) MiniBooNE\n\n(b) Forest Covertype\n\n(c) Yahoo! Rank\n\n(d) CIFAR10\n\nFigure 4: Comparison of ADAPT-GBRT against GREEDYMISER and BUDGETPRUNE on four\nbenchmark datasets. RF is used as f0 for ADAPT-GBRT in (a-c) while an RBF-SVM is used as\nf0 in (d). ADAPT-GBRT achieves better accuracy-cost tradeoff than other methods. The gap is sig-\nni\ufb01cant in (b) (c) and (d). Note the accuracy of GREEDYMISER in (b) never exceeds 0.86 and its\nprecision in (c) slowly rises to 0.138 at cost of 658. We limit the cost range for a clearer comparison.\n\n#Validation\n\n4000\n19510\n15688\n8468\n146769\n\n16\n50\n54\n400\n519\n\nFeature Costs\n\nDataset\nLetters\n\nMiniBooNE\n\nForest\n\nCIFAR10\nYahoo!\n\n#Test\n4000\n65031\n58101\n10000\n184968\n\n#Train\n12000\n45523\n36603\n19761\n141397\n\nUniform\nUniform\nUniform\nUniform\nCPU units\n\nTable 1: Dataset Statistics\n#Features\n\nREAL DATASETS: We test various aspects\nof our algorithms and compare with state-\nof-the-art feature-budgeted algorithms on \ufb01ve\nreal world benchmark datasets: Letters, Mini-\nBooNE Particle Identi\ufb01cation, Forest Cover-\ntype datasets from the UCI repository [6],\nCIFAR-10 [11] and Yahoo!\nLearning to\nRank[4]. Yahoo! is a ranking dataset where each example is associated with features of a query-\ndocument pair together with the relevance rank of the document to the query. There are 519 such\nfeatures in total; each is associated with an acquisition cost in the set {1,5,20,50,100,150,200},\nwhich represents the units of CPU time required to extract the feature and is provided by a Yahoo!\nemployee. The labels are binarized into relevant or not relevant. The task is to learn a model that\ntakes a new query and its associated documents and produce a relevance ranking so that the relevant\ndocuments come on top, and to do this using as little feature cost as possible. The performance\nmetric is Average Precision @ 5 following [16]. The other datasets have unknown feature costs so\nwe assign costs to be 1 for all features; the aim is to show ADAPT-GBRT successfully selects sparse\nsubset of \u201cusefull\u201d features for f1 and g. We summarize the statistics of these datasets in Table 1.\nNext, we highlight the key insights from the real dataset experiments.\nGenerality of Approximation: Our framework allows approximation of powerful classi\ufb01ers such\nas RBF-SVM and Random Forests as shown in Figure 5 as red and black curves, respectively.\nIn particular, ADAPT-GBRT can well maintain high accuracy while reducing cost. This is a key\nadvantage for our algorithms because we can choose to approximate the f0 that achieves the best\naccuracy. ADAPT-LIN Vs L1: Figure 5 shows that ADAPT-LIN outperforms L1 baseline method\non real dataset as well. Again, this con\ufb01rms the intuition we have in the Synthetic-2 example as\nADAPT-LIN is able to iteratively select the common subset of features jointly for g and f1. ADAPT-\nGBRT Vs ADAPT-LIN: ADAPT-GBRT leads to signi\ufb01cantly better performance than ADAPT-LIN in\napproximating both RBF-SVM and RF as shown in Figure 5. This is expected as the non-parametric\nnon-linear classi\ufb01ers are much more powerful than linear ones.\nADAPT-GBRT Vs BUDGETPRUNE: Both are bottom-up approaches that bene\ufb01t from good initial-\nizations. In (a), (b) and (c) of Figure 4 we let f0 in ADAPT-GBRT be the same RF that BUDGET-\nPRUNE starts with. ADAPT-GBRT is able to maintain high accuracy longer as the budget decreases.\n\n8\n\n1520253035404550Average Feature Cost0.9200.9250.9300.935Test AccuracyAdapt_GbrtGreedyMiser(Xu et al. 2012)BudgetPrune (Nan et al. 2016)1015202530Average Feature Cost0.840.860.880.900.92Test Accuracy406080100120140160180Average Feature Cost0.1280.1300.1320.1340.1360.138Average Precision@5050100150200250300350400Average Feature Cost0.650.700.750.80Test Accuracy\fThus, ADAPT-GBRT improves state-of-the-art bottom-up method. Notice in (c) of Figure 4 around\nthe cost of 100, BUDGETPRUNE has a spike in precision. We believe this is because the initial\npruning improved the generalization performance of RF.\nBut in the cost region of 40-80, ADAPT-GBRT maintains much better accuracy than BUDGET-\nPRUNE. Furthermore, ADAPT-GBRT has the freedom to approximate the best f0 given the problem.\nSo in (d) of Figure 4 we see that with f0 being RBF-SVM, ADAPT-GBRT can achieve much higher\naccuracy than BUDGETPRUNE.\nADAPT-GBRT Vs GREEDYMISER: ADAPT-GBRT out-\nperforms GREEDYMISER on all the datasets. The gaps in\nFigure 5, (b) (c) and (d) of Figure 4 are especially signif-\nicant.\nSigni\ufb01cant Cost Reduction: Without sacri\ufb01cing top ac-\ncuracies (within 1%), ADAPT-GBRT reduces average fea-\nture costs during test-time by around 63%, 32%, 58%,\n12% and 31% on MiniBooNE, Forest, Yahoo, Cifar10\nand Letters datasets, respectively.\n\n6 Conclusions\n\nFigure 5: Compare the L1 baseline ap-\nproach, ADAPT-LIN and ADAPT-GBRT\nbased on RBF-SVM and RF as f0\u2019s on\nthe Letters dataset.\n\nWe presented an adaptive approximation approach to ac-\ncount for prediction costs that arise in various applica-\ntions. At test-time our method uses a gating function to\nidentify a prediction model among a collection of models\nthat is adapted to the input. The overall goal is to reduce\ncosts without sacri\ufb01cing accuracy. We learn gating and\nprediction models by means of a bottom-up strategy that trains low prediction-cost models to ap-\nproximate high prediction-cost models in regions where low-cost models suf\ufb01ce. On a number of\nbenchmark datasets our method leads to an average of 40% cost reduction without sacri\ufb01cing test\naccuracy (within 1%). It outperforms state-of-the-art top-down and bottom-up budgeted learning\nalgorithms, with a signi\ufb01cant margin in several cases.\n\nAcknowledgments\n\nFeng Nan would like to thank Dr Ofer Dekel for ideas and discussions on resource constrained\nmachine learning during an internship in Microsoft Research in summer 2016. Familiarity and\nintuition gained during the internship contributed to the motivation and formulation in this paper.\nWe also thank Dr Joseph Wang and Tolga Bolukbasi for discussions and helps in experiments. This\nmaterial is based upon work supported in part by NSF Grants CCF: 1320566, CNS: 1330008, CCF:\n1527618, DHS 2013-ST-061-ED0001, NGA Grant HM1582-09-1-0037 and ONR Grant N00014-\n13-C-0288.\n\nReferences\n[1] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural net-\nworks for ef\ufb01cient inference. In Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 527\u2013536, Interna-\ntional Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[2] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classi\ufb01cation and\n\nregression trees. CRC press, 1984.\n\n[3] R\u00f3bert Busa-Fekete, Djalel Benbouzid, and Bal\u00e1zs K\u00e9gl. Fast classi\ufb01cation using sparse deci-\nsion dags. In Proceedings of the 29th International Conference on Machine Learning, ICML\n2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.\n\n[4] O Chapelle, Y Chang, and T Liu, editors. Proceedings of the Yahoo! Learning to Rank Chal-\n\nlenge, held at ICML 2010, Haifa, Israel, June 25, 2010, 2011.\n\n9\n\n111213141516Average Feature Cost0.880.900.920.940.960.98Test AccuracyAdapt_Gbrt RFAdapt_Lin RFL1 RFAdapt_Gbrt RBFAdapt_Lin RBFL1 RBFGreedyMiser(Xu et al 2012)\f[5] Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, Olivier Chapelle, and Dor Kedem.\nIn Proceedings of the Fifteenth\nClassi\ufb01er cascade for minimizing feature evaluation cost.\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2012, La Palma,\nCanary Islands, April 21-23, 2012, pages 218\u2013226, 2012.\n\n[6] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n\n[7] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals\n\nof Statistics, 29:1189\u20131232, 2000.\n\n[8] Kuzman Ganchev, Ben Taskar, and Jo ao Gama. Expectation maximization and posterior\nconstraints. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural\nInformation Processing Systems 20, pages 569\u2013576. Curran Associates, Inc., 2008.\n\n[9] T. Gao and D. Koller. Active classi\ufb01cation based on value of classi\ufb01er. In Advances in Neural\n\nInformation Processing Systems (NIPS 2011), 2011.\n\n[10] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm.\n\nNeural Comput., 6(2):181\u2013214, March 1994.\n\n[11] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis,\n\n2009.\n\n[12] Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-ef\ufb01cient machine learning in 2\nKB RAM for the internet of things. In Proceedings of the 34th International Conference on\nMachine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1935\u2013\n1944, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[13] M Kusner, W Chen, Q Zhou, E Zhixiang, K Weinberger, and Y Chen. Feature-cost sensitive\nlearning with submodular trees of classi\ufb01ers. In AAAI Conference on Arti\ufb01cial Intelligence,\n2014.\n\n[14] D. Lopez-Paz, B. Sch\u00f6lkopf, L. Bottou, and V. Vapnik. Unifying distillation and privileged\n\ninformation. In International Conference on Learning Representations, 2016.\n\n[15] Feng Nan, Joseph Wang, and Venkatesh Saligrama. Feature-budgeted random forest. In David\nBlei and Francis Bach, editors, Proceedings of the 32nd International Conference on Machine\nLearning (ICML-15), pages 1983\u20131991. JMLR Workshop and Conference Proceedings, 2015.\n\n[16] Feng Nan, Joseph Wang, and Venkatesh Saligrama. Pruning random forests for prediction on a\nbudget. In Advances in Neural Information Processing Systems 29, pages 2334\u20132342. Curran\nAssociates, Inc., 2016.\n\n[17] Feng Nan, Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. Fast margin-based\ncost-sensitive classi\ufb01cation. In IEEE International Conference on Acoustics, Speech and Sig-\nnal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014, 2014.\n\n[18] Daniel P. Robinson and Suchi Saria. Trading-off cost of deployment versus accuracy in learn-\ning predictive models. In Proceedings of the Twenty-Fifth International Joint Conference on\nArti\ufb01cial Intelligence, IJCAI\u201916, pages 1974\u20131982. AAAI Press, 2016.\n\n[19] K Trapeznikov and V Saligrama. Supervised sequential classi\ufb01cation under budget constraints.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 581\u2013589, 2013.\n\n[20] Joseph Wang, Tolga Bolukbasi, Kirill Trapeznikov, and Venkatesh Saligrama. Model Selection\n\nby Linear Programming, pages 647\u2013662. Springer International Publishing, Cham, 2014.\n\n[21] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. Ef\ufb01cient learning by directed\nacyclic graph for resource constrained prediction. In Advances in Neural Information Process-\ning Systems 28, pages 2143\u20132151. Curran Associates, Inc., 2015.\n\n[22] D. Weiss, B. Sapp, and B. Taskar. Dynamic structured model selection. In 2013 IEEE Inter-\n\nnational Conference on Computer Vision, pages 2656\u20132663, Dec 2013.\n\n10\n\n\f[23] Z Xu, M Kusner, M Chen, and K. Q Weinberger. Cost-sensitive tree of classi\ufb01ers. In Proceed-\n\nings of the 30th International Conference on Machine Learning, 2013.\n\n[24] Zhixiang Eddie Xu, Kilian Q. Weinberger, and Olivier Chapelle. The greedy miser: Learning\nIn Proceedings of the 29th International Conference on Machine\n\nunder test-time budgets.\nLearning, ICML, 2012.\n\n11\n\n\f", "award": [], "sourceid": 2471, "authors": [{"given_name": "Feng", "family_name": "Nan", "institution": "Boston University"}, {"given_name": "Venkatesh", "family_name": "Saligrama", "institution": "Boston University"}]}