{"title": "Faster Boosting with Smaller Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 11371, "page_last": 11380, "abstract": "State-of-the-art implementations of boosting, such as XGBoost and LightGBM, can process large training sets extremely fast. However, this performance requires that the memory size is sufficient to hold a 2-3 multiple of the training set size. This paper presents an alternative approach to implementing the boosted trees, which achieves a significant speedup over XGBoost and LightGBM, especially when the memory size is small. This is achieved using a combination of three techniques: early stopping, effective sample size, and stratified sampling. Our experiments demonstrate a 10-100 speedup over XGBoost when the training data is too large to fit in memory.", "full_text": "Faster Boosting with Smaller Memory\n\nJulaiti Alafate\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\nYoav Freund\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\nAbstract\n\nState-of-the-art implementations of boosting, such as XGBoost and LightGBM,\ncan process large training sets extremely fast. However, this performance requires\nthat the memory size is suf\ufb01cient to hold a 2-3 multiple of the training set size. This\npaper presents an alternative approach to implementing the boosted trees, which\nachieves a signi\ufb01cant speedup over XGBoost and LightGBM, especially when the\nmemory size is small. This is achieved using a combination of three techniques:\nearly stopping, effective sample size, and strati\ufb01ed sampling. Our experiments\ndemonstrate a 10-100 speedup over XGBoost when the training data is too large to\n\ufb01t in memory.\n\n1\n\nIntroduction\n\nBoosting [7, 16], and in particular gradient boosted trees [9], are some of the most popular learning\nalgorithms used in practice. There are several highly optimized implementations of boosting, among\nwhich XGBoost [5] and LightGBM [12] are two of the most popular ones. These implementations\ncan train models with hundreds of trees using millions of training examples in a matter of minutes.\nHowever, a signi\ufb01cant limitation of these methods is that all of the training examples are required to\nbe stored in the main memory. For LightGBM this requirement is strict. XGBoost can operate in the\ndisk-mode, which makes it possible to use machines with smaller memory than the training set size.\nHowever, it comes with a penalty in much longer training time.\nIn this paper, we present a new implementation of boosted trees1. This implementation can run\nef\ufb01ciently on machines whose memory sizes are much smaller than the training set. It is achieved\nwith no penalty in accuracy, and with a speedup of 10-100 over XGBoost in disk mode.\nOur method is based on the observation that each boosting step corresponds to an estimation of\nthe gradients along the axis de\ufb01ned by the weak rules. The common approach to performing this\nestimation is to scan all of the training examples so as to minimize the estimation error. This operation\nis very expensive especially when the training set does not \ufb01t in memory.\nWe reduce the number of examples scanned in each boosting iteration by combining two ideas. First,\nwe use early stopping [19] to minimize the number of examples scanned at each boosting iteration.\n\n1The source code of the implementation is released at\n\n.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSecond, we keep in memory only a sample of the training set, and we replace the sample when the\nsample in memory is a poor representation of the complete training set. We exploit the fact that\nboosting tends to place large weights on a small subset of the training set, thereby reducing the\neffectiveness of the memory-resident training set. We propose a measure for quantifying the variation\nin weights called the effective number of examples. We also describe an ef\ufb01cient algorithm, strati\ufb01ed\nweighted sampling.\nEarly stopping for Boosting was studied in previous work [6, 4]. The other two are, to the best of our\nknowledge, novel. In the following paragraphs, we give a high-level description of these three ideas,\nwhich will be elaborated on in the rest of the paper.\n\nEarly Stopping We use early stopping to reduce the number of examples that the boosting algorithm\nreads from the memory to the CPU. A boosting algorithm adds a weak rule to the combined strong\nrule iteratively. In most implementations, the algorithm searches for the best weak rule, which\nrequires scanning all of the training examples. However, the theory of boosting requires the added\nweak rule to be just signi\ufb01cantly better than random guessing, which does not require scanning all of\nthe training examples. Instead, our approach is to read just as many examples as needed to identify a\nweak rule that is signi\ufb01cantly better than random guessing.\nOur approach is based on sequential analysis and early stopping [19]. Using sequential analysis\nmethods, we designed a stopping rule to decide when to stop reading more examples without\nincreasing the chance of over-\ufb01tting.\n\nEffective Number of Examples Boosting assigns different weights to different examples. The\nweight of an example represents the magnitude of its \u201cin\ufb02uence\u201d on the estimate of the gradient.\nHowever, when the weight distribution of a training set is dominated by a small number of \u201cheavy\u201d\nexamples, the variance of the gradient estimates is high. It leads to over-\ufb01tting, and effectively\nreduces the size of the training set. We quantify this reduction using the effective number of examples,\nneff. To get reliable estimates, neff should be close to the size of the current training set in memory, n.\nWhen neff\nn is small, we \ufb02ush the current training set, and get a new sample using weighted sampling.\n\nStrati\ufb01ed Weighted Sampling While there are well-known methods for weighted sampling, all of\nthe existing methods (that we know of) are inef\ufb01cient when the weights are highly skewed. In such\ncases, most of the scanned examples are rejected, which leads to very slow sampling. To increase the\nsampling ef\ufb01ciency, we introduce a technique we call strati\ufb01ed weighted sampling. It generates the\nsame sampled distribution while guaranteeing that the fraction of rejected examples is no large than\n2.\n1\nWe implemented a new boosted tree algorithm with these three techniques, called Sparrow. We\ncompared its performance to the performance of XGBoost and LightGBM on two large datasets:\none with 50 million examples (the human acceptor splice site dataset [18, 1]), the other with over 600\nmillion examples (the bathymetry dataset [11]). We show that Sparrow can achieve 10-20x speed-up\nover LightGBM and XGBoost especially in the limited memory settings.\nThe rest of the paper is organized as follows. In Section 2 we discuss the related work. In Section 3\nwe review the con\ufb01dence-based boosting algorithm. In Section 4 we describe the statistical theory\nbehind the design of Sparrow. In Section 5 we describe the design of our implementation. In Section 6\nwe describe our experiments. We conclude with the future work direction in Section 7.\n\n2 Related Work\n\nThere are several methods that use sampling to reduce the training time of boosting. Both Friedman\net al. [8] and LightGBM [12] use a \ufb01xed threshold to \ufb01lter out the light-weight examples: the former\ndiscards the examples whose weights are smaller than the threshold; the later accepts all examples if\ntheir gradients exceed the threshold, otherwise accepts them with a \ufb01xed probability. Their major\ndifference with Sparrow is that their sampling methods are biased, while Sparrow does not change\nthe original data distribution. Appel et al. [2] uses small samples to prune weak rules associated with\nunpromising features, and only scans all samples for evaluating the remaining ones. Their major\ndifference with Sparrow is that they focus on \ufb01nding the \u201cbest\u201d weak rule, while Sparrow tries to \ufb01nd\n\n2\n\n\fa \u201cstatistically signi\ufb01cant\u201d one. Scanning over all example is required for the former, while using a\nstopping rule the algorithm often stops after reading a small fraction of examples.\nThe idea of accelerating boosting with stopping rules is also studied by Domingo and Watanabe [6]\nand Bradley and Schapire [4]. Our contribution is in using a tighter stopping rule. Our stopping rule\nis tighter because it takes into account the dependence on the variance of the sample weights.\nThere are several techniques that speed up boosting by taking advantage of the sparsity of the\ndataset [5, 12]. We will consider those techniques in future work.\n\n3 Con\ufb01dence-Rated Boosting\n\nWe start with a brief description of the con\ufb01dence-rated boosting algorithm under the AdaBoost\nframework (Algorithm 9.1 on the page 274 of [16]).\nLet ~x2 X be the feature vectors and let the output be y 2 Y = {1, +1}. For a joint distribution D\nover X \u21e5 Y , our goal is to \ufb01nd a classi\ufb01er c : X ! Y with small error:\n\nerrD(c)\n\n.\n= P(~x,y)\u21e0D [c(~x) 6= y] .\n\nWe are given a set H of base classi\ufb01ers (weak rules) h : X ! [1, +1]. We want to generate a score\nfunction, which is a weighted sum of T rules from H:\n\nST (~x) = TXt=1\n\n\u21b5tht(~x)! .\n\nThe term \u21b5t is the weights by which each base classi\ufb01ers contribute to the \ufb01nal prediction, and is\ndecided by the speci\ufb01c boosting paradigm.\nFinally, we have the strong classi\ufb01er as the sign of the score function: HT = sign(ST ).\nAdaBoost can be viewed as a coordinate-wise gradient descent algorithm [15]. The algorithm\niteratively \ufb01nds the direction (weak rule) which maximizes the decrease of the average potential\nfunction, and then adds this weak rule to ST with a weight that is proportional to the magnitude of the\ngradient. The potential function used in AdaBoost is (~x, y) = eST (~x)y. Other potential functions\nhave been studied (e.g. [9]). In this work we focus on the potential function used in AdaBoost.\nWe distinguish between two types of average potentials: the expected potential or true potential:\n\nand the average potential or empirical potential:\n\n(ST ) = E(~x,y)\u21e0DheST (~x)yi ,\n\neST (~xi)yi.\n\n1\nn\n\nnXi=1\n\nb(ST ) =\n\nThe ultimate goal of the boosting algorithm is to minimize the expected potential, which determines\nthe true error rate. However, most boosting algorithms, including XGBoost and LightGBM, focus\n\non minimizing the empirical potential b(ST ), and rely on the limited capacity of the weak rules\n\nto guarantee that the true potential is also small. Sparrow takes a different approach. It uses an\nestimator of the true edge (explained below) to identify the weak rules that reduce the true potential\nwith high probability.\nConsider adding a weak rule ht to the score function St1, we get St = St1 + \u21b5tht. Taking the\npartial derivative of the average potential with respect to \u21b5t we get\n\n@\n\n@\u21b5t\u21b5t=0\n\nwhere\n\n(St1 + \u21b5th) = E(~x,y)\u21e0Dt1 [h(~x)y]\n\nDt1 = D\nZt1\n\nexp (St1(~x)y) ,\n\n3\n\n(1)\n\n(2)\n\n\fand Zt1 is a normalization factor that makes Dt1 a distribution.\nBoosting algorithms performs coordinate-wise gradient descent on the average potential where each\ncoordinate corresponds to one weak rule. Using equation (1), we can express the gradient with respect\nto the weak rule h as a correlation, which we call the true edge:\n\n(3)\nwhich is not directly measurable. Given n i.i.d. samples, an unbiased estimate for the true edge is the\nempirical edge:\n\n.\n= E(~x,y)\u21e0Dt1 [h(~x)y] ,\n\n.\n= corrDt1(h)\n\nt(h)\n\n\u02c6t(h)\n\n.\n\n=dcorrDt1(h)\n\ni=1 wi.\n\n.\n=\n\nnXi=1\n\nwi\nZt1\n\nh(~xi)yi,\n\n(4)\n\nwhere wi = eSt1(~xi) and Zt1 =Pn\n\n4 Theory\n\nTo decrease the expected potential, we want to \ufb01nd a weak rule with a large edge (and add it to the\nscore function). XGBoost and LightGBM do this by searching for the weak rule with the largest\nempirical edge. Sparrow \ufb01nds a weak rule which, with high probability, has a signi\ufb01cantly large true\nedge. Next, we explain the statistical techniques for identifying such weak rules while minimizing\nthe number of examples needed to compute the estimates.\n\n4.1 Effective Number of Examples\nEquation 4 de\ufb01nes \u02c6(h), which is an unbiased estimate of (h). How accurate is this estimate?\nA standard quanti\ufb01er is the variance of the estimator. Suppose the true edge of a weak rule h is .\nThen the expected (normalized) correlation between the predictions of h and the true labels, w\nZ yh(x),\nE(w2)\nis 2. The variance of this correlation can be written as 1\nE2(w) 42. Ignoring the second term\nn2\n(because is usually close to zero) and the variance in E(w), we approximate the variance in the\nedge to be\n\n(5)\n\nVar(\u02c6) \u21e1 Pn\ni=1 w2\ni\ni=1 wi)2 .\n(Pn\n\n.\n=\n\nneff\n\ni=1 wi)2\ni=1 w2\ni\n\n(Pn\nPn\n\nIf all of the weights are equal then Var(\u02c6) = 1/n. It corresponds to a standard deviation of 1/pn\nwhich is the expected relation between the sample size and the error.\nIf the weights are not equal then the variance is larger and thus the estimate is less accurate. We\nde\ufb01ne the effective number of examples neff to be 1/Var(\u02c6), speci\ufb01cally,\n\n.\n\n(6)\n\nTo see that the name \u201ceffective number of examples\u201d makes sense, consider n weights where\nw1 = \u00b7\u00b7\u00b7 = wk = 1/k and wk+1 = \u00b7\u00b7\u00b7 = wn = 0. It is easy to verify that in this case neff = k\nwhich agrees with our intuition, namely the examples with zero weights do not affect the estimate.\nSuppose the memory is only large enough to store n examples. If neff \u2327 n then we are wasting\nvaluable memory space on examples with small weights, which can signi\ufb01cantly increase the chance\nof over-\ufb01tting. We can \ufb01x this problem by using weighted sampling. In this way we repopulate\nmemory with n equally weighted examples, and make it possible to learn without over-\ufb01tting.\n\n4.2 Weighted Sampling\nWhen Sparrow detects that neff is much smaller than the memory size n, it clears the memory and\ncollects a new sample from disk using weighted sampling.\nThe speci\ufb01c sampling algorithm that Sparrow uses is the minimal variance weighted sampling [13].\nThis method reads from disk one example (~x, y) at a time, calculates the weight for that example\n\n4\n\n\fwi, and accepts the example with the probability proportional to its weight. Accepted examples are\nstored in memory with the initial weights of 1. This increases the effective sample size from neff back\nto n, thereby reduces the chance of over-\ufb01tting.\nTo gain some intuition regarding this effect, consider the following setup of an imbalanced classi-\n\ufb01cation problem. Suppose that the training set size is N = 100, 000, of which 0.01N are positive\nand 0.99N are negative. Suppose we can store n = 2, 000 examples in memory. The number of the\nmemory-resident examples is 0.01n = 20. Clearly, with such a small number of positive examples,\nthere is a danger of over-\ufb01tting. However, an (almost) all negative rule is 99% correct. If we then\nreweigh the examples using the AdaBoost rule, we will give half of the total weight to the positives\nand the other half to the negatives. The value of neff will drop to about 80. This would trigger a\nresampling step, which generates a training set with 1000 positives and 1000 negatives. It allows us\nto \ufb01nd additional weak rules with little danger of over-\ufb01tting.\nThis process continues as long as Sparrow is making progress and the weights are becoming\nincreasingly skewed. When the skew is large, neff is small and Sparrow resamples a new sample\nwith uniform weights.\nSparrow uses weighted sampling to achieve high disk-to-memory ef\ufb01ciency. In addition, Sparrow\nalso achieves high memory-to-CPU ef\ufb01ciency by reading from memory the minimal number of\nexamples necessary to establish that a particular weak rule has a signi\ufb01cant edge. This is done using\nsequential analysis and early stopping.\n\n4.3 Sequential Analysis\n\nSequential analysis was introduced by Wald in the 1940s [19]. Suppose we want to estimate the\nexpected loss of a model. In the standard large deviation analysis, we assume that the loss is bounded\nin some range, say [M, +M ], and that the size of the training set is n. This implies that the standard\ndeviation of the training loss is at most M/pn. To make this standard deviation smaller than some\n\u270f> 0, we need that n > (M/\u270f)2. While this analysis is optimal in the worst case, it can be improved\nif we have additional information about the standard deviation. We can glean such information from\nthe observed losses by using the following sequential analysis method.\nInstead of choosing n ahead of time, the algorithm computes the loss one example at a time. It uses a\nstopping rule to decide whether, conditioned on the sequence of losses seen so far, the difference\nbetween the average loss and the true loss is smaller than \u270f with large probability. The result is that\nwhen the standard deviation is signi\ufb01cantly smaller than M/pn, the number of examples needed in\nthe estimate is much smaller than (M/\u270f)2.\nWe use a stopping rule based on Theorem 1 in Appendix B, which depends on both the mean and the\nvariance of the weighted correlation [3]. Fixing the current strong rule H (i.e. the score function),\nwe de\ufb01ne a (unnormalized) weight for each example, denoted as w(~x, y) = eH(x)y. Consider a\nparticular candidate weak rule h and a sequence of labeled examples {(~x1, y1), (~x2, y2), . . .}. For\nsome > 0, we de\ufb01ne two cumulative quantities (after seeing n examples from the sequence):\n\n.\n=\n\nMt\n\nw(~xi, yi)(ht(~xi)yi ), and Vt\n\n.\n=\n\nw(~xi, yi)2.\n\n(7)\n\nnXi=1\n\nnXi=1\n\nMt is an estimate of the difference between the true correlation of h and . Vt quanti\ufb01es the variance\nof this estimate.\nThe goal of the stopping rule is to identify a weak rule h whose true edge is larger than . It is de\ufb01ned\nto be t > t0 and\n\nMt > CrVt(log log\n\nVt\nMt\n\n+ B),\n\n(8)\n\nwhere t0, C, and B are parameters. If both conditions of the stopping rule are true, we claim that the\ntrue edge of h is larger than with high probability. The proof of this test can be found in [3].\nNote that our stopping rule is correlated with the cumulative variance Vt, which is basically the same\nas 1/neff. If neff is large, say neff = n when a new sample is placed in memory, the stopping rule stops\nquickly. On the other hand, when the weights diverge, neff becomes smaller than n, and the stopping\nrule requires proportionally more examples before stopping.\n\n5\n\n\fFigure 1: The Sparrow system architecture. Left: The work\ufb02ow of the Scanner and the Sampler.\nRight: Partitioning of the examples stored in disk according to their weights.\n\nThe relationship between martingales, sequential analysis, and stopping rules has been studied in\nprevious work [19]. Brie\ufb02y, when the edge of a rule is smaller than , then the sequence is a\nsupermartingale. If it is larger than , then it is a submartingale. The only assumption is that the\nexamples are sampled i.i.d.. Theorem 1 in Appendix B guarantees two things about the stopping rule\nde\ufb01ned in Equation 8: (1) if the true edge is smaller than , the stopping rule will never \ufb01re (with\nhigh probability); (2) if the stopping rule \ufb01res, the true edge of the rule h is larger than .\n\n5 System Design and Algorithms\n\nIn this section we describe Sparrow. As Sparrow consists of a number of concurrent threads and\nmany queues, we chose to implement it using the Rust programming language for the bene\ufb01ts of its\nmemory-safety and thread-safety guarantees [14].\nWe use a bold letter in parenthesis to refer the corresponding component in the work\ufb02ow diagram in\nFigure 1. We also provide the pseudo-code in the Appendix C.\nThe main procedure of Sparrow generates a sequence of weak rules h1, . . . , hk and combines them\ninto a strong rule Hk. It calls two subroutines that execute in parallel: a Scanner and a Sampler.\n\nScanner The task of a scanner (the upper part of the work\ufb02ow diagram in Figure 1) is to read\ntraining examples sequentially and stop when it has identi\ufb01ed one of the rules to be a good rule.\nAt any point of time, the scanner maintains the current strong rule Ht, a set of candidate weak rules\nW, and a target edge t+1. For example, when training boosted decision trees, the scanner maintains\nthe current strong rule Ht which consists of a set of decision trees, a set of candidate weak rules W\nwhich is the set of candidate splits on all features, and t+1 2 (0.0, 0.5).\nInside the scanner, a booster (d) scans the training examples stored in main memory (c) sequentially,\none at a time. It computes the weight of the read examples using Ht and then updates a running\nestimate of the edge of each weak rule h 2W accordingly. Periodically, it feeds these running\nestimates into the stopping rule, and stop the scanning when the stopping rule \ufb01res.\nThe stopping rule is designed such that if it \ufb01res for ht, then the true edge of a particular weak\nrule (ht+1) is, with high probability, larger than the set threshold t+1. The booster then adds\nthe identi\ufb01ed weak rule ht+1 (f) to the current strong rule Ht to create a new strong rule Ht+1 (g).\nThe booster decides the weight of the weak rule ht+1 in Ht+1 based on t+1 (lower bound of its\naccuracy). It could underestimate the weight. However, if the underestimate is large, the weak rule\nht+1 is likely to be \u201cre-discovered\u201d later which will effectively increase its weight.\nLastly, the scanner falls into the Failed state if after exhausting all examples in the current sample set,\nno weak rule with an advantage larger than the target threshold t+1 is detected. When it happens, the\nscanner shrinks the value of t+1 and restart scanning. More precisely, it keeps track of the empirical\nedges \u02c6(h) of all weak rules h. When the failure state happens, it resets the threshold t+1 to just\nbelow the value of the current maximum empirical edge of all weak rules.\nTo illustrate the relationship between the target threshold and the empirical edge of the detected\nweak rule, we compare their values in Figure 2. The empirical edge \u02c6(ht+1) of the detected weak\n\n6\n\n\fFigure 2: The empirical edge and the correspond-\ning target edge of the weak rules being added\nto the ensemble. Sparrow adds new weak rules\nwith a weight calculated using the value of at\nthe time of their detection, and shrinks when it\ncannot detect a rule with an edge over .\n\nFigure 3: Accuracy comparison on the Cover-\nType dataset. For uniform sampling, we trained\nXGBoost on a uniformly sampled dataset with\nthe same sample fraction set in Sparrow. The ac-\ncuracy is evaluated with same number of boost-\ning iterations.\n\nrules are usually larger than t+1. The weak rules are then added to the strong rule with a weight\ncorresponding to t+1 (the lower bound for their true edges) to avoid over-estimation. Lastly, the\nvalue of t+1 shrinks over time when there is no weak rule with the larger edge exists.\n\nSampler Our assumption is that the entire training dataset does not \ufb01t into the main memory and\nis therefore stored in an external storage (a). As boosting progresses, the weights of the examples\nbecome increasingly skewed, making the dataset in memory effectively smaller. To counteract that\nskew, Sampler prepares a new training set, in which all of the examples have equal weights, by using\nselective sampling. When the effective sample size neff associated with the old training set becomes\ntoo small, the scanner stops using the old training set and starts using the new one2.\nThe sampler uses selective sampling by which we mean that the probability of an example (x, y)\nbeing added to the sample is proportional to its weight w(x, y). Each added example is assigned an\ninitial weight of 1. There are several known algorithms for selective sampling. The best-known one\nis rejection sampling in which a biased coin is \ufb02ipped for each example. We use a method known as\nminimal variance sampling [13] because it produces less variation in the sampled set.\n\nStrati\ufb01ed Storage and Strati\ufb01ed Sampling The standard approach to sampling reads examples\none at a time, calculates the weight of the example, and accepts the example into the memory with the\nprobability proportional to its weight, otherwise rejects the example. Let the largest weight be wmax\nand the average weight be wmean, then the maximal rate at which examples are accepted is wmean/wmax.\nIf the weights are highly skewed, then this ratio can be arbitrarily small, which means that only a\nsmall fraction of the evaluated examples are then accepted. As evaluation is time-consuming, this\nprocess becomes a computation bottleneck.\nWe proposed a strati\ufb01ed-based sampling mechanism to address this issue (the right part of Figure 1).\nIt applies incremental update to reduce the computational cost of making predictions with a large\nmodel, and uses a strati\ufb01ed data organization to reduce the rejection rate.\nTo implement incremental update we store for each example, whether it is on disk or in memory, the\nresult of the latest update. Speci\ufb01cally, we store each training example in a tuple (x, y, Hl, wl), where\nx, y are the feature vector and the label, Hl is the last strong rule used to calculate the weight of the\nexample, and wl is the weight last calculated. In this way both the scanner and sampler only calculate\nover the incremental changes to the model since the last time it was used to predict examples.\nTo reduce the rejection rate, we want the sampler to avoid reading examples that it will likely to\nreject. We organize examples in a strati\ufb01ed structure, where the stratum k contains examples whose\nweights are in [2k, 2k+1). It limits the skew of the weights of the examples in each stratum so that\n2. Besides, the sampler also maintains the (estimated) total weight of the examples in\nwmean/wmax \uf8ff 1\neach stratum. It then associates a probability with each stratum by normalizing the total weights to 1.\n\n2The sampler and scanner can run in parallel on a multi-core machine, or run on two different machines. In\n\nour experiments, we keep them in one machine.\n\n7\n\n\fTo sample a new example, the sampler \ufb01rst samples the next stratum to read, then reads examples\nfrom the selected stratum until one of them is accepted. For each example, the sampler \ufb01rst updates\nits weight, then decides whether or not to accept this example, \ufb01nally writes it back to the stratum it\nbelongs to according to its updated weight. As a result, the reject rate is at most 1/2, which greatly\nimproves the speed of sampling.\nLastly, since the strati\ufb01ed structure contains all of the examples, it is managed mostly on disk, with a\nsmall in-memory buffer to speed up I/O operations.\n\n6 Experiments\n\nIn this section we describe the experiment results of Sparrow. In all experiments, we use trees as\nweak rules. First, we use the forest cover type dataset [10] to evaluate the sampling effectiveness. It\ncontains 581 K samples. We performed a 80/20 random split for training and testing.\nBesides, we use two large datasets to evaluate the overall performance of Sparrow on large datasets.\nThe \ufb01rst large dataset is the splice site dataset for detecting human acceptor splice site [18, 1]. We use\nthe same training dataset of 50 M samples as in the other work, and validate the model on the testing\ndata set of 4.6 M samples. The training dataset on disk takes over 39 GB in size. The second large\ndataset is the bathymetry dataset for detecting the human mislabeling in the bathymetry data [11].\nWe use a training dataset of 623M samples, and validate the model on the testing dataset of 83M\nsamples. The training dataset takes 100 GB on disk. Both learning tasks are binary classi\ufb01cation.\nThe experiments on large datasets are all conducted on EC2 instances with attached SSD storages\nfrom Amazon Web Services. We ran the evaluations on \ufb01ve different instance types with increasing\nmemory capacities, ranging from 8 GB to 244 GB (for details see Appendix A).\n\n6.1 Effectiveness of Weighted Sampling\n\nWe evaluate the effectiveness of weighted sampling by comparing it to uniform sampling. The\ncomparison is over the model accuracy on the testing data when both trained for 500 boosting\niteration on the cover type dataset. For both methods, we generate trees with depth 5 as weak rules.\nIn uniform sampling, we \ufb01rst randomly sample from the training data with each sampling ratio, and\nuse XGBoost to train the models. We evaluated the model performance on the sample ratios ranging\nfrom 0.1 to 0.5, and repeated each evaluation for 10 times. The results are showed in Figure 3. We\ncan see that the accuracy of Sparrow is higher with the same number of boosting iteration and the\nsame sampling ratio. In addition, the variance of the model accuracy is also smaller. It demonstrates\nthat the weighted sampling method used in Sparrow is more effective and more stable than uniform\nsampling.\n\n6.2 Training on Large Datasets\n\nWe compare Sparrow on the two large datasets, and use XGBoost and LightGBM for the baselines\nsince they out-perform other boosting implementations [5, 12]. The comparison was done in terms\nof the reduction in the exponential loss, which is what boosting minimizes directly, and in terms of\nAUROC, which is often more relevant for practice. We include the data loading time in the reported\ntraining time.\nThere are two popular tree-growth algorithms: depth-wise and leaf-wise [17]. Both Sparrow and\nLightGBM grow trees leaf-wise. XGBoost uses the depth-wise method by default. In all experiments,\nwe grow trees with at most 4 leaves, or depth two. We choose to train smaller trees in these\nexperiments since the training takes a very long time otherwise.\nFor XGBoost, we chose the approximate greedy algorithm which is its fastest training method.\nLightGBM supports using sampling in the training, which they called Gradient-based One-Side\nSampling (GOSS). GOSS keeps a \ufb01xed percentage of the examples with large gradients, and randomly\nsample from the remaining examples. We selected GOSS as the tree construction algorithm for\nLightGBM. In addition, we also enabled the\noption in LightGBM to reduce\nits memory footprint.\n\n8\n\n\fFigure 4: Time-AUROC curve on the splice site\ndetection dataset, higher is better, clipped on\nright and bottom The (S) suf\ufb01x is for training\non 30.5 GB memory, and the (L) suf\ufb01x is for\ntraining on 61 GB memory.\n\nFigure 5: Time-AUROC curve on the bathymetry\ndataset, higher is better, clipped on right and\nbottom. The (S) suf\ufb01x is for training on 61 GB\nmemory, and the (L) suf\ufb01x is for training on\n244 GB memory.\n\nThe memory requirement of Sparrow is decided by the sample size, which is a con\ufb01gurable parameter.\nXGBoost supports external memory training when the memory is too small to \ufb01t the training dataset.\nThe in-memory version of XGBoost is used for training whenever possible. If it runs out of memory,\nwe trained the model using the external memory version of XGBoost instead. Unlike XGBoost,\nLightGBM does not support external memory execution. Lastly, all algorithms in this comparison\noptimize the exponential loss as de\ufb01ned in AdaBoost.\nDue to the space limit, we put the detailed summary of the experiment results in Table 1 and Table 2\nin the Appendix A. We evaluated each algorithm in terms of the AUROC as a function of training\ntime on the testing dataset. The results are given in Figure 4 and Figure 5.\nOn the splice site dataset, Sparrow is able to run on the instances with as small as 8 GB memory.\nThe external memory version of XGBoost can execute with a reasonable amount of memory (but\nstill needs to be no smaller than 15 GB) but takes about 3x longer training time. However, we also\nnoticed that Sparrow does not have an advantage over other two boosting implementations when the\nmemory size is large enough to load in the entire training dataset.\nOn the bathymetry dataset, Sparrow consistently out-performs XGBoost and LightGBM, even when\nthe memory size is larger than the dataset size. In extreme cases, we see that Sparrow takes 10x-20x\nshorter training time and achieves better accuracy. In addition, both LightGBM and the in-memory\nversion of XGBoost crash when trained with less than 244 GB memory.\nWe observed that properly initializing the value of and setting a reasonable sample set size can\nhave a great impact on the performance of Sparrow. If stopping rule frequently fails to \ufb01re, it can\nintroduce signi\ufb01cant overhead to the training process. Speci\ufb01c to the boosted trees, one heuristic we\n\ufb01nd useful is to initialize to the maximum advantage of the tree nodes in the previous tree. A more\nsystematic approach for deciding and the sample set size is left as future work.\n\n7 Conclusion and Future Work\n\nIn this paper, we have proposed a boosting algorithm contains three techniques: effective number\nof examples, weighted sampling, and early stopping. Our preliminary results show that they can\ndramatically speed up boosting algorithms on large real-world datasets, especially when the data\nsize exceeds the memory capacity. For future work, we are working on a parallelized version of\nSparrow which uses a novel type of asynchronous communication protocol. It uses stopping rule to\ndo the model update, and relaxes the necessity for frequent communication between multiple workers\nespecially when training on large datasets, which we believe is a better parallel learning paradigm.\n\nAcknowledgements\nWe are grateful to David Sandwell and Brook Tozer for providing the bathymetry dataset.\nThis work was supported by the NIH (grant U19 NS107466).\n\n9\n\n\fReferences\n[1] Alekh Agarwal, Oliveier Chapelle, Miroslav Dud\u00edk, and John Langford. A Reliable Effective\nTerascale Linear Learning System. Journal of Machine Learning Research, 15:1111\u20131133,\n2014.\n\n[2] Ron Appel, Thomas Fuchs, Piotr Doll\u00e1r, and Pietro Perona. Quickly boosting decision trees\u2013\npruning underachieving features early. In International Conference on Machine Learning, pages\n594\u2013602, 2013.\n\n[3] Akshay Balsubramani. Sharp Finite-Time Iterated-Logarithm Martingale Concentration.\n\narXiv:1405.2639 [cs, math, stat], May 2014.\n\n[4] Joseph K. Bradley and Robert E. Schapire. FilterBoost: Regression and Classi\ufb01cation on\nLarge Datasets. In Proceedings of the 20th International Conference on Neural Information\nProcessing Systems, NIPS\u201907, pages 185\u2013192, USA, 2007. Curran Associates Inc.\n\n[5] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings\nof the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201916, pages 785\u2013794, New York, NY, USA, 2016. ACM.\n\n[6] Carlos Domingo and Osamu Watanabe. Scaling Up a Boosting-Based Learner via Adaptive\nSampling. In Knowledge Discovery and Data Mining. Current Issues and New Applications,\nLecture Notes in Computer Science, pages 317\u2013328. Springer, Berlin, Heidelberg, April 2000.\n[7] Yoav Freund and Robert E Schapire. A Decision-Theoretic Generalization of On-Line Learning\nand an Application to Boosting. Journal of Computer and System Sciences, 55(1):119\u2013139,\nAugust 1997.\n\n[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical\nview of boosting (With discussion and a rejoinder by the authors). The Annals of Statistics,\n28(2):337\u2013407, April 2000.\n\n[9] Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The\n\nAnnals of Statistics, 29(5):1189\u20131232, 2001.\n\n[10] Jo\u00e3o Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed\ndata streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 523\u2013528. ACM, 2003.\n\n[11] Japan Agency for Marine-Earth Science and Technology JAMSTEC. Data and sample research\n\nsystem for whole cruise information in jamstec (darwin).\n\n, 2016.\n\n[12] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and\nTie-Yan Liu. LightGBM: A Highly Ef\ufb01cient Gradient Boosting Decision Tree. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 3146\u20133154. Curran Associates,\nInc., 2017.\n\n[13] Genshiro Kitagawa. Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space\n\nModels. Journal of Computational and Graphical Statistics, 5(1):1\u201325, 1996.\n\n[14] Steve Klabnik and Carol Nichols. The Rust Programming Language. No Starch Press, 2018.\n[15] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting Algorithms As\nGradient Descent. In Proceedings of the 12th International Conference on Neural Information\nProcessing Systems, NIPS\u201999, pages 512\u2013518, Cambridge, MA, USA, 1999. MIT Press.\n\n[16] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.\n[17] Haijian Shi. Best-\ufb01rst decision tree learning. PhD thesis, The University of Waikato, 2007.\n[18] Soeren Sonnenburg and Vojt\u02c7ech Franc. COFFIN: A Computational Framework for Linear\nSVMs. In Proceedings of the 27th International Conference on International Conference on\nMachine Learning, ICML\u201910, pages 999\u20131006, USA, 2010. Omnipress.\n\n[19] Abraham Wald. Sequential Analysis. Courier Corporation, 1973.\n\n10\n\n\f", "award": [], "sourceid": 6072, "authors": [{"given_name": "Julaiti", "family_name": "Alafate", "institution": "University of California San Diego"}, {"given_name": "Yoav", "family_name": "Freund", "institution": "University of California, San Diego"}]}