{"title": "Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale", "book": "Advances in Neural Information Processing Systems", "page_first": 3817, "page_last": 3825, "abstract": "Deep distributed decision trees and tree ensembles have grown in importance due to the need to model increasingly large datasets. However, PLANET, the standard distributed tree learning algorithm implemented in systems such as \\xgboost and Spark MLlib, scales poorly as data dimensionality and tree depths grow. We present Yggdrasil, a new distributed tree learning method that outperforms existing methods by up to 24x. Unlike PLANET, Yggdrasil is based on vertical partitioning of the data (i.e., partitioning by feature), along with a set of optimized data structures to reduce the CPU and communication costs of training. Yggdrasil (1) trains directly on compressed data for compressible features and labels; (2) introduces efficient data structures for training on uncompressed data; and (3) minimizes communication between nodes by using sparse bitvectors. Moreover, while PLANET approximates split points through feature binning, Yggdrasil does not require binning, and we analytically characterize the impact of this approximation. We evaluate Yggdrasil against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, Yggdrasil is faster by up to an order of magnitude.", "full_text": "Yggdrasil: An Optimized System for Training Deep\n\nDecision Trees at Scale\n\nFiras Abuzaid1, Joseph Bradley2, Feynman Liang3, Andrew Feng4, Lee Yang4,\n\nMatei Zaharia1, Ameet Talwalkar5\n\n1MIT CSAIL, 2Databricks, 3University of Cambridge, 4Yahoo, 5UCLA\n\nAbstract\n\nDeep distributed decision trees and tree ensembles have grown in importance due\nto the need to model increasingly large datasets. However, PLANET, the standard\ndistributed tree learning algorithm implemented in systems such as XGBOOST\nand Spark MLLIB, scales poorly as data dimensionality and tree depths grow. We\npresent YGGDRASIL, a new distributed tree learning method that outperforms\nexisting methods by up to 24\u00d7. Unlike PLANET, YGGDRASIL is based on ver-\ntical partitioning of the data (i.e., partitioning by feature), along with a set of\noptimized data structures to reduce the CPU and communication costs of train-\ning. YGGDRASIL (1) trains directly on compressed data for compressible features\nand labels; (2) introduces ef\ufb01cient data structures for training on uncompressed\ndata; and (3) minimizes communication between nodes by using sparse bitvectors.\nMoreover, while PLANET approximates split points through feature binning, YG-\nGDRASIL does not require binning, and we analytically characterize the impact of\nthis approximation. We evaluate YGGDRASIL against the MNIST 8M dataset and\na high-dimensional dataset at Yahoo; for both, YGGDRASIL is faster by up to an\norder of magnitude.\n\n1\n\nIntroduction\n\nDecision tree-based methods, such as random forests and gradient-boosted trees, have a rich and\nsuccessful history in the machine learning literature. They remain some of the most widely-used\nmodels for both regression and classi\ufb01cation tasks, and have proven to be practically advantageous\nfor several reasons: they are arbitrarily expressive, can naturally handle categorical features, and are\nrobust to a wide range of hyperparameter settings [4].\nAs datasets have grown in scale, there is an increasing need for distributed algorithms to train decision\ntrees. Google\u2019s PLANET framework [12] has been the de facto approach for distributed tree learning,\nwith several popular open source implementations, including Apache Mahout, Spark MLLIB, and\nXGBOOST [1, 11, 7]. PLANET partitions the training instances across machines and parallelizes the\ncomputation of split points and stopping criteria over them, thus effectively leveraging a large cluster.\nWhile PLANET works well for shallow trees and small numbers of features, it has high communication\ncosts when tree depths and data dimensionality grow. PLANET\u2019s communication cost is linear in the\nnumber of features p, and is linear in 2D, where D is the tree depth. As demonstrated by several\nstudies [13, 3, 8], datasets have become increasingly high-dimensional (large p) and complex, often\nrequiring high-capacity models (e.g., deep trees with large D) to achieve good predictive accuracy.\nWe present YGGDRASIL, a new distributed tree learning system that scales well to high-dimensional\ndata and deep trees. Unlike PLANET, YGGDRASIL is based on vertical partitioning of the data [5]: it\nassigns a subset of the features to each worker machine, and asks it to compute an optimal split for\neach of its features. These candidate splits are then sent to a master, which selects the best one. On\ntop of the basic idea of vertical partitioning, YGGDRASIL introduces three novel optimizations:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2022 Training on compressed data without decompression: YGGDRASIL compresses features via\nrun-length encoding and encodes labels using dictionary compression. We design a novel split-\n\ufb01nding scheme that trains directly on compressed data for compressible features, which reduces\nruntime by up to 20%.\n\u2022 Ef\ufb01cient training on uncompressed data: YGGDRASIL\u2019s data structures let each worker implic-\nitly store the split history of the tree without introducing any memory overheads. Each worker\nrequires only a sequential scan over the data to perform greedy split-\ufb01nding across all leaf nodes\nin the tree, and only one set of suf\ufb01cient statistics is kept in memory at a time.\n\u2022 Minimal communication between nodes: YGGDRASIL uses sparse bit vectors to reduce inter-\nmachine communication costs during training.\n\nTogether, these optimizations yield an algorithm that is asymptotically less expensive than PLANET\non high-dimensional data and deep trees: YGGDRASIL\u2019s communication cost is O(2D + Dn), in\ncontrast to O(2Dp) for PLANET-based methods, and its data structure optimizations yield up to 2\u00d7\nsavings in memory and 40% savings in time over a naive implementation of vertical partitioning.\nThese optimizations enable YGGDRASIL to scale up to thousands of features and tree depths up to 20.\nOn tree depths greater than 10, YGGDRASIL outperforms MLLIB and XGBOOST by up to 6\u00d7 on the\nMNIST 8M dataset, and up to 24\u00d7 on a dataset with 2 million training examples and 3500 features\nmodeled after the production workload at Yahoo.\nNotation We de\ufb01ne n and p as the number of instances and features in the training set, D as the\nmaximum depth of the tree, B as the number of histogram buckets to use in PLANET, k as the number\nof workers in the cluster, and Wj as the jth worker.\n\n2 PLANET: Horizontal Partitioning\n\nexpress the splitting criterion Split(\u00b7) for node i as Split(i) = arg maxs\u2208S f ((cid:80)\n\nWe now describe the standard algorithm for training a decision tree in a distributed fashion via\nhorizontal partitioning, inspired by PLANET [12]. We assume that B potential thresholds for each\nof the p features are considered; thus, we de\ufb01ne S as the set of cardinality pB containing all split\ncandidates. For a given node i, de\ufb01ne the set I as the instances belonging to this node. We can then\nx\u2208I g(x, s)) for\nfunctions f and g where f : Rc \u2192 R, g : Rp \u00d7 N \u2192 Rc, and c \u2208 O(1). Intuitively, for each split\ncandidate s \u2208 S, g(x, s) computes the c suf\ufb01cient statistics for each point x; f (\u00b7) aggregates these\nsuf\ufb01cient statistics to compute the node purity for candidate s; and Split(i) returns the split candidate\nthat maximizes the node purity. Hence, if worker j contains instances in the set J , we have:\n\n(cid:33)\n\nSplit(i) = arg max\n\ns\u2208S\n\nf\n\ngj(s)\n\nwhere\n\ngj(s) =\n\ng(x, s)\n\n(1)\n\nThis observation suggests a natural distributed algorithm. We assume a star-shaped distributed\narchitecture with a master node and k worker nodes. Data are distributed using horizontal partitioning;\ni.e., each worker node stores a subset of the n instances. For simplicity, we assume that we train our\ntree to a \ufb01xed depth D. On the tth iteration of the algorithm, we compute the optimal splits for all\nnodes on the tth level of the tree via a single round trip of communication between the master and the\nworkers. Each tree node i is split as follows:\n1. The jth worker locally computes suf\ufb01cient statistics gj(s) from Equation 1 for all s \u2208 S.\n2. Each worker communicates all statistics gj(s) to the master (Bp in total).\n3. The master computes the best split s\u2217 = Split(i) from Equation 1.\n4. The master broadcasts s\u2217 to the workers, who update their local states to keep track of which\n\ninstances are assigned to which child nodes.\n\nOverall, the computation is linear in n, p, and D, and is trivially parallelizable. Yet the algorithm\nis communication-intensive. For each tree node, step 2 above requires communicating kBp tuples\nof size c. With 2D total nodes, the total communication is 2DkpBc \ufb02oating point values, which is\nexponential in tree depth D, and linear in B, the number of thresholds considered for each feature.\nMoreover, using B < n thresholds results in an approximation of the tree trained on a single machine,\nand can result in adverse statistical consequences, as noted empirically by [7]. We present a theoretical\nanalysis of the impact of this approximation in Section 4.\n\n2\n\n(cid:32) k(cid:88)\n\nj=1\n\n(cid:88)\n\nx\u2208I\u2229J\n\n\f3 YGGDRASIL: Vertical Partitioning\n\nk\n\nWe propose an alternative algorithm to address the aforementioned shortcomings. Rather than\npartition the data by instance, we partition by feature: each of the k worker machines stores all feature\n\n(cid:7) of the features, as well the labels for all instances. This organizational strategy has\n\nvalues for(cid:6) p\n\ntwo crucial bene\ufb01ts: (1) each worker can locally compute the node purity for a subset of the split\ncandidates, which signi\ufb01cantly reduces the communication bottleneck; and (2) we can ef\ufb01ciently\nconsider all possible B = n \u2212 1 splits.\nWe can derive an expression for Split(i) with vertical partitioning analogous to Equation 1. Rede\ufb01n-\ning J to be the set of features stored on the jth worker, we have\nfj = arg max\n\n(cid:32)(cid:88)\n\nSplit(i) = arg max\n\nfj where\n\n(cid:33)\n\nj\n\ns\u2208J\n\nf\n\ng(x, s)\n\nx\u2208I\n\n(2)\n\nIntuitively, each worker identi\ufb01es its top split candidate among its set of features, and the master then\nchooses the best split candidate among these top k candidates.\nAs with horizontal partitioning, computation is linear in n, p, and D, and is easily parallelizable.\nHowever, the communication pro\ufb01le is quite different, with two major sources of communication.\nFor each node, each worker communicates one tuple of size c, resulting in 2Dkc communication\nfor all nodes. When training each level of the tree, n bits are communicated to indicate the split\ndirection (left/right) for each training point. Hence, the overall communication is O(2Dk + Dnk).\nIn contrast to the O(2DkpB) communication cost of horizontal partitioning, vertical partitioning has\nno dependence on p, and, for large n, the O(Dnk) term will likely be the bottleneck.\n\n(a) Regimes of (n, p) where each partitioning\nstrategy dominates for D = 15, k = 16, B =\n32.\n\n(b) Regimes of (n, D) where each partitioning\nstrategy dominates for p = 2500, k = 16, B =\n32.\n\nFigure 1: Communication cost tradeoffs between vertical and horizontal partitioning\n\nThus, there exists a set of tradeoffs between horizontal and vertical partitioning across different\nregimes of n, p, and D, as illustrated in Figure 1. The overall trend is clear: for large p and D, vertical\npartitioning can drastically reduce communication.\n\n3.1 Algorithm\n\nThe YGGDRASIL algorithm works as follows: at iteration t, we compute the optimal splits for all\nnodes on the tth level of the tree via two round trips of communication between the master and the\nworkers. Like PLANET, all splits for a single depth t are computed at once. For each node i at depth\nt, the following steps are performed:\nComputeBestSplit(i):\n\u2022 The jth worker locally computes fj from Equation 2 and sends this to the master.\n\u2022 The master selects s\u2217 = Split(i). Let f\u2217\nthe worker containing this optimal feature: f\u2217\nbitV ector = CollectBitVector(W \u2217\nj ):\n\nj denote the optimal feature selected for s\u2217, and let W \u2217\n\nj \u2208 W \u2217\nj .\n\nj be\n\n3\n\nn2n4n8nNum.Instances,LogScalep2p3p4p5p6p7p8pNum.FeaturesHorizontalPartitioningBetterVerticalPartitioningBettern2n4n8nNum.Instances,LogScaleD2D3D4D5DTreeDepthHorizontalPartitioningBetterVerticalPartitioningBetter\f\u2022 The master requests a bitvector from W \u2217\nright) each training point x \u2208 I should be assigned to.\nBroadcastSplitInfo(bitV ector):\n\u2022 The master then broadcasts the bitvector to all k workers. Each worker then updates its internal\n\nj in order to determine which child node (either left or\n\nstate to prepare for the next iteration of training.\n\n3.2 Optimizations\n\nAs we previously showed, vertical partitioning leads to asymptotically lower communication costs as\np and D increase. However, this asymptotic behavior does not necessarily translate to more ef\ufb01cient\ntree learning; on the contrary, a naive implementation may easily lead to high CPU and memory\noverheads, communication overhead, and poor utilization of the CPU cache. In YGGDRASIL, we\nintroduce three novel optimizations for vertically partitioned tree learning that signi\ufb01cantly improve\nits scalability, memory usage and performance.\n\n3.2.1 Sparse Bitvectors for Reduced Communication Overhead\nOnce the master has found the optimal split s\u2217 for each leaf node i in the tree, each worker must\nthen update its local features to re\ufb02ect that the instances have been divided into new child nodes.\nTo accomplish this while minimizing communication, the workers and master communicate using\nbitvectors. Speci\ufb01cally, after \ufb01nding the optimal split, the master requests from worker W \u2217\nj a\ncorresponding bitvector for s\u2217; this bitvector encodes the partitioning of instances between the two\nchildren of i. Once the master has collected all optimal splits for all leaf nodes, it broadcasts the\nbitvectors out to all workers. This means that (assuming a fully balanced tree), for every depth t\nduring training, 2t bitvectors \u2013 for a total of n bits \u2013 are sent from the k workers.\nAdditionally, the n bits are encoded in a sparse format [6], which offers much better compression\nvia packed arrays than a naive bitvector. This sparse encoding is particularly useful for imbalanced\ntrees: rather than allocate memory to encode a potential split for all nodes at depth t, we only allocate\nmemory for the nodes in which an optimal split was found. By taking advantage of sparsity, we can\nsend the n bits between the master and the workers at only a fraction of the cost.\n\n3.2.2 Training on Compressed Data without Decompression\n\nIn addition to its more favorable communication cost for large p and D, YGGDRASIL\u2019s vertical\npartitioning strategy presents a unique optimization opportunity: the ability to ef\ufb01ciently compress\ndata by feature. Furthermore, because the feature values must be in sorted order to perform greedy\nsplit-\ufb01nding, we can use this to our advantage to perform lossless compression without sacri\ufb01cing\nrecoverability. This leads to a clear optimization: feature compression via run-length encoding (RLE),\nan idea that has been explored extensively in column-store databases [10, 14]. In addition to the\nobvious in-memory savings, this technique also impacts the runtime performance of split-\ufb01nding,\nsince the vast majority of feature values are now able to reside in the L3 cache. To the best of our\nknowledge, YGGDRASIL is the \ufb01rst system to apply this optimization to decision tree learning.\nMany features compress well using RLE: sparse features, continuous features with few distinct\nvalues, and categorical features with low arity. However, to train directly on compressed data without\ndecompressing, we must maintain the feature in sorted order throughout the duration of training,\na prerequisite for RLE. Therefore, to compute all splits for a given depth t, we introduce a data\nstructure to record the most recent splits at depth t \u2212 1. Speci\ufb01cally, we create a mapping between\neach feature value and the node i at depth t that it is currently assigned to.\nAt the end of an iteration of training, each worker updates this data structure by applying the bitvector\nit receives from the master, which requires a single sequential scan over the data. All random accesses\nare con\ufb01ned to the labels, which we also encode (when feasible) using dictionary compression. This\ngives us much better cache density during split-\ufb01nding: all random accesses no longer touch DRAM\nand instead read from the last-level cache.\nTo minimize the number of additional passes, we compute the optimal split across all leaf nodes as\nwe iterate over a given feature. This means that each feature requires only two sequential scans over\nthe data for each iteration of training: one to update the value-node mapping, and one to compute\nthe entire set of optimal splits for iteration t + 1. However, as a tradeoff, we must maintain the\n\n4\n\n\fsuf\ufb01cient statistics for all splits in memory as we scan over the feature. For categorical features\n(especially those with high arity), this cost in memory overhead proves to be too exorbitant, and the\nruntime performance suffers despite obtaining excellent compression. For sparse continuous features,\nhowever, the improvements are signi\ufb01cant: on MNIST 8M, we achieve 2\u00d7 compression (including\nthe auxiliary data structure) and obtain a 20% reduction in runtime.\n\n3.2.3 Ef\ufb01cient Training on Uncompressed Data\n\nFor features that aren\u2019t highly compressible, YGGDRASIL uses a different scheme that, in contrast,\ndoes not use any auxiliary data structures to keep track of the split history. Since features no longer\nneed to stay sorted in perpetuity, YGGDRASIL implicitly encodes the split partitions by recursively\ndividing its features into sub-arrays \u2013 each feature value is assigned to a sub-array based on the\nbit assigned to it and its previous sub-array assignment. Because the feature is initially sorted,\na sequential scan over the sub-arrays maintains the sorted-order invariant, and we construct the\nsub-arrays for the next iteration of training in O(n) time, requiring only a single pass over the feature.\nBy using this implicit representation of the split history, we\u2019re left only with the feature values and\nlabel indices stored in memory. Therefore, the memory load does not increase during training for\nuncompressed features \u2013 it remains constant at 2\u00d7.\nThis scheme yields another additional bene\ufb01t: when computing the next iteration of splits for depth\nt + 1, YGGDRASIL only maintains the suf\ufb01cient statistics for one node at a time, rather than for\nall leaf nodes. Furthermore, YGGDRASIL still only requires a single sequential scan through the\nentire feature to compute all splits. This means that, as was the case for compressed features, every\niteration of training requires only two sequential scans over each feature, and all random accesses are\nagain con\ufb01ned to the dictionary-compressed labels. Finally, for imbalanced trees, we can skip entire\nsub-arrays that no longer need to be split, which saves additional time as trees grow deeper.\n\nFigure 2: Overview of one iteration of uncompressed training in YGGDRASIL. Left side: Root node\ni0 is split into nodes i1 and i2; the split is encoded by a bitvector. Right side: Prior to training, the\nfeature ci is sorted to optimize split-\ufb01nding. Once a split has been found, ci is re-sorted into two\nsub-arrays: the 1st, 4th, and last values (the \u201con\u201d bits) are sorted into i1\u2019s sub-array, and the \u201coff\u201d bits\nare sorted into i2\u2019s sub-array. Each sub-array is in sorted order for the next iteration of training.\n\n4 Discretization Error Analysis\n\nHorizontal partitioning requires each worker to communicate the impurity on its subset of data for\nall candidate splits associated with each of the p features, and to do so for all 2D tree nodes. For\ncontinuous features where each training instance could have a distinct value, up to n candidate splits\nare possible so the communication cost is O(2Dkpn). To improve ef\ufb01ciency, continuous features\nare instead commonly discretized to B discrete bins such that only B rather than n \u2212 1 candidate\nsplits are considered at each tree node [9]. In contrast, discretization of continuous features is not\nrequired in YGGDRASIL, since all n values for a particular feature are stored on a single worker (due\nto vertical partitioning). Hence, the impurity for the best split rather than all splits is communicated.\nThis discretization heuristic results in the approximation of continuous values by discrete repre-\nsentatives, and can adversely impact the statistical performance of the resulting decision tree, as\ndemonstrated empirically by [7]. Prior work has shown that the number of bins can be chosen such\nthat the decrease in information gain at any internal tree node between the continuous and discretized\n\n5\n\nci = bitVector = 100101 0 1 0 2 3 1 0 0 1 1 2 3 0 1 2 0 1 3 sort by value split found, sort by bitvector i0 i1 i2 sorted feature before training 2) original feature 1) after 1st iteration of training 3) Entire Cluster Single Worker \ffeature can be made arbitrarily small [2]. However, their proof does not quantify the effects of\ndiscretization on a decision tree\u2019s performance.\nTo understand the impact of discretization on accuracy, we analyze the simpli\ufb01ed setting of training\na decision stump classi\ufb01er on a single continuous feature x \u2208 [0, 1]. Suppose the feature data is\ndrawn i.i.d. from a uniform distribution, i.e., x(i) iid\n\u223c U[0, 1] for i = 1, 2, . . . , n, and that labels are\ngenerated according to some threshold ttruth \u223c U[0, 1], i.e., y(i) = sgn(x(i) \u2212 ttruth). The decision\nstump training criterion seeks to choose a splitting threshold at one of the training instances x(i) in\norder to minimize \u02c6tn = arg maxt\u2208{x(i)} f (t), where f (t) is some purity measure. In our analysis,\nwe will de\ufb01ne the purity measure to be information gain. In our simpli\ufb01ed setting, we show that there\nis a natural relationship between the misclassi\ufb01cation probability Perr(t), and the approximation\nerror of our decision stump, i.e., |t \u2212 ttruth|. All proofs are deferred to the appendix.\nObservation 1. For an undiscretized decision stump, as n \u2192 \u221e, Perr(\u02c6tn)\nObservation 2. Maximizing information gain is equivalent to minimizing absolute distance, i.e.,\n\n\u2192 0.\n\na.s.\n\n\u02c6tn = arg max\nt\u2208{x(i)}n\n\ni=1\n\nf (t) = arg min\nt\u2208{x(i)}n\n\ni=1\n\n|t \u2212 ttruth| .\n\nMoreover, Perr(t) = |t \u2212 ttruth|.\nWe now present our main result. This intuitive result shows that increasing the number of discretization\nbins B leads to a reduction in the expected probability of error.\nTheorem 1. Let \u02c6tN,B denote the threshold learned by a decision stump on n training instances\n\ndiscretized to B + 1 levels. Then E(cid:2)Perr(\u02c6tN,B)(cid:3) a.s.\n\n4B .\n\u2192 1\n\n5 Evaluation\n\nWe developed YGGDRASIL on top of Spark 1.6.0 with an API compatible with MLLIB. Our\nimplementation is 1385 lines of code, excluding comments and whitespace. Our implementation\nis open-source and publicly available.1 Our experimental results show that, for large p and D,\nYGGDRASIL outperforms PLANET by an order of magnitude, corroborating our analysis in Section 3.\n\n5.1 Experimental Setup\n\nWe benchmarked YGGDRASIL against two implementations of PLANET: Spark MLLIB v1.6.0, and\nXGBOOST4J-SPARK v0.47. These two implementations are slightly different from the algorithm\nfrom Panda et al. In particular, the original PLANET algorithm has separate subroutines for distributed\nvs. \u201clocal\u201d training. By default, PLANET executes the horizontally partitioned algorithm in Section 2\nusing on-disk data; however, if the instances assigned to a given tree node \ufb01t in-memory on a single\nworker, then PLANET moves all the data for that node to one worker and switches to in-memory\ntraining on that worker. In contrast, MLLIB loads all the data into distributed memory across the\ncluster at the beginning and executes all training passes in memory. XGBOOST extends PLANET\nwith several additional optimizations; see [7] for details.\nWe ran all experiments on 16 Amazon EC2 r3.2xlarge machines. Each machine has an Intel Xeon\nE5-2670 v2 CPU, 61 GB of memory, and 1 Gigabit Ethernet connectivity. Prior to our experiments,\nwe tuned Spark\u2019s memory con\ufb01guration (heap memory used for storage, number of partitions, etc.)\nfor optimal performance. All results are averaged over \ufb01ve trials.\n\n5.2 Large-scale experiments\n\nTo examine the performance of YGGDRASIL and PLANET, we trained a decision tree on two large-\nscale datasets: the MNIST 8 million dataset, and another modeled after a private Yahoo dataset that\nis used for search ranking. Table 1 summarizes the parameters of these datasets.\n\n1Yggdrasil has been published as a Spark package at the following URL: https://spark-packages.\n\norg/package/fabuzaid21/yggdrasil\n\n6\n\n\fFigure 3: Training time vs. tree depth for MNIST 8M and Yahoo 2M.\n\nDataset\n# instances\nMNIST 8M 8.1\u00d7106\nYahoo 2M\n2\u00d7106\n\nTable 1: Parameters of the datasets for our experiments\n\n# features\n\n784\n3500\n\nSize\n\nTask\n\n18.2 GiB classi\ufb01cation\n52.2 GiB\nregression\n\nFigure 3 shows the training time across various tree depths for MNIST 8M and Yahoo 2M. For both\ndatasets, we carefully tuned XGBOOST to run on the maximum number of threads and the optimal\nnumber of partitions. Despite this, XGBOOST was unable to train trees deeper than D = 13 without\ncrashing due to OutOfMemory exceptions. While Spark MLLIB\u2019s implementation of PLANET is\nmarginally faster for shallow trees, its runtime increases exponentially as D increases. YGGDRASIL,\non the other hand, scales well up to D = 20, for which it runs up to 6\u00d7 faster. For the Yahoo dataset,\nYGGDRASIL\u2019s speed-up is even greater because of the higher number of features p \u2013 recall that the\ncommunication cost for PLANET is proportional to 2D and p. Thus, for D = 18, YGGDRASIL is up\nto 24\u00d7 faster than Spark MLLIB.\n5.3 Study of Individual Optimizations\n\nTo understand the impacts of the optimizations in Section 3.2, we measure each optimization\u2019s effect\non YGGDRASIL runtime. To fully evaluate our optimizations \u2013 including feature compression \u2013 we\nchose MNIST 8M, whose features are all sparse, for this study. The results are in Figure 6: we see\nthat the total improvement from the naive baseline to the fully optimized algorithm is a 40% reduction\nin runtime. Using sparse bitvectors reduces the communication overhead between the master and the\nworkers, giving a modest speedup. Encoding the labels and compressing the features via run-length\nencoding each yield 20% improvements. As discussed, these speedups are due to improved cache\nutilization: encoding the labels via dictionary compression reduces their size in memory by 8\u00d7; as a\nresult, the labels entirely \ufb01t in the last-level cache. The feature values also \ufb01t in cache after applying\nRLE, and we gain 2\u00d7 in memory overhead once we factor in needed auxiliary data structures.\n5.4 Scalability experiments\n\nTo further demonstrate the scalability of YGGDRASIL vs. PLANET for high-dimensional datasets,\nwe measured the training time on a series of synthetic datasets parameterized by p. For each dataset,\napproximately p\n2 features were categorical, while the remaining features were continuous. From\nFigure 4, we see that, YGGDRASIL scales much more effectively as p increases, especially for larger\nD. In particular, for D = 15, YGGDRASIL is initially 3\u00d7 faster than PLANET for p = 500, but is\nmore than 8\u00d7 faster for p = 4000. This con\ufb01rms our asymptotic analysis in Section 3.\n6 Related Work\n\nVertical Partitioning. Several authors have proposed partitioning data by feature for training\ndecision trees; to our knowledge, none of these systems perform the communication and data\nstructure optimizations in YGGDRASIL, and none report results at the same scale. Svore and Burges\n\n7\n\n68101214161820TreeDepth05001000150020002500TrainingTime(s)Yggdrasilvs.PLANETandXGBoost:MNIST8MMLlibYggdrasilXGBoost681012141618TreeDepth0200040006000800010000TrainingTime(s)Yggdrasilvs.PLANETandXGBoost:Yahoo2MMLlibYggdrasilXGBoost\fFigure 4: Training time vs. number of\nfeatures for n = 2 \u00d7 106, k = 16, B =\n32. Because the communication cost of\nPLANET scales linearly with p, the total\nruntime increases at a much faster rate.\n\nFigure 5: Number of bytes sent vs. tree\ndepth for n = 2 \u00d7 106, k = 16, B = 32.\nFor YGGDRASIL, the communication cost\nis the same for all p; each worker sends its\nbest local feature to the master.\n\n[15] treat data as vertical columns, but place a full copy of the dataset on every worker node, an\napproach that is not scalable for large datasets. Caragea et al. [5] analyze the costs of horizontal\nand vertical partitioning but do not include an implementation. Ye et al. [16] implement vertical\npartitioning using MapReduce or MPI and benchmark data sizes up to 1.2 million rows and 520\nfeatures. None of these systems compress columnar data on each node, communicate using sparse\nbitvectors, or optimize for cache locality as YGGDRASIL does (Section 3.2). These optimizations\nyield signi\ufb01cant speedups over a basic implementation of vertical partitioning (Section 5.3).\n\nDistributed Tree Learning. The most widely used\ndistributed tree learning method is PLANET ([12]),\nwhich is also implemented in open-source libraries\nsuch as Apache Mahout ([1]) and MLLIB ([11]). As\nshown in Figure 1, PLANET works well for shallow\ntrees and small numbers of features, but its cost grows\nquickly with tree depth and is proportional to the num-\nber of features and the number of bins used for dis-\ncretization. This makes it suboptimal for some large-\nscale tree learning problems.\nXGBOOST ([7]) uses a partitioning scheme similar\nto PLANET, but uses a compressed, sorted columnar\nformat inside each \u201cblock\u201d of data. Its communica-\ntion cost is therefore similar to PLANET, but its mem-\nory consumption is smaller. XGBOOST is optimized\nfor gradient-boosted trees, in which case each tree\nis relatively shallow. It does not perform as well as\nYGGDRASIL on deeper trees, such as those needed\nfor random forests, as shown in our evaluation. XG-\nBOOST also lacks some of the processing optimizations in YGGDRASIL, such as label encoding to\nmaximize cache density and training directly on run-length encoded features without decompressing.\n\nFigure 6: YGGDRASIL runtime improve-\nments from speci\ufb01c optimizations, on\nMNIST 8M at D = 10.\n\n7 Conclusion\n\nDecision trees and tree ensembles are an important class of models, but previous distributed training\nalgorithms were optimized for small numbers of features and shallow trees. We have presented\nYGGDRASIL, a new distributed tree learning system optimized for deep trees and thousands of\nfeatures. Through vertical partitioning of the data and a set of data structure and algorithmic optimiza-\ntions, YGGDRASIL outperforms existing tree learning systems by up to 24\u00d7, while simultaneously\neliminating the need to approximate data through binning. YGGDRASIL is easily implementable on\nparallel engines like MapReduce and Spark.\n\n8\n\n5001000150020002500300035004000Num.Features020040060080010001200140016001800TrainingTime(s)Yggdrasilvs.PLANET:NumberofFeaturesMLlib,D=15MLlib,D=13Yggdrasil,D=15Yggdrasil,D=1312345678910TreeDepth10410510610710810910101011NumberofBytesSent,LogScaleYggdrasilvs.PLANET:CommunicationCostMLlib,p=1KMLlib,p=2KMLlib,p=4KYggdrasil,p={1K,2K,4K}uncompressedtraininguncompressed+sparsebitvectorsuncompressed+sparsebitvectors+labelencodingRLE+sparsebitvectors+labelencoding020406080100120140TrainingTime(s)134s125s101s81sYggdrasil:ImpactofIndividualOptimizations\fReferences\n[1] Apache Mahout. https://mahout.apache.org/, 2015.\n\n[2] Y. Ben-Haim and E. Tom-Tov. A streaming parallel decision tree algorithm. The Journal of\n\nMachine Learning Research, 11:849\u2013872, 2010.\n\n[3] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classi\ufb01cation and regression trees.\n\nCRC press, 1984.\n\n[5] D. Caragea, A. Silvescu, and V. Honavar. A framework for learning from distributed data using\nsuf\ufb01cient statistics and its application to learning decision trees. International Journal of Hybrid\nIntelligent Systems, 1(1, 2):80\u201389, 2004.\n\n[6] S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with roaring bitmaps.\n\nSoftware: Practice and Experience, 2015.\n\n[7] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system.\n\narXiv:1603.02754, 2016.\n\narXiv preprint\n\n[8] C. Cortes, M. Mohri, and U. Syed. Deep boosting. In ICML, 2014.\n\n[9] U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision\ntree generation. Mach. Learn., 8(1):87\u2013102, Jan. 1992. ISSN 0885-6125. doi: 10.1023/A:\n1022638503176. URL http://dx.doi.org/10.1023/A:1022638503176.\n\n[10] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The vertica\nanalytic database: C-store 7 years later. Proceedings of the VLDB Endowment, 5(12):1790\u20131801,\n2012.\n\n[11] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai,\nM. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar.\nMLlib: Machine learning in apache spark. arXiv:1505.06807, 2015.\n\n[12] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree\n\nensembles with mapreduce. International Conference on Very Large Data Bases, 2009.\n\n[13] S. R. Safavian and D. Landgrebe. A survey of decision tree classi\ufb01er methodology. IEEE\n\ntransactions on systems, man, and cybernetics, 21(3):660\u2013674, 1991.\n\n[14] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin,\nS. Madden, E. O\u2019Neil, et al. C-store: a column-oriented dbms. In Proceedings of the 31st\ninternational conference on Very large data bases, pages 553\u2013564. VLDB Endowment, 2005.\n\n[15] K. M. Svore and C. Burges. Large-scale learning to rank using boosted decision trees. Scaling\n\nUp Machine Learning: Parallel and Distributed Approaches, 2, 2011.\n\n[16] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision\ntrees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management,\nCIKM \u201909, pages 2061\u20132064, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-512-\n3. doi: 10.1145/1645953.1646301. URL http://doi.acm.org/10.1145/1645953.\n1646301.\n\n9\n\n\f", "award": [], "sourceid": 1900, "authors": [{"given_name": "Firas", "family_name": "Abuzaid", "institution": "MIT"}, {"given_name": "Joseph", "family_name": "Bradley", "institution": "Databricks"}, {"given_name": "Feynman", "family_name": "Liang", "institution": "Cambridge University Engineering Department"}, {"given_name": "Andrew", "family_name": "Feng", "institution": "Yahoo!"}, {"given_name": "Lee", "family_name": "Yang", "institution": "Yahoo!"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "MIT"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "UCLA"}]}