Paper ID: | 1900 |
---|---|
Title: | Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale |
The paper presents a distributed tree learning method, Yggdrasil. The method is based on vertical partitioning of the dataset, and reports very good speedup for deep learning trees (D>15) and datasets with many features as compared to existing methods based on horizontal partitioning. Yggdrasil is implemented on top of Apache Spark, and with an API compatible with MLlib.
The paper is well written and easy to read. The study describes Yggdrasil, a method (and implementation!) to train deep learning trees with many features on distributed systems. The paper contains several novel contributions regarding designing systems for distributed training of learning trees. The main performance improvements come from vertical partitioning of the dataset, which results in lower communication costs, the ability to work on compressed data, and the use of sparse bitvectors to reduce communication. Personally, I like papers about frameworks, implementations, etc, since they provide experimental tools for other researchs to use and base their work on. An important result of this paper is the Yggdrasil implementation, that already has been released in the public domain. The experimental methodology is sound, Yggdrasil is compared to relevant state-of-the-art systems / frameworks. However, as often, one would like to have results from more datasets. The paper only presents results from two datasets, one public (MNIST 8M) and one proprietary form a web company.
2-Confident (read it all; understood it all reasonably well)
This paper presents a new system for performing distributed computing in an efficient way for decision trees. This is done in a classical master-workers setting. Compared to Planet, the algorithm used in standard implementation, the proposed method splits data vertically, distributing distinct features to workers (as opposed to distributing learning instances). The purpose of the method is to solve the bottleneck created by the expensive communication necessary for horizontal approaches (like Planet), scaling up to a larger number of features and bigger decision trees.
The paper is well written, and its structure is adapted to the content. Upon reading the paper, one might think that the contribution resides in the vertical splitting of the data over the workers, but the state of the art study presented later on shows that this idea by itself is not new. The novelty comes from associating it with data also distributed vertically, sparse bit vectors for inter-node communications, feature compression with custom data structures and training on compressed data. The paper shows formally and experimentally how the proposed heuristics significantly improve the communication between the nodes and speed up training. The remark that using run-length encoding for the features allows them to hold in the L3 cache, thus decreasing the number of DRAM accesses, doesn't seem to always be true. The paper should explain in which conditions this is true (size of the cache, size of the data, number and type of features, etc.). The heuristics are presented for a fixed type of architecture: one master connected to a certain number of workers in a star configuration. Is this the only setting used for large scale distributed computing, in machine learning or fields with similar needs? Would changing this architecture remove some of the existing bottlenecks? The experimental results compare Yggdrasil against different implementations of Planet. On the two datasets the method shows important improvements in the computation time compared to other methods. Here, each of the datasets is used for a different task (classification and regression), so it is hard to tell if the speed gain is general or specific to the dataset and the task. It would be useful to see the performance of Yggdrasil on some other datasets as well. Yggdrasil could be of impact in the community making use of distributed architectures. While the experimental results show a clear improvement over state of the art methods, they would be more convincing over a larger number of datasets. Minor remarks: I am not sure that the NIPS format allows to surround figures (Figure 3 in this case) with text. End of Eq. (1): x should maybe be in bold?
2-Confident (read it all; understood it all reasonably well)
This paper provides a "vertical splitting" (per predictor) method instead of "horizontal splitting" (per instance), therefore improving distributed computing for decision tree based methods.
Technical quality: All results are appropriate. Novelty: The approach does seem novel, although Ye et al. also carried out "vertical splitting" but not with the improvements, according to the paper. The improvements are: using sparse bitvectors for communication, compressing the features, and dividing features into sub-arrays for uncompressed data. Impact: The paper makes a sound algorithmic contribution to distributed computing for classification trees. Much improvement in runtime is made. Clarity: The paper can possibly improve from providing more pseudo-codes in Section 3.2.
2-Confident (read it all; understood it all reasonably well)
This paper mainly addresses the poor scale-up issue of the distributed tree learning problem in terms of the data dimensionality and tree depths. In detail, this paper employs the vertical partition idea instead of horizontal partition used by Google PLANET project. This technique mainly works better when data dimensionality and tree depths significantly grows. Also, there are several engineering techniques to further enhance the performance of the YGGDrasil framework, e.g., 1) the usage of the run-length encoding allows the training process on the compressed data without decompression. 2) to minimize the communication cost, the authors includes the bit-sparse vector. This framework contains several smart techniques and works at production scale in the industry. I think this paper will raise lots of interests in the related research community and should definitely be included in the NIPS proceeding.
UPDATE: The authors are welcome to add some toy examples to their descriptions of their optimization techniques. I believe this could help general audiences better understand their paper. The target problem in this paper is very interesting and important. The techniques presented in this paper are smart and efficient. The experiments successfully validate the effectiveness of all parts of the framework. My major concern is when and where will the code be open-sourced, and also how to use the framework. Minor suggestions: - It would be better if the authors could include the strategies used for tuning the Spark configurations.
2-Confident (read it all; understood it all reasonably well)
The paper presents a new system - Yggdrasil for training decision tree with large depth and feature sizes. In contrast to current state-of-the-art like PLANET and XgBoost that do example partitioning of data, Yggdrasil does vertical partitioning of data. As a result amount of communication does not depend on $2^depth$ and feature size which can become bottleneck with example partitioning schemes. The paper argues that binning strategy deployed to remove this inefficiency in other methods lead to loss in performance. The main contribution of the work is the proposal of some optimized data structures and compression schemes to do the efficient training. Specifically, run length encoding is used to compress features and optimized data structures are used to find splits directly on compressed data. Results show training improvements over PLANET and XgBoost for large depth.
First the positive: For building an optimized system, the optimizations proposed by the work are important and useful. However, I have several issues and concerns. a) Exposition is not clear at all to me. The main contribution of the paper - Section 3.2 is very difficult to follow due to its verbose nature and lack of any mathematical equations and expressions which make it hard to understand what authors are doing exactly. For example, details on how these data structures are exactly used for feature splits on compressed data, etc., is totally missing. Section 3.2.3 particularly is not comprehensible. What are these sub-arrays? Again precise mathematical expressions would have helped. Writing changes required for proper explanation are non-trivial. b) Although lot of useful system and data optimizations are done, I do not see any algorithmic novelty in the paper which I believe s required for a conference like NIPS. c) I am not convinced by the experiments. First of all, XgBoost runs on different Spark version than YggDrasil. XgBoost failing for depth > 13 can be because of inefficiencies of earlier Spark version. an apple to apple comparison is needed. Moreover, the whole argument is based on the fact that binning impacts the performance severely. No experiments are done to demonstrate that and compare. For example, if you make communication time or total running time same for all algorithms by changing bin size what is the performance. Infact no test accuracy numbers are given at all. One can also trade-off increasing depth with increasing number of trees. What is the performance in this scenario? Given that algorithmic novelty is limited, a comprehensible set of experiments is must for such a work.
3-Expert (read the paper in detail, know the area, quite certain of my opinion)