Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes a few practical tricks to accelerate the implementation of parallel SCD on NUMA architecture. I do like these practical implementation tricks. (I personally uses some of them before.) My concern on this paper is that the technical contribution of this paper looks weak to me. It might be not sufficient to support a publication in NIPS. I personally suggest that authors may consider selling this paper from the perspective of empirical study. For example, training a huge benchmark dataset in a XXX hours or minutes. =========== after rebuttal ============ I would say that this paper is very different from the typical ones I reviewed in this track. I would not be upset if we decide to accept this paper, but I still have the following concerns that should be addressed before publication in anywhere. The theory part directly applies existing paper. The implicit assumption is not clear. Without making appropriately assumption, I even do not think the theoretical result in the early literature can be applied. I agree that this paper designs a few useful strategies to incorporate with the real system, which is very important to practice. Some of them are actually not new - people who implemented CD on the real system mostly used some them but just sell the paper more from the theoretical perspective. Authors should realize this. This paper mainly uses Heish 15's paper as the bench marker for comparison. A better implementation was done in his 17 nips's paper "Asynchronous parallel greedy coordinate descent", which was not compared. Based on the makeup experiments, the speedup curve does not look very well. To my personal experience, it could be even better. Of course, it could be due to the difference on hardwares. Authors need to clarify this.
The analysis of the main bottlenecks and scalabillity issues of state-of-the-art implementations of sequential and parallel coordinate descent algorithms -- in particular in realistic numa configurations -- is well-organized, clear and the flow of the paper relies on this excellent introduction and description. The algorithmic techniques proposed in this work seem new, and impactful. First, one of the main issues is accessing the model vector, which can be cache-line inefficient. The authors introduce the concept of buckets, which is new as a tool to solve the latter issue. In addition, this new concept serves at reducing the computational time of randomizing all the coordinates during the optimization. Second, in order to overcome the issue of accessing efficiently the shared predictions' vector -- which, the authors argue, can cause the algorithm to diverge when scaling to a large number of threads, and which effect is accentuated in a numa configuration -- the authors propose a way to replicate the vector across threads. Third, they carefully adapt their two previous ideas to numa architectures. As illustrated by extensive numerical simulations, they are able to overcome scalability issues of some parallel coordinate descent algorithms. Finally, the algorithm is proved to converge. Overall, I think this is a very good paper, well-written, and I recommend it for acceptance.
The authors perform system-level optimisation of (block) coordinate descent algorithms on multi-core NUMA architectures. They isolate critical bottlenecks, propose both data structure algorithm modifications to overcome these limitations, and explore trade-offs in extensive experiments. A theoretical convergence proof is given for strongly convex functions, and simple to implement tuning rules for the free algorithm parameters are suggested. Overall, I found this paper interesting and refreshing. The experiments are thorough and the improvements suggested by the authors are well motivated. The resulting performance is impressive, with significant speedups compared to many of the standard production codes that are widely available. The theoretical contributions are perhaps not groundbreaking (key algorithm components and convergence proofs already exists in the literature), but the system-level optimization is a major contribution. I do not really have that many comments or criticism on this paper, other than that it would have been nice to scale the experiments to even more threads. Now, scaling appears to be linear and there is no clear diminishing return. Have you tried running even more threads? When does the linear scalability break down? I would also encourage the authors to move Remark 1 from the supplementary into the main text. The additional tuning parameters was one of my main concerns before shifting to read the supplementary. ** Update after rebuttals **. Thank you for addressing my concerns in your rebuttal letter. I look forward to reading the final version of this paper.