Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes a method that reduces memory consumption of back prop, and as a result allows the use of larger batch sizes. The authors state that this is significant e.g. with batch norm, where batch size matters. In my own experience, I believe this is also significant for improved GPU utilization and data-parallel training. The paper is original in that I have not seen a similar treatment (although the actual solution may have been relatively straight-forward; asking the right question was an important part of this). The paper is well written and easy to follow. In Section 3, the paper formally defines the problem as a minimization problem under specified constraints, and then goes on to solve it in Section 4 via BP. I appreciate the systematic approach. The results are great, with in some benchmarks achieving an up to 50% further memory reduction compared to the baseline method (Chen's). I consider the validation of the method good and convincing. 4.4 "If the solution to the time-centric strategy does not exist, then try the memory-centric strategy next to prioritize memory reduction." I don't understand this. Time-centric refers to the optimal strategy. If that does not exist, that means a solution does not exist. Then why try the other strategy? Section 4.2 is very clear that DP is a correct solution to the problem. So please clarify 4.4.
The paper focuses on a broad problem that plagues many researchers today. The size of the models is often limited by the memory of the accelerator (GPU or TPU), the researcher is limited in what they can do. While there have been prior attempts to solve this problem in specific settings, they have been too limited to specific kinds of models and haven't seen broad usage or availability across ML tools and libraries. The paper does a good job of explaining the problem, coming up with a general solution that seems to apply to all kinds of directed graphs (most of deep learning), and empirically demonstrate the wins over existing strategies. While at first glance the submission seems incremental, the significance of it comes from the fact that it could now be more easily integrated into the underlying ML software and be made available to a lot more researchers.
First of all, I do believe that the general treatment of arbitrary computational graph is helpful. The authors build the constraints into the algorithm and uses a DP based algorithm to solve the planning under the given memory constraint. Given that the proposed algorithm generalize beyond what Chen’s algorithm can do, I would recommend the authors to include experiments on models that cannot be handled in Chen’s algorithm. This will help to strengthen paper. I want to point out that there are related treatment of using a tree decomposition https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9 While I know we are not supposed to treat a blog post as existing literature since blogs are not peer-reviewed, the authors should still try to discuss it and give pointers to the related works. Finally, although this is a borderline paper, I do think it has some values and might be interesting to NeurIPS audiences. comment: I have changed the score to marginally above acceptance after reading authors feedback