Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The theoretical study of instance shrinkage in pegasos is as far as I know novel and interesting. Specially interesting is how instance shrinkage does not affect the solution the model converges to, which justifies later experiments which ignore importance sampling in deep nets. Similarly, the idea of training a small assistant model just to predict the loss of the base model on unseen examples is straightforward and potentially useful. The algorithm is clearly described, including all hyperparameters, and it does look like it should be possible to replicate the experiments. It's unclear from reading the experimental section, however, that this algorithm is actually an improvement over just regular training with no curriculum attached. No experiment in the paper with a deep net shows a learning curve for training with no curriculum, so it's unclear whether adding an assistant actually helps at all, or by how much. The paper is also fairly light on how the hyperparameters were tuned; ideally we'd see that the hyperparameters were tuned on baseline SGD/momentum to minimize convergence time to a given target accuracy; then we'd plot the performance of the autoassist method next to the performance of the tuned SGD/momentum baseline and other curriculum learning methods. The paper also completely omits any analysis on how different characteristics of the assistant model affect the performance of autoassist. Some questions I'd like to understand better are: 1. How simple can the assistant model be to still observe an improvement in training time? 2. How aggressive can the example filtering be until convergence is affected? 3. What is the ideal complexity of the assistant model? Does it make sense to have an assistant almost as complex as the base model itself? 4. Is the behavior of auto assist affected by the batch size used? 5. Some learning tasks (like cifar) are often overfit by state-of-the-art models, while others (like lm1b language modeling) are underfit; does autoassist behave differently in the overfit vs underfit regime? (specifically, does autoassist degrade in performance as the training loss goes to zero on all training set examples, as happens on large nets on cifar?) As the paper currently stands it's hard to judge whether autoassist is an improvement even over baseline SGD/momentum, and it's impossible to tell how autoassist would behave in practice. I don't think all questions above need to be answered to make this paper acceptable, but some information around more of them would be very helpful. ---- after reading author feedback I revise my score to 7, assuming the authors clarify the paper to cover the points I raised.
Much of my comments are reflected in section 1 (Contributions) and section 5 (Improvements). Just provide more points here: 1. In general, the problem this paper wishes to address is super important, especially when we are in the era of big model + big data (such as BERT). The point out from which this paper tries to solve this problem is also making sense: try to abandon the useless data points and only focus on the important one. In particular, I like the idea of using shallow model as a proxy to the true "utility" and cast the training process as a binary classification task. 2. The paper is also presented in a reasonably clear manner. 3. I have major concerns towards the experimental setup, most of which could be reflected in section 5 of this review. Apart from these points, the improvements seem not that significant, especially on the WMT14 dataset (a larger scale one) in Fig.4. And I would suggest that moving Fig. 6 in appendix to the main text since we care about the final evaluation measure (e.g., accuracy and BLEU) much more than training loss/ppl. 4 Several representative related works are ignored in the paper. Indeed there is a rich literature that talks about using importance sampling to boost ML model training that is largely missing. For example, please check paper [1-4].  Loshchilov, I. and Hutter, F., 2015. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343.  Fan, Y., Tian, F., Qin, T., Li, X.Y. and Liu, T.Y., 2018. Learning to Teach. ICLR  Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B. and Dyer, C., 2016, August. Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 130-139).  I also remember a paper appearing in NIPS 2017 about dynamic and automatic batch selection towards NN, but sorry for that I do not remember the exact name. Will update the review once I found that paper. ________________________________________________________________________________________________________ Post rebuttal: I thank the author's response. I still do not think that a paper claiming faster training convergence could establish itself in NeurIPS, if 1) no convincing results on large scale tasks are demonstrated; 2) no thorough discussion/comparison with the literature is provided.
The paper proposes a lightweight assistant model to filter training samples. The lightweight model is learned online with the "Boss" model. The paper is well written and technically sound. Experiments show higher speedup over previous works. *** update Keep the original rate and please "We will include raw ImageNet in the revised version."