Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
As far as I know, this is the first paper I have met on progressive training of NN with both nice theoretical backup and empirical performance. I did not check all the proofs in the appendix, but they appear reasonable to me. For the experimental part, I think it lacks the training efficiency comparison between splitting GD and existing pruning methods, where in the paper only the accuracy comparison is given.
The paper introduces an algorithm for splitting neurons when training neural networks using gradient descent, in order to progressively grow a neural network architecture for a better fit to the data. Some theoretical connections to wasserstein-infinity gradient steps are presented, as well as multiple experiments for learning interpretable or lightweight neural networks. The proposed method is original and interesting, and the experiments suggest that it is promising for various applications. However, the current proposed method seems far from practical in large scale settings, and the presented theoretical results are quite limited: * computationally, the algorithm as presented in Algorithm 1 requires (i) to check if a local optima is reached on the full training loss (and it seems that gradient updates are performed on the full training loss, which is costly), (ii) eigenvector computation on splitting matrices, (iii) dealing with dynamically growing weight matrices. These aspects seem to make the algorithm impractical compared to simply running SGD on an over-parameterized network. * theoretically, the splitting may help escape some local minima by changing the loss landscape, but this only happens when the splitting matrix has negative eigenvalues, which may not necessarily hold in practice. Then, the algorithm may still get stuck (at points that are "splitting stable") and not converge to a smaller loss which may have been reached when training with many more neurons from the beginning (e.g. in the over-parameterized setting of Mei et al. or Chizat and Bach, where it is possible to attain global convergence). Perhaps the authors can give examples where the hessian is positive but S(theta) is negative in order to highlight situations where this approach would be useful. * the links with Wasserstein-infty steepest descent are also unclear: Theorem 2.5(a) suggests that the equivalence is with *normalized* gradient updates (G(theta) / ||G(theta)||), while the algorithm seems to rely on standard gradient updates (corresponding to Wasserstein-2). The claim about the "non-asymptotic" nature of the result is misleading, since it only applies locally on a single update (rather than for global convergence), something which also would hold for the Wasserstein 2 setup even with empirical measures. Minor comments/typos: - L48: inherent -> inherit - L58/59: one would argue that SGD is more of a driving horse than GD for deep learning? - L72: add reference for the O(eps^2) claim? footnote: reference for "happens rarely"? - Algorithm 1, point 1.: more details would be helpful (what updates? how do you check local optimality?) - L179: "often not large in practice": clarify? - L206-208: is this referring to finite p or infinite? what is meant by "no practical difference"? (the number of particles for a good approximation may be very large, thus impractical) ======= after rebuttal ======== I am increasing my score as the proposed method seems more usable in practice than I initially thought, and the theory might be useful for the community. Perhaps the authors could provide more information on the final architectures found by their approach, for instance in the experiments of Figure 3. In particular, I wonder if the final architectures have layers with reasonably similar sizes, in which case the the baseline of gradually increasing the size of all layers equally + retraining from scratch (which doesn't require splitting) might be quite competitive, given the good performance of the 'retrain' strategy.
Before beginning, I need to underline that I am not very much familiar with the related work, so my assessment about the novelty of this work is primarily based on the author's own claims in the paper, although I have made some additional reading on the related literature for evaluating this submission. I enjoyed reading the paper. It is technically sound, all claims are supported by both theoretical analysis and empirical experiments. It is clearly written and well organized. Easy to read and understand, even for someone outside of the field. Results of the numerical experiments are very well illustrated. The new approach is expected to have a practical impact, especially for learning lightweight neural architectures. Some minor comments / typos: Figure 1 (a): f(x) is difficult to see even from a colored print-out Line 102: can be derive - can be derived Line 447: Nesterove - Nesterov ========== After author feedback ========== I read the author feedback carefully. During the discussions, I realized that I have misinterpreted some of the reported results as if the proposed approach outperforms the retraining of the final architecture from scratch. As a result, I overrated the practical significance of the proposed approach in my initial comments. To reflect this, I slightly decrease my score. Nevertheless, the proposed approach still has an adequate level of practical interest, and I vote and argue for the acceptance of the manuscript.