NeurIPS 2020

Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics

Review 1

Summary and Contributions: Summary: Analyzing deep learning optimization and connection to generalization error is a hot topic. Existing frameworks (mean field theory and Neural Tangent Kernel theory) require taking limit of infinite width to show (global) convergence. The manuscript explores a novel approach: describe deep NN training as estimating a transportation map and show convergence by analyzing a (infinite dimensional) Langevin dynamics. The authors derive excess risk bounds and even show fast rates.

Strengths: - timely and significant - wealth of methodological insights - addresses important questions

Weaknesses: Almost Nothing to declare Needs to be completed with concentration bounds (Gaussian concentration should be helpful in a first place).

Correctness: So far so good. Really checking to soundness of the paper would require learning or relearning a lot of things, months of work.

Clarity: As clear and simple a paper can be when it relies on such a variety of material. There are some unenglish sentences that could be cleaned up.

Relation to Prior Work: Yes Possible references to add: Da Prato, Giuseppe; Zabczyk, Jerzy Stochastic equations in infinite dimensions. Second edition. Encyclopedia of Mathematics and its Applications, __152__. Cambridge University Press, Cambridge, 2014. xviii+493 pp. Using Langevin dynamics to investigate training neural networks has been in the air since the days of multilayered neural networks (see Statistical mechanics of learning from examples by H. S. Seung, H. Sompolinsky, and N. Tishby Phys. Rev. A 45, 6056 – Published 1 April 1992). A. Dieuleveut's Doctoral dissertation about Stochastic Approximation in Hilbert Spaces contains relevant matrial. See

Reproducibility: Yes

Additional Feedback: Stochastic Differential Equations are not the easiest topics. Especially in functional spaces. Give more hints about the spaces where solutions are supposed to live, about domain and range of operators.

Review 2

Summary and Contributions: This paper models the training process of neural networks as the estimation of a transportation map, the one taking the initial values of the weights to their values at time t (in a mean-field model of a two-layer network). The evolution of this map is studied as an infinite-dimensional Langevin dynamics and the authors show that (i) the dynamics converges to the invariant distribution under standard assumptions of ergodicity, (ii) compute the convergence rate to this invariant distribution, and (iii) compute the generalization error of solution at time k along with a bound on the excess risk. Examples are provided for binary classifier with the logistic loss and regression problems. This paper is a very terse read and a tour de force. This is a theoretical paper which could be influential.

Strengths: - The point of view of the authors, that of writing down the evolution of map as a stochastic differential equation is refreshing and quite different from the two prevalent themes (mean-field dynamics and neural tangent kernel) in the literature. - The mathematics for the excess risk bound seems novel to my understanding. - The discussion about achieving a minimax optimal "fast" rate for non-parametric regression is wonderful.

Weaknesses: - The expected excess risk bound in Theorem 2 is a lot to unpack. The Assumptions 1, 2, especially the former, should be made easier to understand. - What prevents the authors from applying this result to a non-residual deep network? It is important to discuss this.

Correctness: I have followed the main text of the paper and the theoretical results do not seem to have obvious holes.

Clarity: The current draft is a squished 8-page version of what looks like a very long and complex paper and I would encourage the authors to improve the exposition in the conference version potentially at the cost of some results.

Relation to Prior Work: This is adequately discussed.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper views the training procedure as a mapping W on the parameter space. More precisely, the mapping W itself can be viewed as one from a stochastic process of functions, induced by the training algorithm such as Langevin dynamics. Under the newly proposed framework, the optimization and generalization results are presented.

Strengths: A lifted view from the mean-field theory is very novel. Since the initial distribution can be on finite support, this framework can also apply to finite-width networks, which is an advantage compared with mean-field methods.

Weaknesses: The idea of 'lift' could cost the readers much effort to understand, especially when the idea is not clearly presented in the paper. While the new framework allows small width (compared with mean-field requiring large width to approximate the distribution), it instead needs to average over the randomness from the training algorithms. To be specific, in mean-field theory, with (undesirably) large width, the network's loss approximates the expected loss; in transportation map estimation proposed here, the expected loss of a finite-width network can be indeed bounded, but it seems not easy to guarantee the loss in one real training process is close to its expectation. Averaging over multiple training processes can address this issue, of course, but potentially it will introduces enough computation cost to make it no better than mean field.

Correctness: While I cannot go over all proof and cited results, most of the proof are correct.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Post-rebuttal: The authors' response addressed most of my concerns. I will raise my score to 6. ================================ While I find the idea of this paper very fascinating, some concerns need to be addressed. I am very willing to update my over all score given the feedback. Major concerns: 1) As mentioned in the weakness session, it seems what this paper does is moving the width requirement in the mean-field theory to the requirements of training many times. The optimization result in Eqn (7) is in expectation. Is it possible to derive a high probability bound on the training loss of one training trajectory? Instead of in expectation? If the answer is no I think the claimed improvement could look very suspicious and considered as a weakness. 2) Under your framework, the (expected) performance of the neural network function f_W depends highly on the initial distribution \rho_0. How is this reflected in your results on excessive risk? 3) In Assumption 1 (ii) , the gradient is assumed to be bounded for any w and x, would you provide an example that satisfies this? So as in Assumption 1 (iii). Minor issues: 1) The definition of ResNet in line 188-192 is very strange. How would you convince people that setting is in resemblance of any real ResNet? Also, how is the formulation applied to optimal transport? Typoes: 1) In Eqn (7) the notation for loss function L is not in \mathcal mode?

Review 4

Summary and Contributions: By formulating the neural network training as an optimal transport mapping finding problem of the parameters, this paper resolves several difficulties: (i) diverging width against sample size and (ii) curse of dimensionality. This paper mainly contains three parts: 1. It formulates neural network training as a transportation map learning of weights (parameters) and solves this problem by infinite-dimensional gradient Langevin dynamics in RKHS. 2. It shows the global convergence for the problem. 3. It also gives the generalization error bound of the estimator obtained by the proposed optimization framework.

Strengths: 1. The proposed theoretical framework covers the DNNs with finite width, which is excluded by the NTK and mean-field methods. 2. For optimization, it gives out a size-independent convergence rate. 3. For the generalization bound on the regression and the classification problem, this paper obtains the minimax and the exponential convergence, which may be hard to show in other general theoretical frameworks (without regularization).

Weaknesses: 1. The optimization part seems mainly comes from the results given by [36], and the results benefit from the regularization and the Eigen decay assumption, which may be a strong assumption since it directly helps to erase the dependence of the dimension for the bound. From this perspective, the error bound may not be so impressive. 2. The generalization conclusion also seems to profit from the regularization. With regularizer, it is always convenient to obtain certain tight generalization bounds.

Correctness: Seems correct (didn't check all the details).

Clarity: Clear writing. Minor issue: Line 312: missing reference.

Relation to Prior Work: Almost clear

Reproducibility: Yes

Additional Feedback: -------------------------------- The problem I concern is fairly discussed. I notice that the other review finds that the paper may have a gap between the analyzed SDE and the practical algorithm. Considering all the advantages and disadvantages of this paper, I insist on my original score (7).