NeurIPS 2020

Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model


Review 1

Summary and Contributions: The paper discusses a case of a linear regression model in a Hilbert space, when the regressors are random, but there is no additive noise present. It applies the stochastic gradient iterations for optimizing the objective. While the existence of the additive noise implies well known convergence rates, the paper shows that under the noiseless model the rate of convergence can be significantly higher. The simulation experiments justify the theoretical results.

Strengths: In a field of well-known algorithms and established results, the paper introduces a new view on the same problem. It is important to understand the problem of linear regression better. The focus on infinite-dimensional spaces makes the paper results applicable to kernel regression problems. The upper and lower convergence bounds are established for the expected value of the covariance-weighted risk as well as for the estimation error. The paper shows a simulated example of function interpolation in a high-dimensional space to illustrate the theoretical bounds with the observed convergence rates, which appear to be very close.

Weaknesses: While the paper is written very clearly, there are several questions I’d like to raise. Firstly, in discussing the applicability of the results the paper mentions ‘some basic vision or sound recognition tasks’ (line 33) - I’d like to ask about some examples of such tasks. Looking at the statement of the Theorem 1, seems that it should be applicable in finite-dimensional spaces with invertible covariance matrices. If it is so, then I do not understand the results. In particular, for X distributed with a finite support and has identity covariance matrix, the conditions (a) and (b) hold for arbitrarily large positive \alpha, however the theorem statement implies that the estimates will go to zero at an arbitrarily large polynomial rate, which is not true. The paper does not give any theoretical argument to the 'tightness' of the proposed bounds, so I'd suggest to change the title. In a line 45, one should correct the formulas as Tr() = Tr(X X^T \Sigma^{-\alpha}) = E X^T \Sigma^{-\alpha}X. The product notation in the line 47 is not introduced.

Correctness: There is one question to the statement of the theorem I’d like to be answered.

Clarity: The text is clear and easy to understand.

Relation to Prior Work: The paper does not describe the related work in full details, which is understandable, because linear regression is a well-studied topic. However, I would suggest to add the results with additive noise assumption for infinite-dimensional spaces, to put the proposed model into a perspective.

Reproducibility: Yes

Additional Feedback:


Review 2

Summary and Contributions: This work studies convergence rates for SGD in RKHSs under the noiseless setting and derives the faster convergence rates than $O(1/n)$. ----- After reading the response. The authors have addressed my concerns. I would like to keep my score.

Strengths: Traditionally, the convergence rates of SGD in terms of generalization are studied in a noisy setting except for a few studies. In their analyses, it is known that slower convergence rates than $O(1/n)$ are optimal in general. That is, this paper shows the acceleration of the convergence thanks to the low noise on the regression problems.

Weaknesses: Although the contribution in this work is significant as commented above, I think the claim about the Sobolev smoothness is overstatement because it depends on a somewhat strong condition (line 215-216) which strongly restricts the class of kernels. If there are any misunderstandings in my understanding, I'd appreciate it if you could mention them.

Correctness: The claims seem correct, but I skipped detailed proofs.

Clarity: The paper is well written.

Relation to Prior Work: The difference from the previous studies is well discussed. Additionally, it would be nice if the authors could explain the differences from [18] which also shows the acceleration in the low noise settings.

Reproducibility: Yes

Additional Feedback: As is well known, the averaging technique is known to stabilize and accelerate the convergence of SGD in RKHSs. How does the averaging technique work in the noiseless setting?


Review 3

Summary and Contributions: The paper analyzes the convergence of SGD for the noiseless linear regression model with non-linear transformation on the data. The authors claim that under the noiseless linear regression model, the generalization error of SGD vanishes as the number of sample increases. Then, the authors provide upper bound and lower bounds for the performance of SGD with a small gap. Post Rebuttal: The authors have acknowledged my concerns and I would like to keep my current score.

Strengths: The strength of this work lies in the (almost) matching upper and lower bounds it provides for noiseless linear regression. These results do not require any strong assumptions and can be achieved by assuming regularity of the estimator as well as the data. The experiments in this paper also back up their theoretical claims about the possibility to achieve polynomial convergence rates.

Weaknesses: It seems that [18] does not require the optimal predictor to lie entirely outside the kernel space. In that case, how do the bounds hold up against 18, when the predictor is in the kernel space?

Correctness: The claims seem correct

Clarity: Yes

Relation to Prior Work: The authors do a good job of differentiating their work from previous contributions

Reproducibility: No

Additional Feedback: Line 35: Citation for when it is called multiplicative noise would be great Line 306: Typo connections


Review 4

Summary and Contributions: This paper analyses the convergence rate of stochastic gradient descent in the deceptively simple case of zero noise and square loss in which there exists a deterministic linear relationship between random feature vectors (not necessarily linear in the inputs) and random outputs. It is shown that the convergence rates is given by the minimum between a parameter determining the optimum and the power norm of the covariance matrix (taken as a measure of the regularity of the feature vectors). Theoretical and empirical evidence of the robustness of the findings with respect to the case in which the generalisation error is not negligible is also provided.

Strengths: The results presented in the paper are theoretically sound. Simulations do corroborate the findings. It sheds light on how the convergence rates of SGD depend on the regularity of the optimum and the feature vector.

Weaknesses: N.A.At first sight the setting appear deceptively simple. In practice it applies to a very wide range of application scenarios in which perception is involved

Correctness: Yes

Clarity: The paper is written very well.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: