Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
In the past year, a number of papers have given theoretical insight into "double descent" risk, where the minimum risk is (counterintuitively) found at some $p > n$. This paper provides "double descent" results under slightly different settings from the preceding literature. Originality: The idea of double descent is not new, but the authors explore this behavior in a slightly different setting compared to preceding literature. One useful advantage of this paper is that there is no minimal eigenvalue assumption. Quality: Claims are supported by theoretical analysis and complete. Simulations to illustrate the expected behavior would be welcome. Clarity: The paper is properly motivated, well-written, and placed within the surrounding literature. Significance: The paper is an interesting addition to the interpolation / double descent literature, but aside from analyzing its behavior under different assumptions does not pose many new ideas. *** I thank the authors for their careful reading of our reviews and for addressing my concerns.
Authors analyze the behavior of principal component regression estimates when p>n. They provide theory on the asymptotic value of the L2 error with the sample size n growing to infinity and fixed ratios between the number of features and n and the number of PCs and n. While the work is interesting and focuses on the unintuitive phenomenon, in the current form it's practical to use is not obvious. Results obtained under the assumption of polynomial decay of eigenvalues are solid. However, asymptotics there is not that interesting since at some point, the number of required eigenvectors is clearly much smaller than the number of features and the problem becomes easy even if p>n. From my understanding, the decay of eigenvalues and the growth of p is the key component making the theory hold true. It would be interesting to see where the intuition breaks or what happens when new features are equally important (i.e. have equally large eigenvalues) -- what if all eigenvalues are separated from zero. Despite a catchy title suggesting applied work, in the current version, authors do not answer explicitly the question they ask in the title. While clearly, the paper discusses the choice of p, under a set of assumptions introduced in theorems there is no clear guideline on the choice of p. The paper is well-written and easy to read. My two main concerns are: - it seems that it solves a very particular case of a much bigger problem. - it's not actionable despite a suggestive title.
In the paper, the authors discussed PCR, a well-know variant of regression models, and showed the existence of a "double descent" phenomenon. The paper is technically sound and relatively well-written. I check most of the math and they are correct and reasonable to follow. I do have some concern that too much of the space is taken by the algebra which could make it difficult for readers to grasp the high-level intuition, specifically if they do not have enough time to plough through the equations. Considering the space limit for a NeurIPS submission, I think it's better to reorganize some of the proofs to the appendix, and add a discussion/conclusion session to highlight more about the intuitions. Back to the content of the paper: the PCR model dealt in this model is different from traditional regression analysis, in that - It is high-dimensional, with n,p,N all going to infinity - It mainly concerns the generalization (i.e. expected risk, or out-of-sample performance) rather than in-sample performance (e.g. how well the model fits existing data). Now I think both differences are worth highlighting (which are actually direct causes of the "double descent" phenomenon). And these two differences extend the traditional analysis into a relevant and modern regime, especially in our current era (high-dimensional data, over-parametrized neural nets, etc.). Although linear models are simpler, understanding of their behavior in this regime could possibly shed light on the bigger questions (why over-parametrized nnets generalizes well?) I do want to discuss a bit of the high-level intuitions here, especially how the "double descent" actually arises. a) p\infty so one would pick p=N here, and the expected risk goes to 0 as n->\infty as we will eventually learn the right model. So it was actually surprising to see that the generalization risk increases as α->β which actually seems to diverge (Figure 1). Which also means the two key characteristics (high-dimensional, out-of-sample) MUST be playing important roles here. After a careful examination one would conclude that the high-dimensionality is killing the generalization risk when α->β. I believe the following point is crucial: The design matrix X, when α is close to β, is going to be ill-conditioned with high probability (think of marcenko-pastur law of its spectrum). When α=β, it is asymptotically guaranteed (even in the almost sure sense) that X will have singular values arbitrarily close to 0 as n->\infty. This essentially makes estimation very hard. b) p>n, or the second descent: This observation naturally leads to the second descent as p increases pass n. p>n actually serves as an "implicit ridge regularization" here, since it tries to look for the solution with small l2 norm. This can also be seen from the above point: as p moves away from n, X is becoming better-conditioned again. This "over-parametrization serves as implicit regularization" is a pretty interesting phenomenon, as it introduces bias to the problem (which is essentially non-exist when p