NeurIPS 2020

The Statistical Cost of Robust Kernel Hyperparameter Turning

Review 1

Summary and Contributions: This submission proves that the sample complexity of learning kernel hyperparameters under adversarial noise is not much larger than the sample complexity for the setting with known optimal parameters. This work not only has theoretical contributions, but also can inspire practitioners to develop more effective algorithms to tune Spectral Mixture kernel in the future. Overall, I think this is a solid work.

Strengths: This work is technically sound. The conclusion of this submission reveals that the key to robust kernel hyperparameter tuning is not a larger amount of data, but the development of more effective algorithms. It makes theoretical contributions and provides guidance to practitioners.

Weaknesses: This work is an extension of [AKM+19].The novelty and impact of this work are unclear to me since I am not familiar with this field.

Correctness: Yes, I think this method is technically sound.

Clarity: Yes, this submission is well written and well organized.

Relation to Prior Work: Yes, the authors clearly discussed the related works. To my understand, this work is an extension of [AKM+19].

Reproducibility: Yes

Additional Feedback: As authors discussed in the conclusion section, this submission does not provide any polynomial time algorithm to solve the problem. It would be nice if the authors could discuss the possible direction of developing the corresponding algorithm in this paper. Thank you for authors to address my concerns in the rebuttal. I remain my score unchanged since authors' responses do not chance my evaluation of the overall quality of this submission.

Review 2

Summary and Contributions: This is a theoretical paper that analyzes the statistical cost of hyperparameter tuning. The authors give a background to previous work and present a series of theoretical results. === Post-rebuttal update === Thank you for your response! I've decided to keep my score as I'm still concerned about the broader impact and potential applications of this work. While I understand that this is a theoretical paper, I would still like to see comparisons to other methods to get a better understanding of its impact.

Strengths: I'm not familiar with this area of research, but the paper considers an interesting problem and is very well-written and seems theoretically sound.

Weaknesses: I'm not convinced that this work carries much practical importance and there are no numerical experiments in the paper. In fact, the authors say "This is a theoretical paper, so discussion of broader impacts has to speculate on future applications for this broad line of research.", which makes it sound like it isn't clear what potential applications are and the lack of experiments indicate that there were no clear applications to study. Other questions and comments: - Can you be a bit more explicit about what it means to use O(1) in the bounds? I understand it as a positive constant that doesn't depend on y, z, etc., but being explicit would be helpful. Also, in Problem 2 restated, state the dependencies on C. - Giving the reader some intuition of what energy means when Problem 1 and 2 are introduced will avoid jumping between the sections to understand what is going on. Drawing the connection to inner products in the RKHS would be useful too. - What is the intuition behind the statistical dimension and how does it vary between some of the commonly used stationary kernels (RBF, Matern, etc.)?

Correctness: I believe the claims are correct

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: Within a Kernel Ridge Regression setting, this paper analyses the problem of interpolating a signal with adversarial noise to any arbitrary precision while learning the kernel hyperparameters. The paper proves that this problem can be solved with high probability if the observations positions are drawn from a kernel independent distribution and the number of samples exceeds a log-linear bound on the statistical dimension of the kernel family. These bounds, for the case of the Spectral Mixture kernel (SM), and the universal sampling distribution are the main contribution of the paper. For this, the paper generalizes the results of [1] by removing the fix kernel assumption and replacing it with a parametrized infinite kernel space. Similar to [1], the proof is done by showing first that the active regression problem with adversarial noise can be solved by finding a near-optimal solution to Fourier Fitting (FF) problem with ridge regularization. Then, the paper proves the existence of near-optimal solutions to the FF problem by first reducing it to the case where the hyperparameter space is finite and showing that, when observation positions are sampled with a kernel independent distribution, a solution for the FF problem exists with high probability. The paper then proposes a discretization scheme for the infinite hyperparameters space case. [1] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. A universal sampling method for reconstructing signals with simple Fourier transforms. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1051–1063. ACM, 2019.

Strengths: The main strength of the paper are the explicit bounds on the number of samples needed for achieving arbitrary low error when using the SM kernel, which could potentially give insights on improving the training of spectral kernels in general. This is relevant since spectral kernels are notoriously difficult to train but also of interest to the kernel community due to their universal approximation properties.

Weaknesses: Lack of experimental validation, probably due, as the paper mentions, a missing algorithm for solving the discretized hyperparameter space problem.

Correctness: The paper is technically correct.

Clarity: The presentation of the paper very clear and the outline of the proof is well presented.

Relation to Prior Work: The paper outlines clearly its differences with previous work.

Reproducibility: Yes

Additional Feedback: Post rebuttal: Post rebuttal: While the paper is promising and lays out the theoretical framework needed for tackling the complexity of fitting a SM kernel, its broader applications are still unclear. Thus, I am keeping the score as it is.

Review 4

Summary and Contributions: This paper considered the kernel hyperparameter tuning of Gaussian Mixture kernel in active regression with an eps-net style analysis based on a result in [AKM+19]. The authors showed polynomial sample complexity w.r.t number of Gaussian component, Gaussian length-scales and input range, and claimed the sample complexity is tight up to logarithm factor. Compared with given hyper-parameter, hyper-parameter tuning only requires an additional # of Gaussian components multiplicative factor if we ignore the logarithmic factor, which shows that hyper-parameter tuning only need an additional multiplicative factor samples under active regression setting.

Strengths: Theoretical results show that under active regression setting hyper-parameter tuning is not much harder than learning with given hyper-parameter.

Weaknesses: As the authors admitted, the time complexity is still exponential w.r.t $ of Gaussian components, as the current algorithm need to enumerate all of the combinations of hyper-parameters.

Correctness: The derivations are correct or easy to correct without technical issue if the previous results in [AKM+19] are correct.

Clarity: The presentation is clear.

Relation to Prior Work: The authors discussed the relation with prior work clearly.

Reproducibility: Yes

Additional Feedback: Overall the idea is clear and easy to follow. The authors focused on the specific setting of active regression with Gaussian Mixture kernel and demonstrate that we don’t need much more sample compared with learning with pre-fixed hyperparameter, which is a satisfactory result if we focus on active regression setting. One of my main concern, as the authors pointed out, is the practical algorithm when we need to use multiple Gaussian components. And I’m curious about if the techniques can be generalized to the setting with fixed data, which is more general and also indicates some fundamental limit in the kernel learning.