Paper ID: | 330 |
---|---|

Title: | Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control |

This paper is a joy to read: clear, insightful and well-written. EDIT: I'm happy with how the authors addressed these concerns in the rebuttal. It is also not starved for space. Given that, perhaps it is a good idea to add a definition of what a uniform error bound is. Readers may have a wrong working definition of it. This is not too severe, as the theorems are rigorously stated and do not _use_ the definition of a uniform bound. However, they _are_ uniform bounds and it would be good to explain why. In the experiments, you observe the dynamical systems with some noise at every time step, and do regression from one step to the next. This is very sensible, since in a dynamical system, the measurements are usually taken with the same sensors at every step. However, note that doing this violates Assumption 3.1. This is because now you are also observing the _inputs_ of your regression with noise. You observe x_i + ε_i, so you do not know x_i. In practice in the experiments the noise is so small that it does not matter: the bounds can still be heuristically applied, and will be only a little bit incorrect. Regarding Section 3.3, why is it in the paper? Is it to show off the power of Theorem 3.1? It doesn't seem connected to the rest. As I mentioned earlier, it is unclear how to make sure the variance follows the decrease schedule O(log(N)^(-1/2-ε)). You state that it "depends on the structure of the covariance kernel". It would be good to add an example to this section, using something simple such as the RBF or linear kernel.

The paper provides bounds on the Lipschitz constants of sample functions from Gaussian processes. It also provides bounds on the prediction errors and on the Lipschitz constants of the conditional means and variances. Then, an application is safe control is provided. Originality: The bounds for Gaussian processes are relatively similar to the existing ones, from the references that are given in the paper. The proofs also appear to use standard arguments, for instance the classical bound on the mean value of the supremum of a Gaussian process and the Borel TIS inequality. The authors write that a novelty of their bounds is that they do not rely on unknown quantities. More precisely, the previous bounds hold for fixed functions assumed to belong to RKHSs while their bounds hold in probability w.r.t. the distribution of the Gaussian process. This type of analysis can also have its limits, because the distribution of the Gaussian process (which may need to be selected arbitrarily) can assign very small probabilities to functions that look like the unknown function. Hence, I would not say that the framework of the paper is better (or worse) than the one from previous references. It is just of different nature. The authors write that some references, for instance [7], require bounded observation errors and exclude Gaussian errors. I am not sure that this is true, because I think that these references require Gaussian observation errors. Quality: The proofs appear to be correct but relatively standard. Clarity: The paper is rather clear. Perhaps the authors could explain better the statement of Theorem 3.3. The speak of a process generating infinitely many functions with given Lipschitz constant. This can not be a Gaussian process, since the Lipschitz constant of a Gaussian process is usually not a bounded random variable. Significance: I think the paper provides a welcome contribution to the analysis of Gaussian process extrema, of the prediction error and of safe control. The results are nevertheless not very different from previously existing ones.

This article presents bounds on the absolute difference between a sample from a (possibly noisy) Gaussian process (GP) and the predictive mean. The methodology relies on deriving probabilistic Lipschitz constants for the GP, and does not consider the RKHS associated with the covariance kernel as is generally done. As a result, the bounds are easier to use in practice, as shown on two examples. The proposed Lispschitz bounds on GPs are an interesting result by themselves. But it should be related to and compared with previous derivations, see, e.g., González, J., Dai, Z., Hennig, P., & Lawrence, N. (2016, May). Batch Bayesian optimization via local penalization. In Artificial intelligence and statistics (pp. 648-657). Concerning the main theorem 3.3, could you precise how to achieve the condition on \sigma_N in practice. Simply adding more training points (P7L252) in an arbitrary way would not be sufficient, expecially with noise. Overall the paper is clear and well-written, with potential for both theoretical works and practical ones. Minor point: - Proof if Theorem 3.2 on continuity of samples: early results on this can be found, e.g., in Cramer, H., & Leadbetter, M. R. (1967). Stationary and Related Stochastic Processes-Sample Function Properties and their Applications. Typos: P3L83: on the one hand? P3L87: kernel is usual small P7L272: The simulations ## Added after rebuttal I would like to thank the authors for their response, which addresses most of my concerns. Accordingly I increased my score.