NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7725
Title:Bayesian Optimization under Heavy-tailed Payoffs

Reviewer 1

This work considers a challenging and important setting of heavy-tailed rewards and combines a couple of existing techniques and ideas in a novel way: regret lower bound proof (Scarlett et al.), adaptive truncation in feature space (Shao et al.) and kernel approximation techniques (Mutny et al., Calandriello et al.). While the same problem has been considered in the k-armed bandit setting (and linear bandits) before, the extension to the kernelized bandits setting is new and introduces additional challenges that require a different set of ideas. The paper is well-organized and clearly written. The exposition is clear and a great number of remarks help to better understand the obtained results in the context of previous works. The paper structure makes a lot of sense: a suboptimal algorithm that is easy to understand -> regret lower bound -> main algorithm and almost optimal regret upper bound result. The related work is also adequately cited in my opinion. Apart from the previously mentioned works, the related works on kernel ridge regression, GP-UCB, relevant stochastic bandit literature, and other BO works are well covered. Unlike some previous works in BO that used kernel approximation schemes to reduce the computational cost of inverting kernel matrices (in, e.g., works on high-dimensional BO), this work uses kernel approximation schemes to obtain almost optimal regret bounds. This is done by adaptive truncation of features, while at the same time keeping the approximation error low (the latter was also done in, e.g., Calandriello et al.). The connection between these two ideas is novel to my knowledge. I did not check the proofs in the appendix, but the main presented results seem convincing and the overall analysis certainly looks non-trivial. The novelty in the lower bound proof is the construction of the reward distribution which is specific to this particular setting while the rest of the analysis closely follows the one from Scarlett et al. When it comes to TGP-UCB, I also agree that blowing up the confidence bounds by a multiplicative factor of b_t in data space provides valid confidence intervals, but I am wondering if switching to the feature space (+ kernel approximation) is indeed necessary to obtain the optimal regret (and confidence intervals) bounds, and if yes, why precisely would that be the case? In ATA-GP-UCB, the ridge-leverage-scores-based weighting of different directions is used together with the adaptive truncation to compute approximate rewards. However, TGP-UCB performs truncation only, which leads to suboptimal regret bounds (e.g., in the case of SE kernel) even when reduced to the sub-Gaussian environment. This makes me wonder if some other raw observation truncation ideas can lead to at least a better (if not optimal) regret bound than the one obtained for TGP-UCB. Minor comments: Unlike the exploration (confidence) parameter in TGP-UCB, ATA-GP-UCB requires the knowledge of the time horizon T. Unlike the sub-Gaussian environment, here, one needs to estimate two parameters alpha and v. Perhaps, it is worth testing empirically (on synthetic data) for the robustness of the obtained confidence sets (e.g., from Lemma 1) with respect to the misestimation of these two parameters. Finally, this work also complements the standard confidence bound results (from Srinivas et al., Abbasi et al., Chowdhury et al., Durand et al, etc.) with the new ones in the case of heavy-tailed rewards. Both the result and the technique used to obtain the tighter confidence sets that depend on the approximation error (similarly to Mutny et al. and Calandriello et al.) can be of interest to the BO/GP bandit community. ---------------------------- Edit after author rebuttal: After reading the rebuttal and other reviews I remain confident that this is high-quality work and should be accepted.

Reviewer 2

The paper focuses on the bayesian optimisation with heavy tailed noise. The goal is to maximize a function f over an action space X in a bandit setting. When chosing the action x, the player observes f(x) + some heavy tailed noise. f has some regularity: it is bounded in some RKHS norm. The authors first present a simple algorithm which is far from being optimal. They give lower bounds for specific kernels: SE and Matern. They then present an algorithm with two different embedding strategies, their regret bounds and compare them empirically. The used methods are very interesting and the black box optimization is a significant problem. The theoretical results are nice. However, I have the general feeling that the authors tend to oversell their results: the algorithm does not seem to be adapted to continuous set of arms and does not seem to have a nice complexity in practice as well. I am not yet convinced of the improvement brought by this work, especially compared to the "bandits with heavy tail" paper. For these reasons, I give a weak reject score to the paper. ------------------------------------------------- Major comments: 1. You first consider the problem with a possibly continuous set of arms. However, the computation of your algorithm scales linearly with the cardinal of this set, meaning that your algorithm does only work for a finite set of arms. Besides losing a lot of interest in the original problem, this is the kind of oversold results I mentioned. 2. Let us now consider only finite set of actions. The bandits setting seems to be a more general formulation than your problem then: no regularity of f is assumed. In this case, you need to compare your results (theoretically and empirically) with what is obtained by the bandits setting. More generally, it is hard to place this paper with the related literature. I am especially eager to know the performance of the "bandits with heavy tail" algorithm in the experiments with real data. 3. The complexity of the algorithm (see page 7) is very large (besides being linear in the cardinality of X) and this would make your algorithm not very practical. Moreover, you claim some complexity if gamma << T. First, what does this mean ? Second, gamma can be a power of T (eg in the Matern kernel according to line 127), so this is an overstatement to me. Also, why isn't the per step space complexity of Nystrom embedding scaling with t (but only with m_t) ? We indeed need to store all x_t' for t'

Reviewer 3

This paper addresses Bayesian optimization for an objective function belonging to the RKHS induced by some kernel with heavy-tailed non-subGaussian noise. The first algorithm simply applies GP-UCB with truncated reward, and achieves sublinear regret for general kernels though it is not optimal. The second algorithm uses approximation of a kernel by features with appropriately chosen dimension and makes truncation depending on each feature, which matches the regret lower bound for some kernels. The formulation of the problem is well-motivated and the regret lower and upper bounds are highly non-trivial. In addition, the kernel approximation in the framework of the (theoretical analysis of) BO is quite interesting and seems to be of independent interest for general BO framework (not only with heavy-tailed noise). The following are few minor concerns for the paper. - The truncation strategy in this paper maps reward (or its counterpart for the ATA-GP-UCB) larger than b_t to 0, not b_t. As far as I scan the appendix it seems to make no problem, but it is somewhat not intuitive since the truncated reward changes non-continuously. So it would be beneficial if there is some explanation (from theoretical and/or experimental viewpoints). - It is unclear on the dimension of the QFF embedding. As discussed around Lemma 1, m_t should not be too large. Nevertheless, Theorem 3 only specifies a lower bound of m for the regret bound. It seems from the appendix that the order of m must be equal to (and no larger than) the specified value in Theorem 3. - Regarding the dimension of the kernel approximation, it would be interesting to compare the empirical results for different choice of m (or q) to evaluate how the requirement (m should not be too large) is essential rather than for the theoretical analysis. - In Line 338, I don't understand what "we simulate rewards as y(x) = f(x) + eta, where eta takes values in {-10,10}, uniformly, for any single random point in X, and is zero everywhere else" means. Isn't a single point in a real space is picked with probability zero? - Remark 7: "Theorem 3 and 4" should be "Theorems 3 and 4". ------------ The feedback is convincing to me and please explicitly write them in the final version since they are unclear from the current manuscript.