{"title": "Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1507, "page_last": 1515, "abstract": "We present a new algorithm, truncated variance reduction (TruVaR), that treats Bayesian optimization (BO) and level-set estimation (LSE) with Gaussian processes in a unified fashion. The algorithm greedily shrinks a sum of truncated variances within a set of potential maximizers (BO) or unclassified points (LSE), which is updated based on confidence bounds. TruVaR is effective in several important settings that are typically non-trivial to incorporate into myopic algorithms, including pointwise costs and heteroscedastic noise. We provide a general theoretical guarantee for TruVaR covering these aspects, and use it to recover and strengthen existing results on BO and LSE. Moreover, we provide a new result for a setting where one can select from a number of noise levels having associated costs. We demonstrate the effectiveness of the algorithm on both synthetic and real-world data sets.", "full_text": "Truncated Variance Reduction: A Uni\ufb01ed Approach\nto Bayesian Optimization and Level-Set Estimation\n\nIlija Bogunovic1, Jonathan Scarlett1, Andreas Krause2, Volkan Cevher1\n\n1 Laboratory for Information and Inference Systems (LIONS), EPFL\n\n2 Learning and Adaptive Systems Group, ETH Z\u00a8urich\n\n{ilija.bogunovic,jonathan.scarlett,volkan.cevher}@ep\ufb02.ch, krausea@ethz.ch\n\nAbstract\n\nWe present a new algorithm, truncated variance reduction (TRUVAR), that treats\nBayesian optimization (BO) and level-set estimation (LSE) with Gaussian pro-\ncesses in a uni\ufb01ed fashion. The algorithm greedily shrinks a sum of truncated\nvariances within a set of potential maximizers (BO) or unclassi\ufb01ed points (LSE),\nwhich is updated based on con\ufb01dence bounds. TRUVAR is effective in several\nimportant settings that are typically non-trivial to incorporate into myopic algo-\nrithms, including pointwise costs and heteroscedastic noise. We provide a general\ntheoretical guarantee for TRUVAR covering these aspects, and use it to recover\nand strengthen existing results on BO and LSE. Moreover, we provide a new result\nfor a setting where one can select from a number of noise levels having associated\ncosts. We demonstrate the effectiveness of the algorithm on both synthetic and\nreal-world data sets.\n\nIntroduction\n\n1\nBayesian optimization (BO) [1] provides a powerful framework for automating design problems,\nand \ufb01nds applications in robotics, environmental monitoring, and automated machine learning, just\nto name a few. One seeks to \ufb01nd the maximum of an unknown reward function that is expensive\nto evaluate, based on a sequence of suitably-chosen points and noisy observations. Numerous BO\nalgorithms have been presented previously; see Section 1.1 for an overview.\nLevel-set estimation (LSE) [2] is closely related to BO, with the added twist that instead of seeking\na maximizer, one seeks to classify the domain into points that lie above or below a certain thresh-\nold. This is of considerable interest in applications such as environmental monitoring and sensor\nnetworks, allowing one to \ufb01nd all \u201csuf\ufb01ciently good\u201d points rather than the best point alone.\nWhile BO and LSE are closely related, they are typically studied in isolation. In this paper, we pro-\nvide a uni\ufb01ed treatment of the two via a new algorithm, Truncated Variance Reduction (TRUVAR),\nwhich enjoys theoretical guarantees, good computational complexity, and the versatility to handle\nimportant settings such as pointwise costs, non-constant noise, and multi-task scenarios. The main\nresult of this paper applies to the former two settings, and even the \ufb01xed-noise and unit-cost case,\nwe re\ufb01ne existing bounds via a signi\ufb01cantly improved dependence on the noise level.\n1.1 Previous Work\nThree popular myopic techniques for Bayesian optimization are expected improvement (EI), prob-\nability of improvement (PI), and Gaussian process upper con\ufb01dence bound (GP-UCB) [1, 3], each\nof which chooses the point maximizing an acquisition function depending directly on the current\nposterior mean and variance. In [4], the GP-UCB-PE algorithm was presented for BO, choosing\nthe highest-variance point within a set of potential maximizers that is updated based on con\ufb01dence\nbounds. Another relevant BO algorithm is BaMSOO [5], which also keeps track of potential max-\nimizers, but instead chooses points based on a global optimization technique called simultaneous\n\n1\n\n\fonline optimization (SOO). An algorithm for level-set estimation with GPs is given in [2], which\nkeeps track of a set of unclassi\ufb01ed points. These algorithms are computationally ef\ufb01cient and have\nvarious theoretical guarantees, but it is unclear how best to incorporate aspects such as pointwise\ncosts and heteroscedastic noise [6]. The same is true for the Straddle heuristic for LSE [7].\nEntropy search (ES) [8] and its predictive version [9] choose points to reduce the uncertainty of\nthe location of the maximum, doing so via a one-step lookahead of the posterior rather than only\nthe current posterior. While this is more computationally expensive, it also permits versatility with\nrespect to costs [6], heteroscedastic noise [10], and multi-task scenarios [6]. A recent approach\ncalled minimum regret search (MRS) [11] also performs a look-ahead, but instead chooses points to\nminimize the regret. To our knowledge, no theoretical guarantees have been provided for these.\nThe multi-armed bandit (MAB) [12] literature has developed alongside the BO literature, with the\ntwo often bearing similar concepts. The MAB literature is far too extensive to cover here, but we\nbrie\ufb02y mention some variants relevant to this paper. Extensive attention has been paid to the best-\narm identi\ufb01cation problem [13], and cost constraints have been incorporated in a variety of forms\n[14]. Moreover, the concept of \u201czooming in\u201d to the optimal point has been explored [15]. In general,\nthe assumptions and analysis techniques in the MAB and BO literature are quite different.\n\n1.2 Contributions\n\nWe present a uni\ufb01ed analysis of Bayesian optimization and level-set estimation via a new algorithm\nTruncated Variance Reduction (TRUVAR). The algorithm works by keeping track of a set of poten-\ntial maximizers (BO) or unclassi\ufb01ed points (LSE), selecting points that shrink the uncertainty within\nthat set up to a truncation threshold, and updating the set using con\ufb01dence bounds. Similarly to ES\nand MRS, the algorithm performs a one-step lookahead that is highly bene\ufb01cial in terms of versa-\ntility. However, unlike these previous works, our lookahead avoids the computationally expensive\ntask of averaging over the posterior distribution and the observations.\nAlso in contrast with ES and MRS, we provide theoretical bounds for TRUVAR characterizing the\ncost required to achieve a certain accuracy in \ufb01nding a near-optimal point (BO) or in classifying\neach point in the domain (LSE). By applying this to the standard BO setting, we not only recover\nexisting results [2, 4], but we also strengthen them via a signi\ufb01cantly improved dependence on the\nnoise level, with better asymptotics in the small noise limit. Moreover, we provide a novel result for\na setting in which the algorithm can choose the noise level, each coming with an associated cost.\nFinally, we compare our algorithm to previous works on several synthetic and real-world data sets,\nobserving it to perform favorably in a variety of settings.\n\n2 Problem Setup and Proposed Algorithm\n\nSetup: We seek to sequentially optimize an unknown reward function f (x) over a \ufb01nite domain\nD.1 At time t, we query a single point xt 2 D and observe a noisy sample yt = f (xt) + zt, where\nzt \u21e0 N (0, 2(xt)) for some known noise function 2(\u00b7) : D ! R+. Thus, in general, some points\nmay be noisier than others, in which case we have heteroscedastic noise [10]. We associate with\neach point a cost according to some known cost function c : D ! R+. If both 2(\u00b7) and c(\u00b7) are\nset to be constant, then we recover the standard homoscedastic and unit-cost setting.\nWe model f (x) as a Gaussian process (GP) [16] having mean zero and kernel function k(x, x0),\nnormalized so that k(x, x) = 1 for all x 2 D. The posterior distribution of f given the points and\nobservations up to time t is again a GP, with the posterior mean and variance given by [10]\n\n\u00b5t(x) = kt(x)TKt + \u2303t1yt\nt(x)2 = k(x, x) kt(x)TKt + \u2303t1kt(x),\ni=1, Kt =\u21e5k(xt, xt0)\u21e4t,t0, and \u2303t = diag(2(x1), . . . , 2(xt)). We also\n\nwhere kt(x) =\u21e5k(xi, x)\u21e4t\nt1|x(x) denote the posterior variance of x upon observing x along with x1,\u00b7\u00b7\u00b7 , xt1.\n1Extensions to continuous domains are discussed in the supplementary material.\n\n(1)\n(2)\n\nlet 2\n\n2\n\n\fCon\ufb01dence\n\nTarget\n\nSelected point\nMax. lower bound\n\nPotential maximizers\n\n(a) t = 6\n\n(b) t = 7\n\n(c) t = 8\n\n(d) t = 9\n\nFigure 1: An illustration of the TRUVAR algorithm. In (a), (b), and (c), three points within the set\nof potential maximizers Mt are selected in order to bring the con\ufb01dence bounds to within the target\nrange, and Mt shrinks during this process. In (d), the target con\ufb01dence width shrinks as a result of\nthe last selected point bringing the con\ufb01dence within Mt to within the previous target.\n\nWe consider both Bayesian optimization, which consists of \ufb01nding a point whose function value is\nas high as possible, and level-set estimation, which consists of classifying the domain according into\npoints that lie above or below a given threshold h. The precise performance criteria for these settings\nare given in De\ufb01nition 3.1 below. Essentially, after spending a certain cost we report a point (BO)\nor a classi\ufb01cation (LSE), but there is no preference on the values of f (xt) for the points xt chosen\nbefore coming to such a decision (in contrast with other notions such as cumulative regret).\nTRUVAR algorithm: Our algorithm is described in Algorithm 1, making use of the updates\ndescribed in Algorithm 2. The algorithm keeps track of a sequence of unclassi\ufb01ed points Mt,\nrepresenting potential maximizers for BO or points close to h for LSE. This set is updated\nbased on the con\ufb01dence bounds depending on constants (i). The algorithm proceeds in epochs,\nwhere in the i-th epoch it seeks to bring the con\ufb01dence 1/2\n(i) t(x) of points within Mt be-\nIt does this by greedily minimizing the sum of truncated variances\nlow a target value \u2318(i).\nt1|x(x),\u2318 (i)} arising from choosing the point x, along with a normalization\nmax{(i)2\nand division by c(x) to favor low-cost points. The truncation by \u2318(i) in this decision rule means that\nonce the con\ufb01dence of a point is below the current target value, there is no preference in making it\nany lower (until the target is decreased). Once the con\ufb01dence of every point in Mt is less than a\nfactor 1 + above the target value, the target con\ufb01dence is reduced according to a multiplication by\nr 2 (0, 1). An illustration of the process is given in Figure 1, with details in the caption.\nFor level-set estimation, we also keep track of the sets Ht and Lt, containing points believed to have\nfunction values above and below h, respectively. The constraint x 2 Mt1 in (5)\u2013(7) ensures that\n{Mt} is non-increasing with respect to inclusion, and Ht and Lt are non-decreasing.\n\nPx2Mt1\n\nAlgorithm 1 Truncated Variance Reduction (TRUVAR)\nInput: Domain D, GP prior (\u00b50, 0, k), con\ufb01dence bound parameters > 0, r 2 (0, 1), {(i)}i1,\n1: Initialize the epoch number i = 1 and potential maximizers M(0) = D.\n2: for t = 1, 2, . . . do\n3:\n\n\u2318(1) > 0, and for LSE, level-set threshold h\n\nChoose\n\nxt = arg max\n\nx2D Px2Mt1\n\nmax{(i)2\n\nt1(x),\u2318 2\n\n(i)} Px2Mt1\n\nc(x)\n\nmax{(i)2\n\nt1|x(x),\u2318 2\n\n(i)}\n\n.\n\n(3)\n\n4:\n\n5:\n6:\n\nObserve the noisy function sample yt, and update according to Algorithm 2 to obtain Mt,\n\u00b5t, t, lt and ut, as well as Ht and Lt in the case of LSE\nwhile maxx2Mt 1/2\n\n(i) t(x) \uf8ff (1 + )\u2318(i) do\n\nIncrement i, set \u2318(i) = r \u21e5 \u2318(i1).\n\nThe choices of (i), , and r are discussed in Section 4. As with previous works, the kernel is\nassumed known in our theoretical results, whereas in practice it is typically learned from training\ndata [3]. Characterizing the effect of model mismatch or online hyperparameter updates is beyond\nthe scope of this paper, but is an interesting direction for future work.\n\n3\n\n\fAlgorithm 2 Parameter Updates for TRUVAR\nInput: Selected points and observations {xt0}t\n1: Update \u00b5t and t according to (1)\u2013(2), and form the upper and lower con\ufb01dence bounds\n\n(i) , and for LSE, level-set threshold h.\n\nt0=1, previous sets Mt1, Ht1, Lt1,\n\nt0=1; {yt0}t\n\nparameter 1/2\n\n2: For BO, set\n\nor for LSE, set\n\n(i) t(x).\n\nut(x) = \u00b5t(x) + 1/2\n\n(i) t(x),`\n\nt(x) = \u00b5t(x) 1/2\n`t(x),\nMt =\u21e2x 2 Mt1 : ut(x) max\nMt =x 2 Mt1 : ut(x) h and `t(x) \uf8ff h \n\nx2Mt1\n\n(4)\n\n(5)\n\n(6)\n(7)\n\nHt = Ht1 [x 2 Mt1 : `t(x) > h , Lt = Lt1 [x 2 Mt1 : ut(x) < h .\n\nSome variants of our algorithm and theory are discussed in the supplementary material due to lack\nof space, including pure variance reduction, non-Bayesian settings [3], continuous domains [3], the\nbatch setting [4], and implicit thresholds for level-set estimation [2].\n\n3 Theoretical Bounds\n\nIn order to state our results for BO and LSE in a uni\ufb01ed fashion, we de\ufb01ne a notion of \u270f-accuracy\nfor the two settings. That is, we de\ufb01ne this term differently in the two scenarios, but then we provide\ntheorems that simultaneously apply to both. All proofs are given in the supplementary material.\nDe\ufb01nition 3.1. After time step t of TRUVAR, we use the following terminology:\n\u2022 For BO, the set Mt is \u270f-accurate if it contains all true maxima x\u21e4 2 arg maxx f (x), and all of\nits points satisfy f (x\u21e4) f (x) \uf8ff \u270f.\n\u2022 For LSE, the triplet (Mt, Ht, Lt) is \u270f-accurate if all points in Ht satisfy f (x) > h, all points in\n2.\nLt satisfy f (x) < h, and all points in Mt satisfy |f (x) h|\uf8ff \u270f\nIn both cases, the cumulative cost after time t is de\ufb01ned as Ct =Pt\nt0=1 c(xt0).\n\nWe use \u270f\n2 in the LSE setting instead of \u270f since this creates a region of size \u270f where the function value\nlies, which is consistent with the BO setting. Our performance criterion for level-set estimation is\nslightly different from that of [2], but the two are closely related.\n\n3.1 General Result\nPreliminary de\ufb01nitions: Suppose that the {(i)} are chosen to ensure valid con\ufb01dence bounds,\ni.e., lt(x) \uf8ff f (x) \uf8ff ut(x) with high probability; see Theorem 3.1 and its proof below for such\nchoices. In this case, we have after the i-th epoch that all points are either already discarded (BO)\nor classi\ufb01ed (LSE), or are known up to the con\ufb01dence level (1 + )\u2318(i). For the points with such\ncon\ufb01dence, we have ut(x) lt(x) \uf8ff 2(1 + )\u2318(i), and hence\n\n(8)\nand similarly lt(x) f (x) 2(1 + )\u2318(i). This means that all points other than those within a gap\nof width 4(1 + )\u2318(i) must have been discarded or classi\ufb01ed:\n\nut(x) \uf8ff lt(x) + 2(1 + )\u2318(i) \uf8ff f (x) + 2(1 + )\u2318(i),\n\nMt \u2713x : f (x) f (x\u21e4) 4(1 + )\u2318(i) =: M (i)\nMt \u2713x : |f (x) h|\uf8ff 2(1 + )\u2318(i) =: M (i)\n\n(BO)\n\n(LSE)\n\nSince no points are discarded or classi\ufb01ed initially, we de\ufb01ne M (0) = D.\n\n(9)\n(10)\n\n4\n\n\fFor a collection of points S = (x01, . . . , x0\n\n), possibly containing duplicates, we write the total cost\n\nas c(S) =P|S|i=1 c(x0i). Moreover, we denote the posterior variance upon observing the points up to\ntime t 1 and the additional points in S by t1|S(x). Therefore, c(x) = c({x}) and t1|x(x) =\nt1|{x}(x). The minimum cost (respectively, maximum cost) is denoted by cmin = minx2D c(x)\n(respectively, cmax = maxx2D c(x)).\nFinally, we introduce the quantity\n\n|S|\n\nC\u21e4(\u21e0, M ) = min\n\nS nc(S) : max\n\nx2M\n\n0|S(x) \uf8ff \u21e0o,\n\n(11)\n\nrepresenting the minimum cost to achieve a posterior standard deviation of at most \u21e0 within M.\nMain result: In all of our results, we make the following assumption.\nAssumption 3.1. The kernel k(x, x0) is such that the variance reduction function\n\nis submodular [17] for any time t, and any selected points (x1, . . . , xt) and query point x.\n\n t,x(S) = 2\n\nt (x) 2\n\nt|S(x)\n\n(12)\n\n2\n\n\n\nand\n\n\u23182\n(i)\n\n1/2\n(i)\n\n+ cmax,\n\nC(i) C\u21e4\u2713 \u2318(i)\n\nThis assumption has been used in several previous works based on Gaussian processes, and suf\ufb01cient\nconditions for its validity can be found in [18, Sec. 8]. We now state the following general guarantee.\nTheorem 3.1. Fix \u270f> 0 and 2 (0, 1), and suppose there exist values {C(i)} and {(i)} such that\n(13)\n\n, M (i1)\u25c6 log |M (i1)|(i)\n(i) 2 log |D|Pi0\uf8ffi C(i0)2\u21e12\nC\u270f = Xi : 4(1+)\u2318(i1)>\u270f\nthen with probability at least 1 , we have \u270f-accuracy.\nWhile this theorem is somewhat abstract, it captures the fact that the algorithm improves when points\nhaving a lower cost and/or lower noise are available, since both of these lead to a smaller value of\nC\u21e4(\u21e0, M ); the former by directly incurring a smaller cost, and the latter by shrinking the variance\nmore rapidly. Below, we apply this result to some important cases.\n\nThen if TRUVAR is run with these choices of (i) until the cumulative cost reaches\n\nC(i),\n\n(15)\n\n(14)\n\n6c2\n\nmin\n\n.\n\n3.2 Results for Speci\ufb01c Settings\n\nHomoscedastic and unit-cost setting: De\ufb01ne the maximum mutual information [3]\n\n1\n2\n\n(16)\n\nT = max\nx1,...,xT\n\nlog detIT + 2KT,\nand consider the case that 2(x) = 2 and c(x) = 1. In the supplementary material, we provide a\ntheorem with a condition for \u270f-accuracy of the form T \u2326\u21e4 C1T T\nlog(1+2),\nthus matching [2, 4] up to logarithmic factors. In the following, we present a re\ufb01ned version that has\na signi\ufb01cantly better dependence on the noise level, thus exemplifying that a more careful analysis\nof (13) can provide improvements over the standard bounding techniques.\nCorollary 3.1. Fix \u270f> 0 and 2 (0, 1), de\ufb01ne T = 2 log |D|T 2\u21e12\n, and set \u2318(1) = 1 and r = 1\n2.\nThere exist choices of (i) (not depending on the time horizon T ) such that we have \u270f-accuracy with\nprobability at least 1 once the following condition holds:\nT \u271322T T\n+ 2l log2\n\n\u270f2 + 1 with C1 =\n\n16(1 + )2|D|T\n,\n(17)\n\n32(1 + )2\n\n96(1 + )2\n\n+ C1T T\n\n6(1 + )2\n\n2\n\n\u270f\n\n\u270f2\n\n\u270f2\n\n6\n\n\n\n2\n\n1\n\nwhere C1 =\n\n1\n\nlog(1+2). This condition is of the form T \u2326\u21e4 2T T\n\n\u270f2 + C1T T\n\nm\u25c6 log\n2 + 1.\n\n5\n\n\f2 are made for mathematical convenience, and a similar result follows\n\nThe choices \u2318(1) = 1 and r = 1\nfor any other choices \u2318(1) > 0 and r 2 (0, 1), possibly with different constant factors.\nAs 2 ! 1 (i.e., high noise), both of the above-mentioned bounds have noise dependence O\u21e4(2),\nsince log(1 + \u21b51) = O(\u21b51) as \u21b5 ! 1. On the other hand, as 2 ! 0 (i.e., low noise), C1 is\nlogarithmic, and Corollary 3.1 is signi\ufb01cantly better provided that \u270f \u2327 .\nChoosing the noise and cost: Here we consider the setting that there is a domain of points D0\nthat the reward function depends on, and alongside each point we can choose a noise variance 2(k)\n(k = 1, . . . , K). Hence, D = D0\u21e5{1,\u00b7\u00b7\u00b7 , K}. Lower noise variances incur a higher cost according\nto a cost function c(k).\nCorollary 3.2. For each k = 1,\u00b7\u00b7\u00b7 , K, let T \u21e4(k) denote the smallest value of T such that (17) holds\nwith 2(k) in place of 2, and with T = 2 log |D|T 2c2\n. Then, under the preceding setting, there\n6c2\nexist choices of (i) (not depending on T ) such that we have \u270f-accuracy with probability at least 1\nonce the cumulative cost reaches mink c(k)T \u21e4(k).\n\nmax\u21e12\nmin\n\nThis result roughly states that we obtain a bound as good as that obtained by sticking to any \ufb01xed\nchoice of noise level. In other words, every choice of noise (and corresponding cost) corresponds to a\ndifferent version of a BO or LSE algorithm (e.g., [2, 4]), and our algorithm has a similar performance\nguarantee to the best among all of those. This is potentially useful in avoiding the need for running\nan algorithm once per noise level and then choosing the best-performing one. Moreover, we found\nnumerically that beyond matching the best \ufb01xed noise strategy, we can strictly improve over it by\nmixing the noise levels; see Section 4.\n\n4 Experimental Results\nWe evaluate our algorithm in both the level-set estimation and Bayesian optimization settings.\nParameter choices: As with previous GP-based algorithms that use con\ufb01dence bounds, our theo-\nretical choice of (i) in TRUVAR is typically overly conservative. Therefore, instead of using (14)\ndirectly, we use a more aggressive variant with similar dependence on the domain size and time:\n(i)), where t(i) is the time at which the epoch starts, and a is a constant. Instead\n(i) = a log(|D|t2\nof the choice a = 2 dictated by (14), we set a = 0.5 for BO to avoid over-exploration. We found\nexploration to be slightly more bene\ufb01cial for LSE, and hence set a = 1 for this setting. We found\nTRUVAR to be quite robust with respect to the choices of the remaining parameters, and simply set\n\u2318(1) = 1, r = 0.1, and = 0 in all experiments; while our theory assumes > 0, in practice there\nis negligible difference between choosing zero and a small positive value.\nLevel-set estimation: For the LSE experiments, we use a common classi\ufb01cation rule in all al-\ngorithms, classifying the points according to the posterior mean as \u02c6Ht = {x : \u00b5t(x) h} and\n\u02c6Lt = {x : \u00b5t(x) < h}. The classi\ufb01cation accuracy is measured by the F1-score (i.e., the harmonic\nmean of precision and recall) with respect to the true super- and sub-level sets.\nWe compare TRUVAR against the GP-based LSE algorithm [2], which we name via the authors\u2019\nsurnames as GCHK, as well as the state-of-the-art straddle (STR) heuristic [7] and the maximum\nvariance rule (VAR) [2]. Descriptions can be found in the supplementary material. GCHK includes\nan exploration constant t, and we follow the recommendation in [2] of setting 1/2\nLake data (unit cost): We begin with a data set from the domain of environmental monitoring of\ninland waters, consisting of 2024 in situ measurements of chlorophyll concentration within a vertical\ntransect plane, collected by an autonomous surface vessel in Lake Z\u00a8urich [19]. As in [2], our goal\nis to detect regions of high concentration. We evaluate each algorithm on a 50 \u21e5 50 grid of points,\nwith the corresponding values coming from the GP posterior that was derived using the original data\n(see Figure 2d). We use the Mat\u00b4ern-5/2 ARD kernel, setting its hyperparameters by maximizing the\nlikelihood on the second (smaller) available dataset. The level-set threshold h is set to 1.5.\nIn Figure 2a, we show the performance of the algorithms averaged over 100 different runs; here\nthe randomness is only with respect to the starting point, as we are in the noiseless setting. We ob-\nserve that in this unit-cost case, TRUVAR performs similarly to GCHK and STR. All three methods\noutperform VAR, which is good for global exploration but less suited to level-set estimation.\n\nt = 3.\n\n6\n\n\fe\nr\no\nc\ns\n\n1\nF\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\ne\nr\no\nc\ns\n\n1\nF\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nTruVaR\n\nGCHK\n\nSTR\n\nVAR\n\n20\n\n40\n\n60\n\nTime\n\n80\n\n100\n\n120\n\ne\nr\no\nc\ns\n\n1\nF\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\nTruVaR\n\nGCHK\n\n0.5\n\n1\n\nCost (\u00d7104)\n\n1.5\n\n2\n\n4\n\n\u00d710\n\nTruVaR\nGCHK high noise\nGCHK medium noise\nGCHK small noise\n\n0.1\n\n0.15\n\n0.25\n\n0.2\nCost (\u00d7104)\n\n0.3\n\n0.35\n\n0.4\n\n(a) Lake data, unit-cost\n\n(b) Lake data, varying cost\n\n(c) Synthetic data, varying noise\n\n(d) Inferred concentration function\n\n(e) Points chosen by GCHK\n\n(f) Points chosen by TRUVAR\n\nFigure 2: Experimental results for level-set estimation.\n\nt\ne\nr\ng\ne\nR\nn\na\ni\nd\ne\nM\n\n0\n\n10\n\n-2\n\n10\n\n-4\n\n10\n\n-6\n\n10\n\nTruVaR\nEI\nGP-UCB\nES\nMRS\n\n0\n\n10\n\n-2\n\n10\n\n-4\n\n10\n\nt\ne\nr\ng\ne\nR\nd\ne\ng\na\nr\ne\nv\nA\n\nTruVaR\nEI\nGP-UCB\nES\nMRS\n\n0.27\n\nr\no\nr\nr\nE\nn\no\ni\nt\na\nd\n\ni\nl\na\nV\n\n0.26\n\n0.25\n\nTruVaR\n\nEI\n\nGP-UCB\n\n0\n\n20\n\n40\n\n60\n\nTime\n\n80\n\n100\n\n120\n\n-6\n\n10\n\n0\n\n20\n\n40\n\n60\n\nTime\n\n80\n\n100\n\n120\n\n0.24\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(a) Synthetic, median\n\n(b) Synthetic, outlier-adjusted mean\n\nTime\n\n(c) SVM data\n\nFigure 3: Experimental results for Bayesian optimization.\n\nLake data (varying cost): Next, we modify the above setting by introducing pointwise costs that\nare a function of the previous sampled point x0, namely, cx0(x) = 0.25|x1 x01| + 4(|x2| + 1),\nwhere x1 is the vessel position and x2 is the depth. Although we did not permit such a dependence\non x0 in our original setup, the algorithm itself remains unchanged. Our choice of cost penalizes the\ndistance traveled |x1 x01|, as well as the depth of the measurement |x2|. Since incorporating costs\ninto existing algorithms is non-trivial, we only compare against the original version of GCHK that\nignores costs.\nIn Figure 2b, we see that TruVaR signi\ufb01cantly outperforms GCHK, achieving a higher F1 score\nfor a signi\ufb01cantly smaller cost. The intuition behind this can be seen in Figures 2e and 2f, where\nwe show the points sampled by TruVaR and GCHK in one experiment run, connecting all pairs of\nconsecutive points. GCHK is designed to pick few points, but since it ignores costs, the distance\ntraveled is large. In contrast, by incorporating costs, TRUVAR tends to travel small distances, often\neven staying in the same x1 location to take measurements at multiple depths x2.\nSynthetic data with multiple noise levels: In this experiment, we demonstrate Corollary 3.2 by\nconsidering the setting in which the algorithm can choose the sampling noise variance and incur the\nassociated cost. We use a synthetic function sampled from a GP on a 50 \u21e5 50 grid with an isotropic\nsquared exponential kernel having length scale l = 0.1 and unit variance, and set h = 2.25. We use\nthree different noise levels, 2 2{ 106, 103, 0.05}, with corresponding costs {15, 10, 2}.\nWe run GCHK separately for each of the three noise levels, while running TRUVAR as normal\nand allowing it to mix between the noise levels. The resulting F1-scores are shown in Figure 2c.\nThe best-performing version of GCHK changes throughout the time horizon, while TRUVAR is\nconsistently better than all three. A discussion on how TRUVAR mixes between the noise levels can\nbe found in the supplementary material.\n\n7\n\n\fBayesian optimization. We now provide the results of two experiments for the BO setting.\nSynthetic data: We \ufb01rst conduct a similar experiment as that in [8, 11], generating 200 different\ntest functions de\ufb01ned on [0, 1]2. To generate a single test function, 200 points are chosen uniformly\nat random from [0, 1]2, their function values are generated from a GP using an isotropic squared ex-\nponential kernel with length scale l = 0.1 and unit variance, and the resulting posterior mean forms\nthe function on the whole domain [0, 1]2. We subsequently assume that samples of this function are\ncorrupted by Gaussian noise with 2 = 106. The extension of TRUVAR to continuous domains is\nstraightforward, and is explained in the supplementary material. For all algorithms considered, we\nevaluate the performance according to the regret of a single reported point, namely, the one having\nthe highest posterior mean.\nWe compare the performance of TRUVAR against expected improvement (EI), GP-upper con\ufb01dence\nbound (GP-UCB), entropy search (ES) and minimum regret search (MRS), whose acquisition func-\ntions are outlined in the supplementary material. We use publicly available code for ES and MRS\n[20]. The exploration parameter t in GP-UCB is set according to the recommendation in [3] of\ndividing the theoretical value by \ufb01ve, and the parameters for ES and MRS are set according to the\nrecommendations given in [11, Section 5.1].\nFigure 3a plots the median of the regret, and Figure 3b plots the mean after removing outliers (i.e.,\nthe best and worst 5% of the runs). In the earlier rounds, ES and MRS provide the best performance,\nwhile TRUVAR improves slowly due to exploration. However, the regret of TRUVAR subsequently\ndrops rapidly, giving the best performance in the later rounds after \u201czooming in\u201d towards the max-\nimum. GP-UCB generally performs well with the aggressive choice of t, despite previous works\u2019\nexperiments revealing it to perform poorly with the theoretical value.\nHyperparameter tuning data: In this experiment, we use the SVM on grid dataset, previously used\nin [21]. A 25 \u21e5 14 \u21e5 4 grid of hyperparameter con\ufb01gurations resulting in 1400 data points was pre-\nevaluated, forming the search space. The goal is to \ufb01nd a con\ufb01guration with small validation error.\nWe use a Mat\u00b4ern-5/2 ARD kernel, and re-learn its hyperparameters by maximizing the likelihood\nafter sampling every 3 points. Since the hyperparameters are not \ufb01xed in advance, we replace Mt1\nby D in (5) to avoid incorrectly ruling points out early on, allowing some removed points to be\nadded again in later steps. Once the estimated hyperparameters stop to vary signi\ufb01cantly, the size\nof the set of potential maximizers decreases almost monotonically. Since we consider the noiseless\nsetting here, we measure performance using the simple regret, i.e., the best point found so far.\nWe again average over 100 random starting points, and plot the resulting validation error in Fig-\nure 3c. Even in this noiseless and unit-cost setting that EI and GP-UCB are suited to, we \ufb01nd that\nTRUVAR performs slightly better, giving a better validation error with smaller error bars.\n\n5 Conclusion\n\nrated into the acquisition function.\n\nrithm can choose both a point and a noise level, cf., Corollary 3.2.\n\nTRUVAR and its theoretical guarantees are essentially identical in both cases\n\nWe highlight the following aspects in which TRUVAR is versatile:\n\u2022 Uni\ufb01ed optimization and level-set estimation: These are typically treated separately, whereas\n\u2022 Actions with costs: TRUVAR naturally favors cost-effective points, as this is directly incorpo-\n\u2022 Heteroscedastic noise: TRUVAR chooses points that effectively shrink the variance of other\npoints, thus directly taking advantage of situations in which some points are noisier than others.\n\u2022 Choosing the noise level: We provided novel theoretical guarantees for the case that the algo-\nHence, TRUVAR directly handles several important aspects that are non-trivial to incorporate into\nmyopic algorithms. Moreover, compared to other BO algorithms that perform a lookahead (e.g.,\nES and MRS), TRUVAR avoids the computationally expensive task of averaging over the posterior\nand/or measurements, and comes with rigorous theoretical guarantees.\nAcknowledgment: This work was supported in part by the European Commission under Grant\nERC Future Proof, SNF Sinergia project CRSII2-147633, SNF 200021-146750, and EPFL Fellows\nHorizon2020 grant 665667.\n\n8\n\n\fReferences\n[1] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, \u201cTaking the human out of\nthe loop: A review of Bayesian optimization,\u201d Proc. IEEE, vol. 104, no. 1, pp. 148\u2013175, 2016.\n[2] A. Gotovos, N. Casati, G. Hitz, and A. Krause, \u201cActive learning for level set estimation,\u201d in\n\nInt. Joint. Conf. Art. Intel., 2013.\n\n[3] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, \u201cInformation-theoretic regret bounds for\nGaussian process optimization in the bandit setting,\u201d IEEE Trans. Inf. Theory, vol. 58, no. 5,\npp. 3250\u20133265, May 2012.\n\n[4] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis, Machine Learning and Knowledge Dis-\ncovery in Databases. Springer Berlin Heidelberg, 2013, ch. Parallel Gaussian Process Opti-\nmization with Upper Con\ufb01dence Bound and Pure Exploration, pp. 225\u2013240.\n\n[5] Z. Wang, B. Shakibi, L. Jin, and N. de Freitas, \u201cBayesian multi-scale optimistic optimization,\u201d\n\nhttp://arxiv.org/abs/1402.7005.\n\n[6] K. Swersky, J. Snoek, and R. P. Adams, \u201cMulti-task Bayesian optimization,\u201d in Adv. Neur. Inf.\n\nProc. Sys. (NIPS), 2013, pp. 2004\u20132012.\n\n[7] B. Bryan and J. G. Schneider, \u201cActively learning level-sets of composite functions,\u201d in Int.\n\nConf. Mach. Learn. (ICML), 2008.\n\n[8] P. Hennig and C. J. Schuler, \u201cEntropy search for information-ef\ufb01cient global optimization,\u201d J.\n\nMach. Learn. Research, vol. 13, no. 1, pp. 1809\u20131837, 2012.\n\n[9] J. M. Hern\u00b4andez-Lobato, M. W. Hoffman, and Z. Ghahramani, \u201cPredictive entropy search for\nef\ufb01cient global optimization of black-box functions,\u201d in Adv. Neur. Inf. Proc. Sys. (NIPS), 2014,\npp. 918\u2013926.\n\n[10] P. W. Goldberg, C. K. Williams, and C. M. Bishop, \u201cRegression with input-dependent noise:\nA Gaussian process treatment,\u201d Adv. Neur. Inf. Proc. Sys. (NIPS), vol. 10, pp. 493\u2013499, 1997.\n[11] J. H. Metzen, \u201cMinimum regret search for single-and multi-task optimization,\u201d in Int. Conf.\n\nMach. Learn. (ICML), 2016.\n\n[12] S. Bubeck and N. Cesa-Bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-Armed\n\nBandit Problems, ser. Found. Trend. Mach. Learn. Now Publishers, 2012.\n\n[13] K. Jamieson and R. Nowak, \u201cBest-arm identi\ufb01cation algorithms for multi-armed bandits in the\n\n\ufb01xed con\ufb01dence setting,\u201d in Ann. Conf. Inf. Sci. Sys. (CISS), 2014, pp. 1\u20136.\n\n[14] O. Madani, D. J. Lizotte, and R. Greiner, \u201cThe budgeted multi-armed bandit problem,\u201d in\n\nLearning Theory. Springer, 2004, pp. 643\u2013645.\n\n[15] R. Kleinberg, A. Slivkins, and E. Upfal, \u201cMulti-armed bandits in metric spaces,\u201d in Proc. ACM\n\nSymp. Theory Comp., 2008.\n\n[16] C. E. Rasmussen, \u201cGaussian processes for machine learning.\u201d MIT Press, 2006.\n[17] A. Krause and D. Golovin, \u201cSubmodular function maximization,\u201d Tractability: Practical Ap-\n\nproaches to Hard Problems, vol. 3, 2012.\n\n[18] A. Das and D. Kempe, \u201cAlgorithms for subset selection in linear regression,\u201d in Proc. ACM\n\nSymp. Theory Comp. (STOC). ACM, 2008, pp. 45\u201354.\n\n[19] G. Hitz, F. Pomerleau, M.-E. Garneau, E. Pradalier, T. Posch, J. Pernthaler, and R. Y. Siegwart,\n\u201cAutonomous inland water monitoring: Design and application of a surface vessel,\u201d IEEE\nRobot. Autom. Magazine, vol. 19, no. 1, pp. 62\u201372, 2012.\n\n[20] http://github.com/jmetzen/bayesian optimization (accessed 19/05/2016).\n[21] J. Snoek, H. Larochelle, and R. P. Adams, \u201cPractical Bayesian optimization of machine learn-\n\ning algorithms,\u201d in Adv. Neur. Inf. Proc. Sys., 2012.\n\n[22] K. Swersky, J. Snoek, and R. P. Adams, \u201cFreeze-thaw Bayesian optimization,\u201d 2014,\n\nhttp://arxiv.org/abs/1406.3896.\n\n[23] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, \u201cLipschitzian optimization without the\n\nLipschitz constant,\u201d J. Opt. Theory Apps., vol. 79, no. 1, pp. 157\u2013181, 1993.\n\n[24] A. Krause and C. Guestrin, \u201cA note on the budgeted maximization of submodular functions,\u201d\n\n2005, Technical Report.\n\n9\n\n\f", "award": [], "sourceid": 829, "authors": [{"given_name": "Ilija", "family_name": "Bogunovic", "institution": "EPFL Lausanne"}, {"given_name": "Jonathan", "family_name": "Scarlett", "institution": "EPFL"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}