{"title": "Bayesian Optimization with Exponential Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 2809, "page_last": 2817, "abstract": "This paper presents a Bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the delta-cover sampling. Most Bayesian optimization methods require auxiliary optimization: an additional non-convex global optimization problem, which can be time-consuming and hard to implement in practice. Also, the existing Bayesian optimization method with exponential convergence requires access to the delta-cover sampling, which was considered to be impractical. Our approach eliminates both requirements and achieves an exponential convergence rate.", "full_text": "Bayesian Optimization with Exponential Convergence\n\nKenji Kawaguchi\n\nLeslie Pack Kaelbling\n\nTom\u00b4as Lozano-P\u00b4erez\n\nMIT\n\nMIT\n\nMIT\n\nCambridge, MA, 02139\nkawaguch@mit.edu\n\nCambridge, MA, 02139\nlpk@csail.mit.edu\n\nCambridge, MA, 02139\ntlp@csail.mit.edu\n\nAbstract\n\nThis paper presents a Bayesian optimization method with exponential conver-\ngence without the need of auxiliary optimization and without the \u03b4-cover sam-\npling. Most Bayesian optimization methods require auxiliary optimization: an ad-\nditional non-convex global optimization problem, which can be time-consuming\nand hard to implement in practice. Also, the existing Bayesian optimization\nmethod with exponential convergence [1] requires access to the \u03b4-cover sampling,\nwhich was considered to be impractical [1, 2]. Our approach eliminates both re-\nquirements and achieves an exponential convergence rate.\n\n1\n\nIntroduction\n\nWe consider a general global optimization problem: maximize f(x) subject to x \u2208 \u03a9 \u2282 RD where\nf : \u03a9 \u2192 R is a non-convex black-box deterministic function. Such a problem arises in many real-\nworld applications, such as parameter tuning in machine learning [3], engineering design problems\n[4], and model parameter fitting in biology [ 5]. For this problem, one performance measure of an\nalgorithm is the simple regret, rn, which is given by rn = supx\u2208\u03a9 f(x) \u2212 f(x+) where x+ is the\nbest input vector found by the algorithm. For brevity, we use the term \u201cregret\u201d to mean simple regret.\nThe general global optimization problem is known to be intractable if we make no further assump-\ntions [6]. The simplest additional assumption to restore tractability is to assume the existence of a\nbound on the slope of f. A well-known variant of this assumption is Lipschitz continuity with a\nknown Lipschitz constant, and many algorithms have been proposed in this setting [7, 8, 9]. These\nalgorithms successfully guaranteed certain bounds on the regret. However appealing from a theoret-\nical point of view, a practical concern was soon raised regarding the assumption that a tight Lipschitz\nconstant is known. Some researchers relaxed this somewhat strong assumption by proposing proce-\ndures to estimate a Lipschitz constant during the optimization process [10, 11, 12].\n\nBayesian optimization is an efficient way to relax this assumption of complete knowledge of the Lip-\nschitz constant, and has become a well-recognized method for solving global optimization problems\nwith non-convex black-box functions. In the machine learning community, Bayesian optimization\u2014\nespecially by means of a Gaussian process (GP)\u2014is an active research area [13, 14, 15]. With the\nrequirement of the access to the \u03b4-cover sampling procedure (it samples the function uniformly such\nthat the density of samples doubles in the feasible regions at each iteration), de Freitas et al. [1] re-\ncently proposed a theoretical procedure that maintains an exponential convergence rate (exponential\nregret). However, as pointed out by Wang et al. [2], one remaining problem is to derive a GP-based\noptimization method with an exponential convergence rate without the \u03b4-cover sampling procedure,\nwhich is computationally too demanding in many cases.\n\nIn this paper, we propose a novel GP-based global optimization algorithm, which maintains an\nexponential convergence rate and converges rapidly without the \u03b4-cover sampling procedure.\n\n1\n\n\f2 Gaussian Process Optimization\nIn Gaussian process optimization, we estimate the distribution over function f and use this informa-\ntion to decide which point of f should be evaluated next. In a parametric approach, we consider a pa-\nrameterized function f(x; \u03b8), with \u03b8 being distributed according to some prior. In contrast, the non-\nparametric GP approach directly puts the GP prior over f as f(\u2219) \u223c GP (m(\u2219), \u03ba(\u2219,\u2219)) where m(\u2219) is\nthe mean function and \u03ba(\u2219,\u2219) is the covariance function or the kernel. That is, m(x) = E[f(x)] and\n\u03ba(x, x0) = E[(f(x) \u2212 m(x))(f(x0) \u2212 m(x0))T ]. For a finite set of points, the GP model is simply a\njoint Gaussian: f(x1:N ) \u223c N (m(x1:N ), K), where Ki,j = \u03ba(xi, xj) and N is the number of data\npoints. To predict the value of f at a new data point, we first consider the joint distribution over f\nof the old data points and the new data point:\n\n(cid:18) f (x1:N )\nf (xN +1) (cid:19) \u223c N(cid:18) m(x1:N )\n\nm(xN +1) ,(cid:20) K\n\nkT\n\nk\n\n\u03ba(xN +1, xN +1) (cid:21)(cid:19)\n\nwhere k = \u03ba(x1:N , xN +1) \u2208 RN\u00d71. Then, after factorizing the joint distribution using the Schur\ncomplement for the joint Gaussian, we obtain the conditional distribution, conditioned on observed\nentities DN := {x1:N , f(x1:N )} and xN +1, as:\n\nf(xN +1)|DN , xN +1 \u223c N (\u03bc(xN +1|DN ), \u03c32(xN +1|DN ))\n\nwhere \u03bc(xN +1|DN ) = m(xN +1) + kT K\u22121(f(x1:N ) \u2212 m(x1:N )) and \u03c32(xN +1|DN ) =\n\u03ba(xN +1, xN +1) \u2212 kT K\u22121k. One advantage of GP is that this closed-form solution simplifies both\nits analysis and implementation.\nTo use a GP, we must specify the mean function and the covariance function. The mean function is\nusually set to be zero. With this zero mean function, the conditional mean \u03bc(xN +1|DN ) can still\nbe flexibly specified by the covariance function, as shown in the above equation for\n\u03bc. For the co-\nvariance function, there are several common choices, including the Matern kernel and the Gaussian\n2(x \u2212 x0)T \u03a3\u22121(x \u2212 x0)(cid:17)\nkernel. For example, the Gaussian kernel is defined as \u03ba(x, x0) = exp(cid:16)\u2212 1\n\nwhere \u03a3\u22121 is the kernel parameter matrix. The kernel parameters or hyperparameters can be esti-\nmated by empirical Bayesian methods [16]; see [17] for more information about GP.\n\nThe flexibility and simplicity of the GP prior make it a common choice for continuous objective\nfunctions in the Bayesian optimization literature. Bayesian optimization with GP selects the next\nquery point that optimizes the acquisition function generated by GP. Commonly used acquisition\nfunctions include the upper confidence bound (UCB) and expected improvement (EI). For brevity,\nwe consider Bayesian optimization with UCB, which works as follows. At each iteration, the UCB\nfunction U is maintained as U(x|DN ) = \u03bc(x|DN ) + \u03c2\u03c3(x|DN ) where \u03c2 \u2208 R is a parameter of the\nalgorithm. To find the next query xn+1 for the objective function f, GP-UCB solves an additional\nnon-convex optimization problem with U as xN +1 = arg maxx U(x|DN ). This is often carried out\nby other global optimization methods such as DIRECT and CMA-ES. The justification for intro-\nducing a new optimization problem lies in the assumption that the cost of evaluating the objective\nfunction f dominates that of solving additional optimization problem.\nFor deterministic function, de Freitas et al. [1] recently presented a theoretical procedure that main-\ntains exponential convergence rate. However, their own paper and the follow-up research [1, 2] point\nout that this result relies on an impractical sampling procedure, the \u03b4-cover sampling. To overcome\nthis issue, Wang et al. [2] combined GP-UCB with a hierarchical partitioning optimization method,\nthe SOO algorithm [18], providing a regret bound with polynomial dependence on the number of\nfunction evaluations. They concluded that creating a GP-based algorithm with an exponential con-\nvergence rate without the impractical sampling procedure remained an open problem.\n\nInfinite-Metric GP Optimization\n\n3\n3.1 Overview\n\nThe GP-UCB algorithm can be seen as a member of the class of bound-based search methods,\nwhich includes Lipschitz optimization, A* search, and PAC-MDP algorithms with optimism in the\nface of uncertainty. Bound-based search methods have a common property: the tightness of the\nbound determines its effectiveness. The tighter the bound is, the better the performance becomes.\n\n2\n\n\fHowever, it is often difficult to obtain a tight bound while maintaining correctness. For example,\nin A* search, admissible heuristics maintain the correctness of the bound, but the estimated bound\nwith admissibility is often too loose in practice, resulting in a long period of global search.\n\nThe GP-UCB algorithm has the same problem. The bound in GP-UCB is represented by UCB,\nwhich has the following property: f(x) \u2264 U(x|D) with some probability. We formalize this prop-\nerty in the analysis of our algorithm. The problem is essentially due to the difficulty of obtaining a\ntight bound U(x|D) such that f(x) \u2264 U(x|D) and f(x) \u2248 U(x|D) (with some probability). Our\nsolution strategy is to first admit that the bound encoded in GP prior may not be tight enough to be\nuseful by itself. Instead of relying on a single bound given by the GP, we leverage the existence of\nan unknown bound encoded in the continuity at a global optimizer.\nAssumption 1. (Unknown Bound) There exists a global optimizer x\u2217 and an unknown semi-metric\n\u2018 such that for all x \u2208 \u03a9, f(x\u2217) \u2264 f(x) + \u2018 (x, x\u2217) and \u2018 (x, x\u2217) < \u221e.\nIn other words, we do not expect the known upper bound due to GP to be tight, but instead expect that\nthere exists some unknown bound that might be tighter. Notice that in the case where the bound by\nGP is as tight as the unknown bound by semi-metric \u2018 in Assumption 1, our method still maintains an\nexponential convergence rate and an advantage over GP-UCB (no need for auxiliary optimization).\nOur method is expected to become relatively much better when the known bound due to GP is less\ntight compared to the unknown bound by \u2018.\nAs the semi-metric \u2018 is unknown, there are infinitely many possible candidates that we can think of\nfor \u2018. Accordingly, we simultaneously conduct global and local searches based on all the candidates\nof the bounds. The bound estimated by GP is used to reduce the number of candidates. Since\nthe bound estimated by GP is known, we can ignore the candidates of the bounds that are looser\nthan the bound estimated by GP. The source code of the proposed algorithm is publicly available at\nhttp://lis.csail.mit.edu/code/imgpo.html.\n\n3.2 Description of Algorithm\n\nFigure 1 illustrates how the algorithm works with a simple 1-dimensional objective function. We\nemploy hierarchical partitioning to maintain hyperintervals, as illustrated by the line segments in the\nfigure. We consider a hyperrectangle as our hyperinterval, with its center being the evaluation point\nof f (blue points in each line segment in Figure 1). For each iteration t, the algorithm performs the\nfollowing procedure for each interval size:\n\n(i) Select the interval with the maximum center value among the intervals of the same size.\n(ii) Keep the interval selected by (i) if it has a center value greater than that of any larger\n\ninterval.\n\n(iii) Keep the interval accepted by (ii) if it contains a UCB greater than the center value of any\n\nsmaller interval.\n\n(iv) If an interval is accepted by (iii), divide it along with the longest coordinate into three new\n\nintervals.\n\n(v) For each new interval, if the UCB of the evaluation point is less than the best function value\nfound so far, skip the evaluation and use the UCB value as the center value until the interval\nis accepted in step (ii) on some future iteration; otherwise, evaluate the center value.\n\n(vi) Repeat steps (i)\u2013(v) until every size of intervals are considered\n\nThen, at the end of each iteration, the algorithm updates the GP hyperparameters. Here, the purpose\nof steps (i)\u2013(iii) is to select an interval that might contain the global optimizer. Steps (i) and (ii)\nselect the possible intervals based on the unknown bound by \u2018, while Step (iii) does so based on the\nbound by GP.\n\nWe now explain the procedure using the example in Figure 1. Let n be the number of divisions of\nintervals and let N be the number of function evaluations. t is the number of iterations. Initially,\nthere is only one interval (the center of the input region \u03a9 \u2282 R) and thus this interval is divided,\nresulting in the first diagram of Figure 1. At the beginning of iteration t = 2 , step (i) selects the third\ninterval from the left side in the first diagram ( t = 1, n = 2), as its center value is the maximum.\nBecause there are no intervals of different size at this point, steps (ii) and (iii) are skipped. Step\n(iv) divides the third interval, and then the GP hyperparameters are updated, resulting in the second\n\n3\n\n\fFigure 1: An illustration of IMGPO: t is the number of iteration, n is the number of divisions (or\nsplits), N is the number of function evaluations.\ndiagram (t = 2, n = 3). At the beginning of iteration t = 3, it starts conducting steps (i)\u2013(v) for the\nlargest intervals. Step (i) selects the second interval from the left side and step (ii) is skipped. Step\n(iii) accepts the second interval, because the UCB within this interval is no less than the center value\nof the smaller intervals, resulting in the third diagram (t = 3, n = 4). Iteration t = 3 continues\nby conducting steps (i)\u2013(v) for the smaller intervals. Step (i) selects the second interval from the\nleft side, step (ii) accepts it, and step (iii) is skipped, resulting in the forth diagram (t = 3, n = 4).\nThe effect of the step (v) can be seen in the diagrams for iteration t = 9. At n = 16, the far right\ninterval is divided, but no function evaluation occurs. Instead, UCB values given by GP are placed\nin the new intervals indicated by the red asterisks. One of the temporary dummy values is resolved\nat n = 17 when the interval is queried for division, as shown by the green asterisk. The effect of\nstep (iii) for the rejection case is illustrated in the last diagram for iteration t = 10. At n = 18, t is\nincreased to 10 from 9, meaning that the largest intervals are first considered for division. However,\nthe three largest intervals are all rejected in step (iii), resulting in the division of a very small interval\nnear the global optimum at n = 18.\n\n3.3 Technical Detail of Algorithm\n\nWe define h to be the depth of the hierarchical partitioning tree, and ch,i to be the center point\nof the ith hyperrectangle at depth h. Ngp is the number of the GP evaluations. Define depth(T )\nto be the largest integer h such that the set Th is not empty. To compute UCB U, we use \u03c2M =\np2 log(\u03c02M 2/12\u03b7) where M is the number of the calls made so far for U (i.e., each time we use U,\nwe increment M by one). This particular form of \u03c2M is to maintain the property of f(x) \u2264 U(x|D)\nduring an execution of our algorithm with probability at least 1 \u2212 \u03b7. Here, \u03b7 is the parameter of\nIMGPO. \u039emax is another parameter, but it is only used to limit the possibly long computation of\nstep (iii) (in the worst case, step (iii) computes UCBs 3\u039emax times although it would rarely happen).\nThe pseudocode is shown in Algorithm 1. Lines 8 to 23 correspond to steps (i)-(iii). These lines\ncompute the index i\u2217h of the candidate of the rectangle that may contain a global optimizer for each\ndepth h. For each depth h, non-null index i\u2217h at Line 24 indicates the remaining candidate of a\nrectangle that we want to divide. Lines 24 to 33 correspond to steps (iv)-(v) where the remaining\ncandidates of the rectangles for all h are divided. To provide a simple executable division scheme\n(line 29), we assume \u03a9 to be a hyperrectangle (see the last paragraph of section 4 for a general case).\nLines 8 to 17 correspond to steps (i)-(ii). Specifically, line 10 implements step (i) where a single\ncandidate is selected for each depth, and lines 11 to 12 conduct step (ii) where some candidates are\nscreened out. Lines 13 to 17 resolve the the temporary dummy values computed by GP. Lines 18\n)\nto 23 correspond to step (iii) where the candidates are further screened out. At line 21, T 0h+\u03be(ch,i\u2217h\nindicates the set of all center points of a fully expanded tree until depth h + \u03be within the region\n) contains the nodes of\ncovered by the hyperrectangle centered at ch,i\u2217h\nwith depth \u03be and can be computed by dividing the current\nthe fully expanded tree rooted at ch,i\u2217h\nand recursively divide all the resulting new rectangles until depth \u03be (i.e., depth \u03be\nrectangle at ch,i\u2217h\nfrom ch,i\u2217h\n\n. In other words, T 0h+\u03be(ch,i\u2217h\n\n, which is depth h + \u03be in the whole tree).\n\n4\n\n\f) < \u03c5max then\n\ni\u2217h \u2190 \u2205, break\n\n# for-loop for steps (i)-(ii)\n\n\u03c5max \u2190 \u2212\u221e\nfor h = 0 to depth(T ) do\n\nwhile true do\n\ni\u2217h \u2190 arg maxi:ch,i\u2208Th g(ch,i)\nif g(ch,i\u2217h\n\nAlgorithm 1 Infinite-Metric GP Optimization (IMGPO)\nInput: an objective function f, the search domain \u03a9, the GP kernel \u03ba, \u039emax \u2208 N+ and \u03b7 \u2208 (0, 1)\n1: Initialize the set Th = {\u2205} \u2200h \u2265 0\n2: Set c0,0 to be the center point of \u03a9 and T0 \u2190 {c0,0}\n3: Evaluate f at c0,0: g(c0,0) \u2190 f (c0,0)\n4: f + \u2190 g(c0,0),D \u2190 {(c0,0, g(c0,0))}\n5: n, N \u2190 1, Ngp \u2190 0, \u039e \u2190 1\n6: for t = 1, 2, 3, ... do\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n\ni\u2217h \u2190 \u2205, break\n\u03c5max \u2190 g(ch,i\u2217h\ng(ch,i\u2217h\nN \u2190 N + 1, Ngp \u2190 Ngp \u2212 1\nD \u2190 {D, (ch,i\u2217h\nfor h = 0 to depth(T ) do\n\n\u03be \u2190 the smallest positive integer s.t. i\u2217h+\u03be 6= \u2205 and \u03be \u2264 min(\u039e, \u039emax) if exists, and 0 otherwise\nz(h, i\u2217h) = maxk:ch+\u03be,k\u2208T 0h+\u03be (ch,i\u2217h\nif \u03be 6= 0 and z(h, i\u2217h) < g(ch+\u03be,i\u2217h+\u03be\n\n) U (ch+\u03be,k|D)\n) then\n\n\u03c5max \u2190 \u2212\u221e\nfor h = 0 to depth(T ) do\n\nif i\u2217h 6= \u2205 and g(ch,i\u2217h\n\nalong with the longest coordinate into three new hy-\n\n) and remove the GP-based label from g(ch,i\u2217h\n\n)\n\n# for-loop for steps (iv)-(v)\n\nelse if g(ch,i\u2217h\n\n) is not labeled as GP-based then\n\n), break\n\n) \u2265 \u03c5max then\n\n# for-loop for step (iii)\n\nelse\n\n) \u2190 f (ch,i\u2217h\n\n, g(ch,i\u2217h\n\n))}\n\nif i\u2217h 6= \u2205 then\n\nn \u2190 n + 1.\nDivide the hyperrectangle centered at ch,i\u2217h\nperrectangles with the following centers:\nS = {ch+1,i(lef t), ch+1,i(center), ch+1,i(right)}\nTh+1 \u2190 {Th+1,S}\n, g(ch+1,i(center)) \u2190 g(ch,i\u2217h\nTh \u2190 Th \\ ch,i\u2217h\n)\nfor inew = {i(lef t), i(right)} do\nif U (ch+1,inew|D) \u2265 f + then\ng(ch+1,inew ) \u2190 f (ch+1,inew )\nD \u2190 {D, (ch+1,inew , g(ch+1,inew ))}\nN \u2190 N + 1, f + \u2190 max(f +, g(ch+1,inew )), \u03c5max = max(\u03c5max, g(ch+1,inew ))\ng(ch+1,inew ) \u2190 U (ch+1,inew|D) and label g(ch+1,inew ) as GP-based.\nNgp \u2190 Ngp + 1\nUpdate \u039e: if f + was updated, \u039e \u2190 \u039e + 22 , and otherwise, \u039e \u2190 max(\u039e \u2212 2\u22121, 1)\nUpdate GP hyperparameters by an empirical Bayesian method\n\nelse\n\n30:\n31:\n32:\n33:\n34:\n35:\n\n36:\n37:\n\n38:\n39:\n\n3.4 Relationship to Previous Algorithms\n\nThe most closely related algorithm is the BaMSOO algorithm [2], which combines SOO with GP-\nUCB. However, it only achieves a polynomial regret bound while IMGPO achieves a exponential\nregret bound. IMGPO can achieve exponential regret because it utilizes the information encoded in\nthe GP prior/posterior to reduce the degree of the unknownness of the semi-metric \u2018.\nThe idea of considering a set of infinitely many bounds was first proposed by Jones et al. [\n19]. Their\nDIRECT algorithm has been successfully applied to real-world problems [4, 5], but it only maintains\nthe consistency property (i.e., convergence in the limit) from a theoretical viewpoint. DIRECT takes\nan input parameter \u0001 to balance the global and local search efforts. This idea was generalized to the\ncase of an unknown semi-metric and strengthened with a theoretical support (finite regret bound) by\n\n5\n\n\fMunos [18] in the SOO algorithm. By limiting the depth of the search tree with a parameter hmax,\nthe SOO algorithm achieves a finite regret bound that depends on the near-optimality dimension.\n\n4 Analysis\n\nIn this section, we prove an exponential convergence rate of IMGPO and theoretically discuss the\nreason why the novel idea underling IMGPO is beneficial. The proofs are provided in the supple-\nmentary material. To examine the effect of considering infinitely many possible candidates of the\nbounds, we introduce the following term.\nDefinition 1 . (Infinite-metric exploration loss). The infinite-metric exploration loss\nof intervals to be divided during iteration t.\n\n\u03c1t is the number\n\nh=1\n\nThe infinite-metric exploration loss \u03c1\u03c4 can be computed as \u03c1t = Pdepth(T )\n1(i\u2217h 6= \u2205) at line\n25. It is the cost (in terms of the number of function evaluations) incurred by not committing to\nany particular upper bound. If we were to rely on a specific bound, \u03c1\u03c4 would be minimized to 1.\nFor example, the DOO algorithm [18] has \u03c1t = 1 \u2200t \u2265 1. Even if we know a particular upper\nbound, relying on this knowledge and thus minimizing \u03c1\u03c4 is not a good option unless the known\nbound is tight enough compared to the unknown bound leveraged in our algorithm. This will be\nclarified in our analysis. Let \u02c9\u03c1t be the maximum of the averages of \u03c11:t0 for t0 = 1, 2, ..., t (i.e.,\n\u02c9\u03c1t \u2261 max({ 1\nAssumption 2. There exist L > 0, \u03b1 > 0 and p \u2265 1 in R such that for all x, x0 \u2208 \u03a9, \u2018(x0, x) \u2264\nL||x0 \u2212 x||\u03b1\np .\nIn Theorem 1, we show that the exponential convergence rate O(cid:0)\u03bbN +Ngp(cid:1) with \u03bb < 1 is achieved.\nWe define \u039en \u2264 \u039emax to be the largest \u03be used so far with n total node expansions. For simplicity,\nwe assume that \u03a9 is a square, which we satisfied in our experiments by scaling original \u03a9.\n2kx\u2212 x0k\u221e. Let \u03bb = 3\u2212 \u03b1\nTheorem 1. Assume Assumptions 1 and 2. Let \u03b2 = supx,x0\u2208\u03a9\nThen, with probability at least 1 \u2212 \u03b7, the regret of IMGPO is bounded as\n\n\u03c4 =1 \u03c1\u03c4 ; t0 = 1, 2, ..., t}).\n\nt0Pt0\n\n2CD \u02c9\u03c1t < 1.\n\n1\n\nrN \u2264 L(3\u03b2D1/p)\u03b1 exp(cid:18)\u2212\u03b1(cid:20) N + Ngp\n\n2CD \u02c9\u03c1t \u2212 \u039en \u2212 2(cid:21) ln 3(cid:19) = O(cid:16)\u03bbN +Ngp(cid:17) .\n\nImportantly, our bound holds for the best values of the unknown L, \u03b1 and p even though these\nvalues are not given. The closest result in previous work is that of BaMSOO [2], which obtained\n\u02dcO(n\u2212 2\u03b1\nD(4\u2212\u03b1) ) with probability 1 \u2212 \u03b7 for \u03b1 = {1, 2}. As can be seen, we have improved the regret\nbound. Additionally, in our analysis, we can see how L, p, and \u03b1 affect the bound, allowing us\nto view the inherent difficulty of an objective function in a theoretical perspective. Here, C is a\nconstant in N and is used in previous work [18, 2]. For example, if we conduct 2D or 3D \u2212 1\nfunction evaluations per node-expansion and if p = \u221e, we have that C = 1.\nWe note that \u03bb can get close to one as input dimension D increases, which suggests that there\nis a remaining challenge in scalability for higher dimensionality. One strategy for addressing this\nproblem would be to leverage additional assumptions such as those in [14, 20].\nRemark 1. (The effect of the tightness of UCB by GP) If UCB computed by GP is \u201cuseful\u201d such\n\n2CD \u03b1 ln 3(cid:17)(cid:17). If the bound due to\nthat N/\u02c9\u03c1t = \u03a9(N), then our regret bound becomes O(cid:16)exp(cid:16)\u2212 N +Ngp\nUCB by GP is too loose (and thus useless), \u02c9\u03c1t can increase up to O(N/t) (due to \u02c9\u03c1t \u2264Pt\nO(N/t)), resulting in the regret bound of O(cid:16)exp(cid:16)\u2212 t(1+Ngp/N )\n\u03b1 ln 3(cid:17)(cid:17), which can be bounded\nN )\u03b1 ln 3(cid:17)(cid:17)1. This is still better than the known results.\nby O(cid:16)exp(cid:16)\u2212 N +Ngp\nRemark 2. (The effect of GP) Without the use of GP, our regret bound would be as follows: rN \u2264\nL(3\u03b2D1/p)\u03b1 exp(\u2212\u03b1[ N\n\u02dc\u03c1t \u22122] ln 3), where \u02c9\u03c1t \u2264 \u02dc\u03c1t is the infinite-metric exploration loss without\n1This can be done by limiting the depth of search tree as depth(T ) = O(\u221aN ). Our proof works with this\nadditional mechanism, but results in the regret bound with N being replaced by \u221aN. Thus, if we assume to\nhave at least \u201cnot useless\u201d UCBs such that N/\u02c9\u03c1t = \u03a9(\u221aN ), this additional mechanism can be disadvanta-\ngeous. Accordingly, we do not adopt it in our experiments.\n\n2CD max( 1\u221aN\n\ni=1 i/t \u2264\n\n, t\n\n2CD\n\n2CD\n\n1\n\n6\n\n\f1\n\n2CD\n\nGP. Therefore, the use of GP reduces the regret bound by increasing Ngp and decreasing \u02c9\u03c1t, but may\npotentially increase the bound by increasing \u039en \u2264 \u039e.\nRemark 3. (The effect of infinite-metric optimization) To understand the effect of considering all\nthe possible upper bounds, we consider the case without GP. If we consider all the possible bounds,\n\u02dc\u03c1t \u2212 2] ln 3) for the best unknown L, \u03b1 and p.\nwe have the regret bound L(3\u03b2D1/p)\u03b1 exp(\u2212\u03b1[ N\nFor standard optimization with a estimated bound, we have L0(3\u03b2D1/p0 )\u03b10 exp(\u2212\u03b10[ N\n2C0D \u2212 2] ln 3)\nfor an estimated L0, \u03b10, and p0. By algebraic manipulation, considering all the possible bounds has\na better regret when \u02dc\u03c1\u22121\nL(3\u03b2D1/p)\u03b1 ). For an intuitive\nN ln L0D\u03b1/p0\ninsight, we can simplify the above by assuming \u03b10 = \u03b1 and C0 = C as \u02dc\u03c1\u22121\nLD\u03b1/p .\nBecause L and p are the ones that achieve the lowest bound, the logarithm on the right-hand side is\nalways non-negative. Hence, \u02dc\u03c1t = 1 always satisfies the condition. When L0 and p0 are not tight\nenough, the logarithmic term increases in magnitude, allowing \u02dc\u03c1t to increase. For example, if the\nsecond term on the right-hand side has a magnitude of greater than 0.5, then \u02dc\u03c1t = 2 satisfies the\ninequality. Therefore, even if we know the upper bound of the function, we can see that it may be\nbetter not to rely on this, but rather take the infinite many possibilities into account.\n\n2C0D \u2212 2) ln 3\u03b10 + 2 ln 3\u03b1 \u2212 ln L0(3\u03b2D1/p0 )\u03b10\n\nt \u2265 1\u2212 Cc2D\n\nt \u2265 2CD\n\nN ln 3\u03b1 (( N\n\nOne may improve the algorithm with different division procedures than one presented in Algorithm\n1. Accordingly, in the supplementary material, we derive an abstract version of the regret bound for\nIMGPO with a family of division procedures that satisfy some assumptions. This information could\nbe used to design a new division procedure.\n\n5 Experiments\nIn this section, we compare the IMGPO algorithm with the SOO, BaMSOO, GP-PI and GP-EI algo-\nrithms [18, 2, 3]. In previous work, BaMSOO and GP-UCB were tested with a pair of a handpicked\ngood kernel and hyperparameters for each function [2]. In our experiments, we assume that the\nknowledge of good kernel and hyperparameters is unavailable, which is usually the case in practice.\nThus, for IMGPO, BaMSOO, GP-PI and GP-EI, we simply used one of the most popular kernels,\n\nthe isotropic Matern kernel with \u03bd = 5/2. This is given by \u03ba(x, x0) = g(p5||x \u2212 x0||2/l), where\ng(z) = \u03c32(1 + z + z2/3) exp(\u2212z). Then, we blindly initialized the hyperparameters to \u03c3 = 1\n\n(a) Sin1: [1, 1.92, 2]\n\n(b) Sin2: [2, 3.37, 3]\n\n(c) Peaks: [2, 3.14, 4]\n\n(d) Rosenbrock2: [2, 3.41, 4]\n\n(e) Branin: [2, 4.44, 2]\n\n(f) Hartmann3: [3, 4.11, 3]\n\n(g) Hartmann6: [6, 4.39, 4]\n\n(h) Shekel5: [4, 3.95, 4]\n\n(i) Sin1000: [1000, 3.95, 4]\n\nFigure 2: Performance Comparison: in the order, the digits inside of the parentheses [ ] indicate the\ndimensionality of each function, and the variables \u02c9\u03c1t and \u039en at the end of computation for IMGPO.\n\n7\n\n\fTable 1: Average CPU time (in seconds) for the experiment with each test function\n\nAlgorithm Sin1\n29.66\nGP-PI\n12.74\nGP-EI\nSOO\n0.19\n43.80\nBaMSOO\nIMGPO\n1.61\n\nSin2\n115.90\n115.79\n0.19\n4.61\n3.15\n\nPeaks\n47.90\n44.94\n0.24\n7.83\n4.70\n\nRosenbrock2\n\n921.82\n893.04\n0.744\n12.09\n11.11\n\nBranin\n1124.21\n1153.49\n\n0.33\n14.86\n5.73\n\nHartmann3 Hartmann6\n\n573.67\n562.08\n0.30\n14.14\n6.80\n\n657.36\n604.93\n0.25\n26.68\n13.47\n\nShekel5\n611.01\n558.58\n0.29\n371.36\n15.92\n\nand l = 0.25 for all the experiments; these values were updated with an empirical Bayesian method\nafter each iteration. To compute the UCB by GP, we used \u03b7 = 0.05 for IMGPO and BaMSOO.\nFor IMGPO, \u039emax was fixed to be 22 (the effect of selecting different values is discussed later).\nFor BaMSOO and SOO, the parameter hmax was set to \u221an, according to Corollary 4.3 in [18].\nFor GP-PI and GP-EI, we used the SOO algorithm and a local optimization method using gradients\nto solve the auxiliary optimization. For SOO, BaMSOO and IMGPO, we used the corresponding\ndeterministic division procedure (given \u03a9, the initial point is fixed and no randomness exists). For\nGP-PI and GP-EI, we randomly initialized the first evaluation point and report the mean and one\nstandard deviation for 50 runs.\n\nThe experimental results for eight different objective functions are shown in Figure 2. The vertical\naxis is log10(f(x\u2217) \u2212 f(x+)), where f(x\u2217) is the global optima and f(x+) is the best value found\nby the algorithm. Hence, the lower the plotted value on the vertical axis, the better the algorithm\u2019s\nperformance. The last five functions are standard benchmarks for global optimization [ 21]. The first\ntwo were used in [18] to test SOO, and can be written as fsin1(x) = (sin(13x) sin +1)/2 for Sin1\nand fsin2(x) = fsin1(x1)fsin1(x2) for Sin2. The form of the third function is given in Equation\n(16) and Figure 2 in [22]. The last function is Sin2 embedded in 1000 dimension in the same manner\ndescribed in Section 4.1 in [14], which is used here to illustrate a possibility of using IMGPO as a\nmain subroutine to scale up to higher dimensions with additional assumptions. For this function,\nwe used REMBO [14] with IMGPO and BaMSOO as its Bayesian optimization subroutine. All of\nthese functions are multimodal, except for Rosenbrock2, with dimensionality from 1 to 1000.\n\nAs we can see from Figure 2, IMGPO outperformed the other algorithms in general. SOO produced\nthe competitive results for Rosenbrock2 because our GP prior was misleading (i.e., it did not model\nthe objective function well and thus the property f(x) \u2264 U(x|D) did not hold many times). As can\nbe seen in Table 1, IMGPO is much faster than traditional GP optimization methods although it is\nslower than SOO. For Sin 1, Sin2, Branin and Hartmann3, increasing \u039emax does not affect IMGPO\nbecause \u039en did not reach \u039emax = 22 (Figure 2). For the rest of the test functions, we would be able\nto improve the performance of IMGPO by increasing \u039emax at the cost of extra CPU time.\n6 Conclusion\nWe have presented the first GP-based optimization method with an exponential convergence rate\n\nO(cid:0)\u03bbN +Ngp(cid:1) (\u03bb < 1) without the need of auxiliary optimization and the \u03b4-cover sampling. Perhaps\n\nmore importantly in the viewpoint of a broader global optimization community, we have provided\na practically oriented analysis framework, enabling us to see why not relying on a particular bound\nis advantageous, and how a non-tight bound can still be useful (in Remarks 1, 2 and 3). Following\nthe advent of the DIRECT algorithm, the literature diverged along two paths, one with a particular\nbound and one without. GP-UCB can be categorized into the former. Our approach illustrates the\nbenefits of combining these two paths.\n\nAs stated in Section 3.1, our solution idea was to use a bound-based method but rely less on the\nestimated bound by considering all the possible bounds. It would be interesting to see if a similar\nprinciple can be applicable to other types of bound-based methods such as planning algorithms (e.g.,\nA* search and the UCT or FSSS algorithm [23]) and learning algorithms (e.g., PAC-MDP algorithms\n[24]).\n\nAcknowledgments\nThe authors would like to thank Dr. Remi Munos for his thoughtful comments and suggestions. We\ngratefully acknowledge support from NSF grant 1420927, from ONR grant N00014-14-1-0486, and\nfrom ARO grant W911NF1410433. Kenji Kawaguchi was supported in part by the Funai Overseas\nScholarship. Any opinions, findings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily reflect the views of our sponsors.\n\n8\n\n\fReferences\n\n[1] N. De Freitas, A. J. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits with\ndeterministic observations. In Proceedings of the 29th International Conference on Machine Learning\n(ICML), 2012.\n\n[2] Z. Wang, B. Shakibi, L. Jin, and N. de Freitas. Bayesian Multi-Scale Optimistic Optimization. In Pro-\nceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTAT) , pages\n1005\u20131014, 2014.\n\n[3] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algo-\nrithms. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 2951\u20132959,\n2012.\n\n[4] R. G. Carter, J. M. Gablonsky, A. Patrick, C. T. Kelley, and O. J. Eslinger. Algorithms for noisy problems\n\nin gas transmission pipeline optimization. Optimization and engineering, 2(2):139\u2013157, 2001.\n\n[5] J. W. Zwolak, J. J. Tyson, and L. T. Watson. Globally optimised parameters for a model of mitotic control\n\nin frog egg extracts. IEEE Proceedings-Systems Biology, 152(2):81\u201392, 2005.\n\n[6] L. C. W. Dixon. Global optima without convexity. Numerical Optimisation Centre, Hatfield Polytechnic,\n\n1977.\n\n[7] B. O. Shubert. A sequential method seeking the global maximum of a function. SIAM Journal on\n\nNumerical Analysis, 9(3):379\u2013388, 1972.\n\n[8] D. Q. Mayne and E. Polak. Outer approximation algorithm for nondifferentiable optimization problems.\n\nJournal of Optimization Theory and Applications, 42(1):19\u201330, 1984.\n\n[9] R. H. Mladineo. An algorithm for finding the global maximum of a multimodal, multivariate function.\n\nMathematical Programming, 34(2):188\u2013200, 1986.\n\n[10] R. G. Strongin. Convergence of an algorithm for finding a global extremum. Engineering Cybernetics,\n\n11(4):549\u2013555, 1973.\n\n[11] D. E. Kvasov, C. Pizzuti, and Y. D. Sergeyev. Local tuning and partition strategies for diagonal GO\n\nmethods. Numerische Mathematik, 94(1):93\u2013106, 2003.\n\n[12] S. Bubeck, G. Stoltz, and J. Y. Yu. Lipschitz bandits without the Lipschitz constant.\n\nLearning Theory, pages 144\u2013158. Springer, 2011.\n\nIn Algorithmic\n\n[13] J. Gardner, M. Kusner, K. Weinberger, and J. Cunningham. Bayesian Optimization with Inequality Con-\nstraints. In Proceedings of The 31st International Conference on Machine Learning (ICML), pages 937\u2013\n945, 2014.\n\n[14] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. De Freitas. Bayesian optimization in high dimensions\nvia random embeddings. In Proceedings of the Twenty-Third international joint conference on Artificial\nIntelligence, pages 1778\u20131784. AAAI Press, 2013.\n\n[15] N. Srinivas, A. Krause, M. Seeger, and S. M. Kakade. Gaussian Process Optimization in the Bandit\nSetting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on\nMachine Learning (ICML), pages 1015\u20131022, 2010.\n\n[16] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, page 521, 2012.\n[17] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[18] R. Munos. Optimistic optimization of deterministic functions without the knowledge of its smoothness.\n\nIn Proceedings of Advances in neural information processing systems (NIPS), 2011.\n\n[19] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz con-\n\nstant. Journal of Optimization Theory and Applications, 79(1):157\u2013181, 1993.\n\n[20] K. Kandasamy, J. Schneider, and B. Poczos. High dimensional Bayesian optimisation and bandits via\n\nadditive models. arXiv preprint arXiv:1503.01673, 2015.\n\n[21] S. Surjanovic and D. Bingham. Virtual library of simulation experiments: Test functions and datasets.\n\nRetrieved November 30, 2014, from http://www.sfu.ca/\u02dcssurjano, 2014.\n\n[22] D. B. McDonald, W. J. Grantham, W. L. Tabor, and M. J. Murphy. Global and local optimization us-\ning radial basis function response surface models. Applied Mathematical Modelling, 31(10):2095\u20132110,\n2007.\n\n[23] T. J. Walsh, S. Goschin, and M. L. Littman. Integrating Sample-Based Planning and Model-Based Re-\nIn Proceedings of the 24th AAAI conference on Artificial Intelligence (AAAI) ,\n\ninforcement Learning.\n2010.\n\n[24] A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in finite MDPs: PAC analysis. The Journal\n\nof Machine Learning Research (JMLR), 10:2413\u20132444, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1594, "authors": [{"given_name": "Kenji", "family_name": "Kawaguchi", "institution": "MIT"}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": "MIT"}, {"given_name": "Tom\u00e1s", "family_name": "Lozano-P\u00e9rez", "institution": null}]}