{"title": "Scalable Global Optimization via Local Bayesian Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5496, "page_last": 5507, "abstract": "Bayesian optimization has recently emerged as a popular method for the sample-efficient optimization of expensive black-box functions. However, the application to high-dimensional problems with several thousand observations remains challenging, and on difficult problems Bayesian optimization is often not competitive with other paradigms. In this paper we take the view that this is due to the implicit homogeneity of the global probabilistic models and an overemphasized exploration that results from global acquisition. This motivates the design of a local probabilistic approach for global optimization of large-scale high-dimensional problems. We propose the TuRBO algorithm that fits a collection of local models and performs a principled global allocation of samples across these models via an implicit bandit approach. A comprehensive evaluation demonstrates that TuRBO outperforms state-of-the-art methods from machine learning and operations research on problems spanning reinforcement learning, robotics, and the natural sciences.", "full_text": "Scalable Global Optimization via\n\nLocal Bayesian Optimization\n\nDavid Eriksson\n\nUber AI\n\nMichael Pearce\n\nUniversity of Warwick\n\nJacob R Gardner\n\nUber AI\n\neriksson@uber.com\n\nm.a.l.pearce@warwick.ac.uk\n\njake.gardner@uber.com\n\nRyan Turner\n\nUber AI\n\nryan.turner@uber.com\n\nMatthias Poloczek\n\nUber AI\n\npoloczek@uber.com\n\nAbstract\n\nBayesian optimization has recently emerged as a popular method for the sample-\nef\ufb01cient optimization of expensive black-box functions. However, the application\nto high-dimensional problems with several thousand observations remains chal-\nlenging, and on dif\ufb01cult problems Bayesian optimization is often not competitive\nwith other paradigms. In this paper we take the view that this is due to the implicit\nhomogeneity of the global probabilistic models and an overemphasized exploration\nthat results from global acquisition. This motivates the design of a local probabilis-\ntic approach for global optimization of large-scale high-dimensional problems. We\npropose the TuRBO algorithm that \ufb01ts a collection of local models and performs a\nprincipled global allocation of samples across these models via an implicit bandit\napproach. A comprehensive evaluation demonstrates that TuRBO outperforms state-\nof-the-art methods from machine learning and operations research on problems\nspanning reinforcement learning, robotics, and the natural sciences.\n\n1\n\nIntroduction\n\nThe global optimization of high-dimensional black-box functions\u2014where closed form expressions\nand derivatives are unavailable\u2014is a ubiquitous task arising in hyperparameter tuning [36]; in\nreinforcement learning, when searching for an optimal parametrized policy [7]; in simulation, when\ncalibrating a simulator to real world data; and in chemical engineering and materials discovery, when\nselecting candidates for high-throughput screening [18]. While Bayesian optimization (BO) has\nemerged as a highly competitive tool for problems with a small number of tunable parameters (e.g.,\nsee [13, 35]), it often scales poorly to high dimensions and large sample budgets. Several methods\nhave been proposed for high-dimensional problems with small budgets of a few hundred samples (see\nthe literature review below). However, these methods make strong assumptions about the objective\nfunction such as low-dimensional subspace structure. The recent algorithms of Wang et al. [45]\nand Hern\u00e1ndez-Lobato et al. [18] are explicitly designed for a large sample budget and do not make\nthese assumptions. However, they do not compare favorably with state-of-the-art methods from\nstochastic optimization like CMA-ES [17] in practice.\nThe optimization of high-dimensional problems is hard for several reasons. First, the search space\ngrows exponentially with the dimension, and while local optima may become more plentiful, global\noptima become more dif\ufb01cult to \ufb01nd. Second, the function is often heterogeneous, making the task of\n\ufb01tting a global surrogate model challenging. For example, in reinforcement learning problems with\nsparse rewards, we expect the objective function to be nearly constant in large parts of the search\nspace. For the latter, note that the commonly used global Gaussian process (GP) models [13, 46]\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fimplicitly suppose that characteristic lengthscales and signal variances of the function are constant in\nthe search space. Previous work on non-stationary kernels does not make this assumption, but these\napproaches are too computationally expensive to be applicable in our large-scale setting [37, 40, 3].\nFinally, the fact that search spaces grow considerably faster than sampling budgets due to the curse of\ndimensionality implies the inherent presence of regions with large posterior uncertainty. For common\nmyopic acquisition functions, this results in an overemphasized exploration and a failure to exploit\npromising areas.\nTo overcome these challenges, we adopt a local strategy for BO. We introduce trust region BO\n(TuRBO), a technique for global optimization, that uses a collection of simultaneous local optimization\nruns using independent probabilistic models. Each local surrogate model enjoys the typical bene\ufb01ts of\nBayesian modeling \u2014robustness to noisy observations and rigorous uncertainty estimates\u2014 however,\nthese local surrogates allow for heterogeneous modeling of the objective function and do not suffer\nfrom over-exploration. To optimize globally, we leverage an implicit multi-armed bandit strategy at\neach iteration to allocate samples between these local areas and thus decide which local optimization\nruns to continue.\nWe provide a comprehensive experimental evaluation demonstrating that TuRBO outperforms the\nstate-of-the-art from BO, evolutionary methods, simulation optimization, and stochastic optimization\non a variety of benchmarks that span from reinforcement learning to robotics and natural sciences.\nAn implementation of TuRBO is available at https://github.com/uber-research/TuRBO.\n\n1.1 Related work\n\nBO has recently become the premier technique for global optimization of expensive functions,\nwith applications in hyperparameter tuning, aerospace design, chemical engineering, and materials\ndiscovery; see [13, 35] for an overview. However, most of BO\u2019s successes have been on low-\ndimensional problems and small sample budgets. This is not for a lack of trying; there have been\nmany attempts to scale BO to more dimensions and observations. A common approach is to replace\nthe GP model: Hutter et al. [19] uses random forests, whereas Snoek et al. [38] applies Bayesian\nlinear regression on features from neural networks. This neural network approach was re\ufb01ned by\nSpringenberg et al. [39] whose BOHAMIANN algorithm uses a modi\ufb01ed Hamiltonian Monte Carlo\nmethod, which is more robust and scalable than standard Bayesian neural networks. Hern\u00e1ndez-\nLobato et al. [18] combines Bayesian neural networks with Thompson sampling (TS), which easily\nscales to large batch sizes. We will return to this acquisition function later.\nThere is a considerable body of work in high-dimensional BO [8, 21, 5, 44, 14, 45, 32, 26, 27, 6].\nMany methods exist that exploit potential additive structure in the objective function [21, 14, 45].\nThese methods typically rely on training a large number of GPs (corresponding to different additive\nstructures) and therefore do not scale to large evaluation budgets. Other methods exist that rely on a\nmapping between the high-dimensional space and an unknown low-dimensional subspace to scale to\nlarge numbers of observations [44, 27, 15]. The BOCK algorithm of Oh et al. [29] uses a cylindrical\ntransformation of the search space to achieve scalability to high dimensions. Ensemble Bayesian\noptimization (EBO) [45] uses an ensemble of additive GPs together with a batch acquisition function\nto scale BO to tens of thousands of observations and high-dimensional spaces. Recently, Nayebi\net al. [27] have proposed the general HeSBO framework that extends GP-based BO algorithms to\nhigh-dimensional problems using a novel subspace embedding that overcomes the limitations of the\nGaussian projections used in [44, 5, 6]. From this area of research, we compare to BOCK, BOHAMIANN,\nEBO, and HeSBO.\nTo acquire large numbers of observations, large-scale BO usually selects points in batches to be\nevaluated in parallel. While several batch acquisition functions have recently been proposed [9,\n34, 43, 47, 48, 24, 16], these approaches do not scale to large batch sizes in practice. TS [41] is\nparticularly lightweight and easy to implement as a batch acquisition function as the computational\ncost scales linearly with the batch size. Although originally developed for bandit problems [33], it\nhas recently shown its value in BO [18, 4, 22]. In practice, TS is usually implemented by drawing a\nrealization of the unknown objective function from the surrogate model\u2019s posterior on a discretized\nsearch space. Then, TS \ufb01nds the optimum of the realization and evaluates the objective function at\nthat location. This technique is easily extended to batches by drawing multiple realizations as (see\nthe supplementary material for details).\n\n2\n\n\fEvolutionary algorithms are a popular approach for optimizing black-box functions when thousands\nof evaluations are available, see Jin et al. [20] for an overview in stochastic settings. We compare\nto the successful covariance matrix adaptation evolution strategy (CMA-ES) of Hansen [17]. CMA-ES\nperforms a stochastic search and maintains a multivariate normal sampling distribution over the\nsearch space. The evolutionary techniques of recombination and mutation correspond to adaptions of\nthe mean and covariance matrix of that distribution.\nHigh-dimensional problems with large sample budgets have also been studied extensively in opera-\ntions research and simulation optimization, see [11] for a survey. Here the successful trust region (TR)\nmethods are based on a local surrogate model in a region (often a sphere) around the best solution.\nThe trust region is expanded or shrunk depending on the improvement in obtained solutions; see Yuan\n[49] for an overview. We compare to BOBYQA [31], a state-of-the-art TR method that uses a quadratic\napproximation of the objective function. We also include the Nelder-Mead (NM) algorithm [28]. For\na d-dimensional space, NM creates a (d + 1)-dimensional simplex that adaptively moves along the\nsurface by projecting the vertex of the worst function value through the center of the simplex spanned\nby the remaining vertices. Finally, we also consider the popular quasi-Newton method BFGS [50],\nwhere gradients are obtained using \ufb01nite differences. For other work that uses local surrogate models,\nsee e.g., [23, 42, 1, 2, 25].\n\n2 The trust region Bayesian optimization algorithm\n\nIn this section, we propose an algorithm for optimizing high-dimensional black-box functions. In\nparticular, suppose that we wish to solve:\n\nFind x\u2217\n\n\u2208 \u2126 such that f (x\u2217) \u2264 f (x), \u2200x \u2208 \u2126,\n\nwhere f : \u2126 \u2192 R and \u2126 = [0, 1]d. We observe potentially noisy values y(x) = f (x) + \u03b5, where\n\u03b5 \u223c N (0, \u03c32). BO relies on the ability to construct a global model that is eventually accurate\nenough to uncover a global optimizer. As discussed previously, this is challenging due to the curse\nof dimensionality and the heterogeneity of the function. To address these challenges, we propose\nto abandon global surrogate modeling, and achieve global optimization by maintaining several\nindependent local models, each involved in a separate local optimization run. To achieve global\noptimization in this framework, we maintain multiple local models simultaneously and allocate\nsamples via an implicit multi-armed bandit approach. This yields an ef\ufb01cient acquisition strategy\nthat directs samples towards promising local optimization runs. We begin by detailing a single local\noptimization run, and then discuss how multiple runs are managed.\n\nLocal modeling. To achieve principled local optimization in the gradient-free setting, we draw\ninspiration from a class of TR methods from stochastic optimization [49]. These methods make\nsuggestions using a (simple) surrogate model inside a TR. The region is often a sphere or a polytope\ncentered at the best solution, within which the surrogate model is believed to accurately model the\nfunction. For example, the popular COBYLA [30] method approximates the objective function using\na local linear model. Intuitively, while linear and quadratic surrogates are likely to be inadequate\nmodels globally, they can be accurate in a suf\ufb01ciently small TR. However, there are two challenges\nwith traditional TR methods. First, deterministic examples such as COBYLA are notorious for handling\nnoisy observations poorly. Second, simple surrogate models might require overly small trust regions\nto provide accurate modeling behavior. Therefore, we will use GP surrogate models within a TR.\nThis allows us to inherit the robustness to noise and rigorous reasoning about uncertainty that global\nBO enjoys.\n\nTrust regions. We choose our TR to be a hyperrectangle centered at the best solution found so far,\ndenoted by x(cid:63). In the noise-free case, we set x(cid:63) to the location of the best observation so far. In\nthe presence of noise, we use the observation with the smallest posterior mean under the surrogate\nmodel. At the beginning of a given local optimization run, we initialize the base side length of the\nTR to L \u2190 Linit. The actual side length for each dimension is obtained from this base side length\nby rescaling according to its lengthscale \u03bbi in the GP model while maintaining a total volume of\nj=1 \u03bbj)1/d. To perform a single local optimization run, we utilize an\nacquisition function at each iteration t to select a batch of q candidates {x(t)\nq }, restricted\nto be within the TR. If L was large enough for the TR to contain the whole space, this would be\n\nLd. That is, Li = \u03bbiL/((cid:81)d\n\n1 , . . . , x(t)\n\n3\n\n\fequivalent to running standard global BO. Therefore, the evolution of L is critical. On the one hand,\na TR should be suf\ufb01ciently large to contain good solutions. On the other hand, it should be small\nenough to ensure that the local model is accurate within the TR. The typical behavior is to expand a\nTR when the optimizer \u201cmakes progress\u201d, i.e., it \ufb01nds better solutions in that region, and shrink it\nwhen the optimizer appears stuck. Therefore, following, e.g., Nelder and Mead [28], we will shrink\na TR after too many consecutive \u201cfailures\u201d, and expand it after many consecutive \u201csuccesses\u201d. We\nde\ufb01ne a \u201csuccess\u201d as a candidate that improves upon x(cid:63), and a \u201cfailure\u201d as a candidate that does not.\nAfter \u03c4succ consecutive successes, we double the size of the TR, i.e., L \u2190 min{Lmax, 2L}. After \u03c4fail\nconsecutive failures, we halve the size of the TR: L \u2190 L/2. We reset the success and failure counters\nto zero after we change the size of the TR. Whenever L falls below a given minimum threshold Lmin,\nwe discard the respective TR and initialize a new one with side length Linit. Additionally, we do not\nlet the side length expand to be larger than a maximum threshold Lmax. Note that \u03c4succ, \u03c4fail, Lmin,\nLmax, and Linit are hyperparameters of TuRBO; see the supplementary material for the values used in\nthe experimental evaluation.\n\nTrust region Bayesian optimization. So far, we have detailed a single local BO strategy using a\nTR method. Intuitively, we could make this algorithm (more) global by random restarts. However,\nfrom a probabilistic perspective, this is likely to utilize our evaluation budget inef\ufb01ciently. Just as we\nreason about which candidates are most promising within a local optimization run, we can reason\nabout which local optimization run is \u201cmost promising.\u201d\nTherefore, TuRBO maintains m trust regions simultaneously. Each trust region TR(cid:96) with (cid:96) \u2208\n{1, . . . , m} is a hyperrectangle of base side length L(cid:96) \u2264 Lmax, and utilizes an independent lo-\ncal GP model. This gives rise to a classical exploitation-exploration trade-off that we model by a\nmulti-armed bandit that treats each TR as a lever. Note that this provides an advantage over traditional\nTR algorithms in that TuRBO puts a stronger emphasis on promising regions.\nIn each iteration, we need to select a batch of q candidates drawn from the union of all trust regions,\nand update all local optimization problems for which candidates were drawn. To solve this problem,\nwe \ufb01nd that TS provides a principled solution to both the problem of selecting candidates within\na single TR, and selecting candidates across the set of trust regions simultaneously. To select the\ni-th candidate from across the trust regions, we draw a realization of the posterior function from the\nlocal GP within each TR: f (i)\nis the GP posterior for TR(cid:96)\nat iteration t. We then select the i-th candidate such that it minimizes the function value across all m\nsamples and all trust regions:\n\n(cid:96) (\u00b5(cid:96)(x), k(cid:96)(x, x(cid:48))), where GP (t)\n\n(cid:96) \u223c GP (t)\n\n(cid:96)\n\nx(t)\ni \u2208 argmin\n\n(cid:96)\n\nargmin\nx\u2208TR(cid:96)\n\nf (i)\n(cid:96) where f (i)\n\n(cid:96) \u223c GP (t)\n\n(cid:96) (\u00b5(cid:96)(x), k(cid:96)(x, x(cid:48))).\n\nThat is, we select as point with the smallest function value after concatenating a Thompson sample\nfrom each TR for i = 1, . . . , q. We refer to the supplementary material for additional details.\n\n3 Numerical experiments\n\nIn this section, we evaluate TuRBO on a wide range of problems: a 14D robot pushing problem, a 60D\nrover trajectory planning problem, a 12D cosmological constant estimation problem, a 12D lunar\nlanding reinforcement learning problem, and a 200D synthetic problem. All problems are multimodal\nand challenging for many global optimization algorithms. We consider a variety of batch sizes and\nevaluation budgets to fully examine the performance and robustness of TuRBO. The values of \u03c4succ,\n\u03c4fail, Lmin, Lmax, and Linit are given in the supplementary material.\nWe compare TuRBO to a comprehensive selection of state-of-the-art baselines: BFGS, BOCK,\nBOHAMIANN, CMA-ES, BOBYQA, EBO, GP-TS, HeSBO-TS, Nelder-Mead (NM), and random search (RS).\nHere, GP-TS refers to TS with a global GP model using the Mat\u00e9rn-5/2 kernel. HeSBO-TS combines\nGP-TS with a subspace embedding and thus effectively optimizes in a low-dimensional space; this\ntarget dimension is set by the user. Therefore, a small sample budget may suf\ufb01ce, which allows\nto run p invocations in parallel, following [44]. This may improve the performance, since each\nembedding may \"fail\" with some probability [27], i.e., it does not contain the active subspace even\nif it exists. Note that HeSBO-TS-p recommends a point of optimal posterior mean among the p\nGP-models; we use that point for the evaluation. The standard acquisition criterion EI used in BOCK\nand BOHAMIANN is replaced by (batch) TS, i.e., all methods use the same criterion which allows for a\n\n4\n\n\fFigure 1: Illustration of the TuRBO algorithm. (Left) The true contours of the Branin function.\n(Middle left) The contours of the GP model \ufb01tted to the observations depicted by black dots. The\ncurrent TR is shown as a red square. The global optima are indicated by the green stars. (Middle\nright) During the execution of the algorithm, the TR has moved towards the global optimum and has\nreduced in size. The area around the optimum has been sampled more densely in effect. (Right) The\nlocal GP model almost exactly \ufb01ts the underlying function in the TR, despite having a poor global \ufb01t.\n\ndirect comparison. Methods that attempt to learn an additive decomposition lack scalability and are\nthus omitted. BFGS approximates the gradient via \ufb01nite differences and thus requires d+1 evaluations\nfor each step. Furthermore, NM, BFGS, and BOBYQA are inherently sequential and therefore have an\nedge by leveraging all gathered observations. However, they are considerably more time consuming\non a per-wall-time evaluation basis since we are working with large batches.\nWe supplement the optimization test problems with three additional experiments: i) one that shows\nthat TuRBO achieves a linear speed-up from large batch sizes, ii) a comparison of local GPs and global\nGPs on a control problem, and iii) an analytical experiment demonstrating the locality of TuRBO.\nPerformance plots show the mean performances with one standard error. Overall, we observe that\nTuRBO consistently \ufb01nds excellent solutions, outperforming the other methods on most problems.\nExperimental results for a small budget experiment on four synthetic functions are shown in the\nsupplement, where we also provide details on the experimental setup and runtimes for all algorithms.\n\n3.1 Robot pushing\n\nThe robot pushing problem is a noisy 14D control problem considered in Wang et al. [45]. We run\neach method for a total of 10K evaluations and batch size of q = 50. TuRBO-1 and all other methods\nare initialized with 100 points except for TuRBO-20 where we use 50 initial points for each trust\nregion. This is to avoid having TuRBO-20 consume its full evaluation budget on the initial points. We\nuse HeSBO-TS-5 with target dimension 8. TuRBO-m denotes the variant of TuRBO that maintains m\nlocal models in parallel. Fig. 2 shows the results: TuRBO-1 and TuRBO-20 outperform the alternatives.\nTuRBO-20 starts slower since it is initialized with 1K points, but eventually outperforms TuRBO-1.\nCMA-ES and BOBYQA outperform the other BO methods. Note that Wang et al. [45] reported a median\nvalue of 8.3 for EBO after 30K evaluations, while TuRBO-1 achieves a mean and median reward of\naround 9.4 after only 2K samples.\n\n3.2 Rover trajectory planning\n\nHere the goal is to optimize the locations of 30 points in the 2D-plane that determine the trajectory of\na rover [45]. Every algorithm is run for 200 steps with a batch size of q = 100, thus collecting a total\nof 20K evaluations. We use 200 initial points for all methods except for TuRBO-20, where we use\n100 initial points for each region. Fig. 2 summarizes the performance. We observe that TuRBO-1 and\nTuRBO-20 outperform all other algorithms after a few thousand evaluations. TuRBO-20 once again\nstarts slowly because of the initial 2K random evaluations. Wang et al. [45] reported a mean value\nof 1.5 for EBO after 35K evaluations, while TuRBO-1 achieves a mean and median reward of about 2\nafter only 1K evaluations. We use a target dimension of 10 for HeSBO-TS-15 in this experiment.\n\n5\n\nTrust Region UpdateGP ModelTrue Function\fFigure 2: 14D Robot pushing (left): TuRBO-1 and TuRBO-20 perform well after a few thousand\nevaluations. 60D Rover trajectory planning (right): TuRBO-1 and TuRBO-20 achieve close to\noptimal objective values after 10K evaluations. In both experiments CMA-ES and BOBYQA are the\nrunners up, and HeSBO-TS and EBO perform best among the other BO methods.\n\n3.3 Cosmological constant learning\n\nIn the \u201ccosmological constants\u201d problem, the task is to calibrate a physics simulator1 to observed\ndata. The tunable parameters include various physical constants like the density of certain types of\nmatter and Hubble\u2019s constant. In this paper, we use a more challenging version of the problem in\n[21] by tuning 12 parameters rather than 9, and by using substantially larger parameter bounds. We\nused 2K evaluations, a batch size of q = 50, and 50 initial points. TuRBO-5 uses 20 initial points for\neach local model and HeSBO-TS-4 uses a target dimension of 8. Fig. 3 (left) shows the results, with\nTuRBO-5 performing the best, followed by BOBYQA and TuRBO-1. TuRBO-1 sometimes converges to a\nbad local optimum, which deteriorates the mean performance and demonstrates the importance of\nallocating samples across multiple trust regions.\n\n3.4 Lunar landing reinforcement learning\n\nHere the goal is to learn a controller for a lunar lander implemented in the OpenAI gym2. The state\nspace for the lunar lander is the position, angle, time derivatives, and whether or not either leg is in\ncontact with the ground. There are four possible action for each frame, each corresponding to \ufb01ring a\nbooster engine left, right, up, or doing nothing. The objective is to maximize the average \ufb01nal reward\nover a \ufb01xed constant set of 50 randomly generated terrains, initial positions, and velocities. We\nobserved that the simulation can be sensitive to even tiny perturbations. Fig. 3 shows the results for a\ntotal of 1500 function evaluations, batch size q = 50, and 50 initial points for all algorithms except\nfor TuRBO-5 which uses 20 initial points for each local region. For this problem, we use HeSBO-TS-3\nin an 8-dimensional subspace. TuRBO-5 and TuRBO-1 learn the best controllers; and in particular\nachieves better rewards than the handcrafted controller provided by OpenAI whose performance is\ndepicted by the blue horizontal line.\n\n3.5 The 200-dimensional Ackley function\nWe examine performances on the 200-dimensional Ackley function in the domain [\u22125, 10]200. We\nonly consider TuRBO-1 because of the large number of dimensions where there may not be a bene\ufb01t\nfrom using multiple TRs. EBO is excluded from the plot since its computation time exceeded 30\ndays per replication. HeSBO-TS-5 uses a target dimension of 20. Fig. 4 shows the results for a total\nof 10K function evaluations, batch size q = 100, and 200 initial points for all algorithms.\n\n1https://lambda.gsfc.nasa.gov/toolbox/lrgdr/\n2https://gym.openai.com/envs/LunarLander-v2\n\n6\n\n0200040006000800010000Numberofevaluations4567891011Reward14DRobotpushing05000100001500020000Numberofevaluations\u22127\u22126\u22125\u22124\u22123\u22122\u22121012345Reward60DRovertrajectoryplanningTuRBO-20TuRBO-1EBOThompsonBOCKBohamiannHeSBOCMA-ESBOBYQANelder-MeadBFGSRandomsearch\fFigure 3: 12D Cosmological constant (left): TuRBO-5 provides an improvement over BOBYQA and\nTuRBO-1. BO methods are distanced, with TS performing best among them. 12D Lunar lander\n(right): TuRBO-5, TuRBO-1, EBO, and CMA-ES learn better controllers than the original OpenAI\ncontroller (solid blue horizontal line).\n\nFigure 4: 200D Ackley function: TuRBO-1 clearly outperforms the other baselines. BOBYQA makes\ngood initial progress but consistently converges to sub-optimal local minima.\n\nHeSBO-TS-5, with a target dimension of 20, and BOBYQA perform well initially, but are eventually\noutperformed by TuRBO-1 that achieves the best solutions. The good performance of HeSBO-TS\nis particularly interesting, since this benchmark has no redundant dimensions and thus should be\nchallenging for that embedding-based approach. This con\ufb01rms similar \ufb01ndings in [27]. BO methods\nthat use a global GP model over-emphasize exploration and make little progress.\n\n3.6 The advantage of local models over global models\n\nWe investigate the performance of local and global GP models on the 14D robot pushing problem\nfrom Sect. 3.1. We replicate the conditions from the optimization experiments as closely as possible\nfor a regression experiment, including for example parameter bounds. We choose 20 uniformly\ndistributed hypercubes of (base) side length 0.4, each containing 200 uniformly distributed training\npoints. We train a global GP on all 4000 samples, as well as a separate local GP for each hypercube.\nFor the sake of illustration, we used an isotropic kernel for these experiments. The local GPs have the\nadvantage of being able to learn different hyperparameters in each region while the global GP has\nthe advantage of having access to all of the data. Fig. 5 shows the predictive performance (in log\nloss) on held-out data. We also show the distribution of \ufb01tted hyperparameters for both the local and\nglobal GPs. We see that the hyperparameters (especially the signal variance) vary substantially across\nregions. Furthermore, the local GPs perform better than the global GP in every repeated trial. The\nglobal model has an average log loss of 1.284 while the local model has an average log loss of 1.174\n\n7\n\n0500100015002000Numberofevaluations\u221240\u221235\u221230\u221225\u221220LogLikelihood12DCosmologicalconstantlearning0250500750100012501500Numberofevaluations50100150200250300Reward12DLunarlandingreinforcementlearningTuRBO-5TuRBO-1EBOThompsonBOCKBohamiannHesboCMA-ESBOBYQANelder-MeadBFGSRandomsearch0200040006000800010000Numberofevaluations46810121416ValueTuRBO-1ThompsonBOCKBohamiannHeSBOCMA-ESBOBYQANelder-MeadBFGSRandom\fFigure 5: Local and global GPs on log loss (left): We show the improvement in test set log loss\n(nats/test point) of the local model over the global model by repeated trial. The local GP increases in\nperformance in every trial. Trials are sorted in order of performance gain. This shows a substantial\nmean improvement of 0.110 nats. Learned hypers (right three \ufb01gures): A histogram plot of the\nhyperparameters learned by the local (blue) and global (orange) GPs pooled across all repeated trials.\nThe local GPs show a much wider range of hyperparameters that can specialize per region.\n\nacross 50 trials; the improvement is signi\ufb01cant under a t-test at p < 10\u22124. This experiment con\ufb01rms\nthat we improve the predictive power of the models and also reduce the computational overhead of\nthe GP by using the local approach. The learned local noise variance in Fig. 5 is bimodal, con\ufb01rming\nthe heteroscedasticity in the objective across regions. The global GP is required to learn the high\nnoise value to avoid a penalty for outliers.\n\n3.7 Why high-dimensional spaces are challenging\n\nIn this section, we illustrate why the restarting and banditing strategy of TuRBO is so effective. Each\nTR restart \ufb01nds distant solutions of varying quality, which highlights the multimodal nature of the\nproblem. This gives TuRBO-m a distinct advantage.\nWe ran TuRBO-1 (with a single trust region) for 50 restarts on the 60D rover trajectory planning\nproblem from Sect. 3.2 and logged the volume of the TR and its center after each iteration. Fig. 6\nshows the volume of the TR, the arclength of the TR center\u2019s trajectory, the \ufb01nal objective value, and\nthe distance each \ufb01nal solution has to its nearest neighbor. The left two plots con\ufb01rm that, within a\ntrust region, the optimization is indeed highly local. The volume of any given trust region decreases\nrapidly and is only a small fraction of the total search space. From the two plots on the right, we\nsee that the solutions found by TuRBO are far apart with varying quality, demonstrating the value of\nperforming multiple local search runs in parallel.\n\nFigure 6: Performance statistics for 50 restarts of TuRBO-1 on the 60D rover trajectory planning\nproblem. The domain is scaled to [0, 1]60. Trust region volume (left): We see that the volume of\nthe TR decreases with the iterations. Each TR is shown by a light blue line, and their average in\nsolid blue. Total center distance (middle left): The cumulative Euclidean distance that each TR\ncenter has moved (trajectory arc length). This con\ufb01rms the balance between initial exploration and\n\ufb01nal exploitation. Best value found (middle right): The best function value found during each run\nof TuRBO-1. The solutions vary in quality, which explains why our bandit approach works well.\nDistance between \ufb01nal TR centers (right): Minimum distances between \ufb01nal TR centers, which\nshows that each restart leads to a different part of the space.\n\n8\n\n01020304050Trial0.00.10.20.3Localloglossimprovement0.00.20.40.6Lengthscales0123Signalvariances00.050.1NoisesvariancesLocalGPGlobalGP0102030405060Iteration10\u221210010\u22127510\u22125010\u221225100VolumeofTR0102030405060Iteration0.02.55.07.510.012.515.0Totalcenterdistance12345BestvalueforeachTR0.02.55.07.510.012.5Count2.252.502.753.003.25Distancebetween\ufb01nalcenters02468Count\f3.8 The ef\ufb01ciency of large batches\n\nRecall that combining multiple samples into single batches provides substantial speed-ups in terms of\nwall-clock time but poses the risk of inef\ufb01ciencies since sequential sampling has the advantage of\nleveraging more information. In this section, we investigate whether large batches are ef\ufb01cient for\nTuRBO. Note that Hern\u00e1ndez-Lobato et al. [18] and Kandasamy et al. [22] have shown that the TS\nacquisition function is ef\ufb01cient for batch acquisition with a single global surrogate model. We study\nTuRBO-1 on the robot pushing problem from Sect. 3.1 with batch sizes q \u2208 {1, 2, 4, . . . , 64}. The\nalgorithm takes max{200q, 6400} samples for each batch size and we average the results over 30\nreplications. Fig. 7 (left) shows the reward for each batch size with respect to the number of batches:\nwe see that larger batch sizes obtain better results for the same number of iterations. Fig. 7 (right)\nshows the performance as a function of evaluations. We see that the speed-up is essentially linear.\n\nFigure 7: We evaluate TuRBO for different batch sizes. On the left, we see that larger batches provide\nbetter solutions at the same number of steps. On the right, we see that this reduction in wall-clock\ntime does not come at the expense of ef\ufb01cacy, with large batches providing nearly linear speed up.\n\n4 Conclusions\n\nThe global optimization of computationally expensive black-box functions in high-dimensional\nspaces is an important and timely topic [13, 27]. We proposed the TuRBO algorithm which takes a\nnovel local approach to global optimization. Instead of \ufb01tting a global surrogate model and trading\noff exploration and exploitation on the whole search space, TuRBO maintains a collection of local\nprobabilistic models. These models provide local search trajectories that are able to quickly discover\nexcellent objective values. This local approach is complemented with a global bandit strategy that\nallocates samples across these trust regions, implicitly trading off exploration and exploitation. A\ncomprehensive experimental evaluation demonstrates that TuRBO outperforms the state-of-the-art\nBayesian optimization and operations research methods on a variety of real-world complex tasks.\nIn the future, we plan on extending TuRBO to learn local low-dimensional structure to improve the\naccuracy of the local Gaussian process model. This extension is particularly interesting in high-\ndimensional optimization when derivative information is available [10, 12, 48]. This situation often\narises in engineering, where objectives are often modeled by PDEs solved by adjoint methods, and in\nmachine learning where gradients are available via automated differentiation. Ultimately, it is our\nhope that this work spurs interest in the merits of Bayesian local optimization, particularly in the\nhigh-dimensional setting.\n\nReferences\n[1] L. Acerbi and W. Ji. Practical Bayesian optimization for model \ufb01tting with Bayesian adaptive direct search.\n\nIn Advances in neural information processing systems, pages 1836\u20131846, 2017.\n\n[2] R. Akrour, D. Sorokin, J. Peters, and G. Neumann. Local Bayesian optimization of motor skills. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 41\u201350. JMLR.\norg, 2017.\n\n9\n\n050100150200Numberofbatches4567891011Reward0100020003000400050006000Numberofevaluations4567891011RewardBatchsize1Batchsize2Batchsize4Batchsize8Batchsize16Batchsize32Batchsize64\f[3] J.-A. M. Assael, Z. Wang, B. Shahriari, and N. de Freitas. Heteroscedastic treed Bayesian optimisation.\n\narXiv preprint arXiv:1410.7172, 2014.\n\n[4] R. Baptista and M. Poloczek. Bayesian optimization of combinatorial structures.\n\nConference on Machine Learning, pages 462\u2013471, 2018.\n\nIn International\n\n[5] M. Binois, D. Ginsbourger, and O. Roustant. A warped kernel improving robustness in Bayesian optimiza-\ntion via random embeddings. In International Conference on Learning and Intelligent Optimization, pages\n281\u2013286. Springer, 2015.\n\n[6] M. Binois, D. Ginsbourger, and O. Roustant. On the choice of the low-dimensional domain for global\n\noptimization via random embeddings. Journal of Global Optimization, 2019 (to appear).\n\n[7] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth. Bayesian optimization for learning gaits under\n\nuncertainty. Annals of Mathematics and Arti\ufb01cial Intelligence, 76(1-2):5\u201323, 2016.\n\n[8] B. Chen, R. M. Castro, and A. Krause. Joint optimization and variable selection of high-dimensional\nGaussian processes. In Proceedings of the International Conference on Machine Learning, pages 1379\u2013\n1386. Omnipress, 2012.\n\n[9] C. Chevalier and D. Ginsbourger. Fast computation of the multi-points expected improvement with\napplications in batch selection. In International Conference on Learning and Intelligent Optimization,\npages 59\u201369. Springer, 2013.\n\n[10] P. G. Constantine. Active subspaces: Emerging ideas for dimension reduction in parameter studies,\n\nvolume 2. SIAM, 2015.\n\n[11] N. A. Dong, D. J. Eckman, X. Zhao, S. G. Henderson, and M. Poloczek. Empirically comparing the\n\ufb01nite-time performance of simulation-optimization algorithms. In Winter Simulation Conference, pages\n2206\u20132217. IEEE, 2017.\n\n[12] D. Eriksson, K. Dong, E. Lee, D. Bindel, and A. G. Wilson. Scaling Gaussian process regression with\n\nderivatives. In Advances in Neural Information Processing Systems, pages 6867\u20136877, 2018.\n\n[13] P. I. Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.\n\n[14] J. Gardner, C. Guo, K. Weinberger, R. Garnett, and R. Grosse. Discovering and exploiting additive structure\nfor Bayesian optimization. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n1311\u20131319, 2017.\n\n[15] R. Garnett, M. Osborne, and P. Hennig. Active learning of linear embeddings for Gaussian processes. In\n\n30th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2014), pages 230\u2013239, 2014.\n\n[16] J. Gonz\u00e1lez, Z. Dai, P. Hennig, and N. Lawrence. Batch Bayesian optimization via local penalization. In\n\nArti\ufb01cial intelligence and statistics, pages 648\u2013657, 2016.\n\n[17] N. Hansen. The CMA evolution strategy: A comparing review. In Towards a New Evolutionary Computa-\n\ntion, pages 75\u2013102. Springer, 2006.\n\n[18] J. M. Hern\u00e1ndez-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik. Parallel and distributed\nThompson sampling for large-scale accelerated exploration of chemical space. In Proceedings of the\nInternational Conference on Machine Learning, pages 1470\u20131479, 2017.\n\n[19] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm\ncon\ufb01guration. In International Conference on Learning and Intelligent Optimization, pages 507\u2013523.\nSpringer, 2011.\n\n[20] Y. Jin, J. Branke, et al. Evolutionary optimization in uncertain environments-a survey. IEEE Transactions\n\non Evolutionary Computation, 9(3):303\u2013317, 2005.\n\n[21] K. Kandasamy, J. Schneider, and B. P\u00f3czos. High dimensional Bayesian optimisation and bandits via\n\nadditive models. In International Conference on Machine Learning, pages 295\u2013304, 2015.\n\n[22] K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. P\u00f3czos. Parallelised Bayesian optimisation via\nThompson sampling. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 133\u2013142,\n2018.\n\n[23] T. Krityakierne and D. Ginsbourger. Global optimization with sparse and local Gaussian process models.\nIn International Workshop on Machine Learning, Optimization and Big Data, pages 185\u2013196. Springer,\n2015.\n\n10\n\n\f[24] S. Marmin, C. Chevalier, and D. Ginsbourger. Differentiating the multipoint expected improvement for\noptimal batch design. In International Workshop on Machine Learning, Optimization and Big Data, pages\n37\u201348. Springer, 2015.\n\n[25] M. McLeod, S. Roberts, and M. A. Osborne. Optimization, fast and slow: Optimally switching between\nlocal and Bayesian optimization. In International Conference on Machine Learning, pages 3440\u20133449,\n2018.\n\n[26] M. Mutny and A. Krause. Ef\ufb01cient high dimensional Bayesian optimization with additivity and quadrature\n\nFourier features. In Advances in Neural Information Processing Systems, pages 9005\u20139016, 2018.\n\n[27] A. Nayebi, A. Munteanu, and M. Poloczek. A framework for bayesian optimization in embedded subspaces.\nIn International Conference on Machine Learning, pages 4752\u20134761, 2019. The code is available at\nhttps://github.com/aminnayebi/HesBO.\n\n[28] J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4):\n\n308\u2013313, 1965.\n\n[29] C. Oh, E. Gavves, and M. Welling. BOCK : Bayesian optimization with cylindrical kernels. In Proceedings\n\nof the 35th International Conference on Machine Learning, volume 80, pages 3868\u20133877, 2018.\n\n[30] M. J. Powell. A direct search optimization method that models the objective and constraint functions by\nlinear interpolation. In Advances in Optimization and Numerical Analysis, pages 51\u201367. Springer, 1994.\n\n[31] M. J. Powell. A view of algorithms for optimization without derivatives. Mathematics Today-Bulletin of\n\nthe Institute of Mathematics and its Applications, 43(5):170\u2013174, 2007.\n\n[32] P. Rolland, J. Scarlett, I. Bogunovic, and V. Cevher. High-dimensional Bayesian optimization via additive\nmodels with overlapping groups. In International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 298\u2013307, 2018.\n\n[33] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. A tutorial on Thompson sampling.\n\nFoundations and Trends in Machine Learning, 11(1):1\u201396, 2018.\n\n[34] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of expensive\n\nobjective functions. In Advances in Neural Information Processing Systems, pages 3330\u20133338, 2015.\n\n[35] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A\n\nreview of Bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n\n[36] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn Advances in Neural Information Processing Systems, pages 2951\u20132959, 2012.\n\n[37] J. Snoek, K. Swersky, R. Zemel, and R. Adams. Input warping for Bayesian optimization of non-stationary\n\nfunctions. In International Conference on Machine Learning, pages 1674\u20131682, 2014.\n\n[38] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams.\nScalable Bayesian optimization using deep neural networks. In International Conference on Machine\nLearning, pages 2171\u20132180, 2015.\n\n[39] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 4134\u20134142, 2016.\n\n[40] M. A. Taddy, H. K. Lee, G. A. Gray, and J. D. Grif\ufb01n. Bayesian guided pattern search for robust local\n\noptimization. Technometrics, 51(4):389\u2013401, 2009.\n\n[41] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence\n\nof two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[42] K. P. Wabersich and M. Toussaint. Advancing Bayesian optimization: The mixed-global-local (MGL)\n\nkernel and length-scale cool down. arXiv preprint arXiv:1612.03117, 2016.\n\n[43] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel Bayesian global optimization of expensive functions.\n\narXiv preprint arXiv:1602.05149, 2016.\n\n[44] Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. de Freitas. Bayesian optimization in a billion\n\ndimensions via random embeddings. Journal of Arti\ufb01cial Intelligence Research, 55:361\u2013387, 2016.\n\n11\n\n\f[45] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka. Batched large-scale Bayesian optimization in high-\ndimensional spaces. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 745\u2013754,\n2018.\n\n[46] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT Press,\n\n2006.\n\n[47] J. Wu and P. Frazier. The parallel knowledge gradient method for batch Bayesian optimization. In Advances\n\nin Neural Information Processing Systems, pages 3126\u20133134, 2016.\n\n[48] J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier. Bayesian optimization with gradients. In Advances in\n\nNeural Information Processing Systems, pages 5267\u20135278, 2017.\n\n[49] Y. Yuan. A review of trust region algorithms for optimization. In International Council for Industrial and\n\nApplied Mathematics, volume 99, pages 271\u2013282, 2000.\n\n[50] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale\nbound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550\u2013560,\n1997.\n\n12\n\n\f", "award": [], "sourceid": 2939, "authors": [{"given_name": "David", "family_name": "Eriksson", "institution": "Uber AI"}, {"given_name": "Michael", "family_name": "Pearce", "institution": "Warwick University"}, {"given_name": "Jacob", "family_name": "Gardner", "institution": "Uber AI Labs"}, {"given_name": "Ryan", "family_name": "Turner", "institution": "Uber AI Labs"}, {"given_name": "Matthias", "family_name": "Poloczek", "institution": "Uber AI"}]}