{"title": "Automating Bayesian optimization with Bayesian optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5984, "page_last": 5994, "abstract": "Bayesian optimization is a powerful tool for global optimization of expensive functions. One of its key components is the underlying probabilistic model used for the objective function f. In practice, however, it is often unclear how one should appropriately choose a model, especially when gathering data is expensive. In this work, we introduce a novel automated Bayesian optimization approach that dynamically selects promising models for explaining the observed data using Bayesian Optimization in the model space. Crucially, we account for the uncertainty in the choice of model; our method is capable of using multiple models to represent its current belief about f and subsequently using this information for decision making. We argue, and demonstrate empirically, that our approach automatically finds suitable models for the objective function, which ultimately results in more-efficient optimization.", "full_text": "Automating Bayesian optimization\nwith Bayesian optimization\n\nGustavo Malkomes, Roman Garnett\n\nDepartment of Computer Science and Engineering\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\n\n{luizgustavo, garnett}@wustl.edu\n\nAbstract\n\nBayesian optimization is a powerful tool for global optimization of expensive\nfunctions. One of its key components is the underlying probabilistic model used for\nthe objective function f. In practice, however, it is often unclear how one should\nappropriately choose a model, especially when gathering data is expensive. We in-\ntroduce a novel automated Bayesian optimization approach that dynamically selects\npromising models for explaining the observed data using Bayesian optimization\nin model space. Crucially, we account for the uncertainty in the choice of model;\nour method is capable of using multiple models to represent its current belief about\nf and subsequently using this information for decision making. We argue, and\ndemonstrate empirically, that our approach automatically \ufb01nds suitable models for\nthe objective function, which ultimately results in more-ef\ufb01cient optimization.\n\n1\n\nIntroduction\n\nGlobal optimization of expensive, potentially gradient-free functions has long been a critical compo-\nnent of many complex problems in science and engineering. As an example, imagine that we want to\ntune the hyperparameters of a deep neural network in a self-driving car. That is, we want to maximize\nthe generalization performance of the machine learning algorithm, but the functional form of the\nobjective function f is unknown and even a single function evaluation is costly \u2014 it might take hours\n(or even days!) to train the network. These features render the optimization particularly dif\ufb01cult.\nBayesian optimization has nonetheless shown remarkable success on optimizing expensive gradient-\nfree functions [8, 1, 18]. Bayesian optimization works by maintaining a probabilistic belief about\nthe objective function and designing a so-called acquisition function that intelligently indicates the\nmost-promising locations to evaluate f next. Although the design of acquisition functions has been\nthe subject of a great deal of research, how to appropriately model f has received comparatively less\nattention [17], despite being a decisive factor for performance. In fact, this was considered the most\nimportant problem in Bayesian optimization by Mo\u02c7ckus [12], in a seminal work in the \ufb01eld:\n\n\u201cThe development of some system of a priori distributions suitable for different\nclasses of the function f is probably the most important problem in the application\nof [the] Bayesian approach to ... global optimization\u201d (Mo\u02c7ckus 1974, p. 404).\n\nIn this work, we develop a search mechanism for appropriate surrogate models (prior distributions) to\nthe objective function f. Inspired by Malkomes et al. [11], our model-search procedure operates via\nBayesian optimization in model space. Our method does not prematurely commit to a single model;\ninstead, it uses several models to form a belief about the objective function and plan where the next\nevaluation should be. Our adaptive model averaging approach accounts for model uncertainty, which\nmore realistically copes with the limited information available in practical Bayesian optimization\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Importance of model selection in Bayesian optimization. Top left: one model represents\nthe belief about the objective. Top right: a mixture of models selected by our approach represents the\nbelief about f. Bottom: the acquisition function value (expected improvement) computed using the\nrespective beliefs about the objective. ABO places the next observation at the optimum.\n\napplications. In Figure 1, we show two instances of Bayesian optimization where our goal is to\nmaximize the red objective function f. Both instances use expected improvement as acquisition\nfunction. The difference between is the belief about f: using a single model (left) or combining\nseveral models using our automated Bayesian optimization (ABO) approach (right). A single model\ndoes not capture the nuances of the true function. In contrast, ABO captures the linear increasing\ntrend of the true function and produces a credible interval which successfully captures the function\u2019s\nbehavior. Consequently, ABO \ufb01nds the optimum in the next iteration.\nFinally, we demonstrate empirically that our approach is consistently competitive with or outperforms\nother strong baselines across several domains: benchmark functions for global optimization functions,\nhyperparameter tuning of machine learning algorithms, reinforcement learning for robotics, and\ndetermining cosmological parameters of a physical model of the Universe.\n\n2 Bayesian optimization with multiple models\nSuppose we want to optimize an expensive, perhaps black-box function f : X \u2192 R on some compact\nset X \u2286 X . We may query f at any point x and observe a possibly noisy value y = f (x) + \u03b5. Our\nultimate goal is to \ufb01nd the global optimum:\n(1)\n\nxOPT = arg min\n\nf (x)\n\nthrough a sequence of evaluations of the objective function f. This problem becomes particularly\nchallenging when we may only make a limited number of function evaluations, representing a real-\nworld budget B limiting the total cost of evaluating f. Throughout this text, we denote by D a set of\ngathered observations D = (X, y), where X is a matrix aggregating the input variables xi \u2208 X, and\ny is the respective vector of observed values yi = f (xi) + \u03b5.\nModeling the objective function. Assume we are given a prior distribution over the objective\nfunction p(f ) and, after observing new information, we have means of updating our belief about f\nusing Bayes\u2019 rule:\n\nx\u2208X\n\n2\n\nThe posterior distribution above is then used for decision making, i.e., selecting the x we should\nquery next. When dealing with a single model, the posterior distribution (2) suf\ufb01ces. Here, however,\n\np(f | D) =\n\np(D | f )p(f )\n\np(D)\n\n.\n\n(2)\n\nobservationsDtruefunctionpredictivemean95%credibleintervalexpectedimprovementnextobservationlocation\fwe want to make our model of f more \ufb02exible, accounting for potential misspeci\ufb01cation. Suppose\nwe are given a collection of probabilistic models {Mi} that offer plausible explanations for the data.\nEach model M is a set of probability distributions indexed by a parameter \u03b8 from the corresponding\nmodel\u2019s parameter space \u0398M. With multiple models, we need a means of aggregating their beliefs.\nWe take a fully Bayesian approach and we use the model evidence (or marginal likelihood), the\nprobability of generating the observed data given a model M,\n\n(cid:90)\n\np(y | X,M) =\n\np(y | X, \u03b8,M) p(\u03b8 | M) d\u03b8,\n\n(3)\n\nas the key quantity for measuring the \ufb01t of each model to the data. The evidence integrates over the\nparameters \u03b8 to compute the probability of the model generating the observed data under a hyperprior\ndistribution p(\u03b8 | M). Given (3), one can easily compute the model posterior,\np(y | X,M)p(M)\ni p(y | X,Mi)p(Mi)\n\np(y | X,M)p(M)\n\np(M | D) =\n\np(y | X)\n\n(cid:80)\n\nwhere p(M) represents a prior probability distribution over the models. The model posterior gives us\na principled way of combining the beliefs of all models. Our model of f can now be summarized\nwith the following model-marginalized posterior distribution:\n\n(4)\n\n=\n\n,\n\n(cid:88)\n\ni\n\n(cid:90)\n(cid:124)\n\np(f | D) =\n\np(Mi | D)\n\np(f | D, \u03b8,Mi)p(\u03b8 | D,Mi) d\u03b8\n\n.\n\n(5)\n\n(cid:123)(cid:122)\n\np(f|D,Mi)\n\n(cid:125)\n\nNote that (5) takes into consideration all plausible models {Mi} and the integral p(f | D,Mi)\naccounts for the uncertainty in each model\u2019s hyperparameters \u03b8 \u2208 \u0398Mi. Unfortunately, the latter is\noften intractable and we will discuss means of approximating it in Section 4.2. Next, we describe\nhow to use the model-marginalized posterior to intelligently optimize the objective function.\nSelecting where to evaluate next. Given our belief about f, we want to use this information to\nselect which point x we want to evaluate next. This is typically done by maximizing an acquisition\nfunction \u03b1 : X \u2192 R. Instead of solving (1) directly, we optimize the proxy (and simpler) problem\n(6)\n\nx\u2217 = arg max\n\n\u03b1(x;D).\n\nx\u2208X\n\nWe use expected improvement (EI) [12] as our acquisition function. Suppose that f(cid:48) is the minimal\nvalue observed so far.1 EI selects the point x that, in expectation, improves upon f(cid:48) the most:\n\n\u03b1EI(x ;D,M) = Ey\n\n(7)\nNote that if p(y | x,D,M) is a Gaussian distribution (or can be approximated as one), the expected\nimprovement can be computed in closed form. Usually, acquisition functions are evaluated for a\ngiven model choice M. As before, we want to incorporate multiple models in this framework. For\nEI, we can easily take into account all models as follows:\n\n(cid:2)max(f(cid:48) \u2212 y, 0) | x,D,M\n(cid:3).\n\n(cid:2)max(f(cid:48) \u2212 y, 0) | x,D\n\n(cid:3) = E\n\n(cid:2)\u03b1EI(x ;D,M)(cid:3).\n\n\u03b1EI(x ;D) = Ey,M\n\n(8)\nWe could also derive similar results for other acquisition functions such as probability of improvement\n[9] and GP upper con\ufb01dence bound (GP-UCB) [19].\n\nM\n\n3 Automated model selection for \ufb01xed-size datasets\n\nBefore introducing our automated method for Bayesian optimization, we need to review a previously\nproposed method for automated model selection of \ufb01xed-size datasets. We begin with a brief\nintroduction to Gaussian processes and a description of the model space we adopted in this paper.\nGaussian processes models. We take a standard nonparametric approach and place a Gaussian\nprocess (GP) prior distribution on f, p(f ) = GP(f ; \u00b5, K), where \u00b5 : X \u2192 R is a mean function\nand K : X \u00d7 X \u2192 R is a positive-semide\ufb01nite covariance function or kernel. Both \u00b5 and K may\nhave hyperparameters, which we conveniently concatenate into a single vector \u03b8. To connect to our\n1We make a simplifying assumption that the noise level is small, thus f(cid:48) \u2248 mini \u00b5f|D(xi) and y(x) \u2248 f (x).\n\n3\n\n\fframework, a GP model M comprises \u00b5, K, and a prior over its associated hyperparameters p(\u03b8).\nThanks to the elegant marginalization properties of the Gaussian distribution, computing the posterior\ndistribution p(f | \u03b8,D) can be done in closed form, if we assume a standard Gaussian likelihood\nobservation model, \u03b5 \u223c N (0, \u03c32). For a more detailed introduction to GPs, see [16].\nGaussian processes are extremely powerful modeling tools. Their success, however, heavily depends\non an appropriate choice of the mean function \u00b5 and covariance function K. In some cases, a domain\nexpert might have an informative opinion about which GP model could be more fruitful. Here,\nhowever, we want to avoid human intervention and propose an automatic approach.\nSpace of models. First we need a space of GP models that is general enough to explain virtually any\ndataset. We adopt the generative kernel grammar of [2] due to its ability to create arbitrarily complex\nmodels. We start with a set of so-called base (one-dimensional) kernels, such as the common squared\nexponential (SE) and rational quadratic (RQ) kernels. Then, we create new and potentially more\ncomplex kernels by summation and multiplication, over individual dimensions, of the base kernels.\nThis let us create kernels over multidimensional inputs. As a result, we have a space of kernels that\nallows one to search for appropriate structures (different kernel choices) as well as relevant features\n(subsets of the input). Now, we need an ef\ufb01cient method for searching models from this given space.\nFortunately, this was accomplished by the work of [11], which we summarize next.\nBayesian optimization for model search. Suppose we are given a space of probabilistic models\nM such as the above-cited generative kernel grammar. As mentioned before, the key quantity for\nmodel comparison in a Bayesian framework is the model evidence (3). Previous work has shown that\nwe can search for promising models M \u2208 M by viewing the evidence as a function g : M \u2192 R to\nbe optimized [11]. Their method consists of a Bayesian optimization approach to model selection\n(BOMS), in which we try to \ufb01nd the optimal model\n\nMOPT = arg max\nM\u2208M\n\ng(M;D),\n\n(9)\n\nwhere g(M;D) is the (log) model evidence: g(M;D) = log p(y | X,M). Two key aspects of their\nmethod deserve special attention: their unusual GP prior, p(g) = GP(g; \u00b5g, Kg), where the mean\nand covariance functions are appropriately de\ufb01ned over the model space M; and their heuristic for\ntraversing M by maintaining a set of candidate models C. The precise mechanism for traversing the\nspace of models is not particular relevant for our exposition, but the fact that C is changing as we\nsearch for better models is. Due to limited space, we refer the reader to the original work for more\ninformation. Nevertheless, it is important to note that their approach was shown to be more ef\ufb01cient\nthan previous methods.\n\n4 Automating Bayesian optimization with Bayesian optimization\n\nHere, we present our automated Bayesian optimization (ABO) algorithm. ABO is a two-level Bayesian\noptimization procedure. The \u201couter level\u201d solves the standard Bayesian optimization problem, where\nwe want to search for the optimum of the objective function f. Inside the Bayesian optimization\nloop, we use a second \u201cinner\u201d Bayesian optimization, where the goal is to search for appropriate\nmodels {Mi} to the objetive function f. The inner optimization seeks models maximizing the model\nevidence as in BOMS (Section 3). The motivation is to re\ufb01ne the set of models {Mi} before choosing\nwhere we want to query the (expensive) objective function f next. Given a set of models, we can use\nthe methodology presented in Section 2 to perform Bayesian optimization with multiple models.\nIn the next subsection, we will describe the inner Bayesian optimization method which we refer to\nas active BOMS (ABOMS). Before going to the second Bayesian optimization level, we summarize\nABO in Algorithm 1. First, we initialize our set of promising models {Mi} with random models\nchosen from the grammar of kernel, same used in [2]. To select these models, we perform random\nwalks from the the empty kernel and repeatedly apply a random number of grammatical operations.\nThe number of operations is sampled from a geometric distribution with termination probability of\n3. Then, at each iteration: we update all models with current data, computing the corresponding\n1\nmodel evidence of each model; use ABOMS (the inner model-search optimization) to include more\npromising candidate models in {Mi}; exclude all models that are unlikely to explain the current data,\nthose with p(M | D) < 10\u22124; sample the function at location x\u2217 using (8) and all models {Mi};\n\ufb01nally, we evaluate y\u2217 = f (x\u2217) + \u03b5 and include this new observation in our dataset.\n\n4\n\n\fAlgorithm 1 Automated Bayesian Optimization\n\nInput: function f, budget B, initial data D\n{Mi} \u2190 Initial set of promising models\nrepeat\n{Mi} \u2190 update models ({Mi},D)\n{Mi} \u2190 ABOMS({Mi},D)\np(M | D) \u2190 compute model posterior\ndiscard irrelevant models p(Mi | D) < 10\u22124\nx\u2217 \u2190 arg maxx\u2208X \u03b1EI(x ;D).\ny\u2217 \u2190 f (x\u2217) + \u03b5\nD \u2190 D \u222a {(x\u2217, y\u2217)}\n\nuntil budget B is depleted\n\n4.1 Active Bayesian optimization for model search\n\nThe critical component of ABO is the inner optimization procedure that searches for suitable models\nto the objective function: the active Bayesian optimization for model search (ABOMS). Notice that\nthe main challenge is that ABOMS is nested in a Bayesian optimization loop, meaning that both data\nand models will change as we perform more outer Bayesian optimization iterations.\nSuppose we already gathered some observations D of the objective function f. Additionally, we use\nthe previously proposed BOMS (Section 3) as the inner model search procedure. Inside BOMS, we tried\ndifferent models, gathering observations of the (log) model evidence, g(M;D) = log p(y | X,M).\nWe denote by Dg = {Mj, g(Mj;D)} the observations of the inner Bayesian optimization. After\none loop of the outer Bayesian optimization, we obtain new data D(cid:48) = D \u222a {(x\u2217, y\u2217)}. Now, the\nmodel evidence of all previously evaluated models Dg changes since g(Mj,D) (cid:54)= g(Mj,D(cid:48)) for\nall j. As a result, we would have to retrain all models in Dg to correctly compare them. Recall\nthat there are good models in Dg for explaining the objective function f. These models will be\npassed to the outer Bayesian optimization, where they will be updated \u2014 ultimately, we want to\nprovide outstanding suggestions x\u2217 for where to query f next, thus they need to be retrained. A large\nportion of the tested models in Dg, however, are not appropriated for modeling f; in fact, they can be\ntotally ignored by the outer optimization. Yet these \u201cbad\u201d models can help guide the search toward\nmore-promising regions of model space. How to retain information from previously evaluated models\nwithout resorting to exhaustive retraining?\nOur answer is to modify BOMS in two ways. First, we place a GP on the normalized model evidence,\ng(M;D) = log p(y | X,M) /|D|, which let us compare models across iterations. Second, we\nassume that each evidence evaluation is corrupted by noise, the variance of which depends on the\nnumber of data points used to compute it: the more data we use, the more accurate our estimate,\nand the lower the noise. More speci\ufb01cally, we use the same GP prior of [11], p(g) = GP(g; \u00b5g, Kg),\nwhere \u00b5g : M \u2192 R is just a constant mean function and Kg : M2 \u2192 R is the \u201ckernel kernel\u201d\nde\ufb01ned as a squared exponential kernel that uses the (averaged) Hellinger distance between the\ninputs as oppose to the standard (cid:96)2 norm (see the original paper for more details). Our observation\nmodel, however, assumes that the observations of the normalized model evidence are corrupted by\nheterogenous noise:\n\n(cid:16) 1\n\n(cid:17)\n\n.\n\nn\n\ng(M;Dn)\n\n+ \u03b5\n\nn\n\nyg(M;Dn) =\n\nTo choose the amount of noise, we observed that, using the chain rule, the marginal likelihood can\n\nmarginal predictive log likelihoods for the points in the D. When we divide log p(y | X,M) by\n|D| = n, we can interpret the result as an estimate of the average predictive log marginal likelihood.2\nTherefore, if\n\nbe written as log p(y | X,M) =(cid:80)\n(cid:104)\n\ni log p(cid:0)yi | xi,(cid:8)(xj, yj) | j < i(cid:9),M\n(cid:1), which is the sum of the\nlog p(y | X,M)/n \u2248 E(cid:2) log p(y\u2217 | x\u2217,D,M) | M\n(cid:3),\n(cid:1)(cid:105)\nlog p(cid:0)yi | xi,(cid:8)(xj, yj) | j < i(cid:9),M\n\nthen the variance of this estimate with n measurements is\n\nVar\n\n/n.\n\n(10)\n\n2Note that the training data is not independent since we are choosing the locations x, and we are not assuming\n\nthat n \u2192 \u221e\n\n5\n\n\fwhich shrinks like \u03c32\ng/n for a small constant \u03c3g (e.g., 0.5). For large n it goes to 0. This mechanism\ngracefully allows us to condition on the history of all previously proposed models during the search.\nBy modeling earlier evidence computations as noisier, we avoid recomputing the model evidence of\nprevious models every round, but we still make the search for good models better informed.\n\n4.2\n\nImplementation\n\nIn practice, several distributions presented above are often intractable for GPs. Now, we discuss how\nto ef\ufb01ciently approximate these quantities. First, instead of using just a delta approximation to the\nhyperparameter posterior p(\u03b8 | D,M), e.g. MLE/MAP, we use a Laplace approximation, i.e., we\nmake a second-order Taylor expansion around its mode: \u02c6\u03b8 = arg max\u03b8 log p(\u03b8 | D,M). This results\nin a multivariate Gaussian approximation:\n\np(\u03b8 | D,M) \u2248 N (\u03b8; \u02c6\u03b8, \u03a3) where \u03a3\u22121 = \u2212\u22072 log p(\u03b8 | D,M)\n\n(cid:12)(cid:12)\u03b8=\u02c6\u03b8\n\n.\n\nConveniently, the Laplace approximation also give us a means of approximating the model evidence:\n\nlog p(y | X,M) \u2248 log p(y | X, \u02c6\u03b8,M) + log p(\u02c6\u03b8 | M) \u2212 1\n\n2 log det \u03a3\u22121 + d\n\n2 log 2\u03c0,\n\nwhere d is the dimension of \u03b8. The above approximation can be interpreted as rewarding explaining\nthe data well while penalizing model complexity [13, 15].\nNext consider the posterior distribution p(f | D,M), which is an integral over the model\u2019s hyperpa-\nrameters. This distribution is intractable, even with our Gaussian approximation to the hyperparameter\nposterior p(\u03b8 | D,M) \u2248 N (\u03b8; \u02c6\u03b8, \u03a3). We use a general approximation technique originally proposed\nby [14] (Section 4) in the context of Bayesian quadrature. This approach assumes that the posterior\nmean of p(f\u2217 | x\u2217,D, \u03b8,M) is af\ufb01ne in \u03b8 around \u02c6\u03b8 and the GP covariance is constant. Let\n\n\u03bd\u2217(\u03b8) = Var[f\u2217 | x\u2217,D, \u03b8,M]\n\n\u00b5\u2217(\u03b8) = E[f\u2217 | x\u2217,D, \u03b8,M]\nbe the posterior predictive mean and variance of f\u2217. The result of this approximation is that the\n(cid:0)f\u2217; \u00b5\u2217(\u02c6\u03b8), \u03c32\nposterior distribution of f\u2217 is approximated by\n(cid:90)\n\nThis approach was shown to be a good alternative for propagating the uncertainty in the hyperpa-\nrameters [14]. Finally, given the Gaussian approximations above (11), we use standard techniques to\nanalytically approximate the predictive distribution:\n\n\u2207\u00b5\u2217(\u02c6\u03b8)(cid:3)(cid:62)\u03a3(cid:2)\n\n(cid:1), where \u03c32\n\nAFFINE =(cid:2)\n\np(f\u2217 | x\u2217,D,M) \u2248 N\n\n\u2207\u00b5\u2217(\u02c6\u03b8)(cid:3).\n\nand\n\nAFFINE\n\np(y\u2217 | x\u2217,D,M) =\n\np(y\u2217 | f\u2217) p(f\u2217 | x\u2217,D,M) df\u2217.\n\nOur code and data will be available online: https://github.com/gustavomalkomes/abo.\n\n5 Related Work\n\nOur approach is inspired by some recent developments in the \ufb01eld of automated model selection\n[11, 2, 6]. Here, we take these ideas one step further and consider automated model selection when\nactively acquiring new data.\nGardner et al. [3] also tackled the problem of model selection in an active learning context but with a\ndifferent goal. Given a \ufb01xed set of candidate models, the authors proposed a method for gathering\ndata to quickly identify which model best explains the data. Here our ultimate goal is to perform\nglobal optimization (1) when we can dynamically change our set of models. In future work, it would\nbe interesting to examine whether it may be possible to combine our ideas with their proposed method\nto actively learn in model space.\nMore recently, Gardner et al. [4] developed an automated model search for Bayesian optimization\nsimilar to our method. Their approach, however, uses a MCMC strategy for sampling new promising\nmodels, whereas we adapt the Bayesian optimization search of proposed by Malkomes et al. [11].\nWe will discuss further differences between our approach and their MCMC method in the next section.\n\n6\n\n\f(a) SVM\n\n(b) LDA\n\n(c) Logistic Regression\n\n(d) Neural Network Boston\n\n(e) Robot pushing 3D\n\n(f) Cosmological constants\n\nFigure 2: Averaged minimum observed function value and standard error of all methods for several\nobjective functions. For better visualization, we omit the \ufb01rst 10 function evaluations since they are\nusually much higher than the \ufb01nal observations.\n\n6 Empirical Results\n\nWe validate our approach against several optimization alternatives and across several domains. Our\n\ufb01rst baseline is a random strategy that selects twice as many locations as the other methods. We refer\nto this strategy as RANDOM 2\u00d7 [10]. We also consider a competitive Bayesian optimization imple-\nmentation which uses a single non-isotropic squared exponential kernel (SE), expected improvement\nas the acquisition function and all the approximations described in Section 4.2. Then, we considered\ntwo more baselines that represent the uncertainty about the unknown function through a combination\nof multiple models. One baseline uses the same collection of prede\ufb01ned models throughout its\nexecution; we refer to this approach as the bag of models (BOM). The other is an adaptation of the\nmethod proposed in [4], here referred as MCMC, which, similar to ABO, is allowed to dynamically\nselect more models every iteration. Instead of using the additive class of models proposed in the\noriginal work, we adapted their Metropolis\u2013Hastings algorithm to the more-general compositional\ngrammar proposed by Duvenaud et al. [2], which is also used by our method. This choice lets us\ncompare which adaptive strategy performs better in practice. Speci\ufb01cally, given an initial model M,\nthe MCMC proposal distribution randomly selects a neighboring model M(cid:48) from the grammar. Then\nwe compute the acceptance probability as in [4].\nAll multiple models strategies (BOM, MCMC and ABO) start with the same selection of models (See\nSection 4) and they aim to maximize the model-marginalized expected improvement (8). Both\nadaptive algorithms (ABO and MCMC) are allowed to perform \ufb01ve model evidence computations\nbefore each function evaluation; ABO queries \ufb01ve new models and MCMC performs \ufb01ve new proposals.\nIn our experiments, we limited the number of models to 50, always keeping those with the higher\nmodel evidence. Model choice and acquisition functions apart, we kept all con\ufb01gurations the same.\nAll methods used L-BFGS to optimize each model\u2019s hyperparameters. To avoid bad local minima, we\nperform two restarts, each begining from a sample of p(\u03b8 | M). All the approximations described in\nSection 4.2 were also used. We maximized the acquisition functions by densely sampling 1000d2\npoints from a d-dimensional low-discrepancy Sobol sequence, and starting MATLAB fmincon (a\n\n7\n\n1030507090\u22121.42\u22121.4\u22121.38\u22121.36FunctionevaluationsLogobjectivefunctionSEBOMMCMCABO10305070907.147.157.167.17Functionevaluations1030507090\u22122.7\u22122.6\u22122.5\u22122.4Functionevaluations10305070902.12.22.32.4FunctionevaluationsLogobjectivefunction1020304050\u22122\u2212101Functionevaluations10203040502468Functionevaluations\fTable 1: Results for the average gap performance across 20 repetitions for different test functions and\nmethods. RANDOM 2\u00d7 (R 2\u00d7) results are averaged across 1000 experiments. Numbers that are not\nsigni\ufb01cantly different from the highest average gap for each function are bolded (one-sided paired\nWilcoxon signed rank test, 5% signi\ufb01cance level).\n\nsynthetic\nobjectives Rosenbrock\n\nfunction\nAckley 2d\nBeale\nBranin\nEggholder\nSix-Hump Camel\nDrop-Wave\nGriewank 2d\nRastrigin\n\nShubert\nHartmann\nLevy\nRastrigin 4d\nAckley 5d\nGriewank 5d\nmean gap\nmedian gap\n\nSVM\nLDA\nLogistic regression\nRobot pushing 3d\nreal-world Robot pushing 4d\nobjectives Neural network Boston\nNeural network cancer\nCosmological constants\nmean gap\nmedian gap\n\nd\n2\n2\n2\n2\n2\n2\n2\n2\n2\n2\n3\n3\n4\n5\n5\n\n3\n3\n4\n3\n4\n4\n4\n9\n\nR 2\u00d7\n0.422\n0.725\n0.743\n0.461\n0.673\n0.458\n0.669\n0.538\n0.787\n0.337\n0.682\n0.669\n0.414\n0.299\n0.605\n0.566\n0.605\n0.903\n0.939\n0.928\n0.815\n0.824\n0.491\n0.845\n0.739\n0.810\n0.834\n\nSE\n\n0.717\n0.541\n1.000\n0.516\n0.723\n0.496\n0.924\n0.410\n1.000\n0.384\n1.000\n0.774\n0.261\n0.736\n0.971\n0.697\n0.723\n0.912\n0.950\n0.774\n0.927\n0.748\n0.594\n0.645\n0.848\n0.800\n0.811\n\nBOM\n0.984\n0.644\n0.950\n0.529\n0.988\n0.421\n0.954\n0.832\n0.999\n0.374\n0.970\n0.913\n0.823\n0.409\n0.756\n0.770\n0.832\n0.840\n0.925\n0.899\n0.878\n0.619\n0.703\n0.682\n0.859\n0.801\n0.850\n\nMCMC\n0.988\n0.596\n0.996\n0.546\n0.992\n0.447\n0.941\n0.827\n0.993\n0.332\n0.999\n0.942\n0.715\n0.886\n0.974\n0.812\n0.941\n0.938\n0.950\n0.936\n0.967\n0.668\n0.640\n0.773\n0.984\n0.857\n0.937\n\nABO\n0.980\n0.688\n0.998\n0.579\n0.998\n0.481\n0.964\n0.850\n0.999\n0.481\n1.000\n0.971\n0.821\n0.809\n0.968\n0.839\n0.964\n0.956\n0.950\n0.994\n0.935\n0.715\n0.757\n0.749\n0.999\n0.882\n0.943\n\nlocal optimizer) from the sampled point with highest value. Each experiment, was repeated 20\ntimes with \ufb01ve random initial examples, which were the same for all Bayesian optimization methods.\nRANDOM 2\u00d7 results were averaged across 1000 repetitions.\nBenchmark functions for global optimization. Our \ufb01rst set of experiments are test functions\ncommonly used as benchmarks for optimization [20]. We adopted a similar setup as previous works\n[5] but included more test functions. The goal is to \ufb01nd the global minimum of each test function\ngiven a limited number of function evaluations. We provide more information about the chosen\nfunctions in the supplementary material. The maximum number of function evaluations was limited\nto 10 times the dimensionality of the function domain being optimized. We report the gap measure\n[7], de\ufb01ned as f (x\ufb01rst)\u2212f (xbest)\nf (x\ufb01rst)\u2212f (xOPT) , where f (x\ufb01rst) is the minimum function value among the \ufb01rst initial\nrandom points, f (xbest) is the best value found by the method, and f (xOPT) is the optimum.\nTable 1 (top) shows the results for different functions and methods. For each test function, we perform\na one-sided Wilcoxon signed rank test at the 5% signi\ufb01cance level with each method and the one that\nhad the highest average performance. All results that are not signi\ufb01cantly different than the highest\nare marked in bold. First, note that RANDOM 2\u00d7 performs poorly in these synthetic constructed\n\u201chard\u201d functions. Then, observe that the overall performance of all multi-model methods is higher\nthan the single GP baseline, with ABO leading these algorithms with respect to the mean and median\ngap performance over all functions. In fact, ABO\u2019s performance is comparable to the best method for\n11 out of 15 functions.\n\n8\n\n\fFigure 3: Average gap across the eight real-world objective functions vs. fraction of total number of\nfunction evaluations. Here, we display the performance of random search (R 1\u00d7) for reference.\n\nReal-world optimization functions. To further investigate the importance of model search, we\nconsider a second set of functions used in recent publications. Our goal was to select a diverse and\nchallenging set of functions which might better demonstrate the performance of these algorithms in a\nreal application. More information about these functions is given in the supplementary material.\nWe show the gap measure for the second set of experiments in Table 1 (bottom) and perform the same\nstatistical test as before. For computing the gap measure in these experiments, when the true global\nminimum is unknown, we used the minimum observed value across all experiments as a proxy for\nthe optimal value. In Figure 3 we show the average gap measure across all eight test functions as a\nfunction of the total number of functions evaluations allowed. In Figure 2, we show the averaged\nminimum observed function value and standard error of all methods for 6 out of the 8 functions (see\nsupplementary material for the other two functions).\nWith more practical objective functions, the importance of model search becomes more clear. ABO\neither outperforms the other methods (4 out of the 8 datasets) or achieves the lowest objective\nfunction. Figure 3 also shows that ABO quickly advances on the search for the global minimum \u2014 on\naverage, the gap measure is higher than 0.8 after at half of the budget. Interestingly, RANDOM 2\u00d7\nalso performs well for 2 out of these 8 datasets, those are the problems in which all methods have a\nsimilar performance, suggesting that these functions are easier to optimize than the others.\nNaturally, training more models and performing an extra search to dynamically select models require\nmore computation than running a standard single Bayesian optimization. In our implementation, not\noptimized for speed, the median wall clock across all test functions for updating and searching the \ufb01ve\nnew models was 65 and 41 seconds, respectively, for MCMC and ABO. Note that the model update is\nwhat dominates this procedure for both methods, with MCMC tending to select more complex models\nthan ABO. In practice, one could perform this step in parallel with the expensive objective function\nevaluation, requiring no additional overhead besides the cost of optimizing the model-marginal\nacquisition function, which can also be adjusted by the user.\n\n7 Conclusion\n\nWe introduced a novel automated Bayesian optimization approach that uses multiple models to\nrepresent its belief about an objective function and subsequently decide where to query next. Our\nmethod automatically and ef\ufb01ciently searches for better models as more data is gathered. Empirical\nresults show that the proposed algorithm often outperforms the baselines for several different objective\nfunctions across multiple applications. We hope that this work can represent a step towards a fully\nautomated system for Bayesian optimization that can be used by a nonexpert on arbitrary objectives.\n\n9\n\n00.20.40.60.810.40.60.8FractionofTotalNumberofFunctionEvaluationsGapMeasureABOMCMCR2\u00d7BOMSER1\u00d7\fAcknowledgments\n\nGM, and RG were supported by the National Science Foundation (NSF) under award number IIA\u2013\n1355406. GM was also supported by the Brazilian Federal Agency for Support and Evaluation of\nGraduate Education (CAPES).\n\nReferences\n[1] James S. Bergstra, R\u00e9mi Bardenet, Yoshua Bengio, and Bal\u00e1zs K\u00e9gl. Algorithms for hyper-\nparameter optimization. In Conference on Neural Information Processing Systems (NIPS).\n2011.\n\n[2] David Duvenaud, James R. Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahra-\nmani. Structure discovery in nonparametric regression through compositional kernel search. In\nInternational Conference on Machine Learning (ICML), 2013.\n\n[3] Jacob R. Gardner, Gustavo Malkomes, Roman Garnett, Kilian Q. Weinberger, Dennis Barbour,\nand John P. Cunningham. Bayesian active model selection with an application to automated\naudiometry. In Conference on Neural Information Processing Systems (NIPS), 2015.\n\n[4] Jacob R. Gardner, Chuan Guo, Kilian Q. Weinberger, Roman Garnett, and Roger Grosse.\nIn International\n\nDiscovering and exploiting additive structure for Bayesian optimization.\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\n[5] Javier Gonz\u00e1lez, Michael A. Osborne, and Neil D. Lawrence. GLASSES: relieving the myopia\nof Bayesian optimisation. In International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2016.\n\n[6] Roger Grosse, Ruslan Salakhutdinov, William Freeman, and Joshua Tenenbaum. Exploiting\ncompositionality to explore a large space of model structures. In Conference on Uncertainty in\nArti\ufb01cial Intelligence (UAI), 2012.\n\n[7] Deng Huang, Theodore T. Allen, William I. Notz, and Ning Zeng. Global optimization of\nstochastic black-box systems via sequential kriging meta-models. Journal of Global optimiza-\ntion, 34:441\u2013466, 2006.\n\n[8] Donald R. Jones, Matthias Schonlau, and William J. Welch. Ef\ufb01cient global optimization of\n\nexpensive black-box functions. Journal of Global optimization, 13:455\u2013492, 1998.\n\n[9] Harold J. Kushner. A new method of locating the maximum point of an arbitrary multipeak\n\ncurve in the presence of noise. Journal of Basic Engineering, 86:97\u2013106, 1964.\n\n[10] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-\nperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine\nLearning Research, 18(185):1\u201352, 2018. URL http://jmlr.org/papers/v18/16-558.\nhtml.\n\n[11] Gustavo Malkomes, Charles Schaff, and Roman Garnett. Bayesian optimization for automated\n\nmodel selection. In Conference on Neural Information Processing Systems (NIPS), 2016.\n\n[12] Jonas Mo\u02c7ckus. On Bayesian methods for seeking the extremum, pages 400\u2013404. Springer, 1974.\n[13] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n\n[14] Michael A. Osborne, David Duvenaud, Roman Garnett, Carl E. Rasmussen, Stephen J. Roberts,\nand Zoubin Ghahramani. Active learning of model evidence using Bayesian quadrature. In\nConference on Neural Information Processing Systems (NIPS), 2012.\n\n[15] Adrian E Raftery. Approximate Bayes Factors and Accounting for Model Uncertainty in\n\nGeneralised Linear Models. Biometrika, 83(2):251\u2013266, 1996.\n\n[16] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.\n\nMIT Press, 2006.\n\n10\n\n\f[17] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking\nthe human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104:\n148\u2013175, 2016.\n\n[18] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine\nlearning algorithms. In Conference on Neural Information Processing Systems (NIPS), 2012.\n\n[19] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process opti-\nmization in the bandit setting: No regret and experimental design. In International Conference\non Machine Learning (ICML), 2010.\n\n[20] Sonja Surjanovic and Derek Bingham. Optimization test functions and datasets, 2017. URL\n\nhttp://www.sfu.ca/~ssurjano/optimization.html.\n\n11\n\n\f", "award": [], "sourceid": 2927, "authors": [{"given_name": "Gustavo", "family_name": "Malkomes", "institution": "Washington University in St. Louis"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "Washington University in St. Louis"}]}