{"title": "Maximizing acquisition functions for Bayesian optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9884, "page_last": 9895, "abstract": "Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes' decision rule, but this ideal is difficult to achieve since these functions are frequently non-trivial to optimize. This statement is especially true when evaluating queries in parallel, where acquisition functions are routinely non-convex, high-dimensional, and intractable. We first show that acquisition functions estimated via Monte Carlo integration are consistently amenable to gradient-based optimization. Subsequently, we identify a common family of acquisition functions, including EI and UCB, whose characteristics not only facilitate but justify use of greedy approaches for their maximization.", "full_text": "Maximizing acquisition functions\n\nfor Bayesian optimization\n\nJames T. Wilson\u21e4\n\nImperial College London\n\nFrank Hutter\n\nUniversity of Freiburg\n\nMarc Peter Deisenroth\nImperial College London\n\nPROWLER.io\n\nAbstract\n\nBayesian optimization is a sample-ef\ufb01cient approach to global optimization that\nrelies on theoretically motivated value heuristics (acquisition functions) to guide\nits search process. Fully maximizing acquisition functions produces the Bayes\u2019\ndecision rule, but this ideal is dif\ufb01cult to achieve since these functions are fre-\nquently non-trivial to optimize. This statement is especially true when evaluating\nqueries in parallel, where acquisition functions are routinely non-convex, high-\ndimensional, and intractable. We \ufb01rst show that acquisition functions estimated\nvia Monte Carlo integration are consistently amenable to gradient-based optimiza-\ntion. Subsequently, we identify a common family of acquisition functions, includ-\ning EI and UCB, whose properties not only facilitate but justify use of greedy\napproaches for their maximization.\n\n1\n\nIntroduction\n\nBayesian optimization (BO) is a powerful framework for tackling complicated global optimization\nproblems [32, 40, 44]. Given a black-box function f : X!Y , BO seeks to identify a maximizer\nx\u21e4 2 arg maxx2X f (x) while simultaneously minimizing incurred costs. Recently, these strategies\nhave demonstrated state-of-the-art results on many important, real-world problems ranging from\nmaterial sciences [17, 57], to robotics [3, 7], to algorithm tuning and con\ufb01guration [16, 29, 53, 56].\nFrom a high-level perspective, BO can be understood as the application of Bayesian decision theory\nto optimization problems [11, 14, 45]. One \ufb01rst speci\ufb01es a belief over possible explanations for f\nusing a probabilistic surrogate model and then combines this belief with an acquisition function L\nto convey the expected utility for evaluating a set of queries X. In theory, X is chosen according\nto Bayes\u2019 decision rule as L\u2019s maximizer by solving for an inner optimization problem [19, 42,\n59]. In practice, challenges associated with maximizing L greatly impede our ability to live up to\nthis standard. Nevertheless, this inner optimization problem is often treated as a black-box unto\nitself. Failing to address this challenge leads to a systematic departure from BO\u2019s premise and,\nconsequently, consistent deterioration in achieved performance.\nTo help reconcile theory and practice, we present two modern perspectives for addressing BO\u2019s\ninner optimization problem that exploit key aspects of acquisition functions and their estimators.\nFirst, we clarify how sample path derivatives can be used to optimize a wide range of acquisition\nfunctions estimated via Monte Carlo (MC) integration. Second, we identify a common family of\nsubmodular acquisition functions and show that its constituents can generally be expressed in a\nmore computer-friendly form. These acquisition functions\u2019 properties enable greedy approaches to\nef\ufb01ciently maximize them with guaranteed near-optimal results. Finally, we demonstrate through\ncomprehensive experiments that these theoretical contributions directly translate to reliable and,\noften, substantial performance gains.\n\n\u21e4Correspondence to j.wilson17@imperial.ac.uk\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1\n\nInner optimization problem\nInner optimization problem\n\nRuntimes for inner optimization\nRuntimes for inner optimization\n\n2.2\n2.2\n\n0.1\n0.1\n\n897\n897\n\n673\n673\n\n449\n449\n\n224\n224\n\n2\n\nb\n\nc\n\na\n\nd\n\n-0.2\n-0.2\n\nk=1\n\ni=1\n\nk=1\n\ni=1\n\n256\n256\n\n512\n512\n\n768\n768\n\n1024\n1024\n\n0.0\n-0.0\n\n0.75\n0.75\n\n1.00\n1.00\n\n0.00\n0.00\n\n0.25\n0.25\n\n0\n0\n\n3\n3\n\n-2.6\n-2.6\n\n0.2\n0.2\n\ny\nt\ni\nl\ni\nt\nu\n\ny\nt\ni\nl\ni\nt\nu\n\nf\nf\ne\ne\ni\ni\nl\nl\ne\ne\nb\nb\n\nr\nr\no\no\ni\ni\nr\nr\ne\ne\nt\nt\ns\ns\no\no\nP\nP\n\nd\nd\ne\ne\nt\nt\nc\nc\ne\ne\np\np\nx\nx\nE\nE\n\ns\ns\nd\nd\nn\nn\no\no\nc\nc\ne\ne\nS\nS\nU\nU\nP\nP\nC\nC\n\n0.50\n0.50\nx R\nx R\n\nNum. prior observations\nNum. prior observations\n\ni=1 X L(X)\n\nParallelism\nParallelism\nq = 32\nq = 32\nq = 16\nq = 16\nq = 8\nq = 8\nq = 4\nq = 2\nq = 4\nq = 2\n\nAlgorithm 1 BO outer-loop (joint parallelism)\nAlgorithm 1 BO outer-loop (joint parallelism)\n1: Given model M and data D , ;.\n1: Given model M, acquisition L and data D\n2: for t = 1, . . . , T do\n2: for t = 1, . . . , T do\nFit model M to current data D\n3:\nFit model M to current data D\nSet qt = min(T t, q)\n3:\n4:\nSet q = min(qmax, T t)\n4:\nFind X 2 arg maxX2qt\nInner optimization problem\n5:\nFind X 2 arg maxX02X q L(X0)\n5:\nEvaluate y f (X)\n6:\nUpdate D D [ {(xk, yk)}q\n7:\nEvaluate y f (X)\n6:\n8: end for\nUpdate D D [ {(xi, yi)}q\n7:\n8: end for\nAlgorithm 2 BO outer-loop (greedy parallelism)\n1: Given model M and data D , ;.\nAlgorithm 2 BO outer-loop (greedy parallelism)\n2: for t = 1, . . . , T do\n1: Given model M, acquisition L and data D\nFigure 1: (a) Pseudo-code for standard BO\u2019s \u201couter-loop\u201d with parallelism q; the inner optimization problem\nFit model M to current data D\n3:\n2: for t = 1, . . . , T do\nSet X ;\n4:\nis boxed in red. (b\u2013c) GP-based belief and expected utility (EI), given four initial observations \u2018\u2022\u2019. The aim of\nFit model M to current data D\n3:\nSet X ;\n4:\nthe inner optimization problem is to \ufb01nd the optimal query \u2018\n\u2019. (d) Time to compute 214 evaluations of MC\nfor k = 1, . . . min(T t, q) do\n5:\nFind xk 2 arg maxx2X L(X[{x})\nq-EI using a GP surrogate for varied observation counts and degrees of parallelism. Runtimes fall off at the\n6:\nfor j = 1, . . . min(qmax, T t) do\n5:\n7:\nX X [ {xk}\nFind xj 2 arg maxx2X L(X[{x})\n\ufb01nal step because q decreases to accommodate evaluation budget T = 1, 024.\n6:\nend for\n8:\n7:\nX X [ {xj}\nEvaluate y f (X)\n9:\nend for\n8:\nUpdate D D [ {(xk, yk)}q\n10:\nEvaluate y f (X)\n9:\n11: end for\n2 Background\nUpdate D D [ {(xi, yi)}q\n10:\n11: end for\nBayesian optimization relies on both a surrogate model M and an acquisition function L to de\ufb01ne\na strategy for ef\ufb01ciently maximizing a black-box function f. At each \u201couter-loop\u201d iteration (Fig-\nure 1a), this strategy is used to choose a set of queries X whose evaluation advances the search\nprocess. This section reviews related concepts and closes with discussion of the associated inner\noptimization problem. For an in-depth review of BO, we defer to the recent survey [52].\nWithout loss of generality, we assume BO strategies evaluate q designs X 2 Rq\u21e5d in parallel so\nthat setting q = 1 recovers purely sequential decision-making. We denote available information\nregarding f as D = {(xi, yi)}...\ni=1 and, for notational convenience, assume noiseless observations\ny = f (X). Additionally, we refer to L\u2019s parameters (such as an improvement threshold) as and\nto M\u2019s parameters as \u21e3. Henceforth, direct reference to these terms will be omitted where possible.\nSurrogate models A surrogate model M provides a probabilistic interpretation of f whereby\npossible explanations for the function are seen as draws f k \u21e0 p(f|D). In some cases, this belief\nis expressed as an explicit ensemble of sample functions [28, 54, 60]. More commonly however,\nM dictates the parameters \u2713 of a (joint) distribution over the function\u2019s behavior at a \ufb01nite set of\npoints X. By \ufb01rst tuning the model\u2019s (hyper)parameters \u21e3 to explain for D, a belief is formed as\np(y|X,D) = p(y; \u2713) with \u2713 M (X; \u21e3). Throughout, \u2713 M (X; \u21e3) is used to denote that belief\np\u2019s parameters \u2713 are speci\ufb01ed by model M evaluated at X. A member of this latter category, the\nGaussian process prior (GP) is the most widely used surrogate and induces a multivariate normal\nbelief \u2713 , (\u00b5, \u2303) M (X; \u21e3) such that p(y; \u2713) = N (y; \u00b5, \u2303) for any \ufb01nite set X (see Figure 1b).\nAcquisition functions With few exceptions, acquisition functions amount to integrals de\ufb01ned in\nterms of a belief p over the unknown outcomes y = {y1, . . . , yq} revealed when evaluating a black-\nbox function f at corresponding input locations X = {x1, . . . , xq}. This formulation naturally\noccurs as part of a Bayesian approach whereby the value of querying X is determined by accounting\nfor the utility provided by possible outcomes yk \u21e0 p(y|X,D). Denoting the chosen utility function\nas `, this paradigm leads to acquisition functions de\ufb01ned as expectations\n\n1\n\n1\n\nL(X;D, ) = Ey [`(y; )] =Z `(y; )p(y|X,D)dy .\n\n(1)\n\nA seeming exception to this rule, non-myopic acquisition functions assign value by further con-\ni=1 impact our broader understanding\nsidering how different realizations of Dk\nof f and usually correspond to more complex, nested integrals. Figure 1c portrays a prototypical\nacquisition surface and Table 1 exempli\ufb01es popular, myopic and non-myopic instances of (1).\n\nq D[{\n\ni )}q\n\n(xi, yk\n\nInner optimization problem Maximizing acquisition functions plays a crucial role in BO as the\nprocess through which abstract machinery (e.g. model M and acquisition function L) yields con-\ncrete actions (e.g. decisions regarding sets of queries X). Despite its importance however, this inner\noptimization problem is often neglected. This lack of emphasis is largely attributable to a greater\n\n2\n\n1:GivenmodelM,acquisitionL,anddataD2:fort=1,...,Tdo3:FitmodelMtocurrentdataD4:Setq=min(qmax,Tt)5:FindX2argmaxX02XqL(X0)6:Evaluatey f(X)7:UpdateD D[{(xk,yk)}qk=18:endfor\fReparameterization\n\nMM\n\nAbbr.\n\nEI\nPI\nSR\nUCB\nES\nKG\n\nAcquisition Function L\nEy[max(ReLU(y \u21b5))]\nEy[max( (y \u21b5))]\nEy[max(\u00b5 +p\u21e1/2||)]\n\nEz[max(ReLU(\u00b5 + Lz \u21b5))]\nEz[max(\u00b5 +p\u21e1/2|Lz|)]\nEya[H(Eyb|ya[ +(yb max(yb))])] Eza[H(Ezb[softmax( \u00b5b|a+Lb|azb\nEya[max(\u00b5b + \u2303b,a\u23031\na,aLaza)]\n\nEza[max(\u00b5b + \u2303b,a\u23031\n\nEy[max(y)]\n\nEz[max(( \u00b5+Lz\u21b5\nEz[max(\u00b5 + Lz)]\n\n\u2327\n\n))]\n\na,a(ya \u00b5a))]\n\nY\n\nY\n\nY\n\nY\n\nN\n\nN\n\n)])]\n\n\u2327\n\nTable 1: Examples of reparameterizable acquisition functions; the \ufb01nal column indicates whether they belong\nto the MM family (Section 3.2). Glossary: +/ denotes the right-/left-continuous Heaviside step function;\nReLU and recti\ufb01ed linear and sigmoid nonlinearities, respectively; H the Shannon entropy; \u21b5 an improve-\nment threshold; \u2327 a temperature parameter; LL> , \u2303 the Cholesky factor; and, residuals \u21e0N (0, \u2303).\nLastly, non-myopic acquisition function (ES and KG) are assumed to be de\ufb01ned using a discretization. Terms\nassociated with the query set and discretization are respectively denoted via subscripts a and b.\n\nfocus on creating new and improved machinery as well as on applying BO to new types of prob-\nlems. Moreover, elementary examples of BO facilitate L\u2019s maximization. For example, optimizing\na single query x 2 Rd is usually straightforward when x is low-dimensional and L is myopic.\nOutside these textbook examples, however, BO\u2019s inner optimization problem becomes qualitatively\nmore dif\ufb01cult to solve. In virtually all cases, acquisition functions are non-convex (frequently due to\nthe non-convexity of plausible explanations for f). Accordingly, increases in input dimensionality d\ncan be prohibitive to ef\ufb01cient query optimization. In the generalized setting with parallelism q 1,\nthis issue is exacerbated by the additional scaling in q. While this combination of non-convexity\nand (acquisition) dimensionality is problematic, the routine intractability of both non-myopic and\nparallel acquisition poses a commensurate challenge.\nAs is generally true of integrals, the majority of acquisition functions are intractable. Even Gaus-\nsian integrals, which are often preferred because they lead to analytic solutions for certain instances\nof (1), are only tractable in a handful of special cases [13, 18, 20]. To circumvent the lack of\nclosed-form solutions, researchers have proposed a wealth of diverse methods. Approximation\nstrategies [13, 15, 60], which replace a quantity of interest with a more readily computable one, work\nwell in practice but may not to converge to the true value. In contrast, bespoke solutions [10, 20, 22]\nprovide (near-)analytic expressions but typically do not scale well with dimensionality.2 Lastly,\nMC methods [27, 47, 53] are highly versatile and generally unbiased, but are often perceived as\nnon-differentiable and, therefore, inef\ufb01cient for purposes of maximizing L.\nRegardless of the method however, the (often drastic) increase in cost when evaluating L\u2019s proxy\nacts as a barrier to ef\ufb01cient query optimization, and these costs increase over time as shown in\nFigure 1d. In an effort to address these problems, we now go inside the outer-loop and focus on\nef\ufb01cient methods for maximizing acquisition functions.\n\n3 Maximizing acquisition functions\n\nThis section presents the technical contributions of this paper, which can be broken down into two\ncomplementary topics: 1) gradient-based optimization of acquisition functions that are estimated via\nMonte Carlo integration, and 2) greedy maximization of \u201cmyopic maximal\u201d acquisition functions.\nBelow, we separately discuss each contribution along with its related literature.\n\n3.1 Differentiating Monte Carlo acquisitions\n\nGradients are one of the most valuable sources of information for optimizing functions. In this sec-\ntion, we detail both the reasons and conditions whereby MC acquisition functions are differentiable\nand further show that most well-known examples readily satisfy these criteria (see Table 1).\n\n2By near-analytic, we refer to cases where an expression contains terms that cannot be computed exactly\n\nbut for which high-quality solvers exist (e.g. low-dimensional multivariate normal CDF estimators [20, 21]).\n\n3\n\n\fmPm\n\nWe assume that L is an expectation over a multivariate normal belief p(y|X,D) = N (y; \u00b5, \u2303)\nspeci\ufb01ed by a GP surrogate such that (\u00b5, \u2303) M (X). More generally, we assume that samples\ncan be generated as yk \u21e0 p(y|X,D) to form an unbiased MC estimator of an acquisition function\nL(X) \u21e1L m(X) , 1\nk=1 `(yk). Given such an estimator, we are interested in verifying whether\nrL(X) \u21e1 rL m(X) , 1\n\n(2)\nwhere r` denotes the gradient of utility function ` taken with respect to X. The validity of MC\ngradient estimator (2) is obscured by the fact that yk depends on X through generative distribution\np and that rLm is the expectation of `\u2019s derivative rather than the derivative of its expectation.\nOriginally referred to as in\ufb01nitesimal perturbation analysis [8, 24], the reparameterization trick [37,\n50] is the process of differentiating through an MC estimate to its generative distribution p\u2019s param-\neters and consists of two components: i) reparameterizing samples from p as draws from a simpler\nbase distribution \u02c6p, and ii) interchanging differentiation and integration by taking the expectation\nover sample path derivatives.\n\nmXm\n\nk=1 r`(yk),\n\nReparameterization Reparameterization is a way of interpreting samples that makes their differ-\nentiability w.r.t. a generative distribution\u2019s parameters transparent. Often, samples yk \u21e0 p(y; \u2713)\ncan be re-expressed as a deterministic mapping : Z\u21e5 \u21e5 !Y of simpler random variates\nzk \u21e0 \u02c6p(z) [37, 50]. This change of variables helps clarify that, if ` is a differentiable function\nof y = (z; \u2713), then d`\nIf generative distribution p is multivariate normal with parameters \u2713 = (\u00b5, \u2303), the corresponding\nmapping is then (z; \u2713) , \u00b5 + Lz, where z \u21e0N (0, I) and L is \u2303\u2019s Cholesky factor such that\nLL> = \u2303. Rewriting (1) as a Gaussian integral and reparameterizing, we have\n\nd\u2713 by the chain rule of (functional) derivatives.\n\nd\u2713 = d`\nd\n\nd\n\nL(X) =Z b\n\na\n\n`(y)N (y; \u00b5, \u2303)dy =Z b0\n\na0\n\n`(\u00b5 + Lz)N (z; 0, I)dz ,\n\n(3)\n\nwhere each of the q terms c0i in both a0 and b0 is transformed as c0i = (ci \u00b5i Pj*` (y*