{"title": "Batch Bayesian Optimization via Simulation Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 109, "page_last": 117, "abstract": "Bayesian optimization methods are often used to optimize unknown functions that are costly to evaluate. Typically, these methods sequentially select inputs to be evaluated one at a time based on a posterior over the unknown function that is updated after each evaluation. There are a number of effective sequential policies for selecting the individual inputs. In many applications, however, it is desirable to perform multiple evaluations in parallel, which requires selecting batches of multiple inputs to evaluate at once. In this paper, we propose a novel approach to batch Bayesian optimization, providing a policy for selecting batches of inputs with the goal of optimizing the function as efficiently as possible. The key idea is to exploit the availability of high-quality and efficient sequential policies, by using Monte-Carlo simulation to select input batches that closely match their expected behavior. To the best of our knowledge, this is the first batch selection policy for Bayesian optimization. Our experimental results on six benchmarks show that the proposed approach significantly outperforms two baselines and can lead to large advantages over a top sequential approach in terms of performance per unit time.", "full_text": "Batch Bayesian Optimization\n\nvia Simulation Matching\n\nJavad Azimi, Alan Fern, Xiaoli Z. Fern\nSchool of EECS, Oregon State University\n\n{azimi, afern, xfern}@eecs.oregonstate.edu\n\nAbstract\n\nBayesian optimization methods are often used to optimize unknown functions that\nare costly to evaluate. Typically, these methods sequentially select inputs to be\nevaluated one at a time based on a posterior over the unknown function that is\nupdated after each evaluation. In many applications, however, it is desirable to\nperform multiple evaluations in parallel, which requires selecting batches of mul-\ntiple inputs to evaluate at once. In this paper, we propose a novel approach to\nbatch Bayesian optimization, providing a policy for selecting batches of inputs\nwith the goal of optimizing the function as ef\ufb01ciently as possible. The key idea is\nto exploit the availability of high-quality and ef\ufb01cient sequential policies, by using\nMonte-Carlo simulation to select input batches that closely match their expected\nbehavior. Our experimental results on six benchmarks show that the proposed ap-\nproach signi\ufb01cantly outperforms two baselines and can lead to large advantages\nover a top sequential approach in terms of performance per unit time.\n\nIntroduction\n\n1\nWe consider the problem of maximizing an unknown function f (x) when each evaluation of the\nfunction has a high cost. In such cases, standard optimization techniques such as empirical gradient\nmethods are not practical due to the high number of function evaluations that they demand. Rather,\nBayesian optimization (BO) methods [12, 4] have demonstrated signi\ufb01cant promise in their ability\nto effectively optimize a function given only a small number of evaluations. BO gains this ef\ufb01ciency\nby leveraging Bayesian models that take into account all previously observed evaluations in order\nto better inform future evaluation choices. In particular, typical BO methods continually maintain a\nposterior over f (x) that is used to select the next input to evaluate. The result of the evaluation is\nthen used to update the posterior and the process repeats. There are a number of well established\npolicies for selecting the next input to evaluate given the current posterior. We will refer to such\npolicies as sequential policies to stress the fact that they select one input at a time.\nIn many applications it is possible and desirable to run multiple function evaluations in parallel.\nThis is the case, for example, when the underlying function corresponds to a controlled laboratory\nexperiment where multiple experimental setups are examined simultaneously, or when the underly-\ning function is the result of a costly computer simulation and multiple simulations can be run across\ndifferent processors in parallel. In such cases, existing sequential policies are not suf\ufb01cient. Rather,\nbatch mode BO is more appropriate, where policies select a batch of multiple inputs to be evaluated\nat once. To the best of our knowledge and as noted in [4], there is no established work on BO that\nconsiders the batch selection problem, except for a brief treatment in [21]. The main contribution of\nthis work is to propose an approach to batch BO and to demonstrate its effectiveness.\nThe key motivation behind our approach comes from the fact that the sequential mode of BO has a\nfundamental advantage over BO in batch mode. This is because in sequential mode, each function\nevaluation is immediately used to obtain a more accurate posterior of f (x), which in turn will allow\n\n1\n\n\fa selection policy to make more informed choices about the next input. Given an effective sequential\nselection policy, our goal is then to design a batch policy that approximates its behavior.\nIn particular, our batch policy attempts to select a batch that \u201cmatches\u201d the expected behavior of a\nsequential policy as closely as possible. The approach generates Monte-Carlo simulations of a se-\nquential policy given the current posterior, and then derives an optimization problem over possible\nbatches aimed at minimizing the loss between the sequential policy and the batch. We consider two\nvariants of this optimization problem that yield a continuous weighted k-means problem and a com-\nbinatorial weighted k-medoid problem. We solve the k-means variant via k-means clustering and\nshow that the k-medoid variant corresponds to minimizing a non-increasing supermodular function,\nfor which there is an ef\ufb01cient approximation algorithm [9].\nWe evaluate our approach on a collection of six functions and compare it to random and another\nbaseline batch policy based on submodular maximization. The results show that our approach signif-\nicantly outperforms these baselines and can lead to large advantages over a top sequential approach\nin terms of performance per unit time.\n\n2 Problem Setup\nLet X \u2286 Rn be an n-dimensional input space, where we will often refer to elements of X as an\nexperiment and assume that each dimension i is bounded in [Ai, Bi]. We assume an unknown real-\nvalued function f : X \u2192 R, which represents the expected value of the dependent variable after\nrunning an experiment. For example, f (x) might correspond to the result of a wet-lab experiment\nor a computer simulation with input parameters x. Conducting an experiment x produces a noisy\noutcome y = f (x) + \u0001, where \u0001 is a noise term that might be 0 in some applications.\nOur objective is to \ufb01nd an experiment x \u2208 X that approximately maximizes f by requesting a\nlimited number of experiments and observing their outcomes. Furthermore we are interested in\napplications where (1) running experiments is costly (e.g. in terms of laboratory or simulation time);\nand (2) it is desirable to run k > 1 experiments in parallel. This motivates the problem of selecting\na sequence of batches, each containing k experiments, where the choice of a batch can depend on\nthe results observed from all previous experiments. We will refer to the rule for selecting a batch\nbased on previous experiments as the batch policy. The main goal of this paper is to develop a batch\npolicy that optimizes the unknown function as ef\ufb01ciently as possible.\nDue to the high cost of experiments, traditional optimization techniques such as empirical gradient\nascent are not practical for our setting, due to their high demands on the number of experiments.\nRather, we build on Bayesian optimization (BO) [10, 12, 4], which leverages Bayesian modeling\nin an attempt to achieve more ef\ufb01cient optimization. In particular, BO maintains a posterior over\nthe unknown function based on previously observed experiments, e.g. represented via a Gaussian\nProcess (GP) [19]. This posterior is used to select the next experiment to be run in a way that attempts\nto trade-off exploring new parts of the experimental space and exploiting parts that look promising.\nWhile the BO literature has provided a number of effective policies, they are all sequential policies,\nwhere only a single experiment is selected and run at a time. Thus, the main novelty of our work is\nin de\ufb01ning a batch policy in the context of BO, which is described in the next section.\n\n3 Simulation Matching for Batch Selection\nGiven a data set D of previously observed experiments, which induces a posterior distribution over\nthe unknown function, we now consider how to select the next batch of k experiments. A key issue in\nmaking this choice is to manage the trade-off between exploration and exploitation. The policy must\nattempt to explore by requesting experiments from unexplored parts of the input space, at the same\ntime also attempt to optimize the unknown function via experiments that look promising given the\ncurrent data. While, under most measures, optimizing this trade-off is computationally intractable,\nthere are a number of heuristic sequential policies from the BO literature that are computationally\nef\ufb01cient and perform very well in practice. For example, one such policy selects the next experiment\nto be the one that has the \u201cmaximum expected improvement\u201d according to the current posterior\n[14, 10]. The main idea behind our approach is to leverage such sequential policies by selecting a\nbatch of k > 1 experiments that \u201cclosely matches\u201d the sequential policy\u2019s expected behavior.\nMore formally, let \u03c0 be a sequential policy. Given a data set D of prior experimental results, \u03c0 returns\nthe next experiment x \u2208 X to be selected. As is standard in BO, we assume we have a posterior\n\n2\n\n\f\u03c0.1 Our batch policy is based on generating a number of samples of Sk\n\ndensity P (f | D) over the unknown function f, such as a Gaussian Process. Given this density we\ncan de\ufb01ne a density over the outcomes of executing policy \u03c0 for k steps, each outcome consisting\nof a set of k selected experiments. Let Sk\n\u03c0 be the random variable denoting the set of k experiments\nresulting from such k-step executions, which has a well de\ufb01ned density over all possible sets given\nthe posterior of f. Importantly, it is generally straightforward to use Monte Carlo simulation to\nsample values of Sk\n\u03c0, which\nare used to de\ufb01ne an objective for optimizing a batch of k experiments. Below we describe this\nobjective and a variant, followed by a description of how we optimize the proposed objectives.\n3.1 Batch Objective Function\nOur goal is to select a batch B of k experiments that best \u201cmatches the expected behavior\u201d of a base\nsequential policy \u03c0 conditioned on the observed data D. More precisely, we consider a batch B to\nbe a good match for a policy execution if B contains an experiment that is close to the best of the k\nexperiments selected by the policy. To specify this objective we \ufb01rst introduce some notation. Given\na function f and a set of experiments S, we de\ufb01ne x\u2217(f, S) = arg maxx\u2208S f (x) to be the maximizer\nof f in S. Also, for any experiment x and set B we de\ufb01ne nn(x, B) = arg minx(cid:48)\u2208B (cid:107) x \u2212 x(cid:48) (cid:107) to\nbe the nearest neighbor of x in set B. Our objective can now be written as selecting a batch B that\nminimizes\n\n\u03c0), B) (cid:107)2| D(cid:3) | D(cid:3) .\n\n\u03c0) \u2212 nn(x\u2217(f, Sk\n\n(cid:2)Ef|Sk\n\n\u03c0\n\n(cid:2)(cid:107) x\u2217(f, Sk\n\nOBJ(B) = ESk\n\n\u03c0\n\n\u03c0,D) \u00b7 P (Sk\n\n\u03c0 | D) = P (f | Sk\n\n\u03c0 and f as\nNote that this nested expectation is the result of decomposing the joint posterior over Sk\n\u03c0 | D). If we assume that the unknown function f (x) is\nP (f, Sk\nLipschitz continuous then minimizing this objective can be viewed as minimizing an upper bound\non the expected performance difference between the sequential policy and the selected batch. Here\nthe performance of a policy or a batch is equal to the output value of the best selected experiment.\n\u03c0 with a sample average\nWe will approximate this objective by replacing the outer expectation over Sk\nover n samples {S1, . . . , Sn} of Sk\n\n\u03c0 as follows, recalling that each Si is a set of k experiments:\nEf|Si\n\n(cid:2)(cid:107) x\u2217(f, Si) \u2212 nn(x\u2217(f, Si), B) (cid:107)2 | D(cid:3)\n\nOBJ(B) \u2248 1\nn\n\n=\n\n=\n\n1\nn\n\n1\nn\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ni\n\ni\n\n(cid:88)\n(cid:88)\n\nx\u2208Si\n\nx\u2208Si\n\ni\n\nPr(x = x\u2217(f, Si) | D, Si)\u00b7 (cid:107) x \u2212 nn(x, B) (cid:107)2\n\n\u03b1i,x\u00b7 (cid:107) x \u2212 nn(x, B) (cid:107)2\n\n(1)\n\nthe constraint that B is restricted to experiments in the simulations, i.e. B \u2286 (cid:83)\n\nThe second step follows by noting that x\u2217(f, Si) must be one of the k experiments in Si.\nWe now de\ufb01ne our objective as minimizing (1) over batch B. The objective corresponds to a\nweighted k-means clustering problem, where we must select B to minimize the weighted distor-\ntion between the simulated points and their closest points in B. The weight on each simulated\nexperiment \u03b1i,x corresponds to the probability that the experiment x \u2208 Si achieves the maximum\nvalue of the unknown f among the experiments in Si, conditioned on D and the fact that Sk\n\u03c0 = Si.\nWe refer to this objective as the k-means objective.\nWe also consider a variant of this objective where the goal is to \ufb01nd a B that minimizes (1) under\ni Si s.t. |B| = k.\nThis objective corresponds to the weighted k-medoid clustering problem, which is often considered\nto improve robustness to outliers in clustering. Accordingly we will refer to this objective as the\nk-medoid objective and note that given a \ufb01xed set of simulations this corresponds to a discrete\noptimization problem.\n3.2 Optimization Approach\nThe above k-means and k-medoid objectives involve the weights \u03b1i,x = P (x = x\u2217\n\u03c0 =\nSi), for each x \u2208 Si. In general these weights will be dif\ufb01cult to compute exactly, particularly\n1For example, this can be done by starting with D and selecting the \ufb01rst experiment x1 using \u03c0 and then\nusing P (f | D) to simulate the result y1 of experiment x1. This simulated experiment is added to D and the\nprocess repeats for k \u2212 1 additional experiments.\n\ni (f ) | D, Sk\n\n3\n\n\fAlgorithm 1 Greedy Weighted k-Medoid Algorithm\nInput:S = {(x1, w1), . . . , (xm, wm)}, k\nOutput:B\n\nB \u2190 {x1, . . . , xm} // initialize batch to all data points\nwhile |B| > k do\nx \u2190 arg minx\u2208B\nB \u2190 B \\ x\n\n(cid:80)m\nj=1 wj\u00b7 (cid:107) xj \u2212 nn(xj, B \\ x) (cid:107) // point that in\ufb02uences objective the least\n\nend while\nreturn B\n\ndue to the conditioning on the set Si. In this work, we approximate those weights by dropping the\nconditioning on Si, for which it is then possible to derive a closed form when the posterior over f is\nrepresented as a Gaussian Process (GP). We have found that this approach leads to good empirical\nperformance. In particular, instead of using the weights \u03b1i,x we use the weights \u02c6\u03b1i,x = P (x =\ni (f ) | D). When the posterior over f is represented as a GP, as in our experiments, the joint\nx\u2217\ndistribution over experimental outcomes in Si = {xi,1, . . . , xi,k} is normally distributed. That is,\nthe random vector (cid:104)f (xi,1), . . . , f (xi,k)(cid:105) \u223c N (\u00b5, \u03a3), where the mean \u00b5 and covariance \u03a3 have\nstandard closed forms given by the GP conditioned on D. From this, it is clear that for a GP the\ncomputation of \u02c6\u03b1i,x is equivalent to computing the probability that the ith component of a normally\ndistributed vector is larger than the other components. A closed form solution for this probability is\ngiven by the following proposition.\n\nProposition 1. If (y1, y2, . . . , yk) \u223c N(cid:0)\u00b5y, \u03a3y\n\n(cid:1) then for any i \u2208 {1, . . . , k},\n\nk\u22121(cid:89)\n\nP (yi \u2265 y1, yi \u2265 y2, . . . , yi \u2265 yk) =\n\n(1 \u2212 \u03a6(\u2212\u00b5j))\n\n(2)\n\nj=1\n\nwhere \u03a6(.) is standard normal cdf, \u00b5 = (\u00b51, \u00b52,\u00b7 \u00b7 \u00b7, \u00b5k\u22121) = (cid:0)A\u03a3yA(cid:48)(cid:1)\u2212 1\n\n2 A\u00b5y, such that A \u2208\nR(k\u22121)\u00d7k is a sparse matrix that for any j = 1, 2,\u00b7\u00b7\u00b7, k\u2212 1 we have Aj,i = 1, and for any 1 \u2264 p < i\nwe have Ap,p = \u22121 , and for any i < p \u2264 k we have Ap\u22121,p = \u22121.\nUsing this approach to compute the weights we can now consider optimizing the k-means and k-\nmedoid objectives from (1), both of which are known to be NP-hard problems. For the k-means\nobjective we solve for the set B by simply applying the k-means clustering algorithm [13] to the\n\n{(x, \u02c6\u03b1i,x)}. The k cluster centers are returned as our batch B.\n\nweighted data set(cid:83)\n\n(cid:83)\n\nx\u2208Si\n\ni\n\nThe k-medoid objective is well known [22] and the weighted k-medoid clustering algorithm [11]\nhas been shown to perform well and be robust to outliers in the data. While we have experimented\nwith this algorithm and obtained good results, we have achieved results that are as good or better\nusing an alternative greedy algorithm that provides certain approximation guarantees. Pseudo-code\nfor this algorithm is shown in Figure 1. The input to the algorithm is the set of weighted experiments\nand the batch size k. The algorithm initializes the batch B to include all of the input experiments,\nwhich achieves the minimum objective value of zero. The algorithm then iteratively removes one\nexperiment from B at a time until |B| = k, each time removing the element whose removal results\nin the smallest increase in the k-medoid objective.\nThis greedy algorithm is motivated by theoretical results on the minimization of non-increasing,\nsupermodular set functions.\nDe\ufb01nition 1. Suppose S is a \ufb01nite set, f : 2S \u2192 R+ is a supermodular set function if for all\nB1 \u2286 B2 \u2286 S and {x} \u2208 S \\ B2, it holds that f (B1) \u2212 f (B1 \u222a {x}) \u2265 f (B2) \u2212 f (B2 \u222a {x}).\nThus, a set function is supermodular if adding an element to a smaller set provides no less improve-\nment than adding the element to a larger set. Also, a set function is non-increasing if for any set\nS and element x if f (S) \u2265 f (S \u222a {x}). It can be shown that our k-medoid objective function of\n(1) is both a non-increasing and supermodular function of B and achieves a minimum value of zero\ni Si. It follows that we can obtain an approximation guarantee for the described greedy\n\nfor B =(cid:83)\n\nalgorithm in [9].\n\n4\n\n\fTheorem 1. [9] Let f be a monotonic non-increasing supermodular function over subsets of the\n\ufb01nite set S, |S| = m and f (S) = 0. Let B be the set of the elements returned by the greedy\nalgorithm 1 s.t |B| = k, q = m \u2212 k and B\u2217 = argminB(cid:48)\u2286S,|B(cid:48)|=k f (B(cid:48)), then\n\n(cid:20)(cid:18) q + t\n\n(cid:21)\n(cid:19)q \u2212 1\n\nq\n\nf (B) \u2264 1\nt\n\nf (B\u2217) \u2264 et \u2212 1\n\nt\n\nf (B\u2217)\n\n(3)\n\nwhere t is the steepness parameter [9] of function f.\n\nNotice that the approximation bound involves the steepness parameter t of f, which characterizes\nthe rate of decrease of f. This is unavoidable since it is known that achieving a constant factor\napproximation guarantee is not possible unless P=NP [17]. Further this bound has been shown to be\ntight for any t [9]. Note that this is in contrast to guarantees for greedy maximization of submodular\nfunctions [7] for which there are constant factor guarantees. Also note that the greedy algorithm\nwe use is qualitatively different from the one used for submodular maximization, since it greedily\nremoves elements from B rather than greedily adding elements to B.\n4\nGP Posterior. Our batch selection approach described above requires that we maintain a posterior\nover the unknown function f. For this purpose we use a zero-mean GP prior with a zero-mean\nGaussian noise model with variance equal to 0.01. The GP covariance is speci\ufb01ed by a Gaussian\n\n2w (cid:107) x \u2212 x(cid:48) (cid:107)2(cid:1), with signal variance \u03c3 = y2max where ymax is the\n\nImplementation Details and Baselines\n\nkernel K(x, x(cid:48)) = \u03c3 exp(cid:0)\u2212 1\nto set the kernel width w to 0.01(cid:80)d\n\nmaximum value of the unknown function. In all of our experiments we used a simple rule of thumb\ni=1 li where li is the input space length in dimension i. We have\nfound this rule to work well for a variety of problems. An alternative would be to use a validation-\nbased approach for selecting the kernel parameters. In the BO setting, however, we have found this\nto be unreliable since the number of data points is relatively small.\nBase Sequential Policy. Our batch selection approach also requires a base sequential policy \u03c0 to be\nused for simulation matching. This policy must be able to select the next experiment given any set\nof prior experimental observations D. In our experiments, we use a policy based on the Maximum\nExpected Improvement (MEI) heuristic [14, 10] which is a very successful sequential policy for BO\nand has been shown to converge in the limit to the global optimum. Given data D the MEI policy\nsimply selects the next experiment to be the one that maximizes the expected improvement over the\ncurrent set of experiments with respect to maximizing the unknown function. More formally, let y\u2217\nbe the value of the best/largest experimental outcome observed so far in D. The MEI value of an\nexperiment x is given by MEI(x) = Ef [max{f (x) \u2212 y\u2217, 0} | D]. For our GP posterior over f we\ncan derive a closed form for this given by: u = y\u2217\u2212\u00b5(x)\n\u03c3(x) where y\u2217 is our best currently observed\nvalue. For any given example x, the MEI can be computed as follows:\n\nMEI(x) = \u03c3(x) [\u2212u\u03a6(\u2212u) + \u03c6(u)] , u =\n\ny\u2217 \u2212 \u00b5(x)\n\n\u03c3(x)\n\nwhere \u03a6 and \u03c6 are the standard normal cumulative distribution and density functions and \u00b5(x) and\n\u03c3(x) are the mean and variance of f (x) according to the GP given D, which have simple closed\nforms. Note that we have also evaluated our simulation-matching approach with an alternative\nsequential policy known as Maximum Probability of Improvement [16, 10]. The results (not shown\nin this paper) are similar to those obtained from MEI, showing that our general approach works well\nfor different base policies.\nThe computation of the MEI policy requires maximizing MEI(x) over the input space X . In gen-\neral, this function does not have a unique local maximum and various strategies have been tried for\nmaximizing it. In our experiments, we (approximately) maximize the MEI function using the DI-\nRECT black-box optimization procedure, which has shown good optimization performance as well\nas computational ef\ufb01ciency in practice.\nBaseline Batch Policies. To the best of our knowledge there is no well-known batch policy for\nBayesian optimization. However, in our experiments we will compare against two baselines. The\n\ufb01rst baseline is random selection, where a batch of k random experiments is returned at each step. In-\nterestingly, in the case of batch active learning for classi\ufb01cation, the random batch selection strategy\n\n5\n\n\fFunction\nCosines\nRosenbrock\nMichalewicz\n\n1 \u2212 (u2 + v2 \u2212 0.3cos(3\u03c0u) \u2212 0.3cos(3\u03c0v))\n\nu = 1.6x \u2212 0.5, v = 1.6y \u2212 0.5\n\nMathematical representation\n10 \u2212 100(y \u2212 x2)2 \u2212 (1 \u2212 x)2\n\n(cid:16)\n\n(cid:17)(cid:17)20\n\n(cid:16) i.x2\n\ni\n\n\u03c0\n\nTable 1: Benchmark Functions.\n\n\u2212(cid:80)5\n\ni=1 sin(xi).\n\nsin\n\nhas been surprisingly effective and is often dif\ufb01cult to outperform with more sophisticated strategies\n[8]. However, as our experiments will show, our approach will dominate random.\nOur second, more sophisticate, baseline is based on selecting a batch of experiments whose expected\nmaximum output is the largest. More formally, we consider selecting a size k batch B that max-\nimizes the objective Ef [maxx\u2208B f (x) | D], which we will refer to as the EMAX objective. For\nour GP prior, each set B = {x1, . . . , xk} can be viewed as de\ufb01ning a normally distributed vec-\ntor (cid:104)f (x1), . . . , f (xk)(cid:105) \u223c N (\u00b5, \u03a3). Even in this case, \ufb01nding the optimal set B is known to be\nNP-hard. However, for the case where f is assumed to be non-negative, the EMAX objective is\na non-negative, submodular, non-decreasing function of B. Together these properties imply that a\nsimple greedy algorithm can achieve an approximation ratio of 1 \u2212 e\u22121 [7]. The algorithm starts\nwith an empty B and greedily adds experiments to B, each time selecting the one that improves the\nEMAX objective the most. Unfortunately, in general there is no closed form solution for evaluating\nthe EMAX objective, even in our case of normally distributed vectors [20]. Therefore, to imple-\nment the greedy algorithm, which requires many evaluations of the EMAX objective, we use Monte\nCarlo sampling, where for a given set B we sample the corresponding normally distributed vector\nand average the maximum values across the samples.\n5 Experimental Results\nIn this section we evaluate our proposed batch BO approach and the baseline approaches on six\ndifferent benchmarks.\n5.1 Benchmark Functions\nWe consider three well-known synthetic benchmark functions: Cosines and Rosenbrock [1, 5],\nwhich are over [0, 1]2, and Michalewicz [15], which is over [0, \u03c0]5. Table 1 gives the formulas\nfor each of these functions. Two additional benchmark functions Hydrogen and FuelCell, which\nrange over [0, 1]2, are derived from real-world experimental data sets. In both cases, the bench-\nmark function was created by \ufb01tting regression models to data sets resulting from real experiments.\nThe Hydrogen data set is the result of data collected as part of a study on biosolar hydrogen pro-\nduction [6], where the goal was to maximize the hydrogen production of a particular bacteria by\noptimizing the PH and Nitrogen levels of the growth medium. The FuelCell data set was collected\nas part of a study investigating the in\ufb02uence of anodes\u2019 nano-structure on the power output of mi-\ncrobial fuel cells [3]. The experimental inputs include the average area and average circularity of\nthe nano-particles [18]. Contour plots of the four 2-d functions are shown in Figure 1.\nThe last benchmark function is derived from the Cart-Pole [2] problem, which is a commonly used\nreinforcement learning problem. The goal is to optimize the parameters of a controller for a wheeled\ncart with the objective of balancing a pole. The controller is parameterized by four parameters\ngiving a 4-d space of experiments in [1,\u22121]4. Given a setting for these parameters, the benchmark\nfunction is implemented by using the standard Cart-Pole simulator to return the reward received for\nthe controller.\n5.2 Results\nFigures 2 and 3 show the performance of our methods on all six benchmark functions for batch sizes\n5 and 10 respectively. Each graph contains 5 curves, each corresponding to a different BO approach\n(see below). Each curve is the result of taking an average of 100 independent runs. The x-axis of\neach graph represents the total number of experiments and the y-axis represents the regret values,\nwhere the regret of a policy at a particular point is the difference between the best possible output\nvalue (or an upper bound if the value is not known) and the best value found by the policy. Hence the\nregret is always positive and smaller values are preferred. Each run of a policy initializes the data set\nto contain 5 randomly selected experiments for the 2-d functions and 20 random initial experiments\nfor the higher dimensional functions.\n\n6\n\n\fFuel Cell\n\nHydrogen\n\nCosines\n\nRosenbrock\n\nFigure 1: The contour plots for the four 2\u2212dimension proposed test functions.\n\nFuel Cell\n\nHydrogen\n\nCosines\n\nRosenbrock\n\nCart-Pole\n\nMichalewicz\n\nFigure 2: Performance evaluation with batch size 5.\n\nFuel Cell\n\nHydrogen\n\nCosines\n\nRosenbrock\n\nCart-Pole\n\nMichalewicz\n\nFigure 3: Performance evaluation with batch size 10.\n\n7\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.911015202530350.20.250.30.350.40.450.50.550.60.65# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom10152025303500.050.10.150.20.250.3# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom1015202530350.10.20.30.40.50.6# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom10152025303500.050.10.150.20.250.30.350.40.450.5# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom255075100125150175200200300400500600700800900# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom25303540455055606570758022.12.22.32.42.52.62.72.82.93# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom15202530350.20.250.30.350.40.450.50.550.60.65# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom152025303500.050.10.150.20.25# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom15202530350.10.150.20.250.30.350.40.450.50.550.6# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom152025303500.050.10.150.20.250.30.35# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom306090120150180200200300400500600700800900# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom3040506070802.12.22.32.42.52.62.72.82.93# of ExperimetsRegret Sequentialk\u2212medoidk\u2212meansEMAXRandom\fEach graph gives curves for four batch approaches including our baselines Random and EMAX,\nalong with our proposed approaches based on the k-means and k-medoid objectives, which are\noptimized by weighted k-means clustering and the greedy Algorithm 1 respectively. In addition, for\nreference we plot the performance of the base Sequential MEI BO policy (k = 1) on each graph.\nNote that since the batch approaches request either 5 or 10 experiments at a time, their curves only\ncontain data points at those intervals. For example, for the batch size 5 results the \ufb01rst point on a\nbatch curve corresponds to 10 experiments, including the initial 5 experiments and the \ufb01rst requested\nbatch. The next point on the batch curve is for 15 experiments which includes the next requested\nbatch and so on. Rather the Sequential policy has a point at every step since it requests experiments\none at a time. It is important to realize that we generally expect a good sequential policy to do better,\nor no worse, than a batch policy with respect to performance per number of experiments. Thus, the\nSequential curve can be typically viewed as an upper performance bound and provides an indication\nof how much loss is incurred when moving to a batch setting in terms of ef\ufb01ciency per experiment.\nComparison to Baselines. The major observation from our results is that for all benchmarks and\nfor both batch sizes the proposed k-means and k-medoid approaches signi\ufb01cantly outperform the\nbaselines. This provides strong validation for our proposed simulation-matching approach to batch\nselection.\nk-means vs. k-medoid. In most cases, the k-means and k-medoid approaches perform similarly.\nHowever, for both batch sizes k-medoid often does shows a small improvement over k-means and\nappears to have a signi\ufb01cant advantage in FuelCell. The only exception is in Hydrogen where k-\nmeans shows a small advantage over k-medoid for small numbers of experiments. Overall, both\napproaches appear to be effective and in these domains k-medoid has a slight edge.\nBatch vs. Sequential. The advantage of Sequential over our batch approaches varies with the bench-\nmark. However, in most cases, our proposed batch approaches catch up to Sequential in a relatively\nsmall number of experiments and in some cases, the batch policies are similar to Sequential from\nthe start. The main exception is Cart-Pole for batch size 10, where the batch policies appear to be\nsigni\ufb01cantly less ef\ufb01cient in terms of performance versus number of experiments. Generally, we see\nthat the difference between our batch policies and Sequential is larger for batch size 10 than batch\nsize 5, which is expected, since larger batch sizes imply that less information per experiment is used\nin making decisions.\nIt is clear, however, that if we evaluate the performance of our batch policies in terms of experi-\nmental time, then there is a very signi\ufb01cant advantage over Sequential. In particular, the amount of\nexperimental time for a policy is approximately equal to the number of requested batches, assuming\nthat the batch size is selected to allow for all selected experiments to be run in parallel. This means,\nfor example, that for the batch size 5 results, 5 time steps for the batch approaches correspond to\n30 total experiments (5 initial + 5 batches). We can compare this point to the \ufb01rst point on the\nSequential curve, which also corresponds to 5 time steps (5 experiments beyond the initial 5). In all\ncases, the batch policies yield a very large improvement in regret reduction per unit time, which is\nthe primary motivation for batch selection.\n\n6 Summary and Future Work\n\nIn this paper we introduced a novel approach to batch BO based on the idea of simulation matching.\nThe key idea of our approach is to design batches of experiments that approximately match the\nexpected performance of high-quality sequential policies for BO. We considered two variants of\nthe matching problem and showed that both approaches signi\ufb01cantly outperformed two baselines\nincluding random batch selection on six benchmark functions. For future work we plan to consider\nthe general idea of simulation matching for other problems, such as active learning, where there are\nalso good sequential policies and batch selection is often warranted. In addition, we plan to consider\nless myopic approaches for selecting each batch and the problem of batch size selection, where there\nis a choice about batch size that must take into account the current data and experimental budget.\n\nAcknowledgments\n\nThe authors acknowledge the support of the NSF under grants IIS-0905678.\n\n8\n\n\fReferences\n[1] B. S. Anderson, A. W. More, and D. Cohn. A nonparametric approach to noisy and costly optimization.\n\nIn ICML, 2000.\n\n[2] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve dif\ufb01cult\n\nlearning control problems. 13:835\u2013846, 1983.\n\n[3] D. Bond and D. Lovley. Electricity production by geobacter sulfurreducens attached to electrodes. Appli-\n\ncations of Environmental Microbiology, 69:1548\u20131555, 2003.\n\n[4] E. Brochu, M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions,\nwith application to active user modeling and hierarchical reinforcement learning. Technical Report TR-\n2009-23, Department of Computer Science, University of British Columbia, 2009.\n\n[5] M. Brunato, R. Battiti, and S. Pasupuleti. A memory-based rash optimizer. In AAAI-06 Workshop on\n\nHeuristic Search, Memory Based Heuristics and Their applications, 2006.\n\n[6] E. H. Burrows, W.-K. Wong, X. Fern, F. W. Chaplen, and R. L. Ely. Optimization of ph and nitrogen\nfor enhanced hydrogen production by synechocystis sp. pcc 6803 via statistical and machine learning\nmethods. Biotechnology Progress, 25:1009\u20131017, 2009.\n\n[7] M. F. G Nemhauser, L Wolsey. An analysis of the approximations for maximizing submodular set func-\n\ntions. Mathematical Programmingn, 14:265\u2013294, 1978.\n\n[8] Y. Guo and D. Schuurmans. Discriminative batch mode active learning. Proceedings of Advances in\n\nNeural Information Processing Systems (NIPS2007), 6, 2007.\n\n[9] V. P. Il\u2019ev. An approximation guarantee of the greedy descent algorithm for minimizing a supermodular\n\nset function. Discrete Applied Mathematics, 114(1-3):131\u2013146, 2001.\n\n[10] D. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global\n\nOptimization, 21:345\u2013383, 2001.\n\n[11] L. Kaufman and P. J. Rousseeuw. Clustering by means of medoids. Statistical data analysis based on L1\n\nnorm, pages 405\u2013416, 1987.\n\n[12] D. Lizotte. Practical Bayesian optimization. PhD thesis, University of Alberta, 2008.\n[13] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129\u2013137,\n\n1982.\n\n[14] M. Locatelli. Bayesian algorithms for one-dimensional globaloptimization. J. of Global Optimization,\n\n10(1):57\u201376, 1997.\n\n[15] Z. Michalewicz. Genetic algorithms + data structures = evolution programs (2nd, extended ed.).\n\nSpringer-Verlag New York, Inc., New York, NY, USA, 1994.\n\n[16] A. Moore and J. Schneider. Memory-based stochastic optimization. In NIPS, 1995.\n[17] G. Nemhauser and L. Wolsey. Integer and combinatorial optimization. Wiley New York, 1999.\n[18] D. Park and J. Zeikus. Improved fuel cell and electrode designs for producing electricity from microbial\n\ndegradation. Biotechnol.Bioeng., 81(3):348\u2013355, 2003.\n\n[19] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT, 2006.\n[20] A. M. Ross. Computing Bounds on the Expected Maximum of Correlated Normal Variables . Methodol-\n\nogy and Computing in Applied Probability, 2008.\n\n[21] M. Schonlau. Computer Experiments and Global Optimization. PhD thesis, University of Waterloo, 1997.\nInteger programming and the theory of grouping. Journal of the American Statistical\n[22] H. D. Vinod.\n\nAssociation, 64(326):506\u2013519, 1969.\n\n9\n\n\f", "award": [], "sourceid": 1274, "authors": [{"given_name": "Javad", "family_name": "Azimi", "institution": null}, {"given_name": "Alan", "family_name": "Fern", "institution": null}, {"given_name": "Xiaoli", "family_name": "Fern", "institution": null}]}