{"title": "Predictive Entropy Search for Efficient Global Optimization of Black-box Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 918, "page_last": 926, "abstract": "We propose a novel information-theoretic approach for Bayesian optimization called Predictive Entropy Search (PES). At each iteration, PES selects the next evaluation point that maximizes the expected information gained with respect to the global maximum. PES codifies this intractable acquisition function in terms of the expected reduction in the differential entropy of the predictive distribution. This reformulation allows PES to obtain approximations that are both more accurate and efficient than other alternatives such as Entropy Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment of the model hyperparameters while ES cannot. We evaluate PES in both synthetic and real-world applications, including optimization problems in machine learning, finance, biotechnology, and robotics. We show that the increased accuracy of PES leads to significant gains in optimization performance.", "full_text": "Predictive Entropy Search for Ef\ufb01cient Global\n\nOptimization of Black-box Functions\n\nJos\u00b4e Miguel Hern\u00b4andez-Lobato\n\njmh233@cam.ac.uk\nUniversity of Cambridge\n\nMatthew W. Hoffman\nmwh30@cam.ac.uk\nUniversity of Cambridge\n\nZoubin Ghahramani\n\nzoubin@eng.cam.ac.uk\n\nUniversity of Cambridge\n\nAbstract\n\nWe propose a novel information-theoretic approach for Bayesian optimization\ncalled Predictive Entropy Search (PES). At each iteration, PES selects the next\nevaluation point that maximizes the expected information gained with respect to\nthe global maximum. PES codi\ufb01es this intractable acquisition function in terms\nof the expected reduction in the differential entropy of the predictive distribu-\ntion. This reformulation allows PES to obtain approximations that are both more\naccurate and ef\ufb01cient than other alternatives such as Entropy Search (ES). Fur-\nthermore, PES can easily perform a fully Bayesian treatment of the model hy-\nperparameters while ES cannot. We evaluate PES in both synthetic and real-\nworld applications, including optimization problems in machine learning, \ufb01nance,\nbiotechnology, and robotics. We show that the increased accuracy of PES leads to\nsigni\ufb01cant gains in optimization performance.\n\n1\n\nIntroduction\n\nBayesian optimization techniques form a successful approach for optimizing black-box functions\n[5]. The goal of these methods is to \ufb01nd the global maximizer of a nonlinear and generally non-\nconvex function f whose derivatives are unavailable. Furthermore, the evaluations of f are usually\ncorrupted by noise and the process that queries f can be computationally or economically very\nexpensive. To address these challenges, Bayesian optimization devotes additional effort to model-\ning the unknown function f and its behavior. These additional computations aim to minimize the\nnumber of evaluations that are needed to \ufb01nd the global optima.\nOptimization problems are widespread in science and engineering and as a result so are Bayesian\napproaches to this problem. Bayesian optimization has successfully been used in robotics to adjust\nthe parameters of a robot\u2019s controller to maximize gait speed and smoothness [16] as well as param-\neter tuning for computer graphics [6]. Another example application in drug discovery is to \ufb01nd the\nchemical derivative of a particular molecule that best treats a given disease [20]. Finally, Bayesian\noptimization can also be used to \ufb01nd optimal hyper-parameter values for statistical [29] and machine\nlearning techniques [24].\nAs described above, we are interested in \ufb01nding the global maximizer x(cid:63) = arg maxx\u2208X f (x) of\na function f over some bounded domain, typically X \u2282 Rd. We assume that f (x) can only be\nevaluated via queries to a black-box that provides noisy outputs of the form yi \u223c N (f (xi), \u03c32).\nWe note, however, that our framework can be extended to other non-Gaussian likelihoods. In this\nsetting, we describe a sequential search algorithm that, after n iterations, proposes to evaluate f at\nsome location xn+1. To make this decision the algorithm conditions on all previous observations\n\nDn = {(x1, y1), . . . , (xn, yn)}. After N iterations the algorithm makes a \ufb01nal recommendation(cid:101)xN\nlatent function f to guide the search and to select(cid:101)xN . In this work we use a zero-mean Gaussian\n\nfor the global maximizer of the latent function f.\nWe take a Bayesian approach to the problem described above and use a probabilistic model for the\n\n1\n\n\fAlgorithm 1 Generic Bayesian optimization\nInput: a black-box with unknown mean f\n1: for n = 1, . . . , N do\n2:\n3:\n4:\n5: end for\n\nselect xn = arg maxx\u2208X \u03b1n\u22121(x)\nquery the black-box at xn to obtain yn\naugment data Dn = Dn\u22121 \u222a {(xn, yn)}\n\n6: return (cid:101)xN = arg maxx\u2208X \u00b5N (x)\n\nAlgorithm 2 PES acquisition function\nInput: a candidate x; data Dn\n1: sample M hyperparameter values {\u03c8(i)}\n2: for i = 1, . . . , M do\n3:\n4:\n5:\n6:\n7: end for\n8: return \u03b1n(x) as in (10)\n\nsample f (i) \u223c p(f|Dn, \u03c6, \u03c8(i))\n0 and (cid:101)m(i),(cid:101)v(i)\n(cid:63) \u2190 arg maxx\u2208X f (i)(x)\nset x(i)\ncompute m(i)\nn (x|x(i)\ncompute v(i)\n(cid:63) )\n\n0 , V(i)\nn (x) and v(i)\n\np\nr\ne\nc\no\nm\np\nu\nt\ne\nd\n\nprocess (GP) prior for f [22]. This prior is speci\ufb01ed by a positive-de\ufb01nite kernel function k(x, x(cid:48)).\nGiven any \ufb01nite collection of points {x1, . . . , xn}, the values of f at these points are jointly zero-\nmean Gaussian with covariance matrix Kn, where [Kn]ij = k(xi, xj). For the Gaussian likelihood\ndescribed above, the vector of concatenated observations yn is also jointly Gaussian with zero-mean.\nTherefore, at any location x, the latent function f (x) conditioned on past observations Dn is then\nGaussian with marginal mean \u00b5n(x) and variance vn(x) given by\n\n\u00b5n(x) = kn(x)T(Kn + \u03c32I)\u22121yn ,\n\nvn(x) = k(x, x) \u2212 kn(x)T(Kn + \u03c32I)\u22121kn(x) ,\n\n(1)\n\nwhere kn(x) is a vector of cross-covariance terms between x and {x1, . . . , xn}.\nBayesian optimization techniques use the above predictive distribution p(f (x)|Dn) to guide the\nsearch for the global maximizer x(cid:63). In particular, p(f (x)|Dn) is used during the computation of an\nacquisition function \u03b1n(x) that is optimized at each iteration to determine the next evaluation loca-\ntion xn+1. This process is shown in Algorithm 1. Intuitively, the acquisition function \u03b1n(x) should\nbe high in areas where the maxima is most likely to lie given the current data. However, \u03b1n(x)\n\nshould also encourage exploration of the search space to guarantee that the recommendation(cid:101)xN is a\n\nglobal optimum of f, not just a global optimum of the posterior mean. Several acquisition functions\nhave been proposed in the literature. Some examples are the probability of improvement (PI) [14],\nthe expected improvement (EI) [19, 13] or upper con\ufb01dence bounds (UCB) [26]. Alternatively, one\ncan combine several of these acquisition functions [10].\nThe acquisition functions described above are based on probabilistic measures of improvement (PI\nan EI) or on optimistic estimates of the latent function (UCB) which implicitly trade off between\nexploiting the posterior mean and exploring based on the uncertainty. An alternate approach, in-\ntroduced by [28], proposes maximizing the expected posterior information gain about the global\nmaximizer x(cid:63) evaluated over a grid in the input space. A similar strategy was later employed by [9]\nwhich although it requires no such grid, instead relies on a dif\ufb01cult-to-evaluate approximation. In\nSection 2 we derive a rearrangement of this information-based acquisition function which leads to a\nmore straightforward approximation that we call Predictive Entropy Search (PES). In Section 3 we\nshow empirically that our approximation is more accurate than that of [9]. We evaluate this claim on\nboth synthetic and real-world problems and further show that this leads to real gains in performance.\n\n2 Predictive entropy search\n\nWe propose to follow the information-theoretic method for active data collection described in [17].\nWe are interested in maximizing information about the location x(cid:63) of the global maximum, whose\nposterior distribution is p(x(cid:63)|Dn). Our current information about x(cid:63) can be measured in terms\nof the negative differential entropy of p(x(cid:63)|Dn). Therefore, our strategy is to select xn+1 which\nmaximizes the expected reduction in this quantity. The corresponding acquisition function is\n\nwhere H[p(x)] = \u2212(cid:82) p(x) log p(x)dx represents the differential entropy of its argument and the\n\n\u03b1n(x) = H[p(x(cid:63)|Dn)] \u2212 Ep(y|Dn,x)[H[p(x(cid:63)|Dn \u222a {(x, y)})]] ,\n\n(2)\n\nexpectation above is taken with respect to the posterior predictive distribution of y given x. The\nexact evaluation of (2) is infeasible in practice. The main dif\ufb01culties are i) p(x(cid:63)|Dn \u222a {(x, y)})\nmust be computed for many different values of x and y during the optimization of (2) and ii) the\nentropy computations themselves are not analytical. In practice, a direct evaluation of (2) is only\n\n2\n\n\fpossible after performing many approximations [9]. To avoid this, we follow the approach described\nin [11] by noting that (2) can be equivalently written as the mutual information between x(cid:63) and y\ngiven Dn. Since the mutual information is a symmetric function, \u03b1n(x) can be rewritten as\n\n\u03b1n(x) = H[p(y|Dn, x)] \u2212 Ep(x(cid:63)|Dn)[H[p(y|Dn, x, x(cid:63))]] ,\n\n(3)\nwhere p(y|Dn, x, x(cid:63)) is the posterior predictive distribution for y given the observed data Dn and\nthe location of the global maximizer of f. Intuitively, conditioning on the location x(cid:63) pushes the\nposterior predictions up in locations around x(cid:63) and down in regions away from x(cid:63). Note that, unlike\nthe previous formulation, this objective is based on the entropies of predictive distributions, which\nare analytic or can be easily approximated, rather than on the entropies of distributions on x(cid:63) whose\napproximation is more challenging.\nThe \ufb01rst term in (3) can be computed analytically using the posterior marginals for f (x) in (1), that\nis, H[p(y|Dn, x)] = 0.5 log[2\u03c0e (vn(x) + \u03c32)], where we add \u03c32 to vn(x) because y is obtained\nby adding Gaussian noise with variance \u03c32 to f (x). The second term, on the other hand, must be\napproximated. We \ufb01rst approximate the expectation in (3) by averaging over samples x(i)\n(cid:63) drawn\napproximately from p(x(cid:63)|Dn). For each of these samples, we then approximate the corresponding\nentropy function H[p(y|Dn, x, x(i)\n(cid:63) )] using expectation propagation [18]. The code for all these\noperations is publicly available at http://jmhl.org.\n\n2.1 Sampling from the posterior over global maxima\n\nIn this section we show how to approximately sample from the conditional distribution of the global\nmaximizer x(cid:63) given the observed data Dn, that is,\n\np(x(cid:63)|Dn) = p(cid:0)f (x(cid:63)) = max\n\nx\u2208X f (x)(cid:12)(cid:12)Dn\n\n(cid:1) .\n\nj\u2264m\n\nwritten as(cid:82) p(f|Dn)(cid:81)\n\n(4)\nIf the domain X is restricted to some \ufb01nite set of m points, the latent function f takes the form\nof an m-dimensional vector f. The probability that the ith element of f is optimal can then be\nI[fi \u2265 fj] df. This suggests the following generative process: i) draw\na sample from the posterior distribution p(f|Dn) and ii) return the index of the maximum element\nin the sampled vector. This process is known as Thompson sampling or probability matching when\nused as an arm-selection strategy in multi-armed bandits [8]. This same approach could be used for\nsampling the maximizer over a continuous domain X . At \ufb01rst glance this would require constructing\nan in\ufb01nite-dimensional object representing the function f. To avoid this, one could sequentially\nconstruct f while it is being optimized. However, evaluating such an f would ultimately have cost\nO(m3) where m is the number of function evaluations necessary to \ufb01nd the optimum. Instead,\nwe propose to sample and optimize an analytic approximation to f. We will brie\ufb02y derive this\napproximation below, but more detail is given in Appendix A.\nGiven a shift-invariant kernel k, Bochner\u2019s theorem [4] asserts the existence of its Fourier dual s(w),\nwhich is equal to the spectral density of k. Letting p(w) = s(w)/\u03b1 be the associated normalized\ndensity, we can write the kernel as the expectation\n\nwhere b \u223c U[0, 2\u03c0]. Let \u03c6(x) =(cid:112)2\u03b1/m cos(Wx + b) denote an m-dimensional feature mapping\n\nk(x, x(cid:48)) = \u03b1 Ep(w)[e\u2212iwT(x\u2212x(cid:48))] = 2\u03b1 Ep(w,b)[cos(wTx + b) cos(wTx(cid:48) + b)] ,\n\nwhere W and b consist of m stacked samples from p(w, b). The kernel k can then be approximated\nby the inner product of these features, k(x, x(cid:48)) \u2248 \u03c6(x)T\u03c6(x(cid:48)). This approach was used by [21]\nas an approximation method in the context of kernel methods. The feature mapping \u03c6(x) allows\nus to approximate the Gaussian process prior for f with a linear model f (x) = \u03c6(x)T\u03b8 where\n\u03b8 \u223c N (0, I) is a standard Gaussian. By conditioning on Dn, the posterior for \u03b8 is also multivariate\nGaussian, \u03b8|Dn \u223c N (A\u22121\u03a6Tyn, \u03c32A\u22121) where A = \u03a6T\u03a6 + \u03c32I and \u03a6T = [\u03c6(x1) . . . \u03c6(xn)].\nLet \u03c6(i) and \u03b8(i) be a random set of features and the corresponding posterior weights sampled both\naccording to the generative process given above. They can then be used to construct the function\nf (i)(x) = \u03c6(i)(x)T\u03b8(i), which is an approximate posterior sample of f\u2014albeit one with a \ufb01nite\nparameterization. We can then maximize this function to obtain x(i)\n(cid:63) = arg maxx\u2208X f (i)(x), which\nis approximately distributed according to p(x(cid:63)|Dn). Note that for early iterations when n < m, we\ncan ef\ufb01ciently sample \u03b8(i) with cost O(n2m) using the method described in Appendix B.2 of [23].\nThis allows us to use a large number of features in \u03c6(i)(x).\n\n(5)\n\n3\n\n\fH in this expression as p(y|Dn, x, x(cid:63)) =(cid:82) p(y|f (x))p(f (x)|Dn, x(cid:63)) df (x). Here p(f (x)|Dn, x(cid:63))\n\n2.2 Approximating the predictive entropy\nWe now show how to approximate H[p(y|Dn, x, x(cid:63))] in (3). Note that we can write the argument to\nis the posterior distribution on f (x) given Dn and the location x(cid:63) of the global maximizer of f.\nWhen the likelihood p(y|f (x)) is Gaussian, we have that p(f (x)|Dn) is analytically tractable since it\nis the predictive distribution of a Gaussian process. However, by further conditioning on the location\nx(cid:63) of the global maximizer we are introducing additional constraints, namely that f (z) \u2264 f (x(cid:63))\nfor all z \u2208 X . These constraints make p(f (x)|Dn, x(cid:63)) intractable. To circumvent this dif\ufb01culty, we\ninstead use the following simpli\ufb01ed constraints:\n\nC1. x(cid:63) is a local maximum. This is achieved by letting \u2207f (x(cid:63)) = 0 and ensuring that\n\u22072f (x(cid:63)) is negative de\ufb01nite. We further assume that the non-diagonal elements of\n\u22072f (x(cid:63)), denoted by upper[\u22072f (x(cid:63))], are known, for example they could all be zero.\nThis simpli\ufb01es the negative-de\ufb01nite constraint. We denote by C1.1 the constraint given by\n\u2207f (x(cid:63)) = 0 and upper[\u22072f (x(cid:63))] = 0. We denote by C1.2 the constraint that forces the\nelements of diag[\u22072f (x(cid:63))] to be negative.\nC2. f (x(cid:63)) is larger than past observations. We also assume that f (x(cid:63)) \u2265 f (xi) for all\ni \u2264 n. However, we only observe f (xi) noisily via yi. To avoid making inference on these\nlatent function values, we approximate the above hard constraints with the soft constraint\nf (x(cid:63)) > ymax + \u0001, where \u0001 \u223c N (0, \u03c32) and ymax is the largest yi seen so far.\nrather than requiring f (x(cid:63)) \u2264 f (z) for all z \u2208 X .\n\nC3. f (x) is smaller than f (x(cid:63)). This simpli\ufb01ed constraint only conditions on the given x\n\nWe incorporate these simpli\ufb01ed constraints into p(f (x)|Dn) to approximate p(f (x)|Dn, x(cid:63)). This\nis achieved by multiplying p(f (x)|Dn) with speci\ufb01c factors that encode the above constraints. In\nwhat follows we brie\ufb02y show how to construct these factors; more detail is given in Appendix B.\nConsider the latent variable z = [f (x(cid:63)); diag[\u22072f (x(cid:63))]]. To incorporate constraint C1.1 we\ncan condition on the data and on the \u201cobservations\u201d given by the constraints \u2207f (x(cid:63)) = 0 and\nupper[\u22072f (x(cid:63))] = 0. Since f is distributed according to a GP, the joint distribution between z and\nthese observations is multivariate Gaussian. The covariance between the noisy observations yn and\nthe extra noise-free derivative observations can be easily computed [25]. The resulting conditional\ndistribution is also multivariate Gaussian with mean m0 and covariance V0. These computations\nare similar to those performed in (1). Constraints C1.2 and C2 can then be incorporated by writing\n\n(cid:104)(cid:81)d\n\ni=1\n\nI(cid:0)[\u22072f (x(cid:63))]ii \u2264 0(cid:1)(cid:105)N (z|m0, V0) ,\n\np(z|Dn, C1, C2) \u221d \u03a6\u03c32(f (x(cid:63)) \u2212 ymax)\n\n(6)\n\nwhere \u03a6\u03c32 is the cdf of a zero-mean Gaussian distribution with variance \u03c32. The \ufb01rst new factor in\nthis expression guarantees that f (x(cid:63)) > ymax + \u0001, where we have marginalized \u0001 out, and the second\nset of factors guarantees that the entries in diag[\u22072f (x(cid:63))] are negative.\nLater integrals that make use of p(z|Dn, C1, C2), however, will not admit a closed-form expres-\nsion. As a result we compute a Gaussian approximation q(z) to this distribution using Expectation\nPropagation (EP) [18]. The resulting algorithm is similar to the implementation of EP for binary\nclassi\ufb01cation with Gaussian processes [22]. EP approximates each non-Gaussian factor in (6) with\n\na Gaussian factor whose mean and variance are (cid:101)mi and(cid:101)vi, respectively. The EP approximation can\nthen be written as q(z) \u221d [(cid:81)d+1\ni=1 N (zi|(cid:101)mi,(cid:101)vi)]N (z|m0, V0). Note that these computations have so\nfar not depended on x, so we can compute {m0, V0,(cid:101)m,(cid:101)v} once and store them for later use, where\n(cid:101)m = ( \u02dcm1, . . . , \u02dcmd+1) and(cid:101)v = (\u02dcv1, . . . , \u02dcvd+1).\n\nWe will now describe how to compute the predictive variance of some latent function value f (x)\ngiven these constraints. Let f = [f (x); f (x(cid:63))] be a vector given by the concatenation of the values\nof the latent function at x and x(cid:63). The joint distribution between f, z, the evaluations yn collected\nso far and the derivative \u201cobservations\u201d \u2207f (x(cid:63)) = 0 and upper[\u22072f (x(cid:63))] = 0 is multivariate\nGaussian. Using q(z), we then obtain the following approximation:\n\np(f|Dn, C1, C2) \u2248(cid:82) p(f|z,Dn, C1.1) q(z) dz = N (f|mf , Vf ) .\n\n(7)\nImplicitly we are assuming above that f depends on our observations and constraint C1.1, but is\nindependent of C1.2 and C2 given z. The computations necessary to obtain mf and Vf are similar\n\n4\n\n\fto those used above and in (1). The required quantities are similar to the ones used by EP to make\npredictions in the Gaussian process binary classi\ufb01er [22]. We can then incorporate C3 by multiplying\nN (f|mf , Vf ) with a factor that guarantees f (x) < f (x(cid:63)). The predictive distribution for f (x) given\nDn and all the constraints can be approximated as\n\np(f (x)|Dn, C1, C2, C3) \u2248 Z\u22121(cid:82) I(f1 < f2)N (f|mf , Vf ) df2 ,\n\n(8)\n\n\u221a\n\nwhere Z is a normalization constant. The variance of the right hand size of (8) is given by\n\nvn(x|x(cid:63)) = [Vf ]1,1 \u2212 v\u22121\u03b2(\u03b2 + \u03b1){[Vf ]1,1 \u2212 [Vf ]1,2}2 ,\n\n(9)\nwhere v = [\u22121, 1]TVf [\u22121, 1], \u03b1 = m/\nv, m = [\u22121, 1]Tmf , \u03b2 = \u03c6(\u03b1)/\u03a6(\u03b1), and \u03c6(\u00b7) and\n\u03a6(\u00b7) are the standard Gaussian density function and cdf, respectively. By further approximat-\ning (8) by a Gaussian distribution with the same mean and variance we can write the entropy as\nH[p(y|Dn, x, x(cid:63))] \u2248 0.5 log[2\u03c0e(vn(x|x(cid:63)) + \u03c32)].\nThe computation of (9) can be numerically unstable when s is very close to zero. This occurs when\n[Vf ]1,1 is very similar to [Vf ]1,2. To avoid these numerical problems, we multiply [Vf ]1,2 by the\nlargest 0 \u2264 \u03ba \u2264 1 that guarantees that s > 10\u221210. This can be understood as slightly reducing\nthe amount of dependence between f (x) and f (x(cid:63)) when x is very close to x(cid:63). Finally, \ufb01xing\nupper[\u22072f (x(cid:63))] to be zero can also produce poor predictions when the actual f does not satisfy\nthis constraint. To avoid this, we instead \ufb01x this quantity to upper[\u22072f (i)(x(cid:63))], where f (i) is the\nith sample function optimized in Section 2.1 to sample x(i)\n(cid:63) .\n\n2.3 Hyperparameter learning and the PES acquisition function\n\nWe now show how the previous approximations are integrated to compute the acquisition function\nused by predictive entropy search (PES). This acquisition function performs a formal treatment of the\nhyperparameters. Let \u03c8 denote a vector of hyperparameters which includes any kernel parameters\nas well as the noise variance \u03c32. Let p(\u03c8|Dn) \u221d p(\u03c8) p(Dn|\u03c8) denote the posterior distribution\nover these parameters where p(\u03c8) is a hyperprior and p(Dn|\u03c8) is the GP marginal likelihood. For a\nfully Bayesian treatment of \u03c8 we must marginalize the acquisition function (3) with respect to this\nposterior. The corresponding integral has no analytic expression and must be approximated using\nMonte Carlo. This approach is also taken in [24].\nWe draw M samples {\u03c8(i)} from p(\u03c8|Dn) using slice sampling [27]. Let x(i)\nglobal maximizer drawn from p(x(cid:63)|Dn, \u03c8(i)) as described in Section 2.1. Furthermore, let v(i)\nand v(i)\nmodel hyperparameters are \ufb01xed to \u03c8(i). We then write the marginalized acquisition function as\n\n(cid:63) denote a sampled\nn (x)\n(cid:63) ) denote the predictive variances computed as described in Section 2.2 when the\n\nn (x|x(i)\n\nn (x) + \u03c32] \u2212 0.5 log[v(i)\n\nn (x|x(i)\n\ni=1\n\n0.5 log[v(i)\n\n(cid:63) ) + \u03c32]\n\n\u03b1n(x) = 1\nM\n\n(10)\nNote that PES is effectively marginalizing the original acquisition function (2) over p(\u03c8|Dn). This\nis a signi\ufb01cant advantage with respect to other methods that optimize the same information-theoretic\nacquisition function but do not marginalize over the hyper-parameters. For example, the approach\nof [9] approximates (2) only for \ufb01xed \u03c8. The resulting approximation is computationally very ex-\npensive and recomputing it to average over multiple samples from p(\u03c8|Dn) is infeasible in practice.\nAlgorithm 2 shows pseudo-code for computing the PES acquisition function. Note that most of the\ncomputations necessary for evaluating (10) can be done independently of the input x, as noted in the\npseudo-code. This initial cost is dominated by a matrix inversion necessary to pre-compute V for\neach hyperparameter sample. The resulting complexity is O[M (n+d+d(d\u22121)/2)3]. This cost can\nbe reduced to O[M (n + d)3] by ignoring the derivative observations imposed on upper[\u22072f (x(cid:63))]\nby constraint C1.1. Nevertheless, in the problems that we consider d is very small (less than 20).\nAfter these precomputations are done, the evaluation of (10) is O[M (n + d + d(d \u2212 1)/2)].\n\n(cid:111)\n\n.\n\n(cid:80)M\n\n(cid:110)\n\n3 Experiments\n\nk(x, x(cid:48)) = \u03b32 exp{\u22120.5(cid:80)\n\nIn our experiments, we use Gaussian process priors for f with squared-exponential kernels\ni}. The corresponding spectral density is zero-mean Gaus-\ni)2/(cid:96)2\nsian with covariance given by diag([(cid:96)\u22122\n]) and normalizing constant \u03b1 = \u03b32. The model hyperpa-\nrameters are {\u03b3, (cid:96)1, . . . , (cid:96)d, \u03c32}. We use broad, uninformative Gamma hyperpriors.\n\ni(xi\u2212 x(cid:48)\n\ni\n\n5\n\n\fFigure 1: Comparison of different estimates of the objective function \u03b1n(x) given by (2). Left, ground truth\nobtained by the rejection sampling method RS. Middle, approximation produced by the ES method. Right,\napproximation produced by the proposed PES method. These plots show that the PES objective is much more\nsimilar to the RS ground truth than the ES objective.\n\nFirst, we analyze the accuracy of PES in the task of approximating the differential entropy (2).\nWe compare the PES approximation (10), with the approximation used by the entropy search (ES)\nmethod [9]. We also compare with the ground truth for (2) obtained using a rejection sampling (RS)\nalgorithm based on (3). For this experiment we generate the data Dn using an objective function f\nsampled from the Gaussian process prior as in [9]. The domain X of f is \ufb01xed to be [0, 1]2 and data\nare generated using \u03b32 = 1, \u03c32 = 10\u22126, and (cid:96)2\ni = 0.1. To compute (10) we avoid sampling the\nhyperparameters and use the known values directly. We further \ufb01x M = 200 and m = 1000.\nThe ground truth rejection sampling scheme works as follows. First, X is discretized using a uniform\ngrid. The expectation with respect to p(x(cid:63)|Dn) in (3) is then approximated using sampling. For this,\nwe sample x(cid:63) by evaluating a random sample from p(f|Dn) on each grid cell and then selecting the\ncell with highest value. Given x(cid:63), we then approximate H[p(y|Dn, x, x(cid:63))] by rejection sampling.\nWe draw samples from p(f|Dn) and reject those whose corresponding grid cell with highest value is\nnot x(cid:63). Finally, we approximate H[p(y|Dn, x, x(cid:63))] by \ufb01rst, adding zero-mean Gaussian noise with\nvariance \u03c32 to the the evaluations at x of the functions not rejected during the previous step and\nsecond, we estimate the differential entropy of the resulting samples using kernels [1].\nFigure 1 shows the objective functions produced by RS, ES and PES for a particular Dn with 10\nmeasurements whose locations are selected uniformly at random in [0, 1]2. The locations of the\ncollected measurements are displayed with an \u201cx\u201d in the plots. The particular objective function\nused to generate the measurements in Dn is displayed in the left part of Figure 2. The plots in\nFigure 1 show that the PES approximation to (2) is more similar to the ground truth given by RS\nthan the approximation produced by ES. In this \ufb01gure we also see a discrepancy between RS and\nPES at locations near x = (0.572, 0.687). This difference is an artifact of the discretization used in\nRS. By zooming in and drawing many more samples we would see the same behavior in both plots.\nWe now evaluate the performance of PES in the task of \ufb01nding the optimum of synthetic black-box\nobjective functions. For this, we reproduce the within-model comparison experiment described in\n[9]. In this experiment we optimize objective functions de\ufb01ned in the 2-dimensional unit domain\nX = [0, 1]2. Each objective function is generated by \ufb01rst sampling 1024 function values from the\nGP prior assumed by PES, using the same \u03b32, (cid:96)i and \u03c32 as in the previous experiment. The objective\nfunction is then given by the resulting GP posterior mean. We generated a total of 1000 objective\nfunctions by following this procedure. The left plot in Figure 2 shows an example function.\nIn these experiments we compared the performance of PES with that of ES [9] and expected im-\nprovement (EI) [13], a widely used acquisition function in the Bayesian optimization literature. We\nagain assume that the optimal hyper-parameter values are known to all methods. Predictive perfor-\n\nmance is then measured in terms of the immediate regret (IR) |f ((cid:101)xn) \u2212 f (x(cid:63))|, where x(cid:63) is the\nknown location of the global maximum and(cid:101)xn is the recommendation of each algorithm had we\n\nstopped at step n\u2014for all methods this is given by the maximizer of the posterior mean. The right\nplot in Figure 2 shows the decimal logarithm of the median of the IR obtained by each method\nacross the 1000 different objective functions. Con\ufb01dence bands equal to one standard deviation are\nobtained using the bootstrap method. Note that while averaging these results is also interesting,\ncorresponding to the expected performance averaged over the prior, here we report the median IR\n\n6\n\n0.200.250.300.35xxxxxxxxxx0.2 0.2 0.2 0.25 0.25 0.25 0.25 0.25 0.25 0.3 0.35 0.000.010.020.030.040.050.060.07xxxxxxxxxx0 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.04 0.04 0.04 0.04 0.05 0.05 0.06 0.06 0.06 0.000.050.100.150.200.250.30xxxxxxxxxx0.05 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.15 0.2 0.2 0.25 0.25 \fFigure 2: Left, example of objective functions f. Right, median of the immediate regret (IR) for the methods\nPES, ES and EI in the experiments with synthetic objective functions.\n\nFigure 3: Median of the immediate regret (IR) for the methods EI, ES, PES and PES-NB in the experiments\nwith well-known synthetic benchmark functions.\n\nbecause the empirical distribution of IR values is very heavy-tailed. In this case, the median is more\nrepresentative of the exact location of the bulk of the data. These results indicate that the best method\nin this setting is PES, which signi\ufb01cantly outperforms ES and EI. The plot also shows that in this\ncase ES is signi\ufb01cantly better than EI.\nWe perform another series of experiments in which we optimize well-known synthetic benchmark\nfunctions including a mixture of cosines [2] and Branin-Hoo (both functions de\ufb01ned in [0, 1]2) as\nwell as the Hartmann-6 (de\ufb01ned in [0, 1]6) [15]. In all instances, we \ufb01x the measurement noise\nto \u03c32 = 10\u22123. For both PES and EI we marginalize the hyperparameters \u03c8 using the approach\ndescribed in Section 2.3. ES, by contrast, cannot average its approximation of (2) over the posterior\non \u03c8. Instead, ES works by \ufb01xing \u03c8 to an estimate of its posterior mean (obtained using slice\nsampling) [27]. To evaluate the gains produced by the fully Bayesian treatment of \u03c8 in PES, we also\ncompare with a version of PES (PES-NB) which performs the same non-Bayesian (NB) treatment\nof \u03c8 as ES. In PES-NB we use a single \ufb01xed hyperparameter as in previous sections with value\ngiven by the posterior mean of \u03c8. All the methods are initialized with three random measurements\ncollected using latin hypercube sampling [5].\nThe plots in Figure 3 show the median IR obtained by each method on each function across 250\nrandom initializations. Overall, PES is better than PES-NB and ES. Furthermore, PES-NB is also\nsigni\ufb01cantly better than ES in most of the cases. These results show that the fully Bayesian treatment\nof \u03c8 in PES is advantageous and that PES can produce better approximations than ES. Note that\nPES performs better than EI in the Branin and cosines functions, while EI is signi\ufb01cantly better on\nthe Hartmann problem. This appears to be due to the fact that entropy-based strategies explore more\naggressively which in higher-dimensional spaces takes more iterations. The Hartmann problem,\nhowever, is a relatively simple problem and as a result the comparatively more greedy behavior of\nEI does not result in signi\ufb01cant adverse consequences. Note that the synthetic functions optimized\nin the previous experiment were much more multimodal that the ones considered here.\n\n3.1 Experiments with real-world functions\n\nWe \ufb01nally optimize different real-world cost functions. The \ufb01rst one (NNet) returns the predictive\naccuracy of a neural network on a random train/test partition of the Boston Housing dataset [3].\n\n7\n\n\u22122\u22121012xxxxxxxxxx\u22122 \u22121.5 \u22121.5 \u22121 \u22121 \u22121 \u22121 \u22121 \u22120.5 \u22120.5 \u22120.5 \u22120.5 \u22120.5 \u22120.5 \u22120.5 0 0 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1.5 1.5 1.5 2 \u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22125.5\u22124.5\u22123.5\u22122.5\u22121.5\u22120.501020304050Number of Function EvaluationsLog10 Median IRMethods\u25cf\u25cf\u25cfEIESPESResults on Synthetic Cost Functions\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22123.9\u22122.9\u22121.9\u22120.90102030Number of Function EvaluationsLog10 Median IRMethods\u25cf\u25cf\u25cf\u25cfEIESPESPES\u2212NBResults on Branin Cost Function\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22124.6\u22123.6\u22122.6\u22121.6\u22120.60102030Number of Function EvaluationsResults on Cosines Cost Function\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22122.7\u22121.7\u22120.701020304050Number of Function EvaluationsResults on Hartmann Cost Function\fFigure 4: Median of the immediate regret (IR) for the methods PES, PES-NB, ES and EI in the experiments\nwith non-analytic real-world cost functions.\n\nThe variables to optimize are the weight-decay parameter and the number of training iterations for\nthe neural network. The second function (Hydrogen) returns the amount of hydrogen production of\na particular bacteria in terms of the PH and Nitrogen levels of the growth medium [7]. The third\none (Portfolio) returns the ratio of the mean and the standard deviation (the Sharpe ratio) of the\n1-year ahead returns generated by simulations from a multivariate time-series model that is adjusted\nto the daily returns of stocks AXP, BA and HD. The time-series model is formed by univariate\nGARCH models connected with a Student\u2019s t copula [12]. These three functions (NNet, Hydrogen\nand Portfolio) have as domain [0, 1]2. Furthermore, in these examples, the ground truth function\nthat we want to optimize is unknown and is only available through noisy measurements. To obtain\na ground truth, we approximate each cost function as the predictive distribution of a GP that is\nadjusted to data sampled from the original function (1000 uniform samples for NNet and Portfolio\nand all the available data for Hydrogen [7]). Finally, we also consider another real-world function\nthat returns the walking speed of a bipedal robot [30]. This function is de\ufb01ned in [0, 1]8 and its\ninputs are the parameters of the robot\u2019s controller. In this case the ground truth function is noiseless\nand can be exactly evaluated through expensive numerical simulation. We consider two versions of\nthis problem (Walker A) with zero-mean, additive noise of \u03c3 = 0.01 and (Walker B) with \u03c3 = 0.1.\nFigure 4 shows the median IR values obtained by each method on each function across 250 random\ninitializations, except in Hydrogen where we used 500 due to its higher level of noise. Overall, PES,\nES and PES-NB perform similarly in NNet, Hydrogen and Portfolio. EI performs rather poorly in\nthese \ufb01rst three functions. This method seems to make excessively greedy decisions and fails to\nexplore the search space enough. This strategy seems to be advantageous in Walker A, where EI\nobtains the best results. By contrast, PES, ES and PES-NB tend to explore more in this latter dataset.\nThis leads to worse results than those of EI. Nevertheless, PES is signi\ufb01cantly better than PES-NB\nand ES in both Walker datasets and better than EI in the noisier Walker B. In this case, the fully\nBayesian treatment of hyper-parameters performed by PES produces improvements in performance.\n\n4 Conclusions\n\nWe have proposed a novel information-theoretic approach for Bayesian optimization. Our method,\npredictive entropy search (PES), greedily maximizes the amount of one-step information on the loca-\ntion x(cid:63) of the global maximum using its posterior differential entropy. Since this objective function\nis intractable, PES approximates the original objective using a reparameterization that measures\nentropy in the posterior predictive distribution of the function evaluations. PES produces more ac-\ncurate approximations than Entropy Search (ES), a method based on the original, non-transformed\nacquisition function. Furthermore, PES can easily marginalize its approximation with respect to\nthe posterior distribution of its hyper-parameters, while ES cannot. Experiments with synthetic and\nreal-world functions show that PES often outperforms ES in terms of immediate regret. In these ex-\nperiments, we also observe that PES often produces better results than expected improvement (EI),\na popular heuristic for Bayesian optimization. EI often seems to make excessively greedy deci-\nsions, while PES tends to explore more. As a result, EI seems to perform better for simple objective\nfunctions while often getting stuck with noisier objectives or for functions with many modes.\n\nAcknowledgements J.M.H.L acknowledges support from the Rafael del Pino Foundation.\n\n8\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22121.4\u22120.40.6010203040Function EvaluationsLog10 Median IRMethods\u25cf\u25cf\u25cf\u25cfEIESPESPES\u2212NBNNet Cost\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22120.1010203040Function EvaluationsHydrogen\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22121.9\u22120.9010203040Function EvaluationsPortfolio\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u22121.9\u22120.90102030Function EvaluationsWalker A\u22120.3\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf0102030Function EvaluationsWalker B\fReferences\n[1] I. Ahmad and P.-E. Lin. A nonparametric estimation of the entropy for absolutely continuous distributions.\n\nIEEE Transactions on Information Theory, 22(3):372\u2013375, 1976.\n\n[2] B. S. Anderson, A. W. Moore, and D. Cohn. A nonparametric approach to noisy and costly optimization.\n\nIn ICML, pages 17\u201324, 2000.\n\n[3] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[4] S. Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.\n[5] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions,\nwith application to active user modeling and hierarchical reinforcement learning. Technical Report UBC\nTR-2009-23 and arXiv:1012.2599v1, Dept. of Computer Science, University of British Columbia, 2009.\n[6] E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning with discrete choice data. In NIPS,\n\npages 409\u2013416, 2007.\n\n[7] E. H. Burrows, W.-K. Wong, X. Fern, F. W. R. Chaplen, and R. L. Ely. Optimization of ph and nitrogen\nfor enhanced hydrogen production by synechocystis sp. pcc 6803 via statistical and machine learning\nmethods. Biotechnology Progress, 25(4):1009\u20131017, 2009.\n\n[8] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In NIPS, pages 2249\u20132257, 2011.\n[9] P. Hennig and C. J. Schuler. Entropy search for information-ef\ufb01cient global optimization. Journal of\n\nMachine Learning Research, 13, 2012.\n\n[10] M. W. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation for Bayesian optimization. In UAI,\n\npages 327\u2013336, 2011.\n\n[11] N. Houlsby, J. M. Hern\u00b4andez-Lobato, F. Huszar, and Z. Ghahramani. Collaborative Gaussian processes\n\nfor preference learning. In NIPS, pages 2096\u20132104, 2012.\n\n[12] E. Jondeau and M. Rockinger. The copula-GARCH model of conditional dependencies: An international\n\nstock market application. Journal of international money and \ufb01nance, 25(5):827\u2013853, 2006.\n\n[13] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box func-\n\ntions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\n[14] H. Kushner. A new method of locating the maximum of an arbitrary multipeak curve in the presence of\n\nnoise. Journal of Basic Engineering, 86, 1964.\n\n[15] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, Canada, 2008.\n[16] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with Gaussian process\n\nregression. In IJCAI, pages 944\u2013949, 2007.\n\n[17] D. J. MacKay.\n\nInformation-based objective functions for active data selection. Neural Computation,\n\n4(4):590\u2013604, 1992.\n\n[18] T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts\n\nInstitute of Technology, 2001.\n\n[19] J. Mo\u02c7ckus, V. Tiesis, and A. \u02c7Zilinskas. The application of Bayesian methods for seeking the extremum.\n\nIn L. Dixon and G. Szego, editors, Toward Global Optimization, volume 2. Elsevier, 1978.\n\n[20] D. M. Negoescu, P. I. Frazier, and W. B. Powell. The knowledge-gradient algorithm for sequencing\n\nexperiments in drug discovery. INFORMS Journal on Computing, 23(3):346\u2013363, 2011.\n\n[21] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages 1177\u20131184,\n\n2007.\n\n[22] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. The MIT Press, 2006.\n[23] M. W. Seeger. Bayesian inference and optimal design for the sparse linear model. Journal of Machine\n\nLearning Research, 9:759\u2013813, 2008.\n\n[24] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algo-\n\nrithms. In NIPS, pages 2960\u20132968, 2012.\n\n[25] E. Solak, R. Murray-Smith, W. E. Leithead, D. J. Leith, and C. E. Rasmussen. Derivative observations in\n\nGaussian process models of dynamic systems. In NIPS, pages 1057\u20131064, 2003.\n\n[26] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting:\n\nNo regret and experimental design. In ICML, pages 1015\u20131022, 2010.\n\n[27] J. Vanhatalo, J. Riihim\u00a8aki, J. Hartikainen, P. Jyl\u00a8anki, V. Tolvanen, and A. Vehtari. Bayesian modeling\n\nwith Gaussian processes using the matlab toolbox GPstuff (v3.3). CoRR, abs/1206.5754, 2012.\n\n[28] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of\n\nexpensive-to-evaluate functions. Journal of Global Optimization, 44(4):509\u2013534, 2009.\n\n[29] Z. Wang, S. Mohamed, and N. de Freitas. Adaptive Hamiltonian and Riemann Monte Carlo samplers. In\n\nICML, 2013.\n\n[30] E. Westervelt and J. Grizzle. Feedback Control of Dynamic Bipedal Robot Locomotion. Control and\n\nAutomation Series. CRC PressINC, 2007.\n\n9\n\n\f", "award": [], "sourceid": 580, "authors": [{"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "Harvard University"}, {"given_name": "Matthew", "family_name": "Hoffman", "institution": "University of Cambridge"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "University of Cambridge"}]}