{"title": "Multi-Information Source Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4288, "page_last": 4298, "abstract": "We consider Bayesian methods for multi-information source optimization (MISO), in which we seek to optimize an expensive-to-evaluate black-box objective function while also accessing cheaper but biased and noisy approximations (\"information sources\"). We present a novel algorithm that outperforms the state of the art for this problem by using a Gaussian process covariance kernel better suited to MISO than those used by previous approaches, and an acquisition function based on a one-step optimality analysis supported by efficient parallelization. We also provide a novel technique to guarantee the asymptotic quality of the solution provided by this algorithm. Experimental evaluations demonstrate that this algorithm consistently finds designs of higher value at less cost than previous approaches.", "full_text": "Multi-Information Source Optimization\n\nDepartment of Systems and Industrial Engineering\n\nMatthias Poloczek\n\nUniversity of Arizona\n\nTucson, AZ 85721\n\npoloczek@email.arizona.edu\n\nJialei Wang\n\nChief Analytics Of\ufb01ce\n\nIBM\n\nArmonk, NY 10504\njw865@cornell.edu\n\nSchool of Operations Research and Information Engineering\n\nPeter I. Frazier\n\nCornell University\nIthaca, NY 14853\n\npf98@cornell.edu\n\nAbstract\n\nWe consider Bayesian methods for multi-information source optimization (MISO),\nin which we seek to optimize an expensive-to-evaluate black-box objective function\nwhile also accessing cheaper but biased and noisy approximations (\u201cinformation\nsources\u201d). We present a novel algorithm that outperforms the state of the art for this\nproblem by using a Gaussian process covariance kernel better suited to MISO than\nthose used by previous approaches, and an acquisition function based on a one-step\noptimality analysis supported by ef\ufb01cient parallelization. We also provide a novel\ntechnique to guarantee the asymptotic quality of the solution provided by this\nalgorithm. Experimental evaluations demonstrate that this algorithm consistently\n\ufb01nds designs of higher value at less cost than previous approaches.\n\n1\n\nIntroduction\n\nWe consider Bayesian multi-information source optimization (MISO), in which we optimize an\nexpensive-to-evaluate black-box objective function while optionally accessing cheaper biased noisy\napproximations, often referred to as \u201cinformation sources (IS)\u201d. This arises when tuning machine\nlearning algorithms: instead of using the whole dataset for the hyperparameter optimization, one\nmay use a small subset or even a smaller related dataset [34, 15, 17]. We also face this problem in\nrobotics: we can evaluate a parameterized robot control policy in simulation, in a laboratory, or in a\n\ufb01eld test [15]. Cheap approximations promise a route to tractability, but bias and noise complicate\ntheir use. An unknown bias arises whenever a computational model incompletely models a real-world\nphenomenon, and is pervasive in applications.\nWe present a novel algorithm for this problem, misoKG, that is tolerant to both noise and bias and\nimproves substantially over the state of the art. Speci\ufb01cally, our contributions are:\n\n\u2022 The algorithm uses a novel acquisition function that maximizes the incremental gain per unit\ncost. This acquisition function generalizes and parallelizes previously proposed knowledge-\ngradient methods for single-IS Bayesian optimization [7, 8, 28, 26, 37] to MISO.\n\u2022 We prove that this algorithm provides an asymptotically near-optimal solution. If the search\n\ndomain is \ufb01nite, this result establishes the consistency of misoKG.\nWe present a novel proof technique that yields an elegant, short argument and is thus of\nindependent interest.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRelated Work: To our knowledge, MISO was \ufb01rst considered by Swersky, Snoek, and Adams\n[34], under the the name multi-task Bayesian optimization. This name was used to suggest problems\nin which the auxiliary tasks could meaningfully be solved on their own, while we use the term MISO\nto indicate that the IS may be useful only in support of the primary task. Swersky et al. [34] showed\nthat hyperparameter tuning in classi\ufb01cation can be accelerated through evaluation on subsets of the\nvalidation data. They proposed a GP model to jointly model such \u201cauxiliary tasks\u201d and the primary\ntask, building on previous work on GP regression for multiple tasks in [3, 10, 35]. They choose\npoints to sample via cost-sensitive entropy search [11, 39], sampling in each iteration a point that\nmaximally reduces uncertainty in the optimum\u2019s location, normalized by the query cost.\nWe demonstrate in experiments that our approach improves over the method of Swersky et al. [34],\nand we believe this improvement results from two factors: \ufb01rst, our statistical model is more \ufb02exible\nin its ability to model bias that varies across the domain; second, our acquisition function directly\nand maximally reduces simple regret in one step, unlike predictive entropy search which maximally\nreduces the maximizer\u2019s entropy in one step and hence only indirectly reduces regret.\nLam, Allaire, and Willcox [18] also consider MISO, under the name non-hierarchical multi-\ufb01delity\noptimization. They propose a statistical model that maintains a separate GP for each IS, and fuse\nthem via the method of Winkler [40]. They apply a modi\ufb01ed expected improvement acquisition\nfunction on these surrogates to \ufb01rst decide what design x\u2217 to evaluate and then select the IS to\nquery; the latter is decided by a heuristic that aims to balance information gain and query cost. We\ndemonstrate in experiments that our approach improves over the method of Lam et al. [18], and\nwe believe this improvement results from two factors: \ufb01rst, their statistical approach assumes an\nindependent prior on each IS, despite their being linked through modeling a common objective; and\nsecond their acquisition function selects the point to sample and the IS to query separately via a\nheuristic rather than jointly using an optimality analysis.\nBeyond these two works, the most closely related work is in the related problem of multi-\ufb01delity\noptimization. In this problem, IS are supposed to form a strict hierarchy [16, 14, 6, 24, 20, 19, 15].\nIn addition, most of these models limit the information that can be obtained from sources of lower\n\ufb01delity [16, 14, 6, 20, 19]: Given the observation of x at some IS (cid:96), one cannot learn more about the\nvalue of x at IS with higher \ufb01delity by querying IS (cid:96) anywhere else (see Sect. C for details and a\nproof). Picheny et al. [24] propose a quantile-based criterion for optimization of stochastic simulators,\nsupposing that all simulators provide unbiased approximations of the true objective. From this body\nof work, we compare against Kandasamy et al. [15], who present an approach for minimizing both\nsimple and cumulative regret, under the assumption that the maximum bias of an information source\nstrictly decreases with its \ufb01delity.\nAn interesting special case of MISO is warm-starting Bayesian optimization. Here information\nsources correspond to samples that were taken previously on objective functions related to the current\nobjective. For example, this scenario occurs when we are to re-optimize whenever parameters of the\nobjective change or whenever new data becomes available. Poloczek et al. [25] demonstrated that a\nvariant of the algorithm proposed in this article can reduce the optimization costs signi\ufb01cantly by\nwarm-starting Bayesian optimization, as does the algorithm of Swersky et al. [34].\nOutside of both the MISO and multi-\ufb01delity settings, Klein et al. [17] considered hyperparameter\noptimization of machine learning algorithms over a large dataset D. Supposing access to subsets\nof D of arbitrary sizes, they show how to exploit regularity of performance across dataset sizes to\nsigni\ufb01cantly speed up the optimization process for support vector machines and neural networks.\nOur acquisition function is a generalization of the knowledge-gradient policy of Frazier, Powell, and\nDayanik [8] to the MISO setting. This generalization requires extending the one-step optimality\nanalysis used to derive the KG policy in the single-IS setting to account for the impact of sampling\na cheap approximation on the marginal GP posterior on the primary task. From this literature, we\nleverage an idea for computing the expectation of the maximum of a collection of linear functions\nof a normal random variable, and propose a parallel algorithm to identify and compute the required\ncomponents.\nThe class of GP covariance kernels we propose are a subset of the class of linear models of coregion-\nalization kernels [10, 2], with a restricted form derived from a generative model particular to MISO.\nFocusing on a restricted class of kernels designed for our application supports accurate inference with\nless data, which is important when optimizing expensive-to-evaluate functions.\n\n2\n\n\fOur work also extends the knowledge-gradient acquisition function to the variable cost setting.\nSimilar extensions of expected improvement to the variable cost setting can be found in Snoek et al.\n[32] (the EI per second criterion) and in Le Gratiet and Cannamela [19].\nWe now formalize the problem we consider in Sect. 2, describe our statistical analysis in Sect. 3.1,\nspecify our acquisition function and parallel computation method in Sects. 3.2 and 3.3, provide a\ntheoretical guarantee in Sect. 3.4, present numerical experiments in Sect. 4, and conclude in Sect. 5.\n\n2 Problem Formulation\nGiven a continuous objective function g : D \u2192 R on a compact set D \u2282 Rd of feasible designs, our\ngoal is to \ufb01nd a design with objective value close to maxx\u2208D g(x). We have access to M possibly\nbiased and/or noisy IS indexed by (cid:96) \u2208 [M ]0. (Here, for any a \u2208 Z+ we use [a] as a shorthand for\nthe set {1, 2, . . . , a}, and further de\ufb01ne [a]0 = {0, 1, 2, . . . , a}.) Observing IS (cid:96) at design x provides\nindependent, conditional on f ((cid:96), x), and normally distributed observations with mean f ((cid:96), x) and\n\ufb01nite variance \u03bb(cid:96)(x). In [34], IS (cid:96) \u2208 [M ]0 are called \u201cauxiliary tasks\u201d and g the primary task. These\nsources are thought of as approximating g, with variable bias. We suppose that g can be observed\ndirectly without bias (but possibly with noise) and set f (0, x) = g(x). The bias f ((cid:96), x)\u2212 g(x) is also\nreferred to as \u201cmodel discrepancy\u201d in the engineering and simulation literature [1, 4]. Each IS (cid:96) is\nalso associated with a query cost function c(cid:96)(x) : D \u2192 R+. We assume that the cost function c(cid:96)(x)\nand the variance function \u03bb(cid:96)(x) are both known and continuously differentiable (over D). In practice,\nthese functions may either be provided by domain experts or may be estimated along with other\nmodel parameters from data (see Sect. 4 and B.2, and [27]).\n\n3 The misoKG Algorithm\n\nWe now present the misoKG algorithm and describe its two components: a MISO-focused statistical\nmodel in Sect. 3.1; and its acquisition function and parallel computation in Sect. 3.2. Sect. 3.3\nsummarizes the algorithm and Sect. 3.4 provides a theoretical performance guarantee. Extensions of\nthe algorithm are discussed in Sect. D.\n\n3.1 Statistical Model\n\nWe now describe a generative model for f that results in a Gaussian process prior on f with a\nparameterized class of mean functions \u00b5 : [M ]\u00d7D (cid:55)\u2192 R and covariance kernels \u03a3 : ([M ]\u00d7D)2 (cid:55)\u2192 R.\nThis allows us to use standard tools for Gaussian process inference \u2014 \ufb01rst \ufb01nding the MLE estimate\nof the parameters indexing this class, and then performing Gaussian process regression using the\nselected mean function and covariance kernel \u2014 while also providing better estimates for MISO than\nwould a generic multi-output GP regression kernel that does not consider the MISO application.\nWe construct our generative model as follows. For each (cid:96) > 0 suppose that a function \u03b4(cid:96) : D (cid:55)\u2192 R for\neach (cid:96) > 0 was drawn from a separate independent GP, \u03b4(cid:96) \u223c GP (\u00b5(cid:96), \u03a3(cid:96)), and let \u03b40 be identically 0.\nIn our generative model \u03b4(cid:96) will be the bias f ((cid:96), x) \u2212 g(x) for IS (cid:96). We additionally set \u00b5(cid:96)(x) = 0\nto encode a lack of a strong belief on the direction of the bias. (If one had a strong belief that\nan IS is consistently biased in one direction, one may instead set \u00b5(cid:96) to a constant estimated using\nmaximum a posteriori estimation.) Next, within our generative model, we suppose that g : D (cid:55)\u2192 R\nwas drawn from its own independent GP, g \u223c GP (\u00b50, \u03a30), for some given \u00b50 and \u03a30, and suppose\nf ((cid:96), x) = f (0, x) + \u03b4(cid:96)(x) for each (cid:96). We assume that \u00b50 and \u03a3(cid:96) with (cid:96) \u2265 0 belong to one of\nthe standard parameterized classes of mean functions and covariance kernels, e.g., constant \u00b50 and\nMat\u00e9rn \u03a3(cid:96).\nWith this construction, f is a GP: given any \ufb01nite collection of points (cid:96)i \u2208 [M ], xi \u2208 D with\ni = 1, . . . , I, (f ((cid:96)i, xi) : i = 1, . . . , I) is a sum of independent multivariate normal random vectors,\nand thus is itself multivariate normal. Moreover, we compute the mean function and covariance\nkernel of f: for (cid:96), m \u2208 [M ]0 and x, x(cid:48) \u2208 D,\n\n\u00b5((cid:96), x) = E [f ((cid:96), x)] = E [g(x)] + E [\u03b4(cid:96)(x)] = \u00b50(x)\n\n\u03a3 (((cid:96), x), (m, x(cid:48))) = Cov(g(x) + \u03b4(cid:96)(x), g(x(cid:48)) + \u03b4m(x(cid:48))) = \u03a30(x, x(cid:48)) + 1(cid:96),m \u00b7 \u03a3(cid:96)(x, x(cid:48)),\n\nwhere 1(cid:96),m denotes Kronecker\u2019s delta, and where we have used independence of \u03b4(cid:96), \u03b4m, and g. We\nrefer the reader to https://github.com/misoKG/ for an illustration of the model.\n\n3\n\n\fThis generative model draws model discrepancies \u03b4(cid:96) independently across IS. This is appropriate\nwhen IS are different in kind and share no relationship except that they model a common objective.\nIn the supplement (Sect. B) we generalize this generative model to model correlation between model\ndiscrepancies, which is appropriate when IS can be partitioned into groups, such that IS within one\ngroup tend to agree more amongst themselves than they do with IS in other groups. Sect. B also\ndiscusses the estimation of the hyperparameters in \u00b50 and \u03a3(cid:96).\n\n3.2 Acquisition Function\nOur optimization algorithm proceeds in rounds, selecting a design x \u2208 D and an information\nsource (cid:96) \u2208 [M ]0 in each. The value of the information obtained by sampling IS (cid:96) at x is the expected\ngain in the quality of the best design that can be selected using the available information. That is, this\nvalue is the difference in the expected quality of the estimated optimum before and after the sample.\nWe then normalize this expected gain by the cost c(cid:96)(x) associated with the respective query, and\nsample the IS and design with the largest normalized gain. Without normalization we would always\nquery the true objective, since no other IS provides more information about g than g itself.\nWe formalize this idea. Suppose that we have already sampled n points Xn and made the observa-\ntions Yn. Denote by En the expected value according to the posterior distribution given Xn, Yn, and\nlet \u00b5(n) ((cid:96), x) := En [f ((cid:96), x)]. The best expected objective value across the designs, as estimated by\nour statistical model, is maxx(cid:48)\u2208D \u00b5(n) (0, x(cid:48)). Similarly, if we take an additional sample of IS (cid:96)(n+1)\nat design x(n+1) and compute our new posterior mean, the new best expected objective value across\nthe designs is maxx(cid:48)\u2208D \u00b5(n+1) (0, x(cid:48)), whose distribution depends on what IS we sample, and where\nsample it. Thus, the expected value of sampling at ((cid:96), x) normalized by the cost is\n\n(cid:20) maxx(cid:48)\u2208D \u00b5(n+1)(0, x(cid:48)) \u2212 maxx(cid:48)\u2208D \u00b5(n)(0, x(cid:48))\n\nc(cid:96)(x)\n\n,\n(1)\nwhich we refer to as the misoKG factor of the pair ((cid:96), x). The misoKG policy then samples at the\npair ((cid:96), x) that maximizes MKGn((cid:96), x), i.e., ((cid:96)(n+1), x(n+1)) \u2208 argmax(cid:96)\u2208[M ]0,x\u2208D MKGn((cid:96), x),\nwhich is a nested optimization problem.\nTo make this nested optimization problem tractable, we \ufb01rst replace the search domain D in Eq. (1)\nby a discrete set A \u2282 D of points, for example selected by a Latin Hypercube design. We may then\ncompute MKGn((cid:96), x) exactly. Toward that end, note that\n\n(cid:12)(cid:12) (cid:96)(n+1) = (cid:96), x(n+1) = x\n\nMKGn((cid:96), x) = En\n\n(cid:21)\n\n(cid:20)\nx(cid:48)\u2208A \u00b5(n+1)(0, x(cid:48))(cid:12)(cid:12) (cid:96)(n+1) = (cid:96), x(n+1) = x\n\nmax\n\n(cid:21)\n\nEn\n\n(cid:20)\nx(cid:48)\u2208A{\u00b5(n)(0, x(cid:48)) + \u00af\u03c3n\n\nmax\n\nx(cid:48)((cid:96), x) \u00b7 Z}(cid:12)(cid:12) (cid:96)(n+1) = (cid:96), x(n+1) = x\n\n(cid:21)\n\n,\n\n(2)\n\n= En\n\n1\n\n2 . Here \u03a3n is\n\nx(cid:48)((cid:96), x) = \u03a3n((0, x(cid:48)), ((cid:96), x))/ [\u03bb(cid:96)(x) + \u03a3n(((cid:96), x), ((cid:96), x))]\n\nwhere Z \u223c N (0, 1) and \u00af\u03c3n\nthe posterior covariance matrix of f given Xn, Yn.\nWe parallelize the computation of MKGn((cid:96), x) for \ufb01xed (cid:96), x, enabling it to utilize multiple cores.\nThen ((cid:96)(n+1), x(n+1)) is obtained by computing MKGn((cid:96), x) for all ((cid:96), x) \u2208 [M ]0 \u00d7 A, a task that\ncan be parallelized over multiple machines in a cluster. We begin by sorting the points in A in parallel\nx(cid:48)((cid:96), x) (for \ufb01xed (cid:96), x). Thereby we remove some points easily identi\ufb01ed as\nby increasing value of \u00af\u03c3n\ndominated. A point xj is dominated if maxi \u00b5(n)(0, xi) + \u00af\u03c3n\n((cid:96), x)Z is unchanged for all Z if the\nxi\nmaximum is taken excluding xj. Note that a point xj is dominated by xk if \u00af\u03c3n\n((cid:96), x)\nxj\nand \u00b5(n)(0, xj) \u2264 \u00b5(n)(0, xk), since xk has a higher expected value than xj for any realization of Z.\nLet S be the sorted sequence without such dominated points. We will remove more dominated points\nlater.\nSince c(cid:96)(x) is a constant for \ufb01xed (cid:96), x, we may express the conditional expectation in Eq. (1) as\nEn\n((cid:96), x)\nfor xi \u2208 S. We split S into consecutive sequences S1, S2, . . . , SC, where C is the number of\ncores used for computing MKGn((cid:96), x) and Si, Si+1 overlap in one element: that is, for Sj =\n\n(cid:104) maxi{ai+biZ}\u2212maxi ai\n\n, where ai = \u00b5(n)(0, xi) and bi = \u00af\u03c3n\nxi\n\nEn[maxi{ai+biZ}\u2212maxi ai]\n\n((cid:96), x) = \u00af\u03c3n\nxk\n\n(cid:105)\n\nc(cid:96)(x)\n\nc(cid:96)(x)\n\n=\n\n4\n\n\f= xj1 and xjk = x(j+1)1\n\n{xj1, . . . , xjk}, x(j\u22121)k\nhold. Each xji \u2208 Sj speci\ufb01es a linear func-\ntion aji+bjiZ (ordered by increasing slopes in S). We are interested in the realizations of Z for\nwhich aji+bjiZ \u2265 ai(cid:48)+bi(cid:48)Z for any i(cid:48) and hence compute the intersections of these functions.\nThe functions for xji and xji+1 intersect in dji = (aji\u2212aji+1)/(bji+1\u2212bji). Observe if dji \u2264 dji\u22121,\nthen aji+bjiZ \u2264 max{aji\u22121+bji\u22121Z, aji+1+bji+1Z} for all Z: xji is dominated and hence dropped\nfrom Sj. In this case we compute the intersection of the af\ufb01ne functions associated with xj\u22121 and xj+1\nand iterate the process.\nPoints in Sj may be dominated by the rightmost (non-dominated) point in Sj\u22121. Thus, we compute\nthe intersection of the rightmost point of Sj\u22121 and the leftmost point of Sj, iteratively dropping\nall dominated points of Sj. If all points of Sj are dominated, we continue the scan with Sj+1 and\nso on. Observe that we may stop this scan once there is a point that is not dominated, since the\npoints in any sequence Sj have non-decreasing d-values. If some of the remaining points in Sj are\ndominated by a point in Sj(cid:48) with j(cid:48) < j \u2212 1, then this will be determined when the scan initiated\nby Sj(cid:48) reaches Sj. Subsequently, we check the other direction, i.e. whether xj1 dominates elements\nof Sj\u22121, starting with the rightmost element of Sj\u22121. These checks for dominance are performed in\nparallel for neighboring sequences.\n(cid:80)k\u22121\n[8] showed how to compute sequentially the expected maximum of a collection of af\ufb01ne\nIn particular, their Eq. (14) [8, p. 605] gives En [maxi{ai+biZ} \u2212 maxi ai] =\nfunctions.\nh=1(bjh+1\u2212bjh)u(\u2212|djh|), where u is de\ufb01ned as u(z) = z\u03a6(z) + \u03c6(z) for the CDF and\nPDF of the normal distribution. We compute the inner sums simultaneously; the computation of the\nouter sum could be parallelized by recursively adding pairs of inner sums, although we do not do so\nto avoid communication overhead. We summarize the parallel algorithm below.\n\n(cid:80)C\n\nj=1\n\nThe Parallel Algorithm to compute ((cid:96)(n+1), x(n+1)):\n\n1. Scatter the pairs ((cid:96), x) \u2208 [M ]0 \u00d7 A among the machines.\n2. Each computes MKGn((cid:96), x) for its pairs. To compute MKGn((cid:96), x) in parallel:\na. Sort the points in A by ascending \u00af\u03c3n\n\nx(cid:48)((cid:96), x) in parallel, thereby removing dominated points.\n\nLet S be the sorted sequence.\n\npute MKGn((cid:96), x). Each core computes(cid:80)\n\nb. Split S into sequences S1, . . . , SC, where C is the number of cores used to com-\n(bi+1 \u2212 bi)u(\u2212|di|) in parallel, then the\n\npartial sums are added to obtain En [maxi{ai + biZ} \u2212 maxi ai].\n\nxi\u2208SC\n\n3. Determine ((cid:96)(n+1), x(n+1)) \u2208 argmax(cid:96)\u2208[M ]0,x\u2208D MKGn((cid:96), x) in parallel.\n\n3.3 Summary of the misoKG Algorithm.\n\n1. Using samples from all information sources, estimate hyperparameters of the Gaussian\n\nprocess prior as described in Sect. B.2.\nThen calculate the posterior on f based on the prior and samples.\n\n2. Until the budget for samples is exhausted do:\n\nDetermine the information source (cid:96)\u2208[M ]0 and the design x\u2208D that maximize the misoKG\nfactor proposed in Eq. (1) and observe IS (cid:96)(x).\nUpdate the posterior distribution with the new observation.\n\n3. Return argmaxx(cid:48)\u2208A \u00b5(n)(0, x(cid:48)).\n\n3.4 Provable Performance Guarantees.\n\nThe misoKG chooses an IS and an x such that the expected gain normalized by the query cost is\nmaximized. Thus, misoKG is one-step Bayes optimal in this respect, by construction.\nWe establish an additive bound on the difference between misoKG\u2019s solution and the unknown\noptimum, as the number of queries N \u2192 \u221e. For this argument we suppose that \u00b5((cid:96), x)=0 \u2200(cid:96), x\nand \u03a30 is either the squared exponential kernel or a four times differentiable Mat\u00e9rn kernel. Moreover,\nlet xOPT\u2208 argmaxx(cid:48)\u2208D f (0, x(cid:48)), and d = maxx(cid:48)\u2208D minx(cid:48)(cid:48)\u2208A dist(x(cid:48), x(cid:48)(cid:48)).\n\n5\n\n\fTheorem 1. Let x\u2217\nthere is a constant Kp such that with probability p\n\nN \u2208 A be the point that misoKG recommends in iteration N. For each p \u2208 [0, 1)\n\nN\u2192\u221e f (0, x\u2217\n\nlim\n\nN ) \u2265 f (0, xOPT) \u2212 Kp \u00b7 d.\n\nWe point out that Frazier, Powell, and Dayanik [8] showed in their seminal work an analogous result\nfor the case of a single information source with uniform query cost (Theorem 4 in [8]).\nIn Sect. A we prove the statement for the MISO setting that allows multiple information sources that\neach have query costs c(cid:96)(x) varying over the search domain D. This proof is simple and short. Also\nnote that Theorem 3 establishes consistency of misoKG for the special case that D is \ufb01nite, since\nthen d = 0. Interestingly, we can compute Kp given \u03a3 and p. Therefore, we can control the additive\nerror Kp \u00b7 d by increasing the density of A, leveraging the scalability of our parallel algorithm.\n\n4 Numerical Experiments\n\nWe now compare misoKG to other state-of-the-art MISO algorithms. We implemented misoKG\u2019s\nstatistical model and acquisition function in Python 2.7 and C++ leveraging functionality from\nthe Metrics Optimization Engine [23]. We used a gradient-based optimizer [28] that \ufb01rst\n\ufb01nds an optimizer via multiple restarts for each IS (cid:96) separately and then picks ((cid:96)(n+1), x(n+1))\nwith maximum misoKG factor among these. An implementation of our method is available at\nhttps://github.com/misoKG/.\nWe compare to misoEI of Lam et al. [18] and to MTBO+, an improved version of Multi-Task Bayesian\nOptimization proposed by Swersky et al. [34]. Following a recommendation of Snoek 2016, our im-\nplementation of MTBO+ uses an improved formulation of the acquisition function given by Hern\u00e1ndez-\nLobato et al. [12], Snoek and et al. [31], but otherwise is identical to MTBO; in particular, it uses the\nstatistical model of [34]. Sect. E provides detailed descriptions of these algorithms.\n\nExperimental Setup. We conduct experiments on the following test problems:\n(1) the 2-\ndimensional Rosenbrock function modi\ufb01ed to \ufb01t the MISO setting by Lam et al. [18]; (2) a MISO\nbenchmark proposed by Swersky et al. [34] in which we optimize the 4 hyperparameters of a machine\nlearning algorithm, using a small, related set of smaller images as cheap IS; (3) an assemble-to-order\nproblem from Hong and Nelson [13] in which we optimize an 8-dimensional target stock vector to\nmaximize the expected daily pro\ufb01t of a company as estimated by a simulator.\nIn MISO settings the amount of initial data that one can use to inform the methods about each\ninformation source is typically dictated by the application, in particular by resource constraints and\nthe availability of the respective source. In our experiments all methods were given identical initial\ndatasets for each information source in every replication; these sets were drawn randomly via Latin\nHypercube designs. For the sake of simplicity, we provided the same number of points for each\nIS, set to 2.5 points per dimension of the design space D. Regarding the kernel and mean function,\nMTBO+ uses the settings provided in [31]. The other algorithms used the squared exponential kernel\nand a constant mean function set to the average of a random sample.\nWe report the \u201cgain\u201d over the best initial solution, that is the true objective value of the respective\ndesign that a method would return at each iteration minus the best value in the initial data set. If\nthe true objective value is not known for a given design, we report the value obtained from the\ninformation source of highest \ufb01delity. This gain is plotted as a function of the total cost, that is the\ncumulative cost for invoking the information sources plus the \ufb01xed cost for the initial data; this metric\nnaturally generalizes the number of function evaluations prevalent in Bayesian optimization. Note\nthat the computational overhead of choosing the next information source and sample is omitted, as\nit is negligible compared to invoking an information source in real-world applications. Error bars\nare shown at the mean \u00b1 2 standard errors averaged over at least 100 runs of each algorithm. For\ndeterministic sources a jitter of 10\u22126 is added to avoid numerical issues during matrix inversion.\n\n4.1 The Rosenbrock Benchmarks\nWe consider the design space D = [\u22122, 2]2, and M = 2 information sources. IS 0 is the Rosenbrock\nfunction g(x) = (1 \u2212 x1)2 + 100 \u00b7 (x2 \u2212 x2\n1)2 plus optional Gaussian noise u \u00b7 \u03b5. IS 1 returns\n\n6\n\n\fFigure 1: (l) The Rosenbrock benchmark with the parameter setting of [18]: misoKG offers an\nexcellent gain-to-cost ratio and outperforms its competitors substantially.\n(r) The Rosenbrock\nbenchmark with the alternative setup.\n\ng(x)+v\u00b7sin(10\u00b7x1 +5\u00b7x2), where the additional oscillatory component serves as model discrepancy.\nWe assume a cost of 1000 for each query to IS 0 and a cost of 1 for IS 1.\nSince all methods converged to good solutions within few queries, we investigate the ratio of gain to\ncost: Fig. 1 (l) displays the gain of each method over the best initial solution as a function of the total\ncost in\ufb02icted by querying information sources. The new method misoKG offers a signi\ufb01cantly better\ngain per unit cost and \ufb01nds an almost optimal solution typically within 5 \u2212 10 samples. Interestingly,\nmisoKG relies only on cheap samples, proving its ability to successfully handle uncertainty. MTBO+,\non the other hand, struggles initially but then eventually obtains a near-optimal solution, too. To this\nend, it makes usually one or two queries of the expensive truth source after about 40 steps. misoEI\nshows a odd behavior: it takes several queries, one of them to IS 0, before it improves over the best\ninitial design for the \ufb01rst time. Then it jumps to a very good solution and subsequently samples only\nthe cheap IS.\nFor the second setup, we set u = 1, v = 2, and suppose for IS 0 uniform noise of \u03bb0(x) = 1\nand query cost c0(x) = 50. Now the difference in costs is much smaller, while the variance is\nconsiderably bigger. The results are displayed in Fig. 1 (r): as for the \ufb01rst con\ufb01guration, misoKG\noutperforms the other methods from the start. Interestingly, misoEI\u2019s performance is drastically\ndecreased compared to the \ufb01rst setup, since it only queries the expensive truth. Looking closer, we\nsee that misoKG initially queries only the cheap information source IS 1 until it comes close to an\noptimal value after about \ufb01ve samples. It starts to query IS 0 occasionally later.\n\n4.2 The Image Classi\ufb01cation Benchmark\n\nThis classi\ufb01cation problem was introduced by Swersky et al. [34] to demonstrate that MTBO can reduce\nthe cost of hyperparameter optimization by leveraging a small dataset as information source. The\ngoal is to optimize four hyperparameters of the logistic regression algorithm [36] using a stochastic\ngradient method with mini-batches (the learning rate, the L2-regularization parameter, the batch size,\nand the number of epochs) to minimize the classi\ufb01cation error on the MNIST dataset [21]. This\ndataset contains 70,000 images of handwritten digits: each image has 784 pixels. IS 1 uses the USPS\ndataset [38] of about 9000 images with 256 pixels each. The query costs are 4.5 for IS 1 and 43.69\nfor IS 0. A closer examination shows that IS 1 is subject to considerable bias with respect to IS 0,\nmaking it a challenge for MISO algorithms.\nFig.2 (l) summarizes performance: initially, misoKG and MTBO+ are on par. Both clearly outper-\nform misoEI that therefore was stopped after 50 iterations. misoKG and MTBO+ continued for 150\nsteps (with a lower number of replications). misoKG usually achieves an optimal test error of\nabout 7.1% on the MNIST testset after about 80 queries, matching the classi\ufb01cation performance\nof the best setting reported by Swersky et al. [34]. Moreover, misoKG achieves better solutions than\nMTBO+ at the same costs. Note that the results in [34] show that MTBO+ will also converge to the\noptimum eventually.\n\n7\n\n500550105015502050255030Total Cost051015202530GainmisoKGMTBO+misoEI255260265270275280Total Cost0510152025303540GainmisoKGMTBO+misoEI\fFigure 2: (l) The performance on the image classi\ufb01cation benchmark of [34]. misoKG achieves better\ntest errors after about 80 steps and converges to the global optimum. (r) misoKG outperforms the\nother algorithms on the assemble-to-order benchmark that has signi\ufb01cant model discrepancy.\n\n4.3 The Assemble-To-Order Benchmark\n\nThe assemble-to-order (ATO) benchmark is a reinforcement learning problem from a business\napplication where the goal is to optimize an 8-dimensional target level vector over [0, 20]8 (see\nSect. G for details). We set up three information sources: IS 0 and 2 use the discrete event simulator\nof Xie et al. [42], whereas the cheapest source IS 1 invokes the implementation of Hong and Nelson.\nIS 0 models the truth.\nThe two simulators differ subtly in the model of the inventory system. However, the effect in estimated\nobjective value is signi\ufb01cant: on average the outputs of both simulators at the same target vector differ\nby about 5% of the score of the global optimum, which is about 120, whereas the largest observed\nbias out of 1000 random samples was 31.8. Thus, we are witnessing a signi\ufb01cant model discrepancy.\nFig. 2 (r) summarizes the performances. misoKG outperforms the other algorithms from the start:\nmisoKG averages at a gain of 26.1, but in\ufb02icts only an average query cost of 54.6 to the information\nsources. This is only 6.3% of the query cost that misoEI requires to achieve a comparable score.\nInterestingly, misoKG and MTBO+ utilize mostly the cheap biased IS, and therefore are able to\nobtain signi\ufb01cantly better gain to cost ratios than misoEI. misoKG\u2019s typically \ufb01rst calls IS 2 after\nabout 60 \u2212 80 steps. In total, misoKG queries IS 2 about ten times within the \ufb01rst 150 steps; in some\nreplications misoKG makes one late call to IS 0 when it has already converged. Our interpretation is\nthat misoKG exploits the cheap, biased IS 1 to zoom in on the global optimum and switches to the\nunbiased but noisy IS 2 to identify the optimal solution exactly. This is the expected (and desired)\nbehavior for misoKG when the uncertainty of f (0, x\u2217) is not expected to be reduced suf\ufb01ciently by\nqueries to IS 1. MTBO+ trades off the gain versus cost differently: it queries IS 0 once or twice after 100\nsteps and directs all other queries to IS 1, which might explain the observed lower performance.\nmisoEI, which employs a two-step heuristic for trading off predicted gain and query cost, almost\nalways chose to evaluate the most expensive IS.\n\n5 Conclusion\n\nWe have presented a novel algorithm for MISO that uses a mean function and covariance matrix\nmotivated by a MISO-speci\ufb01c generative model. We have proposed a novel acquisition function\nthat extends the knowledge gradient to the MISO setting and comes with a fast parallel method for\ncomputing it. Moreover, we have provided a theoretical guarantee on the solution quality delivered\nby this algorithm, and demonstrated through numerical experiments that it improves signi\ufb01cantly\nover the state of the art.\n\nAcknowledgments\n\nThis work was partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF\nIIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16-1-\n0046.\n\n8\n\n6.26.46.66.87.07.27.4log(Total Cost)0.060.080.100.120.140.160.180.20TesterrormisoKGMTBO+misoEI6.086.106.126.146.166.186.206.22log(Total Cost)0510152025303540GainmisoKGMTBO+misoEI\fReferences\n[1] D. Allaire and K. Willcox. A mathematical and computational framework for multi\ufb01delity\ndesign and analysis with computer models. International Journal for Uncertainty Quanti\ufb01cation,\n4(1), 2014.\n\n[2] M. A. Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: A review.\n\nFoundations and Trends in Machine Learning, 4(3):195\u2013266, 2012.\n\n[3] E. V. Bonilla, K. M. Chai, and C. Williams. Multi-task gaussian process prediction. In Advances\n\nin Neural Information Processing Systems, pages 153\u2013160, 2007.\n\n[4] J. Brynjarsdottir and A. O\u2019Hagan. Learning about physical parameters: the importance of model\n\ndiscrepancy. Inverse Problems, 30(11), 2014.\n\n[5] E. \u00c7\u0131nlar. Probability and Stochastics, volume 261 of Graduate texts in Mathematics. Springer,\n\n2011.\n\n[6] A. I. Forrester, A. S\u00f3bester, and A. J. Keane. Multi-\ufb01delity optimization via surrogate modelling.\nProceedings of the Royal Society of London A: Mathematical, Physical and Engineering\nSciences, 463(2088):3251\u20133269, 2007.\n\n[7] P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge-gradient policy for sequential\ninformation collection. SIAM Journal on Control and Optimization, 47(5):2410\u20132439, 2008.\n\n[8] P. I. Frazier, W. B. Powell, and S. Dayanik. The Knowledge Gradient Policy for Correlated\n\nNormal Beliefs. INFORMS Journal on Computing, 21(4):599\u2013613, 2009.\n\n[9] S. Ghosal and A. Roy. Posterior consistency of Gaussian process prior for nonparametric binary\n\nregression. The Annals of Statistics, 34(5):2413\u20132429, 2006.\n\n[10] P. Goovaerts. Geostatistics for Natural Resources Evaluation. Oxford University, 1997.\n\n[11] P. Hennig and C. J. Schuler. Entropy search for information-ef\ufb01cient global optimization. The\n\nJournal of Machine Learning Research, 13(1):1809\u20131837, 2012.\n\n[12] J. M. Hern\u00e1ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search\nfor ef\ufb01cient global optimization of black-box functions. In Advances in Neural Information\nProcessing Systems, pages 918\u2013926, 2014.\n\n[13] L. J. Hong and B. L. Nelson. Discrete optimization via simulation using compass. Operations\n\nResearch, 54(1):115\u2013129, 2006.\n\n[14] D. Huang, T. Allen, W. Notz, and R. Miller. Sequential kriging optimization using multiple-\n\n\ufb01delity evaluations. Structural and Multidisciplinary Optimization, 32(5):369\u2013382, 2006.\n\n[15] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Poczos. Gaussian process bandit\noptimisation with multi-\ufb01delity evaluations. In Advances in Neural Information Processing\nSystems, 2016. The code is available at https://github.com/kirthevasank/mf-gp-ucb.\nLast Accessed on 04/22/2017.\n\n[16] M. C. Kennedy and A. O\u2019Hagan. Predicting the output from a complex computer code when\n\nfast approximations are available. Biometrika, 87(1):1\u201313, 2000.\n\n[17] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast bayesian optimization of machine\n\nlearning hyperparameters on large datasets. CoRR, abs/1605.07079, 2016.\n\n[18] R. Lam, D. Allaire, and K. Willcox. Multi\ufb01delity optimization using statistical surrogate\nmodeling for non-hierarchical information sources. In 56th AIAA/ASCE/AHS/ASC Structures,\nStructural Dynamics, and Materials Conference, 2015.\n\n[19] L. Le Gratiet and C. Cannamela. Cokriging-based sequential design strategies using fast\ncross-validation techniques for multi-\ufb01delity computer codes. Technometrics, 57(3):418\u2013427,\n2015.\n\n9\n\n\f[20] L. Le Gratiet and J. Garnier. Recursive co-kriging model for design of computer experiments\nwith multiple levels of \ufb01delity. International Journal for Uncertainty Quanti\ufb01cation, 4(5), 2014.\n\n[21] Y. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits, 2017.\n\nhttp://yann.lecun.com/exdb/mnist/. Last Accessed on 05/15/2017.\n\n[22] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):\n\n583\u2013601, 2002.\n\n[23] MOE. Metrics optimization engine. http://yelp.github.io/MOE/, 2016. Last Accessed\n\non 05/15/2017.\n\n[24] V. Picheny, D. Ginsbourger, Y. Richet, and G. Caplin. Quantile-based optimization of noisy\n\ncomputer experiments with tunable precision. Technometrics, 55(1):2\u201313, 2013.\n\n[25] M. Poloczek, J. Wang, and P. I. Frazier. Warm starting bayesian optimization. In Winter\nSimulation Conference (WSC), pages 770\u2013781. IEEE, 2016. Also available on arXiv https:\n//arxiv.org/abs/1608.03585.\n\n[26] H. Qu, I. O. Ryzhov, M. C. Fu, and Z. Ding. Sequential selection with unknown correlation\n\nstructures. Operations Research, 63(4):931\u2013948, 2015.\n\n[27] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006. ISBN ISBN 0-262-18253-X.\n\n[28] W. R. Scott, P. I. Frazier, and W. B. Powell. The correlated knowledge gradient for simulation\noptimization of continuous parameters using gaussian process regression. SIAM Journal on\nOptimization, 21(3):996\u20131026, 2011.\n\n[29] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of\nexpensive objective functions. In Advances in Neural Information Processing Systems, pages\n3330\u20133338, 2015.\n\n[30] J. Snoek. Personal communication, 2016.\n\n[31] J. Snoek and et al. Spearmint. http://github.com/HIPS/Spearmint, 2017. Last Accessed\n\non 05/15/2017.\n\n[32] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning\nalgorithms. In Advances in Neural Information Processing Systems, pages 2951\u20132959, 2012.\n\n[33] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the\n\nbandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.\n\n[34] K. Swersky, J. Snoek, and R. P. Adams. Multi-task bayesian optimization. In Advances in\n\nNeural Information Processing Systems, pages 2004\u20132012, 2013.\n\n[35] Y.-W. Teh, M. Seeger, and M. Jordan. Semiparametric latent factor models.\n\nIntelligence and Statistics 10, 2005.\n\nIn Arti\ufb01cial\n\n[36] Theano. Theano: Logistic regression, 2017. http://deeplearning.net/tutorial/code/\n\nlogistic_sgd.py. Last Accessed on 05/16/2017.\n\n[37] S. Toscano-Palmerin and P. I. Frazier. Strati\ufb01ed bayesian optimization. In Proceedings of the\n12th International Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scienti\ufb01c\nComputing, 2016. Accepted for Publication. Also available at https://arxiv.org/abs/\n1602.02338.\n\n[38] USPS. USPS dataset, 2017. http://mldata.org/repository/data/viewslug/usps/.\n\nLast Accessed on 05/16/2017.\n\n[39] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization\n\nof expensive-to-evaluate functions. Journal of Global Optimization, 44(4):509\u2013534, 2009.\n\n10\n\n\f[40] R. L. Winkler. Combining probability distributions from dependent information sources.\n\nManagement Science, 27(4):479\u2013488, 1981.\n\n[41] J. Wu, M. Poloczek, A. G. Wilson, and P. I. Frazier. Bayesian optimization with gradients. In\nAdvances in Neural Information Processing Systems, 2017. Accepted for Publication. Also\navailable at https://arxiv.org/abs/1703.04389.\n\n[42] J. Xie, P. I. Frazier, and S. Chick. Assemble to order simulator. http://simopt.org/wiki/\nindex.php?title=Assemble_to_Order&oldid=447, 2012. Last Accessed on 05/16/2017.\n\n11\n\n\f", "award": [], "sourceid": 2247, "authors": [{"given_name": "Matthias", "family_name": "Poloczek", "institution": "Cornell University"}, {"given_name": "Jialei", "family_name": "Wang", "institution": "IBM"}, {"given_name": "Peter", "family_name": "Frazier", "institution": "Cornell / Uber"}]}