{"title": "Offline Contextual Bayesian Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4627, "page_last": 4638, "abstract": "In black-box optimization, an agent repeatedly chooses a configuration to test, so as to find an optimal configuration.\r\nIn many practical problems of interest, one would like to optimize several systems, or ``tasks'', simultaneously; however, in most of these scenarios the current task is determined by nature. In this work, we explore the ``offline'' case in which one is able to bypass nature and choose the next task to evaluate (e.g. via a simulator). Because some tasks may be easier to optimize and others may be more critical, it is crucial to leverage algorithms that not only consider which configurations to try next, but also which tasks to make evaluations for. In this work, we describe a theoretically grounded Bayesian optimization method to tackle this problem. We also demonstrate that if the model of the reward structure does a poor job of capturing variation in difficulty between tasks, then algorithms that actively pick tasks for evaluation may end up doing more harm than good. Following this, we show how our approach can be used for real world applications in science and engineering, including optimizing tokamak controls for nuclear fusion.", "full_text": "Of\ufb02ine Contextual Bayesian Optimization\n\nIan Char1, Youngseog Chung1, Willie Neiswanger1, Kirthevasan Kandasamy2, Andrew\n\nOakleigh Nelson3, Mark D Boyer3, Egemen Kolemen3, and Jeff Schneider1\n\n1Department of Machine Learning, Carnegie Mellon University\n\n{ichar, youngsec, willie, schneide}@cs.cmu.edu\n\n2Department of EECS, University of California Berkeley\n\nkandasamy@eecs.berkeley.edu\n\n3Princeton Plasma Physics Laboratory\n{anelson, mboyer, ekolemen}@pppl.gov\n\nAbstract\n\nIn black-box optimization, an agent repeatedly chooses a con\ufb01guration to test, so\nas to \ufb01nd an optimal con\ufb01guration. In many practical problems of interest, one\nwould like to optimize several systems, or \u201ctasks\u201d, simultaneously; however, in\nmost of these scenarios the current task is determined by nature. In this work, we\nexplore the \u201cof\ufb02ine\u201d case in which one is able to bypass nature and choose the\nnext task to evaluate (e.g. via a simulator). Because some tasks may be easier\nto optimize and others may be more critical, it is crucial to leverage algorithms\nthat not only consider which con\ufb01gurations to try next, but also which tasks to\nmake evaluations for. In this work, we describe a theoretically grounded Bayesian\noptimization method to tackle this problem. We also demonstrate that if the model\nof the reward structure does a poor job of capturing variation in dif\ufb01culty between\ntasks, then algorithms that actively pick tasks for evaluation may end up doing\nmore harm than good. Following this, we show how our approach can be used for\nreal world applications in science and engineering, including optimizing tokamak\ncontrols for nuclear fusion.\n\n1\n\nIntroduction\n\nBlack-box optimization is the problem in which one tries to \ufb01nd the maximum of an unknown function\nsolely using evaluations for speci\ufb01ed inputs. In many interesting scenarios, there is a collection of\nunknown, possibly correlated functions (or tasks) that need to be simultaneously optimized. This\nproblem set up often occurs in applications where one wants to design an agent that makes an action\nbased on some contextual information from the environment. However, we would prefer that the agent\nnot run potentially costly or poor performing experimental actions online. Also, because the agent\nmay have to make these decisions at a rapid pace, we often do not have time to compute an expensive\nexperimentation policy. We consider applications that provide the ability to run of\ufb02ine experiments\nwhere nature can be bypassed and the contextual information can be manually set (e.g. on a surrogate\nsystem or on a simulation). These experiments are used to discover a good action policy which is\nthen encoded into a fast cache, such as a look-up table. Even though the experiments are done of\ufb02ine,\nthey are still expensive and we must search the design space ef\ufb01ciently. The following are examples\nof this problem:\n\n\u2022 Nuclear Fusion A tokamak is a device used to magnetically con\ufb01ne plasma and is the most\ncommonly pursued means of generating power from controlled nuclear fusion. A current\nobstacle in realizing sustained nuclear fusion is the dif\ufb01culty in maintaining the plasma\u2019s\nstability at the required temperatures and pressures for a prolonged period of time. We\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconsider the stability of the plasma as an output to optimize, where the input is the controls\nfor the tokamak. The optimal action depends on the current state of the plasma, so each\nplasma state can be regarded as its own task to optimize. We cannot search for a good\ncontrol policy during live experiments because of cost, limited time available on the device,\nand the need to provide a real-time controller that operates in a millisecond scale control\nloop. However, we do have a simulation (forward model) that may be used with Bayesian\noptimization of\ufb02ine to discover a good controller. Importantly, the simulator allows one to\nmanually set the current state of the plasma, and thus prudently selecting states to optimize\nover becomes an important part of the problem.\n\u2022 Database Tuning Consider the problem of tuning the con\ufb01guration of a database so as\nto minimize the latency, the CPU/memory footprint, or any other desired criteria. The\nperformance of a con\ufb01guration depends critically on the underlying hardware and the\nworkload [Van Aken et al., 2017]. Since these variables can change when databases are\ndeployed in production, we need to simultaneously optimize for these different tasks.\n\nIn each of the above settings, dif\ufb01culty of the tasks may vary drastically. For example, in the nuclear\nfusion application, if the current state of the plasma is already stable, the stability may be less sensitive\nto controls, leading to an easy optimization landscape. On the other hand, when the plasma is in\nan unstable state, it may be that only a small set of controls will lead to improved stabilization and\n\ufb01nding them may require many more experiments.\nIn this paper, we propose a Thompson sampling approach for adaptively picking the next task and input\nfor evaluation. Unlike other Bayesian Optimization (BO) algorithms, evaluations are picked in order\nto ef\ufb01ciently estimate the optimal action for each task, where these optimal actions are most likely\ndistinct. This algorithm comes with theoretic guarantees, and we show that it often enjoys a signi\ufb01cant\nboost in performance when compared to uniformly distributing resources across tasks and other state\nof the art methods. Another contribution of this paper is showing the signi\ufb01cance of model choice in\nthis setting. We argue that when using a single Gaussian process (GP) to jointly model correlated\ntasks, the choice of kernel is crucial for estimating the dif\ufb01culty of each task. We believe that model\nselection here is even more important than in single-task BO because incorrect estimates can lead\nto poorly managed resource allocation for tasks. We give an example where inaccurately modeling\nreward structure between tasks via a stationary kernel severely hurts our algorithm. Following this,\nwe suggest a kernel with a lengthscale that varies with tasks and show that this more intelligent kernel\nagain allows our algorithm to enjoy a performance boost. An implementation of our algorithm and\nsynthetic experiments can be found at https://github.com/fusion-ml/OCBO.\nWe end this paper by showing an application of our method to the nuclear fusion problem. In particular,\nwe optimize tokamak controls for a set of different plasma states using a tokamak simulator. We\nobserve that our method is able to identify where best to devote resources, leading to ef\ufb01cient\noptimization.\n\n2 Related Work\n\nOur algorithm falls under the general umbrella of Bayesian optimization [Shahriari et al., 2015,\nFrazier, 2018]. As is common in BO, we use a GP prior to guide us in selecting next evaluations to\nmake. Previously, in the context of active learning and active sensing, techniques have been made that\nuse GPs to select the most informative points for evaluation [Pasolli and Melgani, 2011, Seo et al.,\n2000, Guestrin et al., 2005]. In contrast, our goal is optimization which is more in line with bandit\nmethods. Under the bandits setting, Srinivas et al. [2009] use an upper con\ufb01dence bound approach\nwith GPs and show that such a strategy results in sublinear cumulative regret. As an alternative to the\nupper con\ufb01dence bound approach, Russo and Van Roy [2014] show that one can achieve sublinear\ncumulative regret using a posterior sampling (or Thompson sampling) approach. The method we\npresent here is also a posterior sampling method, and it falls into the general framework of myopic\nposterior sampling described by Kandasamy et al. [2019a].\nOur setting is related to online contextual bandits [Krause and Ong, 2011, Agrawal and Goyal, 2013,\nAuer, 2002], where each task can be viewed as a different context. In these earlier works, the agent\nchooses an action online for a context that is chosen by the environment. In our setting, we wish to\n\ufb01nd the optimal action of\ufb02ine in advance and can choose the contexts we invest our experimentation\neffort on. The models in the works of Krause and Ong [2011], Swersky et al. [2013] are of particular\n\n2\n\n\finterest. Both works use a GP to jointly model correlated contexts and propose a similar structure\nfor the joint GP\u2019s kernel. We adopt a similar strategy, however, our model has the advantage that\nlengthscales can vary between contexts.\nA similar contextual optimization problem shows up in reinforcement learning (RL). While the\ncommon RL setup has contexts delivered solely by the environment, there is some work on actively\nchoosing contexts [Fabisch and Metzen, 2014, Fabisch et al., 2015]. This work proposes methods for\napproximating the expected improvement (EI) in the overall objective. Similarly, the objective can be\nwritten in terms of entropy and experiments may be chosen in terms of its expected improvement\n[Metzen, 2015, Swersky et al., 2013]. In our empirical study, we compare to expected improvement\nfor task and action selection.\nUnlike many other problems under the BO setting, our algorithm searches for an optimal action\nfor each task rather than a single optimal action. This serves as a contrasting feature from other\nproblems in multi-task BO [Swersky et al., 2013, Toscano-Palmerin and Frazier, 2018], in which a\nsingle action that performs optimally across all objectives simultaneously is sought. The works most\nsimilar to ours present algorithms based around EI or knowledge gradients [Frazier et al., 2009]. In\nparticular Ginsbourger et al. [2014] and Pearce and Branke [2018] consider the same problem setting,\nbut focus on the case where the set of tasks is continuous. Although our algorithm can be adapted to\nthis case, we focus on the \ufb01nite task setting and show that our posterior sampling approach provides\na theoretically-grounded, competitive alternative. We also note that previous works have used RBF\nkernels for their synthetic experiments, and while this adequately models the reward landscape for\ntheir relatively smooth functions, we claim that when there is a large variation in task dif\ufb01culty this\nmay cause these algorithms to do more harm than good.\n\n3 Thompson Sampling for Multi-Task Optimization\n\n3.1 Preliminaries\nFor the following, let X be the collection of tasks and let A be the compact set of possible actions.\nThroughout this work, we assume that the same set of actions is available for each task. Let\nf : X \u00d7 A \u2192 R be the bounded reward function, where f (x, a) is the reward for performing action\na in task x. Let \u02c6h : X \u2192 A be our estimated mapping from task to action. Our goal is then to \ufb01nd\nsuch an \u02c6h which maximizes the following objective:\n\n(cid:16)\n\n(cid:88)\n\nx\u2208X\n\n(cid:17)\n\nf\n\nx, \u02c6h(x)\n\n\u03c9(x)\n\n(1)\n\nwhere \u03c9(x) \u2265 0 is some weighting on x that may depend on the probability of seeing x at evaluation\ntime or the importance of x. We usually assume that X is \ufb01nite; however, we also consider the case\nwhen X is continuous in Appendix D, in which case the sum in (1) becomes an integral. At round t\nof optimization, we pick a task xt and an action at to perform a query (xt, at) and observe a noisy\nestimate of the function yt = f (xt, at) + \u0001t, where \u0001t \u223c N (0, \u03c32\n\u0001 ) and is iid. Let Dt be the sequence\nof queried tasks, actions, and rewards up to time t, i.e. Dt = {(x1, a1, y1, ), . . . , (xt, at, yt)}.\nAdditionally, de\ufb01ne \u02c6yt(x) to be the best reward observed for task x up to time t, \u02c6at(x) to be the\naction made to see this corresponding reward, and At(x) to be the set of all actions made for task x\nup to time t.\nIn this work, we assume that f is drawn from a Gaussian process (GP) prior. A GP is characterized by\nits mean function, \u00b5(\u00b7) and kernel (or covariance) function \u03c3(\u00b7,\u00b7). Then for any \ufb01nite set of variables,\nz1, . . . , zn \u2208 X \u00d7 A, [f (z1), . . . , f (zn)]T \u223c N (m, \u03a3), where m \u2208 Rn, \u03a3 \u2208 Rn\u00d7n, mi = \u00b5(zi),\nand \u03a3i,j = \u03c3(zi, zj). It is important to note that by selecting different kernel functions we make\nimplicit assumptions about the smoothness of f. A valuable property of the GP is that its posterior is\nsimple to compute. We denote \u00b5t and \u03c3t to be the posterior mean and posterior kernel functions after\nseeing t evaluations. For more information about GPs see Rasmussen and Williams [2005].\n\n3.2 Multi-Task Thompson Sampling\n\nWe now describe our proposed algorithm called Multi-Task Thompson Sampling (MTS), which is\npresented in Algorithm 1 for the case in which X is a \ufb01nite set of correlated tasks. The algorithm is\n\n3\n\n\fan extension of Thompson sampling [Thompson, 1933] to the multi-task setting. Simply put, MTS\nacts optimally with respect to samples drawn from the posterior. That is, at every round a sample for\nthe reward function is drawn, and this sample is used as if it was ground truth to identify the task in\nwhich the most improvement can be made. After doing this for T iterations, we return the estimated\nmapping \u02c6h such that \u02c6h(x) = \u02c6aT (x) if an evaluation was made for task x; otherwise, \u02c6h(x) maps to an\na \u2208 A drawn uniformly at random. Note that when tasks are assumed to be independent, Algorithm 1\ncan be modi\ufb01ed by instead using a separate GP prior for each task and drawing samples from each at\nevery iteration.\n\nAlgorithm 1 Multi-Task Thompson Sampling (MTS)\n\nInput: capital T , initial capital tinit, mean function \u00b5, kernel function \u03c3.\nDo random search on tasks in round-robin fashion until tinit evaluations are expended.\nfor t = tinit + 1 to T do\n\nDraw (cid:101)f \u223c GP (\u00b5, \u03c3)|Dt\u22121.\n(cid:17)\n(cid:104)(cid:16)\nmaxa\u2208A (cid:101)f (x, a) \u2212 maxa\u2208At(x) (cid:101)f (x, a)\n(cid:101)f (xt, a).\n\nSet xt = argmax\n\n\u03c9(x)\n\nx\u2208X\n\n(cid:105)\n\n.\n\nSet at = argmax\nObserve yt = f (xt, at).\nUpdate Dt = Dt\u22121 \u222a {(xt, at, yt)}.\n\na\u2208A\n\nend for\nOutput: \u02c6h\n\nOne bene\ufb01t of this algorithm is that it comes with theoretic guarantees. For the following, de\ufb01ne\na\u2217\nt (x) to be the past action played for task x that yields the largest expected reward. That is,\n\na\u2217\nt (x) :=\n\nf (x, a) At(x) (cid:54)= \u2205\n\nf (x, a)\n\nelse\n\na\u2208At(x)\nargmin\na\u2208A\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3argmax\n(cid:32)\n\nt (x) has an implicit dependence on f.\n\nNote that a\u2217\nTheorem 1. De\ufb01ne the maximum information gain to be \u03b3T := maxDT I(DT ; f ), where I(\u00b7;\u00b7) is\nthe Shannon mutual information. Assume that X and A are \ufb01nite. Then if Algorithm 1 is played for\nT rounds where tinit = 0\n\n(cid:114)|X||A|\u03b3T\n\n(cid:33)\n\n2T\n\nE [RT,f ] \u2264 |X|\n\n1\nT\n\n+\n\nwhere the expectation is with respect to the data sequence collected and f, and where RT,f is de\ufb01ned\nto be\n\n(cid:80)\n(cid:80)\nx\u2208X \u03c9(x) (maxa\u2208A f (x, a) \u2212 f (x, a\u2217\nT (x)))\nx\u2208X \u03c9(x) (maxa\u2208A f (x, a) \u2212 mina\u2208A f (x, a))\n\nRT,f :=\n\nwhen the denominator is not 0. Otherwise, RT,f takes the value of 0.\n\nThe proof of this theorem (see Appendix A) uses ideas from Kandasamy et al. [2019a]. This result\ngives a bound on the expected normalized total simple regret. Here, simple regret is the difference\nbetween the best reward and the best reward for a played action (i.e. maxa\u2208A f (x, a)\u2212 f (x, a\u2217\nT (x))),\n\nand total simple regret refers to the simple regret summed across all tasks. The(cid:112)|X||A| factor in\nwe still get that the rate of decrease is dominated by(cid:112) \u03b3T\n\nthe theorem accounts for the number of actions that can be taken at every step, and the\n\u03b3T factor\ncharacterizes the complexity of the prior over the tasks. We suspect that our proof technique may\nhave lead to a somewhat loose bound because there is an extra dependence of |X|; that being said,\nT , which is the same as the single-task regret\n\nrate [Russo and Van Roy, 2014].\nAn important implication of this result is that there is no task in which we will have especially bad\nresults, and when \u03b3T = o(T ), the normalized simple regret converges to 0 in expectation for every\ntask. We note that Srinivas et al. [2009] give bounds on the maximum information gain for a single\n\n\u221a\n\n4\n\n\fT\n\nGP in a few standard cases. For example, when dealing with a GP over a d-dimensional compact\n= O(log(t)d+1). Finally, we note that these types of results can\nset using an RBF kernel, \u03b3(RBF )\nusually be generalized to in\ufb01nite action spaces via known techniques [Russo and Van Roy, 2016,\nBubeck et al., 2011]\nContinuous task setting. Often one is confronted by a set of tasks that are correlated and continuous.\nThe problem of \ufb01nding a policy in this setting is inherently different because the tasks seen of\ufb02ine\nwill not be the exact same as the tasks encountered when the policy is deployed. Nevertheless,\nMTS can be adapted to this setting by leveraging the posterior mean instead of At(x) (details are\nin Appendix D). Even though greedily picking evaluations to increase improvement within a single\ntask is likely not optimal here, we found that our algorithm performs competitively with other more\nexpensive state of the art methods, especially in higher dimensional settings.\n\n3.3 Synthetic Experiments\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 1: Synthetic experiments for MTS. Each of the experiments were averaged over 10 trials\nand show the mean value and standard error. The plots for independent tasks in the top row are as\nfollows: (a) Total simple regret when tasks are Branin-Hoo and four other identical 2D parabaloids,\n(b) the corresponding proportion of capital spent for each task in (a), (c) total simple regret for 30\nrandom 4D functions, (d) total simple regret for 30 random 6D functions. The second row shows\ntotal simple regret for Branin 1-1 (e), Hartmann 2-2 (f), Hartmann 3-1 (g), and Hartmann 4-2 (h). In\nmany of the cases, we must estimate the true optimal value and therefore cannot plot the true regret\n(see Appendix F).\n\nIndependent Tasks. For this setting we compare MTS against a suite of baselines which distribute\nresources evenly amongst tasks. The \ufb01rst of which, selects task-action pairs uniformly at random.\nAdditionally, we compare against the procedure of selecting a task uniformly at random and applying\nstandard Thompson sampling (TS) or expected improvement (EI) at every iteration. This is essentially\nequivalent to iteratively optimizing for each task using standard BO methods, using the same amount\nof computation for each task. Additionally, we compare against the procedure described by Swersky\net al. [2013] in Section 3.2. This algorithm, which we call Multi-task Expected Improvement (MEI),\npicks the task with the greatest expected improvement at every iteration. However, we do not impute\nmissing data across tasks using the posterior mean (in the independent case it is impossible to do\nso). Although the setting this algorithm was designed for is slightly different (it is assumed there\nis one optimal action over all tasks), the approach is still applicable to our setting. The following\nexperiments are averaged over 10 trials. We start by evaluating each task with 5 points drawn\nuniformly at random. Each task is modeled by a GP with an RBF kernel, and hyperparameters are\ntuned for a GP every time an observation is seen for its corresponding task. For two-dimensional\n\n5\n\n020406080100t1002\u00d71003\u00d71004\u00d71006\u00d7100Total Simple RegretMTSMEIEITSRANDBraninParabaloid1Parabaloid2Parabaloid3Parabaloid40.00.10.20.30.40.50.60.70.8Proportion02004006008001000t100101Total Simple Regret0200400600800100012001400t101Total Simple Regret020406080100t101100101102Total Simple RegretMTSMEIEITSRANDREVI020406080100t101100Total Simple Regret020406080100t102101Total Simple Regret0100200300400500t101100Total Simple Regret\ffunctions, hyperparameters are tuned according to marginal likelihood, but for greater dimensions,\ntuning is done using a blend of marginal likelihood and posterior sampling. This method was found\nto be more robust by Kandasamy et al. [2019b]. Here, and throughout this section, we leverage the\nDragon\ufb02y library for our experiments [Kandasamy et al., 2019b]. Lastly, in every experiment we let\n\u03c9(x) = 1 for all x \u2208 X and give noiseless feedback to the algorithms.\nFor the \ufb01rst synthetic problem, we wish to optimize over 5 functions: four of which are concave\nparabaloids (with a range of [0, 1]) and the other being the Branin-Hoo function [Branin, 1972]. Not\nonly does the Branin-Hoo function have a greater scale, but it is also much more complex. Thus, one\nmight imagine that virtually all resources should be invested in optimizing this function, which is\nthe behavior displayed by MEI. However, we see that MTS performs best by distributing resources\nmore liberally amongst tasks (see Figure 1 (a) and (b)). We also test these methods on 30 randomly\ngenerated functions in four and six dimensions (see Appendix B for details), and we found MTS to\nbe the strongest performer.\nCorrelated Finite Tasks. To evaluate our method in the correlated \ufb01nite task setting, we take multi-\ndimensional functions, treat the \ufb01rst few dimensions as task space, and select equispaced tasks in\nthis space to focus on. In particular we use the Branin-Hoo, Hartmann 4, and Hartmann 6 [Picheny\net al., 2013] function to create Branin 1-1, Hartmann 2-2, Hartmann 3-1, and Hartmann 4-2, where\nthe \ufb01rst number is the task dimension and the second is the action dimension. We consider 10, 9,\n8, and 16 tasks for each of these functions, respectively. The set up is identical to before except for\nthat a single GP is used to jointly model tasks, and the GP is tuned by maximizing the marginal\nlikelihood for all experiments. In addition to the previous baselines, we also compare against the\nREVI algorithm introduced by [Pearce and Branke, 2018]. This algorithm, based on knowledge\ngradients Frazier et al. [2009], picks task-action pairs for evaluation by estimating which will increase\nthe GP mean the greatest across all tasks. That is, it myopically tries to optimize (1) at each round by\nusing the GP mean as a proxy. In the risk-averse setting in which the policy returned maps task to the\nbest action seen throughout training (i.e. the setting we have considered throughout this paper), the\nauthors recommend running EI in a round-robbin fashion at the end of training. As such, we end the\noptimization with one round of EI for REVI.\nThe results are shown in the second row of Figure 1. We also compare in the risk-neutral setting\n(i.e. when the policy is derived from posterior mean) in Appendix C. For the majority of the cases\nMTS and MEI are the best performers. The exception to this is the experiment done on the Hartmann\n4-2 function. Here, MEI does signi\ufb01cantly worse than standard EI and TS methods, while MTS has\nabout the same performance. We found that MEI focuses almost all of its capital on just three tasks,\nwhich most likely causes the poor performance. In all cases, MTS and MEI outperform REVI, even\nwhen a round of EI is performed at the end of execution. We believe that REVI does not perform as\nwell in these experiments since the tasks considered are spread out in task space, and REVI focuses\nless on tasks at the boundary of the space (see Appendix E for visualizations). Indeed, if we consider\ncontinuous correlated tasks instead (see Appendix D), REVI becomes a strong performer. With that\nbeing said, we argue that the formulation of these experiments is natural for real life applications, and\nthe set up for our fusion experiments in Section 5 is similar to this.\n\n4 Modeling Variation in Dif\ufb01culty\n\nThe selection of hyperparameters for the kernel function of a GP is often key to whether the landscape\ncan be modeled well. Usually these hyperparameters include lengthscale, which determines how\ncorrelated points are based on their distance to each other, and scale, which determines the magnitude\nof correlation. Intuitively, these values provide some indication of the optimization landscape\u2019s\ndif\ufb01culty. For example, larger lengthscales imply more smooth functions, which are often easier to\noptimize for. From a more theoretical standpoint, the hyperparameters have a direct effect on the\nmaximum information gain and therefore impact regret bounds shown by Theorem 1 and Russo and\nVan Roy [2014].\nIntuitively, hyperparameters should vary between tasks in order to adequately model any difference in\ndif\ufb01culty between them. One method for achieving this when jointly modelling tasks is via a locally\nstationary kernel, i.e. hyperparameters vary with respect to tasks but not with actions. Although there\nmay be many ways to achieve this, a straightforward approach is to use the Gibbs kernel [Gibbs,\n1998]. The Gibbs kernel is a non-stationary variant of the RBF kernel that allows the lengthscale and\n\n6\n\n\fscale to vary over the space. Where z, z(cid:48) \u2208 X \u00d7 A, PX = dim(X ) and PA = dim(A),\n\n(cid:34)(cid:115)\n\nPX +PA(cid:89)\n\np=1\n\n\u03c3(z, z(cid:48)) =\n\n(cid:32) \u2212(zp \u2212 z(cid:48)\n\n(cid:96)2\np(z) + (cid:96)2\n\np)2\np(z(cid:48))\n\n(cid:33)(cid:35)\n\n2(cid:96)p(z)(cid:96)p(z(cid:48))\np(z(cid:48))\n(cid:96)2\np(z) + (cid:96)2\n\nexp\n\n(2)\n\nHere, (cid:96)p is the non-negative lengthscale function that characterizes the hyperparameters for the pth\ndimension. The above can be separated into the product of a kernel over the task space and a kernel\nover the action space, i.e. \u03c3(z, z(cid:48)) = \u03c3X (x, x(cid:48))\u03c3A(z, z(cid:48)) where z = (x, a) and z(cid:48) = (x(cid:48), a(cid:48)). To suit\nour needs, we make all lengthscale functions for \u03c3X constant functions so that (cid:96)i(z) = (cid:96)i where\n(cid:96)i \u2208 R and (cid:96)i > 0 for all i = 1, . . . , PX. As for \u03c3A, we limit the lengthscale functions to only\ndepend on the task component of z. Altogether,\n\n\u03c3(z, z(cid:48)) = \u03c3X (x, x(cid:48))\u03c3A(z, z(cid:48))\n\n(cid:32) PX(cid:89)\n\ni=1\n\n=\n\nexp\n\n(cid:18)\u2212(xi \u2212 x(cid:48)\n\ni)2\n\n2(cid:96)2\ni\n\n(cid:115)\n\n(cid:19)(cid:33)\uf8eb\uf8ed PA(cid:89)\n\nj=1\n\n(cid:32) \u2212(aj \u2212 a(cid:48)\n\nj (x) + (cid:96)2\n(cid:96)2\n\nj)2\nj (x(cid:48))\n\n(3)\n\n(cid:33)\uf8f6\uf8f8 (4)\n\n2(cid:96)j(x)(cid:96)j(x(cid:48))\nj (x(cid:48))\nj (x) + (cid:96)2\n(cid:96)2\n\nexp\n\nNote that with this modi\ufb01cation \u03c3X reduces to the RBF kernel. Furthermore, for any \ufb01xed task\nx \u2208 X (i.e. we only consider z = (x, a), z(cid:48) = (x, a(cid:48))) the entire kernel reduces to the RBF kernel.\nAs such, we are left with a locally stationary kernel, where the hyperparameters only vary as the task\nvaries. In the proceeding section, we leverage this model with our posterior sampling methods.\nSynthetic Example. For the correlated task experiment in Section 3.3, tasks are generally quite\nsimilar, so MTS and MEI can do well when the GP uses an RBF kernel. However, we now wish\nto optimize 10 correlated tasks of variable dif\ufb01culty. To create the tasks, we take slices from the\nfunction visualized in Figure 2 (see Section B in Appendix for details). Like many real-world tasks,\nthis function has areas that make for an interesting optimization problem and others that are quite\nboring. In order to optimize well, we use the kernel presented in (3), where the lengthscale function\nof each action dimension is the soft plus of a quadratic polynomial, and the coef\ufb01cients of each\npolynomial are treated as hyperparameters. We form a hierarchical probabilistic model by placing\nNormal priors over each hyperparameter. Then, for every iteration of our algorithm, we now make\ndecisions according to a posterior sample drawn from this hierarchical model. For our implementation\nof these models, we use probabilistic programming and BO frameworks [Carpenter et al., 2017,\nNeiswanger et al., 2019].\nIn practice, this does a superior job at modeling each task. To show this, each of the ten tasks were\nevaluated at \ufb01ve points. Then, both our suggested model and a stationary model using an RBF kernel\nwas \ufb01t to the data. The difference becomes especially clear when looking at tasks that are relatively\n\ufb02at functions, since the stationary GP falsely estimates large peaks. This can be especially damaging\nin our case where we select tasks based on possible performance improvements.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Tasks of varying dif\ufb01culty. (a): Average best seen rewards summed across all tasks of\nvarying dif\ufb01culty. Each curve is averaged over 12 trials, and the shaded region shows the standard\nerror. (b): Surface of the function used to generate correlated tasks. (c) shows our proposed model\non an easy task, and (d) shows a stationary model on the same easy task. Here, the red line shows the\ntrue function, the black line shows the posterior mean, the blue points show evaluations made for the\ncorresponding task, and the shaded area shows high con\ufb01dence regions.\n\nWe run optimization using MTS and standard Thompson sampling where tasks are picked uniformly\nat random. Moreover, we run these algorithms using the model described above, using a single GP\n\n7\n\n020406080100t100101Total Simple RegretNS-TSI-MTSNS-MTSS-TSS-MTS0.00.20.40.60.81.00.00.20.40.60.81.0105051042024a1050510Reward42024a1050510Reward\fthat jointly models tasks using an RBF kernel, and using several GPs, each corresponding to a task\nand using an RBF kernel (i.e. assume that tasks have no correlation). In all cases, we use posterior\nsampling to select hyperparameters. For simplicity, we append pre\ufb01xes to these methods where \u201cI\u201d\nstands for independent GPs, \u201cS\u201d stands for stationary GP, and \u201cNS\u201d stands for non-stationary GP.\nThe results in Figure 2 show that one can be negatively affected by picking tasks given an ill-suited\nmodel. Although S-MTS ultimately ends up performing well, it initially struggles when compared to\nS-TS and I-MTS. Thus depending on resources available, it may be better to either forego shared\ninformation to better model the function or distribute resources to tasks uniformly. That being said,\ndisregarding both shared information and picking tasks intelligently, as in I-TS, results in the worst\nperformance (not pictured here). Notice that when tasks can be modeled appropriately, distributing\nresources according to our algorithm is again bene\ufb01cial as shown by NS-MTS.\n\n5 Application to Nuclear Fusion\n\nNuclear fusion is regarded as the energy of the future since it presents the possibility of unlimited\nclean energy. The most widespread method of realizing fusion reactions requires heating up isotopes\nof hydrogen to temperatures of hundreds of millions of degrees using a magnetic device called a\ntokamak. In this state, the nuclei of two nearby atoms may overcome electrostatic repulsion force\nbetween them to form a single nucleus, releasing energy. One obstacle in utilizing fusion as a feasible\nenergy source, however, is the stability of the reaction. Once the plasma has reached a reaction state,\nit is uncertain how the tokamak controls should be modi\ufb01ed to address the varying state of the plasma\nin order to sustain the fusion reaction. We tackle this problem by attempting to learn optimal controls\nof\ufb02ine via a simulator. In particular, we apply our algorithm to determine a mapping from plasma\nstate to tokamak neutral beam controls.\nExperiment Set Up. We consider a collection of 7 tasks that represent different plasma states. An\nevaluation of an action on a task corresponds to setting the tokamak beam controls and conducting a\nsimulation on the selected state of the plasma. These simulations are run on the predictive mode of\nTRANSP, which simulates tokamaks. Both the action space and the task space are two dimensional,\nand the reward is a weighted sum of plasma stability and fusion reaction ef\ufb01ciency. Appendix G.1\nprovides more details of this experiment.\nWe compare the performance of 4 algorithms: MTS and standard Thompson sampling with a\njoint GP model across both states (tasks) and actions (denoted J-MTS and J-TS, respectively),\nand MTS and standard Thompson sampling with independent GP models across actions for each\nstate (denoted I-MTS and I-TS). Because we had no reason to believe that there will be a drastic\ndifference in dif\ufb01culty between tasks, we used a non-stationary kernel for these initial experiments.\nMoreover, the experimental settings were identical to the two-dimensional synthetic experiments in\nSection 3.3, except for 2 differences: 5 trials of each algorithms were run with each trial consisting\nof 200 evaluations, and in each trial, we allow up to 10 evaluations to be run in parallel. We\nrely on parallel optimization here since each query has high simulation overhead (> 1 hour per\nsimulation experiment). For more details regarding the setup for the fusion simulation experiments,\nsee Appendix G.1.\nOver 200 evaluations, we observe that J-MTS (the blue curve in Figure 3 (a)) outperforms J-TS,\nwhich shows the merit of focusing on states that are deemed more \u201cdif\ufb01cult\u201d, rather than uniformly\nselecting a state. This behavior can also be seen in the performance and query plots per task in\nFigure 3 (b). Once the reward has levelled off in a certain task (e.g. plasma state 3, 4), J-MTS stops\nquerying the task and queries other tasks that are predicted to provide improvement, while J-TS will\nstill query the task as it chooses tasks randomly. This algorithm also outperforms the MTS and TS\nwith independent models for each state, I-MTS and I-TS, which shows the merit of jointly learning\nthe state-action space and sharing information across the correlated states. With independent GPs for\neach state, I-TS outperformed I-MTS. We believe this may be because of occasional erratic outputs\nfrom the simulator. This occurred more frequently in some states than in others and when the limits\nof the simulator were tested with extreme controls (e.g. very high power for all neutral beams). In\nsuch cases, I-MTS will estimate the reward landscape to be non-smooth and focus on the particular\nstate. This behavior is shown by the high proportion of queries made by I-MTS in state 3 (Figure 3\n(c)) despite little further improvement in reward (Figure 3 (b)). It is worthwhile to note, however,\nthat this behavior is not evident in J-MTS, and we believe that using all queries of state-action pairs\nto learn a single joint state-action model is more robust to extreme observations from a particular\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Fusion Simulation Experiments. Each of the above show average values and standard\nerror from 5 trials. (a) shows the total regret summed across all tasks, (b) the regret achieved in each\ntask, and (c) the proportion of capital spend in each task. Note that curves differ in length for (b)\nsince different amounts of resources were allocated for each task.\n\nstate. An experiment setting where the controls are highly constrained, and hence less straining to the\nsimulator, is presented in Appendix G.2 and in Chung et al. [2020].\nDiscussion of Physical Results. These results are promising, not only from an algorithmic perspec-\ntive, but also from a physics perspective. While there have been applications of machine learning\ntechniques in nuclear fusion, they primarily focus on detecting disruptions and plasma instabilities\n[Cannas et al., 2013, Tang et al., 2016, Montes et al., 2019, Kates-Harbeck et al., 2019]. The work\ndone by Baltz et al. [2017] is the closest to our application; however, since they were doing costly\nexperiments online, their optimization leveraged human operators to ultimately decide which evalu-\nation to perform next. To the best of our knowledge, our application is one of the \ufb01rst attempts in\nconducting of\ufb02ine optimization for tokamak control.\nWith these initial results established, we hope to continue progress on this problem by forming a\nclosed loop controller for a tokamak. This requires expanding the number of tasks so that they\ncover the plasma\u2019s state space, and adapting to the fact that the state space is continuous (i.e. either\nvia interpolation or using the continuous variant of MTS). Furthermore, we wish to develop more\nsophisticated plasma state representations, actions that can be applied, and reward functions in order\nto discover more interesting results. Lastly, readers may note that a controller derived from this\nmethod may not be optimal. Here, we have been seeking actions that myopically maximize reward;\nhowever, the real goal is to \ufb01nd an optimal sequence of actions that maximizes long term reward. We\nstarted with this approach since simulations are expensive, and we hope that this approximation still\nleads to a good controller. That being said, in the future we would like to extend the ideas of our\nalgorithm to the reinforcement learning setting in order to derive sample ef\ufb01cient methods.\n\n6 Conclusion\n\nIn this paper, we have proposed methods for dealing with many optimization problems that need to\nbe solved simultaneously. We introduced a posterior sampling approach that has theoretic guarantees\nand often has dominant performance when compared to methods which do not distributed resources\nintelligently. This Thompson sampling method pairs nicely with our proposed locally stationary\nmodel, and we demonstrated that more sophisticated models are key when functions vary in dif\ufb01culty.\nFinally, we used our algorithm to derive real results for nuclear fusion, which we hope to build upon\nin following work.\n\n7 Acknowledgements\n\nThis material is based upon work supported by the National Science Foundation Graduate Research\nFellowship Program under Grant No. DGE1252522 and DGE1745016. Willie Neiswanger is also\n\n9\n\n0255075100125150175t100101Total RegretJ-MTSJ-TSI-MTSI-TS050t101Regret per TaskPlasma State 0050100t101100Plasma State 302040t100Regret per TaskPlasma State 402040t102101100Plasma State 6State 0State 1State 2State 3State 4State 5State 60.000.050.100.150.200.250.300.35ProportionJ-MTSJ-TSI-MTSI-TS\fsupported by NSF grants CCF1629559 and IIS1563887. Any opinions, \ufb01ndings, and conclusions or\nrecommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect the\nviews of the National Science Foundation.\nYoungseog Chung is supported by the Kwanjeong Educational Foundation.\nThe authors would also like to thank the reviewers for their helpful feedback.\n\nReferences\nShipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In\n\nInternational Conference on Machine Learning, pages 127\u2013135, 2013.\n\nPeter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\nEA Baltz, E Trask, M Binderbauer, M Dikovsky, H Gota, R Mendoza, JC Platt, and PF Riley.\nAchievement of sustained net plasma heating in a fusion experiment with the optometrist algorithm.\nScienti\ufb01c reports, 7(1):6425, 2017.\n\nFranklin H Branin. Widely convergent method for \ufb01nding multiple solutions of simultaneous\n\nnonlinear equations. IBM Journal of Research and Development, 16(5):504\u2013522, 1972.\n\nS\u00e9bastien Bubeck, R\u00e9mi Munos, Gilles Stoltz, and Csaba Szepesv\u00e1ri. X-armed bandits. Journal of\n\nMachine Learning Research, 12(May):1655\u20131695, 2011.\n\nBarbara Cannas, Alessandra Fanni, A Murari, Alessandro Pau, Giuliana Sias, and JET EFDA Con-\ntributors. Automatic disruption classi\ufb01cation based on manifold learning for real-time applications\non jet. Nuclear Fusion, 53(9):093023, 2013.\n\nBob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Be-\ntancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic\nprogramming language. Journal of statistical software, 76(1), 2017.\n\nYoungseog Chung, Ian Char, Willie Neiswanger, Kirthevasan Kandasamy, Andrew Oakleigh Nelson,\nMark D Boyer, Egemen Kolemen, and Jeff Schneider. Of\ufb02ine contextual bayesian optimization for\nnuclear fusion. arXiv preprint arXiv:2001.01793, 2020.\n\nAlexander Fabisch and Jan Hendrik Metzen. Active contextual policy search. The Journal of Machine\n\nLearning Research, 15(1):3371\u20133399, 2014.\n\nAlexander Fabisch, Jan Hendrik Metzen, Mario Michael Krell, and Frank Kirchner. Accounting for\ntask-dif\ufb01culty in active multi-task robot control learning. KI-K\u00fcnstliche Intelligenz, 29(4):369\u2013377,\n2015.\n\nPeter Frazier, Warren Powell, and Savas Dayanik. The knowledge-gradient policy for correlated\n\nnormal beliefs. INFORMS journal on Computing, 21(4):599\u2013613, 2009.\n\nPeter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.\n\nMark N Gibbs. Bayesian Gaussian processes for regression and classi\ufb01cation. PhD thesis, Citeseer,\n\n1998.\n\nDavid Ginsbourger, Jean Baccou, Cl\u00e9ment Chevalier, Fr\u00e9d\u00e9ric Perales, Nicolas Garland, and Yann\nMonerie. Bayesian adaptive reconstruction of pro\ufb01le optima and optimizers. SIAM/ASA Journal\non Uncertainty Quanti\ufb01cation, 2(1):490\u2013510, 2014.\n\nDaniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active\nlearning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013486,\n2011.\n\nBA Grierson, X Yuan, M Gorelenkova, S Kaye, NC Logan, O Meneghini, SR Haskey, J Buchanan,\nM Fitzgerald, SP Smith, et al. Orchestrating transp simulations for interpretative and predictive\ntokamak modeling with om\ufb01t. Fusion Science and Technology, 74(1-2):101\u2013115, 2018.\n\n10\n\n\fCarlos Guestrin, Andreas Krause, and Ajit Paul Singh. Near-optimal sensor placements in gaussian\nIn Proceedings of the 22nd international conference on Machine learning, pages\n\nprocesses.\n265\u2013272. ACM, 2005.\n\nKirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnab\u00e1s P\u00f3czos. Parallelised\nbayesian optimisation via thompson sampling. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 133\u2013142, 2018.\n\nKirthevasan Kandasamy, Willie Neiswanger, Reed Zhang, Akshay Krishnamurthy, Jeff Schneider,\nand Barnabas Poczos. Myopic posterior sampling for adaptive goal oriented design of experiments.\nIn Proceedings of the 36th International Conference on Machine Learning. JMLR. org, 2019a.\n\nKirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R\nCollins, Jeff Schneider, Barnabas Poczos, and Eric P Xing. Tuning hyperparameters with-\nout grad students: Scalable and robust bayesian optimisation with dragon\ufb02y. arXiv preprint\narXiv:1903.06694, 2019b.\n\nJulian Kates-Harbeck, Alexey Svyatkovskiy, and William Tang. Predicting disruptive instabilities in\n\ncontrolled fusion plasmas through deep learning. Nature, page 1, 2019.\n\nAndreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in\n\nNeural Information Processing Systems, pages 2447\u20132455, 2011.\n\nJan Hendrik Metzen. Active contextual entropy search. arXiv preprint arXiv:1511.04211, 2015.\n\nKevin Joseph Montes, Cristina Rea, Robert Granetz, Roy Alexander Tinguely, Nicholas W Eidietis,\nO Meneghini, Dalong Chen, Biao Shen, Bingjia Xiao, Keith Erickson, et al. Machine learning for\ndisruption warning on alcator c-mod, diii-d, and east. Nuclear Fusion, 2019.\n\nWillie Neiswanger, Kirthevasan Kandasamy, Barnabas Poczos, Jeff Schneider, and Eric Xing. Probo:\na framework for using probabilistic programming in bayesian optimization. arXiv preprint\narXiv:1901.11515, 2019.\n\nEdoardo Pasolli and Farid Melgani. Gaussian process regression within an active learning scheme.\nIn 2011 IEEE International Geoscience and Remote Sensing Symposium, pages 3574\u20133577. IEEE,\n2011.\n\nMichael Pearce and Juergen Branke. Continuous multi-task bayesian optimisation with correlation.\n\nEuropean Journal of Operational Research, 270(3):1074\u20131085, 2018.\n\nVictor Picheny, Tobias Wagner, and David Ginsbourger. A benchmark of kriging-based in\ufb01ll criteria\n\nfor noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607\u2013626, 2013.\n\nCarl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learn-\nISBN\n\ning (Adaptive Computation and Machine Learning series). The MIT Press, 2005.\n9780262182539.\n\nDaniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\nDaniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling. The\n\nJournal of Machine Learning Research, 17(1):2442\u20132471, 2016.\n\nSambu Seo, Marko Wallat, Thore Graepel, and Klaus Obermayer. Gaussian process regression:\nActive data selection and test point rejection. In Mustererkennung 2000, pages 27\u201334. Springer,\n2000.\n\nBobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the\nhuman out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):\n148\u2013175, 2015.\n\nNiranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process opti-\nmization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995,\n2009.\n\n11\n\n\fKevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In Advances in\n\nneural information processing systems, pages 2004\u20132012, 2013.\n\nWilliam Tang, Matthew Parsons, Eliot Feibush, A Murari, J Vega, A Pereira, and J Choi. Big\ndata machine learning for disruption predictions. In 26th IAEA Fusion Energy Conference-IAEA\nCN-234, Paper Number EX/P6\u201347, 2016.\n\nWilliam R Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nSaul Toscano-Palmerin and Peter I Frazier. Bayesian optimization with expensive integrands. arXiv\n\npreprint arXiv:1803.08661, 2018.\n\nDana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. Automatic database man-\nagement system tuning through large-scale machine learning. In Proceedings of the 2017 ACM\nInternational Conference on Management of Data, pages 1009\u20131024. ACM, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2590, "authors": [{"given_name": "Ian", "family_name": "Char", "institution": "Carnegie Mellon University"}, {"given_name": "Youngseog", "family_name": "Chung", "institution": "Carnegie Mellon University"}, {"given_name": "Willie", "family_name": "Neiswanger", "institution": "Carnegie Mellon University"}, {"given_name": "Kirthevasan", "family_name": "Kandasamy", "institution": "Carnegie Mellon University"}, {"given_name": "Andrew", "family_name": "Nelson", "institution": "Princeton Plasma Physics Lab"}, {"given_name": "Mark", "family_name": "Boyer", "institution": "Princeton Plasma Physics Lab"}, {"given_name": "Egemen", "family_name": "Kolemen", "institution": "Princeton Plasma Physics Lab"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "Carnegie Mellon University"}]}