{"title": "Contextual Gaussian Process Bandit Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2447, "page_last": 2455, "abstract": "How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context.", "full_text": "Contextual Gaussian Process Bandit Optimization\n\nAndreas Krause\n\nCheng Soon Ong\n\nDepartment of Computer Science, ETH Zurich,\n\n8092 Zurich, Switzerland\n\nkrausea@ethz.ch\n\nchengsoon.ong@inf.ethz.ch\n\nAbstract\n\nHow should we design experiments to maximize performance of a complex\nsystem,\ntaking into account uncontrollable environmental conditions? How\nshould we select relevant documents (ads) to display, given information about the\nuser? These tasks can be formalized as contextual bandit problems, where at each\nround, we receive context (about the experimental conditions, the query), and\nhave to choose an action (parameters, documents). The key challenge is to trade\noff exploration by gathering data for estimating the mean payoff function over the\ncontext-action space, and to exploit by choosing an action deemed optimal based\non the gathered data. We model the payoff function as a sample from a Gaussian\nprocess de\ufb01ned over the joint context-action space, and develop CGP-UCB, an\nintuitive upper-con\ufb01dence style algorithm. We show that by mixing and matching\nkernels for contexts and actions, CGP-UCB can handle a variety of practical ap-\nplications. We further provide generic tools for deriving regret bounds when using\nsuch composite kernel functions. Lastly, we evaluate our algorithm on two case\nstudies, in the context of automated vaccine design and sensor management. We\nshow that context-sensitive optimization outperforms no or naive use of context.\n\n1\n\nIntroduction\n\nConsider the problem of learning to optimize a complex system subject to varying environmental\nconditions. Or learning to retrieve relevant documents (ads), given context about the user. Or learn-\ning to solve a sequence of related optimization and search tasks, by taking into account experience\nwith tasks solved previously. All these problems can be phrased as a contextual bandit problem (c.f.,\n[1, 2], we review related work in Section 7), where in each round, we receive context (about the\nexperimental conditions, the query, or the task), and have to choose an action (system parameters,\ndocument to retrieve). We then receive noisy feedback about the obtained payoff. The key challenge\nis to trade off exploration by gathering data for estimating the mean payoff function over the context-\naction space, and to exploit by choosing an action deemed optimal based on the gathered data.\nWithout making any assumptions about the class of payoff functions under consideration, we\ncannot expect to do well. A natural approach is to choose a regularizer, encoding assumptions\nabout smoothness of the payoff function.\nIn this paper, we take a nonparametric approach, and\nmodel the payoff function as a sample from a Gaussian process de\ufb01ned over the joint context-action\nspace (or having low norm in the associated RKHS). This approach allows us to estimate the\npredictive uncertainty in the payoff function estimated from previous experiments, guiding the\ntradeoff between exploration and exploitation.\nIn the context-free case, this problem is studied\nby [3], who analyze GP-UCB, an upper-con\ufb01dence bound-based sampling algorithm that makes\nuse of the predictive uncertainty to trade exploration and exploitation. In this paper, we develop\nCGP-UCB, a natural generalization of GP-UCB, which takes context information into account.\nBy constructing a composite kernel function for the regularizer from kernels de\ufb01ned over the action\nand context spaces (e.g., a linear kernel on the actions, and Gaussian kernel on the contexts), we can\ncapture several natural contextual bandit problem formulations. We prove that CGP-UCB incurs\n\n1\n\n\fsublinear contextual regret (i.e., prove that it competes with the optimal mapping from context\nto actions) for a large class of composite kernel functions constructed in this manner. Lastly, we\nevaluate our algorithm on two real-world case studies in the context of automated vaccine design,\nand management of sensor networks. We show that in both these problems, properly taking into\naccount contextual information outperforms ignoring or naively using context.\nIn summary, as our main contributions we\n\n\u2022 develop an ef\ufb01cient algorithm, CGP-UCB, for the contextual GP bandit problem;\n\u2022 show that by \ufb02exibly combining kernels over contexts and actions, CGP-UCB can be\n\u2022 provide a generic approach for deriving regret bounds for composite kernel functions;\n\u2022 evaluate CGP-UCB on two case studies, related to automated vaccine design and sensor\n\napplied to a variety of applications;\n\nmanagement.\n\nregret rt = sups(cid:48)\u2208S f (s(cid:48), zt) \u2212 f (st, zt). After T rounds, our cumulative regret is RT =(cid:80)T\n\n2 Modeling Contextual Bandits with Gaussian Processes\nWe consider playing a game for a sequence of T (not necessarily known a priori) rounds. In each\nround, we receive a context zt \u2208 Z from a (not necessarily \ufb01nite) set Z of contexts, and have to\nchoose an action st \u2208 S from a (not necessarily \ufb01nite) set S of actions. We then receive a payoff\nyt = f (st, zt) + \u0001t, where f : S \u00d7 Z \u2192 R is an (unknown) function, and \u0001t is zero mean random\nnoise (independent across the rounds). The addition of (externally chosen) contextual information\ncaptures a critical component in many applications, and generalizes the k-armed bandit setting.\nSince f is unknown, we will not generally be able to choose the optimal action, and thus incur\nt=1 rt.\nThe context-speci\ufb01c best action is a more demanding benchmark than the best action used in the\n(context-free) de\ufb01nition of regret. Our goal will be to develop an algorithm which achieves sublinear\ncontextual regret, i.e., RT /T \u2192 0 for T \u2192 \u221e. Note that achieving sublinear contextual regret\nrequires learning (and competing with) the optimal mapping from contexts to actions.\nRegularity assumptions are required, since without any there could be a single action s\u2217 \u2208 S that\nobtains payoff of 1, and all other actions obtain payoff 0. With in\ufb01nite action sets, no algorithm will\nbe able to identify s\u2217 in \ufb01nite time. In this paper, we assume that the function f : S \u00d7 Z \u2192 R\nis a sample from a known Gaussian process (GP) distribution1. A Gaussian process is a collection\nof dependent random variables, one for each x \u2208 X, such that every \ufb01nite marginal distribution\nis a multivariate Gaussian (while ensuring overall consistency) [4]. Here we use X = S \u00d7 Z\nto refer to the set of all action-context pairs. A GP (\u00b5, k) is fully speci\ufb01ed by its mean function\n\u00b5 : X \u2192 R, \u00b5(x) = E[f (x)] and covariance (or kernel) function k : X \u00d7 X \u2192 R, k(x, x(cid:48)) =\nE[(f (x)\u2212 \u00b5(x))(f (x(cid:48))\u2212 \u00b5(x(cid:48)))]. Without loss of generality [4], we assume that \u00b5 \u2261 0. We further\nassume bounded variance by restricting k(x, x) \u2264 1, for all x \u2208 X. The covariance function k\nencodes smoothness properties of sample functions f drawn from the GP. Since the random variables\nare action-context pairs, often there is a natural decomposition of the covariance function k into the\ncorresponding covariance functions on actions and contexts (Section 5).\nA major computational bene\ufb01t of working with GPs is the fact that posterior inference can be\nperformed in closed form. Suppose we have collected observations yT = [y1 . . . yT ]T at inputs\nAT = {x1, . . . , xT}, yt = f (xt) + \u0001t with i.i.d. Gaussian noise \u0001t \u223c N (0, \u03c32), the posterior\ndistribution over f is a GP with mean \u00b5T (x), covariance kT (x, x(cid:48)) and variance \u03c32\nT (x), with\nparameters estimated as\n\n\u00b5T (x) = kT (x)T (KT + \u03c32I)\u22121yT ,\n\nkT (x, x(cid:48)) = k(x, x(cid:48)) \u2212 kT (x)T (KT + \u03c32I)\u22121kT (x(cid:48)),\n\n\u03c32\nT (x) = kT (x, x),\n\nwhere kT (x) = [k(x1, x) . . . k(xT , x)]T and KT is the (positive semi-de\ufb01nite) kernel matrix\n[k(x, x(cid:48))]x,x(cid:48)\u2208AT . The choice of the kernel function turns out to be crucial in regularizing the\nfunction class to achieve sublinear regret (Section 4).\n\n1We will also consider the case where f has low norm in the RKHS associated with the covariance k.\n\n2\n\n\f3 The Contextual Upper Con\ufb01dence Bound Algorithm\nIn the context-free case Z = \u2205, the problem of trading off exploration and exploitation with payoff\nfunctions sampled from a Gaussian process is studied by [3]. They show that a simple upper con-\n\ufb01dence bound algorithm, GP-UCB (Equation 1), achieves sublinear regret. At round t, GP-UCB\npicks action st = xt such that\n\ns\u2208S\n\n\u00b5t\u22121(s) + \u03b21/2\n\nst = argmax\n\n(1)\nwhere \u03b2t are appropriate constants. Here \u00b5t\u22121(\u00b7) and \u03c3t\u22121(\u00b7) are the posterior mean and stan-\ndard deviation conditioned on the observations (s1, y1), . . . , (st\u22121, yt\u22121). This GP-UCB objective\nnaturally trades off exploration (picking actions with uncertain outcomes, i.e., large \u03c3t\u22121(s)), and\nexploitation (picking actions expected to do well, i.e., having large \u00b5t\u22121(s)).\nWe propose a natural generalization of GP-UCB, which incorporates contextual information\n\nt \u03c3t\u22121(s),\n\ns\u2208S\n\nst = argmax\n\n\u00b5t\u22121(s, zt) + \u03b21/2\n\nt \u03c3t\u22121(s, zt),\n\n(2)\nwhere \u00b5t\u22121(\u00b7) and \u03c3t\u22121(\u00b7) are the posterior mean and standard deviation of the GP over the joint\nset X = S \u00d7 Z conditioned on the observations (s1, z1, y1), . . . , (st\u22121, zt\u22121, yt\u22121). Thus, when\npresented with context zt, this algorithm uses posterior inference to predict mean and variance for\neach possible decision s, conditioned on all past observations (involving both the chosen actions, the\nobserved contexts as well as the noisy payoffs). We call the greedy algorithm implementing rule 2\nthe contextual Gaussian process UCB algorithm (CGP-UCB). As we will show in Section 5, this\nalgorithm allows to incorporate various assumptions about the dependencies of the payoff function\non the chosen actions and observed contexts. It also allows us to generalize several approaches\nproposed in the literature [3, 5, 6]. In the following, we will prove that in many practical applications,\nCGP-UCB attains sublinear contextual regret (i.e., is able to compete with the optimal mapping\nfrom contexts to actions).\n4 Bounds on the Contextual Regret\nBounding the contextual regret of CGP-UCB is a challenging problem, since the regret is measured\nwith respect to the best action for each context. Intuitively, the amount of regret we incur should\ndepend on how quickly we can gather information about the payoff function, which now jointly\ndepends on context and actions. In the following, we show that the contextual regret of CGP-UCB\nis bounded by an intuitive information-theoretic quantity, which quanti\ufb01es the mutual information\nbetween the observed context-action pairs and the estimated payoff function f.\nIt is\nWe start by reviewing the special case of [3] where no context information is provided.\nshown that in this context-free case, the regret RT of the GP-UCB algorithm can be bounded as\nO\u2217(\n\nT \u03b3T ), where \u03b3T is de\ufb01ned as:\n\n\u221a\n\n\u03b3T := max\n\nA\u2282S:|A|=T\n\nI(yA; f ),\n\nwhere I(yA; f ) = H(yA) \u2212 H(yA|f ) quanti\ufb01es the reduction in uncertainty (measured in terms of\ndifferential Shannon entropy [7]) about f achieved by revealing yA. In the multivariate Gaussian\n2 log |2\u03c0e\u03a3|, so that I(yA; f ) =\ncase, the entropy can be computed in closed form: H(N (\u00b5, \u03a3)) = 1\n2 log |I + \u03c3\u22122KA|, where KA = [k(s, s(cid:48))]s,s(cid:48)\u2208A is the Gram matrix of k evaluated on set A \u2286 S.\n1\nFor the contextual case, our regret bound comes also in terms of the quantity \u03b3T , rede\ufb01ned so that the\ninformation gain I(yA; f ) now depends on the observations yA = [y(x)]x\u2208A of the joint context-\naction pairs x = (s, z), and f : S \u00d7 Z \u2192 R is the payoff function over the context-action space.\nConsequently, the kernel matrix KA = [k(x, x(cid:48))]x,x(cid:48)\u2208A is de\ufb01ned over context-action pairs. Using\nthis notion of information gain \u03b3T , we lift the results of [3] to the much more general contextual\nbandit setting, shedding further light on the connection between bandit optimization and information\ngain. In Section 5, we show how to bound \u03b3T for composite kernels, combining possibly different\nassumptions about the regularity of f in the action space S and context space Z.\nWe consider the same three settings as analyzed in [3]. Note that none of the results subsume each\nother, and so all cases may be of use. For the \ufb01rst two settings, we assume a known GP prior and (1)\na \ufb01nite X and (2) in\ufb01nite X with mild assumptions about k. A third (and perhaps more \u201cagnostic\u201d)\nway to express assumptions about f is to require that f has low \u201ccomplexity\u201d as quanti\ufb01ed in terms\nof the Reproducing Kernel Hilbert Space (RKHS, [8]) norm associated with kernel k.\n\n3\n\n\fTheorem 1 Let \u03b4 \u2208 (0, 1). Suppose one of the following assumptions holds\n\n1. X is \ufb01nite, f is sampled from a known GP prior with known noise variance \u03c32, and \u03b2t =\n\n2 log(|X|t2\u03c02/6\u03b4)\n\n2. X \u2286 [0, r]d is compact and convex, d \u2208 N, r > 0. Suppose f is sampled from a known\nGP prior with known noise variance \u03c32, and that k(x, x(cid:48)) satis\ufb01es the following high\nprobability bound on the derivatives of GP sample paths f: for some constants a, b > 0,\n\nPr{supx\u2208X |\u2202f /\u2202xj| > L} \u2264 ae\u2212(L/b)2\n\nj = 1, . . . , d.\n\n(cid:16)\n\n(cid:17)\nt2dbr(cid:112)log(4da/\u03b4)\n\n,\n\n.\n\nChoose \u03b2t = 2 log(t22\u03c02/(3\u03b4)) + 2d log\n\n3. X is arbitrary; ||f||k \u2264 B. The noise variables \u0001t form an arbitrary martingale difference\nsequence (meaning that E[\u03b5t | \u03b51, . . . , \u03b5t\u22121] = 0 for all t \u2208 N), uniformly bounded by \u03c3.\nFurther de\ufb01ne \u03b2t = 2B2 + 300\u03b3t ln3(t/\u03b4).\n\nThen the contextual regret of CGP-UCB is bounded by O\u2217(\n\n(cid:110)\nRT \u2264(cid:112)C1T \u03b2T \u03b3T + 2 \u2200T \u2265 1\n\n(cid:111) \u2265 1 \u2212 \u03b4.\n\n\u221a\n\nPr\n\nT \u03b3T \u03b2T ) w.h.p. Precisely,\n\nwhere C1 = 8/ log(1 + \u03c3\u22122).\n\nTheorem 1 (proof given in the supplemental material) shows that, in case (1) and (2), with high\nprobability over samples from the GP, the cumulative contextual regret is bounded in terms of the\nmaximum information gain with respect to the GP de\ufb01ned over S \u00d7 Z. In case of assumption (3),\na regret bound is obtained in a more agnostic setting, where no prior on f is assumed, and much\nweaker assumptions are made about the noise process. Note that case (3) requires a bound B on\n||f||k. If no such bound is available, standard guess-and-doubling arguments can be used.\n5 Applications of CGP-UCB\nBy choosing different kernel functions k : X\u00d7X \u2192 R, the CGP-UCB algorithm can be applied to a\nvariety of different applications. A natural approach is to start with kernel functions kZ : Z\u00d7Z \u2192 R\nand kS : S \u00d7 S \u2192 R on the space of contexts and actions, and use them to derive the kernel on the\nproduct space.\n5.1 Constructing Composite Kernels\nOne possibility is to consider a product kernel k = kS \u2297 kZ, by setting (kS \u2297 kZ)((s, z), (s(cid:48), z(cid:48))) =\nkZ(z, z(cid:48))kS(s, s(cid:48)). The intuition behind this product kernel is a conjunction of the notions of simi-\nlarities induced by the kernels over context and action spaces: Two context-action pairs are similar\n(large correlation) if the contexts are similar and actions are similar (Figure 1(a)). Note that many\nkernel functions used in practice are already in product form. For example, if kZ and kS are squared\nexponential kernels (or Mat\u00b4ern kernels with smoothness parameters \u03bd), then the product k = kZ\u2297kS\nis a squared exponential kernel (or Mat\u00b4ern kernels with smoothness parameters \u03bd). Similarly, if kS\n\n(a)\n\n(b)\n\nFigure 1: Illustrations of composite kernel functions that can be incorporated into CGP-UCB. (a) Product of\nsquared exponential kernel and linear kernel; (b) additive combination of a payoff function that smoothly de-\npends on context, and exhibits clusters of actions. In general, context and action spaces are higher dimensional.\n\n4\n\n\u22121\u22120.500.51\u22121\u22120.500.51\u221210\u22128\u22126\u22124\u2212202468ActionsContextsPayoffs\u22121\u22120.500.51\u22121\u22120.500.51\u22123\u22122\u221210123ActionsContextsPayoffs\fand kZ have \ufb01nite rank mS and mZ (i.e., all kernel matrices over \ufb01nite sets have rank at most mS\nand mZ respectively), then kS \u2297 kZ has \ufb01nite rank mSmZ. However, other kernel functions can be\nnaturally combined as well.\nAn alternative is to consider the additive combination (kS \u2295 kZ)((s, z), (s(cid:48), z(cid:48))) = kZ(z, z(cid:48)) +\nkS(s, s(cid:48)) which is positive de\ufb01nite as well. The intuition behind this construction is that a GP with\nadditive kernel can be understood as a generative model, which \ufb01rst samples a function fS(s, z) that\nis constant along z, and varies along s with regularity as expressed by ks; it then samples a function\nfz(s, z), which varies along z and is constant along s; then f = fs + fz. Thus, the fz component\nmodels overall trends according to the context (e.g., encoding assumptions about similarity within\nclusters of contexts), and the fS models action-speci\ufb01c deviation from this trend (Figure 1(b)). In\nSection 5.3, we provide examples of applications that can be captured in this framework.\n\n5.2 Bounding the Information Gain for Composite Kernels.\n\nSince the key quantity governing the regret is the information gain \u03b3T , we would like to \ufb01nd a\nconvenient way of bounding \u03b3T for composite kernels (kS \u2297 kZ and kS \u2295 kZ), plugging in different\nregularity assumptions for the contexts (via kZ) and actions (via kS). More formally, let us de\ufb01ne\n\n\u03b3(T ; k; V ) = max\n\nA\u2286V,|A|\u2264T\n\n1\n2\n\nlog\n\n(cid:12)(cid:12)(cid:12)I + \u03c3\u22122[k(v, v(cid:48))]v,v(cid:48)\u2208A\n\n(cid:12)(cid:12)(cid:12),\n\nwhich quanti\ufb01es the maximum possible information gain achievable by sampling T points in a GP\nde\ufb01ned over set V with kernel function k. In [3, Theorem 5], bounds on \u03b3(T ; k; V ) were derived\nfor common kernel functions including the linear (\u03b3(T ; k; V ) = O(d log T ) for d-dimensions),\nthe squared exponential (\u03b3(T ; k; V ) = O((log T )d+1)) and Mat\u00b4ern kernels (\u03b3(T ; k; V ) =\nO(T d(d+1)/(2\u03bd+d(d+1)) log T ) for smoothness parameter \u03bd).\nIn the following, we show how \u03b3(T ; k; V ) can be bounded for composite kernels of the form kS\u2297kZ\nand kS \u2295 kZ, dependent on \u03b3(T ; kS; S) and \u03b3(T ; kZ; Z).\nTheorem 2 Let kZ be a kernel function on Z with rank at most d (i.e., all Gram matrices over\narbitrary \ufb01nite sets of points A \u2286 Z have rank at most d). Then\n\n\u03b3(T ; kS \u2297 kZ; X) \u2264 d\u03b3(T ; kS; S) + d log T.\n\nThe assumptions of Theorem 2 are satis\ufb01ed, for example, if |Z| < \u221e and rk KZ = d, or if kZ is a\nd-dimensional linear kernel on Z \u2286 Rd. Theorem 2 also holds with the roles of kZ and kS reversed.\nTheorem 3 Let kS and kZ be kernel functions on S and Z respectively. Then for the additive\ncombination k = kS \u2295 kZ de\ufb01ned on X it holds that\n\n\u03b3(T ; kS \u2295 kZ; X) \u2264 \u03b3(T ; kS; S) + \u03b3(T ; kZ; Z) + 2 log T.\n\nProofs of Theorems 2 and 3 are given in the supplemental material. By combining the results above\nwith the information gain bounds of [3], we can immediately obtain that, e.g., \u03b3T for the product of\na d1 dimensional linear kernel and a d2 dimensional Gaussian kernel is O(d1(log T )d2+1).\n5.3 Example applications.\n\nWe now illustrate the generality of the CGP-UCB approach, by \ufb02eshing out four possible applica-\ntions. In Section 6, we experimentally evaluate CGP-UCB on two of these applications.\nOnline advertising and news recommendation. Suppose an online service would like to display\nquery-speci\ufb01c ads. This is the textbook contextual bandit problem [9]. There are |S| = m different\nads to select from, and each round we receive, for each ad s \u2208 S, a feature vector zs. Thus, the\ncomplete context is z = [z1, . . . , zm]. [9] model the expected payoff for each action as a (unknown)\nlinear function \u00b5(s, z) = zT\ns models the dependence of action s on the context z.\nBesides online advertising, a similar model has been proposed and experimentally studied by [6]\nfor the problem of contextual news recommendation (see Section 7 for a discussion). Both these\nproblems are addressed by CGP-UCB by choosing KS = I as the m \u00d7 m identity matrix, and KZ\n\ns . Hereby, \u03b8\u2217\n\ns \u03b8\u2217\n\n5\n\n\f(a) Average regret\n\n(b) Maximum regret\n\n(c) Context similarity\n\nFigure 2: CGP-UCB applied to the average (a) and maximum regret over all molecules (b) for three methods\non MHC benchmark. (c) Context similarity using inter task predictions.\n\nas the linear kernel on the features2. In this application, additive kernel combinations may be useful\nto model temporal dependencies of the overall click probabilities (e.g., during evening, users may\nor may not be more likely to click on an ad than during business hours).\nLearning to control complex systems. Suppose we have a complex system and would like to\nachieve some desired behavior, for example robot walking [10]. In such a setting, we may wish to\nestimate a controller in a data-driven manner; however, we would also like to maximize the perfor-\nmance of the estimated controller, resulting in an exploration\u2013exploitation tradeoff. In addition to\ncontroller parameters s \u2208 S \u2286 RdS , the system may be exposed to changing (in an uncontrollable\nmanner) environmental conditions, which are provided as context z \u2208 Z \u2286 RdZ . The goal is thus\nto learn, which control parameters to apply in which conditions to maximize system performance.\nIn this case, we may consider using a linear kernel kZ(z, z(cid:48)) = zT z(cid:48) to model the dependence of\nthe performance on environmental features, and a squared exponential kernel kS(s, s(cid:48)) to model the\nsmooth but nonlinear response of the system to the chosen control parameters. Theorems 1 and 2\n\nbound RT = O\u2217((cid:112)T dZ(log T )dS +1). Additive kernel combinations may allow to model the fact\n\nthat control in some contexts (environments) is inherently more dif\ufb01cult (or noisy).\nMulti-task experimental design. Suppose we would like to perform a sequence of related\nexperiments. In particular, in Section 6.1 we consider the case of vaccine design. The aim is to\ndiscover peptide sequences which bind to major histocompatibility complex molecules (MHC).\nMHC molecules present fragments of proteins from within the cell to T cells, resulting in healthy\ncells being left alone, while cells containing foreign proteins to be attacked by the immune system.\nHere, each experiment is associated with a set of features (encoding the MHC alleles), which are\nprovided as context z. The goal in each experiment is to choose a stimulus (the vaccine) s \u2208 S\nthat maximizes an observed response (binding af\ufb01nity). In this case, we may consider using a \ufb01nite\ninter-task covariance kernel KZ with rank mZ to model the similarity of different experiments, and\na Gaussian kernel kS(s, s(cid:48)) to model the smooth but nonlinear dependency of the stimulus response\n\non the experimental parameters. Theorems 1 and 2 bound RT = O\u2217((cid:112)T mZ(log T )dS +1).\n\nSpatiotemporal monitoring with sensor networks. Suppose we have deployed a network of\nsensors, which we wish to use to monitor the maximum temperature in a building. Due to battery\nlimitations, we would like, at each timestep, to only activate few sensors. We can cast this problem\nin the contextual bandit setting, where time of day is considered as the context z \u2208 Z, and each\naction s \u2208 S corresponds to picking a sensor. Due to the fact that the sun is moving relative to the\nbuilding, the hottest point in the building changes depending on the time of the day, and we would\nlike to learn which sensors to activate at which time of the day. In this problem, we would estimate\na joint spatio-temporal covariance function (e.g., using the Mat\u00b4ern kernel), and use it for inference.\nWe show experimental results for this problem in Section 6.2.\n6 Experiments\nIn our two experimental case studies, we aim to study how much context information can help. We\ncompare three methods: Ignoring (correlation between) contexts by running a separate instance of\nGP-UCB for every context (i.e., ignoring measurements from all but the current molecule or time);\n\n2[6] also propose a more complex hybrid model that uses features shared between the actions. This model\nis also captured in our framework by adding a second kernel function, which composes a low-rank (instead of\nI) matrix with the linear kernel.\n\n6\n\n0501001502002503003500.511.522.533.544.5Trial tAverage regret RtCGP\u2212UCBGP\u2212UCBignore contextGP\u2212UCBmerge context0102030405000.511.522.533.544.55Trial t per taskMaximum regret RtGP\u2212UCBignore contextsGP\u2212UCBmerge contextsCGP\u2212UCB\f(a) Using minimum\n\n(b) Using average\n\n(c) Test data\n\nFigure 3: CGP-UCB applied to temperature data from a network of 46 sensors at Intel Research Berkeley.\n\nrunning a single instance of GP-UCB, merging together the context information (i.e., ignoring the\nmolecule or time information); and running CGP-UCB, conditioning on measurements made at\ndifferent contexts (MHC molecules considered / times of day) using the product kernel.\n6.1 Multi-task Bayesian Optimization of MHC class-I binding af\ufb01nity\nWe perform experiments in the multi-task vaccine design problem introduced in Section 5.3. In\nour experiments, we focus on a subset of MHC class I molecules that have af\ufb01nity binding scores\navailable. Each experimental design task corresponds to searching for maximally binding peptides,\nwhich is a vital step in the design of peptide-based vaccines. We use the data from [11], which is\npart of a benchmark set of MHC class I molecules [12]. The data contains binding af\ufb01nities (IC50\nvalues), as well as features extracted from the peptides. Peptides with IC50 values greater than 500\nnM were considered non-binders, all others binders. We convert the IC50 values into negative log\nscale, and normalize them so that 500nM corresponds to zero, i.e. \u2212 log10(IC50) + log10(500).\nIn total, we consider identifying peptides for seven different MHC molecules (i.e., seven related\ntasks = contexts). The context similarity was obtained using the hamming distance between amino\nacids in the binding pocket [11] (see Figure 2(c)), and we used the Gaussian kernel on the extracted\nfeatures. We used a random subset of 1000 examples to estimate hyperparameters, and then\nconsidered each MHC allele in the order shown in Figure 2(c). For each MHC molecule, we ran\nCGP-UCB for 50 trials.\nFrom Figure 2(a) we see that for the \ufb01rst three molecules (up to trial 150), which are strongly\ncorrelated, merging contexts and CGP-UCB perform similarly, and both perform better than\nignoring observations from other MHC molecules previously considered. However, the fourth\nmolecule (A 0201) has little correlation with the earlier ones, and hence simply merging contexts\nperforms poorly. We also wish to study, how long it takes, in the worst-case over all seven\nmolecules, to identify a peptide with binding af\ufb01nity of desired strength. Therefore, in Figure 2(b),\nwe plot, for each t from 1 to 50, the largest (across the seven tasks) discrepancy between the\nmaximum achievable af\ufb01nity, and the best af\ufb01nity score observed in the \ufb01rst t trials. We \ufb01nd that\nby exploiting correlation among contexts, CGP-UCB outperforms the two baseline approaches.\n6.2 Learning to Monitor Sensor Networks\nWe also apply CGP-UCB to the spatiotemporal monitoring problem described in Section 5. We\nuse data from 46 sensors deployed at Intel Research, Berkeley. The data set contains 4 days of\ndata, sampled at 5 minute intervals. We take the \ufb01rst 24 hours to \ufb01t (by maximizing the marginal\nlikelihood) parameters of a spatio-temporal covariance function (we choose the Mat\u00b4ern kernel with\n\u03bd = 2.5). On the remaining 3 days of data (see Figure 3(c)), we then proceed by, at each time step,\nsequentially activating 5 sensors and reporting the regret of the average and maximum temperature\nmeasured (hereby the regret is the error in estimating the actual maximum temperature reported by\nany of the 46 sensors).\nFigure 3(a) (using the maximum temperature among the 5 readings each time step) and 3(b) (using\nthe average temperature) show the results of this experiment. Notice that ignoring contexts performs\npoorly. Merging contexts (single instance of context-free GP-UCB) performs best for the \ufb01rst few\ntimesteps (since temperature is very similar, and the highest temperature sensor does not change).\nHowever, after running CGP-UCB for more than one day of data (i.e., until context reoccurs), it\noutperforms the other methods, since it is able to learn to query the maximum temperature sensors\nas a function of the time of the day.\n\n7\n\n01020304050607000.511.522.5Time (h)Temperature error (C)CGP\u2212UCBGP\u2212UCBmerge contextGP\u2212UCBignore context01020304050607000.511.522.533.544.5Time (h)Temperature error (C)CGP\u2212UCBGP\u2212UCBmerge contextGP\u2212UCBignore context010203040506070\u22125051015Time (h)Temperature (C)\fd+1\n\n\u221a\n\n7 Related Work\nThe use of upper con\ufb01dence bounds to trade off exploration and exploitation has been introduced\nby [13], and studied thereafter [1, 14, 15, 16]. The approach for the classical k-armed bandit set-\nting [17] has been generalized to more complex settings, such as in\ufb01nite action sets and linear\npayoff functions [14, 18], Lipschitz continuous payoff functions [15] and locally-Lipschitz func-\ntions [19]. However, there is a strong tradeoff between strength of the assumptions and achievable\nregret bounds. For example, while O(d\nT log T ) can be achieved in the linear setting [14], if only\nLipschitz continuity is assumed, regret bounds scale as \u2126(T\nd+2 ) [15]. Srinivas et al [3] analyze the\ncase where the payoff function is sampled from a GP, which encodes con\ufb01gurable assumptions. The\npresent work builds on and strictly generalizes their approach. In fact, in the context free case, CGP-\nUCB is precisely the GP-UCB algorithm of [3]. The ability to incorporate contextual information,\nhowever, signi\ufb01cantly expands the class of applications of GP-UCB. Besides handling context and\nbounding the stronger notion of contextual regret, in this paper we provide generic techniques for\nobtaining regret bounds for composite kernels. An alternative rule (in the context free setting) is the\nExpected Improvement algorithm [20], for which no bounds on the cumulative regret are known.\nFor contextual bandit problems, work has focused on the case of \ufb01nitely many actions, where the\ngoal is to obtain sublinear contextual regret against classes of functions mapping context to actions\n[1]. This setting resembles (multi-class) classi\ufb01cation problems, and regret bounds can be given\nin terms of the VC dimension of the hypothesis space [2]. [6] present an approach, LinUCB, that\nassumes that payoffs for each action are linear combinations (with unknown coef\ufb01cients) of context\nfeatures. In [5], it is proven that a modi\ufb01ed variant of LinUCB achieves sublinear contextual regret.\nTheirs is a special case of our setting (assuming a linear kernel for the contexts and diagonal kernel\nfor the actions). Another related approach is taken by Slivkins [21], who presents several algorithms\nwith sublinear contextual regret for the case of in\ufb01nite actions and contexts, assuming Lipschitz\ncontinuity of the payoff function in the context-action space. In [22], this approach is generalized\nto select sets of actions, and applied to a problem of diverse retrieval in large document collections.\nHowever, in contrast to CGP-UCB, this approach does not enable stronger guarantees for smoother\nor more structured payoff functions.\nThe construction of composite kernels is common in the context of multitask learning with GPs\n[23, 24, 25]. Instead of considering a scalar GP with joint feature space f : S \u00d7 Z \u2192 R, they\nconsider a multioutput GP fvec : S \u2192 RZ, and introduce output correlations as linear combinations\nof latent channels or convolutions of GPs [25]. Our results are complementary to this line of work, as\nwe can make use of such kernel functions for \u201cmulti-task Bayesian optimization\u201d. Theorems 2 and 3\nprovide convenient ways for deriving regret bounds for such problems. There has been a signi\ufb01cant\namount of work on GP optimization and response surface methods [26]. For example, [27] consider\nsharing information across multiple sessions in a problem of parameter identi\ufb01cation in animation\ndesign. We are not aware of theoretical convergence results in case of context information, and our\nTheorem 1 provides the \ufb01rst general approach to obtain rates.\n8 Conclusions\nWe have described an algorithm, CGP-UCB, which addresses the exploration\u2013exploitation tradeoff\nin a large class of contextual bandit problems, where the regularity of the payoff function de\ufb01ned\nover the action\u2013context space is expressed in terms of a GP prior. As we discuss in Section 5, by\nconsidering various kernel functions on actions and contexts this approach allows to handle a variety\nof applications. We show that, similar as in the context free case studied by [3], the key quantity\ngoverning the regret is a mutual information between experiments performed by CGP-UCB and the\nGP prior (Theorem 1). In contrast to prior work, however, our approach bounds the much stronger\nnotion of contextual regret (competing with the optimal mapping from contexts to actions). We\nprove that in many practical settings, as discussed in Section 5, the contextual regret is sublinear. In\naddition, Theorems 2 and 3 provide tools to construct bounds on this information theoretic quantity\ngiven corresponding bounds on the context and actions. We also demonstrate the effectiveness of\nCGP-UCB on two applications: computational vaccine design and sensor network management. In\nboth applications, we show that utilizing context information in the joint covariance function reduces\nregret in comparison to ignoring or naively using the context.\nAcknowledgments The authors wish to thank Christian Widmer for providing the MHC data, as\nwell as Daniel Golovin and Aleksandrs Slivkins for helpful discussions. This research was partially\nsupported by ONR grant N00014-09-1-1044, NSF grants CNS-0932392, IIS-0953413, DARPA\nMSEE grant FA8650-11-1-7156 and SNF grant 200021 137971.\n\n8\n\n\fReferences\n[1] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. JMLR, 3, 2002.\n[2] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits.\n\nNIPS, 2008.\n\nIn\n\n[3] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting:\n\nNo regret and experimental design. In ICML, 2010.\n\n[4] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[5] Wei Chu, Lihong Li, Lev Reyzin, , and Robert E. Schapire. Contextual bandits with linear payoff func-\n\ntions. In AISTATS, 2011.\n\n[6] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personal-\n\nized news article recommendation. In WWW, 2010.\n\n[7] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Interscience, 1991.\n[8] G. Wahba. Spline Models for Observational Data. SIAM, 1990.\n[9] Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and\n\nlinear hypotheses. Algorithmica, 37(4):263\u2013293, 2003.\n\n[10] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with Gaussian process\n\nregression. In IJCAI, pages 944\u2013949, 2007.\n\n[11] C. Widmer, N. Toussaint, Y. Altun, and G. R\u00a8atsch. Inferring latent task structure for multitask learning\n\nby multiple kernel learning. BMC Bioinformatics, 11(Suppl 8:S5), 2010.\n\n[12] B. Peters et. al. A community resource benchmarking predictions of peptide binding to mhc-i molecules.\n\nPLoS Computational Biology, 2(6):e65, 2006.\n\n[13] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Adv. Appl. Math., 6:4, 1985.\n[14] V. Dani, T. P. Hayes, and S. Kakade. The price of bandit information for online optimization. In NIPS,\n\n2007.\n\n[15] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages 681\u2013690,\n\n2008.\n\n[16] L. Kocsis and C. Szepesv\u00b4ari. Bandit based monte-carlo planning. In ECML, 2006.\n[17] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach.\n\nLearn., 47(2-3):235\u2013256, 2002.\n\n[18] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In COLT,\n\n2008.\n\n[19] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv\u00b4ari. Online optimization in X-armed bandits. In NIPS,\n\n2008.\n\n[20] S. Gr\u00a8unew\u00a8alder, J-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret bounds for gaussian process bandit\n\nproblems. In AISTATS, 2010.\n\n[21] Aleksandrs Slivkins. Contextual bandits with similarity information. Technical Report 0907.3986, arXiv,\n\n2009.\n\n[22] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Learning optimally diverse rankings over\n\nlarge document collections. In ICML, 2010.\n\n[23] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In\n\nICML, 2005.\n\n[24] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task gaussian process\n\nprediction. In NIPS, 2008.\n\n[25] Mauricio A. \u00b4Alvarez, David Luengo, Michalis K. Titsias, and Neil D. Lawrence. Ef\ufb01cient multioutput\n\ngaussian processes through variational inducing kernels. In AISTATS, 2010.\n\n[26] E. Brochu, M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions,\nwith application to active user modeling and hierarchical reinforcement learning. In TR-2009-23, UBC,\n2009.\n\n[27] Eric Brochu, Tyson Brochu, and Nando de Freitas. A bayesian interactive optimization approach to\n\nprocedural animation design. In Eurographics, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1316, "authors": [{"given_name": "Andreas", "family_name": "Krause", "institution": null}, {"given_name": "Cheng", "family_name": "Ong", "institution": null}]}