{"title": "Batched Gaussian Process Bandit Optimization via Determinantal Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4206, "page_last": 4214, "abstract": "Gaussian Process bandit optimization has emerged as a powerful tool for optimizing noisy black box functions. One example in machine learning is hyper-parameter optimization where each evaluation of the target function may require training a model which may involve days or even weeks of computation. Most methods for this so-called \u201cBayesian optimization\u201d only allow sequential exploration of the parameter space. However, it is often desirable to propose batches or sets of parameter values to explore simultaneously, especially when there are large parallel processing facilities at our disposal. Batch methods require modeling the interaction between the different evaluations in the batch, which can be expensive in complex scenarios. In this paper, we propose a new approach for parallelizing Bayesian optimization by modeling the diversity of a batch via Determinantal point processes (DPPs) whose kernels are learned automatically. This allows us to generalize a previous result as well as prove better regret bounds based on DPP sampling. Our experiments on a variety of synthetic and real-world robotics and hyper-parameter optimization tasks indicate that our DPP-based methods, especially those based on DPP sampling, outperform state-of-the-art methods.", "full_text": "Batched Gaussian Process Bandit Optimization via\n\nDeterminantal Point Processes\n\nTarun Kathuria, Amit Deshpande, Pushmeet Kohli\n\nMicrosoft Research\n\nt-takat@microsoft.com, amitdesh@microsoft.com, pkohli@microsoft.com\n\nAbstract\n\nGaussian Process bandit optimization has emerged as a powerful tool for optimizing\nnoisy black box functions. One example in machine learning is hyper-parameter\noptimization where each evaluation of the target function may require training\na model which may involve days or even weeks of computation. Most methods\nfor this so-called \u201cBayesian optimization\u201d only allow sequential exploration of\nthe parameter space. However, it is often desirable to propose batches or sets\nof parameter values to explore simultaneously, especially when there are large\nparallel processing facilities at our disposal. Batch methods require modeling the\ninteraction between the different evaluations in the batch, which can be expensive\nin complex scenarios. In this paper, we propose a new approach for parallelizing\nBayesian optimization by modeling the diversity of a batch via Determinantal\npoint processes (DPPs) whose kernels are learned automatically. This allows us\nto generalize a previous result as well as prove better regret bounds based on\nDPP sampling. Our experiments on a variety of synthetic and real-world robotics\nand hyper-parameter optimization tasks indicate that our DPP-based methods,\nespecially those based on DPP sampling, outperform state-of-the-art methods.\n\nIntroduction\n\n1\nThe optimization of an unknown function based on noisy observations is a fundamental problem\nin various real world domains, e.g., engineering design [33], \ufb01nance [36] and hyper-parameter\noptimization [29]. In recent years, an increasingly popular direction has been to model smoothness\nassumptions about the function via a Gaussian Process (GP), which provides an easy way to compute\nthe posterior distribution of the unknown function, and thereby uncertainty estimates that help to\ndecide where to evaluate the function next, in search of an optima. This Bayesian optimization (BO)\nframework has received considerable attention in tuning of hyper-parameters for complex models\nand algorithms in Machine Learning, Robotics and Computer Vision [16, 31, 29, 12].\nApart from a few notable exceptions [9, 8, 11], most methods for Bayesian optimization work by\nexploring one parameter value at a time. However, in many applications, it may be possible and,\nmoreover, desirable to run multiple function evaluations in parallel. A case in point is when the\nunderlying function corresponds to a laboratory experiment where multiple experimental setups are\navailable or when the underlying function is the result of a costly computer simulation and multiple\nsimulations can be run across different processors in parallel. By parallelizing the experiments,\nsubstantially more information can be gathered in the same time-frame; however, future actions must\nbe chosen without the bene\ufb01t of intermediate results. One might conceptualize these problems as\nchoosing \u201cbatches\u201d of experiments to run simultaneously. The key challenge is to assemble batches\n(out of a combinatorially large set of batches) of experiments that both explore the function and\nexploit by focusing on regions with high estimated value.\nOur Contributions Given that functions sampled from GPs usually have some degree of smoothness,\nin the so-called batch Bayesian optimization (BBO) methods, it is desirable to choose batches which\nare diverse. Indeed, this is the motivation behind many popular BBO methods like the BUCB [9],\nUCB-PE [8] and Local Penalization [11]. Motivated by this long line of work in BBO, we propose\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fa new approach that employs Determinantal Point Processes (DPPs) to select diverse batches of\nevaluations. DPPs are probability measures over subsets of a ground set that promote diversity, have\napplications in statistical physics and random matrix theory [28, 21], and have ef\ufb01cient sampling\nalgorithms [17, 18]. The two main ways for \ufb01xed cardinality subset selection via DPPs are that of\nchoosing the subset which maximizes the determinant [DPP-MAX, Theorem 3.3] and sampling a\nsubset according to the determinantal probability measure [DPP-SAMPLE, Theorem 3.4]. Following\nUCB-PE [8], our methods also choose the \ufb01rst point via an acquisition function, and then the rest\nof the points are selected from a relevance region using a DPP. Since DPPs crucially depend on the\nchoice of the DPP kernel, it is important to choose the right kernel. Our method allows the kernel\nto change across iterations and automatically compute it based on the observed data. This kernel\nis intimately linked to the GP kernel used to model the function; it is in fact exactly the posterior\nkernel function of the GP. The acquisition functions we consider are EST [34], a recently proposed\nsequential MAP-estimate based Bayesian optimization algorithm with regret bounds independent of\nthe size of the domain, and UCB [30]. In fact, we show that UCB-PE can be cast into our framework\nas just being DPP-MAX where the maximization is done via a greedy selection rule.\nGiven that DPP-MAX is too greedy, it may be desirable to allow for uncertainty in the observations.\nThus, we de\ufb01ne DPP-SAMPLE which selects the batches via sampling subsets from DPPs, and show\nthat the expected regret is smaller than that of DPP-MAX. To provide a fair comparison with an\nexisting method, BUCB, we also derive regret bounds for B-EST [Theorem 3.2]. Finally, for all\nmethods with known regret bounds, the key quantity is the information gain. In the appendix, we also\nprovide a simpler proof of the information gain for the widely-used RBF kernel which also improves\nthe bound from O((log T )d+1) [26, 30] to O((log T )d). We conclude with experiments on synthetic\nand real-world robotics and hyper-parameter optimization for extreme multi-label classi\ufb01cation tasks\nwhich demonstrate that our DPP-based methods, especially the sampling based ones are superior or\ncompetitive with the existing baselines.\nRelated Work One of the key tasks involved in black box optimization is of choosing actions that\nboth explore the function and exploit our knowledge about likely high reward regions in the function\u2019s\ndomain. This exploration-exploitation trade-off becomes especially important when the function\nis expensive to evaluate. This exploration-exploitation trade off naturally leads to modeling this\nproblem in the multi-armed bandit paradigm [25], where the goal is to maximize cumulative reward\nby optimally balancing this trade-off. Srinivas et al. [30] analyzed the Gaussian Process Upper\nCon\ufb01dence Bound (GP-UCB) algorithm, a simple and intuitive Bayesian method [3] to achieve the\n\ufb01rst sub-linear regret bounds for Gaussian process bandit optimization. These bounds however grow\nlogarithmically in the size of the (\ufb01nite) search space.\nRecent work by Wang et al. [34] considered an intuitive MAP-estimate based strategy (EST) which\ninvolves estimating the maximum value of a function and choosing a point which has maximum\nprobability of achieving this maximum value. They derive regret bounds for this strategy and show\nthat the bounds are actually independent of the size of the search space. The problem setting for both\nUCB and EST is of optimizing a particular acquisition function. Other popular acquisition functions\ninclude expected improvement (EI), probability of improvement over a certain threshold (PI). Along\nwith these, there is also work on Entropy search (ES) [13] and its variant, predictive entropy search\n(PES) [14] which instead aims at minimizing the uncertainty about the location of the optimum of\nthe function. All the fore-mentioned methods, though, are inherently sequential in nature.\nThe BUCB and UCB-PE both depend on the crucial observation that the variance of the posterior\ndistribution does not depend on the actual values of the function at the selected points. They exploit\nthis fact by \u201challucinating\u201d the function values to be as predicted by the posterior mean. The BUCB\nalgorithm chooses the batch by sequentially selecting the points with the maximum UCB score\nkeeping the mean function the same and only updating the variance. The problem with this naive\napproach is that it is too \u201covercon\ufb01dent\u201d of the observations which causes the con\ufb01dence bounds on\nthe function values to shrink very quickly as we go deeper into the batch. This is \ufb01xed by a careful\ninitialization and expanding the con\ufb01dence bounds which leads to regret bounds which are worse\nthan that of UCB by some multiplicative factor (independent of T and B). The UCB-PE algorithm\nchooses the \ufb01rst point of the batch via the UCB score and then de\ufb01nes a \u201crelevance region\u201d and\nselects the remaining points from this region greedily to maximize the information gain, in order to\nfocus on pure exploration (PE). This algorithm does not require any initialization like the BUCB and,\nin fact, achieves better regret bounds than the BUCB.\n\n2\n\n\fBoth BUCB and UCB-PE, however, are too greedy in their selection of batches which may be really\nfar from the optimal due to our \u201cimmediate overcon\ufb01dence\u201d of the values. Indeed this is the criticism\nof these two methods by a recently proposed BBO strategy PPES [27], which parallelizes predictive\nentropy search based methods and shows considerable improvements over the BUCB and UCB-PE\nmethods. Another recently proposed method is the Local Penalization (LP) [11], which assumes that\nthe function is Lipschitz continuous and tries to estimate the Lipschitz constant. Since assumptions\nof Lipschitz continuity naturally allow one to place bounds on how far the optimum of f is from a\ncertain location, they work to smoothly reduce the value of the acquisition function in a neighborhood\nof any point re\ufb02ecting the belief about the distance of this point to the maxima. However, assumptions\nof Lipschitzness are too coarse-grained and it is unclear how their method to estimate the Lipschitz\nconstant and modelling of local penalization affects the performance from a theoretical standpoint.\nOur algorithms, in constrast, are general and do not assume anything about the function other than it\nbeing drawn from a Gaussian Process.\n2 Preliminaries\nGaussian Process Bandit Optimization We address the problem of \ufb01nding, in the lowest possible\nnumber of iterations, the maximum (m) of an unknown function f : X \u2192 R where X \u2282 Rd, i.e.,\n\n\u2217\n\nm = f (x\n\n) = max\n\nx\u2208X f (x).\n\nWe consider the domain to be discrete as it is well-known how to obtain regret bounds for continous,\ncompact domains via suitable discretizations [30]. At each iteration t, we choose a batch {xt,b}1\u2264b\u2264B\nof B points and then simultaneously observe the noisy values taken by f at these points, yt,b =\nf (xt,b) + \u0001t,b, where \u0001t,k is i.i.d. Gaussian noise N (0, \u03c32). The function is assumed to be drawn\nfrom a Gaussian process (GP), i.e., f \u223c GP (0, k), where k : X 2 \u2192 R+ is the kernel function. Given\nthe observations Dt = {(x\u03c4 , y\u03c4 )t\n\u03c4 =1} up to time t, we obtain the posterior mean and covariance\nfunctions [24] via the kernel matrix Kt = [k(xi, xj)]xi,xj\u2208Dt and kt(x) = [k(xi, x)]xi\u2208Dt : \u00b5t(x) =\nkt(x)T (Kt + \u03c32I)\u22121yt and kt(x, x(cid:48)) = k(x, x(cid:48)) \u2212 kt(x)T (Kt + \u03c32I)\u22121kt(x(cid:48)). The posterior\nvariance is given by \u03c32\nt (x) = kt(x, x). De\ufb01ne the Upper Con\ufb01dence Bound (UCB) f + and Lower\nCon\ufb01dence Bound (LCB) f\u2212 as\n\nt (x) = \u00b5t\u22121(x) + \u03b21/2\nf +\n\nt \u03c3t\u22121(x)\n\nt (x) = \u00b5t\u22121(x) \u2212 \u03b21/2\n\u2212\n\nf\n\nt \u03c3t\u22121(x)\n\nA crucial observation made in BUCB [9] and UCB-PE [8] is that the posterior covariance and variance\nfunctions do not depend on the actual function values at the set of points. The EST algorithm in [34]\nchooses at each timestep t,the point which has the maximum posterior probability of attaining the\nmaximum value m, i.e., the arg maxx\u2208X Pr(Mx|m,Dt) where Mx is the event that point x achieves\n\nthe maximum value. This turns out to be equal to arg minx\u2208X(cid:2)(m \u2212 \u00b5t(x))/\u03c3t(x)(cid:3). Note that this\n\npaper we will focus on the standard full cumulative regret de\ufb01ned as RT B =(cid:80)T\n\nactually depends on the value of m which, in most cases, is unknown. [34] get around this by using\nan approximation \u02c6m which, under certain conditions speci\ufb01ed in their paper, is an upper bound on m.\nThey provide two ways to get the estimate \u02c6m, namely ESTa and ESTn. We refer the reader to [34]\nfor details of the two estimates and refer to ESTa as EST.\nAssuming that the horizon T is unknown, a strategy has to be good at any iteration. Let rt,b denote the\nsimple regret, the difference between the value of the maxima and the point queried xt,k, i.e., rt,b =\nmaxx\u2208X f (x) \u2212 f (xt,b). While, UCB-PE aims at minimizing a batched cumulative regret, in this\nb=1 rt,b. This\nmodels the case where all the queries in a batch should have low regret. The key quantity controlling\nthe regret bounds of all known BO algorithms is the maximum mutual information that can be gained\nabout f from T measurements : \u03b3T = maxA\u2286X ,|A|\u2264T I(yA, fA) = maxA\u2286X ,|A|\u2264T\n1\n2 log det(I +\n\u03c3\u22122KA), where KA is the (square) submatrix of K formed by picking the row and column indices\ncorresponding to the set A. The regret for both the UCB and the EST algorithms are presented in the\nfollowing theorem which is a combination of Theorem 1 in [30] and Theorem 3.1 in [34].\nTheorem 2.1. Let C = 2/ log(1 + \u03c3\u22122) and \ufb01x \u03b4 > 0. For UCB, choose \u03b2t = 2 log(|X|t2\u03c02/6\u03b4)\nand for EST, choose \u03b2t = (minx\u2208X \u02c6m\u2212\u00b5t\u22121(x)\n)2 and \u03b6t = 2 log(\u03c02t2/\u03b4). With probability 1 \u2212 \u03b4,\nthe cumulative regret up to any time step T can be bounded as\nfor UCB\nfor EST\n\n\u221a\nCT \u03b2T \u03b3T\nCT \u03b3T (\u03b21/2\n\n(cid:80)B\n\n= arg max\n\n\u03b2t.\n\n(cid:40)\u221a\n\nwhere t\n\nrt \u2264\n\n\u03c3t\u22121(x)\n\nRT =\n\n\u2217\n\nt\n\nt=1\n\nT(cid:88)\n\nt=1\n\nt\u2217 + \u03b6 1/2\nT )\n\nDeterminantal Point Processes Given a DPP kernel K \u2208 Rm\u00d7m of m elements {1, . . . , m}, the k-\nDPP distribution de\ufb01ned on 2Y is de\ufb01ned as picking B, a k-subset of [m] with probability proportional\n\n3\n\n\fAlgorithm 1 GP-BUCB/B-EST Algorithm\n\nInput: Decision set X , GP prior \u00b50, \u03c30, kernel function k(\u00b7,\u00b7), feedback mapping f b[\u00b7]\nfor t = 1 to TB do\nt =\n\n(cid:26)C(cid:48)(cid:2)2 log(|X|\u03c02t2/6)\u03b4(cid:3)\nC(cid:48)(cid:2) minx\u2208X ( \u02c6m \u2212 \u00b5f b[t])/\u03c3t\u22121(x)(cid:3)\n\nfor BUCB\nfor B-EST\n\nChoose \u03b21/2\n\nChoose xt = arg maxx\u2208X [\u00b5f b[t](x) + \u03b21/2\nif f b[t] < f b[t + 1] then\n\nt \u03c3t\u22121(x)] and compute \u03c3t(\u00b7)\n\nObtain yt(cid:48) = f (xt(cid:48)) + \u0001t(cid:48) for t(cid:48) \u2208 {f b[t] + 1, . . . , f b[t + 1]} and compute \u00b5f b[t+1](\u00b7)\n\nend if\nend for\nreturn arg max\n\nyt\n\nt=1...T B\n\nto det(KB). Formally,\n\n(cid:80)\n\ndet(KB)\n|S|=k det(KS)\n\nPr(B) =\n\nThe problem of picking a set of size k which maximizes the determinant and sampling a set according\nto the k-DPP distribution has received considerable attention [22, 7, 6, 10, 1, 17]. The maximization\nproblem in general is NP-hard and furthermore, has a hardness of approximation result of 1/ck for\nsome c > 1. The best known approximation algorithm is by [22] with a factor of 1/ek, which almost\nmatches the lower bound. Their algorithm however is a complicated and expensive convex program.\nA simple greedy algorithm on the other hand gives a 1/2k log(k)-approximation. For sampling from\nk-DPPs, an exact sampling algorithm exists due to [10]. This, however, does not scale to large\ndatasets. A recently proposed alternative is an MCMC based method by [1] which is much faster.\n3 Main Results\nIn this section, we present our DPP-based algorithms. For a fair comparison of the various methods,\nwe \ufb01rst prove the regret bounds of the EST version of BUCB, i.e., B-EST. We then show the\nequivalence between UCB-PE and UCB-DPP maximization along with showing regret bounds for the\nEST version of PE/DPP-MAX. We then present the DPP sampling (DPP-SAMPLE) based methods\nfor UCB and EST and provide regret bounds. In Appendix 4, while borrowing ideas from [26], we\nprovide a simpler proof with improved bounds on the maximum information gain for the RBF kernel.\n3.1 The Batched-EST algorithm\nThe BUCB has a feedback mapping f b which indicates that at any given time t (just in this case we\nwill mean a total of T B timesteps), the iteration upto which the actual function values are available.\nIn the batched setting, this is just (cid:98)(t \u2212 1)/B(cid:99)B. The BUCB and B-EST, its EST variant algorithms\nare presented in Algorithm 1. The algorithm mainly comes from the observation made in [34] that\nthe point chosen by EST is the same as a variant of UCB. This is presented in the following lemma.\nLemma 3.1. (Lemma 2.1 in [34]) At any timestep t, the point selected by EST is the same as the\npoint selected by a variant of UCB with \u03b21/2\nThis will be suf\ufb01cient to get to B-EST as well by just running BUCB with the \u03b2t as de\ufb01ned in\nLemma 3.1 and is also provided in Algorithm 1. In the algorithm, C(cid:48) is chosen to be exp(2C), where\nC is an upper bound on the maximum conditional mutual information I(f (x); yf b[t]+1:t\u22121|y1:f b[t])\n(refer to [9] for details). The problem with naively using this algorithm is that the value of C(cid:48), and\ncorrespondingly the regret bounds, usually has at least linear growth in B. This is corrected in [9] by\ntwo-stage BUCB which \ufb01rst chooses an initial batch of size T init by greedily choosing points based\non the (updated) posterior variances. The values are then obtained and the posterior GP is calculated\nwhich is used as the prior GP in Algorithm 1. The C(cid:48) value can then be chosen independent of B.\nWe refer the reader to the Table 1 in [9] for values of C(cid:48) and T init for common kernels. Finally, the\nregret bounds of B-EST are presented in the next theorem.\n\nt = minx\u2208X ( \u02c6m \u2212 \u00b5t\u22121(x))/\u03c3t\u22121(x).\n\nTheorem 3.2. Choose \u03b1t =(cid:0) minx\u2208X \u02c6m\u2212\u00b5f b[t](x)\n\n(cid:1)2 and \u03b2t = (C(cid:48))2\u03b1t, B \u2265 2, \u03b4 > 0 and the C(cid:48)\n\nand T init values are chosen according to Table 1 in [9]. At any timestep T , let RT be the cumulative\nregret of the two-stage initialized B-EST algorithm. Then\n\n\u03c3t\u22121(x)\n\nProof. The proof is presented in Appendix 1.\n\nP r{RT \u2264 C(cid:48)Rseq\n\nT + 2(cid:107)f(cid:107)\u221eT init,\u2200T \u2265 1} \u2265 1 \u2212 \u03b4\n\n4\n\n\fAlgorithm 2 GP-(UCB/EST)-DPP-(MAX/SAMPLE) Algorithm\nInput: Decision set X , GP prior \u00b50, \u03c30, kernel function k(\u00b7,\u00b7)\nfor t = 1 to T do\n\n(cid:26)(cid:2)2 log(|X|\u03c02t2/6)\u03b4(cid:3)\n(cid:2) minx\u2208X ( \u02c6m \u2212 \u00b5f b[t])/\u03c3t\u22121(x)(cid:3)\n(cid:26)kDPPMaxGreedy(Kt,1, B \u2212 1)\n\nCompute \u00b5t\u22121 and \u03c3t\u22121 according to Bayesian inference.\nChoose \u03b21/2\nxt,1 \u2190 arg maxx\u2208X \u00b5t\u22121(x) +\nCompute R+\nb=2 \u2190\n{xt,b}B\nkDPPSample(Kt,1, B \u2212 1)\nObtain yt,b = f (xt,b) + \u0001t,b for b = 1, . . . , B\n\nt and construct the DPP kernel Kt,1\n\n\u03b2t\u03c3t\u22121(x)\n\nt =\n\n\u221a\n\nfor UCB\nfor EST\n\nfor DPP-MAX\nfor DPP-SAMPLE\n\nend for\n\nt\n\nt = f\u2212\n\nt (x\u2022\n\nt ), where x\u2022\n\nt = {x \u2208 X|\u00b5t\u22121 + 2(cid:112)\u03b2t+1\u03c3t\u22121(x) \u2265 y\u2022\n\n3.2 Equivalence of Pure Exploration (PE) and DPP Maximization\nWe now present the equivalence between the Pure Exploration and a procedure which involves DPP\nmaximization based on the Greedy algorithm. For the next two sections, by an iteration, we mean all\nB points selected in that iteration and thus, \u00b5t\u22121 and kt\u22121 are computed using (t \u2212 1)B observations\nthat are available to us. We \ufb01rst describe a generic framework for BBO inspired by UCB-PE : At\nany iteration, the \ufb01rst point is chosen by selecting the one which maximizes UCB or EST which can\nbe seen as a variant of UCB as per Lemma 3.1. A relevance region R+\nis de\ufb01ned which contains\nt = arg maxx\u2208X f\u2212\nt+1(x) with high probability. Let y\u2022\narg maxx\u2208X f +\nt (x).\nThe relevance region is formally de\ufb01ned as R+\nt }. The\nintuition for considering this region is that using R+\nt guarantees that the queries at iteration t will\nleave an impact on the future choices at iteration t + 1. The next B \u2212 1 points for the batch are\nt , according to some rule. In the special case of UCB-PE, the B \u2212 1 points\nthen chosen from R+\nare selected greedily from R+\nt by maximizing the (updated) posterior variance, while keeping the\nmean function the same. Now, at the tth iteration, consider the posterior kernel function after xt,1\nhas been chosen (say kt,1) and consider the kernel matrix Kt,1 = I + \u03c3\u22122[kt,1(pi, pj)]i,j over the\npoints pi \u2208 R+\nt . We will consider this as our DPP kernel at iteration t. Two possible ways of\nchoosing B \u2212 1 points via this DPP kernel is to either choose the subset of size B \u2212 1 of maximum\ndeterminant (DPP-MAX) or sample a set from a (B \u2212 1)-DPP using this kernel (DPP-SAMPLE). In\nthis subsection, we focus on the maximization problem. The proof of the regret bounds of UCB-PE\ngo through a few steps but in one of the intermediate steps (Lemma 5 of [8]), it is shown that the sum\n(cid:21)\nof regrets over a batch at an iteration t is upper bounded as\n\nB(cid:88)\nwhere C2 = \u03c3\u22122/ log(1 + \u03c3\u22122). From the \ufb01nal log-product term, it can be seen (from Schur\u2019s\ndeterminant identity [5] and the de\ufb01nition of \u03c3t,b(xt,b)) that the product of the last B \u2212 1 terms is\nexactly the B \u2212 1 principal minor of Kt,1 formed by the indices corresponding to S = {xt,b}B\nb=2.\nThus, it is straightforward to see that the UCB-PE algorithm is really just (B \u2212 1)-DPP maximization\nvia the greedy algorithm. This connection will also be useful in the next subsection for DPP-\n. Finally, for EST-PE,\nthe proof proceeds like in the B-EST case by realising that EST is just UCB with an adaptive \u03b2t. The\n\ufb01nal algorithm (along with its sampling counterpart; details in the next subsection) is presented in\nAlgorithm 2. The procedure kDPPMaxGreedy(K, k) picks a principal submatrix of K of size k by\nthe greedy algorithm. Finally, we have the theorem for the regret bounds for (UCB/EST)-DPP-MAX.\nTheorem 3.3. At iteration t, let \u03b2t = 2 log(|X|\u03c02t2/6\u03b4) for UCB, \u03b2t = (min \u02c6m\u2212\u00b5t\u22121(x)\n)2 and\n\u03b6t = 2 log(\u03c02t2/3\u03b4) for EST, C1 = 36/ log(1 + \u03c3\u22122) and \ufb01x \u03b4 > 0, then, with probability \u2265 1 \u2212 \u03b4\nthe full cumulative regret RT B incurred by UCB-DPP-MAX is RT B \u2264 \u221a\nC1T B\u03b2T \u03b3T B} and that\nfor EST-DPP-MAX is RT B \u2264 \u221a\n\nSAMPLE. Thus,(cid:80)B\n\n(\u03c3t,b(xt,b))2 \u2264 B(cid:88)\n\nlog(1 + \u03c3\u22122\u03c3t,1(xt,1)) + log det((Kt,1)S)\n\nrt,b \u2264 B(cid:88)\n\n\u22122\u03c3t,b(xt,b)) = C2\u03c32 log\n\nb=1 rt,b \u2264 C2\u03c32\n\nC1T B\u03b3T B(\u03b21/2\n\nt\u2217 + \u03b6 1/2\nT ).\n\n(1 + \u03c3\n\n\u22122\u03c3t,b(xt,b)\n\n(cid:20) B(cid:89)\n\nb=1\n\nb=1\n\nb=1\n\nb=1\n\nC2\u03c32 log(1 + \u03c3\n\n\u03c3t\u22121(x)\n\n(cid:20)\n\n(cid:21)\n\nProof. The proof is provided in Appendix 2. It should be noted that the term inside the logarithm in\n\u03b6t has been multiplied by 2 as compared to the sequential EST, which has a union bound over just\none point, xt. This happens because we will need a union bound over not just xt,b but also x\u2022\nt .\n\n5\n\n\fFigure 1: Immediate regret of the algorithms on two synthetic functions with B = 5 and 10\n\n3.3 Batch Bayesian Optimization via DPP Sampling\nIn the previous subsection, we looked at the regret bounds achieved by DPP maximization. One\nnatural question to ask is whether the other subset selection method via DPPs, namely DPP sampling,\ngives us equivalent or better regret bounds. Note that in this case, the regret would have to be de\ufb01ned\nas expected regret. The reason to believe this is well-founded as indeed sampling from k-DPPs\nresults in better results, in both theory and practice, for low-rank matrix approximation [10] and\nexemplar-selection for Nystrom methods [19]. Keeping in line with the framework described in the\nprevious subsection, the subset to be selected has to be of size B \u2212 1 and the kernel should be Kt,1 at\nany iteration t. Instead of maximizing, we can choose to sample from a (B \u2212 1)-DPP. The algorithm\nis described in Algorithm 2. The kDPPSample(K, k) procedure denotes sampling a set from the\nk-DPP distribution with kernel K. The question then to ask is what is the expected regret of this\nprocedure. In this subsection, we show that the expected regret bounds of DPP-SAMPLE are less\nthan the regret bounds of DPP-MAX and give a quantitative bound on this regret based on entropy\nof DPPs. By entropy of a k-DPP with kernel K, H(k \u2212 DPP(K)), we simply mean the standard\nde\ufb01nition of entropy for a discrete distribution. Note that the entropy is always non-negative in this\ncase. Please see Appendix 3 for details. For brevity, since we always choose B \u2212 1 elements from\nthe DPP, we denote H(DP P (K)) to be the entropy of (B \u2212 1)-DPP for kernel K.\nTheorem 3.4. The regret bounds of DPP-SAMPLE are less than that of DPP-MAX. Furthermore, at\niteration t, let \u03b2t = 2 log(|X|\u03c02t2/6\u03b4) for UCB, \u03b2t = (min \u02c6m\u2212\u00b5t\u22121(x)\n)2 and \u03b6t = 2 log(\u03c02t2/3\u03b4)\nfor EST, C1 = 36/ log(1 + \u03c3\u22122) and \ufb01x \u03b4 > 0, then the expected full cumulative regret of UCB-DPP-\nSAMPLE satis\ufb01es\n\n\u03c3t\u22121(x)\n\n(cid:21)\n\n(cid:20)\n\n\u03b3T B \u2212 T(cid:88)\n(cid:20)\n\nt=1\n\n)2\n\nT B \u2264 2T BC1\u03b2T\nR2\n\nH(DP P (Kt,1)) + B log(|X|)\n\n\u03b3T B \u2212 T(cid:88)\n\nH(DP P (Kt,1)) + B log(|X|)\n\n(cid:21)\n\nand that for EST-DPP-SAMPLE satis\ufb01es\n\nT B \u2264 2T BC1(\u03b21/2\nR2\n\nt + \u03b6 1/2\n\nt\n\nt=1\n\nProof. The proof is provided in Appendix 3.\nNote that the regret bounds for both DPP-MAX and DPP-SAMPLE are better than BUCB/B-EST\ndue to the latter having both an additional factor of B in the log term and a regret multiplier constant\nC(cid:48). In fact, for the RBF kernel, C(cid:48) grows like edd which is quite large for even moderate values of d.\n4 Experiments\nIn this section, we study the performance of the DPP-based algorithms, especially DPP-SAMPLE\nagainst some existing baselines. In particular, the methods we consider are BUCB [9], B-EST,\n\n6\n\n\fd)/l2\n\nd(xd \u2212 y2\n\nUCB-PE/UCB-DPP-MAX [8], EST-PE/EST-DPP-MAX, UCB-DPP-SAMPLE, EST-DPP-SAMPLE\nand UCB with local penalization (LP-UCB) [11]. We used the publicly available code for BUCB and\nPE1. The code was modi\ufb01ed to include the code for the EST counterparts using code for EST 2. For\nLP-UCB, we use the publicly available GPyOpt codebase 3 and implemented the MCMC algorithm\nby [1] for k-DPP sampling with \u0001 = 0.01 as the variation distance error. We were unable to compare\nagainst PPES as the code was not publicly available. Furthermore, as shown in the experiments\nin [27], PPES is very slow and does not scale beyond batch sizes of 4-5. Since UCB-PE almost\nalways performs better than the simulation matching algorithm of [4] in all experiments that we\ncould \ufb01nd in previous papers [27, 8], we forego a comparison against simulation matching as well to\navoid clutter in the graphs. The performance is measured after t batch evaluations using immediate\n\nregret, rt = |f ((cid:101)xt) \u2212 f (x\u2217)|, where x\u2217 is a known optimizer of f and(cid:101)xt is the recommendation\nk(x, y) = \u03b32 exp[\u22120.5(cid:80)\n\nof an algorithm after t batch evaluations. We perform 50 experiments for each objective function\nand report the median of the immediate regret obtained for each algorithm. To maintain consistency,\nthe \ufb01rst point of all methods is chosen to be the same (random). The mean function of the prior\nGP was the zero function while the kernel function was the squared-exponential kernel of the form\nd]. The hyper-parameter \u03bb was picked from a broad Gaussian\n\nhyperprior and the the other hyper-parameters were chosen from uninformative Gamma priors.\nOur \ufb01rst set of experiments is on a set of synthetic benchmark objective functions including Branin-\nHoo [20], a mixture of cosines [2] and the Hartmann-6 function [20]. We choose batches of size 5\nand 10. Due to lack of space, the results for mixture of cosines are provided in Appendix 5 while\nthe results of the other two are shown in Figure 1. The results suggest that the DPP-SAMPLE\nbased methods perform superior to the other methods. They do much better than their DPP-MAX\nand Batched counterparts. The trends displayed with regards to LP are more interesting. For the\nBranin-Hoo, LP-UCB starts out worse than the DPP based algorithms but takes over DPP-MAX\nrelatively quickly and approaches the performance of DPP-SAMPLE when the batch size is 5. When\nthe batch size is 10, the performance of LP-UCB does not improve much but both DPP-MAX and\nDPP-SAMPLE perform better. For Hartmann, LP-UCB outperforms both DPP-MAX algorithms\nby a considerable margin. The DPP-SAMPLE based methods perform better than LP-UCB. The\ngap, however, is more for the batch size of 10. Again, the performance of LP-UCB changes much\nlesser compared to the performance gain of the DPP-based algorithms. This is likely because the\nbatches chosen by the DPP-based methods are more \u201cglobally diverse\u201d for larger batch sizes. The\nsuperior performance of the sampling based methods can be attributed to allowing for uncertainty in\nthe observations by sampling as opposed to greedily emphasizing on maximizing information gain.\nWe now consider maximization of real-world objective functions. The \ufb01rst function we consider,\nrobot, returns the walking speed of a bipedal robot [35]. The function\u2019s input parameters, which live\nin [0, 1]8, are the robot\u2019s controller. We add Gaussian noise with \u03c3 = 0.1 to the noiseless function.\nThe second function, Abalone4 is a test function used in [8]. The challenge of the dataset is to predict\nthe age of a species of sea snails from physical measurements. Similar to [8], we will use it as a\nmaximization problem. Our \ufb01nal experiment is on hyper-parameter tuning for extreme multi-label\nlearning. In extreme classi\ufb01cation, one needs to deal with multi-class and multi-label problems\ninvolving a very large number of categories. Due to the prohibitively large number of categories,\nrunning traditional machine learning algorithms is not feasible. A recent popular approach for extreme\nclassi\ufb01cation is the FastXML algorithm [23]. The main advantage of FastXML is that it maintains\nhigh accuracy while training in a fraction of the time compared to the previous state-of-the-art. The\nFastXML algorithm has 5 parameters and the performance depends on these hyper-parameters, to a\nreasonable amount. Our task is to perform hyper-parameter optimization on these 5 hyper-parameters\nwith the aim to maximize the Precision@k for k = 1, which is the metric used in [23] to evaluate\nthe performance of FastXML compared to other algorithms as well. While the authors of [23] run\nextensive tests on a variety of datasets, we focus on two small datasets : Bibtex [15] and Delicious[32].\nAs before, we use batch sizes of 5 and 10. The results for Abalone and the FastXML experiment on\nDelicious are provided in the appendix. The results for Prec@1 for FastXML on the Bibtex dataset\n\n1http://econtal.perso.math.cnrs.fr/software/\n2https://github.com/zi-w/EST\n3http://shef\ufb01eldml.github.io/GPyOpt/\n4The Abalone\n\ndataset\n\nprovided\n\nis\n\nhttp://archive.ics.uci.edu/ml/datasets/Abalone\n\nby\n\nthe UCI Machine Learning Repository\n\nat\n\n7\n\n\fFigure 2: Immediate regret of the algorithms for Prec@1 for FastXML on Bibtex and Robot with B = 5 and 10\n\nand for the robot experiment are provided in Figure 2. The blue horizontal line for the FastXML\nresults indicates the maximum Prec@k value found using grid search.\nThe results for robot indicate that while DPP-MAX does better than their Batched counterparts, the\ndifference in the performance between DPP-MAX and DPP-SAMPLE is much less pronounced for\na small batch size of 5 but is considerable for batch sizes of 10. This is in line with our intuition\nabout sampling being more bene\ufb01cial for larger batch sizes. The performance of LP-UCB is quite\nclose and slightly better than UCB-DPP-SAMPLE. This might be because the underlying function is\nwell-behaved (Lipschitz continuous) and thus, the estimate for the Lipschitz constant might be better\nwhich helps them get better results. This improvement is more pronounced for batch size of 10 as\nwell. For Abalone (see Appendix 5), LP does better than DPP-MAX but there is a reasonable gap\nbetween DPP-SAMPLE and LP which is more pronounced for B = 10.\nThe results for Prec@1 for the Bibtex dataset for FastXML are more interesting. Both DPP based\nmethods are much better than their Batched counterparts. For B = 5, DPP-SAMPLE is only slightly\nbetter than DPP-MAX. LP-UCB starts out worse than DPP-MAX but starts doing comparable to\nDPP-MAX after a few iterations. For B = 10, there is not a large improvement in the gap between\nDPP-MAX and DPP-SAMPLE. LP-UCB however, quickly takes over UCB-DPP-MAX and comes\nquite close to the performance of DPP-SAMPLE after a few iterations. For the Delicious dataset (see\nAppendix 5), we see a similar trend of the improvement of sampling to be larger for larger batch sizes.\nLP-UCB displays an interesting trend in this experiment by doing much better than UCB-DPP-MAX\nfor B = 5 and is in fact quite close to the performance of DPP-SAMPLE. However, for B = 10, its\nperformance is much closer to UCB-DPP-MAX. DPP-SAMPLE loses out to LP-UCB only on the\nrobot dataset and does better for all the other datasets. Furthermore, this improvement seems more\npronounced for larger batch sizes. We leave experiments with other kernels and a more thorough\nexperimental evaluation with respect to batch sizes for future work.\n5 Conclusion\nWe have proposed a new method for batched Gaussian Process bandit (batch Bayesian) optimization\nbased on DPPs which are desirable in this case as they promote diversity in batches. The DPP kernel\nis automatically \ufb01gured out on the \ufb02y which allows us to show regret bounds for DPP maximization\nand sampling based methods for this problem. We show that this framework exactly recovers a\npopular algorithm for BBO, namely the UCB-PE when we consider DPP maximization using the\ngreedy algorithm. We showed that the regret for the sampling based method is always less than the\nmaximization based method. We also derived their EST counterparts and also provided a simpler\nproof of the information gain for RBF kernels which leads to a slight improvement in the best bound\nknown. Our experiments on a variety of synthetic and real-world tasks validate our theoretical claims\nthat sampling performs better than maximization and other methods.\n\n8\n\n\fReferences\n\nICML, 2000.\n\n[1] N. Anari, S.O. Gharan, and A. Rezaei. Monte carlo markov chains algorithms for sampling strongly\n\nrayleigh distributions and determinantal point processes. COLT, 2016.\n\n[2] B.S. Anderson, A.W. Moore, and D. Cohn. A nonparametric approach to noisy and costly optimization.\n\n[3] P. Auer. Using con\ufb01dence bounds for exploration-exploitation trade-offs. JMLR, 3:397\u2013422, 2002.\n[4] J. Azimi, A. Fern, and X. Fern. Batch bayesian optimization via simulation matching. 2010.\n[5] R. Brualdi and H. Schneider. Determinantal identities: Gauss, schur, cauchy, sylvester, kronecker, jacobi,\n\nbinet, laplace, muir, and cayley. Linear Algebra and its Applications, 1983.\n\n[6] A. \u00c7ivril and M. Magdon-Ismail. On selecting a maximum volume sub-matrix of a matrix and related\n\nproblems. Theor. Comput. Sci., 410(47-49):4801\u20134811, 2009.\n\n[7] A. \u00c7ivril and M. Magdon-Ismail. Exponential inapproximability of selecting a maximum volume sub-\n\nmatrix. Algorithmica, 65(1):159\u2013176, 2013.\n\n[8] E. Contal, D. Buffoni, D. Robicquet, and N. Vayatis. Parallel gaussian process optimization with upper\n\ncon\ufb01dence bound and pure exploration. ECML, 2013.\n\n[9] T. Desautels, A. Krause, and J.W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian\n\nprocess bandit optimization. JMLR, 15:4053\u20134103, 2014.\n\n[10] A. Deshpande and L. Rademacher. Ef\ufb01cient volume sampling for row/column subset selection. FOCS,\n\n2010.\n\nAISTATS, 2016.\n\nAISTATS, 2016.\n\n[11] J. Gonzalez, Z. Dai, P. Hennig, and N. Lawrence. Batch bayesian optimization via local penalization.\n\n[12] J. Gonz\u00e1lez, M. A. Osborne, and N. D. Lawrence. GLASSES: relieving the myopia of bayesian optimisation.\n\n[13] P. Hennig and C. Schuler. Entropy search for information-ef\ufb01cient global optimization. JMLR, 13, 2012.\n[14] J.M. Hernandex-Lobato, M.W. Hoffman, and Z. Ghahramani. Predicitive entropy search for ef\ufb01cient global\n\n[15] I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classi\ufb01cation for automated tag suggestion.\n\noptimization of black-box functions. NIPS, 2014.\n\nECML/PKDD Discovery Challenge, 2008.\n\n[16] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. NIPS, 2011.\n[17] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011.\n[18] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Found. Trends R(cid:13)\n\nMach. Learn., (2-3):123\u2013286, 2012.\n\n2016.\n\n[19] C. Li, S. Jegelka, and S. Sra. Fast dpp sampling for nystr\u00f6m with application to kernel methods. ICML,\n\n[20] D. Lizotte. Pratical bayesian optimization. PhD thesis, University of Alberta, 2008.\n[21] R. Lyons. Determinantal probability measures. Publications Math\u00e9matiques de l\u2019Institut des Hautes\n\n\u00c9tudes Scienti\ufb01ques, 98(1):167\u2013212, 2003.\n\n[22] A. Nikolov. Randomized rounding for the largest simplex problem. In STOC, pages 861\u2013870, 2015.\n[23] Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable tree-classi\ufb01er for extreme multi-label\n\nlearning. KDD, 2014.\n\n[24] C. Rasmussen and C. Williams. Gaussian processes for machine learning. MIT Press, 2008.\n[25] H. Robbins. Some aspects of the sequential design of experiments. Bul. Am. Math. Soc., 1952.\n[26] M. W. Seeger, S. M. Kakade, and D. P. Foster. Information consistency of nonparametric gaussian process\n\nmethods. IEEE Tr. Inf. Theo., 54(5):2376\u20132382, 2008.\n\n[27] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of expensive\n\nobjective functions. NIPS, 2015.\n\n[28] T. Shirai and Y. Takahashi. Random point \ufb01elds associated with certain fredholm determinants i: fermion,\n\npoisson and boson point processes. Journal of Functional Analysis, 205(2):414 \u2013 463, 2003.\n\n[29] J. Snoek, H. Larochelle, and R.P. Adams. Practical bayesian optimization of machine learning. NIPS,\n\n2012.\n\n[30] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Information-theoretic regret bounds for gaussian process\n\noptimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250\u20133265, 2012.\n\n[31] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-weka : combined selection and hyper-\n\nparameter optimization of classi\ufb01cation algorithms. KDD, 2003.\n\n[32] G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and ef\ufb01cient multilabel classi\ufb01cation in domains with\n\nlarge number of labels. ECML/PKDD 2008 Workshop on Mining Multidimensional Data, 2008.\n\n[33] G. Wang and S. Shan. Review of metamodeling techniques in support of engineering design optimization.\n\nJournal of Mechanical Design, 129:370\u2013380, 2007.\n\n[34] Z. Wang, B. Zhou, and S. Jegelka. Optimization as estimation with gaussian processes in bandit settings.\n\n[35] E. Westervelt and J. Grizzle. Feedback control of dynamic bipedal robot locomotion. Control and\n\nAISTATS, 2016.\n\nAutomation Series, 2007.\n\n[36] W. Ziemba and R. Vickson. Stochastic optimization models in \ufb01nance. World Scienti\ufb01c Singapore, 2006.\n\n9\n\n\f", "award": [], "sourceid": 2098, "authors": [{"given_name": "Tarun", "family_name": "Kathuria", "institution": "Microsoft Research"}, {"given_name": "Amit", "family_name": "Deshpande", "institution": null}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}]}