{"title": "High-Dimensional Gaussian Process Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1033, "abstract": "Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration\u2013exploitation tradeoff by allocating sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios.", "full_text": "High-Dimensional Gaussian Process Bandits\n\nJosip Djolonga\nETH Z\u00a8urich\n\njosipd@ethz.ch\n\nAndreas Krause\n\nETH Z\u00a8urich\n\nkrausea@ethz.ch\n\nVolkan Cevher\n\nEPFL\n\nvolkan.cevher@epfl.ch\n\nAbstract\n\nMany applications in machine learning require optimizing unknown functions\nde\ufb01ned over a high-dimensional space from noisy samples that are expensive to\nobtain. We address this notoriously hard challenge, under the assumptions that\nthe function varies only along some low-dimensional subspace and is smooth\n(i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we\npresent the SI-BO algorithm, which leverages recent low-rank matrix recovery\ntechniques to learn the underlying subspace of the unknown function and applies\nGaussian Process Upper Con\ufb01dence sampling for optimization of the function.\nWe carefully calibrate the exploration\u2013exploitation tradeoff by allocating the\nsampling budget to subspace estimation and function optimization, and obtain the\n\ufb01rst subexponential cumulative regret bounds and convergence rates for Bayesian\noptimization in high-dimensions under noisy observations. Numerical results\ndemonstrate the effectiveness of our approach in dif\ufb01cult scenarios.\n\nIntroduction\n\n1\nThe optimization of non-linear functions whose evaluation may be noisy and expensive is a chal-\nlenge that has important applications in sciences and engineering. One approach to this notoriously\nhard problem takes a Bayesian perspective, which uses the predictive uncertainty in order to trade\nexploration (gathering data for reducing model uncertainty) and exploitation (focusing sampling\nnear likely optima), and is often called Bayesian Optimization (BO). Modern BO algorithms are\nquite successful, surpassing even human experts in learning tasks: e.g., gait control for the SONY\nAIBO, convolutional neural networks, structural SVMs, and Latent Dirichlet Allocation [1, 2, 3].\nUnfortunately, the theoretical ef\ufb01ciency of these methods depends exponentially on the\u2014often\nhigh\u2014dimension of the domain over which the function is de\ufb01ned. A way to circumvent this \u201ccurse\nof dimensionality\u201d is to make the assumption that only a small number of the dimensions actually\nmatter. For example, the cost function of neural networks effectively varies only along a few dimen-\nsions [2]. This idea has been also at the root of nonparametric regression approaches [4, 5, 6, 7].\nTo this end, we propose an algorithm that learns a low dimensional, not necessarily axis-aligned,\nsubspace and then applies Bayesian optimization on this estimated subspace. In particular, our SI-\nBO approach combines low-rank matrix recovery with Gaussian Process Upper Con\ufb01dence Bound\nsampling in a carefully calibrated manner. We theoretically analyze its performance, and prove\nbounds on its cumulative regret. To the best of our knowledge, we prove the \ufb01rst subexponential\nbounds for Bayesian optimization in high dimensions under noisy observations. In contrast to exist-\ning approaches, which have an exponential dependence on the ambient dimension, our bounds have\nin fact polynomial dependence on the dimension. Moreover, our performance guarantees depend\nexplicitly on what we could have achieved if we had known the subspace in advance.\nPrevious work. Exploration\u2013exploitation tradeoffs were originally studied in the context of \ufb01nite\nmulti-armed bandits [8]. Since then, results have been obtained for continuous domains, starting\nwith the linear [9] and Lipschitz-continuous cases [10, 11]. A more recent algorithm that enjoys\ntheoretical bounds for functions sampled from a Gaussian Process (GP), or belong to some Repro-\n\n1\n\n\fducible Kernel Hilbert Space (RKHS) is GP-UCB [12]. The use of GPs to negotiate exploration\u2013\nexploitation tradeoffs originated in the areas of response surface and Bayesian optimization, for\nwhich there are a number of approaches (cf., [13]), perhaps most notably the Expected Improve-\nment [14] approach, which has recently received theoretical justi\ufb01cation [15], albeit only in the\nnoise-free setting.\nBandit algorithms that exploit low-dimensional structure of the function appeared \ufb01rst for the linear\nsetting, where under sparsity assumptions one can obtain bounds, which depend only weakly on the\nambient dimension [16, 17]. In [18] the more general case of functions sampled from a GP under the\nsame sparsity assumptions was considered. The idea of random projections to BO has been recently\nintroduced [19]. They provide bounds on the simple regret under noiseless observations, while we\nalso analyze the cumulative regret and allow noisy observations. Also, unless the low-dimensional\nspace is of dimension 1, our bounds on the simple regret improve on theirs. In [7] the authors ap-\nproximate functions that live on low-dimensional subspaces using low-rank recovery and analysis\ntechniques. While providing uniform approximation guarantees, their approach is not tailored to-\nwards exploration\u2013exploitation tradeoffs, and does not achieve sublinear cumulative regret. In [20]\nthe stochastic and adversarial cases for axis-aligned H\u00a8older continuous functions are considered.\nOur speci\ufb01c contributions in this paper can be summarized as follows:\n\n\u2022 We introduce the SI-BO algorithm for Bayesian bandit optimization in high dimensions,\nadmitting a large family of kernel functions. Our algorithm is a natural but non-trivial\nfusion of modern low-rank subspace approximation tools with GP optimization methods.\n\u2022 We derive theoretical guarantees on SI-BO\u2019s cumulative and simple regret in high-\ndimensions with noise. To the best of our knowledge, these are the \ufb01rst theoretical results\non the sample complexity and regret rates that are subexponential in the ambient dimension.\n\n\u2022 We provide experimental results on synthetic data and classical benchmarks.\n\ntime step t. Hence, a natural quantity to consider is the cumulative regret de\ufb01ned as RT =(cid:80)T\n\n2 Background and Problem Statement\nGoal.\nIn plain words, we wish to sequentially optimize a bounded function over a compact, convex\nsubset D \u2282 Rd. Without loss of generality, we denote the function by f : D \u2192 [0, 1] and let x\u2217\nbe a maximizer. The algorithm proceeds in a total of T rounds. In each round t, it asks an oracle\nfor the function value at some point xt and it receives back the value f (xt), possibly corrupted by\nnoise. Our goal is to choose points such that their values are close to the optimum f (x\u2217).\nAs performance metric, we consider the regret, which tells us how much better we could have done\nin round t had we known x\u2217, or formally rt = f (x\u2217) \u2212 f (xt).\nIn many applications, such as\nrecommender systems, robotic control, etc., we care about the quality of the points chosen at every\nt=1 rt.\nOne can also consider the simple regret, de\ufb01ned as ST = minT\nt=1 rt, measuring the quality of the\nbest solution found so far. We will give bounds on the more challenging notion of cumulative regret,\nwhich also bounds the simple regret via ST \u2264 RT /T .\nLow-dimensional functions in high-dimensions. Unfortunately, our problem cannot be tractably\nsolved without further assumptions on the properties of the function f. What is worse is that the\nusual compact support and smoothness assumptions cannot achieve much: the minimax lower bound\non the sample complexity is exponential in d [21, 6, 7]. We hence assume that the function effec-\ntively varies only along a small number of true active dimensions:\ni.e., the function lives on a\nk (cid:28) d-dimensional subspace. Typically, k or an upper bound on k is assumed known [4, 5, 7, 6].\nFormally, we suppose that there exists some function g : Rk \u2192 [0, 1] and a matrix A \u2208 Rk\u00d7d with\northogonal rows so that f (x) = g(Ax). We will additionally assume that g \u2208 C2, which is necessary\nto bound the errors from the linear approximation that we will make. Further, w.l.o.g., we assume\nthat D = Bd(1+ \u00af\u03b5) for some \u00af\u03b5 > 0, where we de\ufb01ne Bd(r) to be the closed ball around 0 of radius r\nin Rd.1 To be able to recover the subspace we also need the condition that g has Lipschitz continuous\nsecond order derivatives and a full rank Hessian at 0, which is satis\ufb01ed for many functions [7].\nSmooth, low-complexity functions.\nIn addition to the low-dimensional subspace assumption, we\nalso assume that g is smooth. One way to encode our prior is to assume that the function g resides in\n\n1Our method method can be extended to any convex compact set, see Section 5.2 in [22].\n\n2\n\n\fAlgorithm 1 The SI-BO algorithm\nRequire: mX , m\u03a6, \u03bb, \u03b5, k, oracle for f, kernel \u03ba\n\u221a\n\u03a6i \u2190 m\u03a6 samples uniformly from {\u00b11/\n\nC \u2190 mX samples uniformly from Sd\u22121\nfor i \u2190 1 to mX do\ny \u2190 compute using Equation 1\n\u02c6XDS \u2190 Dantzig Selector using y, see Equation 2 and compute the SVD \u02c6XDS = \u02c6U \u02c6\u03a3 \u02c6V T\n\u02c6A \u2190 \u02c6U (k) // Principal k vectors of \u02c6U, D \u2190 all ( \u02c6Ax, y) pairs queried so far\nUse GP inference to obtain \u00b51(\u00b7), \u03c31(\u00b7).\nfor t \u2190 1 to T \u2212 mX (m\u03a6 + 1) do\n\nm\u03a6}k\n\nzt \u2190 arg maxz \u00b5t(z) + \u03b21/2\n\nt \u03c3t(z) , yt \u2190 f ( \u02c6AT zt) + noise , D.add(zt, yt)\n\ncompleting the set of functions(cid:80)n\n\na Reproducing Kernel Hilbert Space (RKHS; cf., [23]), which allows us to quantify g\u2019s complexity\n. The RKHS for some positive semide\ufb01nite kernel \u03ba(\u00b7,\u00b7) can be constructed by\nvia its norm (cid:107)g(cid:107)H\u03ba\ni=1 \u03b1i\u03ba(xi,\u00b7) under a suitable inner product. In this work, we use\nisotropic kernels, i.e., those that depend only on the distance between points, since the problem is\nrotation invariant and we can only recover A up to some rotation.\nHere is a \ufb01nal summary of our problem and its underlying assumptions:\n\n1. We wish to maximize f : Bd(1 + \u00af\u03b5) \u2192 [0, 1], where f (x) = g(Ax) for some matrix\nA \u2208 Rk\u00d7d with orthogonal rows and g belongs to some RKHS H\u03ba.\n2. The kernel \u03ba is isotropic \u03ba(x, x(cid:48)) = \u03ba(cid:48)(x \u2212 x(cid:48)) = \u03ba(cid:48)(cid:48)((cid:107)x \u2212 x(cid:48)(cid:107)2) and \u03ba(cid:48) is continuous,\nintegrable and with a Fourier transform F\u03ba(cid:48) that is isotropic and radially non-increasing.2\n3. The function g has Lipschitz continuous 2nd-order derivatives and a full rank Hessian at 0.\n4. The function g is C2 on a compact support and max|\u03b2|\u22642(cid:107)D\u03b2g(cid:107)\u221e \u2264 C2 for some C2 > 0.\n5. The oracle noise is Gaussian with zero mean with a known variance \u03c32.\n\n3 The SI-BO Algorithm\nThe SI-BO algorithm performs two separate exploration and exploitation stages: (1) subspace iden-\nti\ufb01cation (SI), i.e. estimating the subspace on which the function is supported, and then (2) Bayesian\noptimization (BO), in order to optimize the function on the learned subspace. A key challenge here\nis to carefully allocate samples between these phases.\nWe \ufb01rst give a detailed outline for SI-BO in Alg. 1, deferring its theoretical analysis to Section 4.\nGiven the (noisy) oracle for f, we \ufb01rst evaluate the function at several suitably chosen points and\nthen use a low-rank recovery algorithm to compute a matrix \u02c6A that spans a subspace well aligned\nwith the one generated by the true matrix A. Once we have computed \u02c6A, similarly to [22, 7], we\nde\ufb01ne the function which we optimize as \u02c6g(z) = f ( \u02c6AT z) = g(A \u02c6AT z). Thus, we effectively work\nwith an approximation \u02c6f to f given by \u02c6f (x) = \u02c6g( \u02c6Ax) = g(A \u02c6AT \u02c6Ax). With the approximation at\nhand, we apply BO, in particular the GP-UCB algorithm, on \u02c6g for the remaining steps.\nSubspace Learning. We learn A using the approach from [7], which reduces the learning prob-\nlem to that of low rank matrix recovery. We construct a set of mX points C = [\u03be1,\u00b7\u00b7\u00b7 , \u03bemX ],\nwhich we call sampling centers, and consider the matrix X of gradients at those points X =\n[\u2207f (\u03be1),\u00b7\u00b7\u00b7 ,\u2207f (\u03bemX )]. Using the chain rule, we have X = AT [\u2207g(A\u03be1),\u00b7\u00b7\u00b7 ,\u2207g(A\u03bemX )].\nBecause A is a matrix of size k \u00d7 d it follows that the rank of X is at most k. This suggests that\nusing low-rank approximation techniques, one may be able to (up to rotation) infer A from X.\nGiven that we have no access to the gradients of f directly, we approximate X using a linearization\nof f. Consider a \ufb01xed sampling center \u03be. If we make a linear approximation with step size \u03b5 to the\ndirectional derivative at center \u03be in direction \u03d5 then, by Taylor\u2019s theorem, for a suitable \u03b6(x, \u03b5, \u03d5):\n\n(cid:104)\u03d5, AT\u2207g(A\u03be)(cid:105) =\n\n1\n\u03b5\n\n(f (\u03be + \u03b5\u03d5) \u2212 f (\u03be)) \u2212 \u03b5\n2\n\n\u03d5T\u22072f (\u03b6)\u03d5\n\n.\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nE(\u03be,\u03b5,\u03d5)\n\n(cid:125)\n\n2This is the same assumption as in [15]. Radially non-increasing means that if (cid:107)w(cid:107) \u2264 (cid:107)w(cid:48)(cid:107) then F\u03ba(cid:48)(w) \u2265\n\nF\u03ba(cid:48)(w(cid:48)). Note that this is satis\ufb01ed by the RBF and Mat`ern kernels.\n\n3\n\n\fThus, sampling the \ufb01nite difference f (\u03be + \u03b5\u03d5)\u2212 f (\u03be) provides (up to the curvature error E(\u03be, \u03b5, \u03d5),\nand sampling noise) information about the one-dimensional subspace spanned by AT\u2207g(A\u03be).\nTo estimate it accurately, we must observe multiple directions \u03d5. Further, to infer the full k-\ndimensional subspace A, we need to consider at least mX \u2265 k centers. Consequently, for\neach center \u03bei, we de\ufb01ne a set of m\u03a6 directions and arrange them in a total of m\u03a6 matrices\n\u03a6i = [\u03d5i,1, \u03d5i,2,\u00b7\u00b7\u00b7 , \u03d5i,mX ] \u2208 Rd\u00d7mX . We can now de\ufb01ne the following linear system:\n\ny = A(X) + e + z,\n\nyi =\n\n1\n\u03b5\n\n(f (\u03bej + \u03b5\u03d5i,j) \u2212 f (\u03bej)),\n\n(1)\n\nwhere the linear operator A is de\ufb01ned as A(X)i = tr(\u03a6T\ni X), the curvature errors have been accu-\nmulated in e and the noise has been put in the vector z which is distributed as zi \u223c N (0, 2mX \u03c32/\u03b5).\nGiven the structure of the problem, we can make use of several low-rank recovery algorithms. For\nconcreteness, we choose the Dantzig Selector (DS, [24]), which recovers low rank matrices via\n\nmX(cid:88)\n\nj=1\n\nminimize\n\nM\n\n(cid:107)M(cid:107)\u2217\n\nsubject to (cid:107)A\u2217(y \u2212 A(M )\n\n)(cid:107) \u2264 \u03bb,\n\n(2)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nresidual\n\n(cid:125)\n\nwhere (cid:107)\u00b7(cid:107)\u2217 is the nuclear norm and (cid:107)\u00b7(cid:107) is the spectral norm. The DS will successfully recover a\nmatrix \u02c6X close to the true solution in the Frobenius norm and moreover this distance decreases\nlinearly with \u03bb. As shown in [7], choosing the centers C uniformly at random from the unit sphere\n\u221a\nSd\u22121, choosing each direction vector uniformly at random from {\u00b11/\nm\u03a6}k, and\u2014in the case of\nnoisy observations, resampling f repeatedly\u2014suf\ufb01ces to obtain an accurate \u02c6X w.h.p., as long as m\u03a6\nand mX are suf\ufb01ciently large. The precise choices of these quantities are analyzed in Section 4.\nFinally, we extract the matrix \u02c6A from the SVD of \u02c6X, by taking its top k left singular vectors.\nBecause the DS will \ufb01nd a matrix \u02c6X close to X, due to a result by Wedin [25] we know that the\nlearned subspace will be close to the true one.\nOptimizing \u02c6g. Once we have an approximate \u02c6A, we optimize the function \u02c6g(z) = f ( \u02c6AT z) on the\nlow-dimensional domain Z = Bk(1+ \u00af\u03b5). Concretely, we use GP-UCB [12], because it exhibits state\nof the art empirical performance, and enjoys strong theoretical bounds for the cumulative regret. It\nrequires that \u02c6g belongs to the RKHS and the noise when conditioned on the history is zero-mean and\nalmost surely bounded by some \u02c6\u03c3. Section 4 shows that this is indeed true with high probability.\nIn order to trade exploration and exploitation, the GP-UCB algorithm computes, for each point z,\na score that combines the predictive mean that we have inferred for that point with its variance,\nwhich quanti\ufb01es the uncertainty in our estimate. They are combined linearly with a time-dependent\nweighting factor \u03b2t in the following surrogate function\n\nucb(z) = \u00b5t(z) + \u03b21/2\n\nt \u03c3t(z)\n\n(3)\n\nfor a suitably chosen \u03b2t = 2B + 300\u03b3t log3(t/\u03b4). Here, B is an upper bound on the squared RKHS\nnorm of the function that we optimize, \u03b4 is an upper bound on the failure probability and \u03b3t depends\non the kernel [12]: cf., Section 4.3 The algorithm then greedily maximizes the ucb score above.\nNote that \ufb01nding the maximum of this non-convex and in general multi-modal function, while\nconsidered to be cheaper than evaluating f at a new point, is by itself a hard problem and it is\nusually approached by either sampling on a grid in the domain, or using some global Lipschitz\noptimizer [13]. Hence, by reducing the dimension of the domain Z over which we have to optimize,\nour algorithm has the additional bene\ufb01t that this process can be performed more ef\ufb01ciently.\nHandling the noise. The last ingredient that we need is theory on how to pick \u02c6\u03c3 so that it bounds\nthe noise during the execution of GP-UCB w.h.p., and how to select \u03bb in (2) so that the true matrix\nX is feasible in the DS. Due to the fast decay of the tails of the Gaussian distribution we can pick\n\u03c3, where T is the number of GP-UCB iterations and \u03c32 is the\n\n\u02c6\u03c3 =\nvariance of the noise. Then the noise will be trapped in [\u2212\u02c6\u03c3, \u02c6\u03c3] with probability at least 1 \u2212 \u03b4.\n\n(cid:17)1/2\n\n\u03b4 + 2 log T + log 1\n\n2\u03c0\n\n2 log 1\n\n(cid:16)\n\n3If the bound B is not known beforehand then one can use a doubling trick.\n\n4\n\n\fFigure 1: A 2-dimensional function f (x, y) varying along a 1-dimensional subspace and its projec-\ntions on different subspaces. The numbers are the respective cosine distances.\n\nThe analysis on \u03bb comes from [7]. They bound (cid:107)A\u2217(e + z)(cid:107) using the assumption that the second\norder derivatives are bounded and, as shown in [24], because z has a Gaussian distribution,\n\n(4)\n\n(cid:107)A\u2217(e + z)(cid:107) \u2264 1.2\n\n(cid:18) C2\u03b5dmX k2\n\n\u221a\n\n2\n\nm\u03a6\n\n(cid:19)\n\n\u221a\n5\n\n+\n\nmX m\u03a6\u03c3\n\n\u03b5\n\nIf there is no noise it still holds by setting \u03c3 = 0. This bound, intuitively, relates the approximation\nquality \u03bb of the subspace to the quantities m\u03a6, mX as well as the step size \u03b5.\n\n4 Theoretical Analysis\n\nOverview. A crucial choice in our algorithm is how to allocate samples (by choosing m\u03a6 and mX\nappropriately) to the tasks of subspace learning and function optimization. We now analyze both\nphases, and determine how to split the queries in order to optimize the cumulative regret bounds.\nLet us \ufb01rst consider the regret incurred in the second phase, in the ideal (but unrealistic) case that\nthe subspace is estimated exactly (i.e., \u02c6A = A). This question was answered recently in [12], where\nit is proven that it is bounded by O\u2217(\n\n\u221a\n\n\u221a\n\n\u03b3t + \u03b3t)) 4 . Hereby, the quantity \u03b3T is de\ufb01ned as\nH(yS) \u2212 H(yS|f ),\n\nT (B\n\u03b3T = max\n\nS\u2286D,|S|=T\n\nof g w.r.t. kernel \u03ba. Note that generally \u03b3T grows exponentially with k, rendering the\n\nthe RBF kernel in k dimensions, \u03b3T = O(cid:0)(log T )k+1(cid:1). Further, B is a bound on the squared\n\nwhere yS are the values of f at the points in S, corrupted by Gaussian noise, and H(\u00b7) is the entropy.\nIt quanti\ufb01es the maximal gain in information that we can obtain about f by picking a set of T points.\nIn [12] sublinear bounds for \u03b3T have been computed for several popular kernels. For example, for\nnorm (cid:107)g(cid:107)2H\u03ba\napplication of GP-UCB directly to the high-dimensional problem intractable.\nWhat happens if the subspace \u02c6A is estimated incorrectly? Fortunately, w.h.p. the estimated function\n\u02c6g still remains in the RKHS associated with kernel \u03ba. However, the norm (cid:107)\u02c6g(cid:107)H\u03ba\nmay increase, and\nconsequently may the regret. Moreover, the considered \u02c6f disagrees with the true f, and consequently\nadditional regret per sample may be incurred by \u03b7 = || \u02c6f \u2212 f||\u221e. As an illustration of the effect of\nmisestimated subspaces see Figure 1. We can observe that subspaces far from the true one stretch\nthe function more, thus increasing its RKHS norm.\nWe now state a general result that formalizes these insights by bounding the cumulative regret in\nterms of the samples allocated to subspace learning, and the subspace approximation quality.\nLemma 1 Assume that we spend 0 < n \u2264 T samples to learn the subspace such that (cid:107)f\u2212 \u02c6f(cid:107)\u221e \u2264 \u03b7,\n(cid:107)\u02c6g(cid:107) \u2264 B and the error is bounded by \u02c6\u03c3, each w.p. at least 1 \u2212 \u03b4/4.\nIf we run the GP-UCB\nalgorithm for the remaining T \u2212 n steps with the suggested \u02c6\u03c3 and \u03b4/4, then the following bound on\nthe cumulative regret holds w.p. at least 1 \u2212 \u03b4\n\nRT \u2264 n + \u03b7T(cid:124)(cid:123)(cid:122)(cid:125)\n\napprox. error\n\n\u221a\n\n+O\u2217(\n\n(cid:124)\n\n\u221a\n\n(cid:123)(cid:122)\n\nRUCB(T,\u02c6g,\u03ba)\n\nT (B\n\n\u03b3t + \u03b3t))\n\n(cid:125)\n\n4We have used the notation O\u2217(f ) = O(f log f ) to suppress the log factors. \u2126\u2217(\u00b7) is analogously de\ufb01ned.\n\n5\n\n\u22122\u22121012\u22122\u22121012\u221220\u221215\u221210\u221250510xyf(x,y)\u22122\u22121.5\u22121\u22120.500.511.52\u221220\u221215\u221210\u221250510x\u02c6g(x)true subspace\fFigure 2: Approximations \u02c6g resulting from differently aligned subspaces. Note that inaccurate\nestimation (the middle two cases) can wildly distort the objective.\n\nwhere RUCB(T, \u02c6g, \u03ba) is the regret of GP-UCB when run for T steps using \u02c6g and kernel \u03ba 5.\n\nLemma 1 breaks down the regret in terms of the approximation error incurred by subspace-\n\u2264\nmisestimation, and the optimization error incurred by the resulting increased complexity (cid:107)\u02c6g(cid:107)2H\u03ba\nB. We now analyze these effects, and then prove our main regret bounds.\n\nEffects of Subspace Alignment. A notion that will prove to be very helpful for analyzing both,\nthe approximation precision \u03b7 and the norm of \u02c6g, is the set of angles between the subspaces that are\nde\ufb01ned by A and \u02c6A. The following de\ufb01nition [26] makes this notion precise.\nDe\ufb01nition 2 Let A, \u02c6A \u2208 Rk\u00d7d be two matrices with orthogonal rows so that AAT = \u02c6A \u02c6AT = I. We\nde\ufb01ne the vector of cosines between the spanned subspaces cos \u0398(A, \u02c6A) to be equal to the singular\nvalues of A \u02c6AT . Analogously sin \u0398(A, \u02c6A)i = (1 \u2212 cos \u0398(A, \u02c6A)2\nLet us see how \u02c6A affects \u02c6g. Because \u02c6g(z) = g(A \u02c6AT z), the matrix M = A \u02c6AT , which converts\nany point from its coordinates determined by \u02c6A to the coordinates de\ufb01ned by A, will be of crucial\nimportance. First, note that its singular values are cosines and are between \u22121 and 1. This means\nthat it can only shrink the vectors that we apply it to (possibly by different amounts in different\ndirections). The effect on \u02c6g is that it might only \u201csee\u201d a small part of the whole space, and its shape\nmight be distorted, which in turn will increase its RKHS complexity (see Figure 2 for an illustration).\nLemma 3 If g \u2208 H\u03ba for a kernel that is isotropic with a radially non-increasing Fourier transform\nand \u02c6g(x) = g(A \u02c6AT x) for some A, \u02c6A with orthogonal rows, then for C = C2\n\ni )1/2.\n\n\u221a\n\n2k(1 + \u00af\u03b5),\n\u2264 | prod cos \u0398(A, \u02c6A)|\u22121(cid:107)g(cid:107)2H\u03ba\n\n.\n\n(5)\n\n(cid:107)f \u2212 \u02c6f(cid:107)\u221e \u2264 C(cid:107)sin \u0398(A, \u02c6A)(cid:107)2 and (cid:107)\u02c6g(cid:107)2H\u03ba\n\nHere, we use the notation prod x =(cid:81)d\n\ni=1 xi to denote the product of the elements of a vector. By\ndecreasing the angles we tackle both issues: the approximation error \u03b7 = (cid:107)f \u2212 \u02c6f(cid:107)\u221e is reduced\nand the norm of \u02c6g gets closer to the one of g. There is one nice interpretation of the product of the\ncosines. It is equal to the determinant of the matrix M. Hence, \u02c6g will not be in the RKHS only if M\nis rank de\ufb01cient as dimensions are collapsed.\n\nRegret Bounds. We now present our main bounds on the cumulative regret. In order to achieve\nsublinear regret, we need a way of controlling \u03b7 and (cid:107)\u02c6g(cid:107)H\u03ba\n. In the following, we show how this\ngoal can be achieved. As it turns out, subspace learning is substantially harder in the case of noisy\nobservations. Therefore, we focus on the easier, noise-free setting \ufb01rst.\nNoiseless Observations. We should note that the theory behind GP-UCB still holds in the deter-\nministic case, as it only requires the noise to be bounded a.s. by \u02c6\u03c3. The following theorem guarantees\nthat in this setting for non-linear kernels we have a regret dominated by GP-UCB, which is of order\n\u2126\u2217(\nTheorem 4 If the observations are noiseless we can pick mx = O(kd log 1/\u03b4), \u03b5 =\nand m\u03d5 = O(k2d log 1/\u03b4) so that with probability at least 1 \u2212 \u03b4 we have the following\n\nT \u03b3T ), as it is usually exponential in k.\n\nk2.25d3/2T 1/2\n\n\u221a\n\n1\n\nRT \u2264 O(k3d2 log2(1/\u03b4)) + 2 RUCB(T, g, \u03ba).\n\na term of order O((cid:112)log T + log(1/\u03b4)); c.f. supplementary material.\n\n5 Because the noise parameter \u02c6\u03c3 depends on T , we have to slightly change the bounds from [12] as we have\n\n6\n\ncos\u0398=[1.00,1.00]\u22122\u22121012\u22122\u22121012cos\u0398=[0.04,0.00]\u22122\u22121012\u22122\u22121012cos\u0398=[0.99,0.04]\u22122\u22121012\u22122\u22121012cos\u0398=[0.97,0.95]\u22122\u22121012\u22122\u22121012\fNoisy Observations. Equation 4 hints that the noise can have a dramatic effect in learning ef\ufb01-\nciency. As already mentioned, the DS gets better results as we decrease \u03bb. In the noiseless case, it\nsuf\ufb01ces to increase the number of directions m\u03a6 and decrease the step size \u03b5 in estimating the \ufb01nite\ndifferences. However, the second term in \u03bb can only be reduced by decreasing the variance \u03c32.\n\u221a\nAs a result, each point that we evaluate is sampled n times and we take as its value the average.\nMoreover, note that because the standard deviation decreases as 1/\nn, we have to resample at least\n\u03b5\u22122 times and this signi\ufb01cantly increases the number of samples that we need. Nevertheless, we\nare able to obtain cumulative regret bounds (and thereby the \ufb01rst convergence guarantees and rates)\nfor this setting, which only polynomially depend on d. Unfortunately, the dependence on T is now\nweaker than those in the noiseless setting (Theorem 4), and the regret due to the subspace learning\nmight dominate that of GP-UCB.\nTheorem 5 If the observations are noisy, we can pick \u03b5 =\nk2.25d1.5T 1/5 and all other parameters\nas in the previous theorem. Moreover, we have to resample each point O(\u03c32k2dT 2/5m\u03a6/\u03b52) times.\nThen, with probability at least 1 \u2212 \u03b4\n\n1\n\n\u03c32k11.5d7T 4/5 log3(1/\u03b4)\n\n+ 2 RUCB(T, g, \u03ba).\n\nRT \u2264 O(cid:16)\n\n(cid:17)\n\nMismatch on the effective dimension k. All models are imperfect in some sense and the structure\nof a general f is impossible to identify unless we have further scienti\ufb01c evidence beyond the data.\nIn our case, the assumption f (x) = g(Ax) for some k more or less takes the weakest form for\nindicating our hope that BO can succeed from a sub-exponential sample size. In general, we must\ntune k in a degree to re\ufb02ect the anticipated complexity in the learning problem. Fortunately, all\nthe guarantees are preserved if we assume a k > ktrue, for some true synthetic model, where\nf (x) = g(Ax) holds. Under\ufb01tting k leads to additional errors that are well-controlled in low-rank\nsubspace estimation [24]. The impact of under \ufb01tting in our setting is left for future work.\n\n5 Experiments\n\nThe main intent of our experiments is to provide a proof of concept, con\ufb01rming that SI-BO not just\nin theory provides the \ufb01rst subexponential regret bounds, but also empirically obtains low average\nregret for Bayesian optimization in high dimensions.\nBaselines. We compare SI-BO against the following baseline approaches:\n\n\u2022 RandomS-UCB, which runs GP-UCB on a random subspace.\n\u2022 RandomH-UCB, which runs GP-UCB on the high-dimensional space. At each iteration\n\u2022 Exact-UCB, which runs GP-UCB on the exact (but in practice unknown) subspace.\n\nwe pick 1000 points at random and choose the one with highest UCB score.\n\nThe \u03b2t parameter in the GP-UCB score was set as recommended in [12] for \ufb01nite sets. To optimize\nthe UCB score we sampled on a grid on the low dimensional subspace. For all of the measurements\nwe have added Gaussian zero-mean noise with \u03c3 = 0.01.\nData sets. We carry out experiments in the following settings:\n\nwith smoothness parameter \u03bd = 5/2, length scale (cid:96) = 1/2 and signal variance \u03c32\nThe samples are \u201chidden\u201d in a random 2-dimensional subspace in 100 dimensions.\n\n\u2022 GP Samples. We generate random 2-dimensional samples from a GP with Mat`ern kernel\nf = 1.\n\u2022 Gab`or Filters. The second data set is inspired by experimental design in neuroscience\n[27]. The goal is to determine visual stimuli that maximally excite some neuron, which\nreacts to edges in the images. We consider the function f (x) = exp(\u2212(\u03b8T x \u2212 1)2), where\n\u03b8 is a Gab\u00b4or \ufb01lter of size 17 \u00d7 17 and the set of admissible signals is [0, 1]d.\n\nIn the appendix we also include results for the Branin function, a classical optimization benchmark.\nResults. The results are presented in Figure 3. We show the averages of 20 runs (10 runs for\nGP-Posterior) and the shaded areas represent the standard error around the mean. We show both\nthe average regret and simple regret (i.e., suboptimality of the best solution found so far). We \ufb01nd\nthat although SI-BO spends a total of mX (m\u03a6 + 1) samples to learn the subspace and thus incurs\n\n7\n\n\f(a) GP-Posterior\n\n(b) GP-Posterior, Different k\n\n(c) Gab\u00b4or\n\n(d) GP-Posterior\n\n(e) GP-Posterior, Different k\n\n(f) Gab\u00b4or\n\nFigure 3: Performance comparison on different datasets. Our SI-BO approach outperforms the\nnatural benchmarks in terms of cumulative regret, and competes well with the unrealistic Exact-\nUCB approach that knows the true subspace A.\n\nmuch regret during this phase, learning the subspace pays off, both for average and simple regret,\nand SI-BO ultimately outperforms the baseline methods on both data sets. This demonstrates the\nvalue of accurate subspace estimation for Bayesian optimization in high dimensions.\nMis-speci\ufb01ed k. What happens if we do not know the dimensionality k of the low dimensional\nsubspace? To test this, we experimented with the stability of SI-BO w.r.t. k. We sampled 2-\ndimensional GP-Posterior functions and ran SI-BO with k set to 1, 2 and 3. From the \ufb01gure above\nwe can see that in this scenario SI-BO is relatively stable to this parameter mis-speci\ufb01cation.\n6 Conclusion\nWe have addressed the problem of optimizing high dimensional functions from noisy and expensive\nsamples. We presented the SI-BO algorithm, which tackles this challenge under the assumption that\nthe objective varies only along a low dimensional subspace, and has low norm in a suitable RKHS.\nBy fusing modern techniques for low rank matrix recovery and Bayesian bandit optimization in a\ncarefully calibrated manner, it addresses the exploration\u2013exploitation dilemma, and enjoys cumu-\nlative regret bounds, which only polynomially depend on the ambient dimension. Our results hold\nfor a wide family of RKHS\u2019s, including the popular RBF and Mat`ern kernels. Our experiments on\ndifferent data sets demonstrate that our approach outperforms natural benchmarks.\n\nAcknowledgments. A. Krause acknowledges SNF 200021-137971, DARPA MSEE FA8650-11-\n1-7156, ERC StG 307036 and a Microsoft Faculty Fellowship. V. Cevher acknowledges MIRG-\n268398, ERC Future Proof, SNF 200021-132548, SNF 200021-146750, and SNF CRSII2-147633.\n\nReferences\n[1] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with gaussian process\n\nregression. In Proc. of IJCAI, pages 944\u2013949, 2007.\n\n[2] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine\n\nLearning Research, 13:281\u2013305, 2012.\n\n[3] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms.\n\nIn Neural Information Processing Systems, 2012.\n\n[4] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical\n\nAssociation, 86(414):316\u2013327, 1991.\n\n8\n\n0100020003000350000.20.40.60.81Rt/tNumber of samplesOur approachRandomS\u2212UCBRandomH\u2212UCBExact\u2212UCB50010001500200025003000350000.20.40.60.81Rt/tNumber of samplesUCB\u22123UCB\u22121UCB\u22122500100015002000250000.20.40.60.81Rt/tNumber of samplesExact\u2212UCBRandomH\u2212UCBRandomS\u2212UCBOur approach0100020003000350000.20.40.60.81Simple RegretNumber of samples Our approachRandomH\u2212UCBRandomS\u2212UCBExact\u2212UCB50010001500200025003000350000.20.40.60.81Simple RegretNumber of samples UCB\u22121UCB\u22122UCB\u22123500100015002000250000.20.40.60.81Simple RegretNumber of samplesExact\u2212UCBRandomS\u2212UCBOur approachRandomH\u2212UCB\f[5] G. Raskutti, M.J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regres-\n\nsion over (cid:96)q-balls. Information Theory, IEEE Transactions on, 57(10):6976\u20136994, 2011.\n\n[6] S. Mukherjee, Q. Wu, and D. Zhou. Learning gradients on manifolds. Bernoulli, 16(1):181\u2013207, 2010.\n[7] H. Tyagi and V. Cevher. Learning non-parametric basis independent models from point queries via low-\n\nrank methods. Applied and Computational Harmonic Analysis, (0):\u2013, 2014.\n\n[8] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical\n\nSociety, 58(5):527\u2013535, 1952.\n\n[9] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. The Journal of Machine Learn-\n\ning Research, 3:397\u2013422, 2003.\n\n[10] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages 681\u2013690,\n\n2008.\n\n[11] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv\u00b4ari. Online optimization in X-armed bandits. In NIPS,\n\n2008.\n\n[12] N. Srinivas, A. Krause, S. Kakade, and M. Seeger.\n\nInformation-theoretic regret bounds for gaussian\nprocess optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250\u20133265,\nMay 2012.\n\n[13] E. Brochu, V.M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost func-\ntions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599, 2010.\n\n[14] J. Mo\u02c7ckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical\n\nConference Novosibirsk, July 1\u20137, 1974, pages 400\u2013404. Springer, 1975.\n\n[15] A.D. Bull. Convergence rates of ef\ufb01cient global optimization algorithms. The Journal of Machine Learn-\n\ning Research, 12:2879\u20132904, 2011.\n\n[16] A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic\n\nlinear bandit. Journal of Machine Learning Research - Proceedings Track, 22:190\u2013198, 2012.\n\n[17] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Online-to-con\ufb01dence-set conversions and application to\n\nsparse stochastic bandits. In Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[18] B. Chen, R. Castro, and A. Krause. Joint optimization and variable selection of high-dimensional gaussian\n\nprocesses. In Proc. International Conference on Machine Learning (ICML), 2012.\n\n[19] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. de Freitas. Bayesian optimization in high dimensions\n\nvia random embeddings. In In Proc. IJCAI, 2013.\n\n[20] H. Tyagi and B. G\u00a8artner. Continuum armed bandit problem of few variables in high dimensions. CoRR,\n\nabs/1304.5793, 2013.\n\n[21] R.A. DeVore and G.G. Lorentz. Constructive approximation, volume 303. Springer Verlag, 1993.\n[22] M. Fornasier, K. Schnass, and J. Vybiral. Learning functions of few arbitrary linear parameters in high\n\ndimensions. Foundations of Computational Mathematics, pages 1\u201334, 2012.\n\n[23] B. Sch\u00a8olkopf and A.J. Smola. Learning with kernels: Support vector machines, regularization, optimiza-\n\ntion, and beyond. MIT press, 2001.\n\n[24] E.J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number\n\nof noisy random measurements. Information Theory, IEEE Transactions on, 57(4):2342\u20132359, 2011.\n\n[25] P. A. Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical\n\nMathematics, 12(1):99\u2013111, 1972.\n\n[26] G.W. Stewart and J. Sun. Matrix Perturbation Theory, volume 175. Academic Press New York, 1990.\n[27] J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized\nby two-dimensional visual cortical \ufb01lters. Optical Society of America, Journal, A: Optics and Image\nScience, 2:1160\u20131169, 1985.\n\n9\n\n\f", "award": [], "sourceid": 548, "authors": [{"given_name": "Josip", "family_name": "Djolonga", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}