{"title": "Scaling Gaussian Process Regression with Derivatives", "book": "Advances in Neural Information Processing Systems", "page_first": 6867, "page_last": 6877, "abstract": "Gaussian processes (GPs) with derivatives are useful in many applications, including Bayesian optimization, implicit surface reconstruction, and terrain reconstruction. Fitting a GP to function values and derivatives at $n$ points in $d$ dimensions requires linear solves and log determinants with an ${n(d+1) \\times n(d+1)}$ positive definite matrix-- leading to prohibitive $\\mathcal{O}(n^3d^3)$ computations for standard direct methods. We propose iterative solvers using fast $\\mathcal{O}(nd)$ matrix-vector multiplications (MVMs), together with pivoted Cholesky preconditioning that cuts the iterations to convergence by several orders of magnitude, allowing for fast kernel learning and prediction. Our approaches, together with dimensionality reduction, allows us to scale Bayesian optimization with derivatives to high-dimensional problems and large evaluation budgets.", "full_text": "Scaling Gaussian Process Regression with Derivatives\n\nCenter for Applied Mathematics\n\nCenter for Applied Mathematics\n\nKun Dong\n\nCornell University\nIthaca, NY 14853\n\nkd383@cornell.edu\n\nDavid Eriksson\n\nCornell University\nIthaca, NY 14853\n\ndme65@cornell.edu\n\nEric Hans Lee\n\nCornell University\nIthaca, NY 14853\n\nehl59@cornell.edu\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nDavid Bindel\n\nCornell University\nIthaca, NY 14853\n\nbindel@cornell.edu\n\nAndrew Gordon Wilson\n\nSchool of Operations Research\nand Information Engineering\n\nCornell University\nIthaca, NY 14853\n\nandrew@cornell.edu\n\nAbstract\n\nGaussian processes (GPs) with derivatives are useful in many applications, includ-\ning Bayesian optimization, implicit surface reconstruction, and terrain reconstruc-\ntion. Fitting a GP to function values and derivatives at n points in d dimensions\nrequires linear solves and log determinants with an n(d + 1) \u00d7 n(d + 1) positive\nde\ufb01nite matrix \u2013 leading to prohibitive O(n3d3) computations for standard direct\nmethods. We propose iterative solvers using fast O(nd) matrix-vector multipli-\ncations (MVMs), together with pivoted Cholesky preconditioning that cuts the\niterations to convergence by several orders of magnitude, allowing for fast kernel\nlearning and prediction. Our approaches, together with dimensionality reduc-\ntion, enables Bayesian optimization with derivatives to scale to high-dimensional\nproblems and large evaluation budgets.\n\n1 Introduction\n\nGaussian processes (GPs) provide a powerful probabilistic learning framework, including a marginal\nlikelihood which represents the probability of data given only kernel hyperparameters. The marginal\nlikelihood automatically balances model \ufb01t and complexity terms to favor the simplest models that\nexplain the data [22, 21, 27]. Computing the model \ufb01t term, as well as the predictive moments\nof the GP, requires solving linear systems with the kernel matrix, while the complexity term, or\nOccam\u2019s factor [18], is the log determinant of the kernel matrix. For n training points, exact kernel\nlearning costs of O(n3) \ufb02ops and the prediction cost of O(n) \ufb02ops per test point are computationally\ninfeasible for datasets with more than a few thousand points. The situation becomes more challenging\nif we consider GPs with both function value and derivative information, in which case training and\nprediction become O(n3d3) and O(nd) respectively [21, \u00a79.4], for d input dimensions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fDerivative information is important in many applications, including Bayesian Optimization (BO)\n[29], implicit surface reconstruction [17], and terrain reconstruction. For many simulation models,\nderivatives may be computed at little extra cost via \ufb01nite differences, complex step approximation, an\nadjoint method, or algorithmic differentiation [7]. But while many scalable approximation methods\nfor Gaussian process regression have been proposed, scalable methods incorporating derivatives\nhave received little attention. In this paper, we propose scalable methods for GPs with derivative\ninformation built on the structured kernel interpolation (SKI) framework [28], which uses local\ninterpolation to map scattered data onto a large grid of inducing points, enabling fast MVMs using\nFFTs. As the uniform grids in SKI scale poorly to high-dimensional spaces, we also extend the\nstructured kernel interpolation for products (SKIP) method, which approximates a high-dimensional\nproduct kernel as a Hadamard product of low rank Lanczos decompositions [8]. Both SKI and SKIP\nprovide fast approximate kernel MVMs, which are a building block to solve linear systems with the\nkernel matrix and to approximate log determinants [6].\nThe speci\ufb01c contributions of this paper are:\n\nand O(1) prediction per test points, relying only on fast MVM with the kernel matrix.\nin high-dimensional spaces without grids. Our approach allows for O(nd) MVMs.\n\n\u2022 We extend SKI to incorporate derivative information, enabling O(nd) complexity learning\n\u2022 We also extend SKIP, which enables scalable Gaussian process regression with derivatives\n\u2022 We illustrate that preconditioning is critical for fast convergence of iterations for kernel ma-\ntrices with derivatives. A pivoted Cholesky preconditioner cuts the iterations to convergence\nby several orders of magnitude when applied to both SKI and SKIP with derivatives.\n\u2022 We illustrate the scalability of our approach on several examples including implicit surface\n\u2022 We show how our methods, together with active subspace techniques, can be used to extend\n\u2022 Code, experiments, and \ufb01gures may be reproduced at:\n\n\ufb01tting of the Stanford bunny, rough terrain reconstruction, and Bayesian optimization.\n\nBayesian optimization to high-dimensional problems with large evaluation budgets.\n\nhttps://github.com/ericlee0803/GP_Derivatives.\n\nWe start in \u00a72 by introducing GPs with derivatives and kernel approximations. In \u00a73, we extend\nSKI and SKIP to handle derivative information. In \u00a74, we show representative experiments; and we\nconclude in \u00a75. The supplementary materials provide several additional experiments and details.\n\n2 Background and Challenges\n\nA Gaussian process (GP) is a collection of random variables, any \ufb01nite number of which are jointly\nGaussian [21]; it also de\ufb01nes a distribution over functions on Rd, f \u223c GP(\u00b5, k), where \u00b5 : Rd \u2192 R\nis a mean \ufb01eld and k : Rd \u00d7 Rd \u2192 R is a symmetric and positive (semi)-de\ufb01nite covariance kernel.\nFor any set of locations X = {x1, . . . , xn} \u2282 Rd, fX \u223c N (\u00b5X , KXX ) where fX and \u00b5X represent\nthe vectors of function values for f and \u00b5 evaluated at each of the xi \u2208 X, and (KXX )ij = k(xi, xj).\nWe assume the observed function value vector yX \u2208 Rn is contaminated by independent Gaussian\nnoise with variance \u03c32. We denote any kernel hyperparameters by the vector \u03b8. To be concise, we\nsuppress the dependence of k and associated matrices on \u03b8 in our notation. Under a Gaussian process\nprior depending on the covariance hyperparameters \u03b8, the log marginal likelihood is given by\n\nL(yX | \u03b8) = \u2212 1\n2\n\n(yX \u2212 \u00b5X )T \u03b1 + log | \u02dcKXX| + n log 2\u03c0\n\n(1)\nXX (yX \u2212 \u00b5X ) and \u02dcKXX = KXX + \u03c32I. The standard direct method to evaluate (1)\nwhere \u03b1 = \u02dcK\u22121\nand its derivatives with respect to the hyperparameters uses the Cholesky factorization of \u02dcKXX,\nleading to O(n3) kernel learning that does not scale beyond a few thousand points.\nA popular approach to scalable GPs is to approximate the exact kernel with a structured kernel\nthat enables fast MVMs [20]. Several methods approximate the kernel via inducing points U =\n{uj}m\nj=1 \u2282 Rd; see, e.g. [20, 16, 13]. Common examples are the subset of regressors (SoR), which\nexploits low-rank structure, and fully independent training conditional (FITC), which introduces an\nadditional diagonal correction [23]. For most inducing point methods, the cost of kernel learning\nwith n data points and m inducing points scales as O(m2n + m3), which becomes expensive as m\n\n(cid:104)\n\n(cid:105)\n\n2\n\n\fFigure 1: An example where gradient information pays off; the true function is on the left. Compare\nthe regular GP without derivatives (middle) to the GP with derivatives (right). Unlike the former, the\nlatter is able to accurately capture critical points of the function.\n\ngrows. As an alternative, Wilson and Nickisch [28] proposed the structured kernel interpolation (SKI)\napproximation,\n\nKXX \u2248 W KU U W T\n\n(2)\n\nwhere U is a uniform grid of inducing points and W is an n-by-m matrix of interpolation weights;\nthe authors of [28] use local cubic interpolation so that W is sparse. If the original kernel is stationary,\neach MVM with the SKI kernel may be computed in O(n + m log(m)) time via FFTs, leading\nto substantial performance over FITC and SoR. A limitation of SKI when used in combination\nwith Kronecker inference is that the number of grid points increases exponentially with the dimen-\nsion. This exponential scaling has been addressed by structured kernel interpolation for products\n(SKIP) [8], which decomposes the kernel matrix for a product kernel in d-dimensions as a Hadamard\n(elementwise) product of one-dimensional kernel matrices.\nWe use fast MVMs to solve linear systems involving \u02dcKXX by the method of conjugate gradients.\nTo estimate log | \u02dcKXX| = tr(log( \u02dcKXX )), we apply stochastic trace estimators that require only\nproducts of log( \u02dcKXX ) with random probe vectors. Given a probe vector z, several ideas have been\nexplored to compute log( \u02dcKXX )z via MVMs with \u02dcKXX, such as using a polynomial approximation\nof log or using the connection between the Gaussian quadrature rule and the Lanczos method [11, 25].\nIt was shown in [6] that using Lanczos is superior to the polynomial approximations and that only a\nfew probe vectors are necessary even for large kernel matrices.\nDifferentiation is a linear operator, and (assuming a twice-differentiable kernel) we may de\ufb01ne a\nmulti-output GP for the function and (scaled) gradient values with mean and kernel functions\n\n(cid:20)\n\nk(x, x(cid:48))\n\u2202xk(x, x(cid:48))\n\n(\u2202x(cid:48)k(x, x(cid:48)))T\n\u22022k(x, x(cid:48))\n\n(cid:21)\n\n,\n\n(cid:20) \u00b5(x)\n\n(cid:21)\n\n\u2202x\u00b5(x)\n\n\u00b5\u2207(x) =\n\n,\n\nk\u2207(x, x(cid:48)) =\n\nwhere \u2202xk(x, x(cid:48)) and \u22022k(x, x(cid:48)) represent the column vector of (scaled) partial derivatives in x and\nthe matrix of (scaled) second partials in x and x(cid:48), respectively. Scaling derivatives by a natural length\nscale gives the multi-output GP consistent units, and lets us understand approximation error without\nweighted norms. As in the scalar GP case, we model measurements of the function as contaminated\nby independent Gaussian noise.\nBecause the kernel matrix for the GP on function values alone is a submatrix of the kernel matrix\nfor function values and derivatives together, the predictive variance in the presence of derivative\ninformation will be strictly less than the predictive variance without derivatives. Hence, convergence\nof regression with derivatives is always superior to convergence of regression without, which is well-\nstudied in, e.g. [21, Chapter 7]. Figure 1 illustrates the value of derivative information; \ufb01tting with\nderivatives is evidently much more accurate than \ufb01tting function values alone. In higher-dimensional\nproblems, derivative information is even more valuable, but it comes at a cost: the kernel matrix\nK\u2207\nXX is of size n(d + 1)-by-n(d + 1). Scalable approximate solvers are therefore vital in order to\nuse GPs for large datasets with derivative data, particularly in high-dimensional spaces.\n\n3\n\nBraninSE no gradientSE with gradients\f3 Methods\n\nOne standard approach to scaling GPs substitutes the exact kernel with an approximate kernel. When\nthe GP \ufb01ts values and gradients, one may attempt to separately approximate the kernel and the kernel\nderivatives. Unfortunately, this may lead to inde\ufb01niteness, as the resulting approximation is no longer\na valid kernel. Instead, we differentiate the approximate kernel, which preserves positive de\ufb01niteness.\nWe do this for the SKI and SKIP kernels below, but our general approach applies to any differentiable\napproximate MVM.\n\n3.1 D-SKI\n\nD-SKI (SKI with derivatives) is the standard kernel matrix for GPs with derivatives, but applied to\nthe SKI kernel. Equivalently, we differentiate the interpolation scheme:\n\nk(x, x(cid:48)) \u2248(cid:88)\nwi(x)k(xi, x(cid:48)) \u2192 \u2207k(x, x(cid:48)) \u2248(cid:88)\n(cid:20) W KU U W T\n(cid:21)\n(cid:20) W\n\n(cid:20) W\n\n(cid:21)T\n\n(cid:21)\n\ni\n\ni\n\n\u2202W\n\nKU U\n\n\u2202W\n\n=\n\n(\u2202W )KU U W T\n\n(cid:20) K (\u2202K)T\n\n\u2202K \u22022K\n\n\u2248\n\n\u2207wi(x)k(xi, x(cid:48)).\n\n(cid:21)\n\n,\n\nW KU U (\u2202W )T\n\n(\u2202W )KU U (\u2202W )T\n\nOne can use cubic convolutional interpolation [14], but higher order methods lead to greater accuracy,\nand we therefore use quintic interpolation [19]. The resulting D-SKI kernel matrix has the form\n\nwhere the elements of sparse matrices W and \u2202W are determined by wi(x) and \u2207wi(x) \u2014 assuming\nquintic interpolation, W and \u2202W will each have 6d elements per row. As with SKI, we use FFTs\nto obtain O(m log m) MVMs with KU U . Because W and \u2202W have O(n6d) and O(nd6d) nonzero\nelements, respectively, our MVM complexity is O(nd6d + m log m).\n\n3.2 D-SKIP\n\nSeveral common kernels are separable, i.e., they can be expressed as products of one-dimensional\nkernels. Assuming a compatible approximation scheme, this structure is inherited by the SKI\napproximation for the kernel matrix without derivatives,\n1 ) (cid:12) (W2K2W T\n\n2 ) (cid:12) . . . (cid:12) (WdKdW T\nd ),\n\nwhere A (cid:12) B denotes the Hadamard product of matrices A and B with the same dimensions, and Wj\nand Kj denote the SKI interpolation and inducing point grid matrices in the jth coordinate direction.\nThe same Hadamard product structure applies to the kernel matrix with derivatives; for example, for\nd = 2,\n\nK \u2248 (W1K1W T\n\nW1K1 \u2202W T\n1\n\u2202W1K1 \u2202W T\n1\nW1K1 \u2202W T\n1\n\nW1K1W T\n1\n\u2202W1K1W T\n1\nW1K1W T\n1\n\nW2K2W T\n2\nW2K2W T\n2\n\u2202W2K2W T\n2\n\nW2K2 \u2202W T\n2\nW2K2 \u2202W T\n2\n\u2202W2K2 \u2202W T\n2\n\n(3)\n\n\uf8f9\uf8fb (cid:12)\n\n\uf8ee\uf8f0 W2K2W T\n\n2\nW2K2W T\n2\n\u2202W2K2W T\n2\n\n\uf8ee\uf8f0 W1K1W T\n\n1\n\u2202W1K1W T\n1\nW1K1W T\n1\n\nK\u2207 \u2248\n\n\uf8f9\uf8fb.\n\n1 ) (cid:12) (Q2T2QT\n\nEquation 3 expresses K\u2207 as a Hadamard product of one dimensional kernel matrices. Following this\napproximation, we apply the SKIP reduction [8] and use Lanczos to further approximate equation\n3 as (Q1T1QT\n2 ). This can be used for fast MVMs with the kernel matrix. Applied to\nkernel matrices with derivatives, we call this approach D-SKIP.\nConstructing the D-SKIP kernel costs O(d2(n+m log m+r3n log d)), and each MVM costs O(dr2n)\n\ufb02ops where r is the effective rank of the kernel at each step (rank of the Lanczos decomposition). We\nachieve high accuracy with r (cid:28) n.\n\n3.3 Preconditioning\n\nRecent work has explored several preconditioners for exact kernel matrices without derivatives [5].\nWe have had success with preconditioners of the form M = \u03c32I + F F T where K\u2207 \u2248 F F T with\nF \u2208 Rn\u00d7p. Solving with the Sherman-Morrison-Woodbury formula (a.k.a the matrix inversion\nlemma) is inaccurate for small \u03c3; we use the more stable formula M\u22121b = \u03c3\u22122(f \u2212 Q1(QT\n1 b))\nwhere Q1 is computed in O(p2n) time by the economy QR factorization\n\n(cid:20) F\n\n(cid:21)\n\n=\n\n\u03c3I\n\n(cid:20)Q1\n\n(cid:21)\n\nQ2\n\n4\n\nR.\n\n\fIn our experiments with solvers for D-SKI and D-SKIP, we have found that a truncated pivoted\nCholesky factorization, K\u2207 \u2248 (\u03a0L)(\u03a0L)T works well for the low-rank factorization. Computing\nthe pivoted Cholesky factorization is cheaper than MVM-based preconditioners such as Lanczos\nor truncated eigendecompositions as it only requires the diagonal and the ability to form the rows\nwhere pivots are selected. Pivoted Cholesky is a natural choice when inducing point methods are\napplied as the pivoting can itself be viewed as an inducing point method where the most important\ninformation is selected to construct a low-rank preconditioner [12]. The D-SKI diagonal can be\nformed in O(nd6d) \ufb02ops while rows cost O(nd6d + m) \ufb02ops; for D-SKIP both the diagonal and the\nrows can be formed in O(nd) \ufb02ops.\n\n3.4 Dimensionality reduction\n\nC =(cid:82)\n\nIn many high-dimensional function approximation problems, only a few directions are relevant. That\nis, if f : Rd \u2192 R is a function to be approximated, there is often a matrix P with \u02dcd < d orthonormal\ncolumns spanning an active subspace of Rd such that f (x) \u2248 f (P P T x) for all x in some domain \u2126\nof interest [4]. The optimal subspace is given by the dominant eigenvectors of the covariance matrix\n\u2126 \u2207f (x)\u2207f (x)T dx, generally estimated by Monte Carlo integration. Once the subspace is\ndetermined, the function can be approximated through a GP on the reduced space, i.e., we replace the\noriginal kernel k(x, x(cid:48)) with a new kernel \u02c7k(x, x(cid:48)) = k(P T x, P T x(cid:48)). Because we assume gradient\ninformation, dimensionality reduction based on active subspaces is a natural pre-processing phase\nbefore applying D-SKI and D-SKIP.\n\n4 Experiments\n\nFigure 2: (Left two images) log10 error in SKI approximation and comparison to the exact spectrum.\n(Right two images) log10 error in SKIP approximation and comparison to the exact spectrum.\n\nOur experiments use the squared exponential (SE) kernel, which has product structure and can be\nused with D-SKIP; and the spline kernel, to which D-SKIP does not directly apply. We use these\nkernels in tandem with D-SKI and D-SKIP to achieve the fast MVMs derived in \u00a73. We write D-SE\nto denote the exact SE kernel with derivatives.\nD-SKI and D-SKIP with the SE kernel approximate the original kernel well, both in terms of MVM\naccuracy and spectral pro\ufb01le. Comparing D-SKI and D-SKIP to their exact counterparts in Figure 2,\nwe see their matrix entries are very close (leading to MVM accuracy near 10\u22125), and their spectral\npro\ufb01les are indistinguishable. The same is true with the spline kernel. Additionally, scaling tests in\nFigure 3 verify the predicted complexity of D-SKI and D-SKIP. We show the relative \ufb01tting accuracy\nof SE, SKI, D-SE, and D-SKI on some standard test functions in Table 1.\n\n4.1 Dimensionality reduction\n\nWe apply active subspace pre-processing to the 20 dimensional Welsh test function in [2]. The top six\neigenvalues of its gradient covariance matrix are well separated from the rest as seen in Figure 4(a).\nHowever, the function is far from smooth when projected onto the leading 1D or 2D active subspace,\nas Figure 4(b)-4(d) indicates, where the color shows the function value.\nWe therefore apply D-SKI and D-SKIP on the 3D and 6D active subspace, respectively, using 5000\ntraining points, and compare the prediction error against D-SE with 190 training points because\n\n5\n\n-10-8-6-45010015020025030010-610-410-2100True spectrumSKI spectrum200400600800100010-410-2100True spectrumSKIP spectrum-10-8-6-4\fFigure 3: Scaling tests for D-SKI in two dimensions and D-SKIP in 11 dimensions. D-SKIP uses\nfewer data points for identical matrix sizes.\n\nSE\nSKI\nD-SE\nD-SKI\n\nBranin\n6.02e-3\n3.97e-3\n1.83e-3\n1.03e-3\n\nFranke\n8.73e-3\n5.51e-3\n1.59e-3\n4.06e-4\n\nSine Norm Sixhump\n6.44e-3\n5.11e-3\n1.05e-3\n5.66e-4\n\n8.64e-3\n5.37e-3\n3.33e-3\n1.32e-3\n\nStyTang\n4.49e-3\n2.25e-3\n1.00e-3\n5.22e-4\n\nHart3\n1.30e-2\n8.59e-3\n3.17e-3\n1.67e-3\n\nTable 1: Relative RMSE error on 10000 testing points for test functions from [24], including \ufb01ve 2D\nfunctions (Branin, Franke, Sine Norm, Sixhump, and Styblinski-Tang) and the 3D Hartman function.\nWe train the SE kernel on 4000 points, the D-SE kernel on 4000/(d + 1) points, and SKI and D-SKI\nwith SE kernel on 10000 points to achieve comparable runtimes between methods.\n\n(a) Log Directional Variation\n\n(b) First Active Direction\n\n(c) Second Active Direction\n\n(d) Leading 2D Active Subspace\n\nFigure 4: 4(a) shows the top 10 eigenvalues of the gradient covariance. Welsh is projected onto the\n\ufb01rst and second active direction in 4(b) and 4(c). After joining them together, we see in 4(d) that\npoints of different color are highly mixed, indicating a very spiky surface.\n\nof our scaling advantage. Table 2 reveals that while the 3D active subspace fails to capture all the\nvariation of the function, the 6D active subspace is able to do so. These properties are demonstrated\nby the poor prediction of D-SKI in 3D and the excellent prediction of D-SKIP in 6D.\n\nD-SE\n\nRMSE\n4.900e-02\nSMAE 4.624e-02\n\nD-SKI (3D) D-SKIP (6D)\n2.267e-01\n2.073e-01\n\n3.366e-03\n2.590e-03\n\nTable 2: Relative RMSE and SMAE prediction error for Welsh. The D-SE kernel is trained on\n4000/(d + 1) points, with D-SKI and D-SKIP trained on 5000 points. The 6D active subspace is\nsuf\ufb01cient to capture the variation of the test function.\n\n4.2 Rough terrain reconstruction\n\nRough terrain reconstruction is a key application in robotics [9, 15], autonomous navigation [10],\nand geostatistics. Through a set of terrain measurements, the problem is to predict the underlying\ntopography of some region. In the following experiment, we consider roughly 23 million non-\nuniformly sampled elevation measurements of Mount St. Helens obtained via LiDAR [3]. We bin\nthe measurements into a 970 \u00d7 950 grid, and downsample to a 120 \u00d7 117 grid. Derivatives are\napproximated using a \ufb01nite difference scheme.\n\n6\n\n25005000100002000030000Matrix Size10-410-310-210-1100MVM TimeA Comparison of MVM Scalings O(n2) O(n) O(n)SE ExactSE SKI (2D)SE SKIP (11D)1 2 3 4 5 6 7 8 9 10-15-10-5 0 -1-0.50-505-0.500.5-505-1-0.50-0.500.5\fFigure 5: On the left is the true elevation map of Mount St. Helens. In the middle is the elevation\nmap calculated with the SKI. On the right is the elevation map calculated with D-SKI.\n\nWe randomly select 90% of the grid for training and the remainder for testing. We do not include\nresults for D-SE, as its kernel matrix has dimension roughly 4 \u00b7 104. We plot contour maps predicted\nby SKI and D-SKI in Figure 5 \u2014 the latter looks far closer to the ground truth than the former. This\nis quanti\ufb01ed in the following table:\n\nSKI\nD-SKI\n\n(cid:96)\n\ns\n\n\u03c3\n\n35.196\n12.630\n\n207.689\n317.825\n\n12.865\n6.446\n\n\u03c32\nn.a.\n2.799\n\nTesting SMAE Overall SMAE Time[s]\n37.67\n131.70\n\n0.0357\n0.0254\n\n0.0308\n0.0165\n\nTable 3: The hyperparameters of SKI and D-SKI are listed. Note that there are two different noise\nparameters \u03c31 and \u03c32 in D-SKI, for the value and gradient respectively.\n\n4.3\n\nImplicit surface reconstruction\n\nReconstructing surfaces from point cloud data and surface normals is a standard problem in computer\nvision and graphics. One popular approach is to \ufb01t an implicit function that is zero on the surface\nwith gradients equal to the surface normal. Local Hermite RBF interpolation has been considered\nin prior work [17], but this approach is sensitive to noise. In our experiments, using a GP instead\nof splining reproduces implicit surfaces with very high accuracy. In this case, a GP with derivative\ninformation is required, as the function values are all zero.\n\nFigure 6: (Left) Original surface (Middle) Noisy surface (Right) SKI reconstruction from noisy\nsurface (s = 0.4, \u03c3 = 0.12)\n\n7\n\n\fIn Figure 6, we \ufb01t the Stanford bunny using 25000 points and associated normals, leading to a K\u2207\nmatrix of dimension 105, clearly far too large for exact training. We therefore use SKI with the\nthin-plate spline kernel, with a total of 30 grid points in each dimension. The left image is a ground\ntruth mesh of the underlying point cloud and normals. The middle image shows the same mesh, but\nwith heavily noised points and normals. Using this noisy data, we \ufb01t a GP and reconstruct a surface\nshown in the right image, which looks very close to the original.\n\n4.4 Bayesian optimization with derivatives\n\nPrior work examines Bayesian optimization (BO) with derivative information in low-dimensional\nspaces to optimize model hyperparameters [29]. Wang et al. consider high-dimensional BO (without\ngradients) with random projections uncovering low-dimensional structure [26]. We propose BO with\nderivatives and dimensionality reduction via active subspaces, detailed in Algorithm 1.\n\nAlgorithm 1: BO with derivatives and active subspace learning\n\n1: while Budget not exhausted do\n2:\n3:\n4:\n5:\n6:\n7: end\n\nCalculate active subspace projection P \u2208 Rd\u00d7 \u02dcd using sampled gradients\nOptimize acquisition function, un+1 = arg max A(u) with xn+1 = P un+1\nSample point xn+1, value fn+1, and gradient \u2207fn+1\nUpdate data Di+1 = Di \u222a {xn+1, fn+1,\u2207fn+1}\nUpdate hyperparameters of GP with gradient de\ufb01ned by kernel k(P T x, P T x(cid:48))\n\nAlgorithm 1 estimates the active subspace and \ufb01ts a GP with derivatives in the reduced space. Kernel\nlearning, \ufb01tting, and optimization of the acquisition function all occur in this low-dimensional\nsubspace. In our tests, we use the expected improvement (EI) acquisition function, which involves\nboth the mean and predictive variance. We consider two approaches to rapidly evaluate the predictive\nvariance v(x) = k(x, x) \u2212 KxX \u02dcK\u22121KXx at a test point x. In the \ufb01rst approach, which provides a\nbiased estimate of the predictive variance, we replace \u02dcK\u22121 with the preconditioner solve computed\nby pivoted Cholesky; using the stable QR-based evaluation algorithm, we have\n1 KXx(cid:107)2).\n\nv(x) \u2248 \u02c6v(x) \u2261 k(x, x) \u2212 \u03c3\u22122((cid:107)KXx(cid:107)2 \u2212 (cid:107)QT\n\nWe note that the approximation \u02c6v(x) is always a (small) overestimate of the true predictive variance\nv(x). In the second approach, we use a randomized estimator as in [1] to compute the predictive\nvariance at many points X(cid:48) simultaneously, and use the pivoted Cholesky approximation as a control\nvariate to reduce the estimator variance:\n\n(cid:104)\n\n(cid:105) \u2212 \u02c6vX(cid:48).\n\nvX(cid:48) = diag(KX(cid:48)X(cid:48)) \u2212 Ez\n\nz (cid:12) (KX(cid:48)X \u02dcK\u22121KXX(cid:48)z \u2212 KX(cid:48)X M\u22121KXX(cid:48)z)\n\nThe latter approach is unbiased, but gives very noisy estimates unless many probe vectors z are used.\nBoth the pivoted Cholesky approximation to the predictive variance and the randomized estimator\nresulted in similar optimizer performance in our experiments.\nTo test Algorithm 1, we mimic the experimental set up in [26]: we minimize the 5D Ackley and 5D\nRastrigin test functions [24], randomly embedded respectively in [\u221210, 15]50 and [\u22124, 5]50. We \ufb01x\n\u02dcd = 2, and at each iteration pick two directions in the estimated active subspace at random to be\nour active subspace projection P . We use D-SKI as the kernel and EI as the acquisition function.\nThe results of these experiments are shown in Figure 7(a) and Figure 7(b), in which we compare\nAlgorithm 1 to three other baseline methods: BO with EI and no gradients in the original space;\nmulti-start BFGS with full gradients; and random search. In both experiments, the BO variants\nperform better than the alternatives, and our method outperforms standard BO.\n\n5 Discussion\n\nWhen gradients are available, they are a valuable source of information for Gaussian process regres-\nsion; but inclusion of d extra pieces of information per point naturally leads to new scaling issues.\nWe introduce two methods to deal with these scaling issues: D-SKI and D-SKIP. Both are structured\n\n8\n\n\f(a) BO on Ackley\n\n(b) BO on Rastrigin\n\nFigure 7: In the following experiments, 5D Ackley and 5D Rastrigin are embedded into 50 a\ndimensional space. We run Algorithm 1, comparing it with BO exact, multi-start BFGS, and random\nsampling. D-SKI with active subspace learning clearly outperforms the other methods.\n\ninterpolation methods, and the latter also uses kernel product structure. We have also discussed\npractical details \u2014preconditioning is necessary to guarantee convergence of iterative methods and\nactive subspace calculation reveals low-dimensional structure when gradients are available. We\npresent several experiments with kernel learning, dimensionality reduction, terrain reconstruction,\nimplicit surface \ufb01tting, and scalable Bayesian optimization with gradients. For simplicity, these\nexamples all possessed full gradient information; however, our methods trivially extend if only partial\ngradient information is available.\nThere are several possible avenues for future work. D-SKIP shows promising scalability, but it also\nhas large overheads, and is expensive for Bayesian optimization as it must be recomputed from scratch\nwith each new data point. We believe kernel function approximation via Chebyshev interpolation\nand tensor approximation will likely provide similar accuracy with greater ef\ufb01ciency. Extracting\nlow-dimensional structure is highly effective in our experiments and deserves an independent, more\nthorough treatment. Finally, our work in scalable Bayesian optimization with gradients represents\na step towards the uni\ufb01ed view of global optimization methods (i.e. Bayesian optimization) and\ngradient-based local optimization methods (i.e. BFGS).\n\nAcknowledgements. We thank NSF DMS-1620038, NSF IIS-1563887, and Facebook Research\nfor support.\n\nReferences\n[1] Costas Bekas, E\ufb01 Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix.\n\nApplied Numerical Mathematics, 57(11-12):1214\u20131229, November 2007.\n\n[2] Einat Neumann Ben-Ari and David M Steinberg. Modeling data from computer experiments:\nan empirical comparison of kriging with MARS and projection pursuit regression. Quality\nEngineering, 19(4):327\u2013338, 2007.\n\n[3] Puget Sound LiDAR Consortium. Mount Saint Helens LiDAR data. University of Washington,\n\n2002.\n\n[4] Paul G. Constantine. Active subspaces: Emerging ideas for dimension reduction in parameter\n\nstudies. SIAM, 2015.\n\n[5] Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioning\nkernel matrices. In Proceedings of the International Conference on Machine Learning (ICML),\npages 2529\u20132538, 2016.\n\n[6] Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew G. Wilson. Scalable\nlog determinants for Gaussian process kernel learning. In Advances in Neural Information\nProcessing Systems (NIPS), pages 6330\u20136340, 2017.\n\n9\n\n0100200300400500-20-15-10-5BO exactBO D-SKIBFGSRandom sampling0100200300400500-40-20020BO exactBO SKIBFGSRandom sampling\f[7] Alexander Forrester, Andy Keane, et al. Engineering design via surrogate modelling: a practical\n\nguide. John Wiley & Sons, 2008.\n\n[8] Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson.\nProduct kernel interpolation for scalable Gaussian processes. In Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2018.\n\n[9] David Gingras, Tom Lamarche, Jean-Luc Bedwani, and \u00c9rick Dupuis. Rough terrain recon-\nstruction for rover motion planning. In Proceedings of the Canadian Conference on Computer\nand Robot Vision (CRV), pages 191\u2013198. IEEE, 2010.\n\n[10] Raia Hadsell, J. Andrew Bagnell, Daniel F. Huber, and Martial Hebert. Space-carving kernels\nfor accurate rough terrain estimation. International Journal of Robotics Research, 29:981\u2013996,\nJuly 2010.\n\n[11] Insu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation\nthrough stochastic Chebyshev expansions. In Proceedings of the International Conference on\nMachine Learning (ICML), pages 908\u2013917, 2015.\n\n[12] Helmut Harbrecht, Michael Peters, and Reinhold Schneider. On the low-rank approximation by\nthe pivoted Cholesky decomposition. Applied Numerical Mathematics, 62(4):428\u2013440, 2012.\n\n[13] James Hensman, Nicol\u00f3 Fusi, and Neil D. Lawrence. Gaussian processes for big data. In\n\nProceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2013.\n\n[14] Robert Keys. Cubic convolution interpolation for digital image processing. IEEE Transactions\n\non Acoustics, Speech, and Signal Processing, 29(6):1153\u20131160, 1981.\n\n[15] Kurt Konolige, Motilal Agrawal, and Joan Sola. Large-scale visual odometry for rough terrain.\n\nIn Robotics Research, pages 201\u2013212. Springer, 2010.\n\n[16] Quoc Le, Tamas Sarlos, and Alexander Smola. Fastfood \u2013 computing Hilbert space expansions\nin loglinear time. In Proceedings of the 30th International Conference on Machine Learning,\npages 244\u2013252, 2013.\n\n[17] Ives Macedo, Joao Paulo Gois, and Luiz Velho. Hermite radial basis functions implicits.\n\nComputer Graphics Forum, 30(1):27\u201342, 2011.\n\n[18] David J. C. MacKay.\nUniversity Press, 2003.\n\nInformation theory, inference and learning algorithms. Cambridge\n\n[19] Erik H. W. Meijering, Karel J. Zuiderveld, and Max A. Viergever. Image reconstruction by\nconvolution with symmetrical piecewise nth-order polynomial kernels. IEEE Transactions on\nImage Processing, 8(2):192\u2013201, 1999.\n\n[20] Joaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-\nmate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939\u20131959,\n2005.\n\n[21] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. The MIT\n\nPress, 2006.\n\n[22] Carl Edward Rasmussen and Zoubin Ghahramani. Occam\u2019s razor. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 294\u2013300, 2001.\n\n[23] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 1257\u20131264, 2005.\n\n[24] S. Surjanovic and D. Bingham. Virtual library of simulation experiments: Test functions and\n\ndatasets. http://www.sfu.ca/ ssurjano, 2018.\n\n[25] Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(F (A)) via stochastic\nLanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075\u20131099,\n2017.\n\n10\n\n\f[26] Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, Nando De Freitas, et al. Bayesian\noptimization in high dimensions via random embeddings. In Proceedings of the International\nJoint Conferences on Arti\ufb01cial Intelligence, pages 1778\u20131784, 2013.\n\n[27] Andrew G Wilson, Christoph Dann, Chris Lucas, and Eric P Xing. The human kernel. In\n\nAdvances in neural information processing systems, pages 2854\u20132862, 2015.\n\n[28] Andrew G. Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian\nprocesses (KISS-GP). Proceedings of the International Conference on Machine Learning\n(ICML), pages 1775\u20131784, 2015.\n\n[29] Jian Wu, Matthias Poloczek, Andrew G Wilson, and Peter Frazier. Bayesian optimization with\ngradients. In Advances in Neural Information Processing Systems (NIPS), pages 5273\u20135284,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 3437, "authors": [{"given_name": "David", "family_name": "Eriksson", "institution": "Cornell University"}, {"given_name": "Kun", "family_name": "Dong", "institution": "Cornell University"}, {"given_name": "Eric", "family_name": "Lee", "institution": "Cornell University"}, {"given_name": "David", "family_name": "Bindel", "institution": "Cornell University"}, {"given_name": "Andrew", "family_name": "Wilson", "institution": "Cornell University"}]}