{"title": "Exact Gaussian Processes on a Million Data Points", "book": "Advances in Neural Information Processing Systems", "page_first": 14648, "page_last": 14659, "abstract": "Gaussian processes (GPs) are flexible non-parametric models, with a capacity that grows with the available data.\nHowever, computational constraints with standard inference procedures have limited exact GPs to problems with fewer than about ten thousand training points, necessitating approximations for larger datasets. In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. By partitioning and distributing kernel matrix multiplies, we demonstrate that an exact GP can be trained on over a million points, a task previously thought to be impossible with current computing hardware. Moreover, our approach is generally applicable, without constraints to grid data or specific kernel classes. Enabled by this scalability, we perform the first-ever comparison of exact GPs against scalable GP approximations on datasets with $10^4 \\!-\\! 10^6$ data points, showing dramatic performance improvements.", "full_text": "Exact Gaussian Processes on a Million Data Points\n\nKe Alexander Wang1\u2217 Geoff Pleiss1\u2217 Jacob R. Gardner2\n\nStephen Tyree3 Kilian Q. Weinberger1 Andrew Gordon Wilson1,4\n1Cornell University, 2Uber AI Labs, 3NVIDIA, 4New York University\n\nAbstract\n\nGaussian processes (GPs) are \ufb02exible non-parametric models, with a capacity that\ngrows with the available data. However, computational constraints with standard\ninference procedures have limited exact GPs to problems with fewer than about\nten thousand training points, necessitating approximations for larger datasets. In\nthis paper, we develop a scalable approach for exact GPs that leverages multi-\nGPU parallelization and methods like linear conjugate gradients, accessing the\nkernel matrix only through matrix multiplication. By partitioning and distributing\nkernel matrix multiplies, we demonstrate that an exact GP can be trained on\nover a million points, a task previously thought to be impossible with current\ncomputing hardware, in less than 2 hours. Moreover, our approach is generally\napplicable, without constraints to grid data or speci\ufb01c kernel classes. Enabled by\nthis scalability, we perform the \ufb01rst-ever comparison of exact GPs against scalable\nGP approximations on datasets with 104 \u2212 106 data points, showing dramatic\nperformance improvements.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) have seen great success in many machine learning settings, such as black-\nbox optimization [38], reinforcement learning [6, 8], and time-series forecasting [33]. These models\noffer several advantages \u2013 principled uncertainty representations, model priors that require little\nexpert intervention, and the ability to adapt to any dataset size [31, 32]. GPs are not only ideal for\nproblems with few observations; they also have great promise to exploit the available information in\nincreasingly large datasets, especially when combined with expressive kernels [41] or hierarchical\nstructure [5, 35, 43, 45].\nIn practice, however, exact GP inference can be intractable for large datasets, as it na\u00efvely requires\nO(n3) computations and O(n2) storage for n training points [32]. Many approximate methods\nhave been introduced to improve scalability, relying on mixture-of-experts models [7], inducing\npoints [12, 37, 39, 42], random feature expansions [20, 30, 47], or stochastic variational optimization\n[2, 16, 17, 36, 46]. However, due to the historical intractability of training exact GPs on large datasets,\nit has been an open question how approximate methods compare to an exact approach when much\nmore data is available.\nIn this paper, we develop a methodology to scale exact GP inference well beyond what has previously\nbeen achieved: we train a Gaussian process on over a million data points, performing predictions\nwithout approximations. Such a result would be intractable with standard implementations which\nrely on the Cholesky decomposition. The scalability we demonstrate is made feasible by the\nrecent Blackbox Matrix-Matrix multiplication (BBMM) inference procedure of Gardner et al. [11],\nwhich uses conjugate gradients and related methods to reduce GP inference to iterations of matrix\nmultiplication. Gardner et al. [11] show that this procedure (1) achieves exponential convergence\nusing a pivoted Cholesky preconditioner under certain conditions, (2) requires relatively few number\n\n\u2217 Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof conjugate gradient steps to convergence for typical datasets, and (3) can more accurately solve\nlinear systems than Cholesky-based approaches. Using BBMM, our approach is generally applicable\nwithout constraints to input grids or speci\ufb01c kernel families.\nBy partitioning and distributing kernel matrix multiplications across GPUs, we reduce the memory\nrequirement for GP training to O(n) on an individual GPU, permitting scaling beyond n \u2248 104\nsamples. Additionally, we introduce a number of practical heuristics to accelerate training and\nmaximally utilize parallelization. With 8 GPUs, GPs can be trained in seconds for n \u2248 104, hours\nfor n \u2248 105, and a few days for n \u2248 106. In addition, we show that the training time can be further\nreduced by using better hyperparameter initializations. Nevertheless, all trained models can make\nexact GP predictions in less than 1 second on 1 GPU using a simple caching strategy.\nWe benchmark on regression datasets from the UCI repository [1]. We \ufb01nd exact GPs offer notably\nbetter performance on these datasets, often exceeding a two-fold reduction in root-mean-squared\nerror. The results show how non-parametric representations continue to signi\ufb01cantly bene\ufb01t from the\naddition of new training points, a valuable conceptual \ufb01nding in favor of non-parametric approaches.\nThese results clarify the relative performance of popular GP approximations against exact GPs in\nan unexplored data size regime and enable future comparisons against other GP approximations.\nFurther, they serve as a signpost to practitioners using GPs on big data and set the stage for theorists\nto propose better approximations to address this gap in performance.\n\n2 Background\n\nGaussian processes (GPs) are non-parametric machine learning models that place a distribution over\nfunctions f \u223c GP. The function distribution is de\ufb01ned by a prior mean function \u00b5 : Rd \u2192 R, a prior\ncovariance function or kernel k : Rd \u00d7 Rd \u2192 R, and observed (training) data (X, y). The choice\nof \u00b5 and k encode prior information about the data. \u00b5 is typically chosen to be a constant function.\nPopular kernels include the RBF kernel and the Mat\u00e9rn kernels [32].\nNotation: Throughout this paper we will use the following notation: given training inputs X \u2208 Rn\u00d7d,\nKXX is the n \u00d7 n kernel matrix containing covariance terms for all pairs of entries. The vector kXx\u2217\n\nis a vector formed by evaluating the kernel between a test point x\u2217 and all training points. (cid:98)KXX is a\nkernel matrix with added Gaussian observational noise (i.e. (cid:98)KXX = KXX + \u03c32I).\n\nTraining: Most kernels include hyperparameters \u03b8, such as the lengthscale, which must be \ufb01t to the\ntraining data. In regression, \u03b8 are typically learned by maximizing the GP\u2019s log marginal likelihood\nwith gradient descent:\n\nL = log p(y| X, \u03b8) \u221d \u2212y(cid:62)(cid:98)K\n(cid:98)KXX y\u2212tr\n\u221d y(cid:62)(cid:98)KXX\n\n\u2202(cid:98)K\n\nXX y \u2212 log |(cid:98)KXX|,\n(cid:41)\n(cid:40)(cid:98)K\n\u2202(cid:98)KXX\n\n\u22121\nXX\n\n\u22121\n\n\u22121\nXX\n\n.\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n\u2202L\n\u2202\u03b8\n\n(1)\n\n(2)\n\nXx\u2217(cid:98)K\nXx\u2217(cid:98)K\n\n) + k(cid:62)\n,x\u2217\n\n) \u2212 k(cid:62)\n\n\u22121\nXX y\n\u22121\nXX kXx\u2217\n\nA typical GP has very few hyperparameters to optimize and therefore requires fewer iterations of\ntraining than most parametric models.\nPredictions: For a test point x\u2217, the GP predictive posterior distribution p(f (x\u2217) | X, y) with a\nGaussian likelihood is Gaussian with moments:\n\nE [f (x\u2217\nVar [f (x\u2217\n\n)| X, y] = \u00b5(x\u2217\n)| X, y] = k(x\u2217\n\n(3)\n(4)\nPortions of these equations can be precomputed as part of training to reduce the test-time computation.\n\u22121\nXX y is computed and\ncached. Similar caching techniques can reduce the asymptotic time complexity of (4) as well [28].\n\nIn particular, (3) is reduced to an O(n) matrix-vector multiplication once (cid:98)K\nThe Cholesky decomposition is used in many GP implementations to compute (cid:98)K\nrespectively. The columns of L =(cid:2)l(1)\n\nXX y, log |(cid:98)KXX|,\nin (1) and (2). The positive de\ufb01nite kernel matrix (cid:98)KXX can be factor-\nl(k)(cid:3) are computed recursively [14]. Although concur-\n\nand tr\nized into LL(cid:62), where L is lower triangular. Computing L requires O(n3) time and O(n2) memory.\nAfter computing this factorization, matrix solves and log determinants take O(n2) and O(n) time\n\n\u2202(cid:98)KXX /\u2202\u03b8\n\n(cid:110)(cid:98)K\n\nrent work by [26] used the Cholesky decomposition for large scale GP inference through distributed\n\n(cid:17)(cid:111)\n\n\u22121\nXX\n\n(cid:16)\n\n\u22121\n\n. . .\n\n2\n\n\f\u22121\n\nsolution to an optimization problem: v\u2217 = arg minv\n\ncomputing, it requires quadratic communication costs and quadratic memory. Furthermore, its\nrecursive nature makes the Cholesky algorithm less amenable to GPU acceleration since GPUs are\ndesigned to parallelize matrix-vector multiplications.\n\nConjugate gradients (CG) is an alternative method for computing (cid:98)K\n2 v(cid:62)(cid:98)KXX v \u2212 v(cid:62)y, which is convex by the\npositive-de\ufb01niteness of (cid:98)KXX. The optimization is performed iteratively, with each step requiring\na matrix-vector multiplication with (cid:98)KXX. For a speci\ufb01ed tolerance \u0001 of the relative residual norm\n(cid:107)(cid:98)KXX v\u2217 \u2212 y(cid:107)/(cid:107)y(cid:107), the solution can be found in t\u0001 iterations. The exact number of iterations\ndepends on the conditioning and eigenvalue distribution of (cid:98)KXX, but t\u0001 (cid:28) n for reasonable values\n\nXX y. CG frames (cid:98)K\n\nof \u0001. A preconditioner is commonly used to accelerate convergence [14]. In this paper, we refer to\npreconditioned CG as PCG. Gardner et al. [11] demonstrate that a modi\ufb01ed version of PCG can be\nused to compute all terms in (1) and (2) simultaneously. This results in an algorithm for training\nand predicting with GPs that requires only a routine for performing matrix-vector products with the\nkernel matrix.\n\n\u22121\nXX y as the\n\n1\n\n3 Method\n\nTo perform exact Gaussian process inference on large datasets, we must overcome the time and space\nrequirements of solving linear systems. Most GP implementations use the Cholesky decomposition\n\nto solve linear systems required for inference [32]. The O(cid:0)n3(cid:1) time complexity of the decomposition\nCholesky decomposition requires O(cid:0)n2(cid:1) memory to store the lower-triangular factor L in addition to\n\nmakes it dif\ufb01cult to perform exact GP inference on datasets with n > 104 data points without\ndistributed computing and its associated communication overhead. In addition to this limitation, the\n\nthe kernel matrix itself. At n = 500,000, the decomposition requires a full terabyte of memory and a\nprohibitively large amount of computational resources. Concurrent work by Nguyen et al. [26] was\nlimited to exact GPs with n \u2264 120,000 due to these drawbacks.\nTo address the above challenges, we build on Gardner et al. [11] and use preconditioned conjugate\ngradients (PCG) to solve linear systems. We overcome the memory limitations by partitioning the\nkernel matrix to perform all matrix-vector multiplications (MVMs) without ever forming the kernel\nmatrix explicitly, reducing the memory requirement to O(n). In addition, we parallelize partitioned\nMVMs across multiple GPUs to further accelerate the computations, making training possible and\ntimely even for datasets with n > 106.\nO(n) memory MVM-based inference. The primary input to the modi\ufb01ed PCG algorithm of Gardner\n\n(the current solution), r (the current error), p (the \u201csearch\u201d direction for the next solution), and z (a\npreconditioned error term). Storing these vectors requires exactly 4n space. The quadratic space cost\n\net al. [11] is mvm_(cid:98)KXX, a black-box function that performs MVMs using the kernel matrix (cid:98)KXX.\nBesides the storage cost associated with mvm_(cid:98)KXX, each iteration of PCG updates four vectors: u\nassociated with PCG-based GP inference only comes from computing mvm_(cid:98)KXX.\nTypically in the full GP setting, mvm_(cid:98)KXX is implemented by \ufb01rst computing the full n \u00d7 n kernel\nmatrix (cid:98)KXX, then computing the matrix-vector product with the full matrix. However, this would\nhave the same O(n2) memory requirement as Cholesky-based GP inference. Although forming (cid:98)KXX\nrequires O(n2) memory, the result of the MVM (cid:98)KXX v requires only O(n) memory. Therefore, we\nreduce the memory requirement to O(n) by computing (cid:98)KXX v in separate constant-sized pieces.\nPartitioned kernel MVMs. To compute (cid:98)KXX v in pieces, we partition the kernel matrix (cid:98)KXX such\n\nthat we only store a constant number of rows at any given time. With the 4n memory requirement of\nstoring the PCG vectors, our approach requires only O(n) memory.\nWe \ufb01rst partition the data matrix with n points in d dimensions, X \u2208 Rn\u00d7d, into p partitions each of\nwhich contains roughly n/p data points:\n\nwhere we use \u201c;\u201d to denote row-wise concatenation. For each X (l), we can compute (cid:98)KX (l)X, which\n\nis a roughly (n/p) \u00d7 n kernel matrix between the partition X (l) and the full data X. By partitioning\n\n\u00b7\u00b7\u00b7\n\nX =(cid:2)X (1);\n\n; X (p)(cid:3)\n\n3\n\n\f(cid:98)KXX =(cid:2)(cid:98)KX (1)X ;\n(cid:98)KXX v =(cid:2)(cid:98)KX (1)X v;\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n(cid:3) .\n; (cid:98)KX (p)X\n; (cid:98)KX (p)X v(cid:3) ,\n\nthe kernel matrix this way, we rewrite it as a concatenation of the p partitions:\n\nComputing each partition requires access to the full training set X, which we assume \ufb01ts in memory.\n\nHowever, each partition (cid:98)KX (l)X contains only 1/p of the entries of the full kernel matrix. Rewriting\nthe matrix-vector product (cid:98)KXX v in terms of these partitions,\nputing each (cid:98)KX (l)X v and concatenating the results. We discard each kernel partition (cid:98)KX (l)X once\na (n/p) \u00d7 n kernel matrix partition (cid:98)KX (l)X. This algorithm allows us to reduce memory usage in\n\nits MVM has been computed. This partitioning requires access to the training data X and vector\nv already in memory and only allocates new memory to temporarily store the output vector z and\n\nwe see that this matrix-vector product can be computed in smaller components by separately com-\n\nexchange for sequential but easily parallelizable computations. If p = 1 then we have the na\u00efve\nO(n2) memory MVM procedure. As p \u2192 n, PCG will only require O(n) memory. In practice, we\nset a constant number of rows per partition according to the amount of memory available rather than\nnumber of partitions p. By keeping a partition in memory only until its component of the MVM has\nbeen computed, we can train GPs with an O(n) memory requirement.\nDistributed MVMs in Parallel. MVM-based inference can easily take advantage of multiple\n\nGPUs or distributed computational resources because each MVM (cid:98)KX (l)X v can be performed on a\npractice one may prefer O(cid:0)n2/p(cid:1) memory to more effectively accelerate MVMs on parallel hardware\n\ndifferent device. Thus we can compute multiple such MVMs in parallel to attain wall-clock speedups\nproportional to the number of devices available on large data sets where computation time exceeds\nthe distributed computing overhead. Although O(n) memory is achievable by setting p = O(n), in\n\nwith the necessary memory resources.\nAdditionally, we note that distributed parallel MVMs require only O(n) communication. Each\npartitioned matrix multiplication only has to supply each device with a new right-hand-side vector v.\nFinally, if w devices are used, the output from each device will be a vector of length n/pw. Thus only\nO(n) data are copied to or from the devices. In contrast, distributing the Cholesky decomposition\nacross multiple devices would require O(n2) communication [15] [26].\nDistributed computations have been utilized for approximate GP inference through mixture-of-experts\nGP models [7]. Concurrent with the Cholesky-based approach by Nguyen et al. [26], our method is\nthe \ufb01rst to parallelize exact Gaussian process inference through distributed computing.\nPredictions. At inference time, we must compute the predictive mean and variance given in (3) and\n\u22121\nXX y, this solve depends only on the\ntraining data. The result of this solve can be stored in a linear vector a and used for subsequent\n\n(4). Although the predictive mean contains a linear solve (cid:98)K\npredictions. Therefore, computing the predictive mean requires computing (cid:98)Kx\u2217X a. Because this\n\nequation involves no linear solves, it can be computed ef\ufb01ciently on a single GPU.\nSimilarly, a training data dependent cache can be computed for the predictive variance using the\nmethod of Pleiss et al. [28] with a satisfactorily tight convergence tolerance. On exact GPs, this\napproach affords O(n) predictive variance computations,2 removing the need to perform a linear\nsolve at test time. In practice, we observe that both the predictive mean and variance can be computed\nin less than a second on a single GPU, even if the full model required days to train on multiple GPUs.\nBecause predictions are fast after these precomputations, we can afford to use more stringent criteria\nfor CG convergence for these one-time precomputations.\nPreconditioning. To accelerate the convergence of CG, Gardner et al. [11] introduced a precondi-\nmodifying CG to solve the related linear system P \u22121KXX v = P \u22121y instead of solving the original\nsystem KXX v = y. These linear systems have the same solution v\u2217. However, the number of CG\niterations required depends on the eigenvalue distribution of P \u22121KXX rather than that of KXX.\nComputing a rank k pivoted Cholesky preconditioner requires only k kernel matrix rows: an already\nO(n) space dependence. While each iteration of CG requires computing each kernel matrix partition\n2We do not achieve constant-time predictions as described in [28]. Reducing the O(n) prediction time to\n\ntioner for (cid:98)KXX derived from its partial pivoted Cholesky decomposition. Preconditioning works by\n\nO(1) requires using structure kernel interpolation [42] to approximate the kernel matrix.\n\n4\n\n\ffrom scratch, the preconditioner is computed once before any iterations of CG are performed.\nTherefore, it can be ef\ufb01cient to increase the size of the preconditioner to an extent if it reduces the\nnumber of CG iterations. While in Gardner et al. [11] the preconditioner size is typically limited to\nunder 20 by default, in our use case we found that preconditioners of up to size k = 100 provide a\nnoticeable improvement to wall-clock speed for large datasets.\nPCG Convergence Criteria. Importantly, Conjugate gradients is not an approximate method for\nperforming linear solves. Rather, it is a method that consumes time to perform solves to a speci\ufb01ed\ntolerance. If this tolerance is low, the solve is exact. Thus, it is analogous to using gradient descent to\nsolve a convex optimization problem (and is in fact largely what is happening). This gives us the\nability to investigate how the performance of exact GPs change with different degrees of convergence.\nXX y (with tolerance \u0001 \u2264 0.01) is critical for good\n\u22121\npredictive performance; we therefore \ufb01nd that GP predictions require exact solves. For hyperparame-\nter training, however, we \ufb01nd that, interestingly, less strict convergence criteria suf\ufb01ce, and even a\nlooser convergence criterion of up to \u0001 = 1 has little impact on \ufb01nal model performance. Given that\npredictions using our approach are highly ef\ufb01cient (see Table 2), it may be interesting to investigate\nalternative approximate methods for \ufb01nding good hyperparameters, and then using the techniques in\nthis paper for exact inference and predicitons.\n\nAt test time, we \ufb01nd that an accurate solve (cid:98)K\n\n4 Related Work\n\nMVM-based GP inference. Conjugate gradients and related MVM-based algorithms [9, 13, 40]\nhave been used in certain settings throughout the GP literature. However, these methods have\ntypically been used when the kernel matrix is structured and affords fast matrix-vector multiplications.\nCunningham et al. [3] note that CG reduces asymptotic complexity when the data lie on a regularly-\nspaced grid because the kernel matrix is structured and affords O(n log n) MVMs. This idea was\nextended to multi-dimensional grids by Saat\u00e7i [34]. Wilson and Nickisch [42] introduce a general-\npurpose GP approximation speci\ufb01cally designed for CG. They combine a structured inducing point\nmatrix with sparse interpolation for approximate kernel matrices with nearly-linear MVMs.\nMore recently, there has been a push to use MVM-based methods on exact GPs. Cutajar et al.\n[4] use conjugate gradients to train exact GPs on datasets with up to 50,000 points. The authors\ninvestigate using off-the-shelf preconditioners and develop new ones based on inducing-point kernel\napproximations.\nApproximate GP methods. There are several approximations to GP inference that require \u2264 O(n2)\nmemory and scale to large datasets. Perhaps the most common class of approaches are inducing\npoint methods [29, 37], which introduce a set of m (cid:28) n data points Z to form a low-rank kernel\napproximation:\n\nKXX \u2248 KXZK\n\n\u22121\nZZ KZX .\n\nTraining and predictions with this approximation take O(nm2) time and O(nm) space. Here we\nhighlight some notable variants of the basic approach, though it is by no means an exhaustive list\n\u2013 see [22] for a more thorough review. Sparse Gaussian process regression (SGPR) [39] selects\nthe inducing points Z through a regularized objective. Structured kernel interpolation (SKI) [42]\nand its variants [12] place the inducing points on a grid, in combination with sparse interpolation,\nfor O(n + g(m)) computations and memory, where g(m) \u2248 m. Stochastic variational Gaussian\nprocesses (SVGP) [16] introduce a set of variational parameters that can be optimized using minibatch\ntraining. Recent work has investigated how to scale up the number of inducing points using tensor\ndecompositions [10, 18].\n\n5 Results\n\nWe compare the performance of exact Gaussian processes against widely-used scalable GP ap-\nproximation methods on a range of large-scale datasets from the UCI dataset repository [1]. Our\nexperiments demonstrate that exact GPs: (1) outperform popular approximate GPs methods on nearly\nall benchmarking datasets in our study; (2) compute thousands of test-point predictions in less than\na second, even when n > 106; (3) utilize all available data when making predictions, even when\nn > 105; and (4) achieve linear training speedups on large datasets by adding additional GPU devices.\n\n5\n\n\fBaselines. We compare against two scalable GP approximations: Sparse Gaussian Process Regression\n(SGPR) [23, 39], and Stochastic Variational Gaussian Processes (SVGP) [16]. We choose these\nmethods due to their popularity and general applicability, enabling a comparison over a wide range\nof datasets. SGPR is an inducing point method where the inducing points are learned through a\nvariational objective. We use m = 512 for SGPR and m = 1,024 for SVGP, which are common\nvalues used for these methods [24]. We later experiment with varying the number of inducing points.\nExperiment details. We extend the GPyTorch library [11] to perform all experiments. Each dataset\nis randomly split into 4/9 training, 2/9 validating, and 3/9 testing sets. We use the validation set for\ntuning parameters like the CG training tolerance. The data is whitened to be mean 0 and standard\ndeviation 1 as measured by the training set. We use a constant prior mean and a Mat\u00e9rn 3/2 kernel.\nWe benchmark GPs with shared lengthscales across the input dimension in Table 1 as well as GPs\nwith independent lengthscales in the appendix.\n\nFigure 1: A comparison of exact GPs trained using our initialization procedure against exact GPs\ntrained for 100 iterations using Adam. Better initializatoin allows exact GPs to achieve similar\nRMSEs while requiring drastically less training time on large datasets.\n\nWe learn model hyperparameters and variational parameters by optimizing the log marginal likelihood.\nFor SGPR, we perform 100 iterations of Adam with a learning rate of 0.1. For SVGP, we perform\n100 epochs of Adam with a minibatch size of 1,024 and a learning rate of 0.01, which we found to\nperform better than 0.1. For exact GPs, the number of optimization steps has the greatest effect on\nthe training time for large datasets. To reduce the training time for exact GPs, we \ufb01rst randomly\nsubset 10,000 training points from the full training set to \ufb01t an exact GP whose hyperparameters will\nbe used as initialization. We pretrain on this subset with 10 steps of L-BFGS [21] and 10 steps of\nAdam [19] with 0.1 step size before using the learned hyperaparameters to take 3 steps of Adam on\nthe full training dataset. Figure 1 shows that this initialization plus \ufb01ne-tuning procedure achieves\ncomparable test performance to running Adam for the full 100 iterations without pretraining. We do\nnot pretrain the SGPR and SVGP models because we found that they required a signi\ufb01cant number\nof \ufb01ne-tuning steps after pretraining due to their increased number of model parameters. We show\nadditional training statistics for exact GPs trained with 100 steps of Adam in the appendix.\nFor all experiments, we use a rank-100 partial pivoted-Cholesky preconditioner and run PCG with a\ntolerance of \u0001 = 1 during training. We constrain the learned noise to be at least 0.1 to regularize the\npoorly conditioned kernel matrix for the houseelectric dataset. We perform all training on a single\nmachine with 8 NVIDIA Tesla V100-SXM2-32GB-LS GPUs. Code to reproduce the experiments is\navailable at https://gpytorch.ai.\nAccuracy. Table 1 displays the accuracy and negative log likelihoods of exact GPs and approximate\nmethods on several large-scale datasets using a single lengthscale across dimensions. The results for\nindependent lengthscale GPs can be found in the appendix. We \ufb01nd that exact GPs achieve lower\nerror than approximate methods on nearly every dataset. Notably, on certain datasets like Kin40K and\nCTslice, exact GPs achieve a half or even a quarter of the error of some approximate methods, and\nconsistently outperform the approximate methods even on datasets with up to over 1M data points.\nAlthough Nguyen et al. [26] show results for exact GPs on n < 120,000, this is the \ufb01rst set of results\ncomparing exact GPs to approximate GPs on n (cid:29) 105.\n\n6\n\nkin40k(n=25600)protein(n=29267)keggdirected(n=31248)keggundirected(n=40708)3droad(n=278319)song(n=329820)houseelectric(n=1311539)0.00.10.20.30.40.50.60.70.8TestRMSEsWithpretrainingWithoutpretrainingkin40k(n=25600)protein(n=29267)keggdirected(n=31248)keggundirected(n=40708)3droad(n=278319)song(n=329820)houseelectric(n=1311539)1min1hr1dayTrainingtime(seconds)WithpretrainingWithoutpretraining\fTable 1: Root-mean-square-error (RMSE) and negative log-likelihood (NLL) of exact GPs and\napproximate GPs on UCI regression datasets using a constant prior mean and a Mat\u00e9rn 3/2 kernel\nwith a shared lengthscale across all dimensions. All trials were averaged over 3 trials with different\nsplits. n and d are the size and dimensionality of the training dataset, respectively. The number of\nGPUs used and the number of kernel partions are reported in Table 2. We were unable to scale SGPR\nto HouseElectric due to its memory requirements when m = 512.\n\nDataset\n\nn\n\nKeggDirected\n\nPoleTele\nElevators\n\nBike\n\nKin40K\nProtein\n\nCTslice\nKEGGU\n3DRoad\n\nSong\nBuzz\n\nHouseElectric\n\n9,600\n10,623\n11,122\n25,600\n29,267\n31,248\n34,240\n40,708\n278,319\n329,820\n373,280\n1,311,539\n\nExact GP\n(BBMM)\n\n0.151 \u00b1 0.012\n0.394 \u00b1 0.006\n0.220 \u00b1 0.002\n0.099 \u00b1 0.001\n0.536 \u00b1 0.012\n0.086 \u00b1 0.005\n0.262 \u00b1 0.448\n0.118 \u00b1 0.000\n0.101 \u00b1 0.007\n0.807 \u00b1 0.024\n0.288 \u00b1 0.018\n0.055 \u00b1 0.000\n\nd\n\n26\n18\n17\n8\n9\n20\n385\n27\n3\n90\n77\n9\n\nRMSE\nSGPR\n\n(m = 512)\n\n0.217 \u00b1 0.002\n0.437 \u00b1 0.018\n0.362 \u00b1 0.004\n0.273 \u00b1 0.025\n0.656 \u00b1 0.010\n0.104 \u00b1 0.003\n0.218 \u00b1 0.011\n0.130 \u00b1 0.001\n0.661 \u00b1 0.010\n0.803 \u00b1 0.002\n0.300 \u00b1 0.004\n\n\u2014\u2013\n\nSVGP\n\n(m = 1,024)\n0.215 \u00b1 0.002\n0.399 \u00b1 0.009\n0.303 \u00b1 0.004\n0.268 \u00b1 0.022\n0.668 \u00b1 0.005\n0.096 \u00b1 0.001\n1.003 \u00b1 0.005\n0.124 \u00b1 0.002\n0.481 \u00b1 0.002\n0.998 \u00b1 0.000\n0.304 \u00b1 0.012\n0.084 \u00b1 0.005\n\nExact GP\n(BBMM)\n\nNLL\nSGPR\n\n(m = 512)\n\nSVGP\n\n(m = 1,024)\n\n\u22120.180 \u00b1 0.036 \u22120.094 \u00b1 0.008 \u22120.001 \u00b1 0.008\n0.580 \u00b1 0.060\n0.519 \u00b1 0.022\n0.619 \u00b1 0.054\n0.272 \u00b1 0.018\n0.291 \u00b1 0.032\n0.119 \u00b1 0.044\n\u22120.258 \u00b1 0.084\n0.087 \u00b1 0.067\n0.236 \u00b1 0.077\n1.035 \u00b1 0.006\n1.018 \u00b1 0.056\n0.970 \u00b1 0.010\n\u22120.199 \u00b1 0.381 \u22121.123 \u00b1 0.016 \u22120.940 \u00b1 0.020\n\u22120.894 \u00b1 0.188 \u22120.073 \u00b1 0.097\n1.422 \u00b1 0.005\n\u22120.419 \u00b1 0.027 \u22120.984 \u00b1 0.012 \u22120.666 \u00b1 0.007\n0.909 \u00b1 0.001\n0.943 \u00b1 0.002\n0.697 \u00b1 0.002\n1.417 \u00b1 0.000\n1.206 \u00b1 0.024\n1.213 \u00b1 0.003\n0.106 \u00b1 0.008\n0.267 \u00b1 0.028\n0.224 \u00b1 0.050\n\u22120.152 \u00b1 0.001\n\u22121.010 \u00b1 0.039\n\n\u2014\u2013\n\nTable 2: Timing results for training and prediction for exact GPs and approximate GPs. Training\ntimes were recorded using the same hardware and other experimental details as in Table 1. Except\nfor \u2020, all trials were averaged over 3 trials with different splits. p is the number of kernel partitions\nused to train the exact GP. Prediction times were measured by computing 1,000 predictive means\nand variances on 1 NVIDIA RTX 2080 Ti GPU An asterisk (*) indicates the one-time pre-computed\ncache was calculated using 8 V100 GPUs. Best results are in bold (lower is better).\n\nExact GP\n(BBMM)\n41.5s \u00b1 1.1\n41.0s \u00b1 0.7\n41.2s \u00b1 0.9\n42.7s \u00b1 2.7\n47.9s \u00b1 10.1\n51.0s \u00b1 6.3\n199.0s \u00b1 299.9\n47.4s \u00b1 8.6\n947.8s \u00b1 443.8\n253.4s \u00b1 221.7\n4283.6s \u00b1 1407.2\n4317.3s \u00b1 147.2\n\nTraining\n\nSGPR\n\n(m = 512)\n69.5s \u00b1 20.5\n69.7s \u00b1 22.5\n70.0s \u00b1 22.9\n97.3s \u00b1 57.9\n136.5s \u00b1 53.8\n132.0s \u00b1 65.6\n129.6s \u00b1 59.2\n133.4s \u00b1 62.7\n720.5s \u00b1 330.4\n473.3s \u00b1 187.5\n1754.8s \u00b1 1099.6\n\n\u2014\u2013\n\nSVGP\n\n(m = 1,024)\n68.7s \u00b1 4.1\n76.5s \u00b1 5.5\n77.1s \u00b1 5.6\n195.4s \u00b1 14.0\n198.3s \u00b1 15.9\n228.2s \u00b1 22.9\n232.1s \u00b1 20.5\n287.0s \u00b1 24.1\n2045.1s \u00b1 191.4\n2373.3s \u00b1 184.9\n2780.8s \u00b1 175.6\n22062.6s \u00b1 282.0\n\nPrecomputation\n\nExact GP\n(BBMM)\n\nExact GP\n(BBMM)\n\nPrediction\nSGPR\n\n(m = 512)\n\nSVGP\n\n(m = 1,024)\n\n#GPUs\n\np\n\n1\n1\n1\n1\n1\n1\n1\n8\n8\n8\n8\n8\n\n1\n1\n1\n1\n1\n1\n1\n1\n16\n16\n19\n218\n\n5.14 s\n0.95 s\n0.38 s\n12.3 s\n7.53 s\n8.06 s\n7.57 s\n18.9 s\n118 m*\n22.2 m*\n42.6 m*\n3.40 hr*\n\n6 ms\n7 ms\n7 ms\n11 ms\n14 ms\n15 ms\n22 ms\n18 ms\n119 ms\n123 ms\n131 ms\n958 ms\n\n6 ms\n7 ms\n9 ms\n12 ms\n9 ms\n16 ms\n14 ms\n13 ms\n68 ms\n99 ms\n114 ms\n\n\u2014\u2013\n\n273 ms\n212 ms\n182 ms\n220 ms\n146 ms\n143 ms\n133 ms\n211 ms\n130 ms\n134 ms\n142 ms\n166 ms\n\nDataset\n\nPoleTele\nElevators\n\nBike\n\nKin40K\nProtein\n\nCTslice\nKEGGU\n3DRoad\n\nSong\nBuzz\n\nKeggDirected\n\nHouseElectric\n\nInterestingly, the size or the dimensionality of the dataset does not seem to in\ufb02uence the relative\nperformance of the approximate methods. For example, though Protein and Kin40K are similar in\nsize and have almost the same dimensionality, the approximate methods perform worse on Kin40K\n(relative to the RMSE of exact GPs). Moreover, we also see that the choice of approximate method\naffects performance, with neither approximate method consistently outperforming the other.\nTraining time. Table 2 displays the training time for exact and approximate GPs. On datasets with\nn < 35,000, an exact GP \ufb01ts on a single 32GB GPU without any partitioning and can be trained in\nless than a minute. For n \u2265 100,000, we must use kernel partitioning as discussed above, which\nsigni\ufb01cantly increases the training time for exact GPs. However, if the necessary computational\nresources are available, these experiments show that it may be preferable to train an exact GP to make\nmore accurate predictions in exchange for longer training times.\nTraining acceleration with multiple GPUs. Because we use matrix-multiplication-based ap-\nproaches to train exact GPs, the computations can be easily parallelized and distributed. Moreover,\nmatrix multiplication is one of the most commonly distributed routines, so parallelized GP imple-\nmentations can be built using readily-available routines in libraries like PyTorch [27]. Figure 2 plots\nthe speedup as more GPUs are used for training on the KEGGU, 3DRoad, Song, and Buzz datasets.\nEach of these datasets achieve a nearly linear speedup when adding up to 4 GPUs. The speedup is\nmore pronounced for the two large datasets (3DRoad and Song) that require kernel partitioning. The\ntraining time can be further improved by using more GPUs to reduce the number of kernel partitions.\n\n7\n\n\fFigure 2: Training speedup from using additional GPUs at training time. Since our exact GPs\nuse matrix-multiplication based inference, they achieve a near linear speedup with more computing\nresources on large datasets.\n\nFigure 4: Test root-mean-square error (RMSE) as a function of subsampled dataset size (lower is\nbetter). Subsampled exact GPs outperform approximate GPs even with a quarter of the training set.\nExact GP error continues to decrease as data is added until the full dataset is used.\n\nPrediction time. Although exact GPs take longer to train, we \ufb01nd that their speed is comparable\nto approximate methods at test time. After training various GPs, we follow the common practice of\nprecomputing the work required for GP predictions [28].\nTable 2 displays the time to compute 1,000 predictive means and variances at test time before and\nafter precomputation. All predictions are made on one NVIDIA RTX 2080 Ti GPU. We see exact\nGPs take less than a second for all predictions across all dataset sizes used.\n\n5.1 Ablation Studies\n\nWith our method, we can better understand how\nexact GPs and approximate GPs scale to datasets\nwith n (cid:29) 104. Here, we demonstrate how the\namount of data affects exact GP performance,\nand how the number of inducing points affects\nthe performance of approximate GPs.\nDo GPs need the entire dataset? As a non-\nparametric model, Gaussian processes naturally\nadapt to the amount of training data available\n[44]. Figure 4 shows an increase in accuracy as\nwe increase the amount of training data on the\nKEGGU, 3DRoad, and Song datasets. For each\ndataset, we subsample a fraction of the training\ndata and plot the resulting root-mean-square er-\nror on the test-set as a function of subsampled\ntraining set size. We use the same 1/3 holdout\nof the full dataset to test in each case. As ex-\npected, the test RMSE decreases monotonically\nas we increase the subsample size. Figure 4 also\nshows the performance of exact GP, SGPR, and SVGP trained on the entire training set. Strikingly, in\n\nFigure 3: Error of SVGP and SGPR methods\nas a function of the number of inducing points\n(m). Both methods scale cubically with m. We\nwere unable to run SGPR with more than 1,024\ninducing points on a single GPU. Exact GPs have\nlower error than both methods.\n\n8\n\n12345678#GPUs2468Speedupover1GPUKEGGU(n=40708)12345678#GPUs24683DRoad(n=278319)12345678#GPUs2468Song(n=329820)12345678#GPUs2468Buzz(n=373280)010000200003000040000Subsampleddatasetsize0.120.130.140.150.16RMSEKEGGU(n=40708)Subsampled(ExactGP)FullDataset(SGPRm=512)FullDataset(SVGPm=1024)FullDataset(ExactGP)0100000200000Subsampleddatasetsize0.20.40.63DRoad(n=278319)0100000200000300000Subsampleddatasetsize0.750.800.850.900.951.00Song(n=329820)010002000InducingPoints(m)0.250.300.35RMSEBike(n=11122)010002000InducingPoints(m)0.550.600.650.700.75Protein(n=29267)SVGPSGPRExactGP\fall three cases, an exact GP with less than a quarter of the training data outperformed approximate\nGPs trained on the entire training set. Furthermore, test error continues to decrease with each addition\nof training data.\nWould more inducing points help? In Table 1, exact GPs uniformly take substantially longer to\ntrain on the largest datasets than the approximate methods. This naturally raises the question \u201ccan\napproximate models with more inducing points recover the performance of exact methods?\u201d We\nplot test RMSE on two datasets, Bike and Protein, as a function of the number of inducing points\nin Figure 3. The test RMSE of both inducing point methods saturates on both datasets well above\nthe test RMSE of an exact GP. Furthermore, we note that using m inducing points introduces a\nm \u00d7 m matrix and a O(nm2 + m3) time complexity [16, 17] which makes it dif\ufb01cult to train SGPR\nwith m (cid:29) 1024 inducing points on one GPU. It is possible to combine kernel partitioning with\ninducing-point methods to utilize even larger values of m. However, as Figure 3 and Table 1 show, it\nmay be preferable to use the extra computational resources to train an exact GP on more data rather\nthan to train an approximate GP with more inducing points.\n\n6 Discussion\n\nHistorically, for Gaussian processes, \u201ca large dataset is one that contains over a few thousand data\npoints\u201d [16] which have traditionally necessitated scalable approximations. Bigger datasets have\ntraditionally necessitated scalable approximations. In this paper, we have extended the applicability\nof exact GPs far beyond what has been thought possible \u2014 to datasets with over a million training\nexamples through MVM-based GP inference. Our approach uses easily parallelizable routines that\nfully exploit modern parallel hardware and distributed computing. In our comparisons, we \ufb01nd that\nexact GPs are more widely applicable than previously thought, performing signi\ufb01cantly better than\napproximate methods on large datasets, while requiring fewer design choices.\nIs CG still exact? In the GP literature, exact GP inference typically refers to using the Cholesky\ndecomposition with exact kernels [32]. A natural question to ask is whether we can consider our\napproach \u201cexact\u201d in light of the fact that CG perform solves only up to a pre-speci\ufb01ed error tolerance.\nHowever, unlike the approximate methods presented in this paper, the difference between a CG-based\nmodel and a theoretical model with \u201cperfect\u201d solves can be precisely controlled by this error tolerance.\nWe therefore consider CG exact in a sense that is commonly used in the context of mathematical\noptimization \u2014 namely that it computes solutions up to arbitrary precision. In fact, CG-based\nmethods can often be more precise than Cholesky based approaches in \ufb02oating-point arithmetic due\nto fewer round-off errors [11].\nWhen to approximate? There are many approximate methods for scalable Gaussian processes,\nwith varying statistical properties, advantages, and application regimes. We chose to compare\nexact GPs to approximation methods SVGP and SGPR for their popularity and available GPU\nimplementations. There may be some regimes where other approximate methods or combinations of\nmethods outperform these two approximations. Our objective is not to perform an exhaustive study\nof approximate methods and their relative strengths but to highlight that such comparisons are now\npossible with modern hardware.\nIndeed, there are cases where an approximate GP method might still be preferable. Examples may\ninclude training on large datasets with limited computational resources. In certain regimes, such\nas low dimensional spaces, there are approximations that are designed to achieve high degrees of\naccuracy in less time than exact GPs. Additionally, GP inference with non-Gaussian likelihoods\n(such as for classi\ufb01cation) requires an approximate inference strategy. Some approximate inference\nmethods, such as Laplace and MCMC [25, 32], may be amenable to the parallelisation approaches\ndiscussed here for approximate inference with exact kernels.\nNonetheless, with ef\ufb01cient utilization of modern hardware, exact Gaussian processes are now an\nappealing option on substantially larger datasets than previously thought possible. Exact GPs are\npowerful yet simple \u2013 achieving remarkable accuracy without requiring much expert intervention. We\nexpect exact GPs to become ever more scalable and accessible with continued advances in hardware\ndesign.\n\n9\n\n\fAcknowledgments\n\nKAW and AGW are supported by NSF IIS-1910266, NSF IIS-1563887, Facebook Research, NSF\nI-DISRE 1934714, and an Amazon Research Award. GP and KQW are supported in part by the III-\n1618134, III-1526012, IIS-1149882, IIS-1724282, and TRIPODS-1740822 grants from the National\nScience Foundation. In addition, they are supported by the Bill and Melinda Gates Foundation, the\nOf\ufb01ce of Naval Research, and SAP America Inc.\n\nReferences\n\n[1] A. Asuncion and D. Newman. Uci machine learning repository. https://archive.ics.uci.edu/\n\nml/, 2007. Last accessed: 2018-02-05.\n\n[2] C.-A. Cheng and B. Boots. Variational inference for gaussian process models with linear complexity. In\n\nNeurIPS, pages 5184\u20135194, 2017.\n\n[3] J. P. Cunningham, K. V. Shenoy, and M. Sahani. Fast gaussian process methods for point process intensity\n\nestimation. In ICML, pages 192\u2013199. ACM, 2008.\n\n[4] K. Cutajar, M. Osborne, J. Cunningham, and M. Filippone. Preconditioning kernel matrices. In ICML,\n\n2016.\n\n[5] A. Damianou and N. Lawrence. Deep gaussian processes. In AISTATS, pages 207\u2013215, 2013.\n\n[6] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search.\n\nIn ICML, pages 465\u2013472, 2011.\n\n[7] M. P. Deisenroth and J. W. Ng. Distributed gaussian processes. In ICML, pages 1481\u20131490, 2015.\n\n[8] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-ef\ufb01cient learning in robotics\n\nand control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408\u2013423, 2015.\n\n[9] K. Dong, D. Eriksson, H. Nickisch, D. Bindel, and A. G. Wilson. Scalable log determinants for gaussian\n\nprocess kernel learning. In NeurIPS, 2017.\n\n[10] T. Evans and P. Nair. Scalable gaussian processes with grid-structured eigenfunctions (gp-grief). In ICML,\n\npages 1416\u20131425, 2018.\n\n[11] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. Gpytorch: Blackbox matrix-matrix\n\ngaussian process inference with gpu acceleration. In NeurIPS, pages 7587\u20137597, 2018.\n\n[12] J. R. Gardner, G. Pleiss, R. Wu, K. Q. Weinberger, and A. G. Wilson. Product kernel interpolation for\n\nscalable gaussian processes. In AISTATS, 2018.\n\n[13] M. Gibbs and D. J. MacKay. Ef\ufb01cient implementation of gaussian processes. Technical report, 1997.\n\n[14] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n\n[15] F. G. Gustavson, L. Karlsson, and B. K\u00e5gstr\u00f6m. Three algorithms for cholesky factorization on distributed\nmemory using packed storage. In International Workshop on Applied Parallel Computing, pages 550\u2013559.\nSpringer, 2006.\n\n[16] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In UAI, 2013.\n\n[17] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational gaussian process classi\ufb01cation. In\n\nAISTATS, 2015.\n\n[18] P. Izmailov, A. Novikov, and D. Kropotov. Scalable gaussian processes with billions of inducing inputs via\n\ntensor train decomposition. In ICML, pages 726\u2013735, 2018.\n\n[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[20] Q. Le, T. Sarl\u00f3s, and A. Smola. Fastfood: approximating kernel expansions in loglinear time. In ICML,\n\n2013.\n\n[21] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math.\nProgram., 45(1-3):503\u2013528, Aug. 1989. ISSN 0025-5610. doi: 10.1007/BF01589116. URL https:\n//doi.org/10.1007/BF01589116.\n\n10\n\n\f[22] H. Liu, Y.-S. Ong, X. Shen, and J. Cai. When gaussian process meets big data: A review of scalable gps.\n\narXiv preprint arXiv:1807.01065, 2018.\n\n[23] A. G. d. G. Matthews. Scalable Gaussian process inference using variational methods. PhD thesis,\n\nUniversity of Cambridge, 2016.\n\n[24] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahra-\nmani, and J. Hensman. Gp\ufb02ow: A gaussian process library using tensor\ufb02ow. Journal of Machine Learning\nResearch, 18(40):1\u20136, 2017.\n\n[25] I. Murray, R. Prescott Adams, and D. J. MacKay. Elliptical slice sampling. 2010.\n\n[26] D.-T. Nguyen, M. Filippone, and P. Michiardi. Exact gaussian process regression with distributed\nIn Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pages\n\ncomputations.\n1286\u20131295. ACM, 2019.\n\n[27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and\n\nA. Lerer. Automatic differentiation in pytorch. 2017.\n\n[28] G. Pleiss, J. R. Gardner, K. Q. Weinberger, and A. G. Wilson. Constant-time predictive distributions for\n\ngaussian processes. In ICML, 2018.\n\n[29] J. Qui\u00f1onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process\n\nregression. Journal of Machine Learning Research, 6(Dec):1939\u20131959, 2005.\n\n[30] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NeurIPS, pages 1177\u20131184,\n\n2008.\n\n[31] C. E. Rasmussen and Z. Ghahramani. Occam\u2019s razor. In NeurIPS, pages 294\u2013300, 2001.\n\n[32] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning, volume 1. MIT press\n\nCambridge, 2006.\n\n[33] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian processes for time-series\nmodelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering\nSciences, 371(1984), 2013.\n\n[34] Y. Saat\u00e7i. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge,\n\n2012.\n\n[35] H. Salimbeni and M. Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In\n\nNeurIPS, pages 4588\u20134599, 2017.\n\n[36] H. Salimbeni, C.-A. Cheng, B. Boots, and M. Deisenroth. Orthogonally decoupled variational gaussian\n\nprocesses. In NeurIPS, pages 8725\u20138734, 2018.\n\n[37] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NeurIPS, pages\n\n1257\u20131264, 2006.\n\n[38] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms.\n\nIn Advances in neural information processing systems, pages 2951\u20132959, 2012.\n\n[39] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, pages\n\n567\u2013574, 2009.\n\n[40] S. Ubaru, J. Chen, and Y. Saad. Fast estimation of tr (f (a)) via stochastic lanczos quadrature. SIAM Journal\n\non Matrix Analysis and Applications, 38(4):1075\u20131099, 2017.\n\n[41] A. Wilson and R. Adams. Gaussian process kernels for pattern discovery and extrapolation. In ICML,\n\npages 1067\u20131075, 2013.\n\n[42] A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In\n\nICML, pages 1775\u20131784, 2015.\n\n[43] A. G. Wilson, D. A. Knowles, and Z. Ghahramani. Gaussian process regression networks. In International\n\nConference on Machine Learning, 2012.\n\n[44] A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cunningham. Fast kernel learning for multidimensional\n\npattern extrapolation. In Advances in Neural Information Processing Systems, 2014.\n\n11\n\n\f[45] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In AISTATS, pages 370\u2013378,\n\n2016.\n\n[46] A. G. Wilson, Z. Hu, R. R. Salakhutdinov, and E. P. Xing. Stochastic variational deep kernel learning. In\n\nNeurIPS, pages 2586\u20132594, 2016.\n\n[47] Z. Yang, A. Wilson, A. Smola, and L. Song. A la carte\u2013learning fast kernels. In AISTATS, pages 1098\u20131106,\n\n2015.\n\n12\n\n\f", "award": [], "sourceid": 8279, "authors": [{"given_name": "Ke", "family_name": "Wang", "institution": "Cornell University"}, {"given_name": "Geoff", "family_name": "Pleiss", "institution": "Cornell University"}, {"given_name": "Jacob", "family_name": "Gardner", "institution": "Uber AI Labs"}, {"given_name": "Stephen", "family_name": "Tyree", "institution": "NVIDIA"}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": "Cornell University / ASAPP Research"}, {"given_name": "Andrew Gordon", "family_name": "Wilson", "institution": "New York University"}]}