{"title": "Orthogonally Decoupled Variational Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 8711, "page_last": 8720, "abstract": "Gaussian processes (GPs) provide a powerful non-parametric framework for reasoning over functions. Despite appealing theory, its superlinear computational and memory complexities have presented a long-standing challenge. State-of-the-art sparse variational inference methods trade modeling accuracy against complexity. However, the complexities of these methods still scale superlinearly in the number of basis functions, implying that that sparse GP methods are able to learn from large datasets only when a small model is used. Recently, a decoupled approach was proposed that removes the unnecessary coupling between the complexities of modeling the mean and the covariance functions of a GP. It achieves a linear complexity in the number of mean parameters, so an expressive posterior mean function can be modeled. While promising, this approach suffers from optimization difficulties due to ill-conditioning and non-convexity. In this work, we propose an alternative decoupled parametrization. It adopts an orthogonal basis in the mean function to model the residues that cannot be learned by the standard coupled approach. Therefore, our method extends, rather than replaces, the coupled approach to achieve strictly better performance. This construction admits a straightforward natural gradient update rule, so the structure of the information manifold that is lost during decoupling can be leveraged to speed up learning. Empirically, our algorithm demonstrates significantly faster convergence in multiple experiments.", "full_text": "Orthogonally Decoupled Variational\n\nGaussian Processes\n\nHugh Salimbeni\u2217\n\nImperial College London\n\nhrs13@ic.ac.uk\n\nChing-An Cheng\u2217\n\nGeorgia Institute of Technology\n\ncacheng@gatech.edu\n\nByron Boots\n\nGeorgia Institute of Technology\n\nbboots@gatech.edu\n\nMarc Deisenroth\n\nImperial College London\n\nmpd37@ic.ac.uk\n\nAbstract\n\nGaussian processes (GPs) provide a powerful non-parametric framework for rea-\nsoning over functions. Despite appealing theory, its superlinear computational and\nmemory complexities have presented a long-standing challenge. State-of-the-art\nsparse variational inference methods trade modeling accuracy against complexity.\nHowever, the complexities of these methods still scale superlinearly in the number\nof basis functions, implying that that sparse GP methods are able to learn from\nlarge datasets only when a small model is used. Recently, a decoupled approach\nwas proposed that removes the unnecessary coupling between the complexities\nof modeling the mean and the covariance functions of a GP. It achieves a linear\ncomplexity in the number of mean parameters, so an expressive posterior mean\nfunction can be modeled. While promising, this approach suffers from optimization\ndif\ufb01culties due to ill-conditioning and non-convexity. In this work, we propose an\nalternative decoupled parametrization. It adopts an orthogonal basis in the mean\nfunction to model the residues that cannot be learned by the standard coupled ap-\nproach. Therefore, our method extends, rather than replaces, the coupled approach\nto achieve strictly better performance. This construction admits a straightforward\nnatural gradient update rule, so the structure of the information manifold that is\nlost during decoupling can be leveraged to speed up learning. Empirically, our\nalgorithm demonstrates signi\ufb01cantly faster convergence in multiple experiments.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) are \ufb02exible Bayesian non-parametric models that have achieved state-\nof-the-art performance in a range of applications [8, 31]. A key advantage of GP models is that\nthey have large representational capacity, while being robust to over\ufb01tting [27]. This property is\nespecially important for robotic applications, where there may be an abundance of data in some parts\nof the space but a scarcity in others [7]. Unfortunately, exact inference in GPs scales cubically in\ncomputation and quadratically in memory with the size of the training set, and is only available in\nclosed form for Gaussian likelihoods.\nTo learn from large datasets, variational inference provides a principled way to \ufb01nd tractable ap-\nproximations to the true posterior. A common approach to approximate GP inference is to form\na sparse variational posterior, which is designed by conditioning the prior process at a small set\nof inducing points [32]. The sparse variational framework trades accuracy against computation,\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fbut its complexities still scale superlinearly in the number of inducing points. Consequently, the\nrepresentation power of the approximate distribution is greatly limited.\nVarious attempts have been made to reduce the complexities in order to scale up GP models for\nbetter approximation. Most of them, however, rely on certain assumptions on the kernel structure\nand input dimension. In the extreme, Hartikainen and S\u00e4rkk\u00e4 [10] show that, for 1D-input problems,\nexact GP inference can be solved in linear time for kernels with \ufb01nitely many non-zero derivatives.\nFor low-dimensional inputs and stationary kernels, variational inference with structured kernel\napproximation [34] or Fourier features [13] has been proposed. Both approaches, nevertheless, scale\nexponentially with input dimension, except for the special case of sum-and-product kernels [9].\nApproximate kernels have also been proposed as GP priors with low-rank structure [30, 25] or a\nsparse spectrum [19]. Another family of methods partitions the input space into subsets and performs\nprediction aggregation [33, 26, 24, 6], and Bayesian aggregation of local experts with attractive\ntheoretical properties is recently proposed by Rulli\u00e8re et al. [28].\nA recent decoupled framework [5] takes a different direction to address the complexity issue of GP\ninference. In contrast to the above approaches, this decoupled framework is agnostic to problem\nsetups (e.g. likelihoods, kernels, and input dimensions) and extends the original sparse variational\nformulation [32]. The key idea is to represent the variational distribution in the reproducing kernel\nHilbert space (RKHS) induced by the covariance function of the GP. The sparse variational posterior\nby Titsias [32] turns out to be equivalent to a particular parameterization in the RKHS, where the\nmean and covariance both share the same basis. Cheng and Boots [5] suggest to relax the requirement\nof basis sharing. Since the computation only scales linearly in the mean parameters, many more basis\nfunctions can be used for modeling the mean function to achieve higher accuracy in prediction.\nHowever, the original decoupled basis [5] turns out to have optimization dif\ufb01culties [11]. In particular,\nthe non-convexity of the optimization problem means that a suboptimal solution may be found, leading\nto performance that is potentially worse than the standard coupled case. While Havasi et al. [11]\nsuggest to use a pre-conditioner to amortize the problem, their algorithm incurs an additional cubic\ncomputational cost; therefore, its applicability is limited to small simple models.\nInspired by the success of natural gradients in variational inference [15, 29], we propose a novel\nRKHS parameterization of decoupled GPs that admits ef\ufb01cient natural gradient computation. We\ndecompose the mean parametrization into a part that shares the basis with the covariance, and an\northogonal part that models the residues that the standard coupled approach fails to capture. We show\nthat, with this particular choice, the natural gradient update rules further decouple into the natural\ngradient descent of the coupled part and the functional gradient descent of the residual part. Based on\nthese insights, we propose an ef\ufb01cient optimization algorithm that preserves the desired properties of\ndecoupled GPs and converges faster than the original formulation [5].\nWe demonstrate that our basis is more effective than the original decoupled formulation on a range of\nclassi\ufb01cation and regression tasks. We show that the natural gradient updates improve convergence\nconsiderably and can lead to much better performance in practice. Crucially, we show also that our\nbasis is more effective than the standard coupled basis for a \ufb01xed computational budget.\n\n2 Background\n\nn=1 and a GP prior\nWe consider the inference problem of GP models. Given a dataset D = {(xn, yn)}N\non a latent function f, the goal is to infer the (approximate) posterior of f (x\u2217) for any query input\nx\u2217. In this work, we adopt the recent decoupled RKHS reformulation of variational inference [5],\nand, without loss of generality, we will assume f is a scalar function. For notation, we use boldface\nto distinguish \ufb01nite-dimensional vectors and matrices that are used in computation from scalar and\nabstract mathematical objects.\n\n2.1 Gaussian Processes and their RKHS Representation\n\nWe \ufb01rst review the primal and dual representations of GPs, which form the foundation of the RKHS\nreformulation. Let X \u2286 Rd be the domain of the latent function. A GP is a distribution of functions,\nwhich is described by a mean function m : X \u2192 R and a covariance function k : X \u00d7 X \u2192 R. We\nsay a latent function f is distributed according to GP(m, k), if for any x, x\ufffd \u2208 X , E[f (x)] = m(x),\nC[f (x), f (x\ufffd)] = k(x, x\ufffd), and for any \ufb01nite subset {f (xn) : xn \u2208 X}N\nn=1 is Gaussian distributed.\n\n2\n\n\fWe call the above de\ufb01nition, in terms of the function values m(x) and k(x, x\ufffd), the primal representa-\ntion of GPs. Alternatively, one can adopt a dual representation of GPs, by treating functions m and k\nas RKHS objects [4]. This is based on observing that the covariance function k satis\ufb01es the de\ufb01nition\nof positive semi-de\ufb01nite functions, so k can also be viewed as a reproducing kernel [3]. Speci\ufb01cally,\ngiven GP(m, k), without loss of generality, we can \ufb01nd an RKHS H such that\n\nm(x) = \u03c6(x)\ufffd\u00b5,\n\nk(x, x\ufffd) = \u03c6(x)\ufffd\u03a3\u03c6(x\ufffd)\n\n(1)\nfor some \u00b5 \u2208 H, bounded positive semide\ufb01nite self-adjoint operator \u03a3 : H \u2192 H, and feature map\n\u03c6 : X \u2192 H. Here we use \ufffd to denote the inner product in H, even when dimH is in\ufb01nite. For\nnotational clarity, we use symbols m and k (or s) to denote the mean and covariance functions, and\nsymbols \u00b5 and \u03a3 to denote the RKHS objects; we use s to distinguish the (approximate) posterior\ncovariance function from the prior covariance function k. If f \u223c GP(m, k) satisfying (1), we also\nwrite f \u223c GPH(\u00b5, \u03a3).\u2217\nTo concretely illustrate the primal-dual connection, we consider the GP regression problem. Suppose\nf \u223c GP(0, k) in prior and yn = f (xn) + \ufffdn, where \ufffdn \u223c N (\ufffdn|0, \u03c32). Let X = {xn}N\nn=1 and\ndenotes stacking the elements. Then, with D observed,\ny = (yn)N\nit can be shown that f \u223c GP(m, s) where\nm(x) = kx,X (KX + \u03c32I)\u22121y,\n\nn=1 \u2208 RN , where the notation (\u00b7)\u00b7n=\u00b7\n\ns(x, x\ufffd) = kx,x\ufffd \u2212 kx,X (KX + \u03c32I)\u22121kX,x\ufffd ,\n\n(2)\n\nwhere k\u00b7,\u00b7, k\u00b7,\u00b7 and K\u00b7,\u00b7 denote the covariances between the subscripted sets,\u2020 We can also equivalently\nwrite the posterior GP in (2) in its dual RKHS representation: suppose the feature map \u03c6 is selected\nsuch that k(x, x\ufffd) = \u03c6(x)\ufffd\u03c6(x\ufffd), then a priori f \u223c GPH(0, I) and a posteriori f \u223c GPH(\u00b5, \u03a3),\n(3)\n\n\u00b5 = \u03a6X (KX + \u03c32I)\u22121y,\n\n\u03a3 = I \u2212 \u03a6X (KX + \u03c32I)\u22121\u03a6\ufffdX ,\n\nwhere \u03a6X = [\u03c6(x1), . . . , \u03c6(xN )].\n\n2.2 Variational Inference Problem\n\nInference in GP models is challenging because the closed-form expressions in (2) have computational\ncomplexity that is cubic in the size of the training dataset, and are only applicable for Gaussian\nlikelihoods. For non-Gaussian likelihoods (e.g. classi\ufb01cation) or for large datasets (i.e. more than\n10,000 data points), we must adopt approximate inference.\nVariational inference provides a principled approach to search for an approximate but tractable\nposterior. It seeks a variational posterior q that is close to the true posterior p(f|D) in terms of KL\ndivergence, i.e. it solves minq KL(q(f )||p(f|D)). For GP models, the variational posterior must be\nde\ufb01ned over the entire function, so a natural choice is to use another GP. This choice is also motivated\nby the fact that the exact posterior is a GP in the case of a Gaussian likelihood as shown in (2). Using\nthe results from Section 2.1, we can represent this posterior process via a mean and a covariance\nfunction or, equivalently, through their associated RKHS objects.\nWe denote these RKHS objects as \u00b5 and \u03a3, which uniquely determine the GP posterior GPH(\u00b5, \u03a3).\nIn the following, without loss of generality, we shall assume that the prior GP is zero-mean and the\nRKHS is selected such that f \u223c GPH(0, I) a priori.\nThe variational inference problem in GP models leads to the optimization problem\n\nL(q) = \u2212\ufffdN\n\nn=1\n\nmin\n\nq=GPH(\u00b5,\u03a3)L(q) ,\nwhere p(f ) = GPH(0, I) and KL(q(f )||p(f )) = \ufffd log q(f )\np(f ) dq(f ) is the KL divergence be-\ntween the approximate posterior GP q(f ) and the prior GP p(f ). It can be shown that L(q) =\nKL(q(f )||p(f|D)) up to an additive constant [5].\n\nEq(f (xn))[log p(yn|f (xn))] + KL(q(f )||p(f )) ,\n\n(4)\n\n\u2217This notation only denotes that m and k can be represented as RKHS objects, not that the sampled functions\n\nof GP(m, k) necessarily reside in H (which only holds for the special when \u03a3 has \ufb01nite trace).\n\n\u2020If the two sets are the same, only one is listed.\n\n3\n\n\f2.3 Decoupled Gaussian Processes\n\nDirectly optimizing the possibly in\ufb01nite-dimensional RKHS objects \u00b5 and \u03a3 is computationally\nintractable except for the special case of a Gaussian likelihood and a small training set size N.\nTherefore, in practice, we need to impose a certain sparse structure on \u00b5 and \u03a3. Inspired by the\nfunctional form of the exact solution in (3), Cheng and Boots [5] propose to model the approximate\nposterior GP in the decoupled subspace parametrization (which we will refer to as decoupled basis\nfor short) with\n\n\u00b5 = \u03a8\u03b1a,\n\n\u03a3 = I + \u03a8\u03b2A\u03a8\ufffd\u03b2\n\n(5)\nwhere \u03b1 and \u03b2 are the sets of inducing points to specify the bases \u03a8\u03b1 and \u03a8\u03b2 in the RKHS, and\na \u2208 R|\u03b1| and A \u2208 R|\u03b2|\u00d7|\u03b2| are the coef\ufb01cients such that \u03a3 \ufffd 0. With only \ufb01nite perturbations from\nthe prior, the construction in (5) ensures the KL divergence KL(q(f )||p(f )) is \ufb01nite [23, 5] (see\nAppendix A). Importantly, this parameterization decouples the variational parameters (a, \u03b1) for the\nmean \u00b5 and the variational parameters (A, \u03b2) for the covariance \u03a3. As a result, the computation\ncomplexities related to the two parts become independent, and a large set of parameters can adopted\nfor the mean to model complicated functions, as discussed below.\n\nCoupled Basis The form in (5) covers the sparse variational posterior [32]. Let Z = {zn \u2208 X}M\nn=1\nbe some \ufb01ctitious inputs and let fZ = (f (zn))M\nn=1 be the vector of function values. Based on\nthe primal viewpoint of GPs, Titsias [32] constructs the variational posterior as the posterior GP\nconditioned on Z with marginal q(fZ) = N (fZ|m, S), where m \u2208 RM and S \ufffd 0 \u2208 RM\u00d7M .\nThe elements in Z along with m and S are the variational parameters to optimize. The mean and\ncovariance functions of this process GP(m, s) are\n\n(6)\n\nZ kZ,x ,\nwhich is reminiscent of the exact result in (2). Equivalently, it has the dual representation\n\ns(x, x\ufffd) = kx,x\ufffd + kx,Z K\u22121\n\nZ (S \u2212 KZ)K\u22121\n\nm(x) = kx,Z K\u22121\n\nZ m,\n\n\u00b5 = \u03a8ZK\u22121\n\n(7)\nwhich conforms with the form in (5). The computational complexity of using the coupled basis\nreduces from O(N 3) to O(M 3 + M 2N ). Therefore, when M \ufffd N is selected, the GP can be\napplied to learning from large datasets [32].\n\nZ (S \u2212 KZ)K\u22121\n\n\u03a3 = I + \u03a8ZK\u22121\n\nZ \u03a8\ufffdZ ,\n\nZ m,\n\nInversely Parametrized Decoupled Basis Directly parameterizing the dual representation in (5)\nadmits more \ufb02exibility than the primal function-valued perspective. To ensure that the covariance of\nthe posterior strictly decreases compared with the prior, Cheng and Boots [5] propose a decoupled\nbasis with an inversely parametrized covariance operator\n\n\u00b5 = \u03a8\u03b1a,\n\n\u03a3 = (I + \u03a8\u03b2B\u03a8\ufffd\u03b2 )\u22121,\n\n(8)\nwhere B \ufffd 0 \u2208 R|\u03b2|\u00d7|\u03b2| and is further parametrized by its Cholesky factors in implementation.\nIt can be shown that the choice in (8) is equivalent to setting A = \u2212K\u22121\n\u03b2 + (K\u03b2BK\u03b2 + K\u03b2)\u22121\nin (5). In this parameterization, because the bases for the mean and the covariance are decoupled, the\ncomputational complexity of solving (4) with the decoupled basis in (8) becomes O(|\u03b1| + |\u03b2|3), as\nopposed to O(M 3) of (7). Therefore, while it is usually assumed that |\u03b2| is in the order of M, with a\ndecoupled basis, we can freely choose |\u03b1| \ufffd |\u03b2| for modeling complex mean functions accurately.\n3 Orthogonally Decoupled Variational Gaussian Processes\n\nWhile the particular decoupled basis in (8) is more expressive, its optimization problem is ill-\nconditioned and non-convex, and empirically slow convergence has been observed [11]. To improve\nthe speed of learning decoupled models, we consider the use of natural gradient descent [2]. In partic-\nular, we are interested in the update rule for natural parameters, which has empirically demonstrated\nimpressive convergence performance over other choices of parametrizations [29]\nHowever, it is unclear what the natural parameters (5) for the general decoupled basis in (5) are and\nwhether \ufb01nite-dimensional natural parameters even exist for such a model. In this paper, we show\nthat when a decoupled basis is appropriately structured, then natural parameters do exist. Moreover,\nthey admit a very ef\ufb01cient (approximate) natural gradient update rule as detailed in Section 3.4. As a\nresult, large-scale decoupled models can be quickly learned, joining the fast convergence property\nfrom the coupled approach [12] and the \ufb02exibility of the decoupled approach [5].\n\n4\n\n\f3.1 Alternative Decoupled Bases\n\nTo motivate the proposed approach, let us \ufb01rst introduce some alternative decoupled bases for\nimproving optimization properties (8) and discuss their limitations. The inversely parameterized\ndecoupled basis (8) is likely to have different optimization properties from the standard coupled\nbasis (7), due to the inversion in its covariance parameterization. To avoid these potential dif\ufb01culties,\nwe reparameterize the covariance of (8) as the one in (7) and consider instead the basis\n\n\u00b5 = \u03a8\u03b1a,\n\n\u03a3 = (I \u2212 \u03a8\u03b2K\u22121\n\n(9)\nThe basis (9) can be viewed as a decoupled version of (7): it can be readily identi\ufb01ed that setting\n\u03b1 = \u03b2 = Z and a = K\u22121\nZ m recovers (7). Note that we do not want to de\ufb01ne a basis in terms of K\u22121\n\u03b1\nas that incurs the cubic complexity that we intend to avoid. This basis gives a posterior process with\n(10)\n\n\u03b2 \u03a8\ufffd\u03b2 ) + \u03a8\u03b2K\u22121\n\n\u03b2 SK\u22121\n\nm(x) = kx,\u03b1a,\n\n\u03b2 \u03a8\ufffd\u03b2 .\n\ns(x, x\ufffd) = kx,x\ufffd \u2212 kx,\u03b2K\u22121\n\n\u03b2 (S \u2212 K\u03b2)K\u22121\n\n\u03b2 k\u03b2,x\ufffd .\n\nThe alternate choice (9) addresses the dif\ufb01culty in optimizing the covariance operator, but it still\nsuffers from one serious drawback: while using more inducing points, (9) is not necessarily more\nexpressive than the standard basis (7), for example, when \u03b1 is selected badly. To eliminate the\nworst-case setup, we can explicitly consider \u03b2 to be part of \u03b1 and use\n\n\u03b2 m\u03b2,\n\n\u00b5 = \u03a8\u03b3a\u03b3 + \u03a8\u03b2K\u22121\n\n(11)\nwhere \u03b3 = \u03b1 \\ \u03b2. This is exactly the hybrid basis suggested in the appendix of Cheng and Boots\n[5], which is strictly more expressive than (7) and yet has the complexity as (8). Also the explicit\ninclusion of \u03b2 inside \u03b1 is pivotal to de\ufb01ning proper \ufb01nite-dimensional natural parameters, which we\nwill later discuss. This basis gives a posterior process with the same covariance as (10), and mean\nm(x) = kx,\u03b3a\u03b3 + kx,\u03b2K\u22121\n\n\u03a3 = (I \u2212 \u03a8\u03b2K\u22121\n\n\u03b2 \u03a8\ufffd\u03b2 ) + \u03a8\u03b2K\u22121\n\n\u03b2 SK\u22121\n\n\u03b2 \u03a8\ufffd\u03b2 .\n\n\u03b2 m\u03b2.\n\n3.2 Orthogonally Decoupled Representation\n\nBut is (11) the best possible decoupled basis? Upon closer inspection, we \ufb01nd that there is redundancy\nin the parameterization of this basis: as \u03a8\u03b3 is not orthogonal to \u03a8\u03b2 in general, optimizing a\u03b3 and\nm\u03b2 jointly would create coupling and make the optimization landscape more ill-conditioned.\nTo address this issue, under the partition that \u03b1 = {\u03b2, \u03b3}, we propose a new decoupled basis as\n\n\u03b2 \u03a8\ufffd\u03b2 ) + \u03a8\u03b2K\u22121\n\n\u03b2 \u03a8\ufffd\u03b2 )\u03a8\u03b3a\u03b3 + \u03a8\u03b2a\u03b2,\n\n\u00b5 = (I \u2212 \u03a8\u03b2K\u22121\n\n\u03a3 = (I \u2212 \u03a8\u03b2K\u22121\n\n(12)\nwhere a\u03b3 \u2208 R|\u03b3|, a\u03b2 \u2208 R|\u03b2| and S = LL\ufffd is parametrized by its Cholesky factor. We call (a\u03b3, a\u03b2, S)\nthe model parameters and refer to (12) as the orthogonally decoupled basis, because (I\u2212\u03a8\u03b2K\u22121\n\u03b2 \u03a8\ufffd\u03b2 )\n\u03b2 \u03a8\ufffd\u03b2 )\ufffd\u03a8\u03b2 = 0). By substituting Z = \u03b2 and a\u03b2 = K\u22121\nis orthogonal to \u03a8\u03b2 (i.e. (I \u2212 \u03a8\u03b2K\u22121\nZ m, we\ncan compare (12) to (7): (12) has an additional part parameterized by a\u03b3 to model the mean function\nresidues that cannot be captured by using the inducing points \u03b2 alone. In prediction, our basis\nhas time complexity in O(|\u03b3| + |\u03b2|3) because K\u22121\n\u03b2 K\u03b2,\u03b3a\u03b3 can be precomputed. The orthogonally\ndecoupled basis results in a posterior process with\n\u03b2 (S \u2212 K\u22121\nm(x) = (kx,\u03b3 \u2212 kx,\u03b2K\u03b2,\u03b3)a\u03b3 + kx,\u03b2a\u03b2,\nThis decoupled basis can also be derived from the perspective of Titsias [32] by conditioning the\nprior on a \ufb01nite set of inducing points\u2217. Details of this construction are in Appendix D.\nCompared with the original decoupled basis in (8), our choice in (12) has attractive properties:\n\ns(x, x\ufffd) = kx,x\ufffd \u2212 kx,\u03b2K\u22121\n\n\u03b2 SK\u22121\n\n\u03b2 )K\u22121\n\n\u03b2 k\u03b2,x\ufffd .\n\n\u03b2 \u03a8\ufffd\u03b2 ,\n\n1. The explicit inclusion of \u03b2 as a subset of \u03b1 leads to the existence of natural parameters.\n2. If the likelihood is strictly log-concave (e.g. Gaussian and Bernoulli likelihoods), then the\n\nvariational inference problem in (4) is strictly convex in (a\u03b3, a\u03b2, L) (see Appendix B).\n\nOur setup in (12) introduces a projection operator before \u03a8\u03b3a\u03b3 in the basis (11) and therefore it can\nbe viewed as the unique hybrid parametrization, which con\ufb01nes the function modeled by \u03b3 to be\northogonal to the span the \u03b2 basis. Consequently, there is no correlation between optimizing a\u03b3 and\na\u03b2, making the problem more well-conditioned.\n\n\u2217We thank an anonymous reviewer for highlighting this connection.\n\n5\n\n\f3.3 Natural Parameters and Expectation Parameters\n\nTo identify the natural parameter of GPs structured as (12), we revisit the de\ufb01nition of natural\nparameters in exponential families. A distribution p(x) belongs to an exponential family if we\ncan write p(x) = h(x) exp(t(x)\ufffd\u03b7 \u2212 A(\u03b7)), where t(x) is the suf\ufb01cient statistics, \u03b7 is the natural\nparameter, A is the log-partition function, and h(x) is the carrier measure.\nBased on this de\ufb01nition, we can see that the choice of natural parameters is not unique. Suppose\n\u03b7 = H \u02dc\u03b7 + b for some constant matrix H and vector b. Then \u02dc\u03b7 is also an admissible natural\nparameter, because we can write p(x) = \u02dch(x) exp(\u02dct(x)\ufffd \u02dc\u03b7 \u2212 \u02dcA(\u02dc\u03b7)), where \u02dct(x) = H\ufffdt(x), \u02dch(x) =\nh(x) exp(t(x)\ufffdb), and \u02dcA(\u02dc\u03b7) = A(H \u02dc\u03b7 + b). In other words, the natural parameter is only unique\nup to af\ufb01ne transformations. If the natural parameter is transformed, the corresponding expectation\nparameter \u03b8 = Ep[t(x)] also transforms accordingly to \u02dc\u03b8 = H\ufffd\u03b8.\nIt can be shown that the\nLegendre primal-dual relationship between \u03b7 and \u03b8 is also preserved: \u02dcA is also convex, and it satis\ufb01es\n\u02dc\u03b8 = \u2207 \u02dcA(\u02dc\u03b7) and \u02dc\u03b7 = \u2207 \u02dcA\u2217(\u02dc\u03b8), where \u2217 denotes the Legendre dual function (see Appendix C).\nWe use this trick to identify the natural and expectation parameters of (12).\u2217 The relationships\nbetween natural, expectation, and model parameters are summarized in Figure 1.\n\nNatural Parameters Recall that for Gaussian distributions the natural parameters are convention-\nally de\ufb01ned as (\u03a3\u22121\u00b5, 1\n2 \u03a3\u22121). Therefore, to \ufb01nd the natural parameters of (12), it suf\ufb01ces to show\n2 \u03a3\u22121) of (12) can be written as an af\ufb01ne transformation of some \ufb01nite-dimensional\nthat (\u03a3\u22121\u00b5, 1\nparameters. The matrix inversion lemma and the orthogonality of (I \u2212 \u03a8\u03b2K\u22121\n\u03b2 \u03a8\ufffd\u03b2 ) and \u03a8\u03b2 yield\n2 (I \u2212 \u03a8\u03b2K\u22121\n\n\u03b2 \u03a8\ufffd\u03b2 ) + \u03a8\u03b2\u0398\u03a8\ufffd\u03b2 ,\n\n\u03a3\u22121\u00b5 = (I \u2212 \u03a8\u03b2K\u22121\nwhere\n\nj\u03b3 = a\u03b3,\n\n\u03b2 \u03a8\ufffd\u03b2 )\u03a8\u03b3j\u03b3 + \u03a8\u03b2j\u03b2 ,\nj\u03b2 = S\u22121K\u03b2a\u03b2,\n\n(13)\n\n1\n\n2 \u03a3\u22121 = 1\n\u0398 = 1\n2 S\u22121.\n\nTherefore, we call (j\u03b3, j\u03b2, \u0398) the natural parameters of (12). This choice is unique in the sense that\n\u03a3\u22121\u00b5 is orthogonally parametrized.\u2020 The explicit inclusion of \u03b2 as part of \u03b1 is important; otherwise\nthere will be a constraint on j\u03b1 and j\u03b2 because \u00b5 can only be parametrized by the \u03b1-basis (see\nAppendix C).\n\nExpectation Parameters Once the new natural parameters are selected, we can also derive the cor-\nresponding expectation parameters. Recall for the natural parameters (\u03a3\u22121\u00b5, 1\n2 \u03a3\u22121), the associated\nexpectation parameters are (\u00b5,\u2212(\u03a3 + \u00b5\u00b5\ufffd)). Using the relationship between transformed natural\nand expectation parameters, we \ufb01nd the expected parameters of (12) using the adjoint operators:\n[(I \u2212 \u03a8\u03b2K\u22121\n\n\u03b2 \u03a8\ufffd\u03b2 )\u03a8\u03b3, \u03a8\u03b2]\ufffd\u00b5 = [m\u03b3\u22a5\u03b2, m\u03b2]\ufffd and \u2212\u03a8\ufffd\u03b2 (\u03a3 + \u00b5\u00b5\ufffd)\u03a8\u03b2 = \u039b, where we have\nm\u03b3\u22a5\u03b2 = (K\u03b3 \u2212 K\u03b3,\u03b2K\u22121\n\n\u039b = \u2212S \u2212 m\u03b2m\ufffd\u03b2 .\n\n\u03b2 K\u03b2,\u03b3)j\u03b3,\n\nm\u03b2 = Sj\u03b2,\n\n(14)\n\nNote the equations for \u03b2 in (13) and (14) have exactly the same relationship between the natural and\nexpectation parameters in the standard Gaussian case, i.e. (\u03a3\u22121\u00b5, 1\n\n2 \u03a3\u22121) \u2194 (\u00b5,\u2212(\u03a3 + \u00b5\u00b5\ufffd)).\n\n3.4 Natural Gradient Descent\n\nNatural gradient descent updates parameters according to the information geometry induced by the\nKL divergence [2]. It is invariant to reparametrization and can normalize the problem to be well\nconditioned [20]. Let F (\u03b7) = \u22072KL(q||p\u03b7)|q=p\u03b7 be the Fisher information matrix, where p\u03b7 denotes\na distribution with natural parameter \u03b7. Natural gradient descent for natural parameters performs the\nupdate \u03b7 \u2190 \u03b7 \u2212 \u03c4 F (\u03b7)\u22121\u2207\u03b7L, where \u03c4 > 0 is the step size. Because directly computing the inverse\nF (\u03b7)\u22121 is computationally expensive, we use the duality between natural and expectation parameters\nin exponential families and adopt the equivalent update \u03b7 \u2190 \u03b7 \u2212 \u03c4\u2207\u03b8L [15, 29].\n\n\u2217While GPs do not admit a density function, the property of transforming natural parameters described\n\nabove still applies. An alternate proof can be derived using KL divergence.\n\n\u2020The hybrid parameterization (11) in [5, Appendix], which also considers \u03b2 explicitly in \u00b5, admits natural\nparameters as well. However, their relationship and the natural gradient update rule turn out to be more\nconvoluted; we provide a thorough discussion in Appendix C.\n\n6\n\n\fNatural\n\nj\u03b3 = a\u03b3\n\nj\u03b2 = S\u22121K\u03b2a\u03b2\n\u0398 = 1\n\n2 S\u22121\n\nModel\n\na\u03b3\n\na\u03b2\nS\n\nExpectation\nm\u03b3\u22a5\u03b2 = (K\u03b3 \u2212 K\u03b3,\u03b2K\u22121\nm\u03b2 = K\u03b2a\u03b2\n\u039b = \u2212S \u2212 m\u03b2m\ufffd\u03b2\n\n\u03b2 K\u03b2,\u03b3)a\u03b3\n\nFigure 1: The relationship between the three parameterizations of the orthogonally decoupled basis.\nThe box highlights the parameters in common with the standard coupled basis, which are decoupled\nfrom the additional a\u03b3 parameter. This is a unique property of our orthogonal basis\n\nExact Update Rules For our basis in (12), the natural gradient descent step can be written as\n\n(15)\nwhere we recall L is the negative variational lower bound in (4), j = [j\u03b3, j\u03b2] in (13), and m =\n[m\u03b3\u22a5\u03b2, m\u03b2] in (14). As L is de\ufb01ned in terms of (a\u03b3, a\u03b2, S), to compute these derivatives we use\nchain rule (provided by the relationship in Figure 1) and obtain\n\n\u0398 \u2190 \u0398 \u2212 \u03c4\u2207\u039bL,\n\nj \u2190 j \u2212 \u03c4\u2207mL,\n\n\u03b2 K\u03b3,\u03b2)\u22121\u2207a\u03b3L,\n\nj\u03b3 \u2190 j\u03b3 \u2212 \u03c4 (K\u03b3 \u2212 K\u03b3,\u03b2K\u22121\nj\u03b2 \u2190 j\u03b2 \u2212 \u03c4 (K\u22121\n\u0398 \u2190 \u0398 + \u03c4\u2207SL.\n\n(16a)\n(16b)\n(16c)\nDue to the orthogonal choice of natural parameter de\ufb01nition, the update for the \u03b3 and the \u03b2 parts\nare independent. Furthermore, one can show that the update for j\u03b2 and \u0398 is exactly the same as the\nnatural gradient descent rule for the standard coupled basis [12], and that the update for the residue\npart j\u03b3 is equivalent to functional gradient descent [17] in the subspace orthogonal to the span of \u03a8\u03b2.\n\n\u03b2 \u2207a\u03b2L \u2212 2\u2207SLm\u03b2),\n\nApproximate Update Rule We described the natural gradient descent update for the orthogonally\ndecoupled GPs in (12). However, in the regime where |\u03b3| \ufffd |\u03b2|, computing (16a) becomes infeasible.\nHere we propose an approximation of (16a) by approximating K\u03b3 with a diagonal-plus-low-rank\nstructure. Because the inducing points \u03b2 are selected to globally approximate the function landscape,\none sensible choice is to approximate K\u03b3 with a Nystr\u00f6m approximation based on \u03b2 and a diagonal\ncorrection term: K\u03b3 \u2248 D\u03b3|\u03b2 + K\u03b3|\u03b2, where D\u03b3|\u03b2 = diag(K\u03b3 \u2212 K\u03b3|\u03b2), K\u03b3|\u03b2 = K\u03b3,\u03b2K\u22121\n\u03b2 K\u03b2,\u03b3,\nand diag denotes extracting the diagonal part of a matrix. FITC [30] uses a similar idea to approximate\nthe prior distribution [25], whereas here it is used to derive an approximate update rule without\nchanging the problem. This leads to a simple update rule\n\nj\u03b3 \u2190 j\u03b3 \u2212 \u03c4 (D\u03b3|\u03b2 + \ufffdI)\u22121\u2207a\u03b3L,\n\n(17)\nwhere a jitter \ufffd > 0 is added to ensure stability. This update rule uses a diagonal scaling (D\u03b3|\u03b2 +\ufffdI)\u22121,\nwhich is independent of j\u03b2 and \u0398. Therefore, while one could directly use the update (17), in\nimplementation, we propose to replace (17) with an adaptive coordinate-wise gradient descent\nalgorithm (e.g. ADAM [16]) to update the \u03b3-part. Due to the orthogonal structure, the overall\ncomputational complexity is in O(|\u03b3||\u03b2|+|\u03b2|3). While this is more than the O(|\u03b3|+|\u03b2|3) complexity\nof the original decoupled approach [5]; the experimental results suggest the additional computation is\nworth the large performance improvement.\n\n4 Results\n\nWe empirically assess the performance of our algorithm in multiple regression and classi\ufb01cation tasks.\nWe show that a) given \ufb01xed wall-clock time, the proposed orthogonally decoupled basis outperforms\nexisting approaches; b) given the same number of inducing points for the covariance, our method\nalmost always improves on the coupled approach (which is in contrast to the previous decoupled\nbases); c) using natural gradients can dramatically improve performance, especially in regression.\nWe compare updating our orthogonally decoupled basis with adaptive gradient descent using the Adam\noptimizer [16] (ORTH), and using the approximate natural gradient descent rule described in Section\n3.4 (ORTHNAT). As baselines, we consider the original decoupled approach of Cheng and Boots\n\n7\n\n\f\u2212 0. 1\n\n\u2212 0. 2\n\n\u2212 0. 3\n\n\u2212 0. 4\n\n\u2212 0. 5\n\n\u2212 0. 6\n\n\u2212 0. 7\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\n\nl\n \nt\ns\ne\nT\n\n(a) Test RMSE\n\n(b) Test log-likelihood\n\n(c) ELBO\n\n100\n\n101\n\n102\n\n103\n\n104\n\nIterations\n\nFigure 2: Training curves for our models in three different settings. Panel (c) has |\u03b3| = 3500,|\u03b2| =\n1500 for the decoupled bases and |\u03b2| = 2000 for the coupled bases. Panels (a) and (b) have\n|\u03b3| = |\u03b2| = 500 and \ufb01xed hyperparameters and full batches to highlight the convergence properties\nof the approaches. Panels (a) and (c) use a Gaussian likelihood. Panel (b) uses a Bernoulli likelihood.\n\n[5] (DECOUPLED) and the hybrid approach suggested in their Appendix (HYBRID). We compare\nalso to the standard coupled basis with and without natural gradients (COUPLEDNAT and COUPLED,\nrespectively). We make generic choices for hyperparameters, inducing point initializations, and data\nprocessing, which are detailed in Appendix F. Our code \u2217 and datasets \u2020 are publicly available.\n\nIllustration Figures 2a and 2b show a simpli\ufb01ed setting to illustrate the difference between the\nmethods. In this example, we \ufb01xed the inducing inputs and hyperparameters, and optimized the rest of\nthe variational parameters. All the decoupled methods then have the same global optimum, so we can\neasily assess their convergence property. With a Gaussian likelihood we also computed the optimal\nsolution analytically as an optimal baseline, although this requires inverting an |\u03b1|-sized matrix, and\ntherefore is not useful as a practical method. We include the expressions of the optimal solution in\nAppendix E. We set |\u03b3| = |\u03b2| = 500 for all bases and conducted experiments on 3droad dataset\n(N = 434874, D = 3) for regression with a Gaussian likelihood and ringnorm data (N = 7400,\nD = 21) for classi\ufb01cation with a Bernoulli likelihood. Overall, the natural gradient methods are\nmuch faster to converge than their ordinary gradient counterparts. DECOUPLED fails to converge to\nthe optimum after 20K iterations, even in the Gaussian case. We emphasize that, unlike our proposed\napproaches, DECOUPLED leads to a non-convex optimization problem.\n\nWall-clock comparison To investigate large-scale performance, we used 3droad with a large\nnumber of inducing points. We used a computer with a Tesla K40 GPU and found that, in wall-clock\ntime, the orthogonally decoupled basis with |\u03b3| = 3500, |\u03b2| = 1500 was equivalent to a coupled\nmodel with |\u03b2| = 2000 (about 0.7 seconds per iteration) in our tensor\ufb02ow [1] implementation. Under\nthis setting, we show the ELBO in Figure 2c and the test log-likelihood and accuracy in Figure 3 of\nthe Appendix G. ORTHNAT performs the best, both in terms of log-likelihood and accuracy. The the\nhighest test log-likelihood of ORTHNAT is \u22123.25, followed by ORTH (\u22123.26). COUPLED (\u22123.37)\nand COUPLEDNAT (\u22123.33) both outperform HYBRID (\u22123.39) and DECOUPLED (\u22123.66).\nRegression benchmarks We applied our models on 12 regression datasets ranging from 15K to 2M\npoints. To enable feasible computation on multiple datasets, we downscale (but keep the same ratio)\nthe number of inducing points to |\u03b3| = 700,|\u03b2| = 300 for the decoupled models, and |\u03b2| = 400 for\nthe coupled mode. We compare also to the coupled model with |\u03b2| = 300 to establish whether extra\ncomputation always improves the performance of the decoupled basis. The test mean absolute error\n(MAE) results are summarized in Table 1, and the full results for both test log-likelihood and MAE\nare given in Appendix G. ORTHNAT overall is the most competitive basis. And, by all measures, the\northogonal bases outperform their coupled counterparts with the same \u03b2, except for HYBRID and\nDECOUPLED.\n\nClassi\ufb01cation benchmarks We compare our method with state-of-the-art fully connected neural\nnetworks with Selu activations [18]. We adopted the experimental setup from [18], using the largest\n19 datasets (4898 to 130000 data points). For the binary datasets we used the Bernoulli likelihood, and\n\n\u2217https://github.com/hughsalimbeni/orth_decoupled_var_gps\n\u2020https://github.com/hughsalimbeni/bayesian_benchmarks\n\n8\n\n152025303540TestRMSEOrthNatOrthCoupledNatCoupledHybridDecoupledExactCoupledExactDeoupled100101102103104Iterations103104Iterations\u22120.45\u22120.40\u22120.35\u22120.30\u22120.25\u22120.20ELBO(\u00d7106)\fTable 1: Regression summary for normalized test MAE on 12 regression datasets, with standard\nerrors for the average ranks. The coupled bases had |\u03b2| = 400 (|\u03b2| = 300 for the \u2020 bases), and the\ndecoupled all had \u03b3 = 700, \u03b2 = 300. See Appendix G for the full results.\n\nMean\nMedian\nAvg Rank\n\nCOUPLED\u2020 COUPLEDNAT\u2020\n\n6.083(0.19)\n\n5.00(0.33)\n\n0.298\n0.221\n\n0.295\n0.219\n\nCOUPLED\n\nCOUPLEDNAT\n\n0.291\n0.215\n\n3.750(0.26)\n\n0.290\n0.213\n\n2.417(0.31)\n\nORTH\n0.284\n0.211\n\n2.500(0.47)\n\nORTHNAT\n\nHYBRID\n\nDECOUPLED\n\n0.282\n0.210\n\n1.833(0.35)\n\n0.298\n0.225\n\n6.417(0.23)\n\n0.361\n0.299\n8(0.00)\n\nfor the multiclass datasets we used the robust-max likelihood [14]. The same basis settings as for the\nregression benchmarks were used here. ORTH performs the best in terms of median, and ORTHNAT\nis best ranked. The neural network wins in terms of mean, because it substantially outperforms all\nthe GP models in one particular dataset (chess-krvk), which skews the mean performance over the\n19 datasets. We see that our orthogonal bases on average improve the coupled bases with equivalent\nwall-clock time, although for some datasets the coupled bases are superior. Unlike in the regression\ncase, it is not always true that using natural gradients improve performance, although on average they\ndo. This holds for both the coupled and decoupled bases.\n\nTable 2: Classi\ufb01cation test accuracy(%) results for our models, showing also the results from [18],\nwith standard errors for the average ranks. See Table 5 in the Appendix for the complete results.\n\nMean\nMedian\nAverage rank\n\nSelu\n91.6\n93.1\n\n4.16(0.67)\n\nCOUPLED COUPLEDNAT\n\n90.4\n94.8\n\n3.89(0.42)\n\n90.2\n93.6\n\n3.53(0.45)\n\nORTH\n90.6\n95.6\n\n3.68(0.35)\n\nORTHNAT\n\nHYBRID\n\nDECOUPLED\n\n90.3\n93.6\n\n3.42(0.31)\n\n89.9\n93.4\n\n3.89(0.38)\n\n89.0\n92.0\n\n5.42(0.51)\n\nOverall, the empirical results demonstrate that the orthogonally decoupled basis is superior to the\ncoupled basis with the same wall-clock time, averaged over datasets. It is important to note that for\nthe same \u03b2, adding extra \u03b3 increases performance for the orthogonally decoupled basis in almost\nall cases, but not for HYBRID of DECOUPLED. While this does add additional computation, the\nratio between the extra computation for additional \u03b2 and that for additional \u03b3 decreases to zero as \u03b2\nincreases. That is, eventually the cubic scaling in |\u03b2| will dominate the linear scaling in |\u03b3|.\n5 Conclusion\n\nWe present a novel orthogonally decoupled basis for variational inference in GP models. Our basis\nis constructed by extending the standard coupled basis with an additional component to model\nthe mean residues. Therefore, it extends the standard coupled basis [32, 12] and achieves better\nperformance. We show how the natural parameters of our decoupled basis can be identi\ufb01ed and\npropose an approximate natural gradient update rule, which signi\ufb01cantly improves the optimization\nperformance over original decoupled approach [5]. Empirically, our method demonstrates strong\nperformance in multiple regression and classi\ufb01cation tasks.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,\net al. TensorFlow: A system for large-scale machine learning. In Symposium on Operating Systems Design\nand Implementation, 2016.\n\n[2] S.-I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 1998.\n[3] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 1950.\n[4] C.-A. Cheng and B. Boots. Incremental variational sparse Gaussian process regression. In Advances in\n\nNeural Information Processing Systems, 2016.\n\n[5] C.-A. Cheng and B. Boots. Variational inference for Gaussian process models with linear complexity. In\n\nAdvances in Neural Information Processing Systems, 2017.\n\n[6] M. P. Deisenroth and J. W. Ng. Distributed Gaussian processes. In International Conference on Machine\n\nLearning, 2015.\n\n9\n\n\f[7] M. P. Deisenroth and C. E. Rasmussen. PILCO: a model-based and data-ef\ufb01cient approach to policy search.\n\nIn International Conference on Machine Learning, 2011.\n\n[8] P. J. Diggle and P. J. Ribeiro. Model-based Geostatistics. Springer, 2007.\n[9] J. R. Gardner, G. Pleiss, R. Wu, K. Q. Weinberger, and A. G. W. Wilson. Product kernel interpolation for\nscalable Gaussian processes. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n[10] J. Hartikainen and S. S\u00e4rkk\u00e4. Kalman \ufb01ltering and smoothing solutions to temporal Gaussian process\nregression models. In IEEE International Workshop on Machine Learning for Signal Processing, 2010.\n[11] M. Havasi, J. M. Hern\u00e1ndez-Lobato, and J. J. Murillo-Fuentes. Deep Gaussian processes with decoupled\n\ninducing inputs. arXiv:1801.02939, 2018.\n\n[12] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. 2013.\n[13] J. Hensman, N. Durrande, and A. Solin. Variational Fourier features for Gaussian processes. Journal of\n\nMachine Learning Research, 18:1\u201352, 2018.\n\n[14] D. Hern\u00e1ndez-Lobato, J. M. Hern\u00e1ndez-Lobato, and P. Dupont. Robust multi-class Gaussian process\n\nclassi\ufb01cation. In Advances in Neural Information Processing Systems, 2011.\n\n[15] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of\n\nMachine Learning Research, 2013.\n\n[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on\n\nLearning Representations, 2015.\n\n[17] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions on Signal\n\nProcessing, 2004.\n\n[18] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In Advances\n\nin Neural Information Processing Systems, 2017.\n\n[19] M. L\u00e1zaro-Gredilla, J. Qui\u00f1onero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum\n\nGaussian process regression. Journal of Machine Learning Research, 2010.\n\n[20] J. Martens. New insights and perspectives on the natural gradient method. arXiv:1412.1193, 2014.\n[21] A. G. Matthews, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahramani,\nand J. Hensman. GP\ufb02ow: A Gaussian process library using TensorFlow. Journal of Machine Learning\nResearch, 2017.\n\n[22] A. G. d. G. Matthews. Scalable Gaussian process inference using variational methods. PhD thesis,\n\nCamrbidge Univeristy, 2017.\n\n[23] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani. On sparse variational methods and\nthe Kullback-Leibler divergence between stochastic processes. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2016.\n\n[24] T. Nguyen and E. Bonilla. Fast allocation of Gaussian process experts. In International Conference on\n\nMachine Learning, 2014.\n\n[25] J. Qui\u00f1onero-Candela, C. E. Rasmussen, and R. Herbrich. A unifying view of sparse approximate Gaussian\n\nprocess regression. Journal of Machine Learning Research, 2005.\n\n[26] C. E. Rasmussen and Z. Ghahramani. In\ufb01nite mixtures of Gaussian process experts. In Advances in Neural\n\nInformation Processing Systems, 2002.\n\n[27] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[28] D. Rulli\u00e8re, N. Durrande, F. Bachoc, and C. Chevalier. Nested Kriging predictions for datasets with a large\n\nnumber of observations. Statistics and Computing, 2018.\n\n[29] H. Salimbeni, S. Eleftheriadis, and J. Hensman. Natural gradients in practice: Non-conjugate variational\ninference in Gaussian process models. In International Conference on Arti\ufb01cial Intelligence and Statistics,\n2018.\n\n[30] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural\n\nInformation Processing Systems, 2006.\n\n[31] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn Advances in Neural Information Processing Systems, 2012.\n\n[32] M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[33] V. Tresp. A Bayesian committee machine. Neural computation, 2000.\n[34] A. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In\n\nInternational Conference on Machine Learning, 2015.\n\n10\n\n\f", "award": [], "sourceid": 5263, "authors": [{"given_name": "Hugh", "family_name": "Salimbeni", "institution": "Imperial College London"}, {"given_name": "Ching-An", "family_name": "Cheng", "institution": "Georgia Tech"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}, {"given_name": "Marc", "family_name": "Deisenroth", "institution": "Imperial College London"}]}