{"title": "Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 3337, "page_last": 3347, "abstract": "Gaussian Processes (GPs) are a generic modelling tool for supervised learning. While they have been successfully applied on large datasets, their use in safety-critical applications is hindered by the lack of good performance guarantees. To this end, we propose a method to learn GPs and their sparse approximations by directly optimizing a PAC-Bayesian bound on their generalization performance, instead of maximizing the marginal likelihood. Besides its theoretical appeal, we find in our evaluation that our learning method is robust and yields significantly better generalization guarantees than other common GP approaches on several regression benchmark datasets.", "full_text": "Learning Gaussian Processes by\n\nMinimizing PAC-Bayesian Generalization Bounds\n\nDavid Reeb\n\nAndreas Doerr\n\nSebastian Gerwinn\nBosch Center for Arti\ufb01cial Intelligence\u2217\n\nBarbara Rakitsch\n\n{david.reeb,andreas.doerr3,sebastian.gerwinn,barbara.rakitsch}@de.bosch.com\n\nRobert-Bosch-Campus 1\n\n71272 Renningen, Germany\n\nAbstract\n\nGaussian Processes (GPs) are a generic modelling tool for supervised learning.\nWhile they have been successfully applied on large datasets, their use in safety-\ncritical applications is hindered by the lack of good performance guarantees. To\nthis end, we propose a method to learn GPs and their sparse approximations by\ndirectly optimizing a PAC-Bayesian bound on their generalization performance,\ninstead of maximizing the marginal likelihood. Besides its theoretical appeal, we\n\ufb01nd in our evaluation that our learning method is robust and yields signi\ufb01cantly\nbetter generalization guarantees than other common GP approaches on several\nregression benchmark datasets.\n\n1\n\nIntroduction\n\nGaussian Processes (GPs) are a powerful modelling method due to their non-parametric nature\n[1]. Although GPs are probabilistic models and hence come equipped with an intrinsic measure of\nuncertainty, this uncertainty does not allow conclusions about their performance on previously unseen\ntest data. For instance, one often observes over\ufb01tting if a large number of hyperparameters is adjusted\nusing marginal likelihood optimization [2]. While a fully Bayesian approach, i.e. marginalizing out\nthe hyperparameters, reduces this risk, it incurs a prohibitive runtime since the predictive distribution\nis no longer analytically tractable. Also, it does not entail out-of-the-box safety guarantees.\nIn this work, we propose a novel training objective for GP models, which enables us to give rigorous\nand quantitatively good performance guarantees on future predictions. Such rigorous guarantees are\ndeveloped within Statistical Learning Theory (e.g. [3]). But as the classical uniform learning bounds\nare meaningless for expressive models like deep neural nets [4] (as e.g. the VC dimension exceeds the\ntraining size) and GPs or non-parametric methods in general, such guarantees cannot be employed\nfor learning those models. Instead, common optimization schemes are (regularized) empirical risk\nminimization (ERM) [4, 3], maximum likelihood (MLE) [1], or variational inference (VI) [5, 6].\nOn the other hand, better non-uniform learning guarantees have been developed within the PAC-\nBayesian framework [7, 8, 9] (Sect. 2). They are specially adapted to probabilistic methods like\nGPs and can yield tight generalization bounds, as observed for GP classi\ufb01cation [10], probabilistic\nSVMs [11, 12], linear classi\ufb01ers [13], or stochastic NNs [14]. Most previous works used PAC-\nBayesian bounds merely for the \ufb01nal evaluation of the generalization performance, whereas learning\nby optimizing a PAC-Bayesian bound has been barely explored [13, 14]. This work, for the \ufb01rst time,\nexplores the use of PAC-Bayesian bounds (a) for GP training and (b) in the regression setting.\nSpeci\ufb01cally, we propose to learn full and sparse GP predictors Q directly by minimizing a PAC-\nBayesian upper bound B(Q) from Eq. (5) on the true future risk R(Q) of the predictor, as a\n\n\u2217https://www.bosch-ai.com\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fprincipled method to ensure good generalization (Sect. 3). Our general approach comes naturally for\nGPs because the KL divergence KL(Q(cid:107)P ) in the PAC-Bayes theorem can be evaluated analytically\nfor GPs P, Q sharing the same hyperparameters. As this applies to popular sparse GP variants such as\nDTC [16], FITC [15], and VFE [6], they all become amenable to our method of PAC-Bayes learning,\ncombining computational bene\ufb01ts of sparse GPs with theoretical guarantees. We carefully account\nfor the different types of parameters (hyperparameters, inducing inputs, observation noise, free-form\nparameters), as only some of them contribute to the \u201cpenalty term\u201d in the PAC-Bayes bound. Further,\nwe base GP learning directly on the inverse binary KL divergence [10], and not on looser bounds\nused previously, such as from Pinsker\u2019s inequality (e.g., [14]).\nWe demonstrate our GP learning method on regression tasks, whereas PAC-Bayes bounds have so far\nmostly been used in a classi\ufb01cation setting. A PAC-Bayesian bound for regression with potentially\nunbounded loss function was developed in [17], it requires a sub-Gaussian assumption w.r.t. the\n(unknown) data distribution, see also [18]. To remain distribution-free as in the usual PAC setting, we\nemploy and investigate a generic bounded loss function for regression.\nWe evaluate our learning method on several datasets and compare its performance to state-of-the-art\nGP methods [1, 15, 6] in Sect. 4. Our learning objective exhibits robust optimization behaviour\nwith the same scaling to large datasets as the other GP methods. We \ufb01nd that our method yields\nsigni\ufb01cantly better risk bounds, often by a factor of more than two, and that only for our approach the\nguarantee improves with the number of inducing points.\n\n2 General PAC-Bayesian Framework\n\n2.1 Risk functions\n\nWe consider the standard supervised learning setting [3] where a set S of N training examples\n(xi, yi) \u2208 X \u00d7 Y (i = 1, . . . , N) is used to learn in a hypothesis space H \u2286 Y X, a subset of\nthe space of functions X \u2192 Y . We allow learning algorithms that output a distribution Q over\nhypotheses h \u2208 H, rather than a single hypothesis h, which is the case for GPs we consider later on.\nTo quantify how well a hypothesis h performs, we assume a bounded loss function (cid:96) : Y \u00d7 Y \u2192 [0, 1]\n\nto be given, w.l.o.g. scaled to the interval [0, 1]. (cid:96)(y\u2217,(cid:98)y) measures how well the prediction(cid:98)y = h(x\u2217)\nde\ufb01ne the (true) risk as R(h) :=(cid:82) d\u00b5(x, y)(cid:96)(y, h(x)). We will later assume that the training set S\n\napproximates the actual output y\u2217 at an input x\u2217. The empirical risk RS(h) of a hypothesis is then\nde\ufb01ned as the average training loss RS(h) := 1\ni=1 (cid:96)(yi, h(xi)). As in the usual PAC framework,\nwe assume an (unknown) underlying distribution \u00b5 = \u00b5(x, y) on the set X \u00d7 Y of examples, and\nN\n\n(cid:80)N\n\nconsists of N independent draws from \u00b5 and study how close RS is to its mean R [3]. To quantify\nthe performance of stochastic learning algorithms, that output a distribution Q over hypotheses, we\nde\ufb01ne the empirical and true risks by a slight abuse of notation as [7]:\n\nN(cid:88)\n\n(cid:2)RS(h)(cid:3) =\n(cid:2)R(h)(cid:3) = E(x\u2217,y\u2217)\u223c\u00b5 Eh\u223cQ\n\n(cid:2)(cid:96)(cid:0)yi, h(xi)(cid:1)(cid:3),\n(cid:2)(cid:96)(cid:0)y\u2217, h(x\u2217)(cid:1)(cid:3).\n\nEh\u223cQ\n\nRS(Q) :=Eh\u223cQ\n\nR(Q) :=Eh\u223cQ\n\n1\nN\n\ni=1\n\n(1)\n\n(2)\n\nThese are the average losses, also termed Gibbs risks, on the training and true distributions, respec-\ntively, where the hypothesis h is sampled according to Q before prediction.\nIn the following, we focus on the regression case, where Y \u2286 R is the set of reals. An exemplary\n\nloss function in this case is (cid:96)(y\u2217,(cid:98)y) := 1(cid:98)y /\u2208[r\u2212(y\u2217),r+(y\u2217)], where the functions r\u00b1 specify an interval\noutside of which a prediction(cid:98)y is deemed insuf\ufb01cient; similar to \u03b5-support vector regression [19], we\n\nuse r\u00b1(y\u2217) := y\u2217 \u00b1 \u03b5, with a desired accuracy goal \u03b5 > 0 speci\ufb01ed before learning (see Sect. 4). In\nany case, the expectations over h \u223c Q in (1)\u2013(2) reduce to one-dimensional integrals as h(x\u2217) is a\nreal-valued random variable at each x\u2217. See App. C, where we also explore other loss functions.\nInstead of the stochastic predictor h(x\u2217) with h \u223c Q, one is often interested in the deterministic\nat x\u2217. The corresponding Bayes risk is de\ufb01ned by RBay(Q) := E(x\u2217,y\u2217)\u223c\u00b5[(cid:96)(y\u2217, Eh\u223cQ[h(x\u2217)])].\nWhile PAC-Bayesian theorems do not directly give a bound on RBay(Q) but only on R(Q), it is easy\n\nBayes predictor Eh\u223cQ[h(x\u2217)] [10]; for GP regression, this simply equals the predictive mean (cid:98)m(x\u2217)\nto see that RBay(Q) \u2264 2R(Q) if (cid:96)(y\u2217,(cid:98)y) is quasi-convex in(cid:98)y, as in the examples above, and the\n\n2\n\n\fdistribution of(cid:98)y = h(x\u2217) is symmetric around its mean (e.g., Gaussian) [10]. An upper bound B(Q)\n\non R(Q) below 1/2 thus implies a nontrivial bound on RBay(Q) \u2264 2B(Q) < 1.\n\n2.2 PAC-Bayesian generalization bounds\n\nIn this paper we aim to learn a GP Q by minimizing suitable risk bounds. Due to the probabilistic\nnature of GPs, we employ generalization bounds for stochastic predictors, which were previously\nobserved to yield stronger guarantees than those for deterministic predictors [10, 11, 14]. The most\nimportant results in this direction are the so-called \u201cPAC-Bayesian bounds\u201d, originating from [7, 8]\nand developed in various directions [10, 20, 9, 13, 21, 17].\nThe PAC-Bayesian theorem (Theorem 1) gives a probabilistic upper bound (generalization guarantee)\non the true risk R(Q) of a stochastic predictor Q in terms of its empirical risk RS(Q) on a training\nset S. It requires to \ufb01x a distribution P on the hypothesis space H before seeing the training set\nS, and applies to the true risk R(Q) of any distribution Q on H1. The bound contains a term that\ncan be interpreted as complexity of the hypothesis distribution Q, namely the Kullback-Leibler\nP (h) , which takes values in [0, +\u221e]. The bound also\n1\u2212p, de\ufb01ned for q, p \u2208 [0, 1], or\n\n(KL) divergence KL(Q(cid:107)P ) :=(cid:82) dh Q(h) ln Q(h)\n\ncontains the binary KL-divergence kl(q(cid:107)p) := q ln q\nmore precisely its (upper) inverse kl\u22121 w.r.t. the second argument (for q \u2208 [0, 1], \u03b5 \u2208 [0,\u221e]):\n\np + (1 \u2212 q) ln 1\u2212q\n\nkl\u22121(q, \u03b5) := max{p \u2208 [0, 1] : kl(q (cid:107) p) \u2264 \u03b5},\n\n(3)\nwhich equals the unique p \u2208 [q, 1] satisfying kl(q(cid:107)p) = \u03b5. While kl\u22121 has no closed-form expression,\nwe refer to App. A for an illustration and more details, including its derivatives for optimization.\nTheorem 1 (PAC-Bayesian theorem [7, 10, 20]). For any [0, 1]-valued loss function (cid:96), for any\ndistribution \u00b5, for any N \u2208 N, for any distribution P on a hypothesis set H, and for any \u03b4 \u2208 (0, 1],\nthe following holds with probability at least 1 \u2212 \u03b4 over the training set S \u223c \u00b5N :\n\n\u2200Q : R(Q) \u2264 kl\u22121\n\nRS(Q),\n\n(cid:32)\n\n\u221a\nKL(Q (cid:107) P ) + ln 2\n\nN\n\n\u03b4\n\nN\n\n(cid:113)(cid:0)KL(Q (cid:107) P ) + ln 2\n\n\u221a\n\n(cid:33)\n)(cid:1)/(2N ), which gives\n\n(4)\n\n.\n\n\u03b4\n\nN\n\nThe RHS of (4) can be upper bounded by RS(Q) +\na useful intuition about the involved terms, but can exceed 1 and thereby yield a trivial statement.\nNote that the full PAC-Bayes theorem [20] gives a simultaneous lower bound on R(Q), which is\nhowever not relevant here as we are going to minimize the upper risk bound. Further re\ufb01nements\nof the bound are possible (e.g., [20]), but as they improve over Theorem 1 only in small regimes\n[9, 13, 21], often despite adjustable parameters, we will stick with the parameter-free bound (4).\nWe want to consider a family of prior distributions P \u03b8 parametrized by \u03b8 \u2208 \u0398, e.g. in GP hy-\nperparameter training [1]. If this family is countable, one can generalize the above analysis by\n\u03b8 p\u03b8P \u03b8; when\n\u0398 is a \ufb01nite set, the uniform distribution p\u03b8 = 1/|\u0398| is a canonical choice. Using the fact that\nholds for each \u03b8 \u2208 \u0398 (App. B), Theorem 1 yields that, with\nKL(Q (cid:107) P ) \u2264 KL(Q (cid:107) P \u03b8) + ln 1\nprobability at least 1 \u2212 \u03b4 over S \u223c \u00b5N ,\n\n\ufb01xing some probability distribtion p\u03b8 on \u0398 and de\ufb01ning the mixture prior P :=(cid:80)\n\np\u03b8\n\nKL(Q (cid:107) P \u03b8) + ln 1\n\n+ ln 2\n\nN\n\np\u03b8\n\n\u221a\n\n\u03b4\n\nN\n\n\uf8f6\uf8f8 =: B(Q).\n\n(5)\n\n\u2200\u03b8 \u2208 \u0398 \u2200Q : R(Q) \u2264 kl\u22121\n\n\uf8eb\uf8edRS(Q),\n(cid:113)(cid:0)KL(Q(cid:107)P \u03b8) + ln 1\n\nThe bound (5) holds simultaneously for all P \u03b8 and all Q2. One can thus optimize over both \u03b8 and\nQ to obtain the best generalization guarantee, with con\ufb01dence at least 1 \u2212 \u03b4. We use B(Q) for\nour training method below, but we will also compare to training with the suboptimal upper bound\nBPin(Q) := RS(Q) +\n[14]. The PAC-Bayesian bound depends only weakly on the con\ufb01dence parameter \u03b4, which enters\n\n(cid:1)/(2N ) \u2265 B(Q) as was done previously\n\n+ ln 2\n\n\u221a\n\np\u03b8\n\nN\n\n\u03b4\n\n1We follow common usage and call P and Q \u201cprior\u201d and \u201cposterior\u201d distributions in the PAC-Bayesian\nsetting, although their meaning is somewhat different from priors and posteriors in Bayesian probability theory.\n\n2The same result can be derived from (4) via a union bound argument (see Appendix B).\n\n3\n\n\flogarithmically and is suppressed by the sample size N. When the hyperparameter set \u0398 is not too\nis small compared to N), the main contribution to the penalty term in the second\nlarge (i.e. ln 1\np\u03b8\nN KL(Q(cid:107)P \u03b8), which must be (cid:28) 1 for a good generalization statement\nargument of kl\u22121 comes from 1\n(see Sect. 4).\n\n3 PAC-Bayesian learning of GPs\n\n3.1 Learning full GPs\n\nGP modelling is usually presented as a Bayesian method [1], in which the prior P (f ) =\nGP(f|m(x), K(x, x(cid:48))) is speci\ufb01ed by a positive de\ufb01nite kernel K : X \u00d7 X \u2192 R and a mean\nfunction m : X \u2192 R on the input set X. In ordinary GP regression, the learned distribution\nQ is then chosen as the Bayesian posterior coming from the assumption that the training outputs\ni=1 \u2208 RN are noisy versions of fN = (f (x1), . . . , f (xN )) with i.i.d. Gaussian likelihood\nyN := (yi)N\nyN|fN \u223c N (yN|fN , \u03c32\n\n1). Under this assumption, Q is again a GP [1]:\n\nQ(f ) = GP(cid:0)f | m(x) + kN (x)(KN N + \u03c32\n\nn\n\nK(x, x(cid:48)) \u2212 kN (x)(KN N + \u03c32\n\n1)\u22121(yN \u2212 mN ),\n\n1)\u22121kN (x(cid:48))T(cid:1),\n\nn\n\nn\n\n(6)\n\nln det(cid:2)KN N + \u03c32\n\nn\n\n1(cid:3) \u2212 N\n\n2\n(yN \u2212 mN )T (KN N + \u03c32\n\n=\n\n1\n2\n1\n2\n\nn \u2212 1\n2\n\nln \u03c32\n1)\u22121KN N (KN N + \u03c32\n\nwith KN N = (K(xi, xj))N\ni,j=1, kN (x) = (K(x, x1), . . . , K(x, xN )), mN = (m(x1), . . . , m(xN )).\nEq. (6) is employed to make (stochastic) predictions for f (x\u2217) on new inputs x\u2217 \u2208 X. In our approach\nbelow, we do not require any Bayesian rationale behind Q but merely use its form, parametrized by\nn, as an optimization ansatz within the PAC-Bayesian theorem.\n\u03c32\nImportantly, for any full GP prior P and its corresponding posterior Q from (6), the KL-divergence\nKL(Q(cid:107)P ) in Theorem 1 and Eq. (5) can be evaluated on \ufb01nite (N-)dimensional matrices. This allows\nus to evaluate the PAC-Bayesian bound and in turn to learn GPs by optimizing it. More precisely, one\ncan easily verify that P and Q have the same conditional distribution P (f|fN ) = Q(f|fN )3, so that\n(7)\n\nKL(Q (cid:107) P ) = KL(Q(fN )Q(f | fN ) (cid:107) P (fN )P (f | fN )) = KL(Q(fN ) (cid:107) P (fN ))\n\ntr(cid:2)KN N (KN N + \u03c32\n\n1)\u22121(cid:3)\n\nn\n\n(8)\n\n1)\u22121(yN \u2212 mN ),\n\n+\n\n2\n\nn\n\nn\n\nn\n\ni=1\n\ni)2\n\n(xi\u2212x(cid:48)\nl2\ni\n\n], where \u03c32\n\n1)\u22121KN N\n\nwhere in the last step we used the well-known formula [22] for the KL divergence between nor-\n\nmal distributions P (fN ) = N (fN | mN , KN N ) and Q(fN ) = N(cid:0)fN | mN + KN N (KN N +\n(cid:1), and simpli\ufb01ed a bit (see also App. D).\n\n1)\u22121(yN \u2212 mN ), KN N \u2212 KN N (KN N + \u03c32\n\n(cid:80)d\ns ) if we take all lengthscales l1 = . . . = ld \u2261 l to be equal (non-ARD).\n\n\u03c32\nn\nTo learn a full GP means to select \u201cgood\u201d values for the hyperparameters \u03b8, which parametrize a\nfamily of GP priors P \u03b8 = GP(f|m\u03b8(x), K \u03b8(x, x(cid:48))), and for the noise level \u03c3n [1]. Those values\nare afterwards used to make predictions with the corresponding posterior Q\u03b8,\u03c3n from (6). In our\nexperiments (Sect. 4) we will use the squared exponential (SE) kernel on X = Rd, K \u03b8(x, x(cid:48)) =\ns exp[\u2212 1\ns is the signal variance, li are the lengthscales, and we set the\n\u03c32\nmean function to zero. The hyperparameters are \u03b8 \u2261 (l2\ns ) (SE-ARD kernel [1]), or\n\u03b8 \u2261 (l2, \u03c32\nThe basic idea of our method, which we call \u201cPAC-GP\u201d is now to learn the parameters4 \u03b8 and \u03c3n by\nminimizing the upper bound B(Q\u03b8,\u03c3n ) from Eq. (5), therefore selecting the GP predictor Q\u03b8,\u03c3n with\nthe best generalization performance guarantee within the scope of the PAC-Bayesian bound. Note\nN N (fN \u2212 mN ), K(x, x(cid:48)) \u2212\n\u22121\n\n3In fact, direct computation [1] gives P (f|fN ) = GP(cid:0)f|m(x) + kN (x)K\nN N kN (x(cid:48))T(cid:1) = Q(f|fN ). Remarkably, Q(f|fN ) does not depend on yN nor on \u03c3n, even though\n\nkN (x)K\nQ(f ) from (6) does. Intuitively this is because, for the above likelihood, f is independent of yN given fN .\n4Contrary to the usual GP viewpoint [1], \u03c3n is not a hyperparameter in our method since the prior P \u03b8 does\nnot depend on \u03c3n. Thus, \u03c3n does also not contribute to the \u201cpenalty term\u201d ln|\u0398|. \u03c3n is merely a free parameter\nin the posterior distribution Q\u03b8,\u03c3n. By (8), KL(Q\u03b8,\u03c3n(cid:107)P \u03b8) \u2192 \u221e as \u03c3n \u2192 0, so we need this parameter\nn > 0 because otherwise KL = \u221e and the bound as well as the optimization objective would become trivial.\n\u03c32\nAlthough the parameter \u03c32\nn is originally motivated by a Gaussian observation noise assumption, the aim here is\nmerely to parameterize the posterior in some way while maintaining computational tractability; cf. also Sect. 3.2.\n\n\u22121\n\n1, . . . , l2\n\nd, \u03c32\n\n4\n\n\fthat all involved terms RS(Q\u03b8,\u03c3n) (App. C) and KL(Q\u03b8,\u03c3n(cid:107)P \u03b8) from (8) as well as their derivatives\n(App. A) can be computed effectively, so we can use gradient-based optimization.\nThe only remaining issue is that the learned prior hyperparameters \u03b8 have to come from a discrete\nset \u0398 that must be speci\ufb01ed before seeing the training set S (Sect. 2.2). To achieve this, we \ufb01rst\nminimize the RHS of Eq. (5) over \u03b8 and \u03c32\nn in a gradient-based manner, and thereafter discretize\neach of the components of ln \u03b8 to the closest point in the equispaced (G + 1)-element set {\u2212L,\u2212L +\nG , . . . , +L}; thus, when T denotes the number of components of \u03b8, the penalty term to be used in\n2L\n= ln|\u0398| = T ln(G + 1). The SE-ARD kernel has T = d + 1,\nthe optimization objective (5) is ln 1\np\u03b8\nwhile the standard SE kernel has T = 2 parameters. In our experiments we round each component\nof ln \u03b8 to two decimal digits in the range [\u22126, +6], i.e. L = 6, G = 1200. We found that this\ndiscretization has virtually no effect on the predictions of Q\u03b8,\u03c3n, and that coarser rounding (i.e.\nsmaller |\u0398|) does not signi\ufb01cantly improve the bound (5) (via its smaller penalty term ln|\u0398|) nor the\noptimization (via its higher sensitivity to Q); see App. F.\n\n3.2 Learning sparse GPs\nDespite the fact that, with con\ufb01dence 1\u2212\u03b4, the bound in (5) holds for any P\u03b8 from the prior GP family\nand for any distribution Q, we optimized in Sect. 3.1 the upper bound merely over the parameters\n\u03b8, \u03c3n after substituting P \u03b8 and the corresponding Q\u03b8,\u03c3n from (6). We are limited by the need to\ncompute KL(Q(cid:107)P ) effectively, for which we relied on the property Q(f | fN ) = P (f | fN ) and\nthe Gaussianity of P (fN ) and Q(fN ), cf. (7). Building on this two requirements, we now construct\nmore general pairs P, Q of GPs with effectively computable KL(Q(cid:107)P ), so that our learning method\nbecomes more widely applicable, including sparse GP methods.\nInstead of the points x1, . . . , xN associated with the training set S as in Sect. 3.1, one may choose\nfrom the input space any number M of points Z = {z1, . . . , zM} \u2286 X, often called inducing\ninputs, and any Gaussian distribution Q(fM ) = N (fM | aM , BMM ) on function values fM :=\n(f (z1), . . . , f (zM )), with any aM \u2208 RM and positive semide\ufb01nite matrix BMM \u2208 RM\u00d7M . The\ndistribution Q on fM can be extended to all function values at all inputs X using the conditional\nQ(f | fM ) = P (f | fM ) from the prior P (see Sect. 3.1). This yields the following predictive GP:\n(9)\n\nQ(f ) = GP(cid:0)f | m(x) + kM (x)K\u22121\n\nMM (aM \u2212 mM ),\n\nK(x, x(cid:48)) \u2212 kM (x)K\u22121\n\nln det(cid:2)BMM K\u22121\n\nMM [KMM \u2212 BMM ]K\u22121\nwhere KMM := (K(zi, zj))M\n:= (K(x, z1), . . . , K(x, zM )), and mM :=\n(m(z1), . . . , m(zM )). This form of Q includes several approximate posteriors from Bayesian infer-\nence that have been used in the literature [1, 10, 6, 26], even for noise models other than the Gaussian\none used to motivate the Q from Sect. 3.1. Analogous reasoning as in (7) now gives [10, 6, 23]:\nKL(Q (cid:107) P ) = KL(Q(fM ) (cid:107) P (fM )) = \u2212 1\n2\n1\n2\n\n2\n(10)\nOne can thus effectively optimize in (5) the prior P \u03b8 and the posterior distribution Q\u03b8,{zi},aM ,BMM\nby varying the number M and locations z1, . . . , zM of inducing inputs and the parameters aM and\nBMM , along with the hyperparameters \u03b8. Optimization can in this framework be organized such\nthat it consumes time O(N M 2 + M 3) per gradient step and memory O(N M + M 2) as opposed to\nO(N 3) and O(N 2) for the full GP of Sect. 3.1. This is a big saving when M (cid:28) N and justi\ufb01es the\nname \u201csparse GP\u201d [1, 24].\nSome popular sparse-GP methods [24] are special cases of the above form, by prescribing certain\naM and BMM depending on the training set S, so that only the inducing inputs z1, . . . , zM and a\nfew parameters such as \u03c32\n\ntr(cid:2)BMM K\u22121\n\n(cid:3) \u2212 M\n\n(aM \u2212 mM )T K\u22121\n\nMM (aM \u2212 mM ).\n\nMM kM (x(cid:48))T(cid:1),\n\n(cid:3) +\n\n1\n2\n\ni,j=1, kM (x)\n\nMM\n\nMM\n\n+\n\naM = KMM Q\u22121\n\nn are left free:\nMM KM N (\u03b1\u039b + \u03c32\n\nwhere QMM = KMM + KM N (\u03b1\u039b + \u03c32\ni,j=1, KNM =\nn\nM N , and \u039b = diag(\u03bb1, . . . , \u03bbN ) is a diagonal N \u00d7 N-matrix with entries \u03bbi = K(xi, xi) \u2212\nK T\nkM (xi)K\u22121\nMM kM (xi)T . Setting \u03b1 = 1 corresponds to the FITC approximation [15], whereas \u03b1 = 0\n\nn\n\n1)\u22121yN , BMM =KMM Q\u22121\n\nMM KMM ,\n1)\u22121KNM with KM N := (K(zi, xj))M,N\n\n(11)\n\n5\n\n\fFigure 1: Predictive distributions. The predictive distributions (mean \u00b12\u03c3 as shaded area) of our\nkl-PAC-SGP (blue) are shown for various choices of \u03b5 together with the full-GP\u2019s prediction (red).\n(Note that by Eqs. (9,11), kl-PAC-SGP\u2019s predictive variance does not include additive \u03c32\nn, whereas\nfull-GP\u2019s does [1].) The shaded green area visualizes an \u03b5-band, centered around the kl-PAC-SGP\u2019s\npredictive mean; datapoints (black dots) inside this band do not contribute to the risk RS(Q). Crosses\nabove/below the plots indicate the inducing point positions (M = 15) before/after training.\n\nis the VFE and DTC method [6, 16] (see App. D for their training objectives); one can also linearly\ninterpolate between both choices with \u03b1 \u2265 0 [25]. Another form of sparse GPs where the latent\nfunction values fM are \ufb01xed and not marginalized over, corresponds to BMM = 0, which however\ngives diverging KL(Q (cid:107) P ) = \u221e via (10) and therefore trivial bounds in (4)\u2013(5).\nOur learning method for sparse GPs (\u201cPAC-SGP\u201d) follows now similar steps as in Sect. 3.1: One has\n= ln|\u0398| for the prior hyperparameters \u03b8, which are to be discretized into\nto include a penality ln 1\np\u03b8\nthe set \u0398 after the optimization of (5). Note, \u03b8 contains the prior hyperparameters only and not the\ninducing points z1, . . . , zM nor aM , BMM , \u03c3n, or \u03b1 from (11); all these quantities can be optimized\nover simultaneously with \u03b8, but do not need to be discretized. The number M of inducing inputs\ncan also be varied, which determines the required computational effort, and all optimizations can be\nboth discrete [16] or continuous [15, 6]. When optimizing over positive BMM , the parametrization\nBMM = LLT with a lower triangular matrix L \u2208 RM\u00d7M can be used [26]. For the experiments\nbelow we always employ the FITC parametrization (\ufb01xed \u03b1 = 1) in our proposed PAC-SGP method,\ni.e. our optimization parameters are \u03c32\n\nn and {zi} besides the length scale hyperparameters \u03b8.\n\n4 Experiments5\n\nWe now illustrate our learning method and compare it with other GP methods on various regression\ntasks. In contrast to prior work [14], we found the gradient-based training with the objective (5) to\nbe robust enough, such that no pretraining with conventional objectives (such as from App. D) is\nnecessary. We set \u03b4 = 0.01 throughout [10, 14], cf. Sect. 2.2, and use (unless speci\ufb01ed otherwise)\n\nthe generic bounded loss function (cid:96)(y,(cid:98)y) = 1(cid:98)y /\u2208[y\u2212\u03b5,y+\u03b5] for regression, with accuracy goal \u03b5 > 0 as\n\nspeci\ufb01ed below.\nWe evaluate the following methods: (a) PAC-GP: Our proposed method (cf. Sect. 3.1) with the\ntraining objective B(Q) (5) (kl-PAC-GP) and for comparison with the looser training objective\nBPin (sqrt-PAC-GP) (see below (5), similar to e.g. [14]); (b) PAC-SGP: Our sparse GP method\n(Sect. 3.2), again with objectives B(Q) (kl-PAC-SGP) and BPin(Q) (sqrt-PAC-SGP), respectively;\n(c) full-GP: The ordinary full GP for regression [1]; (d) VFE: Titsias\u2019 sparse GP [6]; (e) FITC:\nSnelson-Ghahramani\u2019s sparse GP [15]. Note that full-GP, VFE, and FITC as well as sqrt-PAC-GP and\nsqrt-PAC-SGP are trained on other objectives (see App. D), and we will evaluate the upper bound\n(5) on their generalization performance by evaluating KL(Q (cid:107) P ) via (8) or (10). To obtain \ufb01nite\ngeneralization bounds, we discretize \u03b8 for all methods at the end of training as in Sect. 3.1 and use\nthe appropriate ln 1\np\u03b8\n\n= ln|\u0398| in (5).\n\n(a) Predictive distribution. To get a \ufb01rst intuition, we illustrate in Fig. 1 the effect of varying \u03b5 in\nthe loss function on the predictive distribution of our sparse PAC-SGP. The accuracy goal \u03b5 de\ufb01nes\na band around the predictive mean within which data-points do not contribute to the empirical risk\nRS(Q). We thus chose the accuracy goal \u03b5 relative to the observation noise \u03c3n obtained from an\n\n5Python code (building on GP\ufb02ow [27] and TensorFlow [28]) implementing our method is available at\n\nhttps://github.com/boschresearch/PAC_GP.\n\n6\n\n\u22122\u2212101kl-PAC-SGP (\u03b5=0.5\u03c3n=0.14)kl-PAC-SGP (\u03b5=2\u03c3n)kl-PAC-SGP (\u03b5=5\u03c3n)\fFigure 2: Dependence on the accuracy goal \u03b5. For each \u03b5, the plots from left to right show (means as\nbars, standard errors after ten iterations as grey ticks) the upper bound B(Q) from Eq. (5), the Gibbs\ntraining risk RS(Q), the Gibbs test risk as a proxy for the true R(Q), MSE, and KL(Q (cid:107) P \u03b8)/N,\nafter learning Q on the dataset boston housing by three different methods: our kl-PAC-GP method\nfrom Sect. 3.1 with sqrt-PAC-GP and the ordinary full-GP.\n\nordinary full-GP. Results are presented on the 1D toy dataset6 from the original FITC [15] and VFE\n[6] publications (for a comparison to the predictive distributions of FITC and VFE see App. E, which\nalso contains an illustration that our kl-PAC-SGP avoids FITC\u2019s known over\ufb01tting on pathological\ndatasets.). Here and below, we optimize the hyperparameters in each experiment anew.\nWe \ufb01nd that for large \u03b5 (right plot) the predictive distribution (blue) becomes smoother: Due to the\nwider \u03b5-band (green), the PAC-SGP does not need to adapt much to the data for the \u03b5-band to contain\nmany data points. Hence the predictive distribution can remain closer to the prior, which reduces the\nKL-term in the objective (5). For the same reason, the inducing points need not adapt much compared\nto their initial positions for large \u03b5. For smaller \u03b5, the PAC-SGP adapts more to the data, whereas\nfor very small \u03b5 (left plot), it is anyhow not possible to place many data points within the narrow\n\u03b5-band, so the predictive distribution can again be closer to the prior (compare e.g. in the \ufb01rst and\nsecond plots the blue curves near the rightmost datapoints) for a smaller KL-term. In particular, the\nKL-divergence (divided by number of training points) for the three settings in Fig.1 are: 0.097 (left),\n0.109 (middle), and 0.031 (right).\n\n(b) Full-GP experiments \u2013 dependence on the accuracy goal \u03b5. To explore the dependence on\nthe desired accuracy \u03b5 further, we compare in Fig. 2 the ordinary full-GP to our PAC-GPs on the\nboston housing dataset7. As pre-processing we normalized all features and the output to mean zero\nand unit variance, then analysed the impact of the accuracy goal \u03b5 \u2208 {0.2, 0.4, 0.6, 0.8, 1.0}. We\nused 80% of the dataset for training and 20% for testing, in ten repetitions of the experiment.\nOur PAC-GP yields signi\ufb01cantly better generalization guarantees for all accuracy goals \u03b5 compared to\nfull-GP, since we are directly optimizing the bound (5). This effect is stronger for large \u03b5, where the\nKL-term of PAC-GP can decrease as Q may again remain closer to P while keeping the training loss\nlow. Although better bounds do not necessarily imply better Gibbs test risk, kl-PAC-GP performs only\nslightly below the ordinary full-GP in this regard. Moreover, our PAC-GPs exhibit less over\ufb01tting\nthan the full-GP, for which the training risks are signi\ufb01cantly larger than the test risks (see Table 1 in\nApp. G for numerical values). On the other hand, the tighter objective (5) in the kl-PAC-GP allows\nlearning a slightly more complex GP Q in terms of the KL-divergence compared to the sqrt-PAC-GP,\nwhich results in better test risks and at the same time better guarantees. This con\ufb01rms that kl-PAC-GP\nis always preferable to sqrt-PAC-GP. However, as any prediction within the full \u00b1\u03b5-band around the\nground truth incurs no risk for our PAC-GPs, their mean squared error (MSE) increases with \u03b5.\nThe fact that our learned PAC-GPs exhibit higher training and test errors (Gibbs risk and esp. MSE)\nthan full-GP can be explained by their under\ufb01tting in order to hedge against violating Eq. (5) (i.e.\nTheorem 1). This under\ufb01tting is evidenced by PAC-GP\u2019s signi\ufb01cantly less complex learned posterior\nQ as measured by KL(Q(cid:107)P \u03b8)/N (Fig. 2), or similarly (via Eqs. (8,10)), by its larger learned noise\nvariance \u03c32\nn compared to full-GP\u2019s (Table 1 in App. G). It is exactly this stronger regularization of\nPAC-GP in terms of the KL divergence that leads to its better generalization guarantees.\nIn the following, we will \ufb01x \u03b5 = 0.6 after pre-processing data as above, to illustrate PAC-GP further.\nNote however that in a concrete application, \u03b5 should be \ufb01xed to a desired accuracy goal using domain\n\n6snelson: dimensions 200 \u00d7 1, available at www.gatsby.ucl.ac.uk/~snelson.\n7boston: dimensions 506 \u00d7 13, available at http://lib.stat.cmu.edu/datasets/boston\n\n7\n\n0.20.40.60.81.0epsilon0.00.20.40.60.8Upper Bound0.20.40.60.81.0epsilon0.00.20.40.60.8RS[Train]0.20.40.60.81.0epsilon0.00.20.40.60.8RS[Test]0.20.40.60.81.0epsilon0.00.20.4MSE0.20.40.60.81.0epsilon0.00.10.20.30.4KL / Nkl-PAC-GPsqrt-PAC-GPfull-GP\fFigure 3: Dependence on the number of inducing variables. Shown is the average (\u00b1 stan-\ndard error over 10 repetitions) upper bound B, Gibbs training risk RS, Gibbs test risk, MSE, and\nKL(Q(cid:107)P \u03b8)/N as a function of the number M of inducing inputs (from left to right). We compare\nour sparse kl-PAC-SGP (Sect. 3.2) with the two popular GP approximations VFE and FITC. Each\nrow corresponds to one dataset: pol (top), sarcos (middle) and kin40k (bottom). kl-PAC-SGP has the\nbest guarantee in all settings (left column), due to a lower model complexity (right column), but this\ncomes at the price of slightly larger test errors.\n\nknowledge, but before seeing the training set S. Alternatively, one can consider a set of \u03b5-values\n\u03b51, . . . , \u03b5E chosen in advance, at the cost of a term ln E in addition to log 1\np\u03b8\n\nin the objective (5).\n\nN ln 2\n\nN\n\nN ln|\u0398| = 0.0160, 0.0003, 0.0020 and 1\n\n(c) Sparse-GP experiments \u2013 dependence on number of inducing inputs M. We now examine\nour sparse PAC-SGP method (Sect. 3.2) on the three large data sets pol, sarcos, and kin40k8, again\nusing 80%\u201320% train-test splits and ten iterations. The results are shown in Fig. 3. Here, we vary the\nnumber of inducing points M \u2208 {100, 200, 300, 400, 500}. For modelling pol and kin40k, we use\nthe SE-ARD kernel due to its better performance on these datasets, whereas we model sarcos without\nARD (cf. Table 2 in App. G for the comparison ARD vs. non-ARD). The corresponding penalty terms\n\u221a\n\u03b4 = 0.0008, 0.0003, 0.0003;\nfor the three plots are 1\nwhen compared to KL(Q(cid:107)P )/N from Fig. 3, their contribution is largest for the pol dataset.\nOur kl-PAC-SGP achieves signi\ufb01cantly better upper bounds than VFE and FITC, by more than a\nfactor of 3 on sarcos, a factor of roughly 2 on pol, and a factor between 1.3 and 2 on kin40k (Fig.\n3, cf. also Table 2 in App. G). Also, the PAC-Bayes upper bound is much tighter for kl-PAC-SGP\nthan for VFE or FITC, i.e. closer to the Gibbs risk, often by factors exceeding 3. Our kl-PAC-SGP\nbehaves also more favorably in terms of generalization guarantee when inducing points are added and\nmore complex models are allowed: our upper bound improves substantially with M (kin40) or does\nat least not degrade (pol and sarcos), as opposed to VFE and FITC, whose complexities KL/N grow\nsubstantially with M. Since very low training risks can already be achieved by a moderate number of\ninducing points for pol and sarcos, a growing KL with M deteriorates the upper bound. Regarding\nthe upper bound, the increased \ufb02exibility from larger M only pays off for the kin40k dataset, whereas\nthe MSE improves with increasing M for all models and datasets. As above, kl-PAC-SGP is always\nslightly preferrable to sqrt-PAC-SGP, not only for the upper bound and Gibbs risks as expected but\nalso for MSE (see Table 2 in App. G).\n\n8pol: 15,000 \u00d7 26, kin40k: 40,000 \u00d7 8 (both from https://github.com/trungngv/fgp.git);\n\nsarcos: 48,933 \u00d7 21 (http://www.gaussianprocess.org/gpml/data)\n\n8\n\n100200300400500nInd0.000.080.160.24Upper Bound100200300400500nInd0.000.080.160.24RS[Train]100200300400500nInd0.000.080.160.24RS[Test]100200300400500nInd0.000.010.020.030.04MSE100200300400500nInd0.000.050.100.150.20KL / Nkl-PAC-SGPVFEFITC100200300400500nInd0.000.040.080.12Upper Bound100200300400500nInd0.000.040.080.12RS[Train]100200300400500nInd0.000.040.080.12RS[Test]100200300400500nInd0.000.010.020.030.04MSE100200300400500nInd0.000.020.040.060.08KL / N100200300400500nInd0.00.10.20.3Upper Bound100200300400500nInd0.00.10.20.3RS[Train]100200300400500nInd0.00.10.20.3RS[Test]100200300400500nInd0.000.020.040.060.08MSE100200300400500nInd0.000.050.100.150.20KL / N\fSimilarly to the boston dataset, the higher test errors of kl-PAC-SGP compared to VFE and FITC\ncan be explained by under\ufb01tting due to the stronger regularization, again shown by lower KL and\nsigni\ufb01cantly larger learned \u03c32\nn (by factors of 4\u201328 compared to VFE), cf. Table 2 in App. G. In\nfact, although our implementation of PAC-SGP employs the FITC parametrization, the PAC-(S)GP\noptimization is not prone to FITC\u2019s well-known over\ufb01tting tendency [2], due to the regularization via\nthe KL-divergence (see App. E, and in particular Supplementary Figure 7).\nTo investigate whether the higher test MSE of PAC-GP compared to VFE and FITC (and the full-GP\n\nabove) is a consequence of the 0-1-loss (cid:96)(y,(cid:98)y) = 1(cid:98)y /\u2208[y\u2212\u03b5,y+\u03b5] used so far, we re-ran the PAC-\n(cid:96)exp(y,(cid:98)y) = 1 \u2212 exp[\u2212((y \u2212(cid:98)y)/\u03b5)2] (Eq. (20)), which is MSE-like for small deviations |y \u2212(cid:98)y| (cid:46) \u03b5,\ni.e. (cid:96)exp(y,(cid:98)y) \u2248 (y \u2212(cid:98)y)2/\u03b52 (Supplementary Figure 5). Our results are tabulated in Table 3 in\n\nSGP experiments for M = 500 inducing inputs with the more distance-sensitive loss function\n\nApp. G. The \ufb01ndings are inconclusive and range from an improvement w.r.t. MSE of 25% (pol) over\nlittle change (sarcos) to a decline of 12% (kin40k), showing that the effect of the loss function is\nsmaller than might have been expected. Nevertheless, generalization guarantees of PAC-SGP remain\nmuch better than the ones of the other methods. While the MSE of our PAC-GPs would improve by\nchoosing smaller \u03b5 (e.g., Fig. 2), this comes at the disadvantage of worse generalization bounds.\nWe further note that no method shows signi\ufb01cant over\ufb01tting in Fig. 3, in the sense that the differences\nbetween test and training Gibbs risks are all rather small, despite the KL-complexity increasing with\nM for VFE and FITC. This is unlike for Boston housing above, and may be due to the much larger\ntraining sets here. When comparing VFE and FITC, we observe that VFE consistently outperforms\nFITC in terms of both MSE as well as generalization guarantee, where VFE\u2019s higher KL-complexity\nis offset by its much lower Gibbs risk. This forti\ufb01es the results in [2]. We lastly note that, since for\nour PAC-SGP the obtained guarantees B are much smaller than 1/2, we obtain strong guarantees\neven on the Bayes risk RBay \u2264 2B < 1 (Sect. 2.1).\n\n5 Conclusion\n\nIn this paper, we proposed and explored the use of PAC-Bayesian bounds as an optimization objective\nfor GP training. Consequently, we were able to achieve signi\ufb01cantly better guarantees on the out-\nof-sample performance compared to state-of-the-art GP methods, such as VFE or FITC, while\nmaintaining computational scalability. We further found that using the tighter generalization bound\nB(Q) (5) based on the inverse binary kl-divergence leads to an increase in the performance on all\nmetrics compared to a looser bound BPin as employed in previous works (e.g. [14]).\nDespite the much better generalization guarantees obtained by our method, it often yields worse\ntest error, in particular test MSE, than standard GP regression methods; this largely persists even\nwhen using more distance-sensitive loss functions than the 0-1-loss. The underlying reason could\nbe that all loss functions considered in this work were bounded, as necessitated by our desire to\nprovide generalization guarantees irrespective of the true data distribution. While rigorous PAC-\nBayesian bounds exist for MSE-like unbounded loss functions under special assumptions on the data\ndistribution [17], it may nevertheless be worthwhile to investigate whether these training objectives\nlead to better test MSE in examples. A drawback is that those assumptions are usually impossible to\nverify, thus the generalization guarantees are not comparable. Note that the design of a loss function\nis dependent on the application domain and there is no ubiquitous choice across all settings. In many\nsafety-critical applications, small deviations are tolerable whereas larger deviations are all equally\ncatastrophic, thus a 0-1-loss as ours and a rigorous bound on it can be more useful than the MSE test\nerror.\nWhile in this work we focussed on regression tasks, the same strategy of optimizing a generalization\nbound can also be applied to learn GPs for binary and categorical outputs. Note that the true KL-term\nin this setting has so far been merely upper bounded by its regression proxy [10], and it would\nbe interesting to develop better bounds on the classi\ufb01cation complexity term. Lastly, it may be\nworthwhile to use other or more general sparse GPs within our PAC-Bayesian learning method, such\nas free-form [26] or even more general GPs [29].\n\nAcknowledgments\n\nWe would like to thank Duy Nguyen-Tuong, Martin Schiegg, and Michael Schober for helpful\ndiscussions and proofreading.\n\n9\n\n\fReferences\n\n[1] C. E. Rasmussen, C. K. I. Williams, \u201cGaussian Processes for Machine Learning\u201d, The MIT\n\nPress (2006).\n\n[2] M. Bauer, M. v. d. Wilk, C. E. Rasmussen, \u201cUnderstanding Probabilistic Sparse Gaussian\n\nProcess Approximations\u201d, In NIPS (2016).\n\n[3] S. Shalev-Shwartz, S. Ben-David, \u201cUnderstanding Machine Learning: From Theory to Algo-\n\nrithms\u201d, Cambridge University Press (2014).\n\n[4] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, \u201cUnderstanding deep learning requires\n\nrethinking generalization\u201d, In ICLR (2017).\n\n[5] M. Jordan, Z. Ghahramani, T. Jaakkola, L. Saul, \u201cIntroduction to variational methods for\n\ngraphical models\u201d, Machine Learning, 37, 183-233 (1999).\n\n[6] M. Titsias, \u201cVariational Learning of Inducing Variables in Sparse Gaussian Processes\u201d, In\n\nAISTATS (2009).\n\n[7] D. McAllester, \u201cPAC-Bayesian model averaging\u201d, COLT (1999).\n[8] D. McAllester, \u201cPAC-Bayesian Stochastic Model Selection\u201d, Machine Learning 51, 5-21 (2003).\n[9] O. Catoni, \u201cPac-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical Learn-\n\ning\u201d, IMS Lecture Notes Monograph Series 2007, Vol. 56 (2007).\n\n[10] M. Seeger, \u201cPAC-Bayesian Generalization Error Bounds for Gaussian Process Classi\ufb01cation\u201d,\n\nJournal of Machine Learning Research 3, 233-269 (2002).\n\n[11] A. Ambroladze, E. Parrado-Hern\u00e1ndez, J. Shawe-Taylor, \u201cTighter PAC-Bayes bounds\u201d, In NIPS\n\n(2007).\n\n[12] J. Langford, J. Shawe-Tayor, \u201cPAC-Bayes & margins\u201d, In NIPS (2002).\n[13] P. Germain, A. Lacasse, F. Laviolette, M. Marchand, \u201cPAC-Bayesian Learning of Linear\n\nClassi\ufb01ers\u201d, In ICML (2009).\n\n[14] G. K. Dziugaite, D. M. Roy, \u201cComputing Nonvacuous Generalization Bounds for Deep (Stochas-\n\ntic) Neural Networks with Many More Parameters than Training Data\u201d, In UAI (2017).\n\n[15] E. Snelson, Z. Ghahramani, \u201cSparse Gaussian Processes using Pseudo-inputs\u201d, In NIPS (2005).\n[16] M. Seeger, C. K. I. Williams, N. Lawrence, \u201cFast Forward Selection to Seepd Up Sparse\n\nGaussian Process Regression\u201d, In AISTATS (2003).\n\n[17] P. Germain, F. Bach, A. Lacoste, S. Lacoste-Julien, \u201cPAC-Bayesian Theory Meets Bayesian\n\nInference\u201d, In NIPS (2016).\n\n[18] R. Sheth, R. Khardon, \u201cExcess Risk Bounds for the Bayes Risk using Variational Inference in\n\nLatent Gaussian Models\u201d, In NIPS (2017).\n\n[19] V. Vapnik, \u201cThe Nature of Statistical Learning Theory\u201d, Springer (1995).\n[20] A. Maurer, \u201cA Note on the PAC Bayesian Theorem\u201d, arXiv:cs/0411099 (2004).\n[21] L. Begin, P. Germain, F. Laviolette, J.-F. Roy, \u201cPAC-Bayesian Bounds based on the Renyi\n\nDivergence\u201d, In AISTATS (2016).\n\n[22] S. Kullback, R. Leibler, \u201cOn information and suf\ufb01ciency\u201d, Annals of Mathematical Statistics\n\n22, 79-86 (1951).\n\n[23] A. G. de G. Matthews, J. Hensman, R. Turner, Z. Ghahramani, \u201cOn Sparse Variational Methods\n\nand the Kullback-Leibler Divergence between Stochastic Processes\u201d, In AISTATS (2016).\n\n[24] J. Quinonero-Candela, C. E. Rasmussen, \u201cA Unifying View of Sparse Approximate Gaussian\n\nProcess Regression\u201d, Journal of Machine Learning Research 6, 1939-1959 (2005).\n\n[25] T. D. Bui, J. Yan, R. E. Turner, \u201cA Unifying Framework for Gaussian Process Pseudo-Point\nApproximations using Power Expectation Propagation\u201d, Journal of Machine Learning Research\n18, 1-72 (2017).\n\n[26] J. Hensman, A. Matthews, Z. Ghahramani, \u201cScalable Variational Gaussian Process Classi\ufb01ca-\n\ntion\u201d, In AISTATS (2015).\n\n10\n\n\f[27] A. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z.\nGhahramani, J. Hensman, \u201cGP\ufb02ow: A Gaussian process library using TensorFlow\u201d, Journal of\nMachine Learning Research 18, 1-6 (2017).\n\n[28] M. Abadi et al., \u201cTensorFlow: Large-Scale Machine Learning on Heterogeneous Systems\u201d,\n\nhttps://www.tensorflow.org/ (2015).\n\n[29] C.-A. Cheng, B. Boots, \u201cVariational Inference for Gaussian Process Models with Linear Com-\n\nplexity\u201d, In NIPS (2017).\n\n11\n\n\f", "award": [], "sourceid": 1690, "authors": [{"given_name": "David", "family_name": "Reeb", "institution": "Bosch Center for Artificial Intelligence (BCAI)"}, {"given_name": "Andreas", "family_name": "Doerr", "institution": "BCAI, MPI-IS AMD"}, {"given_name": "Sebastian", "family_name": "Gerwinn", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "Barbara", "family_name": "Rakitsch", "institution": "Bosch Center for Artificial Intelligence"}]}