{"title": "Understanding Probabilistic Sparse Gaussian Process Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 1533, "page_last": 1541, "abstract": "Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.", "full_text": "Understanding Probabilistic Sparse\nGaussian Process Approximations\n\nMatthias Bauer\u2020\u2021\n\nMark van der Wilk\u2020\n\nCarl Edward Rasmussen\u2020\n\n\u2020Department of Engineering, University of Cambridge, Cambridge, UK\n\n\u2021Max Planck Institute for Intelligent Systems, T\u00a8ubingen, Germany\n\n{msb55, mv310, cer54}@cam.ac.uk\n\nAbstract\n\nGood sparse approximations are essential for practical inference in Gaussian\nProcesses as the computational cost of exact methods is prohibitive for large\ndatasets. The Fully Independent Training Conditional (FITC) and the Variational\nFree Energy (VFE) approximations are two recent popular methods. Despite\nsuper\ufb01cial similarities, these approximations have surprisingly different theoretical\nproperties and behave differently in practice. We thoroughly investigate the two\nmethods for regression both analytically and through illustrative examples, and\ndraw conclusions to guide practical application.\n\n1\n\nIntroduction\n\nGaussian Processes (GPs) [1] are a \ufb02exible class of probabilistic models. Perhaps the most prominent\npractical limitation of GPs is that the computational requirement of an exact implementation scales\nas O(N 3) time, and as O(N 2) memory, where N is the number of training cases. Fortunately,\nrecent progress has been made in developing sparse approximations, which retain the favourable\nproperties of GPs but at a lower computational cost, typically O(N M 2) time and O(N M ) memory\nfor some chosen M < N. All sparse approximations rely on focussing inference on a small number\nof quantities, which represent approximately the entire posterior over functions. These quantities\ncan be chosen differently, e.g., function values at certain input locations, properties of the spectral\nrepresentations [2], or more abstract representations [3]. Similar ideas are used in random feature\nexpansions [4, 5].\nHere we focus on methods that represent the approximate posterior using the function value at a set of\nM inducing inputs (also known as pseudo-inputs). These methods include the Deterministic Training\nConditional (DTC) [6] and the Fully Independent Training Conditional (FITC) [7], see [8] for a\nreview, as well as the Variational Free Energy (VFE) approximation [9]. The methods differ both in\nterms of the theoretical approach in deriving the approximation, and in terms of how the inducing\ninputs are handled. Broadly speaking, inducing inputs can either be chosen from the training set\n(e.g. at random) or be optimised over. In this paper we consider the latter, as this will generally allow\nfor the best trade-off between accuracy and computational requirements. Training the GP entails\njointly optimizing over inducing inputs and hyperparameters.\nIn this work, we aim to thoroughly investigate and characterise the difference in behaviour of the FITC\nand VFE approximations. We investigate the biases of the bounds when learning hyperparameters,\nwhere each method allocates its modelling capacity, and the optimisation behaviour. In Section 2\nwe brie\ufb02y introduce inducing point methods and state the two algorithms using a unifying notation.\nIn Section 3 we discuss properties of the two approaches, both theoretical and practical. Our aim is\nto understand the approximations in detail in order to know under which conditions each method is\nlikely to succeed or fail in practice. We highlight issues that may arise in practical situations and how\nto diagnose and possibly avoid them. Some of the properties of the methods have been previously\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\freported in the literature; our aim here is a more complete and comparative approach. We draw\nconclusions in Section 4.\n\n2 Sparse Gaussian Processes\n\nA Gaussian Process is a \ufb02exible distribution over functions, with many useful analytical properties. It\nis fully determined by its mean m(x) and covariance k(x, x(cid:48)) functions. We assume the mean to be\nzero, without loss of generality. The covariance function determines properties of the functions, like\nsmoothness, amplitude, etc. A \ufb01nite collection of function values at inputs {xi} follows a Gaussian\ndistribution N (f ; 0, K\ufb00 ), where [K\ufb00 ]ij = k(xi, xj).\nHere we revisit the GP model for regression [1]. We model the function of interest f (\u00b7) using a GP\nprior, and noisy observations at the input locations X = {xi}i are observed in the vector y.\n\np(f ) = N (f ; 0, K\ufb00 )\n\np(y|f ) =\n\nN\n\n(1)\n\nN(cid:89)\n\nn=1\n\n(cid:0)yn; fn, \u03c32\n\nn\n\n(cid:1)\n\n(cid:90)\n\nThroughout, we employ a squared exponential covariance function k(x, x(cid:48)) = s2\n2|x \u2212\nx(cid:48)|2/(cid:96)2), but our results only rely on the decay of covariances with distance. The hyperparameter \u03b8\ncontains the signal variance s2\nn, and is suppressed in the\nnotation.\nTo make predictions, we follow the common approach of \ufb01rst determining \u03b8 by optimising the\nmarginal likelihood and then marginalising over the posterior of f:\n\nf , the lengthscale (cid:96) and the noise variance \u03c32\n\nf exp(\u2212 1\n\n\u03b8\u2217 = argmax\n\n\u03b8\n\np(y|\u03b8)\n\np(y\u2217|y) =\n\np(y\u2217, y)\n\np(y)\n\n=\n\np(y\u2217|f\u2217)p(f\u2217|f )p(f|y)df df\u2217\n\n(2)\n\nWhile the marginal likelihood, the posterior and the predictive distribution all have closed-form\nnI,\nGaussian expressions, the cost of evaluating them scales as O(N 3) due to the inversion of K\ufb00 + \u03c32\nwhich is impractical for many datasets.\nOver the years, the two inducing point methods that have remained most in\ufb02uential are FITC [7]\nand VFE [9]. Unlike previously proposed methods (see [6, 10, 8]), both FITC and VFE provide an\napproximation to the marginal likelihood which allows both the hyperparameters and inducing inputs\nto be learned from the data through gradient based optimisation. Both methods rely on the low rank\nmatrix Q\ufb00 = KfuK\u22121\nuu Kuf instead of the full rank K\ufb00 to reduce the size of any matrix inversion to\nM. Note that for most covariance functions, the eigenvalues of Kuu are not bounded away from zero.\nAny practical implementation will have to address this to avoid numerical instability. We follow the\ncommon practice of adding a tiny diagonal jitter term \u03b5I to Kuu before inverting.\n\n2.1 Fully Independent Training Conditional (FITC)\n\nOver the years, FITC has been formulated in several different ways. A form of FITC \ufb01rst appeared in\nan online learning setting by Csat\u00b4o and Opper [11], derived from the viewpoint of approximating the\nfull GP posterior. Snelson and Ghahramani [7] introduced FITC as approximate inference in a model\nwith a modi\ufb01ed likelihood and proposed using its marginal likelihood to train the hyperparameters and\ninducing inputs jointly. An alternate interpretation where the prior is modi\ufb01ed, but exact inference is\nperformed, was presented in [8], unifying it with other techniques. The latest interesting development\ncame with the connection that FITC can be obtained by approximating the GP posterior using\nExpectation Propagation (EP) [12, 13, 14].\nUsing the interpretation of modifying the prior to\n\np(f ) = N (f ; 0, Q\ufb00 + diag[K\ufb00 \u2212 Q\ufb00 ])\n\n(3)\n\nwe obtain the objective function in Eq. (5). We would like to stress, however, that this modi\ufb01cation\ngives exactly the same procedure as approximating the full GP posterior with EP. Regardless of the\nfact that that FITC can be seen as a completely different model, we aim to characterise it as an\napproximation to the full GP.\n\n2\n\n\f2.2 Variational Free Energy (VFE)\n\nVariational inference can also be used to approximate the true posterior. We follow the derivation\nby Titsias [9] and bound the marginal likelihood, by instantiating extra function values on the latent\nGaussian process u at locations Z,1 followed by lower bounding the marginal likelihood. To ensure\nef\ufb01cient calculation, q(u, f ) is chosen to factorise as q(u)p(f|u). This removes terms with K\u22121\n\ufb00 :\n\n(cid:90)\n\nlog p(y) \u2265\n\nq(u, f ) log\n\np(y|f )\u0018\u0018\u0018\n\u0018\u0018\u0018\np(f|u)q(u)\n\np(f|u)p(u)\n\ndu df\n\n(4)\n\nThe optimal q(u) can be found by variational calculus resulting in the lower bound in Eq. (5).\n\n2.3 Common notation\n\nThe objective functions for both VFE and FITC look very similar. In the following discussion we\nwill refer to a common notation of their negative log marginal likelihood (NLML) F, which will be\nminimised to train the methods:\n\nF =\n\n(cid:124)\n\nN\n2\n\n(cid:123)(cid:122)\n\nlog(2\u03c0) +\n\nlog |Q\ufb00 + G|\n\n1\n2\ncomplexity penalty\n\n(cid:125)\n(cid:124)\nGFITC = diag[K\ufb00 \u2212 Q\ufb00 ] + \u03c32\nnI\nTFITC = 0\n\n1\n2\n\n+\n\nyT(Q\ufb00 + G)\u22121y\n\n+\n\n(5)\n\n(cid:123)(cid:122)\n\ndata \ufb01t\n\n(cid:125)\n\ntr(T )\n,\n\n1\n2\u03c32\nn\n\n(cid:124)\n\n(cid:123)(cid:122)\n\ntrace term\n\n(cid:125)\n\nwhere\n\nGVFE = \u03c32\nnI\nTVFE = K\ufb00 \u2212 Q\ufb00 .\n\n(6)\n(7)\nThe common objective function has three terms, of which the data \ufb01t and complexity penalty have\ndirect analogues to the full GP. The data \ufb01t term penalises the data lying outside the covariance ellipse\nQ\ufb00 + G. The complexity penalty is the integral of the data \ufb01t term over all possible observations\ny. It characterises the volume of possible datasets that are compatible with the data \ufb01t term. This\ncan be seen as the mechanism of Occam\u2019s razor [16], by penalising the methods for being able to\npredict too many datasets. The trace term in VFE ensures that the objective function is a true lower\nbound to the marginal likelihood of the full GP. Without this term, VFE is identical to the earlier DTC\napproximation [6] which can grossly over-estimate the marginal likelihood. The trace term penalises\nthe sum of the conditional variances at the training inputs, conditioned on the inducing inputs [17].\nIntuitively, it ensures that VFE not only models this speci\ufb01c dataset y well, but also approximates the\ncovariance structure of the full GP K\ufb00 .\n\n3 Comparative behaviour\n\nAs our main test case we use the one dimensional dataset2 considered in [7, 9] with 200 input-output\npairs. Of course, sparse methods are not necessary for this toy problem, but all of the issues we raise\nare illustrated nicely in this one dimensional task which can easily be plotted. In Sections 3.1 to 3.3\nwe illustrate issues relating to the objecctive functions. These properties are independent of how the\nmethod is optimised. However, whether they are encountered in practice can depend on optimiser\ndynamics, which we discuss in Sections 3.4 and 3.5.\n\n3.1 FITC can severely underestimate the noise variance, VFE overestimates it\n\nIn the full GP with Gaussian likelihood we assume a homoscedastic (input independent) noise model\nwith noise variance parameter \u03c32\nn. It fully characterises the uncertainty left after completely learning\nthe latent function. In this section we show how FITC can also use the diagonal term diag(K\ufb00 \u2212 Q\ufb00 )\nin GFITC as heteroscedastic (input dependent) noise [7] to account for these differences, thus,\ninvalidating the above interpretation of the noise variance parameter. In fact, the FITC objective\nfunction encourages underestimation of the noise variance, whereas the VFE bound encourages\noverestimation. The latter is in line with previously reported biases of variational methods [18].\nFig. 1 shows the con\ufb01guration most preferred by the FITC objective for a subset of 100 data points\nof the Snelson dataset, found by an exhaustive manual search for a minimum over hyperparameters,\n\n1Matthews et al. [15] show that this procedure approximates the posterior over the entire process f correctly.\n2Obtained from http://www.gatsby.ucl.ac.uk/~snelson/\n\n3\n\n\finducing inputs and number of inducing points. The noise variance is shrunk to practically zero,\ndespite the mean prediction not going through every data point. Note how the mean still behaves well\nand how the training data lie well within the predictive variance. Only when considering predictive\nprobabilities will this behaviour cause diminished performance. VFE, on the other hand, is able to\napproximate the posterior predictive distribution almost exactly.\n\nFigure 1: Behaviour of FITC and VFE on a subset of 100 data points of the Snelson dataset for 8\ninducing inputs (red crosses indicate inducing inputs; red lines indicate mean and 2\u03c3) compared to\nthe prediction of the full GP in grey. Optimised values for the full GP: nlml = 34.15, \u03c3n = 0.274\n\nFor both approximations, the complexity penalty decreases with decreased noise variance, by reducing\nthe volume of datasets that can be explained. For a full GP and VFE this is accompanied by a data\n\ufb01t penalty for data points lying far away from the predictive mean. FITC, on the other hand, has an\nadditional mechanism to avoid this penalty: its diagonal correction term diag(K\ufb00 \u2212 Q\ufb00 ). This term\ncan be seen as an input dependent or heteroscedastic noise term (discussed as a modelling advantage\nby Snelson and Ghahramani [7]), which is zero exactly at an inducing input, and which grows to the\nprior variance away from an inducing input. By placing the inducing inputs near training data that\nhappen to lie near the mean, the heteroscedastic noise term is locally shrunk, resulting in a reduced\ncomplexity penalty. Data points both far from the mean and far from inducing inputs do not incur a\ndata \ufb01t penalty, as the heteroscedastic noise term has increased around these points. This mechanism\nremoves the need for the homoscedastic noise to explain deviations from the mean, such that \u03c32\nn can\nbe turned down to reduce the complexity penalty further.\nThis explains the extreme pinching (severely reduced noise variance) observed in Fig. 1, also see,\ne.g., [9, Fig. 2]. In examples with more densely packed data, there may not be any places where a\nnear-zero noise point can be placed without incurring a huge data-\ufb01t penalty. However, inducing\ninputs will be placed in places where the data happens to randomly cluster around the mean, which\nstill results in a decreased noise estimate, albeit less extreme, see Figs. 2 and 3 where we use all 200\ndata points.\nRemark 1 FITC has an alternative mechanism to explain deviations from the learned function than\nthe likelihood noise and will underestimate \u03c32\nn can incorrectly\nbe estimated to be almost zero.\nn can no longer be interpreted in the same way\nAs a consequence of this additional mechanism, \u03c32\nn is often interpreted as the amount of uncertainty in the dataset which\nas for VFE or the full GP. \u03c32\ncan not be explained. Based on this interpretation, a low \u03c32\nn is often used as an indication that the\ndataset is being \ufb01tted well. Active learning applications rely on a similar interpretation to differentiate\nbetween inherent noise, and uncertainty in the latent GP which can be reduced. FITC\u2019s different\ninterpretation of \u03c32\nVFE, on the other hand, is biased towards over-estimating the noise variance, because of both the data\n\ufb01t and the trace term. Q\ufb00 + \u03c32\nn, since the rank of\nQ\ufb00 is M. Any component of y in these directions will result in a larger data \ufb01t penalty than for K\ufb00 ,\nwhich can only be reduced by increasing \u03c32\nn.\nn. The trace term can also be reduced by increasing \u03c32\nRemark 2 The VFE objective tends to over-estimate the noise variance compared to the full GP.\n\nnI has N \u2212 M eigenvectors with an eigenvalue of \u03c32\n\nn as a consequence. In extreme cases, \u03c32\n\nn will cause efforts like these to fail.\n\n3.2 VFE improves with additional inducing inputs, FITC may ignore them\n\nHere we investigate the behaviour of each method when more inducing inputs are added. For both\nmethods, adding an extra inducing input gives it an extra basis function to model the data with. We\ndiscuss how and why VFE always improves, while FITC may deteriorate.\n\n4\n\nFITC(nlml=23.16,\u03c3n=1.93\u00b710\u22124)VFE(nlml=38.86,\u03c3n=0.286)\fFigure 2: Top: Fits for FITC and VFE on 200 data points of the Snelson dataset for M = 7 optimised\ninducing inputs (black). Bottom: Change in objective function from adding an inducing input\nanywhere along the x-axis (no further hyperparameter optimisation performed). The overall change is\ndecomposed into the change in the individual terms (see legend). Two particular additional inducing\ninputs and their effect on the predictive distribution shown in red and blue.\n\nFig. 2 shows an example of how the objective function changes when an inducing input is added\nanywhere in the input domain. While the change in objective function looks reasonably smooth\noverall, there are pronounced spikes for both, FITC and VFE. These return the objective to the value\nwithout the additional inducing input and occur at the locations of existing inducing inputs. We\ndiscuss the general change \ufb01rst before explaining the spikes.\nMathematically, adding an inducing input corresponds to a rank 1 update of Q\ufb00 , and can be shown to\nalways improve VFE\u2019s bound3, see Supplement for a proof. VFE\u2019s complexity penalty increases due\nto an extra non-zero eigenvalue in Q\ufb00 , but gains in data \ufb01t and trace.\nRemark 3 VFE\u2019s posterior and marginal likelihood approximation become more accurate (or remain\nunchanged) regardless of where a new inducing input is placed.\nFor FITC, the objective can change either way. Regardless of the change in objective, the heterosce-\ndastic noise is decreased at all points (see Supplement for proof). For a squared exponential kernel,\nthe decrease is strongest around the newly placed inducing input. This decrease has two effects. One,\nit reduces the complexity penalty since the diagonal component of Q\ufb00 + G is reduced and replaced\nby a more strongly correlated Q\ufb00 . Two, it worsens the data \ufb01t term as the heteroscedastic term is\nrequired to \ufb01t the data when the homoscedastic noise is underestimated. Fig. 2 shows reduced error\nbars with several data points now outside of the 95% prediction bars. Also shown is a case where an\nadditional inducing input improves the objective, where the extra correlations outweigh the reduced\nheteroscedastic noise.\nBoth VFE and FITC exhibit pathological behaviour (spikes) when inducing inputs are clumped, that\nis, when they are placed exactly on top of each other. In this case, the objective function has the\nsame value as when all duplicate inducing inputs were removed, see Supplement for a proof. In other\nwords, for all practical purposes, a model with duplicate inducing inputs reduces to a model with\nfewer, individually placed inducing inputs.\nTheoretically, these pathologies only occur at single points, such that no gradients towards or away\nfrom them could exist and they would never be encountered. In practise, however, these peaks\nare widend by a \ufb01nite jitter that is added to Kuu to ensure it remains well conditioned enough\nto be invertible. This \ufb01nite width provides the gradients that allow an optimiser to detect these\ncon\ufb01gurations.\nAs VFE always improves with additional inducing inputs, these con\ufb01gurations must correspond to\nmaxima of the optimisation surface and clumping of inducing inputs does not occur for VFE. For\n\n3Matthews [19] independently proved this result by considering the KL divergence between processes. Titsias\n\n[9] proved this result for the special case when the new inducing input is selected from the training data.\n\n5\n\nFITCVFE\u221210010\u2206F\u221210\u221250\fFITC, con\ufb01gurations with clumped inducing inputs can and often do correspond to minima of the\noptimisation surface. By placing them on top of each other, FITC can avoid the penalty of adding\nan extra inducing input and can gain the bonus from the heteroscedastic noise. Clumping, thus,\nconstitutes a mechanism that allows FITC to effectively remove inducing inputs at no cost.\nWe illustrate this behaviour in Fig. 3 for 15 randomly initialised inducing inputs. FITC places some\nof them exactly on top of each other, whereas VFE spreads them out and recovers the full GP well.\n\nFigure 3: Fits for 15 inducing inputs for FITC and VFE (initial as black crosses, optimised red\ncrosses). Even following joint optimisation of inducing inputs and hyperparameters, FITC avoids the\npenalty of added inducing inputs by clumping some of them on top of each other (shown as a single\nred cross). VFE spreads out the inducing inputs to get closer to the true full GP posterior.\nRemark 4 In FITC, having a good approximation Q\ufb00 to K\ufb00 needs to be traded off with the gains\ncoming from the heteroscedastic noise. FITC does not always favour a more accurate approximation\nto the GP.\nRemark 5 FITC avoids losing the gains of the heteroscedastic noise by placing inducing inputs on\ntop of each other, effectively removing them.\n\n3.3 FITC does not recover the full GP posterior, VFE does\n\nIn the previous section we showed that FITC may not utilise additional resources to model the data.\nThe clumping behaviour, thus, explains why the FITC objective may not recover the full GP, even\nwhen given enough resources.\nBoth VFE and FITC can recover the true posterior by placing an inducing input on every training\ninput [9, 12]. For VFE, this is a global minimum, since the KL gap to the true marginal likelihood is\nzero. For FITC, however, this con\ufb01guration is not stable and the objective can still be improved by\nclumping of inducing inputs, as Matthews [19] has shown empirically by aggressive optimisation.\nThe derivative of the inducing inputs is zero for the initial con\ufb01guration, but adding jitter subtly\nmakes this behaviour more obvious by perturbing the gradients, similar to the widening of the peaks\nin Fig. 2. In Fig. 4 we reproduce the observations in [19, Sec 4.6.1 and Fig. 4.2] on a subset of 100\ndata points of the Snelson dataset: VFE remains at the minimum and, thus, recovers the full GP,\nwhereas FITC improves its objective and clumps the inducing inputs considerably.\n\nMethod\nFull GP\nVFE\nFITC\n\nnlml initial\n\nnlml optimised\n\n\u2212\n\n33.8923\n33.8923\n\n33.8923\n33.8923\n28.3869\n\nFigure 4: Results of optimising VFE and FITC after initialising at the solution that gives the correct\nposterior and marginal likelihood as in [19, Sec 4.6.1]: FITC moves to a signi\ufb01cantly different\nsolution with better objective value (Table, left) and clumped inducing inputs (Figure, right).\n\nRemark 6 FITC generally does not recover the full GP, even when it has enough resources.\n\n3.4 FITC relies on local optima\n\nSo far, we have observed some cases where FITC fails to produce results in line with the full GP, and\ncharacterised why. However, in practice, FITC has performed well, and pathological behaviour is not\nalways observed. In this section we discuss the optimiser dynamics and show that they help FITC\nbehave reasonably.\n\n6\n\nFITCVFE024602468initialoptimisedVFEFITC\fTo demonstrate this behaviour, we consider a 4d toy dataset: 1024 training and 1024 test samples\ndrawn from a 4d Gaussian Process with isotropic squared exponential covariance function (l =\n1.5, sf = 1) and true noise variance \u03c32\nn = 0.01. The data inputs were drawn from a Gaussian centred\naround the origin, but similar results were obtained for uniformly sampled inputs. We \ufb01t both FITC\nand VFE to this dataset with the number of inducing inputs ranging from 16 to 1024, and compare a\nrepresentative run to the full GP in Fig. 5.\n\nFigure 5: Optimisation behaviour of VFE and FITC for varying number of inducing inputs compared\nto the full GP. We show the objective function (negative log marginal likelihood), the optimised noise\n\u03c3n, the negative log predictive probability and standardised mean squared error as de\ufb01ned in [1].\n\nVFE monotonically approaches the values of the full GP but initially overestimates the noise variance,\nas discussed in Section 3.1. Conversely, we can identify three regimes for the objective function of\nFITC: 1) Monotonic improvement for few inducing inputs, 2) a region where FITC over-estimates\nthe marginal likelihood, and 3) recovery towards the full GP for many inducing inputs. Predictive\nperformance follows a similar trend, \ufb01rst improving, then declining while the bound is estimated to\nbe too high, followed by a recovery. The recovery is counter to the usual intuition that over-\ufb01tting\nworsens when adding more parameters.\nWe explain the behaviour in these three regimes as follows: When the number of inducing inputs\nare severely limited (regime 1), FITC needs to place them such that K\ufb00 is well approximated. This\ncorrelates most points to some degree, and ensures a reasonable data \ufb01t term. The marginal likelihood\nis under-estimated due to lack of a \ufb02exibility in Q\ufb00 . This behaviour is consistent with the intuition\nthat limiting model capacity prevents over\ufb01tting.\nAs the number of inducing inputs increases (regime 2), the marginal likelihood is over-estimated and\nthe noise drastically under-estimated. Additionally, performance in terms of log predictive probability\ndeteriorates. This is the regime closest to FITC\u2019s behaviour in Fig. 1. There are enough inducing\ninputs such that they can be placed such that a bonus can be gained from the heteroscedastic noise,\nwithout gaining a complexity penalty from losing long scale correlations.\nFinally, in regime 3, FITC starts to behave more like a regular GP in terms of marginal likelihood,\npredictive performance and noise variance parameter \u03c3n. FITC\u2019s ability to use heteroscedastic noise\nis reduced as the approximate covariance matrix Q\ufb00 is closer to the true covariance matrix K\ufb00 when\nmany (initial) inducing input are spread over the input space.\nIn the previous section we showed that after adding a new inducing input, a better minimum obtained\nwithout the extra inducing input could be recovered by clumping. So it is clear that the minimum that\nwas found with fewer active inducing inputs still exists in the optimisation surface of many inducing\ninputs; the optimiser just does not \ufb01nd it.\nRemark 7 When running FITC with many inducing inputs its resemblance to the full GP solution\nrelies on local optima, rather than the objective function changing.\n\n3.5 VFE is hindered by local optima\n\nSo far we have seen that the VFE objective function is a true lower bound on the marginal likelihood\nand does not share the same pathologies as FITC. Thus, when optimising, we really are interested in\n\ufb01nding a global optimum. The VFE objective function is not completely trivial to optimise, and often\ntricks, such as initialising the inducing inputs with k-means and initially \ufb01xing the hyperparameters\n\n7\n\n2427210\u2212505\u00b7102#inducinginputsNLML242721010\u2212310\u22121#inducinginputsOptimised\u03c3n2427210\u22120.8\u22120.6\u22120.4\u22120.20#inducinginputsNeg.logpred.prob.24272102468\u00b710\u22122#inducinginputsSMSEGPFITCVFE\f[20, 21], are required to \ufb01nd a good optimum. Others have commented that VFE has the tendency to\nunder\ufb01t [3]. Here we investigate the under\ufb01tting claim and relate it to optimisation behaviour.\nAs this behaviour is not observable in our 1D dataset, we illustrate it on the pumadyn32nm dataset4\n(32 dimensions, 7168 training, 1024 test), see Table 1 for the results of a representative run with\nrandom initial conditions and M = 40 inducing inputs.\n\ninv. lengthscales\n\nRMSE\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n0.209\n0.212\n0.979\n0.276\n0.212\n\nMethod\n\nGP (SoD)\nFITC\nVFE\nVFE (frozen)\nVFE (init FITC)\n\nNLML/N\n\u22120.099\n\u22120.145\n1.419\n0.151\n\u22120.096\n\n\u03c3n\n\n0.196\n0.004\n\n1\n\n0.278\n0.213\n\nTable 1: Results for pumadyn32nm dataset. We show negative log marginal likelihood (NLML)\ndivided by number of training points, the optimised noise variance \u03c32\nn, the ten most dominant inverse\nlengthscales and the RMSE on test data. Methods are full GP on 2048 training samples, FITC, VFE,\nVFE with initially frozen hyperparameters, VFE initialised with the solution obtained by FITC.\n\nUsing a squared exponential ARD kernel with separate lengthscales for every dimension, a full GP\non a subset of data identi\ufb01ed four lengthscales as important to model the data while scaling the other\n28 lengthscales to large values (in Table 1 we plot the inverse lengthscales).\nFITC was consistently able to identify the same four lengthscales and performed similarly compared\nto the full GP but scaled down the noise variance \u03c32\nn to almost zero. The latter is consistent with our\nearlier observations of strong pinching in a regime with low-density data as is the case here due to\nthe high dimensionality. VFE, on the other hand, was unable to identify these relevant lengthscales\nwhen jointly optimising the hyperparameters and inducing inputs, and only identi\ufb01ed some of the\nthem when initially freezing the hyperparameters. One might say that VFE \u201cunder\ufb01ts\u201d in this case.\nHowever, we can show that VFE still recognises a good solution: When we initialised VFE with the\nFITC solution it consistently obtained a good \ufb01t to the model with correctly identi\ufb01ed lengthscales\nand a noise variance that was close to the full GP.\nRemark 8 VFE has a tendency to \ufb01nd under-\ufb01tting solutions. However, this is an optimisation issue.\nThe bound correctly identi\ufb01es good solutions.\n\n4 Conclusion\n\nIn this work, we have thoroughly investigated and characterised the differences between FITC\nand VFE, both in terms of their objective function and their behaviour observed during practical\noptimisation. We highlight several instances of undesirable behaviour in the FITC objective: over-\nestimation of the marginal likelihood, sometimes severe under-estimation of the noise variance\nparameter, wasting of modelling resources and not recovering the true posterior. The common\npractice of using the noise variance parameter as a diagnostic for good model \ufb01tting is unreliable.\nIn contrast, VFE is a true bound to the marginal likelihood of the full GP and behaves predictably:\nIt correctly identi\ufb01es good solutions, always improves with extra resources and recovers the true\nposterior when possible. In practice however, the pathologies of the FITC objective do not always\nshow up, thanks to \u201cgood\u201d local optima and (unintentional) early stopping. While VFE\u2019s objective\nrecognises a good con\ufb01guration, it is often more susceptible to local optima and harder to optimise\nthan FITC.\nWhich of these pathologies show up in practise depends on the dataset in question. However, based\non the superior properties of the VFE objective function, we recommend using VFE, while paying\nattention to optimisation dif\ufb01culties. These can be mitigated by careful initialisation, random restarts,\nother optimisation tricks and comparison to the FITC solution to guide VFE optimisation.\n\nAcknowledgements\nWe would like to thank Alexander Matthews, Thang Bui, and Richard Turner for useful discussions.\n\n4obtained from http://www.cs.toronto.edu/~delve/data/datasets.html\n\n8\n\n\fReferences\n[1] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation\n\nand Machine Learning series). The MIT Press, 2005.\n\n[2] M. L\u00b4azaro-Gredilla, J. Qui\u02dcnonero-Candela, C. E. Rasmussen and A. R. Figueiras-Vidal. \u2018Sparse spectrum\n\nGaussian process regression\u2019. In: The Journal of Machine Learning Research 11 (2010).\n\n[3] M. L\u00b4azaro-Aredilla and A. Figueiras-Vidal. \u2018Inter-domain Gaussian processes for sparse inference using\n\ninducing features\u2019. In: Advances in Neural Information Processing Systems. 2009.\n\n[4] A. Rahimi and B. Recht. \u2018Weighted sums of random kitchen sinks: Replacing minimization with\n\nrandomization in learning\u2019. In: Advances in Neural Information Processing Systems. 2009.\n\n[5] Z. Yang, A. J. Smola, L. Song and A. G. Wilson. \u2018A la Carte - Learning Fast Kernels\u2019. In: Arti\ufb01cial\n\nIntelligence and Statistics. 2015. eprint: 1412.6493.\n\n[6] M. Seeger, C. K. I. Williams and N. D. Lawrence. \u2018Fast Forward Selection to Speed Up Sparse Gaussian\nProcess Regression\u2019. In: Proceedings of the Ninth International Workshop on Arti\ufb01cial Intelligence and\nStatistics. 2003.\n\n[7] E. Snelson and Z. Ghahramani. \u2018Sparse Gaussian Processes using Pseudo-inputs\u2019. In: Neural Information\n\nProcessing Systems. Vol. 18. 2006.\nJ. Qui\u02dcnonero-Candela and C. E. Rasmussen. \u2018A unifying view of sparse approximate Gaussian process\nregression\u2019. In: The Journal of Machine Learning Research 6 (2005).\n\n[8]\n\n[9] M. K. Titsias. \u2018Variational learning of inducing variables in sparse Gaussian processes\u2019. In: Proceedings\n\nof the Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics. 2009.\n\n[10] A. J. Smola and P. Bartlett. \u2018Sparse greedy Gaussian process regression\u2019. In: Advances in Neural\n\nInformation Processing Systems 13. 2001.\n\n[11] L. Csat\u00b4o and M. Opper. \u2018Sparse on-line Gaussian processes\u2019. In: Neural computation 14.3 (2002).\n[12] E. Snelson. \u2018Flexible and ef\ufb01cient Gaussian process models for machine learning\u2019. PhD thesis. University\n\nCollege London, 2007.\n\n[13] Y. Qi, A. H. Abdel-Gawad and T. P. Minka. \u2018Sparse-posterior Gaussian Processes for general likelihoods\u2019.\n\nIn: Proceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence. 2010.\n\n[14] T. D. Bui, J. Yan and R. E. Turner. \u2018A Unifying Framework for Sparse Gaussian Process Approximation\n\nusing Power Expectation Propagation\u2019. In: (2016). eprint: 1605.07066.\n\n[15] A. Matthews, J. Hensman, R. E. Turner and Z. Ghahramani. \u2018On Sparse variational methods and\nthe Kullback-Leibler divergence between stochastic processes\u2019. In: Proceedings of the Nineteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics. 2016. eprint: 1504.07027.\n\n[16] C. E. Rasmussen and Z. Ghahramani. \u2018Occam\u2019s Razor\u2019. In: Advances in Neural Information Processing\n\nSystems 13. 2001.\n\n[17] M. K. Titsias. Variational Model Selection for Sparse Gaussian Process Regression. Tech. rep. University\n\nof Manchester, 2009.\n\n[18] R. E. Turner and M. Sahani. \u2018Two problems with variational expectation maximisation for time-series\n\nmodels\u2019. In: Bayesian Time series models. Cambridge University Press, 2011. Chap. 5.\n\n[19] A. Matthews. \u2018Scalable Gaussian process inference using variational methods\u2019. PhD thesis. University of\n\n[20]\n\n[21]\n\nCambridge, 2016.\nJ. Hensman, A. Matthews and Z. Ghahramani. \u2018Scalable Variational Gaussian Process Classi\ufb01cation\u2019. In:\nProceedings of the Eighteenth International Conference on Arti\ufb01cial Intelligence and Statistics. 2015.\neprint: 1411.2005.\nJ. Hensman, N. Fusi and N. D. Lawrence. \u2018Gaussian Processes for Big Data\u2019. In: Conference on\nUncertainty in Arti\ufb01cial Intelligence. 2013. eprint: 1309.6835.\n\n9\n\n\f", "award": [], "sourceid": 832, "authors": [{"given_name": "Matthias", "family_name": "Bauer", "institution": "University of Cambridge"}, {"given_name": "Mark", "family_name": "van der Wilk", "institution": "University of Cambridge"}, {"given_name": "Carl Edward", "family_name": "Rasmussen", "institution": "University of Cambridge"}]}