{"title": "Streaming Sparse Gaussian Process Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 3299, "page_last": 3307, "abstract": "Sparse pseudo-point approximations for Gaussian process (GP) models provide a suite of methods that support deployment of GPs in the large data regime and enable analytic intractabilities to be sidestepped. However, the field lacks a principled method to handle streaming data in which both the posterior distribution over function values and the hyperparameter estimates are updated in an online fashion. The small number of existing approaches either use suboptimal hand-crafted heuristics for hyperparameter learning, or suffer from catastrophic forgetting or slow updating when new data arrive. This paper develops a new principled framework for deploying Gaussian process probabilistic models in the streaming setting, providing methods for learning hyperparameters and optimising pseudo-input locations. The proposed framework is assessed using synthetic and real-world datasets.", "full_text": "Streaming Sparse Gaussian Process Approximations\n\nThang D. Bui\u2217\n\nCuong V. Nguyen\u2217\n\nRichard E. Turner\n\nDepartment of Engineering, University of Cambridge, UK\n\n{tdb40,vcn22,ret26}@cam.ac.uk\n\nAbstract\n\nSparse pseudo-point approximations for Gaussian process (GP) models provide a\nsuite of methods that support deployment of GPs in the large data regime and en-\nable analytic intractabilities to be sidestepped. However, the \ufb01eld lacks a principled\nmethod to handle streaming data in which both the posterior distribution over func-\ntion values and the hyperparameter estimates are updated in an online fashion. The\nsmall number of existing approaches either use suboptimal hand-crafted heuristics\nfor hyperparameter learning, or suffer from catastrophic forgetting or slow updating\nwhen new data arrive. This paper develops a new principled framework for de-\nploying Gaussian process probabilistic models in the streaming setting, providing\nmethods for learning hyperparameters and optimising pseudo-input locations. The\nproposed framework is assessed using synthetic and real-world datasets.\n\n1\n\nIntroduction\n\nProbabilistic models employing Gaussian processes have become a standard approach to solving\nmany machine learning tasks, thanks largely to the modelling \ufb02exibility, robustness to over\ufb01tting, and\nwell-calibrated uncertainty estimates afforded by the approach [1]. One of the pillars of the modern\nGaussian process probabilistic modelling approach is a set of sparse approximation schemes that\nallow the prohibitive computational cost of GP methods, typically O(N 3) for training and O(N 2) for\nprediction where N is the number of training points, to be substantially reduced whilst still retaining\naccuracy. Arguably the most important and in\ufb02uential approximations of this sort are pseudo-point\napproximation schemes that employ a set of M (cid:28) N pseudo-points to summarise the observational\ndata thereby reducing computational costs to O(N M 2) and O(M 2) for training and prediction,\nrespectively [2, 3]. Stochastic optimisation methods that employ mini-batches of training data can\nbe used to further reduce computational costs [4, 5, 6, 7], allowing GPs to be scaled to datasets\ncomprising millions of data points.\nThe focus of this paper is to provide a comprehensive framework for deploying the Gaussian process\nprobabilistic modelling approach to streaming data. That is, data that arrive sequentially in an online\nfashion, possibly in small batches, and whose number are not known a priori (and indeed may be\nin\ufb01nite). The vast majority of previous work has focussed exclusively on the batch setting and there\nis not a satisfactory framework that supports learning and approximation in the streaming setting.\nA na\u00efve approach might simply incorporate each new datum as they arrived into an ever-growing\ndataset and retrain the GP model from scratch each time. With in\ufb01nite computational resources, this\napproach is optimal, but in the majority of practical settings, it is intractable. A feasible alternative\nwould train on just the most recent K training data points, but this completely ignores potentially\nlarge amounts of informative training data and it does not provide a method for incorporating the\nold model into the new one which would save computation (except perhaps through initialisation of\nthe hyperparameters). Existing, sparse approximation schemes could be applied in the same manner,\nbut they merely allow K to be increased, rather than allowing all previous data to be leveraged, and\nagain do not utilise intermediate approximate \ufb01ts.\n\n\u2217These authors contributed equally to this work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWhat is needed is a method for performing learning and sparse approximation that incrementally\nupdates the previously \ufb01t model using the new data. Such an approach would utilise all the previous\ntraining data (as they will have been incorporated into the previously \ufb01t model) and leverage as much\nof the previous computation as possible at each stage (since the algorithm only requires access to the\ndata at the current time point). Existing stochastic sparse approximation methods could potentially\nbe used by collecting the streamed data into mini-batches. However, the assumptions underpinning\nthese methods are ill-suited to the streaming setting and they perform poorly (see sections 2 and 4).\nThis paper provides a new principled framework for deploying Gaussian process probabilistic models\nin the streaming setting. The framework subsumes Csat\u00f3 and Opper\u2019s two seminal approaches to\nonline regression [8, 9] that were based upon the variational free energy (VFE) and expectation\npropagation (EP) approaches to approximate inference respectively. In the new framework, these\nalgorithms are recovered as special cases. We also provide principled methods for learning hyperpa-\nrameters (learning was not treated in the original work and the extension is non-trivial) and optimising\npseudo-input locations (previously handled via hand-crafted heuristics). The approach also relates to\nthe streaming variational Bayes framework [10]. We review background material in the next section\nand detail the technical contribution in section 3, followed by several experiments on synthetic and\nreal-world data in section 4.\n\n2 Background\n\nRegression models that employ Gaussian processes are state of the art for many datasets [11]. In\nthis paper we focus on the simplest GP regression model as a test case of the streaming framework\nn=1, a standard GP\nfor inference and learning. Given N input and real-valued output pairs {xn, yn}N\nregression model assumes yn = f (xn) + \u0001n, where f is an unknown function that is corrupted by\nGaussian observation noise \u0001n \u223c N (0, \u03c32\ny). Typically, f is assumed to be drawn from a zero-mean\nGP prior f \u223c GP(0, k(\u00b7,\u00b7|\u03b8)) whose covariance function depends on hyperparameters \u03b8. In this\nsimple model, the posterior over f, p(f|y, \u03b8), and the marginal likelihood p(y|\u03b8) can be computed\nn=1).2 However,\nanalytically (here we have collected the observations into a vector y = {yn}N\nthese quantities present a computational challenge resulting in an O(N 3) complexity for maximum\nlikelihood training and O(N 2) per test point for prediction.\nThis prohibitive complexity of exact learning and inference in GP models has driven the development\nof many sparse approximation frameworks [12, 13]. In this paper, we focus on the variational free\nenergy approximation scheme [3, 14] which lower bounds the marginal likelihood of the data using a\nvariational distribution q(f ) over the latent function:\n\n(cid:90)\n\n(cid:90)\n\nlog p(y|\u03b8) = log\n\ndf p(y, f|\u03b8) \u2265\n\ndf q(f ) log\n\np(y, f|\u03b8)\n\nq(f )\n\n= Fvfe(q, \u03b8).\n\n(1)\n\nSince Fvfe(q, \u03b8) = log p(y|\u03b8) \u2212 KL[q(f )||p(f|y, \u03b8)], where KL[\u00b7||\u00b7] denotes the Kullback\u2013Leibler\ndivergence, maximising this lower bound with respect to q(f ) guarantees the approximate posterior\ngets closer to the exact posterior p(f|y, \u03b8). Moreover, the variational bound Fvfe(q, \u03b8) approximates\nthe marginal likelihood and can be used for learning the hyperparameters \u03b8.\nIn order to arrive at a computationally tractable method, the approximate posterior is parameterized\nvia a set of M pseudo-points u that are a subset of the function values f = {f(cid:54)=u, u} and which will\nsummarise the data. Speci\ufb01cally, the approximate posterior is assumed to be q(f ) = p(f(cid:54)=u|u, \u03b8)q(u),\nwhere q(u) is a variational distribution over u and p(f(cid:54)=u|u, \u03b8) is the prior distribution of the\nremaining latent function values. This assumption allows the following critical cancellation that\nresults in a computationally tractable lower bound:\n\n= \u2212KL[q(u)||p(u|\u03b8)] +\n\ndu q(u)p(fn|u, \u03b8) log p(yn|fn, \u03b8),\n\nn\n\nwhere fn = f (xn) is the latent function value at xn. For the simple GP regression model considered\nhere, closed-form expressions for the optimal variational approximation qvfe(f ) and the optimal\n\n2The dependence on the inputs {xn}N\nsuppressed throughout to lighten the notation.\n\nn=1 of the posterior, marginal likelihood, and other quantities is\n\n2\n\n(cid:90)\n\nFvfe(q(u), \u03b8) =\n\ndf q(f ) log\n\nXXXXX\np(f(cid:54)=u|u, \u03b8)\n\np(y|f, \u03b8)p(u|\u03b8)\n(cid:88)\nXXXXX\np(f(cid:54)=u|u, \u03b8)q(u)\n\n(cid:90)\n\n\fuuu, \u03c32\nyI),\n1\n2\u03c32\ny\n\nvariational bound Fvfe(\u03b8) = maxq(u)Fvfe(q(u), \u03b8) (also called the \u2018collapsed\u2019 bound) are available:\n\np(f|y, \u03b8) \u2248 qvfe(f ) \u221d p(f(cid:54)=u|u, \u03b8)p(u|\u03b8)N (y; KfuK\u22121\nlog p(y|\u03b8) \u2248 Fvfe(\u03b8) = log N (y; 0, KfuK\u22121\nuuKuf + \u03c32\nyI) \u2212\nwhere f is the latent function values at training points, and Kf1f2 is the covariance matrix between\nthe latent function values f1 and f2. Critically, the approach leads to O(N M 2) complexity for\napproximate maximum likelihood learning and O(M 2) per test point for prediction. In order for this\nmethod to perform well, it is necessary to adapt the pseudo-point input locations, e.g. by optimising\nthe variational free energy, so that the pseudo-data distribute themselves over the training data.\nAlternatively, stochastic optimisation may be applied directly to the original, uncollapsed version of\nthe bound [4, 15]. In particular, an unbiased estimate of the variational lower bound can be obtained\nusing a small number of training points randomly drawn from the training set:\n\n(cid:88)\n\nn\n\n(knn \u2212 KnuK\u22121\n\nuuKun),\n\nFvfe(q(u), \u03b8) \u2248 \u2212KL[q(u)||p(u|\u03b8)] +\n\nN\n|B|\n\ndu q(u)p(fn|u, \u03b8) log p(yn|fn, \u03b8).\n\n(cid:90)\n\n(cid:88)\n\nyn\u2208B\n\nSince the optimal approximation is Gaussian as shown above, q(u) is often posited as a Gaussian\ndistribution and its parameters are updated by following the (noisy) gradients of the stochastic\nestimate of the variational lower bound. By passing through the training set a suf\ufb01cient number\nof times, the variational distribution converges to the optimal solution above, given appropriately\ndecaying learning rates [4].\nIn principle, the stochastic uncollapsed approach is applicable to the streaming setting as it re\ufb01nes an\napproximate posterior based on mini-batches of data that can be considered to arrive sequentially\n(here N would be the number of data points seen so far). However, it is unsuited to this task since\nstochastic optimisation assumes that the data subsampling process is uniformly random, that the\ntraining set is revisited multiple times, and it typically makes a single gradient update per mini-batch.\nThese assumptions are incompatible with the streaming setting: continuously arriving data are not\ntypically drawn iid from the input distribution (consider an evolving time-series, for example); the\ndata can only be touched once by the algorithm and not revisited due to computational constraints;\neach mini-batch needs to be processed intensively as it will not be revisited (multiple gradient\nsteps would normally be required, for example, and this runs the risk of forgetting old data without\ndelicately tuning the learning rates). In the following sections, we shall discuss how to tackle these\nchallenges through a novel online inference and learning procedure, and demonstrate the ef\ufb01cacy of\nthis method over the uncollapsed approach and na\u00efve online versions of the collapsed approach.\n\n3 Streaming sparse GP (SSGP) approximation using variational inference\n\nThe general situation assumed in this paper is that data arrive sequentially so that at each step new\ndata points ynew are added to the old dataset yold. The goal is to approximate the marginal likelihood\nand the posterior of the latent process at each step, which can be used for anytime prediction. The\nhyperparameters will also be adjusted online. Importantly, we assume that we can only access the\ncurrent data points ynew directly for computational reasons (it might be too expensive to hold yold\nand x1:Nold in memory, for example, or approximations made at the previous step must be reused\nto reduce computational overhead). So the effect of the old data on the current posterior must be\npropagated through the previous posterior. We will now develop a new sparse variational free energy\napproximation for this purpose, that compactly summarises the old data via pseudo-points. The\npseudo-inputs will also be adjusted online since this is critical as new parts of the input space will be\nrevealed over time. The framework is easily extensible to more complex non-linear models.\n3.1 Online variational free energy inference and learning\nConsider an approximation to the true posterior at the previous step, qold(f ), which must be updated\nto form the new approximation qnew(f ),\n\nqold(f ) \u2248 p(f|yold) =\n\n1\n\nZ1(\u03b8old)\n\nqnew(f ) \u2248 p(f|yold, ynew) =\n\np(f|\u03b8old)p(yold|f ),\n1\n\np(f|\u03b8new)p(yold|f )p(ynew|f ).\n\nZ2(\u03b8new)\n\n3\n\n(2)\n\n(3)\n\n\fthe updated exact posterior p(f|yold, ynew) balances the contribution of old and\nthe new approximation cannot access p(yold|f ) di-\nthat is\nInstead, we can \ufb01nd an approximation of p(yold|f ) by inverting eq. (2),\n\nWhilst\nnew data through their likelihoods,\nrectly.\np(yold|f ) \u2248 Z1(\u03b8old)qold(f )/p(f|\u03b8old). Substituting this into eq. (3) yields,\nqold(f )\np(f|\u03b8old)\n\n\u02c6p(f|yold, ynew) = Z1(\u03b8old)\nZ2(\u03b8new)\n\np(f|\u03b8new)p(ynew|f )\n\nAlthough it is tempting to use this as the new posterior, qnew(f ) = \u02c6p(f|yold, ynew), this recovers\nexact GP regression with \ufb01xed hyperparameters (see section 3.3) and it is intractable. So, instead, we\nconsider a variational update that projects the distribution back to a tractable form using pseudo-data.\nAt this stage we allow the pseudo-data input locations in the new approximation to differ from those\nin the old one. This is required if new regions of input space are gradually revealed, as for example\nin typical time-series applications. Let a = f (zold) and b = f (znew) be the function values at the\npseudo-inputs before and after seeing new data. Note that the number of pseudo-points, Ma = |a| and\nMb = |b| are not necessarily restricted to be the same. The form of the approximate posterior mirrors\nthat in the batch case, that is, the previous approximate posterior, qold(f ) = p(f(cid:54)=a|a, \u03b8old)qold(a)\nwhere we assume qold(a) = N (a; ma, Sa). The new posterior approximation takes the same form,\nbut with the new pseudo-points and new hyperparameters: qnew(f ) = p(f(cid:54)=b|b, \u03b8new)qnew(b).\nSimilar to the batch case, this approximate inference problem can be turned into an optimisation\nproblem using variational inference. Speci\ufb01cally, consider\n\n.\n\n(4)\n\nKL[qnew(f )||\u02c6p(f|yold, ynew)] =\n\ndf qnew(f ) log\n\n(cid:90)\n\np(f(cid:54)=b|b, \u03b8new)qnew(b)\n\nZ1(\u03b8old)\nZ2(\u03b8new) p(f|\u03b8new)p(ynew|f ) qold(f )\np(f|\u03b8old)\ndf qnew(f )\n\np(a|\u03b8old)qnew(b)\n\n(cid:20)\n\nlog\n\n(5)\n\n(cid:21)\n\n.\n\n= log Z2(\u03b8new)\nZ1(\u03b8old)\n\n+\n\np(b|\u03b8new)qold(a)p(ynew|f )\nSince the KL divergence is non-negative, the second term in the expression above is the negative\napproximate lower bound of the online log marginal likelihood (as Z2/Z1 \u2248 p(ynew|yold)), or the\nvariational free energy F(qnew(f ), \u03b8new). By setting the derivative of F w.r.t. q(b) equal to 0, the\noptimal approximate posterior can be obtained for the regression case,3\n\n(cid:90)\n\n(cid:16)(cid:90)\n\nqvfe(b) \u221d p(b) exp\n\nda p(a|b) log\n\ndf p(f|b) log p(ynew|f )\n\nqold(a)\np(a|\u03b8old)\n\n+\n\n(cid:20)\n\n\u221d p(b)N (\u02c6y; K\u02c6f bK\u22121\n(cid:20)Kfb\n\n(cid:21)\n\n(cid:21)\n\nbbb, \u03a3\u02c6y,vfe),\n\n(cid:20)\u03c32\n\nwhere f is the latent function values at the new training points,\n\n\u02c6y =\n\nynew\nDaS\u22121\n\n0\nyI\n0 Da\nThe negative variational free energy is also analytically available,\n\n, \u03a3\u02c6y,vfe =\n\n, K\u02c6f b =\n\na ma\n\nKab\n\n, Da = (S\u22121\n\na \u2212 K(cid:48)\u22121\n\naa )\u22121.\n\n(cid:90)\n\n(cid:21)\n\n(6)\n\n(7)\n\n(8)\n\n(cid:17)\n\n(cid:21)\n\nbbKbf ) + \u2206a; where\na )ma \u2212 tr[D\u22121\n\nF(\u03b8) = log N (\u02c6y; 0, K\u02c6f bK\u22121\n2\u2206a = \u2212 log |Sa| + log |K(cid:48)\n\n1\nbbKb\u02c6f + \u03a3\u02c6y,vfe) \u2212\n2\u03c32\ny\n(cid:124)\na(S\u22121\naa| + log |Da| + m\n\ntr(K\ufb00 \u2212 KfbK\u22121\na \u2212 S\u22121\na DaS\u22121\n\na Qa] + const.\nEquations (7) and (8) provide the complete recipe for online posterior update and hyperparameter\nlearning in the streaming setting. The computational complexity and memory overhead of the new\nmethod is of the same order as the uncollapsed stochastic variational inference approach. The\nprocedure is demonstrated on a toy regression example as shown in \ufb01g. 1[Left].\n3.2 Online \u03b1-divergence inference and learning\nOne obvious extension of the online approach discussed above replaces the KL divergence\nin eq. (5) with a more general \u03b1-divergence [16]. This does not affect tractability:\nthe op-\ntimal form of the approximate posterior can be obtained analytically for the regression case,\n\nqpep(b) \u221d p(b)N (\u02c6y; K\u02c6f bK\u22121\n\n(cid:20)\u03c32\nbbb, \u03a3\u02c6y,pep) where\nyI + \u03b1diag(K\ufb00 \u2212 KfbK\u22121\n\n\u03a3\u02c6y,pep =\n\nbbKbf )\n\n0\n\n3Note that we have dropped \u03b8new from p(b|\u03b8new), p(a|b, \u03b8new) and p(f|b, \u03b8new) to lighten the notation.\n\n0\n\nDa + \u03b1(Kaa \u2212 KabK\u22121\n\nbbKba)\n\n.\n\n(9)\n\n4\n\n\fFigure 1: [Left] SSGP inference and learning on a toy time-series using the VFE approach. The black\ncrosses are data points (past points are greyed out), the red circles are pseudo-points, and blue lines\nand shaded areas are the marginal predictive means and con\ufb01dence intervals at test points. [Right]\nLog-likelihood of test data as training data arrives for different \u03b1 values, for the pseudo periodic\ndataset (see section 4.2). We observed that \u03b1 = 0.01 is virtually identical to VFE. Dark lines are\nmeans over 4 splits and shaded lines are results for each split. Best viewed in colour.\n\nThis reduces back to the variational case as \u03b1 \u2192 0 (compare to eq. (7)) since then the \u03b1-divergence is\nequivalent to the KL divergence. The approximate online log marginal likelihood is also analytically\ntractable and recovers the variational case when \u03b1 \u2192 0. Full details are provided in the appendix.\n3.3 Connections to previous work and special cases\nThis section brie\ufb02y highlights connections between the new framework and existing approaches\nincluding Power Expectation Propagation (Power-EP), Expectation Propagation (EP), Assumed\nDensity Filtering (ADF), and streaming variational Bayes.\nRecent work has uni\ufb01ed a range of batch sparse GP approximations as special cases of the Power-EP\nalgorithm [13]. The online \u03b1-divergence approach to inference and learning described in the last\nsection is equivalent to running a forward \ufb01ltering pass of Power-EP. In other words, the current work\ngeneralizes the unifying framework to the streaming setting.\nWhen the hyperparameters and the pseudo-inputs are \ufb01xed, \u03b1-divergence inference for sparse GP\nregression recovers the batch solutions provided by Power-EP. In other words, only a single pass\nthrough the data is necessary for Power-EP to converge in sparse GP regression. For the case \u03b1 = 1,\nwhich is called Expectation Propagation, we recover the seminal work by Csat\u00f3 and Opper [8].\nFor the variational free energy case (equivalently where \u03b1 \u2192 0) we recover the seminal work by\nCsat\u00f3 [9]. The new framework can be seen to extend these methods to allow principled learning and\npseudo-input optimisation. Interestingly, in the setting where hyperparameters and the pseudo-inputs\nare \ufb01xed, if pseudo-points are added at each stage at the new data input locations, the method returns\nthe true posterior and marginal likelihood (see appendix).\nFor \ufb01xed hyperparameters and pseudo-points, the new VFE framework is equivalent to the application\nof streaming variational Bayes (VB) or online variational inference [10, 17, 18] to the GP setting in\nwhich the previous posterior plays a role of an effective prior for the new data. Similarly, the equivalent\nalgorithm when \u03b1 = 1 is called Assumed Density Filtering [19]. When the hyperparameters are\nupdated, the new method proposed here is different from streaming VB and standard application of\nADF, as the new method propagates approximations to just the old likelihood terms and not the prior.\nImportantly, we found vanilla application of the streaming VB framework performed catastrophically\nfor hyperparameter learning, so the modi\ufb01cation is critical.\n\n4 Experiments\n\nIn this section, the SSGP method is evaluated in terms of speed, memory usage, and accuracy (log-\nlikelihood and error). The method was implemented on GP\ufb02ow [20] and compared against GP\ufb02ow\u2019s\nversion of the following baselines: exact GP (GP), sparse GP using the collapsed bound (SGP), and\nstochastic variational inference using the uncollapsed bound (SVI). In all the experiments, the RBF\nkernel with ARD lengthscales is used, but this is not a limitation required by the new methods. An im-\nplementation of the proposed method can be found at http://github.com/thangbui/streaming_sparse_gp.\nFull experimental results and additional discussion points are included in the appendix.\n4.1 Synthetic data\nComparing \u03b1-divergences. We \ufb01rst consider the general online \u03b1-divergence inference and learning\nframework and compare the performance of different \u03b1 values on a toy online regression dataset\n\n5\n\n-2.0-1.00.01.02.0y-2.0-1.00.01.02.0y\u22122024681012x-2.0-1.00.01.02.0y0510152025batchindex1.01.52.02.5meanlog-likelihood0.010.200.500.801.00\fin \ufb01g. 1[Right]. Whilst the variational approach performs well, adapting pseudo-inputs to cover\nnew regions of input space as they are revealed, algorithms using higher \u03b1 values perform more\npoorly. Interestingly this appears to be related to the tendency for EP, in batch settings, to clump\npseudo-inputs on top of one another [21]. Here the effect is much more extreme as the clumps\naccumulate over time, leading to a shortage of pseudo-points if the input range of the data increases.\nAlthough heuristics could be introduced to break up the clumps, this result suggests that using small \u03b1\nvalues for online inference and learning might be more appropriate (this recommendation differs from\nthe batch setting where intermediate settings of \u03b1 around 0.5 are best [13]). Due to these \ufb01ndings, for\nthe rest of the paper, we focus on the variational case.\nHyperparameter learning. We generated multiple time-series from GPs with known hyperpa-\nrameters and observation noises, and tracked the hyperparameters learnt by the proposed online\nvariational free energy method and exact GP regression. Overall, SSGP can track and learn good\nhyperparameters, and if there are suf\ufb01cient pseudo-points, it performs comparatively to full GP on\nthe entire dataset. Interestingly, all models including full GP regression tend to learn bigger noise\nvariances as any discrepancy in the true and learned function values is absorbed into this parameter.\n4.2 Speed versus accuracy\nIn this experiment, we compare SSGP to the baselines (GP, SGP, and SVI) in terms of a speed-\naccuracy trade-off where the mean marginal log-likelihood (MLL) and the root mean squared error\n(RMSE) are plotted against the accumulated running time of each method after each iteration. The\ncomparison is performed on two time-series datasets and a spatial dataset.\nTime-series data. We \ufb01rst consider modelling a segment of the pseudo periodic synthetic dataset\n[22], previously used for testing indexing schemes in time-series databases. The segment contains\n24,000 time-steps. Training and testing sets are chosen interleaved so that their sizes are both 12,000.\nThe second dataset is an audio signal prediction dataset, produced from the TIMIT database [23] and\npreviously used to evaluate GP approximations [24]. The signal was shifted down to the baseband\nand a segment of length 18,000 was used to produce interleaved training and testing sets containing\n9,000 time steps. For both datasets, we linearly scale the input time steps to the range [0, 10].\nAll algorithms are assessed in the mini-batch streaming setting with data ynew arriving in batches\nof size 300 and 500 taken in order from the time-series. The \ufb01rst 1,000 examples are used as an\ninitial training set to obtain a reasonable starting model for each algorithm. In this experiment, we\nuse memory-limited versions of GP and SGP that store the last 3,000 examples. This number was\nchosen so that the running times of these algorithms match those of SSGP or are slightly higher. For\nall sparse methods (SSGP, SGP, and SVI), we run the experiments with 100 and 200 pseudo-points.\nFor SVI, we allow the algorithm to make 100 stochastic gradient updates during each iteration and\nrun preliminary experiments to compare 3 learning rates r = 0.001, 0.01, and 0.1. The preliminary\nresults showed that the performance of SVI was not signi\ufb01cantly altered and so we only present the\nresults for r = 0.1.\nFigure 2 shows the plots of the accumulated running time (total training and testing time up until the\ncurrent iteration) against the MLL and RMSE for the considered algorithms. It is clear that SSGP\nsigni\ufb01cantly outperforms the other methods both in terms of the MLL and RMSE, once suf\ufb01cient\ntraining data have arrived. The performance of SSGP improves when the number of pseudo-points\nincreases, but the algorithm runs more slowly. In contrast, the performance of GP and SGP, even after\nseeing more data or using more pseudo-points, does not increase signi\ufb01cantly since they can only\nmodel a limited amount of data (the last 3,000 examples).\nSpatial data. The second set of experiments consider the OS Terrain 50 dataset that contains spot\nheights of landscapes in Great Britain computed on a grid.4 A block of 200 \u00d7 200 points was split\ninto 10,000 training examples and 30,000 interleaved testing examples. Mini-batches of data of size\n750 and 1,000 arrive in spatial order. The \ufb01rst 1,000 examples were used as an initial training set.\nFor this dataset, we allow GP and SGP to remember the last 7,500 examples and use 400 and 600\npseudo-points for the sparse models. Figure 3 shows the results for this dataset. SSGP performs\nbetter than the other baselines in terms of the RMSE although it is worse than GP and SGP in terms\nof the MLL.\n\n4The dataset is available at: https://data.gov.uk/dataset/os-terrain-50-dtm.\n\n6\n\n\fpseudo periodic data, batch size = 300\n\npseudo periodic data, batch size = 500\n\naudio data, batch size = 300\n\naudio data, batch size = 500\n\nFigure 2: Results for time-series datasets with batch sizes 300 and 500. Pluses and circles indicate\nthe results for M = 100, 200 pseudo-points respectively. For each algorithm (except for GP), the\nsolid and dashed lines are the ef\ufb01cient frontier curves for M = 100, 200 respectively.\n\n4.3 Memory usage versus accuracy\nBesides running time, memory usage is another important factor that should be considered. In this\nexperiment, we compare the memory usage of SSGP against GP and SGP on the Terrain dataset\nabove with batch size 750 and M = 600 pseudo-points. We allow GP and SGP to use the last 2,000\nand 6,000 examples for training, respectively. These numbers were chosen so that the memory usage\nof the two baselines roughly matches that of SSGP. Figure 4 plots the maximum memory usage of\nthe three methods against the MLL and RMSE. From the \ufb01gure, SSGP requires small memory usage\nwhile it can achieve comparable or better MLL and RMSE than GP and SGP.\n4.4 Binary classi\ufb01cation\nWe show a preliminary result for GP models with non-Gaussian likelihoods, in particular, a binary\nclassi\ufb01cation model on the benchmark banana dataset. As the optimal form for the approximate\nposterior is not analytically tractable, the uncollapsed variational free energy is optimised numerically.\nThe predictions made by SSGP in a non-iid streaming setting are shown in \ufb01g. 5. SSGP performs\nwell and achieves the performance of the batch sparse variational method [5].\n\n7\n\n100101102103104accumulatedrunningtime(s)10\u2212310\u2212210\u22121100RMSESVI(r=0.1)GPSGPSSGP100101102103104accumulatedrunningtime(s)\u22124\u22122024meanlog-likelihood100101102103accumulatedrunningtime(s)\u22124\u22122024meanlog-likelihood100101102103104accumulatedrunningtime(s)10\u2212410\u2212310\u2212210\u22121100101RMSE100101102103accumulatedrunningtime(s)10\u2212410\u2212310\u2212210\u22121100RMSE100101102103accumulatedrunningtime(s)\u22128\u22126\u22124\u22122024meanlog-likelihood100101102103accumulatedrunningtime(s)\u22128\u22126\u22124\u22122024meanlog-likelihood100101102103accumulatedrunningtime(s)10\u2212410\u2212310\u2212210\u22121100101RMSE100101102103accumulatedrunningtime(s)10\u2212410\u2212310\u2212210\u22121100101RMSE\fterrain data, batch size = 750\n\nterrain data, batch size = 1000\n\nFigure 3: Results for spatial data (see \ufb01g. 2 for the legend). Pluses/solid lines and circles/dashed lines\nindicate the results for M = 400, 600 pseudo-points respectively.\n\nFigure 4: Memory usage of SSGP (blue), GP (magenta) and SGP (red) against MLL and RMSE.\n\nFigure 5: SSGP inference and learning on a binary classi\ufb01cation task in a non-iid streaming setting.\nThe right-most plot shows the prediction made by using sparse variational inference on full training\ndata [5] for comparison. Past observations are greyed out. The pseudo-points are shown as black dots\nand the black curves show the decision boundary.\n\n5 Summary\n\nWe have introduced a novel online inference and learning framework for Gaussian process models.\nThe framework uni\ufb01es disparate methods in the literature and greatly extends them, allowing se-\nquential updates of the approximate posterior and online hyperparameter optimisation in a principled\nmanner. The proposed approach outperforms existing approaches on a wide range of regression\ndatasets and shows promising results on a binary classi\ufb01cation dataset. A more thorough investigation\non models with non-Gaussian likelihoods is left as future work. We believe that this framework will\nbe particularly useful for ef\ufb01cient deployment of GPs in sequential decision making problems such\nas active learning, Bayesian optimisation, and reinforcement learning.\n\n8\n\n100101102103104accumulatedrunningtime(s)\u22126\u22125\u22124\u22123\u22122\u22121012meanlog-likelihood100101102103104accumulatedrunningtime(s)\u22126\u22125\u22124\u22123\u22122\u22121012meanlog-likelihood100101102103104accumulatedrunningtime(s)2550100200RMSE100101102103104accumulatedrunningtime(s)12.52550100200RMSE400600800100012001400maximummemoryusage(MB)\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0\u22120.50.00.5meanlog-likelihood400600800100012001400maximummemoryusage(MB)20406080100RMSE\u2212202x1\u2212202x2error=0.28\u2212202x1error=0.15\u2212202x1error=0.10\u2212202x1error=0.10\fAcknowledgements\nThe authors would like to thank Mark Rowland, John Bradshaw, and Yingzhen Li for insightful\ncomments and discussion. Thang D. Bui is supported by the Google European Doctoral Fellowship.\nCuong V. Nguyen is supported by EPSRC grant EP/M0269571. Richard E. Turner is supported by\nGoogle as well as EPSRC grants EP/M0269571 and EP/L000776/1.\n\nReferences\n[1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[2] E. Snelson and Z. Ghahramani, \u201cSparse Gaussian processes using pseudo-inputs,\u201d in Advances in Neural\n\nInformation Processing Systems (NIPS), 2006.\n\n[3] M. K. Titsias, \u201cVariational learning of inducing variables in sparse Gaussian processes,\u201d in International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2009.\n\n[4] J. Hensman, N. Fusi, and N. D. Lawrence, \u201cGaussian processes for big data,\u201d in Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), 2013.\n\n[5] J. Hensman, A. G. D. G. Matthews, and Z. Ghahramani, \u201cScalable variational Gaussian process classi\ufb01ca-\n\ntion,\u201d in International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[6] A. Dezfouli and E. V. Bonilla, \u201cScalable inference for Gaussian process models with black-box likelihoods,\u201d\n\nin Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[7] D. Hern\u00e1ndez-Lobato and J. M. Hern\u00e1ndez-Lobato, \u201cScalable Gaussian process classi\ufb01cation via ex-\npectation propagation,\u201d in International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2016.\n\n[8] L. Csat\u00f3 and M. Opper, \u201cSparse online Gaussian processes,\u201d Neural Computation, 2002.\n[9] L. Csat\u00f3, Gaussian Processes \u2013 Iterative Sparse Approximations. PhD thesis, Aston University, 2002.\n[10] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan, \u201cStreaming variational Bayes,\u201d in\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\n[11] T. D. Bui, D. Hern\u00e1ndez-Lobato, J. M. Hern\u00e1ndez-Lobato, Y. Li, and R. E. Turner, \u201cDeep Gaussian\nprocesses for regression using approximate expectation propagation,\u201d in International Conference on\nMachine Learning (ICML), 2016.\n\n[12] J. Qui\u00f1onero-Candela and C. E. Rasmussen, \u201cA unifying view of sparse approximate Gaussian process\n\nregression,\u201d The Journal of Machine Learning Research, 2005.\n\n[13] T. D. Bui, J. Yan, and R. E. Turner, \u201cA unifying framework for Gaussian process pseudo-point approxima-\n\ntions using power expectation propagation,\u201d Journal of Machine Learning Research, 2017.\n\n[14] A. G. D. G. Matthews, J. Hensman, R. E. Turner, and Z. Ghahramani, \u201cOn sparse variational methods and\nthe Kullback-Leibler divergence between stochastic processes,\u201d in International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2016.\n\n[15] C.-A. Cheng and B. Boots, \u201cIncremental variational sparse Gaussian process regression,\u201d in Advances in\n\nNeural Information Processing Systems (NIPS), 2016.\n\n[16] T. Minka, \u201cPower EP,\u201d tech. rep., Microsoft Research, Cambridge, 2004.\n[17] Z. Ghahramani and H. Attias, \u201cOnline variational Bayesian learning,\u201d in NIPS Workshop on Online\n\nLearning, 2000.\n\n[18] M.-A. Sato, \u201cOnline model selection based on the variational Bayes,\u201d Neural Computation, 2001.\n[19] M. Opper, \u201cA Bayesian approach to online learning,\u201d in On-Line Learning in Neural Networks, 1999.\n[20] A. G. D. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahra-\nmani, and J. Hensman, \u201cGP\ufb02ow: A Gaussian process library using TensorFlow,\u201d Journal of Machine\nLearning Research, 2017.\n\n[21] M. Bauer, M. van der Wilk, and C. E. Rasmussen, \u201cUnderstanding probabilistic sparse Gaussian process\n\napproximations,\u201d in Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[22] E. J. Keogh and M. J. Pazzani, \u201cAn indexing scheme for fast similarity search in large time series databases,\u201d\n\nin International Conference on Scienti\ufb01c and Statistical Database Management, 1999.\n\n[23] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V. Zue, \u201cTIMIT acoustic-phonetic\n\ncontinuous speech corpus LDC93S1,\u201d Philadelphia: Linguistic Data Consortium, 1993.\n\n[24] T. D. Bui and R. E. Turner, \u201cTree-structured Gaussian process approximations,\u201d in Advances in Neural\n\nInformation Processing Systems (NIPS), 2014.\n\n9\n\n\f", "award": [], "sourceid": 1877, "authors": [{"given_name": "Thang", "family_name": "Bui", "institution": "University of Cambridge"}, {"given_name": "Cuong", "family_name": "Nguyen", "institution": "University of Cambridge"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}