{"title": "Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3257, "page_last": 3265, "abstract": "Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.", "full_text": "Distributed Variational Inference in Sparse Gaussian\n\nProcess Regression and Latent Variable Models\n\nYarin Gal\u2217\n\nMark van der Wilk\u2217\n\nCarl E. Rasmussen\n\nUniversity of Cambridge\n\n{yg279,mv310,cer54}@cam.ac.uk\n\nAbstract\n\nGaussian processes (GPs) are a powerful tool for probabilistic inference over func-\ntions. They have been applied to both regression and non-linear dimensionality\nreduction, and offer desirable properties such as uncertainty estimates, robustness\nto over-\ufb01tting, and principled ways for tuning hyper-parameters. However the\nscalability of these models to big datasets remains an active topic of research.\nWe introduce a novel re-parametrisation of variational inference for sparse GP\nregression and latent variable models that allows for an ef\ufb01cient distributed algo-\nrithm. This is done by exploiting the decoupling of the data given the inducing\npoints to re-formulate the evidence lower bound in a Map-Reduce setting.\nWe show that the inference scales well with data and computational resources,\nwhile preserving a balanced distribution of the load among the nodes. We further\ndemonstrate the utility in scaling Gaussian processes to big data. We show that\nGP performance improves with increasing amounts of data in regression (on \ufb02ight\ndata with 2 million records) and latent variable modelling (on MNIST). The results\nshow that GPs perform better than many common models often used for big data.\n\n1\n\nIntroduction\n\nGaussian processes have been shown to be \ufb02exible models that are able to capture complicated\nstructure, without succumbing to over-\ufb01tting. Sparse Gaussian process (GP) regression [Titsias,\n2009] and the Bayesian Gaussian process latent variable model (GPLVM, Titsias and Lawrence\n[2010]) have been applied in many tasks, such as regression, density estimation, data imputation,\nand dimensionality reduction. However, the use of these models with big datasets has been limited\nby the scalability of the inference. For example, the use of the GPLVM with big datasets such\nas the ones used in continuous-space natural language disambiguation is quite cumbersome and\nchallenging, and thus the model has largely been ignored in such communities.\nIt is desirable to scale the models up to be able to handle large amounts of data. One approach\nis to spread computation across many nodes in a distributed implementation. Brockwell [2006];\nWilkinson [2005]; Asuncion et al. [2008], among others, have reasoned about the requirements such\ndistributed algorithms should satisfy. The inference procedure should:\n\n1. distribute the computational load evenly across nodes,\n2. scale favourably with the number of nodes,\n3. and have low overhead in the global steps.\n\nIn this paper we scale sparse GP regression and latent variable modelling, presenting the \ufb01rst dis-\ntributed inference algorithm for the models able to process datasets with millions of points. We\nderive a re-parametrisation of the variational inference proposed by Titsias [2009] and Titsias and\nLawrence [2010], unifying the two, which allows us to perform inference using the original guaran-\ntees. This is achieved through the fact that conditioned on the inducing inputs, the data decouples and\nthe variational parameters can be updated independently on different nodes, with the only communi-\n\n\u2217Authors contributed equally to this work.\n\n1\n\n\fcation between nodes requiring constant time. This also allows the optimisation of the embeddings\nin the GPLVM to be done in parallel.\nWe experimentally study the properties of the suggested inference showing that the inference scales\nwell with data and computational resources, and showing that the inference running time scales\ninversely with computational power. We further demonstrate the practicality of the inference, in-\nspecting load distribution over the nodes and comparing run-times to sequential implementations.\nWe demonstrate the utility in scaling Gaussian processes to big data showing that GP performance\nimproves with increasing amounts of data. We run regression experiments on 2008 US \ufb02ight data\nwith 2 million records and perform classi\ufb01cation tests on MNIST using the latent variable model.\nWe show that GPs perform better than many common models which are often used for big data.\nThe proposed inference was implemented in Python using the Map-Reduce framework [Dean and\nGhemawat, 2008] to work on multi-core architectures, and is available as an open-source package1.\nThe full derivation of the inference is given in the supplementary material as well as additional ex-\nperimental results (such as robustness tests to node failure by dropping out nodes at random). The\nopen source software package contains an extensively documented implementation of the deriva-\ntions, with references to the equations presented in the supplementary material for explanation.\n\n2 Related Work\n\nRecent research carried out by Hensman et al. [2013] proposed stochastic variational inference (SVI,\nHoffman et al. [2013]) to scale up sparse Gaussian process regression. Their method trained a Gaus-\nsian process using mini-batches, which allowed them to successfully learn from a dataset containing\n700,000 points. Hensman et al. [2013] also note the applicability of SVI to GPLVMs and suggest\nthat SVI for GP regression can be carried out in parallel. However SVI also has some undesirable\nproperties. The variational marginal likelihood bound is less tight than the one proposed in Titsias\n[2009]. This is a consequence of representing the variational distribution over the inducing targets\nq(u) explicitly, instead of analytically deriving and marginalising the optimal form. Additionally\nSVI needs to explicitly optimise over q(u), which is not necessary when using the analytic optimal\nform. The noisy gradients produced by SVI also complicate optimisation; the inducing inputs need\nto be \ufb01xed in advance because of their strong correlation with the inducing targets, and additional\noptimiser-speci\ufb01c parameters, such as step-length, have to be introduced and \ufb01ne-tuned by hand.\nHeuristics do exist, but these points can make SVI rather hard to work with.\nOur approach results in the same lower bound as presented in Titsias [2009], which averts the dif\ufb01-\nculties with the approach above, and enables us to scale GPLVMs as well.\n\n3 The Gaussian Process Latent Variable Model and Sparse GP Regression\n\nWe now brie\ufb02y review the sparse Gaussian process regression model [Titsias, 2009] and the Gaus-\nsian process latent variable model (GPLVM) [Lawrence, 2005; Titsias and Lawrence, 2010], in terms\nof model structure and inference.\n3.1 Sparse Gaussian Process Regression\nWe consider the standard Gaussian process regression setting, where we aim to predict the output of\nsome unknown function at new input locations, given a training set of n inputs {X1, . . . , Xn} and\ncorresponding observations {Y1, . . . , Yn}. The observations consist of the latent function values\n{F1, . . . , Fn} corrupted by some i.i.d. Gaussian noise with precision \u03b2. This gives the following\ngenerative model2:\n\nF (Xi) \u223c GP(0, k(X, X)),\n\nYi \u223c N (Fi, \u03b2\u22121I)\n\nFor convenience, we collect the data in a matrix and denote single data points by subscripts.\n\nX \u2208 Rn\u00d7q,\n\nF \u2208 Rn\u00d7d,\n\nY \u2208 Rn\u00d7d\n\n1see http://github.com/markvdw/GParML\n2We follow the de\ufb01nition of matrix normal distribution [Arnold, 1981]. For a full treatment of Gaussian\n\nProcesses, see Rasmussen and Williams [2006].\n\n2\n\n\fWe can marginalise out the latent F analytically in order to \ufb01nd the predictive distribution and\nmarginal likelihood. However, this consists of an inversion of an n\u00d7 n matrix, thus requiring O(n3)\ntime complexity, which is prohibitive for large datasets.\nTo address this problem, many approximations have been developed which aim to summarise the\nbehaviour of the regression function using a sparse set of m input-output pairs, instead of the entire\ndataset3. These input-output pairs are termed \u201cinducing points\u201d and are taken to be suf\ufb01cient statis-\ntics for any predictions. Given the inducing inputs Z \u2208 Rm\u00d7q and targets u \u2208 Rm\u00d7d, predictions\ncan be made in O(m3) time complexity:\n\nmmkm\u2217(cid:1) p(u|Y, X)du\n\n(3.1)\n\n(cid:90)\n\nN(cid:0)F \u2217; k\u2217mK\u22121\n\np(F \u2217|X\u2217, Y ) \u2248\n\nmmu, k\u2217\u2217 \u2212 k\u2217mK\u22121\n\nmmu. This results in an overall computational complexity of O(nm2).\n\nwhere Kmm is the covariance between the m inducing inputs, and likewise for the other subscripts.\nLearning the function corresponds to inferring the posterior distribution over the inducing targets u.\nPredictions are then made by marginalising u out of equation 3.1. Ef\ufb01ciently learning the posterior\nover u requires an additional assumption to be made about the relationship between the training\ndata and the inducing points, such as a deterministic link using only the conditional GP mean F =\nKnmK\u22121\nQui\u02dcnonero-Candela and Rasmussen [2005] view this procedure as changing the prior to make infer-\nence more tractable, with Z as hyperparameters which can be tuned using optimisation. However,\nmodifying the prior in response to training data has led to over-\ufb01tting. An alternative sparse approx-\nimation was introduced by Titsias [2009]. Here a variational distribution over u is introduced, with\nZ as variational parameters which tighten the corresponding evidence lower bound. This greatly\nreduces over-\ufb01tting, while retaining the improved computational complexity. It is this approxima-\ntion which we further develop in this paper to give a distributed inference algorithm. A detailed\nderivation is given in section 3 of the supplementary material.\n3.2 Gaussian Process Latent Variable Models\nThe Gaussian process latent variable model (GPLVM) can be seen as an unsupervised version of the\nregression problem above. We aim to infer both the inputs, which are now latent, and the function\nmapping at the same time. This can be viewed as a non-linear generalisation of PCA [Lawrence,\n2005]. The model set-up is identical to the regression case, only with a prior over the latents X.\n\nXi \u223c N (Xi; 0, I),\n\nF (Xi) \u223c GP(0, k(X, X)),\n\nYi \u223c N (Fi, \u03b2\u22121I)\n\nA Variational Bayes approximation for this model has been developed by Titsias and Lawrence\n[2010] using similar techniques as for variational sparse GPs. In fact, the sparse GP can be seen as\na special case of the GPLVM where the inputs are given zero variance. The main task in deriving\napproximate inference revolves around \ufb01nding a variational lower bound to:\n\n(cid:90)\n\np(Y ) =\n\np(Y |F )p(F|X)p(X)d(F, X)\n\nWhich leads to a Gaussian approximation to the posterior q(X) \u2248 p(X|Y ), explained in detail in\nsection 4 of the supplementary material. In the next section we derive a distributed inference scheme\nfor both models following a re-parametrisation of the derivations of Titsias [2009].\n\n4 Distributed Inference\n\nWe now exploit the conditional independence of the data given the inducing points to derive a dis-\ntributed inference scheme for both the sparse GP model and the GPLVM, which will allow us to\neasily scale these models to large datasets. The key equations are given below, with an in-depth\nexplanation given in sections sections 3 and 4 of the supplementary material. We present a unifying\nderivation of the inference procedures for both the regression case and the latent variable modelling\n(LVM) case, by identifying that the explicit inputs in the regression case are identical to the latent\ninputs in the LVM case when their mean is set to the observed inputs and used with variance 0 (i.e.\nthe latent inputs are \ufb01xed and not optimised).\nWe start with the general expression for the log marginal likelihood of the sparse GP regression\nmodel, after introducing the inducing points,\n\n3See Qui\u02dcnonero-Candela and Rasmussen [2005] for a comprehensive review.\n\n3\n\n\f(cid:90)\n\nlog p(Y |X) = log\n\np(Y |F )p(F|X, u)p(u)d(u, F ).\n\n(cid:90)\n\n(cid:18)(cid:90)\n\n(cid:90)\n(cid:90)\n\nThe LVM derivation encapsulates this expression by multiplying with the prior over X and then\nmarginalising over X:\n\nlog p(Y ) = log\n\np(Y |F )p(F|X, u)p(u)p(X)d(u, F, X).\n\n(cid:19)\n\np(u)\nq(u)\n\nWe then introduce a free-form variational distribution q(u) over the inducing points, and another\nover X (where in the regression case, p(X)\u2019s and q(X)\u2019s variance is set to 0 and their mean set to\nX). Using Jensen\u2019s inequality we get the following lower bound:\np(Y |F )p(u)\n\nlog p(Y |X) \u2265\n\np(F|X, u)q(u) log\n\nd(u, F )\n\nq(u)\n\n(cid:18)\n\n(cid:90)\n\n=\n\nq(u)\n\np(F|X, u) log p(Y |F )d(F ) + log\n\nd(u)\n\n(4.1)\n\n(cid:90)\n\nall distributions that involve u also depend on Z which we have omitted for brevity. Next we\nintegrate p(Y ) over X to be able to use 4.1,\n\np(Y |X)p(X)\n\nq(X)\n\nd(X) \u2265\n\nq(X)\n\nlog p(Y |X) + log\n\np(X)\nq(X)\n\nd(X)\n\n(4.2)\n\nlog p(Y ) = log\n\nq(X)\n\nand obtain a bound which can be used for both models. Up to here the derivation is identical to the\ntwo derivations given in [Titsias and Lawrence, 2010; Titsias, 2009]. However, now we exploit the\nconditional independence given u to break the inference into small independent components.\n4.1 Decoupling the Data Conditioned on the Inducing Points\nThe introduction of the inducing points decouples the function values from each other in the fol-\nlowing sense. If we represent Y as the individual data points (Y1; Y2; ...; Yn) with Yi \u2208 R1\u00d7d and\nsimilarly for F , we can write the lower bound as a sum over the data points, since Yi are independent\n\n(cid:19)\n\nof Fj for j (cid:54)= i:(cid:90)\n\np(F|X, u) log p(Y |F )d(F ) =\n\n=\n\np(F|X, u)\n\nlog p(Yi|Fi)d(F )\n\ni=1\n\np(Fi|Xi, u) log p(Yi|Fi)d(Fi)\n\nn(cid:88)\n\n(cid:90)\nn(cid:88)\n\ni=1\n\n(cid:90)\n\n(cid:0)YiY T\n\nSimplifying this expression and integrating over X we get that each term is given by\n\n\u2212 d\n2\n\nlog(2\u03c0\u03b2\u22121) \u2212 \u03b2\n2\n\ni \u2212 2(cid:104)Fi(cid:105)p(Fi|Xi,u)q(Xi) Y T\n\nwhere we use triangular brackets (cid:104)F(cid:105)p(F ) to denote the expectation of F with respect to the distri-\nbution p(F ).\nNow, using calculus of variations we can \ufb01nd optimal q(u) analytically. Plugging the optimal distri-\nbution into eq. 4.1 and using further algebraic manipulations we obtain the following lower bound:\n\ni +(cid:10)FiF T\n\ni\n\n(cid:11)\n\np(Fi|Xi,u)q(Xi))(cid:1)\n\nlog p(Y ) \u2265 \u2212 nd\n2\nA \u2212 \u03b2d\n2\n\n\u2212 \u03b2\n2\n\nlog 2\u03c0 +\n\nnd\n2\n\nlog \u03b2 +\n\nT r(K\u22121\n\nmmD) +\n\nlog |Kmm + \u03b2D|\n\nlog |Kmm| \u2212 d\nd\n2\n2\n\u03b22\nT r(C T \u00b7 (Kmm + \u03b2D)\u22121 \u00b7 C) \u2212 KL\n2\n\n(4.3)\n\nn(cid:88)\n\n(cid:104)Kmi(cid:105)q(Xi) Yi, D =\n\n(cid:104)KmiKim(cid:105)q(Xi)\n\nB +\n\n\u03b2d\n2\n\nn(cid:88)\n\nYiY T\ni\n\n, B =\n\nn(cid:88)\n\nwhere\n\nA =\n\nand\n\nn(cid:88)\n\n(cid:104)Kii(cid:105)q(Xi) , C =\nn(cid:88)\n\nKL =\n\nKL(q(Xi)||p(Xi))\n\nwhen the inputs are latent or set to 0 when they are observed.\n\ni=1\n\n4\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n\fNotice that the obtained unifying bound is identical to the ones derived in [Titsias, 2009] for the\nregression case and [Titsias and Lawrence, 2010] for the LVM case since (cid:104)Kmi(cid:105)q(Xi) = Kmi for\nq(Xi) with variance 0 and mean Xi. However, the terms are re-parametrised as independent sums\nover the input points \u2013 sums that can be computed on different nodes in a network without inter-\ncommunication. An in-depth explanation of the different transitions is given in the supplementary\nmaterial sections 3 and 4.\n4.2 Distributed Inference Algorithm\nA parallel inference algorithm can be easily derived based on this factorisation. Using the Map-\nReduce framework [Dean and Ghemawat, 2008] we can maintain different subsets of the inputs\nand their corresponding outputs on each node in a parallel implementation and distribute the global\nparameters (such as the kernel hyper-parameters and the inducing inputs) to the nodes, collecting\nonly the partial terms calculated on each node.\nWe denote by G the set of global parameters over which we need to perform optimisation. These in-\nclude Z (the inducing inputs), \u03b2 (the observation noise), and k (the set of kernel hyper-parameters).\nAdditionally we denote by Lk the set of local parameters on each node k that need to be optimised.\nThese include the mean and variance for each input point for the LVM model. First, we send to\nall end-point nodes the global parameters G for them to calculate the partial terms (cid:104)Kmi(cid:105)q(Xi) Yi,\n(cid:104)KmiKim(cid:105)q(Xi), (cid:104)Kii(cid:105)q(Xi), YiY T\n, and KL(q(Xi)||p(Xi)). The calculation of these terms is ex-\nplained in more detail in the supplementary material section 4. The end-point nodes return these\npartial terms to the central node (these are m \u00d7 m \u00d7 q matrices \u2013 constant space complexity for\n\ufb01xed m). The central node then sends the accumulated terms and partial derivatives back to the\nnodes and performs global optimisation over G. In the case of the GPLVM, the nodes then concur-\nrently perform local optimisation on Lk, the embedding posterior parameters. In total, we have two\nMap-Reduce steps between the central node and the end-point nodes to follow:\n\ni\n\n1. The central node distributes G,\n2. Each end-point node k returns a partial sum of the terms A, B, C, D and KL based on Lk,\n3. The central node calculates F, \u2202F (m \u00d7 m \u00d7 q matrices) and distributes to the end-point\n\nnodes,\n\n4. The central node optimises G; at the same time the end-point nodes optimise Lk.\n\nWhen performing regression, the third step and the second part of the fourth step are not required.\nThe appendices of the supplementary material contain the derivations of all the partial derivatives\nrequired for optimisation.\nOptimisation of the global parameters can be done using any procedure that utilises the calculated\npartial derivative (such as scaled conjugate gradient [M\u00f8ller, 1993]), and the optimisation of the\nlocal variables can be carried out by parallelising SCG or using local gradient descent. We now\nexplore the developed inference empirically and evaluate its properties on a range of tasks.\n\n5 Experimental Evaluation\n\nWe now demonstrate that the proposed inference meets the criteria set out in the introduction. We\nassess the inference on its scalability with increased computational power for a \ufb01xed problem size\n(strong scaling) as well as with proportionally increasing data (weak scaling) and compare to exist-\ning inference. We further explore the distribution of the load over the different nodes, which is a\nmajor inhibitor in large scale distributed systems.\nIn the following experiments we used a squared exponential ARD kernel over the latent space to\nautomatically determine the dimensionality of the space, as in Titsias and Lawrence [2010]. We\ninitialise our latent points using PCA and our inducing inputs using k-means with added noise. We\noptimise using both L-BFGS and scaled conjugate gradient [M\u00f8ller, 1993].\n5.1 Scaling with Computation Power\nWe investigate how much inference on a given dataset can be sped up using the proposed algorithm\ngiven more computational resources. We assess the improvement of the running time of the algo-\n\n5\n\n\fFigure 1: Running time per iteration for 100K\npoints synthetic dataset, as a function of avail-\nable cores on log-scale.\n\nFigure 2: Time per iteration when scaling\nthe computational resources proportionally\nto dataset size up to 50K points. Also shown\nstandard inference (GPy) for comparison.\n\nrithm on a synthetic dataset of which large amounts of data could easily be generated. The dataset\nwas obtained by simulating a 1D latent space and transforming this non-linearly into 3D observa-\ntions. 100K points were generated and the algorithm was run using an increasing number of cores\nand a 2D latent space. We measured the total running time the algorithm spent in each iteration.\nFigure 1 shows the improvement of run-time as a function of available cores. We obtain a relation\nvery close to the ideal t \u221d c\u00b7(cores)\u22121. When doubling the number of cores from 5 to 10 we achieve\na factor 1.93 decrease in computation time \u2013 very close to ideal. In a higher range, a doubling from\n15 to 30 cores improves the running time by a factor of 1.90, so there is very little sign of diminishing\nreturns. It is interesting to note that we observed a minuscule overhead of about 0.05 seconds per\niteration in the global steps. This is due to the m \u00d7 m matrix inversion carried out in each global\nstep, which amounts to an additional time complexity of O(m3) \u2013 constant for \ufb01xed m.\n5.2 Scaling with Data and Comparison to Standard Inference\nUsing the same setup, we assessed the scaling of the running time as we increased both the dataset\nsize and computational resources equally. For a doubling of data, we doubled the number of avail-\nable CPUs. In the ideal case of an algorithm with only distributable components, computation time\nshould be constant. Again, we measure the total running time of the algorithm per iteration. Figure\n2 shows that we are able to effectively utilise the extra computational resources. Our total running\ntime takes 4.3% longer for a dataset scaled by 30 times.\nComparing the computation time to the standard inference scheme we see a signi\ufb01cant improvement\nin performance in terms of running time. We compared to the sequential but highly optimised GPy\nimplementation (see \ufb01gure 2). The suggested inference signi\ufb01cantly outperforms GPy in terms of\nrunning time given more computational resources. Our parallel inference allows us to run sparse\nGPs and the GPLVM on datasets which would simply take too long to run with standard inference.\n5.3 Distribution of the Load\nThe development of parallel inference procedures is an active \ufb01eld of research for Bayesian non-\nparametric models [Lovell et al., 2012; Williamson et al., 2013]. However, it is important to study\n\nFigure 3: Load distribution for each iteration. The maximum time spent in a node is the rate\nlimiting step. Shown are the minimum, mean and maximum execution times of all nodes when\nusing 5 (left) and 30 (right) cores.\n\n6\n\n\fDataset\nFlight 7K\nFlight 70K\nFlight 700K\n\nMean Linear Ridge\n35.05\n36.62\n34.98\n36.61\n36.61\n34.95\n\n34.97\n34.94\n34.94\n\nRF\n34.78\n34.88\n34.96\n\nSVI 100\n\nSVI 200 Dist GP 100\n\nNA\nNA\n33.20\n\nNA\nNA\n33.00\n\n33.56\n33.11\n32.95\n\nTable 1: RMSE of \ufb02ight delay (measured in minutes) for regression over \ufb02ight data with 7K-\n700K points by predicting mean, linear regression, ridge regression, random forest regression (RF),\nStochastic Variational Inference (SVI) GP regression with 100 and 200 inducing points, and the\nproposed inference with 100 inducing points (Dist GP 100).\n\nthe characteristics of the parallel algorithm, which are sometimes overlooked [Gal and Ghahramani,\n2014]. One of our stated requirements for a practical parallel inference algorithm is an approxi-\nmately equal distribution of the load on the nodes. This is especially relevant in a Map-Reduce\nframework, where the reduce step can only happen after all map computations have \ufb01nished, so\nthe maximum execution time of one of the workers is the rate limiting step. Figure 3 shows the\nminimum, maximum and average execution time of all nodes. For 30 cores, there is on average a\n1.9% difference between the minimum and maximum run-time of the nodes, suggesting an even\ndistribution of the load.\n\n6 GP Regression and Latent Variable Modelling on Real-World Big Data\n\nNext we describe a series of experiments demonstrating the utility in scaling Gaussian processes to\nbig data. We show that GP performance improves with increasing amounts of data in regression\nand latent variable modelling tasks. We further show that GPs perform better than common models\noften used for big data.\nWe evaluate GP regression on the US \ufb02ight dataset [Hensman et al., 2013] with up to 2 million\npoints, and compare the results that we got to an array of baselines demonstrating the utility of\nusing GPs for large scale regression. We then present density modelling results over the MNIST\ndataset, performing imputation tests and digit classi\ufb01cation based on model comparison [Titsias and\nLawrence, 2010]. As far as we are aware, this is the \ufb01rst GP experiment to run on the full MNIST\ndataset.\n6.1 Regression on US Flight Data\nIn the regression test we predict \ufb02ight delays from various \ufb02ight-record characteristics such as \ufb02ight\ndate and time, \ufb02ight distance, and others. The US 2008 \ufb02ight dataset [Hensman et al., 2013] was\nused with different subset sizes of data: 7K, 70K, and 700K. We selected the \ufb01rst 800K points from\nthe dataset and then split the data randomly into a test set and a training set, using 100K points\nfor testing. We then used the \ufb01rst 7K and 70K points from the large training set to construct the\nsmaller training sets, using the same test set for comparison. This follows the experiment setup of\n[Hensman et al., 2013] and allows us to compare our results to the Stochastic Variational Inference\nsuggested for GP regression. In addition to that we constructed a 2M points dataset based on a\ndifferent split using 100K points for test. This test is not comparable to the other experiments due to\nthe non-stationary nature of the data, but it allows us to investigate the performance of the proposed\ninference compared to the baselines on even larger datasets.\nFor baselines we predicted the mean of the data, used linear regression, ridge regression with pa-\nrameter 0.5, and MSE random forest regression at depth 2 with 100 estimators. We report the best\nresults we got for each model for different parameter settings with available resources. We trained\nour model with 100 inducing points for 500 iterations using LBFGS optimisation and compared the\n\nDataset\nFlight 2M 38.92\n\nMean Linear Ridge\n37.65\n\n37.65\n\nRF\n37.33\n\nDist GP 100\n\n35.31\n\nTable 2: RMSE for \ufb02ight data with 2M points by predicting mean, linear regression, ridge regression,\nrandom forest regression (RF), and the proposed inference with 100 inducing points (Dist GP).\n\n7\n\n\fFigure 5: Digit from MNIST with missing data\n(left) and reconstruction using GPLVM (right).\n\nFigure 4: Log likelihood as a function of func-\ntion evaluation for the 70K \ufb02ight dataset using\nSCG and LBFGS optimisation.\nroot mean square error (RMSE) to the baselines as well as SVI with 100 and 200 inducing points\n(table 1). The results for 2M points are given in table 2. Our inference with 2M data points on\na 64 cores machine took \u223c 13.8 minutes per iteration. Even though the training of the baseline\nmodels took several minutes, the use of GPs for big data allows us to take advantage of their desir-\nable properties of uncertainty estimates, robustness to over-\ufb01tting, and principled ways for tuning\nhyper-parameters.\nOne unexpected result was observed while doing inference with SCG. When increasing the number\nof data points, the SCG optimiser converged to poor values. When using the \ufb01nal parameters of a\nmodel trained on a small dataset to initialise a model to be trained on a larger dataset, performance\nwas as expected. We concluded that SCG was not converging to the correct optimum, whereas L-\nBFGS performed better (\ufb01gure 4). We suspect this happens because the modes in the optimisation\nsurface sharpen with more data. This is due to the increased weight of the likelihood terms.\n6.2 Latent Variable Modelling on MNIST\nWe also run the GP latent variable model on the full MNIST dataset, which contains 60K examples\nof 784 dimensions and is considered large in the Gaussian processes community. We trained one\nmodel for each digit and used it as a density model, using the predictive probabilities to perform\nclassi\ufb01cation. We classify a test point to the model with the highest posterior predictive probability.\nWe follow the calculation in [Titsias and Lawrence, 2010], by taking the ratio of the exponentiated\nlog marginal likelihoods: p(y\u2217|Y ) = p(y\u2217, Y )/p(Y ) \u2248 eLy\u2217 ,Y \u2212LY . Due to the randomness in the\ninitialisation of the inducing inputs and latent point variances, we performed 10 random restarts on\neach model and chose the model with the largest marginal likelihood lower bound.\nWe observed that the models converged to a point where they performed similarly, occasionally\ngetting stuck in bad local optima. No pre-processing was performed on the training data as our\nmain aim here is to show the bene\ufb01t of training GP models using larger amounts of data, rather than\nproving state-of-the-art performance.\nWe trained the models on a subset of the data containing 10K points as well as the entire dataset\nwith all 60K points, using additional 10K points for testing. We observed an improvement of 3.03\npercentage points in classi\ufb01cation error, decreasing the error from 8.98% to 5.95%. Training on the\nfull MNIST dataset took 20 minutes for the longest running model, using 500 iterations of SCG. We\ndemonstrate the reconstruction abilties of the GPLVM in \ufb01gure 5.\n\n7 Conclusions\n\nWe have scaled sparse GP regression and latent variable modelling, presenting the \ufb01rst distributed\ninference algorithm able to process datasets with millions of data points. An extensive set of ex-\nperiments demonstrated the utility in scaling Gaussian processes to big data showing that GP per-\nformance improves with increasing amounts of data. We studied the properties of the suggested\ninference, showing that the inference scales well with data and computational resources, while pre-\nserving a balanced distribution of the load among the nodes. Finally, we showed that GPs perform\nbetter than many common models used for big data.\nThe algorithm was implemented in the Map-Reduce architecture and is available as an open-source\npackage, containing an extensively documented implementation of the derivations, with references\nto the equations presented in the supplementary material for explanation.\n\n8\n\n\fReferences\n\nArnold, S. (1981). The theory of linear models and multivariate analysis. Wiley series in probability\n\nand mathematical statistics: Probability and mathematical statistics. Wiley.\n\nAsuncion, A. U., Smyth, P., and Welling, M. (2008). Asynchronous distributed learning of topic\n\nmodels. In Advances in Neural Information Processing Systems, pages 81\u201388.\n\nBrockwell, A. E. (2006). Parallel Markov Chain Monte Carlo simulation by Pre-Fetching. Journal\n\nof Computational and Graphical Statistics, 15(1):pp. 246\u2013261.\n\nDean, J. and Ghemawat, S. (2008). MapReduce: Simpli\ufb01ed data processing on large clusters.\n\nCommun. ACM, 51(1):107\u2013113.\n\nGal, Y. and Ghahramani, Z. (2014). Pitfalls in the use of parallel inference for the Dirichlet process.\n\nIn Proceedings of the 31th International Conference on Machine Learning (ICML-14).\n\nHensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In Nicholson,\n\nA. and Smyth, P., editors, UAI. AUAI Press.\n\nHoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic Variational Inference.\n\nJOURNAL OF MACHINE LEARNING RESEARCH, 14:1303\u20131347.\n\nLawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian process\n\nlatent variable models. The Journal of Machine Learning Research, 6:1783\u20131816.\n\nLovell, D., Adams, R. P., and Mansingka, V. (2012). Parallel Markov chain Monte Carlo for Dirichlet\n\nprocess mixtures. In Workshop on Big Learning, NIPS.\n\nM\u00f8ller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural\n\nnetworks, 6(4):525\u2013533.\n\nQui\u02dcnonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate gaus-\n\nsian process regression. Journal of Machine Learning Research, 6:2005.\n\nRasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning (Adap-\n\ntive Computation and Machine Learning). The MIT Press.\n\nTitsias, M. and Lawrence, N. (2010). Bayesian gaussian process latent variable model.\nTitsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian processes.\n\nTechnical report, Technical Report.\n\nWilkinson, D. J. (2005). Parallel Bayesian computation. In Kontoghiorghes, E. J., editor, Handbook\nof Parallel Computing and Statistics, volume 184, pages 477\u2013508. Chapman and Hall/CRC, Boca\nRaton, FL, USA.\n\nWilliamson, S., Dubey, A., and Xing, E. P. (2013). Parallel Markov Chain Monte Carlo for non-\nIn Proceedings of the 30th International Conference on Machine\n\nparametric mixture models.\nLearning (ICML-13), pages 98\u2013106.\n\n9\n\n\f", "award": [], "sourceid": 1660, "authors": [{"given_name": "Yarin", "family_name": "Gal", "institution": "University of Cambridge"}, {"given_name": "Mark", "family_name": "van der Wilk", "institution": "University of Cambridge"}, {"given_name": "Carl Edward", "family_name": "Rasmussen", "institution": "University of Cambridge"}]}