{"title": "Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 7506, "page_last": 7516, "abstract": "Deep Gaussian Processes (DGPs) are hierarchical generalizations of Gaussian Processes that combine well calibrated uncertainty estimates with the high flexibility of multilayer models. One of the biggest challenges with these models is that exact inference is intractable. The current state-of-the-art inference method, Variational Inference (VI), employs a Gaussian approximation to the posterior distribution. This can be a potentially poor unimodal approximation of the generally multimodal posterior. In this work, we provide evidence for the non-Gaussian nature of the posterior and we apply the Stochastic Gradient Hamiltonian Monte Carlo method to generate samples. To efficiently optimize the hyperparameters, we introduce the Moving Window MCEM algorithm. This results in significantly better predictions at a lower computational cost than its VI counterpart. Thus our method establishes a new state-of-the-art for inference in DGPs.", "full_text": "Inference in Deep Gaussian Processes using\nStochastic Gradient Hamiltonian Monte Carlo\n\nMarton Havasi\n\nDepartment of Engineering\nUniversity of Cambridge\n\nmh740@cam.ac.uk\n\nJos\u00b4e Miguel Hern\u00b4andez-Lobato\n\nDepartment of Engineering\nUniversity of Cambridge,\n\nMicrosoft Research,\nAlan Turing Institute\njmh233@cam.ac.uk\n\nJuan Jos\u00b4e Murillo-Fuentes\n\nDepartment of Signal Theory and Communications\n\nUniversity of Sevilla\n\nmurillo@us.es\n\nAbstract\n\nDeep Gaussian Processes (DGPs) are hierarchical generalizations of Gaussian Pro-\ncesses that combine well calibrated uncertainty estimates with the high \ufb02exibility\nof multilayer models. One of the biggest challenges with these models is that exact\ninference is intractable. The current state-of-the-art inference method, Variational\nInference (VI), employs a Gaussian approximation to the posterior distribution.\nThis can be a potentially poor unimodal approximation of the generally multimodal\nposterior. In this work, we provide evidence for the non-Gaussian nature of the\nposterior and we apply the Stochastic Gradient Hamiltonian Monte Carlo method\nto generate samples. To ef\ufb01ciently optimize the hyperparameters, we introduce the\nMoving Window MCEM algorithm. This results in signi\ufb01cantly better predictions\nat a lower computational cost than its VI counterpart. Thus our method establishes\na new state-of-the-art for inference in DGPs.\n\n1\n\nIntroduction\n\nDeep Gaussian Processes (DGP) [Damianou and Lawrence, 2013] are multilayer predictive models\nthat are highly \ufb02exible and can accurately model uncertainty. In particular, they have been shown to\nperform well on a multitude of supervised regression tasks ranging from small (\u223c500 datapoints) to\nlarge datasets (\u223c500,000 datapoints) [Salimbeni and Deisenroth, 2017, Bui et al., 2016, Cutajar et al.,\n2016]. Their main bene\ufb01t over neural networks is that they are capable of capturing uncertainty in\ntheir predictions. This makes them good candidates for tasks where the prediction uncertainty plays a\ncrucial role, for example, black-box Bayesian Optimization problems and a variety of safety-critical\napplications such as autonomous vehicles and medical diagnostics.\nDeep Gaussian Processes introduce a multilayer hierarchy to Gaussian Processes (GP) [Williams and\nRasmussen, 1996]. A GP is a non-parametric model that assumes a jointly Gaussian distribution for\nany \ufb01nite set of inputs. The covariance of any pair of inputs is determined by the covariance function.\nGPs can be a robust choice due to being non-parametric and analytically computable, however, one\nissue is that choosing the covariance function often requires hand tuning and expert knowledge of\nthe dataset, which is not possible without prior knowledge of the problem at hand. In a multilayer\nhierarchy, the hidden layers overcome this limitation by stretching and warping the input space,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFigure 1: (Left): Deep Gaussian Process illustration1. (Middle): Histograms of a random selection of\ninducing outputs. The best-\ufb01t Gaussian distribution is denoted with a dashed line. Some of them\nexhibit a clear multimodal behaviour. (Right): P-values for 100 randomly selected inducing outputs\nper dataset. The null hypotheses are that their distributions are Gaussian.\n\nresulting in a Bayesian \u2018self-tuning\u2019 covariance function that \ufb01ts the data without any human input\n[Damianou, 2015].\nThe deep hierarchical generalization of GPs is done in a fully connected, feed-forward manner. The\noutputs of the previous layer serve as an input to the next. However, a signi\ufb01cant difference from\nneural networks is that the layer outputs are probabilistic rather than exact values so the uncertainty is\npropagated through the network. The left part of Figure 1 illustrates the concept with a single hidden\nlayer. The input to the hidden layer is the input data x and the output of the hidden layer f1 serves as\nthe input data to the output layer, which itself is formed by GPs.\nExact inference is infeasible in GPs for large datasets due to the high computational cost of working\nwith the inverse covariance matrix. Instead, the posterior is approximated using a small set of pseudo\ndatapoints (\u223c100) also referred to as inducing points [Snelson and Ghahramani, 2006, Titsias, 2009,\nQui\u02dcnonero-Candela and Rasmussen, 2005]. We assume this inducing point framework throughout\nthe paper. Predictions are made using the inducing points to avoid computing the covariance matrix\nof the whole dataset. Both in GPs and DGPs, the inducing outputs are treated as latent variables that\nneed to be marginalized.\nThe current state-of-the-art inference method in DGPs is Doubly Stochastic Variation Inference\n(DSVI) [Salimbeni and Deisenroth, 2017] which has been shown to outperform Expectation Prop-\nagation [Minka, 2001, Bui et al., 2016] and it also has better performance than Bayesian Neural\nNetworks with Probabilistic Backpropagation [Hern\u00b4andez-Lobato and Adams, 2015] and Bayesian\nNeural Networks with earlier inference methods such as Variation Inference [Graves, 2011], Stochas-\ntic Gradient Langevin Dynamics [Welling and Teh, 2011] and Hybrid Monte Carlo [Neal, 1993].\nHowever, a drawback of DSVI is that it approximates the posterior distribution with a Gaussian. We\nshow, with high con\ufb01dence, that the posterior distribution is non-Gaussian for every dataset that\nwe examine in this work. This \ufb01nding motivates the use of inference methods with a more \ufb02exible\nposterior approximations.\nIn this work, we apply an inference method new to DGPs, Stochastic Gradient Hamiltonian Monte\nCarlo (SGHMC), a sampling method that accurately and ef\ufb01ciently captures the posterior distribution.\nIn order to apply a sampling-based inference method to DGPs, we have to tackle the problem of\noptimizing the large number of hyperparameters. To address this problem, we propose Moving\nWindow Monte Carlo Expectation Maximization, a novel method for obtaining the Maximum\nLikelihood (ML) estimate of the hyperparameters. This method is fast, ef\ufb01cient and generally\napplicable to any probabilistic model and MCMC sampler.\nOne might expect a sampling method such as SGHMC to be more computationally intensive than a\nvariational method such as DSVI. However, in DGPs, sampling from the posterior is inexpensive,\nsince it does not require the recomputation of the inverse covariance matrix, which only depends on\n\n1Image source: Daniel Hern\u00b4andez-Lobato\n\n2\n\nf1xybostonenergyconcretewine_redkin8nmpowernavalproteinyear103010261022101810141010106102p-value128294=105\fthe hyperparameters. Furthermore, calculating the layerwise variance has a higher cost in the VI\nsetting.\nLastly, we conduct experiments on a variety of supervised regression and classi\ufb01cation tasks. We\nshow empirically that our work signi\ufb01cantly improves predictions on medium-large datasets at a\nlower computational cost.\nOur contributions can be summarized in three points.\n\n1. Demonstrating the non-Gaussianity of the posterior. We provide evidence that every regres-\n\nsion dataset that we examine in this work has a non-Gaussian posterior.\n\n2. We use SGHMC to directly sample from the posterior distribution of a DGP. Experiments\n\nshow that this new inference method outperforms preceding works.\n\n3. We introduce Moving Window MCEM, a novel algorithm for ef\ufb01ciently optimizing the\n\nhyperparameters when using a MCMC sampler for inference.\n\n2 Background and Related Work\n\nThis section provides the background on Gaussian Processes and Deep Gaussian Processes for\nregression and establishes the notation used in the paper.\n\n2.1 Single Layer Gaussian Process\nGaussian processes de\ufb01ne a posterior distribution over functions f : RD \u2192 R given a set of\ninput-output pairs x = {x1, . . . , xN} and y = {y1, . . . , yN} respectively. Under the GP model,\nit is assumed that the function values f = f (x), where f (x) denotes {f (x1), . . . , f (xN )}, are\njointly Gaussian with a \ufb01xed covariance function k : RD \u00d7 RD \u2192 R. The conditional distribution\nof y is obtained via the likelihood function p(y|f ). A commonly used likelihood function is\np(y|f ) = N (y|f , I\u03c32) (constant Gaussian noise).\nThe computational cost of exact inference is O(N 3), rendering it computationally infeasible for\nlarge datasets. A common approach uses a set of pseudo datapoints Z = {z1, . . . zM}, u = f (Z)\n[Snelson and Ghahramani, 2006, Titsias, 2009] and writes the joint probability density function as\n\np(y, f , u) = p(y|f )p(f|u)p(u) .\n\nThe distribution of f given the inducing outputs u can be expressed as p(f|u) = N (\u00b5, \u03a3) with\n\n\u00b5 = KxZK\u22121\n\u03a3 = Kxx \u2212 KxZK\u22121\n\nZZu\n\nZZK T\n\nxZ\n\nwhere the notation KAB refers to the covariance matrix between two sets of points A, B with entries\n[KAB]ij = k(Ai, Bj) where Ai and Bj are the i-th and j-th elements of A and B respectively.\nIn order to obtain the posterior of f, u must be marginalized, yielding the equation\n\n(cid:90)\n\np(f|y) =\n\np(f|u)p(u|y)du .\n\nNote that f is conditionally independent of y given u.\nFor single layer GPs, Variational Inference (VI) can be used for marginalization. VI approximates the\njoint posterior distribution p(f , u|y) with the variational posterior q(f , u) = p(f|u)q(u), where\nq(u) = N (u|m, S).\n\nThis choice of q(u) allows for exact inference of the marginal q(f|m, S) =(cid:82) p(f|u)q(u)du =\n\nN (f|\u02dc\u00b5, \u02dc\u03a3)\n\nwhere \u02dc\u00b5 = KxZK\u22121\n\nZZm ,\n\n\u02dc\u03a3 = Kxx \u2212 KxZK\u22121\n\nZZ(KZZ \u2212 S)K\u22121\n\nZZK T\n\nxZ .\n\n(1)\n\nThe variational parameters m and S need to be optimized. This is done by minimizing the Kullback-\nLeibler divergence of the true and the approximate posteriors, which is equivalent to maximizing a\nlower bound to the marginal likelihood (Evidence Lower Bound or ELBO):\nlog p(y) \u2265 Eq(f ,u)\n\n(cid:2) log p(y, f , u) \u2212 log q(f , u)(cid:3) = Eq(f|m,S)\n\n(cid:2) log p(y|f )(cid:3) \u2212 KL(cid:2)q(u)||p(u)(cid:3) .\n\n3\n\n\f2.2 Deep Gaussian Process\n\nIn a DGP of depth L, each layer is a GP that models a function fl with input fl\u22121 and output fl for\nl = 1, . . . , L (f0 = x) as illustrated in the left part of Figure 1. The inducing inputs for the layers are\ndenoted by Z1, . . ., ZL with associated inducing outputs u1 = f1(Z1), . . ., uL = fL(ZL).\nThe joint probability density function can be written analogously to the GP model case:\n\np(cid:0)y,{fl}L\n\nl=1,{ul}L\n\nl=1\n\n(cid:1) = p(y|fL)\n\nL(cid:89)\n\np(fl|ul)p(ul) .\n\n(2)\n\nInference\n\n2.3\nThe goal of inference is to marginalize the inducing outputs {ul}L\nl=1 and\napproximate the marginal likelihood p(y). This section discusses prior works regarding inference.\n\nl=1 and layer outputs {fl}L\n\nl=1\n\nDoubly Stochastic Variation Inference DSVI is an extension of Variational Inference to DGPs\n[Salimbeni and Deisenroth, 2017] that approximates the posterior of the inducing outputs ul with\nindependent multivariate Gaussians q(ul) = N (ul|ml, Sl).\nThe layer outputs naturally follow the single layer model in Eq. 1:\n\n(cid:90) L(cid:89)\n\nq(fl|fl\u22121) = N (fl|\u02dc\u00b5l, \u02dc\u03a3l) ,\nq(fl|fl\u22121)dfl . . . dfL\u22121 .\n\nq(fL) =\n\nl=1\n\nThe \ufb01rst term in the resulting ELBO, L = Eq(fL)\nestimated by sampling the layer outputs through minibatches to allow scaling to large datasets.\n\n(cid:2) log p(y|fL)(cid:3) \u2212(cid:80)L\n\nl=1 KL(cid:2)q(ul)||p(ul)(cid:3), is then\n\nSampling-based inference for Gaussian Processes\nIn a related work, Hensman et al. [2015] use\nHybrid MC sampling in single layer GPs. They consider joint sampling of the GP hyperparameters\nand the inducing outputs. This work cannot straightforwardly be extended to DGPs because of\nthe high cost of sampling the GP hyperparameters. Moreover, it uses a costly method, Bayesian\nOptimization, to tune the parameters of the sampler which further limits its applicability to DGPs.\n\n3 Analysis of the Deep Gaussian Process Posterior\n\nAdopting a new inference method over variational inference is motivated by the restrictive form that\nVI assumes about the posterior distribution. The variational assumption is that p({ul}L\nl=1|y) takes\nthe form of a multivariate Gaussian that assumes independence between the layers. While in a single\nlayer model, a Gaussian approximation to the posterior is provably correct [Williams and Rasmussen,\n1996], this is not the case for DGPs.\nFirst, we illustrate with a toy problem that the posterior distribution in DGPs can be multimodal.\nFollowing that, we provide evidence that every regression dataset that we consider in this work results\nin a non-Gaussian posterior distribution.\n\nMultimodal toy problem The multimodality of the posterior of a two layer DGP (L = 2) is\ndemonstrated on a toy problem (Table 1). For the purpose of the demonstration, we made the\nsimplifying assumption that \u03c32 = 0, so the likelihood function has no noise. This toy problem has\ntwo Maximum-A-Posteriori (MAP) solutions (Mode A and Mode B). The table shows the variational\nposterior at each layer for DSVI. We can see that it \ufb01ts one of the modes randomly (depending on the\ninitialization) while completely ignoring the other. On the other hand, a sampling method such as\nSGHMC (as implemented in the following section) explores both of the modes and therefore provides\na better approximation to the posterior.\n\nEmpirical evidence To further support our claim regarding the multimodality of the posterior, we\ngive empirical evidence that ,for real-world datasets, the posterior is not Gaussian.\n\n4\n\n\fTable 1: The layer inputs and outputs of a two layer DGP. Under DSVI, we show the mean and the\nstandard deviation of the variational distribution. In the case of SGHMC, samples from each layer\nare shown. The two MAP solutions are shown under Mode A and Mode B.\n\nToy Problem\n\nDSVI\n\nSGHMC\n\nMode A\n\nMode B\n\nLayer 1\n\nLayer 2\n\nFigure 2: The toy prob-\nlem with 7 datapoints.\n\nWe conduct the following analysis. Consider the null hypothesis that the posterior under a dataset is a\nmultivariate Gaussian distribution. This null hypothesis implies that the distribution of each inducing\noutput is a Gaussian. We examine the approximate posterior samples generated by SGHMC for\neach inducing output, using the implementation of SGHMC for DGPs described in the next section.\nIn order to derive p-values, we apply the kurtosis test for Gaussianity [Cramer, 1998]. This test is\ncommonly used to identify multimodal distributions because these often have signi\ufb01cantly higher\nkurtosis (also called 4th moment).\nFor each dataset, we calculate the p-values of 100 randomly selected inducing outputs and compare\nthe results against the probability threshold \u03b1 = 10\u22125. The Bonferroni correction was applied to \u03b1\nto account for the high number of concurrent hypothesis tests. The results are displayed in the right\npart of Figure 1. Since every single dataset had p-values under the threshold, we can state with 99%\ncertainty that all of these datasets have a non-Gaussian posterior.\n\n4 Sampling-based Inference for Deep Gaussian Processes\n\nUnlike with VI, when using sampling methods, we do not have access to an approximate posterior\ndistribution q(u) to generate predictions with. Instead, we have to rely on approximate samples\ngenerated from the posterior which in turn can be used to make predictions [Dunlop et al., 2017,\nHoffman, 2017].\nIn practice, run a sampling process which has two phases. The burn-in phase is used to determine\nthe hyperparameters of the model and the sampler. The hyperparameters of the sampler are selected\nusing a heuristic auto-tuning approach, while the hyperparameters of the DGP are optimized using\nthe novel Moving Window MCEM algorithm.\nIn the sampling phase, the sampler is run using the \ufb01xed hyperparameters. Since consecutive samples\nare highly correlated, we save one sample every 50 iterations and generate 200 samples for prediction.\nOnce the posterior samples are obtained, predictions can be made by combining the per-sample\npredictions to obtain a mixture distribution. Note that it is not more expensive to make predictions\nusing this sampler than in DSVI since DSVI needs to sample the layer outputs to make predictions.\n\n4.1 Stochastic Gradient Hamiltonian Monte Carlo\n\nSGHMC [Chen et al., 2014] is a Markov Chain Monte Carlo sampling method [Neal, 1993] for\nproducing samples from the intractable posterior distribution of the inducing outputs p(u|y) purely\nfrom stochastic gradient estimates.\nWith the introduction of an auxiliary variable, r, the sampling procedure provides samples from\nthe joint distribution p(u, r|y). The equations that describe the MCMC process can be related to\n\n5\n\n1.51.00.50.00.51.01.5x0.60.40.20.00.20.40.6y3210123x2.01.51.00.50.00.51.01.52.0f13210123x2.01.51.00.50.00.51.01.52.0f13210123x2.01.51.00.50.00.51.01.52.0f13210123x2.01.51.00.50.00.51.01.52.0f13210123f11.00.50.00.51.0y3210123f11.00.50.00.51.0y3210123f11.00.50.00.51.0y3210123f11.00.50.00.51.0y\fAlgorithm 1: Moving Window MCEM\ninitialize(\u03b8);\ninitialize(u);\ninitialize(samples [1\u00b7\u00b7\u00b7 m]);\nfor i \u2190 0 to maxiter do\nu(cid:48) \u2190 randomElement(samples);\nstepSGD( \u2202p(y,u(cid:48)|x,\u03b8))\nu \u223c p(u|y, x, \u03b8));\npush front(samples, u);\npop back (samples);\n\n);\n\n\u2202\u03b8\n\nend\nFigure 3: (Left): Pseudocode for Moving Window MCEM. (Middle): Comparison of predictive\nperformance of Moving Window MCEM and MCEM algorithms. Vertical lines denote E-steps in\nMCEM algorithm. Higher is better. (Right): Comparison of the convergence of the different inference\nmethods. Higher is better.\n\nHamiltonian dynamics [Brooks et al., 2011, Neal, 1993]. The negative log-posterior U (u) acts as the\npotential energy and r plays the role of the kinetic energy:\n\np(u, r|y) \u221d exp(cid:0) \u2212 U (u) \u2212 1\n\nrT M\u22121r(cid:1) ,\n\nU (u) = \u2212 log p(u|y) .\n\n2\n\nIn HMC the exact description of motion requires the computation of the gradient \u2207U (u) in each\nupdate step, which is impractical for large datasets because of the high cost of integrating the layer\noutputs out in Eq. 2. This integral can be approximated by a lower bound that can be evaluated by\nMonte Carlo sampling [Salimbeni and Deisenroth, 2017]:\n\n(cid:90)\n\n(cid:20) p(y, f , u)\n\n(cid:21)\n\np(f|u)\n\n(cid:20) p(y, f i, u)\n\n(cid:21)\n\np(f i|u)\n\n,\n\nlog p(u, y) = log\n\np(y, f , u)df \u2265\n\nlog\n\np(f|u)df \u2248 log\n\n(cid:90)\n\np(f|u) =(cid:81)L\n\nwhere f i is a Monte Carlo sample from the predictive distribution of the layer outputs: f i \u223c\n\nl=1 p(fl|ul, fl\u22121). This leads to the estimate\n\nlog p(u, y) \u2248 log[p(y|f i, u)p(u)] = log p(y|f i, u) + log p(u) ,\n\nthat we can use to approximate the gradient since \u2207U (u) = \u2212\u2207 log p(u|y) = \u2212\u2207 log p(u, y).\nChen et al. [2014] show that approximate posterior sampling is still possible with stochastic gradient\nestimates (obtained by subsampling the data) if the following update equations are used:\n\n\u2206u = \u0001M\u22121r ,\n\n\u2206r = \u2212\u0001\u2207U (u) \u2212 \u0001CM\u22121r + N(cid:0)0, 2\u0001(C \u2212 \u02c6B)(cid:1) ,\n\nwhere C is the friction term, M is the mass matrix, \u02c6B is the Fisher information matrix and \u0001 is the\nstep-size.\nOne caveat of SGHMC is that it has multiple parameters (C, M, \u02c6B, \u0001) that can be dif\ufb01cult to set\nwithout prior knowledge of the model and the data. We use the auto-tuning approach of Springenberg\net al. [2016] to set these parameters which has been shown to work well for Bayesian Neural Networks\n(BNN). The similar nature of DGPs and BNNs strongly suggests that the same methodology is\napplicable to DGPs.\n\n4.2 Moving Window Markov Chain Expectation Maximization\n\nOptimizing the hyperparameters \u03b8 (parameters of the covariance function, inducing inputs and\nparameters of the likelihood function) prove dif\ufb01cult for MCMC methods [Turner and Sahani,\n2011]. The naive approach consisting in optimizing them as the sampler progresses fails because\n\n6\n\n0100200300Runtime (sec)0.00.20.40.60.81.01.2MLLMoving Window MCEMMCEM0500100015002000Runtime (sec)2.852.802.752.702.652.60MLL20,000 iterations SGHMC DGP 3Dec DGP 3DGP 3\fsubsequent samples are highly correlated and as a result, the hyperparameters simply \ufb01t this moving,\npoint-estimate of the posterior.\nMonte Carlo Expectation Maximization (MCEM) [Wei and Tanner, 1990] is the natural extension of\nthe Expectation Maximization algorithm that works with posterior samples to obtain the Maximum\nLikelihood estimate of the hyperparameters. MCEM alternates between two steps. The E-step\nsamples from the posterior and the M-step maximizes the average log joint probability of the samples\nand the data:\nE-step:\n\nM-step:\n\n\u03b8 = arg max\n\nQ(\u03b8) ,\n\n\u03b8\n\nu1...m \u223c p(u|y, x, \u03b8) .\n\n(cid:80)m\ni=1 log p(y, ui|x, \u03b8).\n\nwhere Q(\u03b8) = 1\nm\nHowever, there is a signi\ufb01cant drawback to MCEM: If the number of samples m used in the M-step\nis too low then there is a risk of the hyperparameters over\ufb01tting to those samples. On the other hand,\nif m is too high, the M-step becomes too expensive to compute. Furthermore, in the M-step, \u03b8 is\nmaximized via gradient ascent, which means that the computational cost increases linearly with m.\nTo address this, we introduce a novel extension of MCEM called Moving Window MCEM. Our\nmethod optimizes the hyperparameters at the same cost as the previously described naive approach\nwhile avoiding its over\ufb01tting issues.\nThe idea behind Moving Window MCEM is to intertwine the E and M steps. Instead of generating\nnew samples and then maximizing Q(\u03b8) until convergence, we maintain a set of samples and take\nsmall steps towards the maximum of Q(\u03b8). In the E-step, we generate one new sample and add\nit to the set while discarding the oldest sample (hence Moving Window). This is followed by the\nM-step, in which we take a random sample from the set and use it to take an approximate gradient\nstep towards the maximum of Q(\u03b8). Algorithm 1 on the left side of Figure 3 presents the pseudocode\nfor Moving Window MCEM.\nThere are two advantages over MCEM. Firstly, the cost of each update of the hyperparameters is\nconstant and does not scale with m since it only requires a single sample. Secondly, Moving Window\nMCEM converges faster than MCEM. The middle plot of Figure 3 demonstrates this. MCEM\niteratively \ufb01ts the hyperparameters for a speci\ufb01c set of posterior samples. Since hyperparameters and\nposterior samples are highly coupled, this alternating update scheme converges slowly [Neath et al.,\n2013]. To mitigate this problem, Moving Window MCEM continuously updates its population of\nsamples by generating a new sample after each gradient step.\nTo produce the plot in the center of Figure 3, we plotted the predictive log-likelihood on the test set\nagainst the number of iterations of the algorithm to demonstrate the superior performance of Moving\nWindow MCEM over MCEM. For MCEM, we used a set size of m = 10 (larger m would slow down\nthe method) which we generated over 500 MCMC steps. For Moving Window MCEM, we used a\nwindow size of m = 300. The model used in this experiment is a DGP with one hidden layer trained\non the kin8nm dataset.\n\n5 Decoupled Deep Gaussian Processes\n\nThis section describes an extension to DGPs that enables using a large number of inducing points\nwithout signi\ufb01cantly impacting performance. This method is only applicable in the case of DSVI, so\nwe considered it as a baseline model in our experiments.\nUsing the dual formulation of a GP as a Gaussian measure, it has been shown that it does not\nnecessarily have to be the case that \u02dc\u00b5 and \u02dc\u03a3 (Eq. 1) are parameterized by the same set of inducing\npoints [Cheng and Boots, 2017, 2016]. In the case of DGPs, this means that one can use two separate\nsets of inducing points. One set to compute the layerwise mean and one set to compute the layerwise\nvariance.\nIn the variational inference setting, computing the layerwise variance has a signi\ufb01cantly higher cost\nthan computing the layerwise mean. This means that a larger set of inducing points can be used to\ncompute the layerwise mean and a smaller set of inducing points to compute the layerwise variance\nto improve the predictive performance without impacting the computational cost.\n\n7\n\n\fFigure 4: Log-likelihood and standard deviation for each method on the UCI datasets. Ranking\nSummary: average rank and standard deviation. Right is better. Best viewed in colour.\n\nUnfortunately, the parameterization advocated by Cheng and Boots [2017] has poor convergence\nproperties. The dependencies in the ELBO result in a highly non-convex optimization problem, which\nthen leads to high variance gradients. To combat this problem, we used a different parameterization\nthat lifts the dependencies and achieves stable convergence. Further details on these issues can be\nfound in the supplementary material.\n\n6 Experiments\nWe conducted experiments2 on 9 UCI benchmark datasets ranging from small (\u223c500 datapoints) to\nlarge (\u223c500,000) for a fair comparison against the baseline. In each regression task, we measured the\naverage test Log-Likelihood (MLL) and compared the results. Figure 4 shows the MLL values and\ntheir standard deviation over 10 repetitions.\nFollowing Salimbeni and Deisenroth [2017], in all of the models, we set the learning rate to the\ndefault 0.01, the minibatch size to 10,000 and the number of iterations to 20,000. One iteration\ninvolves drawing a sample from the window and updating the hyperparameters by gradient descent\nas illustrated in Algorithm 1 in the left side of Figure 3. The depth varied from 0 hidden layers up\nto 4 with 10 nodes per layer. The covariance function was a standard squared exponential function\nwith separate lengthscales per dimension. We exercised a random 0.8-0.2 train-test split. In the year\ndataset, we used a \ufb01xed train-test split to avoid the \u2018producer effect\u2019 by making sure no song from a\ngiven artist ended up in both the train and test sets.\nBaselines: The main baselines for our experiments were the Doubly Stochastic DGPs. For a faithful\ncomparison, we used the same parameters as in the original paper. In terms of the number of inducing\npoints (the inducing inputs are always shared across the latent dimensions), we tested two variants.\nFirst, the original, coupled version with M = 100 inducing points per layer (DGP). Secondly, a\ndecoupled version (Dec DGP) with Ma = 300 points for the mean and Mb = 50 for the variance.\n2Our code is based on the Tensor\ufb02ow [Abadi et al., 2015] computing library and it is publicly available at\n\nhttps://github.com/cambridge-mlg/sghmc_dgp.\n\n8\n\n3.02.52.0BNNSGPDec SGPDGP 2DGP 3DGP 4DGP 5Dec DGP 2Dec DGP 3Dec DGP 4Dec DGP 5SGHMC DGP 2SGHMC DGP 3SGHMC DGP 4SGHMC DGP 5bostonD=13 N=5061.251.000.75wine_redD=11 N=1,5993.503.253.002.75concreteD=8 N=1,0301.001.25kin8nmD=8 N=8,1922.752.50proteinD=9 N=45,7303.63.4BNNSGPDec SGPDGP 2DGP 3DGP 4DGP 5Dec DGP 2Dec DGP 3Dec DGP 4Dec DGP 5SGHMC DGP 2SGHMC DGP 3SGHMC DGP 4SGHMC DGP 5yearD=90 N=515,34546navalD=17 N=11,9342.82.72.6powerD=4 N=9,56810energyD=8 N=7681st5th10th15thRankingSummaryThis workDecoupled VI DGPDoubly Stochastic VI DGPSingle Layer GPBayesian Neural Network\fThese numbers were chosen so that the runtime of a single iteration is the same as the coupled\nversion. Further baselines were provided by coupled (SGP: M = 100) and decoupled (Dec SGP:\nMa = 300, Mb = 50) single layer GP. The \ufb01nal baseline was a Robust Bayesian Neural Network\n(BNN) [Springenberg et al., 2016] with three hidden layers and 50 nodes per layer.\nSGHMC DGP (This work): The architecture of this model is the same as the baseline models.\nM = 100 inducing inputs were used to stay consistent with the baseline. The burn-in phase consisted\nof 20,000 iterations followed by the sampling phase during which 200 samples were drawn over the\ncourse of 10,000 iterations.\n\nMNIST classi\ufb01cation SGHMC is also effective on classi\ufb01cation problems. Using the Robust-Max\n[Hern\u00b4andez-Lobato et al., 2011] likelihood function, we applied the model to the MNIST dataset. The\nSGP and Dec SGP models achieved an accuracy of 96.8 % and 97.7 % respectively. Regarding the\ndeep models, the best performing model was Dec DGP 3 with 98.1 % followed by SGHMC DGP 3\nwith 98.0 % and DGP 3 with 97.8 %. [Salimbeni and Deisenroth, 2017] report slightly higher values\nof 98.11 % for DGP 3. This difference can be attributed to different initialization of the parameters.\n\nHarvard Clean Energy Project This regression dataset was produced for the Harvard Clean\nEnergy Project [Hachmann et al., 2011]. It measures the ef\ufb01ciency of organic photovoltaic molecules.\nIt is a high-dimensional dataset (60,000 datapoints and 512 binary features) that is known to bene\ufb01t\nfrom deep models. SGHMC DGP 5 established a new state-of-the-art predictive performance with\ntest MLL of \u22120.83. DGP 2-5 reached up-to \u22121.25. Other available results on this dataset are \u22120.99\nfor DGPs with Expectation Propagation and BNNs with \u22121.37 [Bui et al., 2016].\n\nRuntime To support our claim that SGHMC has a lower computational cost than DSVI, we plot\nthe test MLL at different stages during the training process on the protein dataset (the right plot in\nFigure 3). SGHMC converges faster and to a higher limit than DSVI. SGHMC reached the target\n20,000 iterations 1.6 times faster.\n\n7 Conclusions\n\nThis paper described and demonstrated an inference method new to DGPs, SGHMC, that samples\nfrom the posterior distribution in the usual inducing point framework. We described a novel Moving\nWindow MCEM algorithm that was demonstrably able to optimize hyperparameters in a fast and\nef\ufb01cient manner. This signi\ufb01cantly improved performance on medium-large datasets at a reduced\ncomputational cost and thus established a new state-of-the-art for inference in DGPs.\n\nAcknowledgements\n\nWe want to thank Adri`a Gariga-Alonso, John Bronskill, Robert Peharz and Siddharth Swaroop for\ntheir helpful comments and thank Intel and EPSRC for their generous support.\nJuan Jos\u00b4e Murillo-Fuentes acknowledges funding from the Spanish government (TEC2016- 78434-\nC3-R) and the European Union (MINECO/FEDER, UE).\n\nReferences\nM. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,\nL. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00b4e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,\nJ. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00b4egas,\nO. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/.\nSoftware available from tensor\ufb02ow.org.\n\nS. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov chain Monte Carlo. CRC\n\npress, 2011.\n\n9\n\n\fT. Bui, D. Hern\u00b4andez-Lobato, J. Hernandez-Lobato, Y. Li, and R. Turner. Deep Gaussian processes\nfor regression using approximate expectation propagation. In International Conference on Machine\nLearning, pages 1472\u20131481, 2016.\n\nT. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International\n\nConference on Machine Learning, pages 1683\u20131691, 2014.\n\nC.-A. Cheng and B. Boots. Incremental variational sparse Gaussian process regression. In Advances\n\nin Neural Information Processing Systems, pages 4410\u20134418, 2016.\n\nC.-A. Cheng and B. Boots. Variational inference for Gaussian process models with linear complexity.\n\nIn Advances in Neural Information Processing Systems, pages 5190\u20135200, 2017.\n\nD. Cramer. Fundamental Statistics for Social Research: Step-by-Step Calculations and Computer\nTechniques Using SPSS for Windows. Routledge, New York, NY, 10001, 1998. ISBN 0415172039.\n\nK. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Random feature expansions for deep\n\nGaussian processes. arXiv preprint arXiv:1610.04386, 2016.\n\nA. Damianou. Deep Gaussian processes and variational propagation of uncertainty. PhD thesis,\n\nUniversity of Shef\ufb01eld, 2015.\n\nA. Damianou and N. Lawrence. Deep Gaussian processes. In Arti\ufb01cial Intelligence and Statistics,\n\npages 207\u2013215, 2013.\n\nM. M. Dunlop, M. Girolami, A. M. Stuart, and A. L. Teckentrup. How deep are deep Gaussian\n\nprocesses? arXiv preprint arXiv:1711.11280, 2017.\n\nA. Graves. Practical variational inference for neural networks. In Advances in Neural Information\n\nProcessing Systems, pages 2348\u20132356, 2011.\n\nJ. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R. S. S\u00b4anchez-Carrera,\nA. Gold-Parker, L. Vogt, A. M. Brockway, and A. Aspuru-Guzik. The Harvard clean energy\nproject: large-scale computational screening and design of organic photovoltaics on the world\ncommunity grid. The Journal of Physical Chemistry Letters, 2(17):2241\u20132251, 2011.\n\nJ. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for variationally sparse\nGaussian processes. In Advances in Neural Information Processing Systems, pages 1648\u20131656,\n2015.\n\nD. Hern\u00b4andez-Lobato, J. M. Hern\u00b4andez-Lobato, and P. Dupont. Robust multi-class Gaussian process\n\nclassi\ufb01cation. In Advances in neural information processing systems, pages 280\u2013288, 2011.\n\nJ. M. Hern\u00b4andez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of\nBayesian neural networks. In International Conference on Machine Learning, pages 1861\u20131869,\n2015.\n\nM. D. Hoffman. Learning deep latent Gaussian models with Markov chain Monte Carlo.\n\nInternational Conference on Machine Learning, pages 1510\u20131519, 2017.\n\nIn\n\nT. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the\nSeventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 362\u2013369. Morgan Kaufmann\nPublishers Inc., 2001.\n\nR. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. 1993.\n\nR. C. Neath et al. On convergence properties of the Monte Carlo EM algorithm. In Advances in\nModern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton, pages\n43\u201362. Institute of Mathematical Statistics, 2013.\n\nJ. Qui\u02dcnonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. Journal of Machine Learning Research, 6(Dec):1939\u20131959, 2005.\n\nH. Salimbeni and M. Deisenroth. Doubly stochastic variational inference for deep Gaussian processes.\n\nIn Advances in Neural Information Processing Systems, pages 4591\u20134602, 2017.\n\n10\n\n\fE. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs.\n\nIn Y. Weiss,\nB. Sch\u00a8olkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18,\npages 1257\u20131264. MIT Press, 2006.\n\nJ. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian\nneural networks. In Advances in Neural Information Processing Systems, pages 4134\u20134142, 2016.\n\nM. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In D. van Dyk and\nM. Welling, editors, Proceedings of the Twelth International Conference on Arti\ufb01cial Intelligence\nand Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567\u2013574, Hilton\nClearwater Beach Resort, Clearwater Beach, Florida USA, 16\u201318 Apr 2009. PMLR.\n\nR. E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series\n\nmodels. Bayesian Time series models, 1(3.1):3\u20131, 2011.\n\nG. C. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man\u2019s\ndata augmentation algorithms. Journal of the American statistical Association, 85(411):699\u2013704,\n1990.\n\nM. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics.\n\nIn\nProceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681\u2013\n688, 2011.\n\nC. K. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in neural\n\ninformation processing systems, pages 514\u2013520, 1996.\n\n11\n\n\f", "award": [], "sourceid": 3724, "authors": [{"given_name": "Marton", "family_name": "Havasi", "institution": "University of Cambridge"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "University of Cambridge"}, {"given_name": "Juan Jos\u00e9", "family_name": "Murillo-Fuentes", "institution": "Universidad de Sevilla"}]}