{"title": "Gaussian Process Conditional Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2385, "page_last": 2395, "abstract": "Conditional Density Estimation (CDE) models deal with estimating conditional distributions. The conditions imposed on the distribution are the inputs of the model. CDE is a challenging task as there is a fundamental trade-off between model complexity, representational capacity and overfitting. In this work, we propose to extend the model's input with latent variables and use Gaussian processes (GP) to map this augmented input onto samples from the conditional distribution. Our Bayesian approach allows for the modeling of small datasets, but we also provide the machinery for it to be applied to big data using stochastic variational inference. Our approach can be used to model densities even in sparse data regions, and allows for sharing learned structure between conditions. We illustrate the effectiveness and wide-reaching applicability of our model on a variety of real-world problems, such as spatio-temporal density estimation of taxi drop-offs, non-Gaussian noise modeling, and few-shot learning on omniglot images.", "full_text": "Gaussian Process Conditional Density Estimation\n\nVincent Dutordoir\u2217 1 Hugh Salimbeni\u2217 1,2 Marc Peter Deisenroth1,2\n\nJames Hensman1\n\n1PROWLER.io, Cambridge, UK 2Imperial College London\n\n{vincent, hugh, marc, james}@prowler.io\n\nAbstract\n\nConditional Density Estimation (CDE) models deal with estimating conditional\ndistributions. The conditions imposed on the distribution are the inputs of the\nmodel. CDE is a challenging task as there is a fundamental trade-off between model\ncomplexity, representational capacity and over\ufb01tting. In this work, we propose to\nextend the model\u2019s input with latent variables and use Gaussian processes (GPs)\nto map this augmented input onto samples from the conditional distribution. Our\nBayesian approach allows for the modeling of small datasets, but we also provide\nthe machinery for it to be applied to big data using stochastic variational inference.\nOur approach can be used to model densities even in sparse data regions, and allows\nfor sharing learned structure between conditions. We illustrate the effectiveness\nand wide-reaching applicability of our model on a variety of real-world problems,\nsuch as spatio-temporal density estimation of taxi drop-offs, non-Gaussian noise\nmodeling, and few-shot learning on omniglot images.\n\n1\n\nIntroduction\n\nConditional Density Estimation (CDE) is the very general task of inferring the probability distribution\np(f (x) | x), where f (x) is a random variable for each x. Regression can be considered a CDE\nproblem, although the emphasis is on modeling the mapping rather than the conditional density. The\nconditional density is commonly Gaussian with parameters that depend on x. This simple model for\ndata may be inappropriate if the conditional density is multi-modal or has non-linear associations.\nThroughout this paper we consider an input x to be the condition, and the output y to be a sample\nfrom the conditional density imposed by x. For example, in the case of estimating the density of\ntaxi drop-offs, the input or condition x could be the pick-up location and the output y would be\nthe corresponding drop-off. In this context, we are more interested in learning the complete density\nover drop-offs rather than only a single point estimate, as we would expect the taxi drop-off to be\nmulti-modal because passengers need to go to different places (e.g., airport/city center/suburbs). We\nwould also expect the drop-off location to depend on the starting point and time of day: therefore, we\nare interested in conditional densities. In the experiment section we will return to this example.\nIn this work, we present a Gaussian process (GP) based model for estimating conditional densities,\nabbreviated as GP-CDE. While a vanilla GP used directly is unlikely to be a good model for\nconditional density estimation as the marginals are Gaussian, we extend the inputs to the model with\nlatent variables to allow for modeling richer, non-Gaussian densities when marginalizing the latent\nvariable. Fig. 1 shows a high-level overview of the model. The added latent variables are denoted by\nw. The latent variable w and condition x are used as input of the GP. A recognition/encoder network\nis used to amortize the learning of the variational posterior of the latent variables. The matrices A\nand P act as probabilistic linear transforms on the input and the output of the GP, respectively.\nThe GP-CDE model is closely related to both supervised and unsupervised, non-Bayesian and\nBayesian models. We \ufb01rst consider the relationship to parametric models, in particular Variational\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fx\n\ny\n\nNN\n\nw\n\nx\n\nA\n\nGP\n\nf\n\nP\n\ny\n\nEncoder\n\nDecoder\n\nFigure 1: Diagram of the GP-Conditional Density Estimator. The GP-CDE consists of an encoder\n(blue) and a decoder (orange). The observed variables x and y are respectively the condition on the\ndistribution and a sample from the conditional distribution. The encoder consists of a neural network,\nand uses the variables x and y as inputs to produce the parameters for the posterior of the latent\nvariable w. The decoder part is built out of a GP and two linear transformation matrices, A and P.\nA is applied to the condition x to reduce the dimensionality of x before it is combined with the latent\nw and fed into the GP. The matrix P can be used to correlate the outputs of the latent function f.\n\nAutoencoders (VAEs) [17, 23] and their conditional counterparts (CVAE) [18, 24]. (C)VAEs use a\ndeterministic, parametrized neural network as decoder, whereas we adopt a Bayesian non-parametric\nGP. Using a GP for this non-linear decoder mapping offers two advantages. First, it allows us to\nspecify our prior beliefs about the mapping, resulting in a model that can gracefully accommodate\nsparse or missing data. Note that even large datasets can be sparse, e.g., the omniglot images or\ntaxi drop-off locations at the fringes of a city. As the CVAE has no prior on the decoder mapping,\nthe model can over\ufb01t the training data and the latent variable posterior becomes over-concentrated,\nleading to poor test log-likelihoods. Second, the GP allows tractable uncertainty propagation through\nthe decoder mapping (for certain kernels). This allows us to calculate the variational objective\ndeterministically, and we can further exploit natural gradients for fast inference. Neural network\ndecoders do not admit such structure, and are typically optimized with general-purpose tools.\nA second perspective can be seen through the connection to Gaussian process models. By dropping\nthe latent variables w we recover standard multiple-output GP regression with sparse variational\ninference. If we drop the known inputs x and use only the latent variables, we obtain the Bayesian\nGP-LVM [26, 20]. Bayesian GP-LVMs are typically used for modeling complex distributions and\nnon-linear mappings from a lower-dimensional latent variable into a high-dimensional space. By\ncombining the GP-LVM framework with known inputs we create a model that outputs conditional\nsamples in this high-dimensional space.\nOur primary contribution is to show that our GP-CDE can be applied to a wide variety of settings,\nwithout the necessity for \ufb01ne-tuning or regularization. We show that GP-CDE outperforms GP\nregression on regression benchmarks; we study the importance of accurate density estimation in\nhigh-dimensional spaces; and we deal with learning the correlations between conditions in a large\nspatio-temporal dataset. We achieve this through three speci\ufb01c contributions. (i) We extend the\nmodel of Wang and Neal [29] with linear transformations for the inputs and outputs. This allows\nus to deal with high-dimensional conditions and enables a priori correlations in the output. (ii) We\napply natural gradients to address the dif\ufb01culty of mini-batched optimization in [26]. (iii) We derive\na free-form optimal posterior distribution over the latent variables. This provides a tighter bound and\nreduces the number of variational parameters to optimize.\n\n2 Background: Gaussian Processes and Latent Variable Models\n\nGaussian Processes A Gaussian process (GP) is a Bayesian non-parametric model for functions.\nBayesian models have two signi\ufb01cant advantages that we exploit in this work: we can specify prior\nbeliefs leading to greater data ef\ufb01ciency, and we can obtain uncertainty estimates for predictions. A\nGP is de\ufb01ned as a set of random variables {f (x1), f (x2), . . .}, any \ufb01nite subset of which follow a\nmultivariate Gaussian distribution [22]. When a stochastic function f : RD \u2192 R follows a GP it is\nfully speci\ufb01ed by its mean m(\u00b7) and covariance function k(\u00b7, \u00b7), and we write f \u223c GP(m(\u00b7), k(\u00b7, \u00b7)).\nThe most common use of a GP is the regression task of inferring an unknown function f, given a set\nof N observations y = [y1, . . . , yN ]\u22a4 and corresponding inputs x1, . . . , xn. The likelihood p(yi | f )\ngenerally is taken to depend on f (xi) only, and the Gaussian likelihood N (yn | f (xn), \u03c32) is widely\nused as it results in analytical closed-form inference.\n\n2\n\n\fConditional Deep Latent Variable Models Conditional Deep Latent Variable Models (C-DLVMs)\nconsist of two components: a prior distribution p(wn) over the latent variables2 which is assumed to\nfactorize over the data, and a generator or decoder function g\u03b8(xn, wn) : RDx+Dw \u2192 RDy. The\nVAE and CVAE are examples where the generator function is a deep (convolutional) neural network\nwith weights \u03b8 [17, 24]. The outputs of the generator function are the parameters of a likelihood,\ncommonly the Gaussian for continuous data or the Bernoulli for binary data. The joint distribution\nfor a single data point is p\u03b8(yn, wn | xn) = p(wn)p\u03b8(yn | xn, wn). We assume the data to be i.i.d.,\nthen the the marginal likelihood in the Gaussian case is given by\n\nlog p\u03b8(yn | xn) = Xn\n\nlogZ p(wn) N (yn | g\u03b8(xn, wn), \u03c32I) dwn ,\n\nlog p\u03b8(Y | X) = Xn\nwhere X = {x}N\nn=1, and likewise for Y and W. As g\u03b8(xn, wn) is a complicated non-linear function\nof its inputs, this integral cannot be calculated in closed-form. Kingma and Welling [17] and Rezende\net al. [23] addressed this problem by using variational inference. Variational inference posits an\napproximate posterior distribution q\u03c6(W), and \ufb01nds the closest q\u03c6 to the true posterior, measured\nby KL divergence, i.e. arg minq\u03c6 KL [q\u03c6(W)kp(W | X, Y)]. It can be shown that this optimization\nobjective is equal to the Evidence Lower Bound (ELBO)\n\n(1)\n\nlog p\u03b8(Y | X) \u2265 L := Xn\n\nE\n\nq\u03c6(wn) [log p\u03b8(yn | xn, wn)] \u2212 KL [q\u03c6(wn)kp(wn)] .\n\nA mean-\ufb01eld distribution is typically used for the latent variables W with a multivariate Gaussian\nform for q\u03c6(wn) = N (wn | \u00b5wn , \u03a3wn ). Rather than representing the Gaussian parameters \u00b5wn\nand \u03a3wn for each data point directly, Kingma and Welling [17] and Rezende et al. [23] instead\namortize these parameters into a set of global parameters \u03c6, where \u03c6 parameterizes an auxiliary\nfunction h\u03c6 : (xn, yn) 7\u2192 (\u00b5wn , \u03a3wn ), referred to as the encoder/recognition network.\n\n3 Conditional Density Estimation with Gaussian Processes\n\nThis section details our model and the inference scheme. Our contributions are threefold:(i) we derive\nan optimal free-form variational distribution q(W) (Section 3.2); (ii) we ease the burden of jointly\noptimizing q(f (\u00b7)) and q(W) using natural gradients (Section 3.2.1) for the variational parameters of\nq(f (\u00b7)); (iii) we extend the model to allow for the modeling of high-dimensional inputs and impose\ncorrelation on the outputs using linear transformations (Section 3.3).\n\n3.1 Model\n\nThe key idea of our model is to substitute the neural network decoder in the C-DLVM framework\nwith a GP, see Fig. 1. Treating the decoder in a Bayesian manner leads to several advantages. In\nparticular, in the small-data regime, a probabilistic decoder will be advantageous to leverage prior\nassumptions and avoid over-\ufb01tting.\nAs we want to apply our model on both high-dimensional correlated outputs (e.g., images) and high-\ndimensional inputs (e.g., one-hot encodings of omniglot labels), we introduce two matrices A and P.\nThey are used for probabilistic linear transformations of the inputs and the outputs, respectively. The\nlikelihood of the GP-CDE model is then given by\n\np\u03b8(yn | xn, wn, f (\u00b7), A, P) = N (cid:0)yn(cid:12)(cid:12)Pf(cid:0)[Axn, wn](cid:1), \u03c32I(cid:1) ,\n\nwhere [\u00b7, \u00b7] denotes concatenation. We assume the GP f (\u00b7) consists of L independent GPs f\u2113(\u00b7) for\neach output dimension \u2113. The latent variables are a priori independent for each data point and have a\nstandard-normal prior distribution. We discuss priors for A and P in section 3.3.\n\n3.2\n\nInference\n\nIn this section, we present our inference scheme, initially in the case without A and P to lighten the no-\ntation. We will return to these matrices in section section 3.3. We calculate an ELBO on the marginal\n\n2It is common for \u201cz\u201d to denote the latent variables. However, as this letter collides with the notation of\n\ninducing inputs in GPs, we will use \u201cw\u201d for the latent variables throughout this paper.\n\n3\n\n\flikelihood, similarly to (1). Assuming a factorized posterior q(f (\u00b7), W) = q(f (\u00b7))Qn q(wn), where\n\nn=1, between the GP and the latent variables we get the ELBO\n\nW = {wn}N\n\nL = Xnn E\n\nq(wn) E\n\nq(f (\u00b7))(cid:2) log p(yn | f (\u00b7), xn, wn)(cid:3) \u2212 KL [q(wn)kp(wn)]o\n\n\u2212 KL [q(f (\u00b7))kp(f (\u00b7))] .\n\n(2)\n\nSince the ELBO is a sum over the data we can calculate unbiased estimates of the bound using\nmini-batches. We follow Hensman et al. [15] and choose independent sparse GPs over the output\n\ndimensions q(f\u2113(\u00b7)) = R p(f\u2113(\u00b7) | u\u2113) q(u\u2113) du\u2113, where \u2113 = 1, . . . , L and u\u2113 \u2208 RM are inducing\noutputs corresponding to the inducing inputs zm, m = 1, . . . , M, so that uml = f\u2113(zm). We choose\nq(u\u2113) = N (u\u2113 | m\u2113, S\u2113). Since p(f\u2113(\u00b7) | u\u2113) is conjugate to q(u\u2113), the integral can be calculated\nin closed-form. The result is a new sparse GP for each of the output dimensions q(f\u2113(\u00b7)) =\nGP(\u00b5\u2113(\u00b7), \u03c3\u2113(\u00b7, \u00b7)) with closed-form mean and variance. See Appendix C for a detailed derivation.\nUsing the results of Matthews et al. [21], the KL-term over the multi-dimensional latent function f (\u00b7)\nsimpli\ufb01es to P\u2113 KL [q(u\u2113)kp(u\u2113)], which is closed-form, since q(u\u2113) and p(u\u2113) are both Gaussian.\n\nThe inner expectation over the variational posterior q(f (\u00b7)) in (2) can be calculated in closed-form\n(see Appendix C) as the likelihood is Gaussian. We de\ufb01ne this analytically tractable quantity as\n\nLwn = E\n\nq(f (\u00b7))(cid:2) log p(yn | f (\u00b7), xn, wn)(cid:3) .\n\n(3)\n\nUsing this de\ufb01nition, and using the sparse variational posterior for q(f (\u00b7)) as described above, we\nwrite the bound in (2) as\n\nL = Xnn E\n\nq(wn) Lwn \u2212 KL [q(wn)kp(wn)]o \u2212X\u2113\n\nKL [q(u\u2113)kp(u\u2113)] .\n\n(4)\n\nWe consider two options for q(wn): (i) we can either make a further Gaussian assumption, and have\na variational posterior of the form q(wn) = N (wn | \u00b5wn , \u03a3wn ), or (ii) we can \ufb01nd the analytically\noptimal value of the bound for a free-form q(wn).\n\n(i) Gaussian q(wn) First, a Gaussian q(wn) implies that the KL over the latent variables is closed-\nform, as both the prior and posterior are Gaussian. Therefore, we are left with the calculation of\nthe \ufb01rst term in (4) E\nq(wn) Lwn. We follow the approach of C-DLVMs, explained in Section 2,\nand use Monte Carlo sampling to estimate the expectation. To enable differentiability, we use the\nre-parameterization trick and write wn = \u00b5wn + Lwn \u03ben with p(\u03ben) = N (0, I), independent for\neach data point, and Lwn L\u22a4\nwn = \u03a3wn. Note that this does not change the distribution of q(wn), but\nnow the expectation is over a parameterless distribution. We can then take a differentiable unbiased\nestimate of the bound by sampling from \u03ben.\nIn practice, rather than represent the Gaussian parameters \u00b5wn\nand Lwn for each data-point directly,\nwe instead amortize these parameters into a set of global parameters \u03c6, where \u03c6 parameterizes an\nauxiliary function h\u03c6, (or \u2018recognition network\u2019) of the data: (\u00b5wn , Lwn ) = h\u03c6(xn, yn). This is\nidentical to the decoder component of C-DLVMs.\nAn alternative approach would be to use the kernel expectation results of Girard et al. [13] to\nq(wn) Lwn. Using these results we can evaluate the bound in closed-form, rather than\nevaluate E\nan approximate using Monte Carlo. However, the computations involved in calculating the kernel\nexpectations can be prohibitive, as it requires evaluating a N M 2Dy sized tensor. Furthermore,\nclosed-form solutions for the kernel expectations only exist for RBF and polynomial kernels, which\nmakes this approach less favorable in practice.\n\n(ii) Analytically optimal q(wn) So far, we assumed that the variational distribution q(wn) is\nGaussian. When q(\u00b7) is non-Gaussian, it is possible to integrate over wn with quadrature as we detail\nin the following. We \ufb01rst bound the conditional p(Y | X, W) and use the same sparse variational\nposterior for the GP as before, to obtain\n\nlog p(Y | X, W) \u2265 Xn\n\nLwn \u2212X\u2113\n\nKL [q(u\u2113)kp(u\u2113)] .\n\n4\n\n\fAs shown in [15] and explained above we can calculate Lwn analytically. By expressing the marginal\nlikelihood as log p(Y | X) = logR p(Y | X, W)p(W) dW, we get\n\nlog p(Y | X) \u2265 logZ exp(cid:16)Xn\n\nLwn \u2212X\u2113\n\nKL [q(u\u2113)kp(u\u2113)](cid:17) p(W) dW\n\n=Xn\n\nlogZ exp (Lwn ) p(wn) dwn \u2212X\u2113\n\nKL [q(u\u2113)kp(u\u2113)]\n\nwhere we exploited the monotonicity of the logarithm. We can compute this integral with quadrature\nwhen wn is low-dimensional (the dimensionality of xn does not matter here). Assuming we have\nsuf\ufb01cient quadrature points, this gives the analytically optimal bound for q(wn). The analytical\noptimal approach does not resort to the Gaussian approximation for q(wn), so it is a tighter bound.\nSee Appendix D for a proof that this bound is necessarily tighter than the bound in the Gaussian case.\n\n3.2.1 Natural Gradient\n\nOptimizing q(u\u2113) together with q(W) can be challenging due to problems of local optima and\nthe strong coupling between the inducing outputs u\u2113 and the latent variables W. One option is to\nanalytically optimize the bound with respect to the variational parameters of u\u2113, but this prohibits\nthe use of mini-batches and reduces the applicability to large-scale problems. Recall that the\nvariational parameters of u\u2113 are the mean and the covariance of the approximate posterior distribution\nq(u\u2113) = N (m\u2113, S\u2113) over the inducing outputs. We can use the natural gradient [3] to update the\nvariational parameters m\u2113 and S\u2113.\nThis approach has the attractive property of recovering exactly the analytically optimal solution for\nq(u\u2113) in the full-batch case if the natural gradient step size is taken to be 1. While natural gradients\nhave been used before in GP models [15], they have not been used in combination with uncertain\ninputs. Due to the quadratic form of the log-likelihood as a function of the kernel inputs X, W and\nZ, we can calculate the expectation w.r.t. q(W), which will still be quadratic in the inducing outputs\nu\u2113. Therefore, the expression is still conjugate, and the natural gradient step of size 1 recovers the\nanalytic solution.\nIn practice, the natural gradient is used for the Gaussian variational parameters m\u2113 and S\u2113, and\nordinary gradients are used for the inducing inputs Z, the recognition network (if applicable) and other\nhyperparameters of the model (the kernel and likelihood parameters). The variational parameters of\nq(A) and the parameters of P are also updated using the ordinary gradient.\n\n3.3 Probabilistic Linear Transformations\n\nInput For high-dimensional inputs it may not be appropriate to de\ufb01ne a GP directly in the aug-\nmented space [xn, wn] \u2208 RDx+Dw. This might be the case if the input data is a one-hot encoding of\nmany classes. We can extend our model with a linear projection to a lower-dimensional space before\nconcatenating with the latent variables, [Axn, wn]. We denote this projection matrix by A, as shown\nin Fig. 1.\nWe use an isotropic Gaussian prior for elements of A and a Gaussian variational posterior that\nfactorizes between A and the other variables in the model: q(f (\u00b7), W, A) = q(f (\u00b7))q(W)q(A). For\nGaussian q(w) the bound is identical as in (2) except that we include an additional \u2212 KL [q(A)kp(A)]\nterm and include the mean and variance for Ax as the input of the GP. A similar approach was used\nin the regression case by Titsias and L\u00e1zaro-Gredilla [27].\n\nOutput We can move beyond the assumption of a priori independent outputs to a correlated model\nby using a linear transformation of the outputs of the GP. This model is equivalent to a \u2018multi-output\u2019\nGP model with a linear model of covariance between tasks. In the multi-output GP framework [2],\nthe Dy outputs are stacked to a single vector of length N Dy, and a single GP is used jointly with a\nstructured covariance. In the simplest case, the covariance can be structured as R \u2297 K, where R can\nbe any positive semi-de\ufb01nite matrix of size Dy \u00d7 Dy, and K is an N \u00d7 N matrix. By transforming\nthe outputs with the matrix P we recover exactly this model with R = P\u22a4P. Apart from the\nsimplicity of implementation, another advantage is that we can handle degenerate cases (i.e., where\nthe number of outputs is less than Dy) without having to deal with issues of ill-conditioning. It would\nbe possible to use a Gaussian prior for P while retaining conjugacy, but in our experiments we use a\nnon-probabilistic P and optimize it using MAP.\n\n5\n\n\f4 Related work\n\nThe GP-CDE model is closely related to both supervised an unsupervised Gaussian process based\nmodels. If we drop the latent variables W our approach recovers standard multiple-output Gaussian\nprocess regression with sparse variational inference. If we drop the known inputs X and use only\nthe latent variables, we obtain a Bayesian GP-LVM [26]. Bayesian GP-LVMs are typically used for\nmodeling complex distributions and non-linear mappings from a lower-dimensional variable into\na high-dimensional space. By combining the GP-LVM framework with known inputs we create a\nmodel that outputs conditional samples in this high-dimensional space. Differently from Titsias\nand Lawrence 26, during inference we do not marginalize out the inducing variables u\u2113 but rather\ntreat them as variational parameters of the model. This scales our model to arbitrarily large datasets\nthrough the use of Stochastic Variational Inference (SVI) [15, 16]. While the mini-batch extension\nto the Bayesian GP-LVM was suggested in [15], its notable absence from the literature may be due\nto the dif\ufb01culty in the joint optimization of W and u\u2113. We found that the natural gradients were\nessential to alleviate this problem. A comparison demonstrating this is presented in the experiments.\nWang and Neal [29] proposed the Gaussian Process Latent Variable model (GP-LV), which is a\nspecial case of our model. The inference they employ is based on Metroplis sampling schemes and\ndoes not scale to large datasets or high dimensions. In this work, we extend their model using linear\nprojection matrices on both input and output, and we present an alternative method of inference that\nscales to large datasets. Damianou and Lawrence [11] also propose a special case of our model,\nthough they use it for missing data imputation rather than to induce non-Gaussian densities. They\nalso use sparse variational inference, but they analytically optimize q(u\u2113) so cannot use mini-batches.\nDepeweg et al. [12] propose a similar model, but use a Bayesian neural network instead of a GP.\nThe use of a recognition model, as in VAEs, was \ufb01rst proposed by Lawrence and Qui\u00f1onero-Candela\n[20] in the context of a GP-LVM, though it was motivated as a constraint on the latent variable\nlocations rather than an amortization of the optimization cost. Recognitions models were later used\nby Bui and Turner [7] and by Dai et al. [9] for deep GPs.\nA GP model with latent variables and correlated multiple outputs was recently proposed in Dai\net al. [10]. In this model, the latent variables determine the correlations between outputs via a\nKronecker-structured covariance, whereas we have a \ufb01xed between-output covariance. That is, in our\nmodel the covariance of the stacked outputs is (PP\u22a4) \u2297 (KXKW), whereas in Dai et al. [10] the\ncovariance is KW \u2297 KX. These models are complementary and perform different functions. [6]\nproposed a model that is also similar to ours, but with categorical variables in the latent space. Other\napproaches to non-parametric density estimation include modeling the log density directly with a GP\n[1], and using an in\ufb01nite generalization of the exponential family [25] which was recently extended\nto the conditional case [4].\n\n5 Experiments\n\nLarge-scale spatio-temporal density estimation We apply our model to a New York City taxi\ndataset to perform conditional spatial density estimation. The dataset holds records of more than 1.4\nmillion taxi trips, which we \ufb01lter to include trips that start and end within the Manhattan area. Our\nobjective is to predict spatial distributions of the drop-off location, based on the pick-up location, the\nday of the week, and the time of day. The two temporal features are encoded as sine and cosine with\nthe natural periods, giving 6-dimensional inputs in total3. Trippe and Turner [28] follow a similar\nsetup to predict a distribution over the pick-up locations given the fare and the tip of the ride.\nTable 1 compares the performance of 6 different models, unconditional and conditional Kernel\nDensity Estimation (U-KDE, C-KDE), Mixture Density Networks (MDN-k, k = 1, 5, 10, 50) [5], our\nGP-CDE model, a simple GP model and the unconditional GP-LVM [26]. We evaluate the models\nusing negative log predictive probability (NLPP) of the test set. The test sets are constructed by\nsequentially adding points that have greatest minimum distance from the testing set. In this way we\ncover as much of the input space as possible. We use a test set of 1000 points, and vary the number\nof training points to establish the utility of models in both sparse and dense data regimes. We use 1K,\n5K and 1M randomly selected training points to evaluate the models in both sparse and dense data\nregimes.\n\n3See https://github.com/hughsalimbeni/bayesian_benchmarks for the data.\n\n6\n\n\fFigure 2: Conditional densities (displayed as heat-maps: yellow means higher probability) of drop-off\nlocations conditioned on the pick-up location (red cross).\n\nTable 1: NLPP for Manhattan data\n(lower is better). The models are trained\non different dataset sizes.\n\nUnconditional KDE (U-KDE) ignores the input conditions. It directly models the drop-off locations\nusing Gaussian kernel smoothing. The kernel width is selected with cross-validation. The Conditional\nKDE (C-KDE) model uses the 50 nearest neighbors in the training data, with kernel width taken\nfrom the unconditional model. The table shows that the conditional KDE model performs better than\nthe unconditional KDE model for all conditions. This suggests that the conditioning on the pick-up\nlocation and time strongly effects the drop-off location. If the effect of conditioning were slight, the\nunconditional model should perform better as it has access to all the data.\nWe also evaluate several MDNs models with differing\nnumber of mixing components (MDN-k, where k is the\nnumber of components), using fully connected neural net-\nworks with 3 layers. The MDN model perform poorly\nexcept in the large data regime, where the model with the\nlargest number of components is the best performing. The\nMDN with a large number of components can put mass\nat localized locations, which for this data is likely to be\nappropriate as the taxis are con\ufb01ned to streets.\nWe test three GP-based models: our GP-CDE model with\n2-dimensional latent variables, and two special cases: one\nwithout conditioning (GP-LVM) and one without latent\nvariables (GP). The GP-LVM [26] is our model without\nthe conditioning, and does not perform well on this task\nas it has not access to the inputs and models all conditions\nidentically. The GP model has no latent variables and\nindependent Gaussian marginals, and so cannot model this\ndata well as the drop-off location is quite strongly non-\nGaussian. We added predictive probabilities for all models\nin Appendix F to illustrate these \ufb01ndings.\nThe GP-CDE performs best on this dataset for the small data regimes. For the large data case\nthe MDN model is superior. We attribute this to the high density of data when 1 million training\npoints are used. We used a 2D latent space and Gaussian q(W) for the latent variables, with a\nrecognition network amortizing the inference. We use the RBF kernel and use Monte Carlo sampling\nto evaluate the bound, as described in Section 3.2. For training we use the Adam optimizer with a\nexponentially decaying learning rate starting at 0.01 for the hyperparameters, the inducing inputs\nand on the recognition network parameters. Natural gradient steps of size 0.05 are used for the GP\u2019s\nvariational parameters. Fig. 2 shows the density of our GP-CDE model for two different conditions.\nSimilar \ufb01gures for the other methods are in Appendix F.\n\n1K\nGP-LVM 2.61\nGP\n2.68\nGP-CDE\n2.31\nU-KDE\nC-KDE\nMDN-1\nMDN-5\nMDN-10\nMDN-50\n\n2.5\n2.38\n\n2.77\n2.55\n2.66\n3.08\n\n2.49\n2.40\n\n2.83\n2.72\n3.17\n5.09\n\n1M\n\n2.43\n2.67\n2.13\n\n5K\n\n2.52\n2.67\n2.22\n\n2.35\n2.314\n\n2.65\n2.16\n2.06\n1.97\n\n7\n\n\fFew-shot learning We demonstrate the GP-\nCDE model for the challenging task of few-shot\nconditional density estimation on the omniglot\ndataset. Our task is to obtain a density over the\npixels given the class label. We use the train-\ning/test split from [19], using all the examples\nin the training classes and four samples from\neach of the test classes. The inputs are one-hot\nencoded (1623 classes) and the outputs are the\npixel intensities, which we resize to 28 \u00d7 28.\nWe apply a linear transformation on both the\ninput and output (see Section 3.3). We use a\n1623 \u00d7 30 matrix A with independent standard-\nnormal priors on each component to project the\nlabels onto a 30-dimensional space. To prevent\nthe model from over\ufb01tting, it is important to\ntreat the A transformation in a Bayesian way and marginalize it.\nTo correlate the outputs a priori we use a linear transformation P of the GP outputs, which is equivalent\nto considering multiple outputs jointly in a single GP with a Kronecker-structured covariance. See\nSection 3.3. We use 400 GP outputs so P has shape 400 \u00d7 784. To initialize P we use a model of\nlocal correlation by using the Mat\u00e9rn-5/2 kernel with unit lengthscale on the pixel coordinates and\ntaking the \ufb01rst 400 eigenvectors scaled by the square-root eigenvalues. We then optimize the matrix\nP as a hyperparameter of the model. Learning the P matrix is a form of transfer learning: we update\nour prior in light of the training classes to make better inference about the few-shot test classes.\nWe obtain a log-likelihood of 7.2 \u00d7 10\u22122 nat/pixel, averaging over the all the test images (659 classes\nwith 16 images per class). We train for 25K iterations with the same training procedure as in the\nprevious experiment. Samples from the posterior on a selection of test classes are shown in Fig. 3.\nFor a larger selection, see Fig. 8 in Appendix F.\n\nFigure 3: Sample images for 4-shot learning.\nLeft column is a true (unseen) image, remain-\ning columns are samples from the posterior con-\nditioned on the same label. See supplementary\nmaterial for further examples.\n\nHeteroscedastic noise modeling We use 10 UCI regression datasets to compare two variants of\nour CDE model with a standard sparse GP and a CVAE. Since we model a 1D target we consider wn\nto be uni-dimensional, allowing us to use the quadrature method (Section 3.2) to obtain the bound\nfor an analytically optimal q(wn). We compare also to an amortized Gaussian approximation to\nq(wn), where we use a three-layer fully connected neural network with tanh activations for the\nrecognition model. In all three models we use a RBF kernel and 100 inducing points, optimizing\nfor 20K iterations using Adam optimizer for the hyperparameters and a natural gradient optimizer\nwith step size 0.1 for the Gaussian variational parameters. The quadrature model use Gauss-Hermite\nquadrature with 100 points. For the CVAE we use, given the modest size of the UCI datasets, a\nrelatively small encoder and decoder network architecture together with dropout. See Appendix B for\ndetails.\nFig. 3 shows the test log-likelihoods using 20-fold cross validation with 10% test splits. We normalize\nthe inputs to have zero mean and unit variance. We see that the quadrature CDE model outperforms\nthe standard GP and CVAE on many of the datasets. The optimal GP-CDE model performs better\nthan the GP-CDE with Gaussian q(w) on all datasets. This can be attributed to three reasons: we\nimpose fewer restrictions on the variational posterior, there is no amortization gap (i.e. the recognition\nnetwork might not \ufb01nd the optimal parameters [8]), and problems of local optima are likely to be less\nsevere as there are fewer variational parameters to optimize.\n\nFigure 4: Test log-likelihood of the GP, the optimal GP-CDE, the amortized GP-CDE, the CVAE,\nand a Linear model on 10 UCI datasets. Higher is better.\n\n8\n\n-2.8-2.6-2.4BostonN=506, D=13-3.16-3.14-3.12-3.10-3.77Concrete1030, 8-0.70-0.65-0.60-2.53Energy768, 81.051.101.150.25Kin8nm8192, 85.75.85.93.843.73Naval11934, 14-3.2-3.0-2.8Power9568, 4-2.0-1.8-1.6-3.07Protein45730, 9-1.00.01.02.0Wine red1599, 11-1.06-1.05-1.04-1.03-1.13Wine white4898, 11-1-00-3.65Yacht308, 6GPGP-CDE optimalGP-CDE AmortizedCVAELinear\fIn this experiment we compare the test log-likelihood of the\nDensity estimation of image data\nGP-CDE and the CVAE [24] on the MNIST dataset for the image generation task. We train the\nmodels with N = 2, 4, 8, . . . , 512 images per class, to test their utility in different data regimes. The\nmodel\u2019s input is a one-hot encoding of the image label which we concatenate with a 2-dimensional\nlatent variable. We use all 10, 000 test images to calculate the average test log-likelihood, which we\nestimate using Monte Carlo.\nFor the CVAE\u2019s encoder and decoder network architecture we follow Wu et al. [30] and regularize\nthe network using dropout. Appendix B contains more details on the CVAE\u2019s setup. The GP-CDE\nhas the same setup as in the few-shot learning experiment, except that we set the shape of the output\nmixing matrix P to 50 \u00d7 784. We reduce the size of P, compared to the omniglot experiment, as\nthe MNIST digits are relatively easier to model. Since we are considering small datasets in this\nexperiment the role of the mixing matrix becomes more important: it enables the encoding of prior\nknowledge about the structure in images.\nWu et al. [30] point out that when evaluating test-densities for generative models, the assumed\nnoise variance \u03c32 plays an important role, so for both models we compare two different cases: one\nwith the likelihood variance parameter \ufb01xed and one where it is optimized. Table 2 shows that in\nlow-data regimes the highly parametrized CVAE severely over\ufb01ts to the data and underestimates the\nvariance. The GP-CDE operates much more gracefully in these regimes: it estimates the variance\ncorrectly, even for N = 2 (where N is the number of training points), and the gap between train/test\nlog-likelihood is considerably smaller.\n\nTable 2: Log-likelihoods of the CVAE and GP-CDE models. N is the number of images per class.\nHigher test log-likelihood is better. See Appendix E for the complete table.\n\nCVAE: Fixed \u03c32\nTrain\nTest\n180.97\n178.22\n76.18\n65.30\n\n-129.72\n-60.03\n52.17\n54.48\n\nN\n2\n4\n256\n512\n\nCVAE: \u03c32 optimized\n\u03c32\n\nTest\n\n-1296.63\n-759.18\n218.08\n244.88\n\nTrain\n956.39\n956.26\n325.72\n286.38\n\nGP-CDE: Fixed \u03c32\nTest\n161.9\n195.2\n606.2\n606.7\n\nTrain\n242.2\n254.2\n545\n512\n\n\u03c32\n\nopt\n\nGP-CDE: \u03c32 optimized\nTest\n74.01\n86.59\n108.1\n124.2\n\nTrain\n130.4\n160.3\n105.4\n120.7\n\n0.0303\n0.0310\n0.0378\n0.0388\n\nopt\n\n0.01378\n0.01364\n0.03272\n0.03407\n\nNecessity of natural gradients Natural gradients are a vital component of our approach. We\ndemonstrate this with the simplest possible example modeling a dataset of 100 \u20181\u2019 digits, using an\nunconditional model with no projection matrices, no mini-batches and no recognition model (i.e.\nexactly the GP-LVM in Titsias and Lawrence [26]). We compare our natural gradient approach with\nstep size of 0.1 against using the Adam optimizer (learning rate 0.001) directly for the variational\nparameters. We compare also to the analytic solution in Titsias and Lawrence [26], which is possible\nas we are not using mini-batches. We \ufb01nd that the analytic model and our natural gradient method\nobtain test log-likelihoods (using all the \u20181\u2019s in the standard testing set) of 1.02, but the ordinary\ngradient approach attains a test log-likelihood of only \u22120.13. See Fig. 9 in Appendix F for samples\nfrom the latent space, and Fig. 10 for the training curves. We see that the ordinary gradient model\ncannot \ufb01nd a good solution, even in a large number of iterations, but the natural gradient model\nperforms similarly to the analytic case.\n\n6 Conclusion\n\nWe presented a model for conditional density estimation with Gaussian processes. Our approach\nextends prior work in three signi\ufb01cant ways. We perform Bayesian linear transformations on both\ninput and output spaces to allow for the modeling of high-dimensional inputs and strongly-coupled\noutputs. Our model is able to operate in low and high data regimes. Compared with other approaches\nwe have shown that our model does not over-concentrate its density, even with very few data.\nFor inference, we derived an optimal posterior for the latent variable inputs and we demonstrated\nthe usefulness of natural gradients for mini-batched training of GPs with uncertain inputs. These\nimprovements provide us with a more accurate variational approximation, and allow us to scale to\nlarger datasets than were previous possible. We applied the model in different settings across a wide\nrange of dataset sizes and input/output domains, demonstrating its general utility.\n\n9\n\n\fReferences\n\n[1] Ryan P Adams, Iain Murray, and David JC MacKay. Nonparametric Bayesian Density Modeling\n\nwith Gaussian Processes. arXiv:0912.4896, 2009.\n\n[2] Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued\n\nfunctions: A review. Foundations and Trends in Machine Learning, 2012.\n\n[3] Shun-Ichi Amari. Natural Gradient Works Ef\ufb01ciently in Learning. Neural Computation, 1998.\n\n[4] Michael Arbel and Arthur Gretton. Kernel Conditional Exponential Family. 2018.\n\n[5] Christopher M Bishop. Mixture Density Networks. 1994.\n\n[6] Erik Bodin, Neill D Campbell, and Carl H Ek. Latent Gaussian Process Regression.\n\narXiv:1707.05534, 2017.\n\n[7] Thang D Bui and Richard E Turner. Stochastic variational inference for gaussian process latent\nvariable models using back constraints. In Black Box Learning and Inference NIPS workshop,\n2015.\n\n[8] Chris Cremer, Xuechen Li, and David Duvenaud.\n\nAutoencoders. arXiv:1801.03558, 2018.\n\nInference Suboptimality in Variational\n\n[9] Zhenwen Dai, Andreas Damianou, Javier Gonz\u00e1lez, and Neil Lawrence. Variational auto-\nencoded deep gaussian processes. International Conference on Learning Representations,\n2015.\n\n[10] Zhenwen Dai, Mauricio A \u00c1lvarez, and Neil Lawrence. Ef\ufb01cient Modeling of Latent Infor-\nmation in Supervised Learning using Gaussian Processes. Advances in Neural Information\nProcessing Systems, 2017.\n\n[11] Andreas Damianou and Neil D Lawrence. Semi-described and semi-supervised learning with\n\ngaussian processes. Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[12] Stefan Depeweg, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Finale Doshi-Velez, and Steffen Udluft.\nLearning and policy search in stochastic dynamical systems with bayesian neural networks.\nInternational Conference on Learning Representations, 2016.\n\n[13] Agathe Girard, Carl E Rasmussen, Joaquin Quinonero-Candela, and Roderick Murray-Smith.\nGaussian Process Priors With Uncertain Inputs Application to Multiple-Step Ahead Time Series\nForecasting. Advances in Neural Information Processing Systems, 2003.\n\n[14] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\n\nneural networks. Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[15] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian Processes for Big Data. Uncer-\n\ntainty in Arti\ufb01cial Intelligence, 2013.\n\n[16] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic Variational\n\nInference. Journal of Machine Learning Research, 2013.\n\n[17] Diederik P Kingma and Max Welling. Auto-encoding Variational Bayes. arXiv:1312.6114,\n\n2013.\n\n[18] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised Learning with Deep Generative Models. Advances in Neural Information Processing\nSystems, 2014.\n\n[19] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level Concept\n\nLearning through Probabilistic Program Induction. Science, 2015.\n\n[20] Neil D Lawrence and Joaquin Qui\u00f1onero-Candela. Local distance preservation in the gp-lvm\n\nthrough back constraints. In International Conference on Machine learning, 2006.\n\n10\n\n\f[21] Alexander Matthews, James Hensman, Turner Richard, and Zoubin Ghahramani. On Sparse\nVariational Methods and the Kullback-Leibler Divergence between Stochastic Processes. Arti\ufb01-\ncial Intelligence and Statistics, 2016.\n\n[22] Carl E Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning.\n\nMIT Press, 2006.\n\n[23] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation\nand Approximate Inference in Deep Generative Models. International Conference on Machine\nLearning, 2014.\n\n[24] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning Structured Output Representation using\n\nDeep Conditional Generative Models. Advances in Neural Information Processing Systems.\n\n[25] Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv\u00e4rinen, and Revant Kumar.\nDensity Estimation in In\ufb01nite Dimensional Exponential Families. Journal of Machine Learning\nResearch, 18(57):1\u201359, 2017.\n\n[26] Michalis Titsias and Neil D Lawrence. Bayesian Gaussian Process Latent Variable Model.\n\nArti\ufb01cial Intelligence and Statistics, 2010.\n\n[27] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Variational Inference for Mahalanobis Distance\nMetrics in Gaussian Process Regression. Advances in Neural Information Processing Systems,\n2013.\n\n[28] Brian L Trippe and Richard E Turner. Conditional Density Estimation with Bayesian Normaliz-\ning Flows. Bayesian Deep Learning Workshop, Advances in Neural Information Processing\nSystems, 2017.\n\n[29] Chunyi Wang and Radford Neal. Gaussian Process Regression with Heteroscedastic or Non-\n\ngaussian Residuals. arXiv:1212.6246, 2012.\n\n[30] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the Quantitative Analysis\n\nof Decoder-based Generative Models. arXiv preprint arXiv:1611.04273, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1217, "authors": [{"given_name": "Vincent", "family_name": "Dutordoir", "institution": "PROWLER.io"}, {"given_name": "Hugh", "family_name": "Salimbeni", "institution": "Imperial College London"}, {"given_name": "James", "family_name": "Hensman", "institution": "PROWLER.io"}, {"given_name": "Marc", "family_name": "Deisenroth", "institution": "Imperial College London"}]}