{"title": "Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 287, "abstract": "We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering  datasets with high-dimensional inputs.", "full_text": "Variational Inference for Mahalanobis\n\nDistance Metrics in Gaussian Process Regression\n\nMichalis K. Titsias\n\nDepartment of Informatics\n\nAthens University of Economics and Business\n\nmtitsias@aueb.gr\n\nMiguel L\u00b4azaro-Gredilla\n\nDpt. Signal Processing & Communications\nUniversidad Carlos III de Madrid - Spain\n\nmiguel@tsc.uc3m.es\n\nAbstract\n\nWe introduce a novel variational method that allows to approximately integrate\nout kernel hyperparameters, such as length-scales, in Gaussian process regression.\nThis approach consists of a novel variant of the variational framework that has\nbeen recently developed for the Gaussian process latent variable model which ad-\nditionally makes use of a standardised representation of the Gaussian process. We\nconsider this technique for learning Mahalanobis distance metrics in a Gaussian\nprocess regression setting and provide experimental evaluations and comparisons\nwith existing methods by considering datasets with high-dimensional inputs.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) have found many applications in machine learning and statistics ranging\nfrom supervised learning tasks to unsupervised learning and reinforcement learning. However, while\nGP models are advertised as Bayesian models, it is rarely the case that a full Bayesian procedure is\nconsidered for training. In particular, the commonly used procedure is to \ufb01nd point estimates over\nthe kernel hyperparameters by maximizing the marginal likelihood, which is the likelihood obtained\nonce the latent variables associated with the GP function have been integrated out (Rasmussen and\nWilliams, 2006). Such a procedure provides a practical algorithm that is expected to be robust to\nover\ufb01tting when the number of hyperparameters that need to be tuned are relatively few compared\nto the amount of data. In contrast, when the number of hyperparameters is large this approach will\nsuffer from the shortcomings of a typical maximum likelihood method such as over\ufb01tting. To avoid\nthe above problems, in GP models, the use of kernel functions with few kernel hyperparameters\nis common practice, although this can lead to limited \ufb02exibility when modelling the data. For\ninstance, in regression or classi\ufb01cation problems with high dimensional input data the typical kernel\nfunctions used are restricted to have the simplest possible form, such as a squared exponential with\ncommon length-scale across input dimensions, while more complex kernel functions such as ARD\nor Mahalanobis kernels (Vivarelli and Williams, 1998) are not considered due to the large number\nof hyperparameters needed to be estimated by maximum likelihood. On the other hand, while full\nBayesian inference for GP models could be useful, it is pragmatically a very challenging task that\ncurrently has been attempted only by using expensive MCMC techniques such as the recent method\nof Murray and Adams (2010). Deterministic approximations and particularly the variational Bayes\nframework has not been applied so far for the treatment of kernel hyperparameters in GP models.\nTo this end, in this work we introduce a variational method for approximate Bayesian inference over\nhyperparameters in GP regression models with squared exponential kernel functions. This approach\nconsists of a novel variant of the variational framework introduced in (Titsias and Lawrence, 2010)\nfor the Gaussian process latent variable model. Furthermore, this method uses the concept of a\nstandardised GP process and allows for learning Mahalanobis distance metrics (Weinberger and\nSaul, 2009; Xing et al., 2003) in Gaussian process regression settings using Bayesian inference. In\n\n1\n\n\fthe experiments, we compare the proposed algorithm with several existing methods by considering\nseveral datasets with high-dimensional inputs.\nThe remainder of this paper is organised as follows: Section 2 provides the motivation and the-\noretical foundation of the variational method, Section 3 demonstrates the method in a number of\nchallenging regression datasets by providing also a comprehensive comparison with existing meth-\nods. Finally, the paper concludes with a discussion in Section 4.\n\n2 Theory\n\nSection 2.1 discusses Bayesian GP regression and motivates the variational method. Section 2.2\nexplains the concept of the standardised representation of a GP model that is used by the variational\nmethod described in Section 2.3. Section 2.4 discusses setting the prior over the kernel hyperpa-\nrameters together with a computationally analytical way to reduce the number of parameters to be\noptimised during variational inference. Finally, Section 2.5 discusses prediction in novel test inputs.\n\n2.1 Bayesian GP regression and motivation for the variational method\ni=1, where each xi \u2208 RD and each yi is a real-valued scalar output.\nSuppose we have data {yi, xi}n\nWe denote by y the vector of all output data and by X all input data. In GP regression, we assume\nthat each observed output is generated according to yi = f (xi) + \u0001i, \u0001i \u223c N (0, \u03c32), where the\nfull length latent function f (x) is assigned a zero-mean GP prior with a certain covariance or kernel\nfunction kf (x, x(cid:48)) that depends on hyperparameters \u03b8. Throughout the paper we will consider the\nfollowing squared exponential kernel function\n\nkf (x, x(cid:48)) = \u03c32\n\nf e\u2212 1\n\n2 (x\u2212x(cid:48))T WT W(x\u2212x(cid:48)) = \u03c32\n\nf e\u2212 1\n\n2||Wx\u2212Wx(cid:48)||2\n\nf e\u2212 1\n2 d2\n\nW(x,x(cid:48)),\n\n(cid:80)D\n\nkf (x, x(cid:48)) = \u03c32\n\nf e\u2212 1\n\n(1)\nwhere dW(x, x(cid:48)) = ||Wx \u2212 Wx(cid:48)||. In the above, \u03c3f is a global scale parameter while the matrix\nW \u2208 RK\u00d7D quanti\ufb01es a linear transformation that maps x into a linear subspace with dimension\nat most K. In the special case where W is a square and diagonal matrix, the above kernel function\nreduces to\n\n= \u03c32\n\n,\n\n(2)\nwhich consists of the well-known ARD squared exponential kernel commonly used in GP regres-\nsion applications (Rasmussen and Williams, 2006). In other cases where K < D, dW(x, x(cid:48)) de\ufb01nes\na Mahalanobis distance metric (Weinberger and Saul, 2009; Xing et al., 2003) that allows for su-\npervised dimensionality reduction to be applied in a GP regression setting (Vivarelli and Williams,\n1998).\nIn a full Bayesian formulation, the hyperparameters \u03b8 = (\u03c3f , W) are assigned a prior distribution\np(\u03b8) and the Bayesian model follows the hierarchical structure depicted in Figure 1(a). According\nto this structure the random function f (x) and the hyperparameters \u03b8 are a priori coupled since the\nformer quantity is generated conditional on the latter. This can make approximate, and in particular\nvariational, inference over the hyperparameters to be troublesome. To clarify this, observe that the\njoint density induced by the \ufb01nite data is\n\nd(xd\u2212x(cid:48)\n\nd=1 w2\n\nd)2\n\n2\n\np(y, f , \u03b8) = N (y|f , \u03c32I)N (f|0, Kf ,f )p(\u03b8),\n\n(3)\nwhere the vector f stores the latent function values at inputs X and Kf ,f is the n \u00d7 n kernel matrix\nobtained by evaluating the kernel function on those inputs. Clearly, in the term N (f|0, Kf ,f ) the\nhyperparameters \u03b8 appear non-linearly inside the inverse and determinant of the kernel matrix Kf ,f .\nWhile there exist a recently developed variational inference method applied to Gaussian process\nlatent variable model (GP-LVM) (Titsias and Lawrence, 2010), that approximately integrates out\ninputs that appear inside a kernel matrix, this method is still not applicable to the case of kernel\nhyperparameters such as length-scales. This is because the augmentation with auxiliary variables\nused in (Titsias and Lawrence, 2010), that allows to bypass the intractable term N (f|0, Kf ,f ), leads\nto an inversion of a matrix Ku,u that still depends on the kernel hyperparameters. More precisely,\nthe Ku,u matrix is de\ufb01ned on auxiliary values u comprising points of the function f (x) at some\narbitrary and freely optimisable inputs (Snelson and Ghahramani, 2006a; Titsias, 2009). While this\nkernel matrix does not depend on the inputs X any more (which need to be integrated out in the\nGP-LVM case), it still depends on \u03b8, making a possible variational treatment of those parameters\n\n2\n\n\fintractable. In Section 2.3, we present a novel modi\ufb01cation of the approach in (Titsias and Lawrence,\n2010) which allows to overcome the above intractability. Such an approach makes use of the so-\ncalled standardised representation of the GP model that is described next.\n\n2.2 The standardised representation\nConsider a function s(z), where z \u2208 RK, which is taken to be a random sample drawn from a GP\nindexed by elements in the low K-dimensional space and assumed to have a zero mean function and\nthe following squared exponential kernel function:\n\nks(z, z(cid:48)) = e\u2212 1\n\n2||z\u2212z(cid:48)||2\n\n(4)\nwhere the kernel length-scales and global scale are equal to unity. The above GP is referred to as\nstandardised process, whereas a sample path s(z) is referred to as a standardised function. The inter-\nesting property that a standardised process has is that it does not depend on kernel hyperparameters\nsince it is de\ufb01ned in a space where all hyperparameters have been neutralised to take the value one.\nHaving sampled a function s(z) in the low dimensional input space RK, we can deterministically\nexpress a function f (x) in the high dimensional input space RD according to\n\n,\n\nf (x) = \u03c3f s(Wx),\n\n(5)\nwhere the scalar \u03c3f and the matrix W \u2208 RK\u00d7D are exactly the hyperparameters de\ufb01ned in the\nprevious section. The above simply says that the value of f (x) at a certain input x is the value of the\nstandardised function s(z), for z = Wx \u2208 RK, times a global scale \u03c3f that changes the amplitude\nor power of the new function. Given (\u03c3f , W), the above assumptions induce a GP prior on the\nfunction f (x), which has zero mean and the following kernel function\nf e\u2212 1\n2 d2\n\n(6)\nwhich is precisely the kernel function given in eq. (1) and therefore, the above construction leads to\nthe same GP prior distribution described in Section 2.1. Nevertheless, the representation using the\nstandardised process also implies a reparametrisation of the GP regression model where a priori the\nhyperparameters \u03b8 and the GP function are independent. More precisely, one can now represent the\nGP model according to the following structure:\n\nkf (x, x(cid:48)) = E[\u03c3f s(Wx)\u03c3f s(Wx(cid:48))] = \u03c32\n\nW(x,x(cid:48)),\n\ns(z) \u223c GP(0, ks(z, z(cid:48))), \u03b8 \u223c p(\u03b8)\nf (x) = \u03c3f s(Wx)\n\nyi \u223c N (yi|f (xi), \u03c32), i = 1, . . . , n\n\n(7)\nwhich is depicted graphically in Figure 1(b). The interesting property of this representation is that\nthe GP function s(z) and the hyperparameters \u03b8 interact only inside the likelihood function while\na priori are independent. Furthermore, according to this representation one could now consider a\nmodi\ufb01cation of the variational method in (Titsias and Lawrence, 2010) so that the auxiliary variables\nu are de\ufb01ned to be points of the function s(z) so that the resulting kernel matrix Ku,u which needs\nto be inverted does not depend on the hyperparameters but only on some freely optimisable inputs.\nNext we discuss the details of this variational method.\n\n2.3 Variational inference using auxiliary variables\nWe de\ufb01ne a set of m auxiliary variables u \u2208 Rm such that each ui is a value of the standardised\nfunction so that ui = s(zi) and the input zi \u2208 RK lives in dimension K. The set of all inputs\nZ = (z1, . . . , zm) are referred to as inducing inputs and consist of freely-optimisable parameters\nthat can improve the accuracy of the approximation. The inducing variables u follow the Gaussian\ndensity\n\n(8)\nwhere [Ku,u]ij = ks(zi, zj) and ks is the standardised kernel function given by eq. (4). Notice that\nthe density p(u) does not depend on the kernel hyperparameters and particularly on the matrix W.\nThis is a rather critical point, that essentially allows the variational method to be applicable, and\ncomprise the novelty of our method compared to the initial framework in (Titsias and Lawrence,\n2010). The vector f of noise-free latent function values, such that [f ]i = \u03c3f s(Wxi), covary with\nthe vector u based on the cross-covariance function\n\np(u) = N (u|0, Ku,u),\n\nkf,u(x, z) = E[\u03c3f s(Wx)s(z)] = \u03c3f E[s(Wx)s(z)] = \u03c3f e\u2212 1\n\n2||Wx\u2212z||2\n\n= \u03c3f ks(Wx, z).\n\n(9)\n\n3\n\n\f\u03b8\n\ns(x)\n\n\u03b8\n\nf (x)\n\nf (x)\n\ny\n\n(a)\n\ny\n\n(b)\n\n(c)\n\nFigure 1: The panel in (a) shows the usual hierarchical structure of a GP model where the middle node cor-\nresponds to the full length function f (x) (although only a \ufb01nite vector f is associated with the data). The\npanel in (b) shows an equivalent representation of the GP model expressed through the standardised ran-\ndom function s(z), that does not depend on hyperparameters, and interacts with the hyperparameters at the\ndata generation process. The rectangular node for f (x) corresponds to a deterministic operation representing\nf (x) = \u03c3f s(Wx). The panel in (c) shows how the latent dimensionality of the Puma dataset is inferred to be\n4, roughly corresponding to input dimensions 4, 5, 15 and 16 (see Section 3.3).\n\nBased on this function, we can compute the cross-covariance matrix Kf ,u and subsequently express\nthe conditional Gaussian density (often referred to as conditional GP prior):\n\np(f|u, W) = N (f|Kf ,uK\u22121\n\nu,uu, Kf ,f \u2212 Kf ,uK\u22121\n\na marginalisation over the inducing variables, i.e. p(f|W) =(cid:82) p(f|u, W)p(u)du. We would like\n\nso that p(f|u, W)p(u) allows to obtain the initial conditional GP prior p(f|W), used in eq. (3), after\n\nu,uKT\n\nf ,u),\n\nnow to apply variational inference in the augmented joint model1\n\np(y, f , u, W) = N (y|f , \u03c32I)p(f|u, W)p(u)p(W),\n\nin order to approximate the intractable posterior distribution p(f , W, u|y). We introduce the varia-\ntional distribution\n\nq(f , W, u) = p(f|u, W)q(W)q(u),\n\n(10)\nwhere p(f|u, W) is the conditional GP prior that appears in the joint model, q(u) is a free-form\nvariational distribution that after optimisation is found to be Gaussian (see Section B.1 in the sup-\nplementary material), while q(W) is restricted to be the following factorised Gaussian:\n\nK(cid:89)\n\nD(cid:89)\n\nq(W) =\n\nN (wkd|\u00b5dk, \u03c32\n\nkd),\n\n(11)\n\nk=1\n\nd=1\n\nThe variational lower bound that minimises the Kullback Leibler (KL) divergence between the vari-\national and the exact posterior distribution can be written in the form\n\nF = F1 \u2212 KL(q(W)||p(W)),\n\n(12)\nwhere the analytical form of F1 is given in Section B.1 of the supplementary material, whereas the\nKL divergence term KL(q(W)||p(W)) that depends on the prior distribution over W is described\nin the next section.\nThe variational lower bound is maximised using gradient-based methods over the variational pa-\nkd}K,D\nrameters {\u00b5kd, \u03c32\nk=1,d=1, the inducing inputs Z (which are also variational parameters) and the\nhyperparameters (\u03c3f , \u03c32).\n\n1The scale parameter \u03c3f and the noise variance \u03c32 are not assigned prior distributions, but instead they are\ntreated by Type II ML. Notice that the treatment of (\u03c3f , \u03c32) with a Bayesian manner is easier and approximate\ninference could be done with the standard conjugate variational Bayesian framework (Bishop, 2006).\n\n4\n\n1234567891000.511.522.533.544.5Latent dimension (sorted)RelevanceLatent dimension (sorted)Input dimension24681051015202530\f2.4 Prior over p(W) and analytical reduction of the number of optimisable parameters\nTo set the prior distribution for the parameters W, we follow the automatic relevance determina-\ntion (ARD) idea introduced in (MacKay, 1994; Neal, 1998) and subsequently considered in several\nmodels such as sparse linear models (Tipping, 2001) and variational Bayesian PCA (Bishop, 1999).\nSpeci\ufb01cally, the prior distribution takes the form\n\np(W) =\n\nN (wkd|0, (cid:96)2\nk),\n\n(13)\n\nK(cid:89)\n\nD(cid:89)\n\nk=1\n\nd=1\n\nk}K\n\nwhere the elements of each row of W follow a zero-mean Gaussian distribution with a common\nvariance. Learning the set of variances {(cid:96)2\nk=1 can allow to automatically select the dimensionality\nassociated with the Mahalanobis distance metric dW(x, x(cid:48)). This could be carried out by either\napplying a Type II ML estimation procedure or a variational Bayesian approach, where the latter as-\nsigns a conjugate Gamma prior on the variances and optimises a variational distribution q({(cid:96)2\nk=1)\nover them. The optimisable quantities in both these procedures can be removed analytically and\noptimally from the variational lower bound as described next.\nConsider the case where we apply Type II ML for the variances {(cid:96)2\nk=1. These parameters appear\nonly in the KL(q(W)||p(W)) term (denoted by KL in the following) of the lower bound in eq. (12)\nwhich can be written in the form:\n\nk}K\n\nk}K\n\nKL =\n\n1\n2\n\nd=1 \u03c32\ndk + \u00b52\ndk\n(cid:96)2\nk\n\n\u2212 D \u2212 D(cid:88)\n\nd=1\n\nlog\n\n\u03c32\ndk\n(cid:96)2\nk\n\n(cid:35)\n\n.\n\nBy \ufb01rst minimizing this term with respect to these former hyperparameters we \ufb01nd that\n\nK(cid:88)\n\nk=1\n\n(cid:34)(cid:80)D\n(cid:80)D\n\n(cid:96)2\nk =\n\n(cid:34) D(cid:88)\n\nK(cid:88)\n\nd=1 \u03c32\ndk + \u00b52\ndk\nD\n\n, k = 1, . . . , K,\n\n(14)\n\n(cid:32) D(cid:88)\n\nand then by substituting back these optimal values into the KL divergence we obtain\n\n(cid:35)\n\n(cid:33)\n(cid:0)(cid:96)2\n(cid:1)\u2212\u03b1\u22121\nk}K\nD(cid:88)\nK(cid:88)\n\nk\n\nKL =\n\n1\n2\n\nlog \u03c32\n\ndk \u2212 D log\n\nk=1\n\nd=1\n\nd=1\n\n\u03c32\ndk + \u00b52\ndk\n\n+ D log D\n\n,\n\n(15)\n\nwhich now depends only on variational parameters. When we treat {(cid:96)2\nwe assign inverse Gamma prior to each variance (cid:96)2\ning a similar procedure as the one above we can remove optimally the variational factor q({(cid:96)2\n(see Section B.2 in the supplementary material) to obtain\n\nk=1 in a Bayesian manner,\n\u2212 \u03b2\nk . Then, by follow-\n(cid:96)2\ne\nk=1)\n\nk) = \u03b2\u03b1\n\nk, p((cid:96)2\n\nk}K\n\n\u0393(\u03b1)\n\n(cid:18) D\n\n2\n\nKL = \u2212\n\n(cid:19) K(cid:88)\n\n(cid:32)\n\nD(cid:88)\n\n(cid:33)\n\n1\n2\n\n+ \u03b1\n\nlog\n\n2\u03b2 +\n\n\u00b52\nkd + \u03c32\nkd\n\n+\n\nlog(\u03c32\n\nkd) + const,\n\n(16)\n\nk=1\n\nd=1\n\nk=1\n\nd=1\n\nwhich, as expected, has the nice property that when \u03b1 = \u03b2 = 0, so that the prior over variances\nbecomes improper, it reduces to the quantity in (15).\nFinally, it is important to notice that different and particularly non-Gaussian priors for the parameters\nW can be also accommodated by our variational method. More precisely, any alternative prior for\nW changes only the form of the negative KL divergence term in the lower bound in eq. (12). This\nterm remains analytically tractable even for priors such as the Laplace or certain types of spike and\nslab priors. In the experiments we have used the ARD prior described above while the investigation\nof alternative priors is intended to be studied as a future work.\n\n2.5 Predictions\nAssume we have a test input x\u2217 and we would like to predict the corresponding output y\u2217. The exact\npredictive density p(y\u2217|y) is intractable and therefore we approximate it with the density obtained\nby averaging over the variational posterior distribution:\n\n(cid:90)\n\nq(y\u2217|y) =\n\nN (y\u2217|f\u2217, \u03c32)p(f\u2217|f , u, W)p(f|u, W)q(u)q(W)df\u2217df dudW,\n\n(17)\n\n5\n\n\fwhere p(f|u, W)q(u)q(W) is the variational distribution and p(f\u2217|f , u, W) is the conditional GP\nprior over the test value f\u2217 given the training function values f and the inducing variables u. By\n\nperforming \ufb01rst the integration over f, we obtain (cid:82) p(f\u2217|f , u, W)p(f|u, W)df = p(f\u2217|u, W)\n\nwhich yields as a consequence of the consistency property of the Gaussian process prior. Given\nthat p(f\u2217|u, W) and q(u) (see Section B.1 in the supplementary material) are Gaussian densities\nwith respect to f\u2217 and u, the above can be further simpli\ufb01ed to\n\nq(y\u2217|y) =\n\nN (y\u2217|\u00b5\u2217(W), \u03c32\u2217(W) + \u03c32)q(W)dW,\n\nwhere the mean \u00b5\u2217(W) and variance \u03c32\u2217(W) obtain closed-form expressions and consist of non-\nlinear functions of W making the above integral intractable. However, by applying Monte Carlo\nintegration based on drawing independent samples from the Gaussian distribution q(W) we can\nef\ufb01ciently approximate the above according to\n\n(cid:90)\n\nT(cid:88)\n\nt=1\n\nq(y\u2217|y) =\n\n1\nT\n\nN (y\u2217|\u00b5\u2217(W(t)), \u03c32\u2217(W(t)) + \u03c32),\n\n(18)\n\nwhich is the quantity used in our experiments. Furthermore, although the predictive density is not\nGaussian, its mean and variance can be computed analytically as explained in Section B.1 of the\nsupplementary material.\n\n3 Experiments\nIn this section we will use standard data sets to assess the performance of the proposed VDMGP\nin terms of normalised mean square error (NMSE) and negative log-probability density (NLPD).\nWe will use as benchmarks a full GP with automatic relevance determination (ARD) and the state-\nof-the-art SPGP-DR model, which is described below. Also, see Section A of the supplementary\nmaterial for an example of dimensionality reduction on a simple toy example.\n\n3.1 Review of SPGP-DR\nThe sparse pseudo-input GP (SPGP) from Snelson and Ghahramani (2006a) is a well-known sparse\nGP model, that allows the computational cost of GP regression to scale linearly with the number of\nsamples in a the dataset. This model is sometimes referred to as FITC (fully independent training\nconditional) and uses an active set of m pseudo-inputs that control the speed vs. performance trade-\noff of the method. SPGP is often used when dealing with datasets containing more than a few\nthousand samples, since in those cases the cost of a full GP becomes impractical.\nIn Snelson and Ghahramani (2006b), a version of SPGP with dimensionality reduction (SPGP-DR)\nis presented. SPGP-DR applies the SPGP model to a linear projection of the inputs. The K \u00d7 D\nprojection matrix W is learned so as to maximise the evidence of the model. This can be seen\nsimply as a specialisation of SPGP in which the covariance function is a squared exponential with a\nMahalanobis distance de\ufb01ned by W(cid:62)W. The idea had already been applied to the standard GP in\n(Vivarelli and Williams, 1998).\nDespite the apparent similarities between SPGP-DR and VDMGP, there are important differences\nworth clarifying. First, SPGP\u2019s pseudo-inputs are model parameters and, as such, \ufb01tting a large\nnumber of them can result in over\ufb01tting, whereas the inducing inputs used in VDMGP are varia-\ntional parameters whose optimisation can only result in a better \ufb01t of the posterior densities. Second,\nSPGP-DR does not place a prior on the linear projection matrix W; it is instead \ufb01tted using Max-\nimum Likelihood, just as the pseudo-inputs. In contrast, VDMGP does place a prior on W and\nvariationally integrates it out.\nThese differences yield an important consequence: VDMGP can infer automatically the latent di-\nmensionality K of data, but SPGP-DR is unable to, since increasing K is never going to decrease\nits likelihood. Thus, VDMGP follows Occam\u2019s razor on the number of latent dimensions K.\n\n3.2 Temp and SO2 datasets\nWe will assess VDMGP on real-world datasets. For this purpose we will use the two data sets\nfrom the WCCI-2006 Predictive Uncertainty in Environmental Modeling Competition run by Gavin\n\n6\n\n\f(a) Temp,\nstd. dev. of avg.\n\navg. NMSE \u00b1 1\n\n(b) SO2,\nstd. dev. of avg.\n\navg. NMSE \u00b1 1\n\n(c) Puma,\nstd. dev. of avg.\n\navg. NMSE \u00b1 1\n\n(d) Temp, avg. NLPD \u00b1 one\nstd. dev. of avg.\n\navg. NLPD \u00b1 one\n\n(e) SO2,\nstd. dev. of avg.\n\n(f) Puma, avg.\nstd. dev. of avg.\n\nNLPD \u00b1 one\n\nFigure 2: Average NMSE and NLPD for several real datasets, showing the effect of different training set sizes.\n\nCawley2, called Temp and SO2. In dataset Temp, maximum daily temperature measurements have\nto be predicted from 106 input variables representing large-scale circulation information. For the\nSO2 dataset, the task is to predict the concentration of SO2 in an urban environment twenty-four\nhours in advance, using information on current SO2 levels and meteorological conditions.3 These\nare the same datasets on which SPGP-DR was originally tested (Snelson and Ghahramani, 2006b),\nand it is worth mentioning that SPGP-DR\u2019s only entry in the competition (for the Temp dataset) was\nthe winning one.\nWe ran SPGP-DR and VDMGP using the same exact initialisation for the projection matrix on\nboth algorithms and tested the effect of using a reduced number of training data. For SPGP-DR\nwe tested several possible latent dimensions K = {2, 5, 10, 15, 20, 30}, whereas for VDMGP we\n\ufb01xed K = 20 and let the model infer the number of dimensions. The number of inducing variables\n(pseudo-inputs for SPGP-DR) was set to 10 for Temp and 20 for SO2. Varying sizes for the training\nset between 100 and the total amount of available samples were considered. Twenty independent\nrealisations were performed.\nAverage NMSE as a function of training set size is shown in Figures 2(a) and 2(b). The multiple\ndotted blue lines correspond to SPGP-DR with different choices of latent dimensionality K. The\ndashed black line represents the full GP, which has been run for training sets up to size 2000. VD-\nMGP is shown as a solid red line. Similarly, average NLPD is shown as a function of training set\nsize in Figures 2(d) and 2(e).\nWhen feasible, the full GP performs best, but since it requires the inversion of the full kernel matrix,\nit cannot by applied to large-scale problems such as the ones considered in this subsection. Also,\neven in reasonably-sized problems, the full GP may run into trouble if several noise-only input\ndimensions are present. SPGP-DR works well for large training set sizes, since there is enough\ninformation for it to avoid over\ufb01tting and the advantage of using a prior on W is reduced. However,\n\n2Available at http://theoval.cmp.uea.ac.uk/\u02dcgcc/competition/\n\nTemp: 106 dimensions 7117/3558 training/testing data, SO2: 27 dimensions 15304/7652 training/testing data.\n3For SO2, which contains only positive labels yn, a logarithmic transformation of the type log(a + yn) was\napplied, just as the authors of (Snelson and Ghahramani, 2006b) did. However, reported NMSE and NLPD\n\ufb01gures still correspond to the original labels.\n\n7\n\n10020050010002000500071170.050.10.150.20.250.30.350.40.450.50.55  Full GPVDMGPSPGP\u2212DR100200500100020005000153040.80.911.11.21.31.41.5  Full GPVDMGPSPGP\u2212DR10020050010002000500071680.20.40.60.811.21.4  Full GPVDMGPSPGP\u2212DR100200500100020005000711700.20.40.60.811.21.41.61.82  Full GPVDMGPSPGP\u2212DR100200500100020005000153044.44.64.855.25.45.65.86  Full GPVDMGPSPGP\u2212DR1002005001000200050007168\u22120.200.20.40.60.811.21.41.61.8  Full GPVDMGPSPGP\u2212DR\ffor smaller training sets, performance is quite bad and the choice of K becomes very relevant (which\nmust be selected through cross-validation). Finally, VDMGP results in scalable performance: It is\nable to perform dimensionality reduction and achieve high accuracy both on small and large datasets,\nwhile still being faster than a full GP.\n3.3 Puma dataset\nIn this section we consider the 32-input, moderate noise version of the Puma dataset.4 This is\nrealistic simulation of the dynamics of a Puma 560 robot arm. Labels represent angular accelerations\nof one of the robot arm\u2019s links, which have to be predicted based on the angular positions, velocities\nand torques of the robot arm. 7168 samples are available for training and 1024 for testing.\nIt is well-known from previous works (Snelson and Ghahramani, 2006a) that only 4 out of the 32\ninput dimensions are relevant for the prediction task, and that identifying them is not always easy.\nIn particular, SPGP (the standard version, with no dimensionality reduction), fails at this task unless\ninitialised from a \u201cgood guess\u201d about the relevant dimensions coming from a different model, as\ndiscussed in (Snelson and Ghahramani, 2006a). We thought it would be interesting to assess the\nperformance of the discussed models on this dataset, again considering different training set sizes,\nwhich are generated by randomly sampling from the training set.\nResults are shown in Figures 2(c) and 2(f). VDMGPR determines that there are 4 latent dimen-\nsions, as shown in Figure 1(c). The conclusions to be drawn here are similar to those of the previous\nsubsection: SPGP-DR has trouble with \u201csmall\u201d datasets (where the threshold for a dataset being con-\nsidered small enough may vary among different datasets) and requires a parameter to be validated,\nwhereas VDMGPR seems to perform uniformly well.\n3.4 A note on computational complexity\nThe computational complexity of VDMGP is O(N M 2K +N DK), just as that of SPGP-DR, which\nis much smaller than the O(N 3+N 2D) required by a full GP. However, since the computation of the\nvariational bound of VDMGP involves more steps than the computation of the evidence of SPGP-\nDR, VDMGP is slower than SPGP-DR. In two typical cases using 500 and 5000 training points\nfull GP runs in 0.24 seconds (for 500 training points) and in 34 seconds (for 5000 training points),\nVDMGP runs in 0.35 and 3.1 seconds while SPGP-DR runs in 0.01 and 0.10 seconds.\n4 Discussion and further work\nA typical approach to regression when the number of input dimensions is large is to \ufb01rst use a\nlinear projection of input data to reduce dimensionality (e.g., PCA) and then apply some regres-\nsion technique. Instead of approaching this method in two steps, a monolithic approach allows the\ndimensionality reduction to be tailored to the speci\ufb01c regression problem.\nIn this work we have shown that it is possible to variationally integrate out the linear projection\nof the inputs of a GP, which, as a particular case, corresponds to integrating out its length-scale\nhyperparameters. By placing a prior on the linear projection, we avoid over\ufb01tting problems that may\narise in other models, such as SPGP-DR. Only two parameters (noise variance and scale) are free in\nthis model, whereas the remaining parameters appearing in the bound are free variational parameters,\nand optimizing them can only result in improved posterior estimates. This allows us to automatically\ninfer the number of latent dimensions that are needed for regression in a given problem, which is\nalso not possible using SPGP-DR. Finally, the size of the data sets that the proposed model can\nhandle is much wider than that of SPGP-DR, which performs badly on small-size data.\nOne interesting topic for future work is to investigate non-Gaussian sparse priors for the parameters\nW. Furthermore, given that W represents length-scales it could be replaced by a random function\nW(x), such a GP random function, which would render the length-scales input-dependent, mak-\ning such a formulation useful in situations with varying smoothness across input space. Such a\nsmoothness-varying GP is also an interesting subject of further work.\nAcknowledgments\nMKT greatly acknowledges support from \u201cResearch Funding at AUEB for Excellence and Extro-\nversion, Action 1: 2012-2014\u201d. MLG acknowledges support from Spanish CICYT TIN2011-24533.\n4Available from Delve, see http://www.cs.toronto.edu/\u02dcdelve/data/pumadyn/desc.\n\nhtml.\n\n8\n\n\fReferences\nBishop, C. M. (1999). Variational principal components.\n\nConference on Arti\ufb01cial Neural Networks, ICANN?99, pages 509\u2013514.\n\nIn In Proceedings Ninth International\n\nBishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statis-\n\ntics). Springer, 1st ed. 2006 edition.\n\nMacKay, D. J. (1994). Bayesian non-linear modelling for the energy prediction competition. SHRAE\n\nTransactions, 4:448\u2013472.\n\nMurray, I. and Adams, R. P. (2010). Slice sampling covariance hyperparameters of latent Gaussian\nmodels. In Lafferty, J., Williams, C. K. I., Zemel, R., Shawe-Taylor, J., and Culotta, A., editors,\nAdvances in Neural Information Processing Systems 23, pages 1723\u20131731.\n\nNeal, R. M. (1998). Assessing relevance determination methods using delve. Neural Networksand\n\nMachine Learning, pages 97\u2013129.\n\nRasmussen, C. and Williams, C. (2006). Gaussian Processes for Machine Learning. Adaptive\n\nComputation and Machine Learning. MIT Press.\n\nSnelson, E. and Ghahramani, Z. (2006a). Sparse Gaussian processes using pseudo-inputs. In Ad-\n\nvances in Neural Information Processing Systems 18, pages 1259\u20131266. MIT Press.\n\nSnelson, E. and Ghahramani, Z. (2006b). Variable noise and dimensionality reduction for sparse\n\nGaussian processes. In Uncertainty in Arti\ufb01cial Intelligence.\n\nTipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. Journal of\n\nMachine Learning Research, 1:211\u2013244.\n\nTitsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian processes. In\n\nProc. of the 12th International Workshop on AI Stats.\n\nTitsias, M. K. and Lawrence, N. D. (2010). Bayesian Gaussian process latent variable model. Jour-\n\nnal of Machine Learning Research - Proceedings Track, 9:844\u2013851.\n\nVivarelli, F. and Williams, C. K. I. (1998). Discovering hidden features with Gaussian processes\n\nregression. In Advances in Neural Information Processing Systems, pages 613\u2013619.\n\nWeinberger, K. Q. and Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. J. Mach. Learn. Res., 10:207\u2013244.\n\nXing, E., Ng, A., Jordan, M., and Russell, S. (2003). Distance metric learning, with application to\n\nclustering with side-information.\n\n9\n\n\f", "award": [], "sourceid": 223, "authors": [{"given_name": "Michalis", "family_name": "Titsias RC AUEB", "institution": "Athens University of Economics and Business"}, {"given_name": "Miguel", "family_name": "Lazaro-Gredilla", "institution": "Universidad Carlos III de Madrid"}]}