{"title": "Scalable Inference for Gaussian Process Models with Black-Box Likelihoods", "book": "Advances in Neural Information Processing Systems", "page_first": 1414, "page_last": 1422, "abstract": "We propose a sparse method for scalable automated variational inference (AVI) in a large class of models with Gaussian process (GP) priors, multiple latent functions, multiple outputs and non-linear likelihoods. Our approach maintains the statistical efficiency property of the original AVI method, requiring only expectations over univariate Gaussian distributions to approximate the posterior with a mixture of Gaussians. Experiments on small datasets for various problems including regression, classification, Log Gaussian Cox processes, and warped GPs show that our method can perform as well as the full method under high levels of sparsity. On larger experiments using the MNIST and the SARCOS datasets we show that our method can provide superior performance to previously published scalable approaches that have been handcrafted to specific likelihood models.", "full_text": "Scalable Inference for Gaussian Process Models with\n\nBlack-Box Likelihoods\n\nAmir Dezfouli\n\nThe University of New South Wales\n\nakdezfuli@gmail.com\n\nEdwin V. Bonilla\n\nThe University of New South Wales\n\ne.bonilla@unsw.edu.au\n\nAbstract\n\nWe propose a sparse method for scalable automated variational inference (AVI) in\na large class of models with Gaussian process (GP) priors, multiple latent func-\ntions, multiple outputs and non-linear likelihoods. Our approach maintains the\nstatistical ef\ufb01ciency property of the original AVI method, requiring only expec-\ntations over univariate Gaussian distributions to approximate the posterior with a\nmixture of Gaussians. Experiments on small datasets for various problems includ-\ning regression, classi\ufb01cation, Log Gaussian Cox processes, and warped GPs show\nthat our method can perform as well as the full method under high sparsity levels.\nOn larger experiments using the MNIST and the SARCOS datasets we show that\nour method can provide superior performance to previously published scalable\napproaches that have been handcrafted to speci\ufb01c likelihood models.\n\n1\n\nIntroduction\n\nDeveloping automated yet practical approaches to Bayesian inference is a problem that has attracted\nconsiderable attention within the probabilisitic machine learning community (see e.g. [1, 2, 3, 4]).\nIn the case of models with Gaussian process (GP) priors, the main challenge is that of dealing\nwith a large number of highly-coupled latent variables. Although promising directions within the\nsampling community such as Elliptical Slice Sampling (ESS, [5]) have been proposed, they have\nbeen shown to be particularly slow compared to variational methods.\nIn particular, [6] showed\nthat their automated variational inference (AVI) method can provide posterior distributions that are\npractically indistinguishable from those obtained by ESS, while running orders of magnitude faster.\nOne of the fundamental properties of the method proposed in [6] is its statistical ef\ufb01ciency, which\nmeans that, in order to approximate a posterior distribution via the maximization of the evidence\nlower bound (ELBO), it only requires expectations over univariate Gaussian distributions regardless\nof the likelihood model. Remarkably, this property holds for a large class of models involving\nmultiple latent functions and multiple outputs. However, this method is still impractical for large\ndatasets as it inherits the cubic computational cost of GP models on the number of observations (N).\nWhile there have been several approaches to large scale inference in GP models [7, 8, 9, 10, 11],\nthese have been focused on regression and classi\ufb01cation problems. The main obstacle to apply these\napproaches to inference with general likelihood models is that it is unclear how they can be extended\nto frameworks such as those in [6], while maintaining that desirable property of statistical ef\ufb01ciency.\nIn this paper we build upon the inducing-point approach underpinning most sparse approximations\nto GPs [12, 13] in order to scale up the automated inference method of [6]. In particular, for models\nwith multiple latent functions, multiple outputs and non-linear likelihoods (such as in multi-class\nclassi\ufb01cation and Gaussian process regression networks [14]) we propose a sparse approximation\nwhose computational complexity is O(M 3) in time, where M (cid:28) N is the number of inducing\npoints. This approximation maintains the statistical ef\ufb01ciency property of the original AVI method.\nAs the resulting ELBO decomposes over the training data points, our method can scale up to a very\n\n1\n\n\flarge number of observations and is amenable to stochastic optimization and parallel computation.\nMoreover, it can, in principle, approximate arbitrary posterior distributions as it uses a Mixture-of-\nGaussians (MoG) as the family of approximate posteriors. We refer to our method as SAVIGP, which\nstands for scalable automated variational inference for Gaussian process models.\nOur experiments on small datasets for problems including regression, classi\ufb01cation, Log Gaussian\nCox processes, and warped GPs [15] show that SAVIGP can perform as well as the full method un-\nder high levels of sparsity. On a larger experiment on the MNIST dataset, our approach outperforms\nthe distributed variational inference method in [9], who used a class-conditional density modeling\napproach. Our method, unlike [9], uses a single discriminative multi-class framework. Finally, we\nuse SAVIGP to do inference for the Gaussian process regression network model [14] on the SAR-\nCOS dataset concerning an inverse robot dynamics problem [16]. We show that we can outperform\npreviously published scalable approaches that used likelihood-speci\ufb01c inference algorithms.\n\n2 Related work\n\nThere has been a long-standing interest in the GP community to overcome the cubic scaling of\ninference in standard GP models [17, 18, 12, 13, 8]. However, none of these approaches actually\ndealt with the harder tasks of developing scalable inference methods for multi-output problems and\ngeneral likelihood models. The former (multiple output problem) has been addressed, notably, by\n[19] and [20] using the convolution process formalism. Nevertheless, such approaches were speci\ufb01c\nto regression problems. The latter problem (general likelihood models) has been tackled from a\nsampling perspective [5] and within an optimization framework using variational inference [21].\nIn particular, the work of [21] proposes an ef\ufb01cient full Gaussian posterior approximation for GP\nmodels with iid observations. Our work pushes this breakthrough further by allowing multiple latent\nfunctions, multiple outputs, and more importantly, scalability to large datasets.\nA related area of research is that of modeling complex data with deep belief networks based on\nGaussian process mappings [22]. Unlike our approach, these models target the unsupervised prob-\nlem of discovering structure in high-dimensional data, do not deal with black-box likelihoods, and\nfocus on small-data applications. Finally, very recent developments in speeding-up probabilistic ker-\nnel machines [9, 23, 24] show that the types of problems we are addressing here are highly relevant\nto the machine learning community. In particular, [23] has proposed ef\ufb01cient inference methods for\nlarge scale GP classi\ufb01cation and [9] has developed a distributed variational approach for GP models,\nwith a focus on regression and classi\ufb01cation problems. Our work, unlike these approaches, allows\npractitioners and researchers to investigate new models with GP priors and complex likelihoods for\nwhich currently there is no machinery that can scale to very large datasets.\n\n3 Gaussian Process priors and multiple-output nonlinear likelihoods\nWe are given a dataset D = {xn, yn}N\nn=1, where xn is a D-dimensional input vector and yn is\na P -dimensional output. Our goal is to learn the mapping from inputs to outputs, which can be\nestablished via Q underlying latent functions {fj}Q\nj=1. A sensible modeling approach to the above\nproblem is to assume that the Q latent functions {fj} are uncorrelated a priori and that they are\ndrawn from Q zero-mean Gaussian processes [25]:\n\np(f ) =\n\nN (f\u00b7j; 0, Kj),\nwhere f is the set of all latent function values; f\u00b7j = {fj(xn)}N\nn=1 denotes the values of latent\nfunction j; and Kj is the covariance matrix induced by the covariance function \u03baj(\u00b7,\u00b7), evaluated\nat every pair of inputs. Along with the prior in Equation (1), we can also assume that our multi-\ndimensional observations {yn} are iid given the corresponding set of latent functions {fn}:\n\np(f\u00b7j) =\n\n(1)\n\nQ(cid:89)\n\nj=1\n\nQ(cid:89)\n\nj=1\n\nN(cid:89)\n\np(y|f ) =\n\np(yn|fn\u00b7),\n\n(2)\n\nwhere y is the set of all output observations; yn is the nth output observation; and fn\u00b7 =\n{fj(xn)}Q\nj=1 is the set of latent function values which yn depends upon. In short, we are inter-\n\nn=1\n\n2\n\n\fested in models for which the following criteria are satis\ufb01ed: (i) factorization of the prior over the\nlatent functions; and (ii) factorization of the conditional likelihood over the observations given the\nlatent functions. Interestingly, a large class of problems can be well modeled with the above assump-\ntions: binary classi\ufb01cation [7, 26], warped GPs [15], log Gaussian Cox processes [27], multi-class\nclassi\ufb01cation [26], and multi-output regression [14] all belong to this family of models.\n\n3.1 Automated variational inference\n\nOne of the key inference challenges in the above models is that of computing the posterior distribu-\ntion over the latent functions p(f|y). Ideally, we would like an ef\ufb01cient method that does not need to\nknow the details of the likelihood in order to carry out posterior inference. This is exactly the main\nresult in [6], which approximates the posterior with a mixture-of-Gaussians within a variational in-\nference framework. This entails the optimization of an evidence lower bound, which decomposes\nas a KL-divergence term and an expected log likelihood (ELL) term. As the KL-divergence term is\nrelatively straightforward to deal with, we focus on their main result regarding the ELL term:\n[6], Th. 1: \u201cThe expected log likelihood and its gradients can be approximated using samples from\nunivariate Gaussian distributions\u201d. More generally, we say that the ELL term and its gradients can\nbe estimated using expectations over univariate Gaussian distributions. We refer to this result as\nthat of statistical ef\ufb01ciency. One of the main limitations of this method is its poor scalability to\nlarge datasets, as it has a cubic time complexity on the number of data points, i.e. O(N 3). In the\nnext section we describe our inference method that scales up to large datasets while maintaining the\nstatistical ef\ufb01ciency property of the original model.\n\n4 Scalable inference\n\np(u) =\n\nQ(cid:89)\n\nj=1\n\nIn order to make inference scalable we rede\ufb01ne our prior to be sparse by conditioning the latent\nj=1, which lie in the same space as {f\u00b7j} and are\nprocesses on a set of inducing variables {u\u00b7j}Q\ndrawn from the same zero-mean GP priors. As before, we assume factorization of the prior across\nthe Q latent functions. Hence the resulting sparse prior is given by:\n\nN (f\u00b7j; \u02dc\u00b5j,(cid:101)Kj),\nN (u\u00b7j; 0, \u03ba(Zj, Zj)),\n(cid:101)Kj = \u03baj(X, X) \u2212 Aj\u03ba(Zj, X) with Aj = \u03ba(X, Zj)\u03ba(Zj, Zj)\u22121,\n\u02dc\u00b5j = \u03ba(X, Zj)\u03ba(Zj, Zj)\u22121u\u00b7j,\n\n(4)\n(5)\nwhere u\u00b7j are the inducing variables for latent process j; u is the set of all the inducing variables; Zj\nare all the inducing inputs (i.e. locations) for latent process j; X is the matrix of all input locations\n{xi}; and \u03ba(U, V) is the covariance matrix induced by evaluating the covariance function \u03baj(\u00b7,\u00b7) at\nall pairwise vectors of matrices U and V. We note that while each of the inducing variables in u\u00b7j\nlies in the same space as the elements in f\u00b7j, each of the M inducing inputs in Zj lies in the same\nspace as each input data point xn. Given the latent function values fn\u00b7, the conditional likelihood\nfactorizes across data points and is given by Equation (2).\n\nQ(cid:89)\n\nj=1\n\np(f|u) =\n\n(3)\n\n4.1 Approximate posterior\n\nWe will approximate the posterior using variational inference. Motivated by the fact that the true\njoint posterior is given by p(f , u|y) = p(f|u, y)p(u|y), our approximate posterior has the form:\n\nq(f , u|y) = p(f|u)q(u),\n\n(6)\nwhere p(f|u) is the conditional prior given in Equation (3) and q(u) is our approximate (variational)\nposterior. This decomposition has proved effective in problems with a single latent process and a\nsingle output (see e.g. [13]).\nOur variational distribution is a mixture of Gaussians (MoG):\n\nq(u|\u03bb) =\n\n\u03c0kqk(u|mk, Sk) =\n\n\u03c0k\n\nN (u\u00b7j; mkj, Skj),\n\n(7)\n\nK(cid:88)\n\nK(cid:88)\n\nQ(cid:89)\n\nk=1\n\nk=1\n\nj=1\n\n3\n\n\fwhere \u03bb = {\u03c0k, mkj, Skj} are the variational parameters: the mixture proportions {\u03c0k}, the pos-\nterior means {mkj} and posterior covariances {Skj} of the inducing variables corresponding to\nmixture component k and latent function j. We also note that each of the mixture components\nqk(u|mk, Sk) is a Gaussian with mean mk and block-diagonal covariance Sk.\n\n5 Posterior approximation via optimization of the evidence lower bound\n\n(cid:90)\n(cid:124)\n\nFollowing variational inference principles, the log marginal likelihood log p(y) (or evidence) is\nlower bounded by the variational objective:\n\nlog p(y) \u2265 Lelbo =\n\nq(u|\u03bb)p(f|u) log p(y|f )dfdu\n\n\u2212KL(q(u|\u03bb)(cid:107)p(u))\n\n,\n\n(8)\n\n(cid:123)(cid:122)\n\nLell\n\n(cid:124)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nLkl\n\n(cid:125)\n\nwhere the evidence lower bound (Lelbo) decomposes as the sum of an expected log likelihood term\n(Lell) and a KL-divergence term (Lkl). Our goal is to estimate our posterior distribution q(u|\u03bb) via\nmaximization of Lelbo. We consider \ufb01rst the Lell term, as it is the most dif\ufb01cult to deal with since\nwe do not know the details of the implementation of the conditional likelihood p(y|f ).\n\n5.1 Expected log likelihood term\nHere we need to compute the expectation of the log conditional likelihood log p(y|f ) over the joint\napproximate posterior given in Equation (6). Our goal is to obtain expressions for the Lell term and\nits gradients wrt the variational parameters while maintaining the statistical ef\ufb01ciency property of\nneeding only expectations from univariate Gaussians. For this we \ufb01rst introduce an intermediate\ndistribution q(f|\u03bb) that is obtained by integrating out u from the joint approximate posterior:\n\nLell(\u03bb) =\n\nq(u|\u03bb)p(f|u) log p(y|f )dfdu =\n\nlog p(y|f )\n\np(f|u)q(u|\u03bb)du\n\nf\n\nu\n\nf\n\n(cid:90)\n\n(cid:90)\n(cid:124)\n\nu\n\n(cid:123)(cid:122)\n\nq(f|\u03bb)\n\n(cid:90)\n\n(cid:90)\n\nGiven our approximate posterior in Equation (7), q(f|\u03bb) can be obtained analytically:\n\n(cid:125)\n\ndf.\n\n(9)\n\n(10)\n\n(11)\n\nq(f|\u03bb) =\n\n\u03c0kqk(f|\u03bbk) =\n\nN (f\u00b7j; bkj, \u03a3kj), with\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nK(cid:88)\n\nQ(cid:89)\n\n\u03c0k\n\n\u03a3kj = (cid:101)Kj + AjSkjAT\n\nk=1\n\nj=1\n\nN(cid:88)\n\nK(cid:88)\n\nwhere (cid:101)Kj and Aj are given in Equation (5). Now we can rewrite Equation (9) as:\n\nbkj = Ajmkj,\n\nj ,\n\nLell(\u03bb) =\n\n\u03c0kEqk(f|\u03bbk)[log p(y|f )] =\n\n\u03c0kEqk(n)(fn\u00b7)[log p(yn\u00b7|fn\u00b7)],\n\n(12)\n\nk=1\n\nn=1\n\nk=1\n\nwhere Eq(x)[g(x)] denotes the expectation of function g(x) over the distribution q(x). Here we have\nused the mixture decomposition of q(f|\u03bb) in Equation (10) and the factorization of the likelihood\nover the data points in Equation (2). Now we are ready to state formally our main result.\n\nTheorem 1 For the sparse GP model with prior de\ufb01ned in Equations (3) to (5), and likelihood\nde\ufb01ned in Equation (2), the expected log likelihood over the variational distribution in Equation (7)\nand its gradients can be estimated using expectations over univariate Gaussian distributions.\nGiven the result in Equation (12), the proof is trivial for the computation of Lell as we only need\nto realize that qk(f|\u03bbk) = N (f ; bk, \u03a3k) given in Equation (10) has a block-diagonal covariance\nstructure. Consequently, qk(n)(fn\u00b7) is a Q-dimensional Gaussian with diagonal covariance. For the\ngradients of Lell wrt the variational parameters, we use the following identity:\n\n\u2207\u03bbk\n\nEqk(n)(fn\u00b7)[log p(yn|fn\u00b7)] = Eqk(n)(fn\u00b7)\u2207\u03bbk log qk(n)(fn\u00b7) log p(yn|fn\u00b7),\n\n(13)\n(cid:4)\n\nfor \u03bbk \u2208 {mk, Sk}, and the result for {\u03c0k} is straightforward.\n\n4\n\n\fExplicit computation of Lell\nWe now provide explicit expressions for the computation of Lell. We know that qk(n)(fn\u00b7) is a\nQ-dimensional Gaussian with :\n\n(14)\nwhere \u03a3k(n) is a diagonal matrix. The jth element of the mean and the (j, j)th entry of the covari-\nance are given by:\n\nqk(n)(fn\u00b7) = N (fn\u00b7; bk(n), \u03a3k(n)),\n\n[\u03a3k(n)]j,j = [(cid:101)Kj]n,n + [Aj]n,:Skj[AT\n\n(15)\nwhere [A]n,: and [A]:,n denote the nth row and nth column of matrix A respectively. Hence we\n\n[bk(n)]j = [Aj]n,:mkj,\n\nj ]:,n,\n\ncan compute Lell as follows:(cid:110)\n\n(cid:111)S\n(cid:98)Lell =\n\ni=1\n\nf (k,i)\nn\u00b7\n\n\u223c N (fn\u00b7; bk(n), \u03a3k(n)), k = 1, . . . , K,\n\nN(cid:88)\n\nK(cid:88)\n\nS(cid:88)\n\nn=1\n\nk=1\n\ni=1\n\n1\nS\n\n\u03c0k\n\nlog p(yn\u00b7|f (k,i)\n\nn\u00b7\n\n).\n\nThe gradients of Lell wrt variational parameters are given in the supplementary material.\n\n5.2 KL-divergence term\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\nWe turn now our attention to the KL-divergence term, which can be decomposed as follows:\n\n\u2212KL(q(u|\u03bb)(cid:107)p(u)) = Eq[\u2212 log q(u|\u03bb)]\n\n+ Eq[log p(u)]\n\n,\n\n(cid:123)(cid:122)\n\nLent\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nLcross\n\n(cid:125)\n\nwhere the entropy term (Lent) can be lower bounded using Jensen\u2019s inequality:\n\u03c0(cid:96)N (mk; m(cid:96), Sk + S(cid:96)) def= \u02c6Lent.\n\n\u03c0k log\n\n(cid:124)\nK(cid:88)\n\n(cid:96)=1\n\nThe negative cross-entropy term (Lcross) can be computed exactly:\n\nLent \u2265 \u2212 K(cid:88)\nQ(cid:88)\n\nk=1\n\nK(cid:88)\n\n\u03c0k\n\nLcross = \u2212 1\n2\n\n[M log 2\u03c0 + log |\u03ba(Zj, Zj)| + mT\n\nkj\u03ba(Zj, Zj)\u22121mkj + tr \u03ba(Zj, Zj)\u22121Skj].\n(20)\nThe gradients of the above terms wrt the variational parameters are given in the supplementary\nmaterial.\n\nk=1\n\nj=1\n\n5.3 Hyperparameter learning and scalability to large datasets\n\nFor simplicity in the notation we have omitted the parameters of the covariance functions and the\nlikelihood parameters from the ELBO. However, in our experiments we optimize these along with\nthe variational parameters in a variational-EM alternating optimization framework. The gradients of\nthe ELBO wrt these parameters are given in the supplementary material.\nThe original framework of [6] is completely unfeasible for large datasets, as its complexity is dom-\ninated by the inversion of the Gram matrix on all the training data, which is an O(N 3) operation\nwhere N is the number of training points. Our sparse framework makes automated variational in-\nference practical for large datasets as its complexity is dominated by inversions of the kernel matrix\non the inducing points, which is an O(M 3) operation where M is the number of inducing points\nper latent process. Furthermore, as the Lell and its gradients decompose over the training points, and\nthe Lkl term decomposes over the number of latent process, our method is amenable to stochastic\noptimization and / or parallel computation, which makes it scalable to very large number of input\nobservations, output dimensions and latent processes. In our experiments in section 6 we show that\nour sparse framework can achieve similar performance to the full method [6] on small datasets un-\nder high levels of sparsity. Moreover, we carried out experiments on larger datasets for which is\npractically impossible to apply the full (i.e. non-sparse) method.\n\n5\n\n\fFigure 1: The SSE and NLPD for warped GPs on the Abalone dataset, where lower values on both\nmeasures are better. Three approximate posteriors are used: FG (full Gaussian), MoG1 (diagonal\nGaussian), and MoG2 (mixture of two diagonal Gaussians), along with various sparsity factors (SF\n= M/N). The smaller the SF the sparser the model, with SF=1 corresponding to no sparsity.\n\n6 Experiments\n\nOur experiments \ufb01rst consider the same six benchmarks with various likelihood models analyzed\nby [6]. The number of training points (N) on these benchmarks ranges from 300 to 1233 and their\ninput dimensionality (D) ranges from 1 to 256. The goal of this \ufb01rst set of experiments is to show\nthat SAVIGP can attain as good performance as the full method under high sparsity levels. We also\ncarried out experiments at a larger scale using the MNIST dataset and the SARCOS dataset [16]. The\napplication of the original automated variational inference framework on these datasets is unfeasible.\nWe refer the reader to the supplementary material for the details of our experimental set-up.\nWe used two performance measures in each experiment: the standardized squared error (SSE) and\nthe negative log predictive density (NLPD) for continuous-output problems, and the error rate and\nthe negative log probability (NLP) for discrete-output problems. We use three versions of SAVIGP:\nFG, MoG1, and MoG2, corresponding to a full Gaussian, a diagonal Gaussian, and mixture of di-\nagonal Gaussians with 2 components, respectively. We refer to the ratio of the number of inducing\npoints over the number of training points (M/N) as sparsity factor.\n\n6.1 Small-scale experiments\n\nprocesses\n\ndataset\n\n[28],\n\np(yn|fn)\n\n(WGP), Abalone\n\nIn this section we describe the results on three (out of six) benchmarks used by [6] and analyze the\nperformance of SAVIGP. The other three benchmarks are described in the supplementary material.\nWarped Gaussian\n=\n\u2207ynt(yn)N (t(yn)|fn, \u03c32).\nFor this task we used the same neural-net transformation as in\n[15] and the results for the Abalone dataset are shown in Figure 1. We see that the performance of\nSAVIGP is practically indistinguishable across all sparsity factors for SSE and NLPD. Here we note\nthat [6] showed that automated variational inference performed competitively when compared to\nhand-crafted methods for warped GPs [15].\nn exp(\u2212\u03bbn)\nLog Gaussian Cox process (LGCP), Coal-mining disasters dataset [29], p(yn|fn) = \u03bbyn\n.\nHere we used the LGCP for modeling the number of coal-mining disasters between years 1851 to\n1962. We note that [6] reported that automated variational inference (the focus of this paper) pro-\nduced practically indistinguishable distributions (but run order of magnitude faster) when compared\nto sampling methods such as Elliptical Slice Sampling [5]. The results for our sparse models are\nshown in Figure 2, where we see that both models (FG and MoG1) remain mostly unaffected when\nusing high levels of sparsity. We also con\ufb01rm the \ufb01ndings in [6] that the MoG1 model underestimates\nthe variance of the predictions.\nBinary classi\ufb01cation, Wisconsin breast cancer dataset [28], p(yn = 1) = 1/(1 + exp(\u2212fn)).\nClassi\ufb01cation error rates and the negative log probability (NLP) on the Wisconsin breast cancer\ndataset are shown in Figure 3. We see that the error rates are comparable across all models and\nsparsity factors. Interestingly, sparser models achieved lower NLP values, suggesting overcon\ufb01dent\npredictions by the less sparse models, especially for the mixtures of diagonal Gaussians.\n\nyn!\n\n6\n\nFGMoG1MoG20.250.500.751.00SSEFGMoG1MoG21234NLPDSF0.10.20.51.0\fFigure 2: Left: the coal-mining disasters data. Right: the posteriors for a Log Gaussian Cox process\non these data when using a full Gaussian (FG) and a diagonal Gaussian (MoG1), for various sparsity\nfactors (SF = M/N). The smaller the SF the sparser the model, with SF=1 corresponding to no\nsparsity. The solid line is the posterior mean and the shading area includes 90% con\ufb01dence interval.\n\nFigure 3: Error rates and NLP for binary classi\ufb01cation on the Wisconsin breast cancer dataset. Three\napproximate posteriors are used: FG (full Gaussian), MoG1 (diagonal Gaussian), and MoG2 (mixture\nof two diagonal Gaussians), along with various sparsity factors (SF = M/N). The smaller the SF the\nsparser the model, with SF=1 corresponding to the original model without sparsity. Error bars on\nthe left plot indicate 95% con\ufb01dence interval around the mean.\n\n6.2 Large-scale experiments\n\nIn this section we show the results of the experiments carried out on larger datasets with non-linear\nnon-Gaussian likelihoods.\nMulti-class classi\ufb01cation on the MNIST dataset. We \ufb01rst considered a multi-class classi\ufb01cation\ntask on the MNIST dataset using the softmax likelihood. This dataset has been extensively used by the\nmachine learning community and contains 50,000 examples for training, 10,000 for validation and\n10,000 for testing, with 784-dimensional input vectors. Unlike most previous approaches, we did not\ntune additional parameters using the validation set. Instead we used our variational framework for\nlearning all the model parameters using all the training and validation data. This setting most likely\nprovides a lower bound on test accuracy but our goal here is simply to show that we can achieve\ncompetitive performance with highly-sparse models as our inference algorithm does not know the\ndetails of the conditional likelihood. Figure 4 (left and middle) shows error rates and NLPs where we\nsee that, although the performance decreases with sparsity, the method is able to attain an accuracy\nof 97.49%, while using only around 2000 inducing points (SF = 0.04).\nTo the best of our knowledge, we are the \ufb01rst to train a multi-class Gaussian process classi\ufb01er using\na single discriminative probabilistic framework on all classes on MNIST. For example, [17] used a\n1-vs-rest approach and [23] focused on the binary classi\ufb01cation task of distinguishing the odd digits\nfrom the even digits. Finally, [9] trained one model for each digit and used it as a density model,\nachieving an error rate of 5.95%. Our experiments show that by having a single discriminative\nprobabilistic framework, even without exploiting the details of the conditional likelihood, we can\nbring this error rate down to 2.51%. As a reference, previous literature reports about 12% error\nrate by linear classi\ufb01ers and less than 1% error rate by sate-of-the-art large/deep convolutional nets.\n\n7\n\n0123418501875190019251950timeevent countsSF = 0.1SF = 0.2SF = 0.5SF = 1.00.20.40.60.20.40.6FGMoG118501875190019251950185018751900192519501850187519001925195018501875190019251950intensityFGMoG1MoG20.000.010.020.030.04error rateFGMoG1MoG20.10.2NLPSF0.10.20.51.0\fFigure 4: Left and middle: classi\ufb01cation error rates and negative log probabilities (NLP) for the\nmulti-class problem on MNIST. Here we used the FG (full Gaussian) approximation with various\nsparsity factors (SF = M/N). The smaller the SF the sparser the model. Right: the SMSE for a\nGaussian process regression network model on the SARCOS dataset when learning the 4th and 7th\ntorques (output 1 and output 2) with a FG (full Gaussian) approximation and 0.04 sparsity factor.\n\nOur results show that our method, while solving the harder problem of full posterior estimation, can\nreduce the gap between GPs and deep nets.\nGaussian process regression networks on the SARCOS dataset. Here we apply our SAVIGP infer-\nence method to the Gaussian process regression networks (GPRNs) model of [14], using the SARCOS\ndataset as a test bed. GPRNs are a very \ufb02exible regression approach where P outputs are a linear\ncombination of Q latent Gaussian processes, with the weights of the linear combination also drawn\nfrom Gaussian processes. This yields a non-linear multiple output likelihood model where the cor-\nrelations between the outputs can be spatially adaptive, i.e. input dependent. The SARCOS dataset\nconcerns an inverse dynamics problem of a 7-degrees-of-freedom anthropomorphic robot arm [16].\nThe data consists of 44,484 training examples mapping from a 21-dimensional input space (7 joint\npositions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Similarly to\nthe work in [10], we consider joint learning for the 4th and 7th torques, which we refer to as output\n1 and output 2 respectively, and make predictions on 4,449 test points per output.\nFigure 4 (right) shows the standardized mean square error (SMSE) with the full Gaussian approxi-\nmation (FG) using SF=0.04, i.e. less than 2000 inducing points. The results are considerably better\nthan those reported by [10] (0.2631 and 0.0127 for each output respectively), although their setting\nwas much sparser than ours on the \ufb01rst output. This also corroborates previous \ufb01ndings that, on this\nproblem, having more data does help [16]. To the best of our knowledge, we are the \ufb01rst to perform\ninference in GPRNs on problems at this scale.\n\n7 Conclusion\n\nWe have presented a scalable approximate inference method for models with Gaussian process (GP)\npriors, multiple outputs, and nonlinear likelihoods. One of the key properties of this method is its\nstatistical ef\ufb01ciency in that it requires only expectations over univariate Gaussian distributions to ap-\nproximate the posterior with a mixture of Gaussians. Extensive experimental evaluation shows that\nour approach can attain excellent performance under high sparsity levels and that it can outperform\nprevious inference methods that have been handcrafted to speci\ufb01c likelihood models. Overall, this\nwork makes a substantial contribution towards the goal of developing generic yet scalable Bayesian\ninference methods for models based on Gaussian processes.\n\nAcknowledgments\n\nThis work has been partially supported by UNSW\u2019s Faculty of Engineering Research Grant Program\nproject # PS37866 and an AWS in Education Research Grant award. AD was also supported by a\ngrant from the Australian Research Council # DP150104878.\n\n8\n\nFG0.000.020.040.060.080.0010.0040.020.04SFerror rateFG0.10.20.30.40.50.0010.0040.020.04SFNLPSF = 0.040.0000.0030.0060.00912outputSMSE\fReferences\n[1] Pedro Domingos, Stanley Kok, Hoifung Poon, Matthew Richardson, and Parag Singla. Unifying logical\n\nand statistical AI. In AAAI, 2006.\n\n[2] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum.\n\nChurch: A language for generative models. In UAI, 2008.\n\n[3] Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. JMLR, 15(1):1593\u20131623, 2014.\n\n[4] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In AISTATS, 2014.\n[5] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical slice sampling. In AISTATS, 2010.\n[6] Trung V. Nguyen and Edwin V. Bonilla. Automated variational inference for Gaussian process models.\n\nIn NIPS. 2014.\n\n[7] Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process classi\ufb01ca-\n\ntion. JMLR, 9(10), 2008.\n\n[8] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. In UAI, 2013.\n[9] Yarin Gal, Mark van der Wilk, and Carl Rasmussen. Distributed variational inference in sparse Gaussian\n\nprocess regression and latent variable models. In NIPS. 2014.\n\n[10] Trung V. Nguyen and Edwin V. Bonilla. Collaborative multi-output Gaussian processes. In UAI, 2014.\n[11] Trung V. Nguyen and Edwin V. Bonilla. Fast allocation of Gaussian process experts. In ICML, 2014.\n[12] Joaquin Qui\u02dcnonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaus-\n\nsian process regression. JMLR, 6:1939\u20131959, 2005.\n\n[13] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In AISTATS,\n\n2009.\n\n[14] Andrew G. Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks.\n\nIn ICML, 2012.\n\n[15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian processes.\n\nNIPS, 2003.\n\nIn\n\n[16] Sethu Vijayakumar and Stefan Schaal. Locally weighted projection regression: An O(n) algorithm for\n\nincremental real time learning in high dimensional space. In ICML, 2000.\n\n[17] Neil D Lawrence, Matthias Seeger, and Ralf Herbrich. Fast sparse Gaussian process methods: The\n\ninformative vector machine. In NIPS, 2002.\n\n[18] Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NIPS, 2006.\n[19] Mauricio A \u00b4Alvarez and Neil D Lawrence. Computationally ef\ufb01cient convolved multiple output Gaussian\n\nprocesses. JMLR, 12(5):1459\u20131500, 2011.\n\n[20] Mauricio A. \u00b4Alvarez, David Luengo, Michalis K. Titsias, and Neil D. Lawrence. Ef\ufb01cient multioutput\n\nGaussian processes through variational inducing kernels. In AISTATS, 2010.\n\n[21] Manfred Opper and C\u00b4edric Archambeau. The variational Gaussian approximation revisited. Neural\n\nComputation, 21(3):786\u2013792, 2009.\n\n[22] Andreas Damianou and Neil Lawrence. Deep Gaussian processes. In AISTATS, 2013.\n[23] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process\n\nclassi\ufb01cation. In AISTATS, 2015.\n\n[24] Zichao Yang, Andrew Gordon Wilson, Alexander J. Smola, and Le Song.\n\nkernels. In AISTATS, 2015.\n\n`A la carte \u2014 learning fast\n\n[25] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The\n\nMIT Press, 2006.\n\n[26] Christopher K.I. Williams and David Barber. Bayesian classi\ufb01cation with Gaussian processes. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 20(12):1342\u20131351, 1998.\n\n[27] Jesper M\u00f8ller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen. Log Gaussian Cox processes.\n\nScandinavian journal of statistics, 25(3):451\u2013482, 1998.\n\n[28] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[29] R.G. Jarrett. A note on the intervals between coal-mining disasters. Biometrika, 66(1):191\u2013193, 1979.\n\n9\n\n\f", "award": [], "sourceid": 865, "authors": [{"given_name": "Amir", "family_name": "Dezfouli", "institution": "The University of New South Wales"}, {"given_name": "Edwin", "family_name": "Bonilla", "institution": "University of New South Wales"}]}