{"title": "Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3140, "page_last": 3148, "abstract": "We present a new variational inference algorithm for Gaussian processes with non-conjugate likelihood functions. This includes binary and multi-class classification, as well as ordinal regression. Our method constructs a convex lower bound, which can be optimized by using an efficient fixed point update method. We then show empirically that our new approach is much faster than existing methods without any degradation in performance.", "full_text": "Fast Bayesian Inference for Non-Conjugate\n\nGaussian Process Regression\n\nMohammad Emtiyaz Khan, Shakir Mohamed, and Kevin P. Murphy\n\nDepartment of Computer Science, University of British Columbia\n\nAbstract\n\nWe present a new variational inference algorithm for Gaussian process regres-\nsion with non-conjugate likelihood functions, with application to a wide array of\nproblems including binary and multi-class classi\ufb01cation, and ordinal regression.\nOur method constructs a concave lower bound that is optimized using an ef\ufb01cient\n\ufb01xed-point updating algorithm. We show that the new algorithm has highly com-\npetitive computational complexity, matching that of alternative approximate infer-\nence methods. We also prove that the use of concave variational bounds provides\nstable and guaranteed convergence \u2013 a property not available to other approaches.\nWe show empirically for both binary and multi-class classi\ufb01cation that our new\nalgorithm converges much faster than existing variational methods, and without\nany degradation in performance.\n\n1\n\nIntroduction\n\nGaussian processes (GP) are a popular non-parametric prior for function estimation. For real-valued\noutputs, we can combine the GP prior with a Gaussian likelihood and perform exact posterior in-\nference in closed form. However, in other cases, such as classi\ufb01cation, the likelihood is no longer\nconjugate to the GP prior, and exact inference is no longer tractable.\nVarious approaches are available to deal with this intractability. One approach is Markov Chain\nMonte Carlo (MCMC) techniques [1, 11, 22, 9]. Although this can be accurate, it is often quite\nslow, and assessing convergence is challenging. There is therefore great interest in deterministic ap-\nproximate inference methods. One recent approach is the Integrated Nested Laplace Approximation\n(INLA) [21], which uses numerical integration to approximate the marginal likelihood. Unfortu-\nnately, this method is limited to six or fewer hyperparameters, and is thus not suitable for models\nwith a large number of hyperparameters. Expectation propagation (EP) [17] is a popular alterna-\ntive, and is a method that approximates the posterior distribution by maintaining expectations and\niterating until these expectations are consistent for all variables. Although this is fast and accurate\nfor the case of binary classi\ufb01cation [15, 18], there are dif\ufb01culties extending EP to many other cases,\nsuch as multi-class classi\ufb01cation and parameter learning [24, 13]. In addition, EP is known to have\nconvergence issues and can be numerically unstable.\nIn this paper, we use a variational approach, where we compute a lower bound to the log marginal\nlikelihood using Jensen\u2019s inequality. Unlike EP, this approach does not suffer from numerical issues\nand convergence problems, and can easily handle multi-class and other likelihoods. This is an active\narea of research and many solutions have been proposed, see for example, [23, 6, 5, 19, 14]. Un-\nfortunately, most of these methods are slow, since they attempt to solve for the posterior covariance\nmatrix, which has size O(N 2), where N is the number of data points. In [19], a reparameteriza-\ntion was proposed that only requires computing O(N ) variational parameters. Unfortunately, this\nmethod relies on a non-concave lower bound. In this paper, we propose a new lower bound that is\nconcave, and derive an ef\ufb01cient iterative algorithm for its maximization. Since the original objective\nis unimodal, we reach the same global optimum as the other methods, but we do so much faster.\n\n1\n\n\fp(z|X, \u03b8) = N (z|\u00b5, \u03a3)\np(yn|zn)\n\np(y|z) =\n\nN(cid:89)\n\n(1)\n\n(2)\n\nn=1\n\nDistribution\nBernoulli logit\n\nType\nBinary\nCategorical Multinomial logit\nCumulative logit\nOrdinal\nCount\nPoisson\n\np(y|z)\np(y = 1|z) = \u03c3(z)\np(y = k|z) = ezk\u2212lse(z)\np(y \u2264 k|z) = \u03c3(\u03c6k \u2212 z)\np(y = k|z) = e\u2212ez\n\nekz\n\nk!\n\nTable 1: Gaussian process regression (top left) and its graphical model (right), along with the exam-\nple likelihoods for outputs (bottom left). Here, \u03c3(z) = 1/(1 + e\u2212z), lse(\u00b7) is the log-sum-exp func-\ntion, k indexes over discrete output values, and \u03c6k are real numbers such that \u03c61 < \u03c62 < . . . < \u03c6K\nfor K ordered categories.\n\n2 Gaussian Process Regression\n\nGaussian process (GP) regression is a powerful method for non-parametric regression that has gained\na great deal of attention as a \ufb02exible and accurate modeling approach. Consider N data points with\nthe n\u2019th observation denoted by yn, with corresponding features xn. A Gaussian process model uses\na non-linear latent function z(x) to obtain the distribution of the observation y using an appropriate\nlikelihood [15, 18]. For example, when y is binary, a Bernoulli logit/probit likelihood is appropriate.\nSimilarly, for count observations, a Poisson distribution can be used.\nA Gaussian process [20] speci\ufb01es a distribution over z(x), and is a stochastic process that is char-\nacterized by a mean function \u00b5(x) and a covariance function \u03a3(x, x(cid:48)), which are speci\ufb01ed using a\nkernel function that depends on the observed features x. Assuming a GP prior over z(x) implies that\na random vector is associated with every input x, such that given all inputs X = [x1, x2, . . . , xN ],\nthe joint distribution over z = [z(x1), z(x2), . . . , z(xN )] is Gaussian.\nThe GP prior is shown in Eq. 1. Here, \u00b5 is a vector with \u00b5(xi) as its i\u2019th element, \u03a3 is a matrix with\n\u03a3(xi, xj) as the (i, j)\u2019th entry, and \u03b8 are the hyperparameters of the mean and covariance functions.\nWe assume throughout a zero mean-function and a squared-exponential covariance function (also\nknown as radial-basis function or Gaussian) de\ufb01ned as: \u03a3(xi, xj) = \u03c32 exp[\u2212(xi \u2212 xj)T (xi \u2212\nxj)/(2s)]. The set of hyperparameters is \u03b8 = (s, \u03c3). We also de\ufb01ne \u2126 = \u03a3\u22121.\nGiven the GP prior, the observations are modeled using the likelihood shown in Eq. 2. The exact\nform of the distribution p(yn|zn) depends on the type of observations and different choices instan-\ntiates many existing models for GP regression [15, 18, 10, 14]. We consider frequently encountered\ndata such as binary, ordinal, categorical and count observations, and describe their likelihoods in Ta-\nble 1. For the case of categorical observations, the latent function z is a vector whose k\u2019th element\nis the latent function for k\u2019th category. A graphical model for Gaussian process regression is also\nshown.\nGiven these models, there are three tasks that are to be performed: posterior inference, prediction\nat test inputs, and model selection. In all cases, the likelihoods we consider are not conjugate to\nthe Gaussian prior distribution and as a result, the posterior distribution is intractable. Similarly,\nthe integrations required in computing the predictive distribution and the marginal likelihood are\nintractable. To deal with this intractability we make use of variational methods.\n\n3 Variational Lower Bound to the Log Marginal Likelihood\n\nInference and model selection are always problematic in any Gaussian process regression using non-\nconjugate likelihoods due to the fact that the marginal likelihood contains an intractable integral. In\nthis section, we derive a tractable variational lower bound to the marginal likelihood. We show\n\n2\n\nz2 y2 X \u03a3 \u00b5 \u03b8 z1 y1 zN yN \fthat the lower bound takes a well known form and can be maximized using concave optimization.\nThroughout the section, we assume scalar zn, with extension to the vector case being straightfor-\nward.\nWe begin with the intractable log marginal likelihood L(\u03b8) in Eq. 3 and introduce a variational\nposterior distribution q(z|\u03b3). We use a Gaussian posterior with mean m and covariance V. The\nfull set of variational parameters is thus \u03b3 = {m, V}. As log is a concave function, we obtain a\nlower bound LJ (\u03b8, \u03b3) using Jensen\u2019s inequality, given in Eq. 4. The \ufb01rst integral is simply the\nKullback\u2212Leibler (KL) divergence from the variational Gaussian posterior q(z|m, V) to the GP\nprior p(z|\u00b5, \u03a3) as shown in Eq. 5, and has a closed-form expression that we substitute to get the\n\ufb01rst term in Eq. 6 (inside square brackets), with \u2126 = \u03a3\u22121.\nThe second integral can be expressed in terms of the expectation with respect to the marginal\nq(zn|mn, Vnn) as shown in the second term of Eq. 5. Here mn is the n\u2019th element of m and\nVnn is the n\u2019th diagonal element of V, the two variables collectively denoted by \u03b3n. The lower\nbound LJ is still intractable since the expectation of log p(yn|zn) is not available in closed form for\nthe distributions listed in Table 1. To derive a tractable lower bound, we make use of local variational\nbounds (LVB) fb, de\ufb01ned such that E[log p(yn|zn)] \u2265 fb(yn, mn, Vnn), giving us Eq. 6.\n\n(cid:90)\n(cid:90)\n\n(cid:90)\n(cid:90)\n\nz\n\nL(\u03b8) = log\n\np(z|\u03b8)p(y|z)dz = log\n\n\u2265 LJ (\u03b8, \u03b3) := \u2212\n\nz\nq(z|\u03b3) log\n\nq(z|\u03b3)\np(z|\u03b8)\n=\u2212DKL [q(z|\u03b3)||p(z|\u03b8)]+\n\nz\n\nq(z|\u03b3)\n\np(z|\u03b8)p(y|z)\n\nq(z|\u03b3)\n\ndz\n\ndz +\n\nN(cid:88)\n\nq(z|\u03b3) log p(y|z)dz\n\nz\nEq(zn|\u03b3n)[log p(yn|zn)]\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(cid:2)log |V\u2126|\u2212tr(V\u2126) \u2212(m\u2212\u00b5)T \u2126(m\u2212\u00b5)+N(cid:3) +\n\nn=1\n\n\u2265 LJ (\u03b8, \u03b3) :=1\n\n2\n\nN(cid:88)\n\nn=1\n\nfb(yn, mn,Vnn).\n\nWe discuss the choice of LVBs in the next section, but \ufb01rst discuss the well-known form that the\nlower bound of Eq. 6 takes. Given V, the optimization function with respect to m is a nonlinear\nleast-squares function. Similarly, the function with respect to V is similar to the graphical lasso\n[8] or covariance selection problem [7], but is different in that the argument is a covariance matrix\ninstead of a precision matrix [8]. These two objective functions are coupled through the non-linear\nterm fb(\u00b7). Usually this term arises due to the prior distribution and may be non-smooth, for exam-\nple, in graphical lasso. In our case, this term arises from the likelihood, and is smooth and concave\nas we discuss in next section.\nIt is straightforward to show that the variational lower bound is strictly concave with respect to\n\u03b3 if fb is jointly concave with respect to mn and Vnn. Strict concavity of terms other than fb is\nwell-known since both the least squares and covariance selection problems are concave. Similar\nconcavity results have been discussed by Braun and McAuliffe [5] for the discrete choice model,\nand more recently by Challis and Barber [6] for the Bayesian linear model, who consider concavity\nwith respect to the Cholesky factor of V. We consider concavity with respect to V instead of its\nCholesky factor, which allows us to exploit the special structure of V, as explained in Section 5.\n\n4 Concave Local Variational Bounds\n\nIn this section, we describe concave LVBs for various likelihoods. For simplicity, we suppress\nthe dependence on n and consider the log-likelihood of a scalar observation y given a predictor z\ndistributed according to q(z|\u03b3) = N (z|m, v) with \u03b3 = {m, v}. We describe the LVBs for the\nlikelihoods given in Table 1 with z being a scalar for count, binary, and ordinal data, but a vector of\nlength K for categorical data, K being the number of classes. When V is a matrix, we denote its\ndiagonal by v.\nFor the Poison distribution, the expectation is available in closed form and we do not need any\nbounding: E[log p(y|\u03b7)] = ym \u2212 exp(m + v/2) \u2212 log y!. This function is jointly concave with\nrespect to m and v since the exponential is a convex function.\n\n3\n\n\ffb(y, m, v) = ym \u2212(cid:80)R\n\nFor binary data, we use the piecewise linear/quadratic bounds proposed by [16], which is a bound\non the logistic-log-partition (LLP) function log(1 + exp(x)) and can be used to obtain a bound over\nthe sigmoid function \u03c3(x). The \ufb01nal bound can be expressed as sum of R pieces: E(log p(y|\u03b7)) =\nr=1 fbr(m, v) where fbr is the expectation of r\u2019th quadratic piece. The\nfunction fbr is jointly concave with respect to m, v and their gradients are available in closed-form.\nAn important property of the piecewise bound is that its maximum error is bounded and can be\ndriven to zero by increasing the number of pieces. This means that the lower bound in Eq. 6 can\nbe made arbitrarily tight by increasing the number of pieces. For this reason, this bound always\nperforms better than other existing bounds, such as Jaakola\u2019s bound [12], given that the number\nof pieces is chosen appropriately. Finally, the cumulative logit likeilhood for ordinal observations\ndepends on \u03c3(x) and its expectation can be bounded using piecewise bounds in a similar way.\nFor the multinomial logit distribution, we can use the bounds proposed by [3] and [4], both leading\nto concave LVBs. The \ufb01rst bound takes the form fb(y, m, V) = yT m \u2212 lse(m + v/2) with y\nrepresented using a 1-of-K encoding. This function is jointly concave with respect to m and v,\nwhich can be shown by noting the fact that the log-sum-exp function is convex. The second bound\nis the product of sigmoids bound proposed by [4] which bounds the likelihood with product of\nsigmoids (see Eq. 3 in [4]), with each sigmoid bounded using Jaakkola\u2019s bound [12]. We can also\nuse piecewise linear/quadratic bound to bound each sigmoid. Alternatively, we can use the recently\nproposed stick-breaking likelihood of [14] which uses piecewise bounds as well.\nFinally, note that the original log-likelihood may not be concave itself, but if it is such that LJ has\na unique solution, then designing a concave variational lower bound will allow us to use concave\noptimization to ef\ufb01ciently maximize the lower bound.\n\n5 Existing Algorithms for Variational Inference\n\nIn this section, we assume that for each output yn there is a corresponding scalar latent function zn.\nAll our results can be easily extended to the case of multi-class outputs where the latent function is a\nvector. In variational inference, we \ufb01nd the approximate Gaussian posterior distribution with mean\nm and covariance V that maximizes Eq. 6. The simplest approach is to use gradient-based methods\nfor optimization, but this can be problematic since the number of variational parameters is quadratic\nin N due to the covariance matrix V. The authors of [19] speculate that this may perhaps be the\nreason behind limited use of Gaussian variational approximations.\nWe now show that the problem is simpler than it appears to be, and in fact the number of parameters\ncan be reduced to O(N ) from O(N 2). First, we write the gradients with respect to m and v in Eq.\nn := \u2202fb(yn, mn, vn)/\u2202vn.\n7 and 8 and equate to zero, using gm\nAlso, gm and gv are the vectors of these gradients, and diag(gv) is the matrix with gv as its diagonal.\n(7)\n(8)\nAt the solution, we see that V is completely speci\ufb01ed if gv is known. This property can be exploited\nto reduce the number of variational parameters.\nOpper and Archambeau [19] (and [18]) propose a reparameterization to reduce the number of pa-\nrameters to O(N ). From the \ufb01xed-point equation, we note that at the solution m and V will have\nthe following form,\n\n(cid:0)V\u22121 \u2212 \u2126(cid:1) + diag(gv) = 0\n\nn := \u2202fb(yn, mn, vn)/\u2202mn and gv\n\n\u2212\u2126(m \u2212 \u00b5) + gm = 0\n\nV = (\u03a3\u22121 + diag(\u03bb))\u22121\nm = \u00b5 + \u03a3\u03b1,\n\n(9)\n(10)\nwhere \u03b1 and \u03bb are real vectors with \u03bbd > 0,\u2200d. At the maximum (but not everywhere), \u03b1 and \u03bb\nwill be equal to gm and gv respectively. Therefore, instead of solving the \ufb01xed-point equations to\nobtain m and V, we can reparameterize the lower bound with respect to \u03b1 and \u03bb. Substituting Eq.\n9 and 10 in Eq. 6 and after simpli\ufb01cation using the matrix inversion and determinant lemmas, we\nget the following new objective function (for a detailed derivation, see [18]),\n\n1\n2\n\n(cid:2)\u2212 log(|B\u03bb||diag(\u03bb)|) + Tr(B\u22121\n\n\u03bb \u03a3) \u2212 \u03b1T \u03a3\u03b1(cid:3) +\n\n1\n2\n\nN(cid:88)\n\nfb(yn, mn, Vnn),\n\n(11)\n\nn=1\n\n4\n\n\fwith B\u03bb = diag(\u03bb)\u22121 + \u03a3. Since the mapping between {\u03b1, \u03bb} and {m, V} is one-to-one, we can\nrecover the latter given the former. The one-to-one relationship also implies that the new objective\nfunction has a unique maximum. The new lower bound involves vectors of size N, reducing the\nnumber of variational parameters to O(N ).\nThe problem with this reparameterization is that the new lower bound is no longer concave, even\nthough it has a unique maximum. To see this, consider the 1-D case. We collect all the terms\ninvolving V from Eq. 6, except the LVB term, to de\ufb01ne the function f (V ) = [log(V \u03a3\u22121) \u2212\nV \u03a3\u22121]/2. We substitute the reparameterization V = (\u03a3\u22121 + \u03bb)\u22121 to get a new function f (\u03bb) =\n[\u2212 log(1 + \u03a3\u03bb) \u2212 (1 + \u03a3\u03bb)\u22121]/2. The second derivative of this function is f(cid:48)(cid:48)(\u03bb) = 1\n2 [\u03a3/(1 +\n\u03a3\u03bb)]2(\u03a3\u03bb\u2212 1). Clearly, this derivative is negative for \u03bb < 1/\u03a3 and non-negative otherwise, making\nthe function neither concave nor convex.\nThe objective function is still unimodal and the maximum of (11) is equal to the maximum of\n(6). With the reparameterization, we loose concavity and therefore the algorithm may have slow\nconvergence. Our experimental results (Section 7) con\ufb01rm the slow convergence.\n\n6 Fast Convergent Variational Inference using Coordinate Ascent\n\nWe now derive an algorithm that reduces the number of variational parameters to 2N while maintain-\ning concavity. Our algorithm uses simple scalar \ufb01xed-point updates to obtain the diagonal elements\nof V. The complete algorithm is shown in Algorithm 1.\nTo derive the algorithm, we \ufb01rst note that the \ufb01xed-point equation Eq. 8 has an attractive property:\nat the solution, the off-diagonal elements of V\u22121 are the same as the off-diagonal elements of \u2126,\ni.e. if we denote K := V\u22121, then Kij = \u2126ij. We need only \ufb01nd the diagonal elements of K to get\nthe full V. This is dif\ufb01cult, however, since the gradient gv depends on v.\nWe take the approach of optimizing each diagonal element Kii \ufb01xing all others (and \ufb01xing m as\nwell). We partition V as shown on the left side of Eq. 12, indexing the last row by 2 and rest of the\nrows by 1. We consider a similar partitioning of K and \u2126. Our goal is to compute v22 and k22 given\nall other elements of K. Matrices K and V are related through the blockwise inversion, as shown\nbelow.\n\n(cid:20) V11 v12\n\nvT\n12\n\nv22\n\n(cid:21)\n\n=\n\n\uf8ee\uf8ef\uf8f0 K\u22121\n\n\uf8f9\uf8fa\uf8fb\n\n11 +\n\u2212\n\n12K\u22121\n11 k12\n\nK\u22121\n11 k12kT\n12K\u22121\nk22\u2212kT\n12K\u22121\nkT\n12K\u22121\nk22\u2212kT\n\n11 k12\n\n11\n\n11\n\n\u2212 K\u22121\nk22\u2212kT\n1\n\n11 k12\n12K\u22121\n12K\u22121\n\nk22\u2212kT\n\n11 k12\n\n11 k12\n\n11 k12) \u21d2 k22 =(cid:101)k22 + 1/v22\n12K\u22121\n\n(12)\n\n(13)\n\nFrom the right bottom corner, we have the \ufb01rst relation below, which we simplify further.\n\nwhere we de\ufb01ne(cid:101)k22 := kT\n\n12K\u22121\n\nv22 = 1/(k22 \u2212 kT\n\n11 k12. We also know from the \ufb01xed point Eq. 8 that the optimal v22\n22 is the gradient of fb with respect to v22. Substitute\nand k22 satisfy Eq. 14 at the solution, where gv\nthe value of k22 from Eq. 13 in Eq. 14 to get Eq. 15. It is easy to check (by taking derivative) that\nthe value v22 that satis\ufb01es this \ufb01xed-point can be found by maximizing the function de\ufb01ned in Eq.\n16.\n\n0 = k22 \u2212 \u212622 + 2gv\n\n0 =(cid:101)k22 + 1/v22 \u2212 \u212622 + 2gv\nf (v) = log(v) \u2212 (\u212622 \u2212(cid:101)k22)v + 2fb(y2, m22, v)\n\n(14)\n(15)\n(16)\nThe function f (v) is a strictly concave function and can be optimized by iterating the following\n\nupdate: v22 \u2190 1/(\u212622 \u2212(cid:101)k22 \u2212 2gv\nSince all elements of K, except k22, are \ufb01xed,(cid:101)k22 can be computed beforehand and need not be\ncan obtain its value using Eq. 13: (cid:101)k22 = k22 \u2212 1/v22, and we do this before starting a \ufb01xed-point\n\nevaluated at every \ufb01xed-point iteration. In fact, we do not need to compute it explicitly, since we\n\n22). We will refer to this as a \u201c\ufb01xed-point iteration\u201d.\n\niteration. The complexity of these iterations depends on the number of gradient evaluations gv\n22,\nwhich is usually constant and very low.\n\n22\n\n22\n\n5\n\n\fAfter convergence of the \ufb01xed-point iterations, we update V using Eq. 12. It turns out that this is a\nrank-one update, the complexity of which is O(N 2). To show these updates, let us denote the new\nvalues obtained after the \ufb01xed-point iterations by knew\nrespectively. and denote the old\nvalues by kold\n22 . We use the right top corner of Eq. 12 to get \ufb01rst equality in Eq. 17. Using\nEq. 13, we get the second equality. Similarly, we use the top left corner of Eq. 12 to get the \ufb01rst\nequality in Eq. 18, and use Eq. 13 and 17 to get the second equality.\n\n22 and vold\n\nand vnew\n\n22\n\n22\n\n11 k12 = \u2212(kold\nK\u22121\nK\u22121\n11 = Vold\n\n22 \u2212(cid:101)k22)vold\n12 = \u2212vold\n22 \u2212(cid:101)k22\n11 \u2212 K\u22121\n12K\u22121\n\n11 k12kT\nkold\n\n11\n\n12 /vold\n22\n\n= Vold\n\n11 \u2212 vold\n\n12 (vold\n\n12 )T /vold\n22\n\nNote that both K\u22121\nVnew. We use Eq. 12 to write updates for Vnew and use 17, 18, and 13 to simplify.\n\n11 and k12 do not change after the \ufb01xed point iteration. We use this fact to obtain\n\nAfter updating V, we update m by optimizing the following non-linear least squares problem,\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n11 k12\n\n22 \u2212(cid:101)k22\n\nvnew\n\n12 =\n\nK\u22121\nknew\n11 = K\u22121\nVnew\n\n11 +\n\nK\u22121\n\n= \u2212 vnew\n22\nvold\n12\nvold\n22\n22 \u2212(cid:101)k22\n12K\u22121\n\n11 k12kT\nknew\n\n11\n\n2 (m \u2212 \u00b5)T \u2126(m \u2212 \u00b5) +\n\u2212 1\n\nmax\nm\n\nWe use Newton\u2019s method, the cost of which is O(N 3).\n\n6.1 Computational complexity\n\n= Vold\n\n11 +\n\n22 \u2212 vold\nvnew\n(vold\n\n22 )2\n\n22\n\nvold\n12 (vold\n\n12 )T\n\nN(cid:88)\n\nn=1\n\nfb(yn, mn, Vnn)\n\nO(N 3 +(cid:80)\n\nThe \ufb01nal procedure is shown in Algorithm 1. The main advantage of our algorithm is its fast\nconvergence as we show this in the results section. The overall computational complexity is\nn ). First term is due to O(N 2) update of V for all n and also due to the opti-\nn \ufb01xed-point iterations, the total cost of which is linear in N\n\nmization of m. Second term is for I f p\ndue to the summation. In all our experiments, I f p\n\nn is usually 3 to 5, adding very little cost.\n\nn I f p\n\n6.2 Proof of convergence\n\nProposition 2.7.1 in [2] states that the coordinate ascent algorithm converges if the maximization\nwith respect to each coordinate is uniquely attained. This is indeed the case for us since each \ufb01xed\npoint iteration solves a concave problem of the form given by Eq. 16. Similarly, optimization with\nrespect to m is also strictly concave. Hence, convergence of our algorithm is assured.\n\n6.3 Proof that V will always be positive de\ufb01nite\n\nLet us assume that we start with a positive de\ufb01nite K, for example, we can initialize it with \u2126. Now\n22 will be positive since it is the maximum of Eq.\nconsider the update of v22 and k22. Note that vnew\n16 which involves the log term. Using this and Eq. 13, we get knew\n11 k12. Hence, the\nSchur complement knew\n11 k12 > 0. Using this and the fact that K11 is positive de\ufb01nite, it\nfollows that Knew will also be positive de\ufb01nite, and hence Vnew will be positive de\ufb01nite.\n\n22 \u2212 kT\n\n22 > kT\n\n12K\u22121\n\n12K\u22121\n\n7 Results\n\nWe now show that the proposed algorithm leads to a signi\ufb01cant gain in the speed of Gaussian process\nregression. The software to reproduce the results of this section are available online1. We evaluate\nthe performance of our fast variational inference algorithm against existing inference methods for\n\n1http://www.cs.ubc.ca/emtiyaz/software/codeNIPS2012.html\n\n6\n\n\fAlgorithm 1 Fast convergent coordinate-ascent algorithm\n\n1. Initialize K \u2190 \u2126, V \u2190 \u2126\u22121, m \u2190 \u00b5, where \u2126 := \u03a3\u22121.\n2. Alternate between updating the diagonal of V and then m until convergence, as follows:\n\n(a) Update the i\u2019th diagonal of V for all i = 1, . . . , N:\n\n22 \u2190 v22.\n\niii. Store old value vold\n\ni. Rearrange V and \u2126 so that the i\u2019th column is the last one.\n\nii. (cid:101)k22 \u2190 k22 \u2212 1/v22.\niv. Run \ufb01xed-point iterations for a few steps: v22 \u2190 1/(\u212622 \u2212(cid:101)k22 \u2212 2gv\nvi. Update k22 \u2190(cid:101)k22 + 1/v22.\n\nv. Update V.\nA. V11 \u2190 V11 + (v22 \u2212 vold\nB. v12 \u2190 \u2212v22v12/vold\n22 .\n\n22 )v12vT\n\n12/(vold\n\n22 )2.\n\n22).\n\n(b) Update m by maximizing the least-squares problem of Eq. 21.\n\nbinary and multi-class classi\ufb01cation. For binary classi\ufb01cation, we use the UCI ionosphere data (with\n351 data examples containing 34 features). For multi-class classi\ufb01cation, we use the UCI forensic\nglass data set with 214 data examples each with 6 category output and features of length 8. In both\ncases, we use 80% of the dataset for training and the rest for testing.\nWe consider GP classi\ufb01cation using the Bernoulli logit likelihood, for which we use the piecewise\nbound of [16] with 20 pieces. We compare our algorithm with the approach of Opper and Archam-\nbeau [19] (Eq. 11). For the latter, we use L-BFGS method for optimization. We also compared to\nthe naive method of optimizing with respect to full m and V, e.g. method of [5], but do not present\nthese results since these algorithms have very slow convergence.\nWe examine the computational cost for each method in terms of the number of \ufb02oating point oper-\nations (\ufb02ops) for four hyperparameter settings \u03b8 = {log(s), log(\u03c3)}. This comparison is shown in\nFigure 1(a). The y-axis shows (negative of) the value of the lower bound, and the x-axis shows the\nnumber of \ufb02ops. We draw markers at iteration 1,2,4,50 and in steps of 50 from then on. In all cases,\ndue to non-concavity, the optimization of the Opper and Archambeau reparameterization (black\ncurve with squares) convergence slowly, passing through \ufb02at regions of the objective and requiring\na large number of computations to reach convergence. The proposed algorithm (blue curve with\ncircles) has consistently faster convergence than the existing method. For this dataset, our algorithm\nalways converged in 5 iterations.\nWe also compare the total cost to convergence, where we count the total number of \ufb02ops until\nsuccessive increase in the objective function is below 10\u22123. Each entry is a different setting of\n{log(s), log(\u03c3)}. Rows correspond to values of log(s) while columns correspond to log(\u03c3), with\nunits M,G,T denoting Mega-, Giga-, and Terra-\ufb02ops. We can see that the proposed algorithm takes\na much smaller number of operations compared to the existing algorithm.\n\n1\n\nProposed Algorithm\n-1\n3\n6M 7M 7M\n26M 20M 22M\n47M 81M 75M\n\n-1\n1\n3\n\n1\n\nOpper and Archambeau\n3\n6T\n24T\n24T\n\n-1\n20G\n101G\n38G\n\n212G\n24T\n1T\n\n-1\n1\n3\n\nWe also applied our method to two more datasets of [18], namely \u2019sonar\u2019 and \u2019usps-3vs5\u2019 dataset\nand observed similar behavior.\nNext, we apply our algorithm to the problem of multi-class classi\ufb01cation, following [14], using the\nstick-breaking likelihood, and compare to inference using the approach of Opper and Archambeau\n[19] (Eq. 11). We show results comparing the lower bound vs the number of \ufb02ops taken in Figure\n1(b), for four hyperparameter settings {log(s), log(\u03c3)}. We show markers at iterations 1, 2, 10,\n100 and every 100th iteration thereafter. The results follow those discussed for binary classi\ufb01cation,\n\n7\n\n\f(a) Ionosphere data\n\n(b) Forensic glass data\n\nFigure 1: Convergence results for (a) the binary classi\ufb01cation on the ionosphere data set and (b) the\nmulti-class classi\ufb01cation on the glass dataset. We plot the negative of the lower bound vs the number\nof \ufb02ops. Each plot shows the progress of algorithms for a hyperparameter setting {log(s), log(\u03c3)}\nshown at the top of the plot. The proposed algorithm always converges faster than the other method,\nin fact, in less than 5 iterations.\n\nwhere both methods reach the same lower bound value, but the existing approach converging much\nslower, with our algorithm always converged within 20 iterations.\n\n8 Discussion\n\nIn this paper we have presented a new variational inference algorithm for non-conjugate GP re-\ngression. We derived a concave variational lower bound to the log marginal likelihood, and used\nconcavity to develop an ef\ufb01cient optimization algorithm. We demonstrated the ef\ufb01cacy of our new\nalgorithm on both binary and multiclass GP classi\ufb01cation, demonstrating signi\ufb01cant improvement\nin convergence.\nOur proposed algorithm is related to many existing methods for GP regression. For example, the\nobjective function that we consider is exactly the KL minimization method discussed in [18], for\nwhich a gradient based optimization was used. Our algorithm uses an ef\ufb01cient approach where we\nupdate the marginals of the posterior and then do a rank one update of the covariance matrix. Our\nresults show that this leads to fast convergence.\nOur algorithm also takes a similar form to the popular EP algorithm [17], e.g. see Algorithm 3.5 in\n[20]. Both EP and our algorithm update posterior marginals, followed by a rank-one update of\nthe covariance. Therefore, the computational complexity of our approach is similar to that of EP.\nThe advantage of our approach is that, unlike EP, it does not suffer from any numerical issues (for\nexample, no negative variances) and is guaranteed to converge.\nThe derivation of our algorithm is based on the observation that the posterior covariance has a special\nstructure, and does not directly use the concavity of the lower bound. An alternate derivation based\non the Fenchel duality exists and shows that the \ufb01xed-point iterations compute dual variables which\nare related to the gradients of fb. We skip this derivation since it is tedious, and present the more\nintuitive derivation instead. The alternative derivation will be made available in an online appendix.\n\nAcknowledgements\n\nWe thank the reviewers for their valuable suggestions. SM is supported by the Canadian Institute\nfor Advanced Research (CIFAR).\n\n8\n\n0300600900134138142(\u22121.0,\u22121.0)Mega\u2212Flopsneg\u2212LogLik0100020003000300600900(\u22121.0,2.5)Mega\u2212Flopsneg\u2212LogLik05K10K15K20K80110140170200(3.5,3.5)Mega\u2212Flopsneg\u2212LogLik02000400060008000100200300Mega\u2212Flopsneg\u2212LogLik(1.0,1.0) Opper\u2212Archproposed01000200030004000260270280290300310320(\u22121.0, \u22121.0)Neg\u2212LogLikMega\u2212flops010K20K30K40K50K500100015002000(\u22121.0, 2.5)Neg\u2212LogLikMega\u2212flops020K40K60K80K100K200250300350400(2.5, 2.5)Neg\u2212LogLikMega\u2212flops010K20K30K40K50K200300400500600(1.0, 1.0)Neg\u2212LogLikMega\u2212flops proposedOpper\u2212Arch\fReferences\n\n[1] J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. J. of the\n\nAm. Stat. Assoc., 88(422):669\u2013679, 1993.\n\n[2] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, second edition, 1999.\n[3] D. Blei and J. Lafferty. Correlated topic models. In Advances in Neural Information Proceed-\n\nings Systems, 2006.\n\n[4] G. Bouchard. Ef\ufb01cient bounds for the softmax and applications to approximate inference in\n\nhybrid models. In NIPS 2007 Workshop on Approximate Inference in Hybrid Models, 2007.\n\n[5] M. Braun and J. McAuliffe. Variational inference for large-scale models of discrete choice.\n\nJournal of the American Statistical Association, 105(489):324\u2013335, 2010.\n\n[6] E. Challis and D. Barber. Concave Gaussian variational approximations for inference in large-\nIn Proceedings of the International Conference on Arti\ufb01cial\n\nscale Bayesian linear models.\nIntelligence and Statistics, volume 6, page 7, 2011.\n\n[7] A. Dempster. Covariance selection. Biometrics, 28(1), 1972.\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432, 2008.\n\n[9] S. Fr\u00a8uhwirth-Schnatter and R. Fr\u00a8uhwirth. Data augmentation and MCMC for binary and multi-\n\nnomial logit models. Statistical Modelling and Regression Structures, pages 111\u2013132, 2010.\n\n[10] M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian\n\nprocess priors. Neural Comptuation, 18(8):1790 \u2013 1817, 2006.\n\n[11] C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regres-\n\nsion. Bayesian Analysis, 1(1):145\u2013168, 2006.\n\n[12] T. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression problems\n\nand their extensions. In AI + Statistics, 1996.\n\n[13] P. Jyl\u00a8anki, J. Vanhatalo, and A. Vehtari. Robust Gaussian process regression with a student-t\n\nlikelihood. The Journal of Machine Learning Research, 999888:3227\u20133257, 2011.\n\n[14] M. Khan, S. Mohamed, B. Marlin, and K. Murphy. A stick-breaking likelihood for categorical\ndata analysis with latent Gaussian models. In Proceedings of the International Conference on\nArti\ufb01cial Intelligence and Statistics, 2012.\n\n[15] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process\n\nclassi\ufb01cation. J. of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[16] B. Marlin, M. Khan, and K. Murphy. Piecewise bounds for estimating Bernoulli-logistic latent\n\nGaussian models. In Intl. Conf. on Machine Learning, 2011.\n\n[17] T. Minka. Expectation propagation for approximate Bayesian inference. In UAI, 2001.\n[18] H. Nickisch and C.E. Rasmussen. Approximations for binary Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 9(10), 2008.\n\n[19] M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural\n\ncomputation, 21(3):786\u2013792, 2009.\n\n[20] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[21] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian\nmodels using integrated nested Laplace approximations. J. of Royal Stat. Soc. Series B, 71:\n319\u2013392, 2009.\n\n[22] S. L. Scott. Data augmentation, frequentist estimation, and the Bayesian analysis of multino-\n\nmial logit models. Statistical Papers, 52(1):87\u2013109, 2011.\n\n[23] M. Seeger. Bayesian Inference and Optimal Design in the Sparse Linear Model. J. of Machine\n\nLearning Research, 9:759\u2013813, 2008.\n\n[24] M. Seeger and H. Nickisch. Fast Convergent Algorithms for Expectation Propagation Ap-\nproximate Bayesian Inference. In Proceedings of the International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1448, "authors": [{"given_name": "Emtiyaz", "family_name": "Khan", "institution": null}, {"given_name": "Shakir", "family_name": "Mohamed", "institution": null}, {"given_name": "Kevin", "family_name": "Murphy", "institution": null}]}