{"title": "Efficient Optimization for Sparse Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1097, "page_last": 1105, "abstract": "We propose an efficient discrete optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates this inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in the training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-of-art performance in the discrete case and competitive results in the continuous case.", "full_text": "Ef\ufb01cient Optimization for\n\nSparse Gaussian Process Regression\n\nYanshuai Cao1 Marcus A. Brubaker2 David J. Fleet1 Aaron Hertzmann1,3\n\n1Department of Computer Science\n\nUniversity of Toronto\n\n2TTI-Chicago\n\n3Adobe Research\n\nAbstract\n\nWe propose an ef\ufb01cient optimization algorithm for selecting a subset of train-\ning data to induce sparsity for Gaussian process regression. The algorithm esti-\nmates an inducing set and the hyperparameters using a single objective, either the\nmarginal likelihood or a variational free energy. The space and time complexity\nare linear in training set size, and the algorithm can be applied to large regression\nproblems on discrete or continuous domains. Empirical evaluation shows state-of-\nart performance in discrete cases and competitive results in the continuous case.\n\nIntroduction\n\n1\nGaussian Process (GP) learning and inference are computationally prohibitive with large datasets,\nhaving time complexities O(n3) and O(n2), where n is the number of training points. Sparsi\ufb01cation\nalgorithms exist that scale linearly in the training set size (see [10] for a review). They construct a\nlow-rank approximation to the GP covariance matrix over the full dataset using a small set of induc-\ning points. Some approaches select inducing points from training points [7, 8, 12, 13]. But these\nmethods select the inducing points using ad hoc criteria; i.e., they use different objective functions to\nselect inducing points and to optimize GP hyperparameters. More powerful sparsi\ufb01cation methods\n[14, 15, 16] use a single objective function and allow inducing points to move freely over the input\ndomain which are learned via gradient descent. This continuous relaxation is not feasible, however,\nif the input domain is discrete, or if the kernel function is not differentiable in the input variables.\nAs a result, there are problems in myraid domains, like bio-informatics, linguistics and computer\nvision where current sparse GP regression methods are inapplicable or ineffective.\nWe introduce an ef\ufb01cient sparsi\ufb01cation algorithm for GP regression. The method optimizes a single\nobjective for joint selection of inducing points and GP hyperparameters. Notably, it optimizes either\nthe marginal likelihood, or a variational free energy [15], exploiting the QR factorization of a par-\ntial Cholesky decomposition to ef\ufb01ciently approximate the covariance matrix. Because it chooses\ninducing points from the training data, it is applicable to problems on discrete or continuous input\ndomains. To our knowledge, it is the \ufb01rst method for selecting discrete inducing points and hy-\nperparameters that optimizes a single objective, with linear space and time complexity. It is shown\nto outperform other methods on discrete datasets from bio-informatics and computer vision. On\ncontinuous domains it is competitive with the Pseudo-point GP [14] (SPGP).\n1.1 Previous Work\nEf\ufb01cient state-of-the-art sparsi\ufb01cation methods are O(m2n) in time and O(mn) in space for learn-\ning. They compute the predictive mean and variance in time O(m) and O(m2). Methods based on\ncontinuous relaxation, when applicable, entail learning O(md) continuous parameters, where d is\nthe input dimension. In the discrete case, combinatorial optimization is required to select the induc-\ning points, and this is, in general, intractable. Existing discrete sparsi\ufb01cation methods therefore use\nother criteria to greedily select inducing points [7, 8, 12, 13]. Although their criteria are justi\ufb01ed,\neach in their own way (e.g., [8, 12] take an information theoretic perspective), they are greedy and\ndo not use the same objective to select inducing points and to estimate GP hyperparameters.\n\n1\n\n\fThe variational formulation of Titsias [15] treats inducing points as variational parameters, and gives\na uni\ufb01ed objective for discrete and continuous inducing point models. In the continuous case, it uses\ngradient-based optimization to \ufb01nd inducing points and hyperparameters. In the discrete case, our\nmethod optimizes the same variational objective of Titsias [15], but is a signi\ufb01cant improvement over\ngreedy forward selection using the variational objective as selection criteria, or some other criteria.\nIn particular, given the cost of evaluating the variational objective on all training points, Titsias [15]\nevaluates the objective function on a small random subset of candidates at each iteration, and then\nselect the best element from the subset. This approximation is often slow to achieve good results,\nas we explain and demonstrate below in Section 4.1. The approach in [15] also uses greedy forward\nselection, which provides no way to re\ufb01ne the inducing set after hyperparameter optimization, except\nto discard all previous inducing points and restart selection. Hence, the objective is not guaranteed\nto decrease after each restart. By comparison, our formulation considers all candidates at each step,\nand revisiting previous selections is ef\ufb01cient, and guaranteed to decrease the objective or terminate.\nOur low-rank decomposition is inspired by the Cholesky with Side Information (CSI) algorithm for\nkernel machines [1]. We extend that approach to GP regression. First, we alter the form of the low-\nrank matrix factorization in CSI to be suitable for GP regression with full-rank diagonal term in the\ncovariance. Second, the CSI algorithm selects inducing points in a single greedy pass using an ap-\nproximate objective. We propose an iterative optimization algorithm that swaps previously selected\npoints with new candidates that are guaranteed to lower the objective. Finally, we perform induc-\ning set selection jointly with gradient-based hyperparameter estimation instead of the grid search in\nCSI. Our algorithm selects inducing points in a principled fashion, optimizing the variational free\nenergy or the log likelihood. It does so with time complexity O(m2n), and in practice provides an\nimproved quality-speed trade-off over other discrete selection methods.\n2 Sparse GP Regression\nLet y \u2208 R be the noisy output of a function, f, of input x. Let X = {xi}n\ni=1 denote n training\ninputs, each belonging to input space D, which is not necessarily Euclidean. Let y \u2208 Rn denote the\ncorresponding vector of training outputs. Under a full zero-mean GP, with the covariance function\n(1)\nwhere \u03ba is the kernel function, 1[\u00b7] is the usual indicator function, and \u03c32 is the variance of the obser-\nvation noise, the predictive distribution over the output f(cid:63) at a test point x(cid:63) is normally distributed.\nThe mean and variance of the predictive distribution can be expressed as\n\nE[yiyj] = \u03ba(xi, xj) + \u03c321[i = j] ,\n\n\u00b5(cid:63) = \u03ba(x(cid:63))T(cid:0)K + \u03c32In\n(cid:63) = \u03ba(x(cid:63), x(cid:63)) \u2212 \u03ba(x(cid:63))T(cid:0)K + \u03c32In\n\n(cid:1)\u22121\n\nv2\n\ny\n\n(cid:1)\u22121\n\n\u03ba(x(cid:63))\n\nEfull (\u03b8) = ( y(cid:62)(cid:0)K +\u03c32In\n\n(cid:1)\u22121\n\nwhere In is the n \u00d7 n identity matrix, K is the kernel matrix whose ijth element is \u03ba(xi, xj), and\n\u03ba(x(cid:63)) is the column vector whose ith element is \u03ba(x(cid:63), xi).\nThe hyperparameters of a GP, denoted \u03b8, comprise the parameters of the kernel function, and the\nnoise variance \u03c32. The natural objective for learning \u03b8 is the negative marginal log likelihood\n(NMLL) of the training data, \u2212 log (P (y|X, \u03b8)), given up to a constant by\n\ny + log |K +\u03c32In| ) / 2 .\n\n(2)\nThe computational bottleneck lies in the O(n2) storage and O(n3) inversion of the full covariance\nmatrix, K + \u03c32In. To lower this cost with a sparse approximation, Csat\u00b4o and Opper [5] and Seeger\net al. [12] proposed the Projected Process (PP) model, wherein a set of m inducing points are used\nto construct a low-rank approximation of the kernel matrix. In the discrete case, where the inducing\npoints are a subset of the training data, with indices I \u2282 {1, 2, ..., n}, this approach amounts to\nreplacing the kernel matrix K with the following Nystr\u00a8om approximation [11]:\n\nK (cid:39) \u02c6K = K[:,I] K[I,I]\u22121 K[I, :]\n\n(3)\nwhere K[:,I] denotes the sub-matrix of K comprising columns indexed by I, and K[I,I] is the\nsub-matrix of K comprising rows and columns indexed by I. We assume the rank of K is m or\nhigher so we can always \ufb01nd such rank-m approximations. The PP NMLL is then algebraically\nequivalent to replacing K with \u02c6K in Eq. (2), i.e.,\n\nE(\u03b8,I) = (cid:0) ED(\u03b8,I) + EC(\u03b8,I)(cid:1) /2 ,\n\n(4)\n\n2\n\n\fwith data term ED(\u03b8,I) = y(cid:62)(\u02c6K +\u03c32In)\u22121y, and model complexity EC(\u03b8,I) = log | \u02c6K +\u03c32In|.\nThe computational cost reduction from O(n3) to O(m2n) associated with the new likelihood is\nachieved by applying the Woodbury inversion identity to ED(\u03b8,I) and EC(\u03b8,I). The objective\nin (4) can be viewed as an approximate log likelihood for the full GP model, or as the exact log\nlikelihood for an approximate model, called the Deterministically Trained Conditional [10].\nThe same PP model can also be obtained by a variational argument, as in [15], for which the varia-\ntional free energy objective can be shown to be Eq. (4) plus one extra term; i.e.,\n\nF (\u03b8,I) = (cid:0) ED(\u03b8,I) + EC(\u03b8,I) + EV(\u03b8,I)(cid:1) / 2 ,\n\n(5)\nwhere EV (\u03b8,I) = \u03c3\u22122 tr(K\u2212 \u02c6K) arises from the variational formulation. It effectively regularizes\nthe trace norm of the approximation residual of the covariance matrix. The kernel machine of [1]\nalso uses a regularizer of the form \u03bb tr(K\u2212 \u02c6K), however \u03bb is a free parameter that is set manually.\n\n3 Ef\ufb01cient optimization\nWe now outline our algorithm for optimizing the variational free energy (5) to select the inducing set\nI and the hyperparameters \u03b8. (The negative log-likelihood (4) is similarly minimized by simply dis-\ncarding the EV term.) The algorithm is a form of hybrid coordinate descent that alternates between\ndiscrete optimization of inducing points, and continuous optimization of the hyperparameters. We\n\ufb01rst describe the algorithm to select inducing points, and then discuss continuous hyperparameter\noptimization and termination criteria in Sec. 3.4.\nFinding the optimal inducing set is a combinatorial problem; global optimization is intractable.\nInstead, the inducing set is initialized to a random subset of the training data, which is then re\ufb01ned\nby a \ufb01xed number of swap updates at each iteration.1 In a single swap update, a randomly chosen\ninducing point is considered for replacement. If swapping does not improve the objective, then the\noriginal point is retained. There are n \u2212 m potential replacements for each each swap update; the\nkey is to ef\ufb01ciently determine which will maximally improve the objective. With the techniques\ndescribed below, the computation time required to approximately evaluate all possible candidates\nand swap an inducing point is O(mn). Swapping all inducing points once takes O(m2n) time.\n\n3.1 Factored representation\nTo support ef\ufb01cient evaluation of the objective and swapping, we use a factored representation of the\nkernel matrix. Given an inducing set I of k points, for any k \u2264 m, the low-rank Nystr\u00a8om approx-\nimation to the kernel matrix (Eq. 3) can be expressed in terms of a partial Cholesky factorization:\n\n\u02c6K = K[:,I] K[I,I]\u22121 K[I, :] = L(I)L(I)(cid:62) ,\n\n(6)\nwhere L(I) \u2208 Rn\u00d7k is, up to permutation of rows, lower trapezoidal matrix (i.e., has a k \u00d7 k\ntop lower triangular block, again up to row permutation). The derivation of Eq. 6 follows from\nProposition 1 in [1], and the fact that, given the ordered sequence of pivots I, the partial Cholesky\nfactorization is unique.\nUsing this factorization and the Woodbury identities (dropping the dependence on \u03b8 and I for clar-\nity), the terms of the negative marginal log-likelihood (4) and variational free energy (5) become\n\nED = \u03c3\u22122(cid:16)\nEC = log(cid:0)(\u03c32)n\u2212k|L(cid:62)L + \u03c32I|(cid:1)\n\ny(cid:62)y \u2212 y(cid:62)L(cid:0)L(cid:62)L + \u03c32I(cid:1)\u22121\n\n(cid:17)\n\nL(cid:62)y\n\n(7)\n\n(8)\n(9)\n\nEV = \u03c3\u22122(tr(K) \u2212 tr(L(cid:62)L))\n\nWe can further simplify the data term by augmenting the factor matrix as(cid:101)L = [L(cid:62), \u03c3Ik](cid:62), where\nIk is the k\u00d7k identity matrix, and(cid:101)y = [yT, 0T\nED = \u03c3\u22122(cid:16)\n\ny(cid:62)y \u2212(cid:101)y(cid:62)(cid:101)L ((cid:101)L(cid:62)(cid:101)L)\u22121(cid:101)L(cid:62)(cid:101)y\n\nT is the y vector with k zeroes appended:\n\n(cid:17)\n\n(10)\n\nk ]\n\n1The inducing set can be incrementally constructed, as in [1], however we found no bene\ufb01t to this.\n\n3\n\n\fNow, let (cid:101)L = QR be a QR factorization of (cid:101)L, where Q \u2208 R(n+k)\u00d7k has orthogonal columns and\n\nR \u2208 Rk\u00d7k is invertible. The \ufb01rst two terms in the objective simplify further to\n\nED = \u03c3\u22122(cid:0)(cid:107)y(cid:107)2 \u2212 (cid:107)Q(cid:62)(cid:101)y(cid:107)2(cid:1)\n\nEC = (n \u2212 k) log(\u03c32) + 2 log |R| .\n\n(11)\n(12)\n\n3.2 Factorization update\nHere we present the mechanics of the swap update algorithm, see [3] for pseudo-code. Suppose we\nwish to swap inducing point i with candidate point j in Im, the inducing set of size m. We \ufb01rst\nmodify the factor matrices in order to remove point i from Im, i.e. to downdate the factors. Then\nwe update all the key terms using one step of Cholesky and QR factorization with the new point j.\nDowndating to remove inducing point i requires that we shift the corresponding columns/rows in\n\nthe factorization to the right-most columns of(cid:101)L, Q, R and to the last row of R. We can then simply\n\nand removing i take O(mn) time, as does the updating with point j.\n\nand update the factors to rank m, one step of Cholesky factorization is performed with point j, for\n\ndiscard these last columns and rows, and modify related quantities. When permuting the order of the\ninducing points, the underlying GP model is invariant, but the matrices in the factored representation\nare not. If needed, any two points in Im, can be permuted, and the Cholesky or QR factors can be\nupdated in time O(mn). This is done with the ef\ufb01cient pivot permutation presented in the Appendix\n\nof [1], with minor modi\ufb01cations to account for the augmented form of (cid:101)L. In this way, downdating\nAfter downdating, we have factors(cid:101)Lm\u22121,Qm\u22121, Rm\u22121, and inducing set Im\u22121. To add j to Im\u22121,\nwhich, ideally, the new column to append to(cid:101)L is formed as\n(cid:46)(cid:113)\nT. Then, we set (cid:101)Lm = [(cid:101)Lm\u22121\n\u02dc(cid:96)m], where \u02dc(cid:96)m is just (cid:96)m augmented\nwhere \u02c6Km\u22121 = Lm\u22121Lm\u22121\nwith \u03c3em = [0, 0, ..., \u03c3, ..., 0, 0](cid:62). The \ufb01nal updates are Qm = [Qm\u22121 qm], where qm is given\nRm is updated from Rm\u22121 so that(cid:101)Lm = QmRm.\nby Gram-Schmidt orthogonalization, qm = ((I \u2212 Qm\u22121Q(cid:62)\nm\u22121)\u02dc(cid:96)m(cid:107), and\n\nm\u22121)\u02dc(cid:96)m) /(cid:107)(I \u2212 Qm\u22121Q(cid:62)\n\n(cid:96)m = (K\u2212 \u02c6Km\u22121)[:, j]\n\n(K\u2212 \u02c6Km\u22121)[j, j]\n\n(13)\n\n3.3 Evaluating candidates\nNext we show how to select candidates for inclusion in the inducing set. We \ufb01rst derive the exact\nchange in the objective due to adding an element to Im\u22121. Later we will provide an approximation\nto this objective change that can be computed ef\ufb01ciently.\n\nGiven an inducing set Im\u22121, and matrices(cid:101)Lm\u22121, Qm\u22121, and Rm\u22121, we wish to evaluate the change\n\nin Eq. 5 for Im =Im\u22121 \u222a j. That is, \u2206F \u2261 F (\u03b8,Im\u22121)\u2212F (\u03b8,Im) = (\u2206ED + \u2206EC + \u2206EV )/2,\nwhere, based on the mechanics of the incremental updates above, one can show that\n\n\u2206ED = \u03c3\u22122((cid:101)y(cid:62)(cid:0)I \u2212 Qm\u22121Q(cid:62)\n\u2206EC = log(cid:0)\u03c32(cid:1) \u2212 log (cid:107)(I \u2212 Qm\u22121Q(cid:62)\n\nm\u22121\n\n(cid:1) \u02dc(cid:96)m)2(cid:46) (cid:107)(cid:0)I \u2212 Qm\u22121Q(cid:62)\n\nm\u22121\n\n(cid:1) \u02dc(cid:96)m(cid:107)2\n\n(14)\n\nm\u22121)\u02dc(cid:96)m(cid:107)2\n\n\u2206EV = \u03c3\u22122(cid:107)(cid:96)m(cid:107)2\n\n(15)\n(16)\nThis gives the exact decrease in the objective function after adding point j. For a single point this\nevaluation is O(mn), so to evaluate all n \u2212 m points would be O(mn2).\n3.3.1 Fast approximate cost reduction\nWhile O(mn2) is prohibitive, computing the exact change is not required. Rather, we only need a\nranking of the best few candidates. Thus, instead of evaluating the change in the objective exactly,\nwe use an ef\ufb01cient approximation based on a small number, z, of training points which provide\ninformation about the residual between the current low-rank covariance matrix (based on inducing\npoints) and the full covariance matrix. After this approximation proposes a candidate, we use the\nactual objective to decide whether to include it. The techniques below reduce the complexity of\nevaluating all n \u2212 m candidates to O(zn).\nTo compute the change in objective for one candidate, we need the new column of the updated\nCholesky factorization, (cid:96)m.\nIn Eq. (13) this vector is a (normalized) column of the residual\n\n4\n\n\fz )[:, j]\n\n(cid:46)(cid:112)(LzL(cid:62)\n\nK \u2212 \u02c6Km\u22121 between the full kernel matrix and the Nystr\u00a8om approximation. Now consider the\nfull Cholesky decomposition of K = L\u2217L\u2217(cid:62) where L\u2217 = [Lm\u22121, L(Jm\u22121)] is constructed with\nIm\u22121 as the \ufb01rst pivots and Jm\u22121 = {1, ..., n}\\Im\u22121 as the remaining pivots, so the resid-\nual becomes K \u2212 \u02c6Km\u22121 = L(Jm\u22121)L(Jm\u22121)(cid:62). We approximate L(Jm\u22121) by a rank z (cid:28) n\nmatrix, Lz, by taking z points from Jm\u22121 and performing a partial Cholesky factorization of\nK \u2212 \u02c6Km\u22121 using these pivots. The residual approximation becomes K \u2212 \u02c6Km\u22121 \u2248 LzL(cid:62)\nz , and\nthus (cid:96)m \u2248 (LzL(cid:62)\nz )[j, j]. The pivots used to construct Lz are called information\npivots; their selection is discussed in Sec. 3.3.2.\nk and \u2206EV\n\nk , Eqs. (14)-(16), for all candidate points, involve\nThe approximations to \u2206ED\nthe following terms: diag(LzL(cid:62)\nz . The \ufb01rst term\ncan be computed in time O(z2n), and the other two in O(zmn) with careful ordering of matrix\nmultiplications.2 Computing Lz costs O(z2n), but can be avoided since information pivots change\nby at most one when an information pivots is added to the inducing set and needs to be replaced.\nThe techniques in Sec. 3.2 bring the associated update cost to O(zn) by updating Lz rather than\nrecomputing it. These z information pivots are equivalent to the \u201clook-ahead\u201d steps of Bach and\nJordan\u2019s CSI algorithm, but as described in Sec. 3.3.2, there is a more effective way to select them.\n\nk , \u2206EC\n\nz LzL(cid:62)\n\nz ), y(cid:62)LzL(cid:62)\n\n(cid:62)\nz , and (Qk\u22121[1 : n, :])\n\nLzL(cid:62)\n\n3.3.2 Ensuring a good approximation\nSelection of the information pivots determines the approximate objective, and hence the candidate\nproposal. To ensure a good approximation, the CSI algorithm [1] greedily selects points to \ufb01nd\nan approximation of the residual K \u2212 \u02c6Km\u22121 in Eq. (13) that is optimal in terms of a bound of\nthe trace norm. The goal, however, is to approximate Eqs. (14)-(16) . By analyzing the role of\nthe residual matrix, we see that the information pivots provide a low-rank approximation to the\northogonal complement of the space spanned by current inducing set. With a \ufb01xed set of information\npivots, parts of that subspace may never be captured. This suggests that we might occasionally\nupdate the entire set of information pivots. Although information pivots are changed when one is\nmoved into the inducing set, we \ufb01nd empirically that this is not insuf\ufb01cient. Instead, at regular\nintervals we replace the entire set of information pivots by random selection. We \ufb01nd this works\nbetter than optimizing the information pivots as in [1].\n\nFigure 1 compares the exact and approximate\ncost reduction for candidate inducing points\n(left), and their respective rankings (right). The\napproximation is shown to work well. It is also\nrobust to changes in the number of information\npivots and the frequency of updates. When bad\ncandidates are proposed, they are rejected after\nevaluating the change in the true objective. We\n\ufb01nd that rejection rates are typically low during\nearly iterations (< 20%), but increase as opti-\nmization nears convergence (to 30% or 40%). Rejection rates also increase for sparser models,\nwhere each inducing point plays a more critical role and is harder to replace.\n\nFigure 1: Exact vs approximate costs, based on\nthe 1D example of Sec. 4, with z = 10, n = 200.\n\n3.4 Hybrid optimization\nThe overall hybrid optimization procedure performs block coordinate descent in the inducing points\nand the continuous hyperparameters.\nIt alternates between discrete and continuous phases until\nimprovement in the objective is below a threshold or the computational time budget is exhausted.\n\nIn the discrete phase, inducing points are considered for swapping with the hyper-parameters \ufb01xed.\nWith the factorization and ef\ufb01cient candidate evaluation above, swapping an inducing point i \u2208 Im\nproceeds as follows: (I) down-date the factorization matrices as in Sec. 3.2 to remove i; (II) compute\nthe true objective function value Fm\u22121 over the down-dated model with Im\\{i}, using (11), (12)\nand (9); (III) select a replacement candidate using the fast approximate cost change from Sec. 3.3.1;\n(IV) evaluate the exact objective change, using (14), (15), and (16); (V) add the exact change to the\ntrue objective Fm\u22121 to get the objective value with the new candidate. If this improves, we include\n\n2Both can be further reduced to O(zn) by appropriate caching during the updates of Q,R and(cid:101)L, and Lz\n\n5\n\n00.0050.010.0150.020.0250.030.03500.0050.010.0150.020.0250.030.035exact total reductionapprox total reduction050100150050100150ranking exact total reductionranking approx total reduction\fFigure 2: Test performance on discrete datasets. (top row) BindingDB, values at each marker is the\naverage of 150 runs (50-fold random train/test splits times 3 random initialization); (bottom row)\nHoG dataset, each marker is the average of 10 randomly initialized runs.\nthe candidate in I and update the matrices as in Sec. 3.2. Otherwise it is rejected and we revert to\nthe factorization with i; (VI) if needed, update the information pivots as in Secs. 3.3.1 and 3.3.2.\nAfter each discrete optimization step we \ufb01x the inducing set I and optimize the hyperparameters\nusing non-linear conjugate gradients (CG). The equivalence in (6) allows us to compute the gradient\nwith respect to the hyperparameters analytically using the Nystr\u00a8om form. In practice, because we\nalternate each phase for many training epochs, attempting to swap every inducing point in each\nepoch is unnecessary, just as there is no need to run hyperparameter optimization until convergence.\nAs long as all inducing set points are eventually considered we \ufb01nd that optimized models can\nachieve similar performance with shorter learning times.\n\nN \u03a3N\n\n4 Experiments and analysis\nFor the experiments that follow we jointly learn inducing points and hyperparameters, a more chal-\nlenging task than learning inducing points with known hyperparameters [12, 14]. For all but the 1D\nexample, the number of inducing points swapped per epoch is min(60, m). The maximum num-\nber of function evaluations per epoch in CG hyperparameter optimization is min(20, max(15, 2d)),\nwhere d is the number of continuous hyperparameters. Empirically we \ufb01nd the algorithm is robust\nto changes in these limits. We use two performance measures, (a) standardized mean square er-\nt=1(\u02c6yt \u2212 yt)2/\u02c6\u03c32\u2217, where \u02c6\u03c32\u2217 is the sample variance of test outputs {yt}, and (2)\nror (SMSE), 1\nstandardized negative log probability (SNLP) de\ufb01ned in [11].\n4.1 Discrete input domain\nWe \ufb01rst show results on two discrete datasets with kernels that are not differentiable in the input\nvariable x. Because continuous relaxation methods are not applicable, we compare to discrete se-\nlection methods, namely, random selection as baseline (Random), greedy subset-optimal selection\nof Titsias [15] with either 16 or 512 candidates (Titsias-16 and Titsias-512), and Informative Vec-\ntor Machine [8] (IVM). For learning continuous hyperparameters, each method optimizes the same\nobjective using non-linear CG. Care is taken to ensure consist initialization and termination criteria\n[3]. For our algorithm we use z = 16 information pivots with random selection (CholQR-z16).\nLater, we show how variants of our algorithm trade-off speed and performance. Additionally, we\nalso compare to least-square kernel regression using CSI (in Fig. 3(c)).\nThe \ufb01rst discrete dataset, from bindingdb.org, concerns the prediction of binding af\ufb01nity for a\ntarget (Thrombin), from the 2D chemical structure of small molecules (represented as graphs). We\ndo 50-fold random splits to 3660 training points and 192 test points for repeated runs. We use a\ncompound kernel, comprising 14 different graph kernels, and 15 continuous hyperparameters (one\n\n6\n\n163264128256512\u22120.7\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2number of inducing points (m)Testing SNLP CholQR\u2212z16IVMRandomTitsias\u221216Titsias\u22125121632641282565120.30.40.50.60.7number of inducing points (m)Testing SMSE163264128256512\u22121.2\u22121\u22120.8\u22120.6\u22120.4number of inducing points (m)Testing SNLP CholQR\u2212z16IVMRandomTitsias\u221216Titsias\u22125121632641282565120.10.150.20.250.30.350.4number of inducing points (m)Testing SMSE\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 3: Training time versus test performance on discrete datasets. (a) the average BindingDB\ntraining time; (b) the average BindingDB objective function value at convergence; (d) and (e) show\ntest scores versus training time with m = 32 for a single run; (c) shows the trade-off between training\ntime and testing SMSE on the HoG dataset with m = 32, for various methods including multiple\nvariants of CholQR and CSI; (f) a zoomed-in version of (c) comparing the variants of CholQR.\n\nnoise variance and 14 data variances). In the second task, from [2], the task is to predict 3D human\njoint position from histograms of HoG image features [6]. Training and test sets have 4819 and\n4811 data points. Because our goal is the general purpose sparsi\ufb01cation method for GP regression,\nwe make no attempt at the more dif\ufb01cult problem of modelling the multivariate output structure in\nthe regression as in [2]. Instead, we predict the vertical position of joints independently, using a\nhistogram intersection kernel [9], having four hyperparameters: one noise variance, and three data\nvariances corresponding to the kernel evaluated over the HoG from each of three cameras. We select\nand show result on the representative left wrist here (see [3] for others joints, and more details about\nthe datasets and kernels used).\nThe results in Fig. 2 and 3 show that CholQR-z16 outperforms the baseline methods in terms of\ntest-time predictive power with signi\ufb01cantly lower training time. Titsias-16 and Titsias-512 shows\nsimilar test performance, but they are two to four orders of magnitude slower than CholQR-z16 (see\nFigs. 3(d) and 3(e)). Indeed, Fig. 3(a) shows that the training time for CholQR-z16 is comparable to\nIVM and Random selection, but with much better performance. The poor performance of Random\nselection highlights the importance of selecting good inducing points, as no amount of hyperparam-\neter optimization can correct for poor inducing points. Fig. 3(a) also shows IVM to be somewhat\nslower due to the increased number of iterations needed, even though per epoch, IVM is faster than\nCholQR. When stopped earlier, IVM test performance further degrades.\nFinally, Fig. 3(c) and 3(f) show the trade-off between the test SMSE and training time for variants of\nCholQR, with baselines and CSI kernel regression [1]. For CholQR we consider different numbers\nof information pivots (denoted z8, z16, z64 and z128), and different strategies for their selection in-\ncluding random selection, optimization as in [1] (denote OI) and adaptively growing the information\npivot set (denoted AA, see [3] for details). These variants of CholQR trade-off speed and perfor-\nmance (3(f)), all signi\ufb01cantly outperform the other methods (3(c)); CSI, which uses grid search to\nselect hyper-parameters, is slow and exhibits higher SMSE.\n4.2 Continuous input domain\nAlthough CholQR was developed for discrete input domains, it can be competitive on continuous\ndomains. To that end, we compare to SPGP [14] and IVM [8], using RBF kernels with one length-\nj )2). We show\n\nscale parameter per input dimension; \u03ba(xi, xj) = c exp(\u22120.5(cid:80)d\n\nresults from both the PP log likelihood and variational objectives, suf\ufb01xed by MLE and VAR.\n\nt=1 bt(x(t)\n\ni \u2212 x(t)\n\n7\n\n163264128256512102103104number of inducing points (m)Total training time (secs)16326412825651205001000number of inducing points (m)Training VAR CholQR\u2212z16IVMRandomTitsias\u221216Titsias\u22125121011021031040.10.20.3Testing SMSETime in secs (log scaled) CholQR\u2212z8CholQR\u2212z16CholQR\u2212OI\u2212z16CholQR\u2212z64CholQR\u2212OI\u2212z64CholQR\u2212AA\u2212z128IVMRandomTitsias\u221216Titsias\u2212512CSI100101102103104\u22120.3\u22120.2\u22120.10Cumulative training time in secs (log scale)Testing SNLP CholQR\u2212z16IVMRandomTitsias\u221216Titsias\u22125121001011021031040.550.60.650.70.75Cumulative training time in secs (log scale)Testing SMSE CholQR\u2212z16IVMRandomTitsias\u221216Titsias\u22125121011020.1380.140.1420.144Testing SMSETime in secs (log scaled)\f(a) CholQR-MLE (b) CholQR-MLE\n\n(c) SPGP\n\n(d) CholQR-VAR\n\n(e) CholQR-VAR\n\n(f) SPGP\n\nFigure 4: Snelson\u2019s 1D example: prediction mean (red curves); one standard deviation in prediction\nuncertainty (green curves); inducing point initialization (black points at top of each \ufb01gure); learned\ninducing point locations (the cyan points at the bottom, also overlaid on data for CholQR).\n\nFigure 5: Test scores on KIN40K as function of number of inducing points: for each number of\ninducing points the value plotted is averaged over 10 runs from 10 different (shared) initializations.\n\nWe use the 1D toy dataset of [14] to show how the PP likelihood with gradient-based optimization\nof inducing points is easily trapped in local minima. Fig. 4(a) and 4(d) show that for this dataset\nour algorithm does not get trapped when initialization is poor (as in Fig. 1c of [14]). To simulate\nthe sparsity of data in high-dimensional problems we also down-sample the dataset to 20 points\n(every 10th point). Here CholQR out-performs SPGP (see Fig. 4(b), 4(e), and 4(c)). By comparison,\nFig. 4(f) shows SPGP learned with a more uniform initial distribution of inducing points avoids this\nlocal optima and achieves a better negative log likelihood of 11.34 compared to 14.54 in Fig. 4(c).\nFinally, we compare CholQR to SPGP [14] and IVM [8] on a large dataset. KIN40K concerns\nnonlinear forward kinematic prediction. It has 8D real-valued inputs and scalar outputs, with 10K\ntraining and 30K test points. We perform linear de-trending and re-scaling as pre-processing. For\nSPGP we use the implementation of [14]. Fig. 5 shows that CholQR-VAR outperforms IVM in terms\nof SMSE and SNLP. Both CholQR-VAR and CholQR-MLE outperform SPGP in terms of SMSE on\nKIN40K with large m, but SPGP exhibits better SNLP. This disparity between the SMSE and SNLP\nmeasures for CholQR-MLE is consistent with \ufb01ndings about the PP likelihood in [15]. Recently,\nChalupka et al. [4] introduced an empirical evaluation framework for approximate GP methods,\nand showed that subset of data (SoD) often compares favorably to more sophisticated sparse GP\nmethods. Our preliminary experiments using this framework suggest that CholQR outperforms\nSPGP in speed and predictive scores; and compared to SoD, CholQR is slower during training, but\nproportionally faster during testing since CholQR \ufb01nds a much sparser model to achieve the same\npredictive scores. In future work, we will report results on the complete suit of benchmark tests.\n\n5 Conclusion\nWe describe an algorithm for selecting inducing points for Gaussian Process sparsi\ufb01cation. It op-\ntimizes principled objective functions, and is applicable to discrete domains and non-differentiable\nkernels. On such problems it is shown to be as good as or better than competing methods and, for\nmethods whose predictive behavior is similar, our method is several orders of magnitude faster. On\ncontinuous domains the method is competitive. Extension to the SPGP form of covariance approxi-\nmation would be interesting future research.\n\n8\n\n128256512102420480.050.10.150.20.25testing SMSE CholQR\u2212MLECholQR\u2212VARSPGPIVM\u2212MLEIVM\u2212VAR12825651210242048\u22122.5\u22122\u22121.5\u22121\u22120.5testing SNLP\fReferences\n[1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. ICML,\n\npp. 33\u201340, 2005..\n\n[2] L. Bo and C. Sminchisescu. Twin gaussian processes for structured prediction. IJCV, 87:28\u2013\n\n52, 2010.\n\n[3] Y. Cao, M. A. Brubaker, D. J. Fleet, and A. Hertzmann.\n\nsupplemen-\ntary material and software for ef\ufb01cient optimization for sparse gaussian process regression.\nwww.cs.toronto.edu/\u02dccaoy/opt_sgpr, 2013.\n\nProject page:\n\n[4] K. Chalupka, C. K. I. Williams, and I. Murray. A framework for evaluating approximation\n\nmethods for gaussian process regression. JMLR, 14(1):333\u2013350, February 2013.\n\n[5] L. Csat\u00b4o and M. Opper. Sparse on-line gaussian processes. Neural Comput., 14:641\u2013668,\n\n2002.\n\n[6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE CVPR,\n\npp. 886\u2013893, 2005.\n\n[7] S. S. Keerthi and W. Chu. A matching pursuit approach to sparse gaussian process regression.\n\nNIPS 18, pp. 643\u2013650. 2006.\n\n[8] N. D. Lawrence, M. Seeger, and R. Herbrich, Fast sparse gaussian process methods: The\n\ninformative vector machine. NIPS 15, pp. 609\u2013616. 2003.\n\n[9] J. J. Lee. Libpmk: A pyramid match toolkit. TR: MIT-CSAIL-TR-2008-17, MIT CSAIL,\n\n2008. URL http://hdl.handle.net/1721.1/41070.\n\n[10] J. Qui\u02dcnonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian\n\nprocess regression. JMLR, 6:1939\u20131959, 2005.\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive\n\ncomputation and machine learning. MIT Press, 2006.\n\n[12] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse\n\ngaussian process regression. AI & Stats. 9, 2003.\n\n[13] A. J. Smola and P. Bartlett. Sparse greedy gaussian process regression. In Advances in Neural\n\nInformation Processing Systems 13, pp. 619\u2013625. 2001.\n\n[14] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. NIPS 18, pp.\n\n1257\u20131264. 2006.\n\n[15] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. JMLR,\n\n5:567\u2013574, 2009.\n\n[16] C. Walder, K. I. Kwang, and B. Sch\u00a8olkopf. Sparse multiscale gaussian process regression.\n\nICML, pp. 1112\u20131119, 2008.\n\n9\n\n\f", "award": [], "sourceid": 582, "authors": [{"given_name": "Yanshuai", "family_name": "Cao", "institution": "University of Toronto"}, {"given_name": "Marcus", "family_name": "Brubaker", "institution": "TTI Chicago"}, {"given_name": "David", "family_name": "Fleet", "institution": "University of Toronto"}, {"given_name": "Aaron", "family_name": "Hertzmann", "institution": "Adobe Research"}]}