{"title": "Predictive App roaches for Choosing Hyperparameters in Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 631, "page_last": 637, "abstract": null, "full_text": "Predictive Approaches For Choosing \n\nHyperparameters in Gaussian Processes \n\nS. Sundararajan \n\nS. Sathiya Keerthi \n\nComputer Science and Automation \n\nIndian Institute of Science \nBangalore 560 012, India \nsundar@csa.iisc. ernet. in \n\nMechanical and Production Engg. \nNational University of Singapore \n\n10 Kentridge Crescent, Singapore 119260 \n\nmpessk@guppy. mpe. nus. edu. sg \n\nAbstract \n\nGaussian Processes are powerful regression models specified by \nparametrized mean and covariance functions. Standard approaches \nto estimate these parameters (known by the name Hyperparam(cid:173)\neters) are Maximum Likelihood (ML) and Maximum APosterior \n(MAP) approaches. In this paper, we propose and investigate pre(cid:173)\ndictive approaches, namely, maximization of Geisser's Surrogate \nPredictive Probability (GPP) and minimization of mean square er(cid:173)\nror with respect to GPP (referred to as Geisser's Predictive mean \nsquare Error (GPE)) to estimate the hyperparameters. We also \nderive results for the standard Cross-Validation (CV) error and \nmake a comparison. These approaches are tested on a number of \nproblems and experimental results show that these approaches are \nstrongly competitive to existing approaches. \n\n1 \n\nIntroduction \n\nGaussian Processes (GPs) are powerful regression models that have gained popular(cid:173)\nity recently, though they have appeared in different forms in the literature for years. \nThey can be used for classification also; see MacKay (1997), Rasmussen (1996) and \nWilliams and Rasmussen (1996). Here, we restrict ourselves to regression problems. \nNeal (1996) showed that a large class of neural network models converge to a Gaus(cid:173)\nsian Process prior over functions in the limit of an infinite number of hidden units. \nAlthough GPs can be created using infinite networks, often GPs are specified di(cid:173)\nrectly using parametric forms for the mean and covariance functions (Williams and \nRasmussen (1996)). We assume that the process is zero mean. Let ZN = {XN,yN} \nwhereXN = {xCi): i = 1, ... ,N}andYN = {y(i): i = 1, ... ,N}. Here,y(i) \nrepresents the output corresponding to the input vector xCi). Then, the Gaussian \nprior over the functions is given by \n\n(1) \n\nwhere eN is the covariance matrix with (i,j)th element [CN]i,j \nC(x(i),x(j);8) \nand C(.; 8) denotes the parametrized covariance function. Now, assuming that the \n\n\f632 \n\nS. Sundararajan and S. S. Keerthi \n\nobserved output tN is modeled as tN = YN + eN and eN is zero mean multivariate \nGaussian with covariance matrix 0'2IN and is independent of YN, we get \n\n(t IX 9) = exp(-t~Ci\\/tN) \np N N, \n\n(27r)~ICNli \n\n(2) \n\nwhere eN = eN + 0'2IN. Therefore, [eN kj = [eN kj + 0'2 bi,j, where bi,j = 1 when \ni = j and zero otherwise. Note that 9 = (9,0'2) is the new set of hyperparameters. \nThen, the predictive distribution ofthe output yeN + 1) for a test case x(N + 1) is \nalso Gaussian with mean and variance \n\n(3) \n\nand \n\nM \n\nM \n\nO';(N+1) = bN+1 - k~+1 C;/kN +1 \n\n(4) \nC(x(N + 1), x(N + 1); 9) and kN+l is an N x 1 vector with ith \nwhere bN+1 \nelement given by C(x(N + 1),x(i); 9). Now, we need to specify the covariance \nfunction C(.; 9). Williams and Rasmussen (1996) found the following covariance \nfunction to work well in practice. \nC(x(i), x(j); 9) = ao + al L xp(i)xp(j) + voexp( - ~ L Wp (xp(i) - Xp(j))2) (5) \n\np=1 \n\np=1 \n\nwhere xp(i) is the pth component of ith input vector xCi). The wp are the Auto(cid:173)\nmatic Relevance Determination (ARD) parameters. Note that C(x(i), x(j); 9) = \nC(x(i), x(j); 9) + 0'2bi ,j' Also, all the parameters are positive and it is conve(cid:173)\nnient to use logarithmic scale. Hence, 9 is given by log (ao, aI, vo, WI, ... ,W M, 0'2). \nThen, the question is: how do we handle 9? More sophisticated techniques \nlike Hybrid Monte Carlo (HMC) methods (Rasmussen (1996) and Neal (1997)) \nare available which can numerically integrate over the hyperparameters to make \npredictions. Alternately, we can estimate 9 from the training data. We restrict \nto the latter approach here. \nterministic but unknown and the estimate is found by maximizing the likelihood \n(2). That is, 9ML = \nIn the Bayesian approach, 9 is \nassumed to be random and a prior p( 9) is specified. Then, the MAP estimate \n9MP is obtained as 9MP = argijaz p(tNIXN,9)p(9) with the motivation that \nthe the predictive distribution p(y(N + 1)lx(N + 1), ZN) can be approximated as \np(y(N + 1)lx(N + 1),ZN,9MP)' With this background, in this paper we propose \nand investigate different predictive approaches to estimate the hyperparameters \nfrom the training data. \n\nIn the classical approach, 9 is assumed to be de(cid:173)\n\nargijaz p(tNIXN' 9). \n\n2 Predictive approaches for choosing hyperparameters \n\nGeisser (1975) proposed Predictive Sample Reuse (PSR) methodology that can be \napplied for both model selection and parameter estimation problems. The basic \nidea is to define a partition scheme peN, n, r) such that pJJ~n = (ZX; -n; Z~) is \nith partition belonging to a set r of partitions with Z}V -n' Z~ representing the \nN - n retained and n omitted data sets respectively. Then, the unknown 9 is esti(cid:173)\nmated (or a model M j is chosen among a set of models indexed by j = 1, ... , J) \nby means of optimizing a predictive measure that measures the predictive perfor(cid:173)\nmance on the omitted observations X~ by using the retained observations ZX;_n \naveraged over the partitions (i E r). In the special case of n = 1, we have the \nleave one out strategy. Note that this approach was independently presented in the \n\n\fPredictive Approaches for Choosing Hyperparameters in Gaussian Processes \n\n633 \n\nname of cross-validation (CV) by Stone (1974). The well known examples are the \nstandard CV error and negative of average predictive likelihood. Geisser and Eddy \n(1979) proposed to maximize n~l p(t(i)lx(i), Z<;}, M j ) (known as Geisser's surro(cid:173)\ngate Predictive Probability (GPP)) by synthesizing Bayesian and PSR methodology \nin the context of (parametrized) model selection. Here, we propose to maximize \nn~l p(t(i)lx(i), Z<;}, 0) to estimate 0, where Z<;} is obtained from ZN by removing \nthe ith sample. Note that p(t(i)lx(i), Zr;) ,0) is nothing but the predictive distribu(cid:173)\ntion p(y(i)lx(i), Zr;), 0) evaluated at y(i) = t(i). Also, we introduce the notion of \nGeisser's Predictive mean square Error (GPE) defined as ~ 2:~1 E((y(i) - t(i))2) \n(where the expectation operation is defined with respect to p(y(i)lx(i), Zr;), 0)) and \npropose to estimate 0 by minimizing GPE. \n\n2.1 Expressions for GPP and its gradient \n\nThe objective function corresponding to GPP is given by \n. \n\n1 N \n\n- N L log(P(t(i)lx(i), Z~, 0) \n\nG(O) \n\nFrom (3) and (4) we get \n\ni=l \n\n(6) \n\n(7) \n\nG(O) = ~ ;... (t(i) - y(i))2 \n\nN ~ 20'2 . \n11(~) \n\ni=l \n\n1 \n\nN \n\n+ 2N L log O';(i) + '2 log 27l' \n\n1 \n\ni=l \n[c~i)JT[C~)-lC~i). Here, C~ \nwhere y(i) = [c~i)JT[C~)J-lt~ and O';(i) = Cii -\nis an N - 1 x N - 1 matrix obtained from C N by removing the ith column and \nith row. Similarly, t<;} and c~i) are obtained from tN and Ci (Le., ith column of \nCN) respectively by removing the ith element. Then, G(O) and its gradient can be \ncomputed efficiently using the following result. \n\nTheorem 1 The objective junction G (0) under the Gaussian Process model is given \nby \n\nG(O) = 2N tt Cii \n\n1 N q'fv(i) \n\n1 N \n\n_ \n\n1 \n\n- 2N ~ logcii + \"2 10g27l' \n\n(8) \n\nwhere Cii denotes the ith diagonal entry of C-r/ and qN (i) denotes the ith element \nof qN = C~;ItN' Its gradient is given by \n\n8G(O) = _1 t (1 + q~(i)) (Si,i) + ~ t qN(i)(r~(i)) \n\n80J\u00b7 \n-\n-\n\n2N . \n\n~=l \n\nCii \n\nCii \n\nN \n\n. 1 \na= \n\nC-18CNC- 1t \n\n- N 80; N N an qN -\n\nd \n\nh \nwere Bj,i \ndenotes the ith column of the matrix c~;I . \n\n-\n-T8CN-\nCi 80; Ci, rj -\n\nCii \n- C-1t \n\nN N\u00b7 \n\n(9) \n\nH \n\n-\nere, Ci \n\nThus, using (8) and (9) we can compute the GPP and its gradient. We will give \nmeaningful interpretation to the different terms shortly. \n\n2.2 Expressions for CV function and its gradient \n\nWe define the CV function as \n\nH(O) \n\nN \n\n~ L (t(i) - y(i))2 \n\ni=l \n\n(10) \n\n\f634 \n\nS. Sundararajan and S. S. Keerthi \n\nwhere y(i) is the mean of the conditional predictive distribution as given above. \nNow, using the following result we can compute R((}) efficiently. \n\nTheorem 2 The CV function R ((}) under the Gaussian model is given by \n\nR((}) = ~ ~ (q~(i))2 \n\nN ~ C;. \n' \n\ni=l \n\nand its gradient is given by \n\nwhere Sj,i,rj,qN(i) and Cii are as defined in theorem 1. \n\n2.3 Expressions for GPE and its gradient \n\nThe G PE function is defined as \n\nGE((}) = ~ L / (t(i) - y(i))2 p(y(i)lx(i), Z~, (}) dy(i) \n\nN \n\nwhich can be readily simplified to \n\ni=l \n\nGE((}) = ~ L (t(i) - y(i))2 + N L a~(i) \n\n1 N \n\nN \n\n(11) \n\n(13) \n\n(14) \n\ni=l \n\ni=l \n\nOn comparing (14) with (10), we see that while CV error minimizes the deviation \nfrom the predictive mean, GPE takes predictive variance also into account. Now, \nthe gradient can be written as \n\n(15) \n\n-\nwhere we have used the results a~(i) = C!i' \n~Oiji = e[ 8(}~ ei and 8/; = \n-Cj\\/ 88~N CNI . Here ei denotes the ith column vector of the identity matrix IN. \n\n8C- 1 \n\n8C- 1 \n\nJ \n\n2.4 \n\nInterpretations \n\nMore insight can be obtained from reparametrizing the covariance function as fol(cid:173)\nlows. \n\nM \n\n1M \n\np=I \n\nC(x(i), x(j); (}) = a2 (ao+ih L xp(i)xp(j)+voexp( - 2 L wp(xp(i)-xPU))2)+Oi,j) \n(16) \nwhere ao = a2 \u00a3la, al = a2 aI, Va = a2 Va. Let us define P(x(i), xU); (}) = \n~C(x(i), xU); (}). Then PNI = a2 C NI . Therefore, Ci,j = ~ where Ci,j, Pi,j \ndenote the (i, j)th element of the matrices CNI and PNI respectively. From theorem \n2 (see (10) and (11)) we have t(i) - y(i) = q~i~i) = c~~iN. Then, we can rewrite \n(8) as \n\np=l \n\nG ((}) = \n\niJ'jy (i) \n_ _ 1_ N \n\" \n2Na2 ~ p.. \ni=l \nn \n\n1 N \n- 2N \" \n\n~ \ni=l \n\nlOgPii + -2log2rra2 \n\n1 \n\n(17) \n\n\fPredictive Approaches for Choosing Hyperparameters in Gaussian Processes \n\n635 \n\nHere, iiN = Pj\\hN and, Pi, Pii denote, respectively, the ith column and ith diagonal \nentry of the matrix Pj\\/. Now, by setting the derivative of (17) with respect to a2 \nto zero, we can infer the noise level as \n\nSimilarly, the CV error (10) can be rewritten as \n\nH(9) = ~ t ii~~.i) \n\ni=l Pu \n\n(18) \n\n(19) \n\nNote that H(9) is dependent only on the ratio of the hyperparameters (Le., on \nao, aI, vo) apart from the ARD parameters. Therefore, we cannot infer the noise \nlevel uniquely. However, we can estimate the ARD parameters and the ratios \nao, aI, vo. Once we have estimated these parameters, then we can use (18) to es(cid:173)\ntimate the noise level. Next, we note that the noise level preferred by the GPE \ncriterion is zero. To see this, first let us rewrite (14) under reparametrization as \n\nGE (9) = ~ t q~;i) + a2 t ~ \n\nN \n\ni=l Pii \n\nN \n\ni=l Pii \n\n(20) \n\nSince iiN(i) and Pii are independent of a 2 , it follows that the GPE prefers zero as \nthe noise level, which is not true. Therefore, this approach can be applied when, \neither the noise level is known or a good estimate of it is available. \n\n3 Simulation results \n\nWe carried out simulation on four data sets. We considered MacKay's robot arm \nproblem and its modified version introduced by Neal (1996). We used the same \ndata set as MacKay (2-inputs and 2-outputs), with 200 examples in the training \nset and 200 in the test set. This data set is referred to as 'data set l' in Table \n1. Next, to evaluate the ability of the predictive approaches in estimating the \nARD parameters, we carried out simulation on the robot arm data with 6 inputs \n(Neal's version), denoted as 'data set 2' in Table 1. This data set was generated by \nadding four further inputs, two of which were copies of the two inputs corrupted \nby additive zero mean Gaussian noise of standard deviation 0.02 and two further \nirrelevant Gaussian noise inputs with zero mean and unit variance (Williams and \nRasmussen (1996)). The performance measures chosen were average of Test Set \nError (normalized by true noise level of 0.0025) and average of negative logarithm of \npredictive probability (NLPP) (computed from Gaussian density function with (3) \nand (4)). Friedman's [1 J data sets 1 and 2 were based on the problem of predicting \nimpedance and phase respectively from four parameters of an electrical circuit. \nTraining sets of three different sizes (50, 100, 200) and with a signal-to-noise ratio \nof about 3:1 were replicated 100 times and for each training set (at each sample \n\nsize N), scaled integral squared error (ISE = \n) and NLPP were \ncomputed using 5000 data points randomly generated from a uniform distribution \nover D (Friedman (1991)). In the case of GPE (denoted as GE in the tables), we \nused the noise level estimate generated from Gaussian distribution with mean N LT \n(true noise level) and standard deviation 0.03 N LT. In the case of CV, we estimated \nthe hyperparameters in the reparametrized form and estimated the noise level using \n(18). In the case of MAP (denoted as MP in the tables), we used the same prior \n\nf (y(x) -1i(x))2dx \nD varD y(x) \n\n\f636 \n\nS. Sundararajan and S. S. Keerthi \n\nTable 1: Results on robot arm data sets. Average of normalized test set error (TSE) \nand negative logarithm of predictive probability (NLPP) for various methods. \n\nData Set: 1 \nData Set: 2 \nTSE NLPP TSE NLPP \n1.126 \n-1.512 \n1.131 \n-1.489 \n1.115 \n-1.516 \n1.112 \n-1.514 \n1.111 \n-1.524 \n\n-1.512 \n-1.511 \n-1.524 \n-1.518 \n-1.524 \n\n1.131 \n1.181 \n1.116 \n1.146 \n1.112 \n\nML \nMP \nGp \nCV \nGE \n\nTable 2: Results on Friedman's data sets. Average of scaled integral squared error \nand negative logarithm of predictive probability (given in brackets) for different \ntraining sample sizes and various methods. \n\nN = 50 \nML \n0.43 7.24 \nMP 0.42 7.18 \n0.47 7.29 \nG p \ncV \n0.55 7.27 \n0.35 7.10 \nGE \n\nDataSet: 1 \n\nN = 100 \n0.19 6.71 \n0.22 6.78 \n0.20 6.65 \n0.22 6.67 \n0.15 6.60 \n\nData Set: 2 \n\nN = 200 \n0.10 6.49 \n0.12 6.56 \n0.10 6.44 \n0.10 6.44 \n0.08 6.37 \n\nN = 50 \n0.26 1.05 \n0.25 1.01 \n0.33 1.25 \n0.42 1.36 \n0.28 1.20 \n\nN = 100 \nN = 200 \n0.16 0.82) 0.11 0.68) \n0.16 0.82) 0.11 0.69) \n0.20 0.86) 0.12 0.70) \n0.21 0.91) \n0.13 0.70) \n0.18 0.85) \n0.12 0.63) \n\ngiven in Rasmussen (1996). The GPP approach is denoted as Gp in the tables. For \nall these methods, conjugate gradient (CG) algorithm (Rasmussen (1996)) was used \nto optimize the hyperparameters. The termination criterion (relative function error) \nwith a tolerance of 10-7 was used, but with a constraint on the maximum number \nof CG iterations set to 100. In the case of robot arm data sets, the algorithm \nwas run with ten different initial conditions and the best solution (chosen from \nrespective best objective function value) is reported. The optimization was carried \nout separately for the two outputs and the results reported are the average TSE, \nNLPP. In the case of Friedman's data sets, the optimization algorithm was run \nwith three different initial conditions and the best solution was picked up. When \nN = 200, the optimization algorithm was run with only one initial condition. For \nall the data sets, both the inputs and outputs were normalized to zero mean and \nunit variance. \nFrom Table 1, we see that the performances (both TSE and NLPP) of the predic(cid:173)\ntive approaches are better than ML and MAP approaches for both the data sets. \nIn the case of data set 2, we observed that like ML and MAP methods, all the \npredictive approaches rightly identified the irrelevant inputs. The performance of \nGPE approach is the best on the robot arm data and demonstrates the usefulness \nof this approach when a good noise level estimate is available. In the case of Fried(cid:173)\nman's data set 1 (see Table 2), the important observation is that the performances \n(both ISE and NLPP) of GPP, CV approaches are relatively poor at low sample size \n(N = 50) and improve very well as N increases. Note that the performances of the \npredictive approaches are better compared to the ML and MAP methods starting \nfrom N = 100 onwards (see NLPP). Again, GPE gives the best performance and \nthe performance at low sample size (N = 50) is also quite good. In the case of \nFriedman's data set 2, the ML and MAP approaches perform better compared to \nthe predictive approaches except GPE. The performances of GPP and CV improve \n\n\fPredictive Approaches for Choosing Hyperparameters in Gaussian Processes \n\n637 \n\nas N increases and are very close to the ML and MAP methods when N \n200. \nNext, it is clear that the MAP method gives the best performance at low sample \nsize. This behavior, we believe, is because the prior plays an important role and \nhence is very useful. Also, note that unlike data set 1, the performance of GP E is \ninferior to ML and MAP approaches at low sample sizes and improves over these \napproaches (see NLPP) as N increases. This suggests that the knowledge of the \nnoise level alone is not the only issue. The basic issue we think is that the predictive \napproaches estimate the predictive performance of a given model from the training \nsamples. Clearly, the quality of the estimate will become better as N increases. \nAlso, knowing the noise level improves the quality of the estimate. \n\n4 Discussion \n\nSimulation results indicate that the size N required to get good estimates of predic(cid:173)\ntive performance will be dependent on the problem. When N is sufficiently large, we \nfind that the predictive approaches perform better than ML and MAP approaches. \nThe sufficient number of samples can be as low as 100 as evident from our results \non Friedman's data set 1. Also, MAP approach is the best, when N is very low. \nAs one would expect, the performances of ML and MAP approaches are nearly \nsame as N increases. The comparison with the existing approaches indicate that \nthe predictive approaches developed here are strongly competitive. The overall cost \nfor computing the function and the gradient (for all three predictive approaches) \nis O(M N3). The cost for making prediction is same as the one required for ML \nand MAP methods. The proofs of the results and detailed simulation results will \nbe presented in another paper (Sundararajan and Keerthi, 1999). \n\nReferences \n\nFriedman, J .H., (1991) Multivariate Adaptive Regression Splines, Ann. of Stat., 19, 1-141. \n\nGeisser, S., (1975) The Predictive Sample Reuse Method with Applications, Journal of \nthe American Statistical Association, 70, 320-328. \n\nGeisser, S., and Eddy, W.F., (1979) A Predictive Approach to Model Selection, Journal \nof the American Statistical Association, 74, 153-160. \n\nMacKay, D.J.C. (1997) Gaussian Processes - A replacement for neural networks ?, Avail(cid:173)\nable in Postscript via URL http://www.wol.ra.phy.cam.ac.uk/mackayj. \n\nNeal, R.M. (1996) Bayesian Learning for Neural Networks, New York: Springer-Verlag. \n\nNeal, R.M. (1997) Monte Carlo Implementation of Gaussian Process Models for Bayesian \nRegression and Classification. Tech. Rep. No. 9702, Dept. of Statistics, University of \nToronto. \nRasmussen, C. (1996) Evaluation of Gaussian Processes and other Methods for Non-Linear \nRegression, Ph.D. Thesis, Dept. of Computer Science, University of Toronto. \n\nStone, M. (1974) Cross-Validatory Choice and Assessment of Statistical Predictions (with \ndiscussion), Journal of Royal Statistical Society, ser.B, 36, 111-147. \n\nSundararajan, S., and Keerthi, S.S. (1999) Predictive Approaches for Choosing Hy(cid:173)\nperparameters in Gaussian Processes, submitted to Neural Computation, available at: \nhttp://guppy.mpe.nus.edu.sgFmpessk/gp/gp.html. \n\nIn \nWilliams, C.K.I., and Rasmussen, C.E. (1996) Gaussian Processes for Regression. \nAdvances in Neural Information Processing Systems 8, ed. by D.S.Touretzky, M.C.Mozer, \nand M.E.Hasselmo. MIT Press. \n\n\f", "award": [], "sourceid": 1767, "authors": [{"given_name": "S.", "family_name": "Sundararajan", "institution": null}, {"given_name": "S.", "family_name": "Keerthi", "institution": null}]}