{"title": "Learning Gaussian Process Kernels via Hierarchical Bayes", "book": "Advances in Neural Information Processing Systems", "page_first": 1209, "page_last": 1216, "abstract": null, "full_text": " Learning Gaussian Process Kernels via\n Hierarchical Bayes\n\n\n\n Anton Schwaighofer Volker Tresp, Kai Yu\n Fraunhofer FIRST Siemens Corporate Technology\n Intelligent Data Analysis (IDA) Information and Communications\n Kekulestrasse 7, 12489 Berlin 81730 Munich, Germany\n anton@first.fhg.de {volker.tresp,kai.yu}@siemens.com\n\n\n\n Abstract\n\n We present a novel method for learning with Gaussian process regres-\n sion in a hierarchical Bayesian framework. In a first step, kernel matri-\n ces on a fixed set of input points are learned from data using a simple\n and efficient EM algorithm. This step is nonparametric, in that it does\n not require a parametric form of covariance function. In a second step,\n kernel functions are fitted to approximate the learned covariance matrix\n using a generalized Nystrom method, which results in a complex, data\n driven kernel. We evaluate our approach as a recommendation engine\n for art images, where the proposed hierarchical Bayesian method leads\n to excellent prediction performance.\n\n\n1 Introduction\n\nIn many real-world application domains, the available training data sets are quite small,\nwhich makes learning and model selection difficult. For example, in the user preference\nmodelling problem we will consider later, learning a preference model would amount to\nfitting a model based on only 20 samples of a user's preference data. Fortunately, there\nare situations where individual data sets are small, but data from similar scenarios can\nbe obtained. Returning to the example of preference modelling, data for many different\nusers are typically available. This data stems from clearly separate individuals, but we can\nexpect that models can borrow strength from data of users with similar tastes. Typically,\nsuch problems have been handled by either mixed effects models or hierarchical Bayesian\nmodelling.\n\nIn this paper we present a novel approach to hierarchical Bayesian modelling in the context\nof Gaussian process regression, with an application to recommender systems. Here, hier-\narchical Bayesian modelling essentially means to learn the mean and covariance function\nof the Gaussian process.\n\nIn a first step, a common collaborative kernel matrix is learned from the data via a simple\nand efficient EM algorithm. This circumvents the problem of kernel design, as no paramet-\nric form of kernel function is required here. Thus, this form of learning a covariance matrix\nis also suited for problems with complex covariance structure (e.g. nonstationarity).\n\nA portion of the learned covariance matrix can be explained by the input features and, thus,\n\n\f\ngeneralized to new objects via a content-based kernel smoother. Thus, in a second step,\nwe generalize the covariance matrix (learned by the EM-algorithm) to new items using a\ngeneralized Nystrom method. The result is a complex content-based kernel which itself\nis a weighted superposition of simple smoothing kernels. This second part could also be\napplied to other situations where one needs to extrapolate a covariance matrix on a finite\nset (e.g. a graph) to a continuous input space, as, for example, required in induction for\nsemi-supervised learning [14].\n\nThe paper is organized as follows. Sec. 2 casts Gaussian process regression in a hierarchical\nBayesian framework, and shows the EM updates to learn the covariance matrix in the first\nstep. Extrapolating the covariance matrix is shown in Sec. 3. We illustrate the function of\nthe EM-learning on a toy example in Sec. 4, before applying the proposed methods as a\nrecommender system for images in Sec. 4.1.\n\n\n1.1 Previous Work\n\nIn statistics, modelling data from related scenarios is typically done via mixed effects mod-\nels or hierarchical Bayesian (HB) modelling [6]. In HB, parameters of models for indi-\nvidual scenarios (e.g. users in recommender systems) are assumed to be drawn from a\ncommon (hyper)prior distribution, allowing the individual models to interact and regular-\nize each other. Recent examples of HB modelling in machine learning include [1, 2]. In\nother contexts, this learning framework is called multi-task learning [4]. Multi-task learn-\ning with Gaussian processes has been suggested by [8], yet with the rather stringent as-\nsumption that one has observations on the same set of points in each individual scenario.\nBased on sparse approximations of GPs, a more general GP multi-task learner with para-\nmetric covariance functions has been presented in [7]. In contrast, the approach presented\nin this paper only considers covariance matrices (and is thus non-parametric) in the first\nstep. Only in a second extrapolation step, kernel smoothing leads to predictions based on a\ncovariance function that is a data-driven combination of simple kernel functions.\n\n\n2 Learning GP Kernel Matrices via EM\n\nThe learning task we are concerned with can be stated as follows: The data are observations\nfrom M different scenarios. In the i.th scenario, we have observations yi = (yi , . . . , yi )\n 1 N i\non a total of N i points, Xi = {xi , . . . , xi }\n 1 . In order to analyze this data in a hierarchical\n N i\nBayesian way, we assume that the data for each scenario is a noisy sample of a Gaussian\nprocess (GP) with unknown mean and covariance function. We assume that mean and\ncovariance function are shared across different scenarios.1\n\nIn the first modelling step presented in this section, we consider transductive learning (\"la-\nbelling a partially labelled data set\"), that is, we are interested in the model's behavior only\non points X, with X = M Xi and cardinality N = |X|. This situation is relevant\n i=1\nfor most collaborative filtering applications. Thus, test points are the unlabelled points in\neach scenario. This reduces the whole \"infinite dimensional\" Gaussian process to its finite\ndimensional projection on points X, which is an N -variate Gaussian distribution with co-\nvariance matrix K and mean vector m. For the EM algorithm to work, we also require that\nthere is some overlap between scenarios, that is, Xi Xj = for some i, j. Coming back\nto the user modelling problem mentioned above, this means that at least some items have\nbeen rated by more than one user.\n\nThus, our first modelling step focusses on directly learning the covariance matrix K and\n\n 1Alternative HB approaches for collaborative filtering, like that discussed in [5], assume that\nmodel weights are drawn from a shared Gaussian distribution.\n\n\f\nm from the data via an efficient EM algorithm. This may be of particular help in problems\nwhere one would need to specify a complex (e.g. nonstationary) covariance function.\n\nFollowing the hierarchical Bayesian assumption, the data observed in each scenario is thus\na partial sample from N (y | m, K + 21), where 1 denotes the unit matrix. The joint\nmodel is simply\n M\n\n p(m, K) p(yi | f i)p(f i | m, K), (1)\n\n i=1\n\nwhere p(m, K) denotes the prior distribution for mean and covariance. We assume a Gaus-\nsian likelihood p(yi | f i) with diagonal covariance matrix 21.\n\n\n2.1 EM Learning\n\nFor the above hierarchical Bayesian model, Eq. (1), the marginal likelihood becomes\n\n M\n\n p(m, K) p(yi | f i)p(f i | m, K) df i. (2)\n\n i=1\n\nTo obtain simple and stable solutions when estimating m and K from the data, we con-\nsider point estimates of the parameters m and K, based on a penalized likelihood approach\nwith conjugate priors.2 The conjugate prior for mean m and covariance K of a multivari-\nate Gaussian is the so-called Normal-Wishart distribution [6], which decomposes into the\nproduct of an inverse Wishart distribution for K and a Normal distribution for m,\n\n p(m, K) = N (m | , -1K)Wi-1(K|, U ). (3)\n\nThat is, the prior for the Gram matrix K is given by an inverse Wishart distribution with\nscalar parameter > 1/2(N - 1) and U being a symmetric positive-definite matrix. Given\nthe covariance matrix K, m is Gaussian distributed with mean and covariance -1K,\nwhere is a positive scalar. The parameters can be interpreted in terms of an equivalent\ndata set for the mean (this data set has size A, with A = , and mean = ) and a data set\nfor the covariance that has size B, with = (B + N )/2, and covariance S, U = (B/2)S.\n\nIn order to write down the EM algorithm in a compact way, we denote by I(i) the set of\nindices of those data points that have been observed in the i.th scenario, that is I(i) =\n{j | j {1, . . . , N } and xj Xi}. Keep in mind that in most applications of interest\nN i N such that most targets are missing in training. KI(i),I(i) denotes the square\nsubmatrix of K that corresponds to points I(i), that is, the covariance matrix for points in\nthe i.th scenario. By K,I(i) we denote the covariance matrix of all N points versus those\nin the i.th scenario.\n\n\n2.1.1 E-step\n\n i\nIn the E-step, one first computes ~\n f , the expected value of functional values on all N\npoints for each scenario i. The expected value is given by the standard equations for the\npredictive mean of Gaussian process models, where the covariance functions are replaced\nby corresponding sub-matrices of the current estimate for K:\n\n ~ i\n f = K,I(i)(KI(i),I(i) + 21)-1(yi - mI(i)) + m, i = 1, . . . , M. (4)\n\nAlso, covariances between all pairs of points are estimated, based on the predictive covari-\nance for the GP models: ( denotes matrix transpose)\n\n ~\n Ci = K - K,I(i)(KI(i),I(i) + 21)-1K , i = 1, . . . , M. (5)\n ,I(i)\n\n 2An efficient EM-based solution for the case 2 = 0 is also given by [9].\n\n\f\n2.1.2 M-step\n\nIn the M-step, the vector of mean values m, the covariance matrix K and the noise variance\n2 are being updated. Denoting the updated quantities by m , K , and (2) , we get\n\n M\n 1 i\n m = A + ~\n f\n M + A i=1\n\n M\n 1 i i\n K = A(m - )(m - ) + BS + ( ~\n f - m )( ~\n f - m ) + ~\n Ci\n M + B i=1\n\n M\n 1 i\n(2) = yi - ~\n f 2 + trace ~\n Ci .\n N I(i) I(i),I(i)\n i=1\n\n\nAn intuitive explanation of the M-step is as follows: The new mean m is a weighted\ncombination of the prior mean, weighted by the equivalent sample size, and the predictive\nmean. The covariance update is a sum of four terms. The first term is typically irrelevant,\nit is a result of the coupling of the Gaussian and the inverse Wishart prior distributions via\nK. The second term contains the prior covariance matrix, again weighted by the equivalent\nsample size. As the third term, we get the empirical covariance, based on the estimated\nand measured functional values f i. Finally, the fourth term gives a correction term to\ncompensate for the fact that the functional values f i are only estimates, thus the empirical\ncovariance will be too small.\n\n\n3 Learning the Covariance Function via Generalized Nystrom\n\nUsing the EM algorithm described in Sec. 2.1, one can easily and efficiently learn a covari-\nance matrix K and mean vector m from data obtained in different related scenarios. Once\nK is found, predictions within the set X can easily be made, by appealing to the same\nequations used in the EM algorithm (Eq. (4) for the predictive mean and Eq. (5) for the\ncovariance). This would, for example, be of interest in a collaborative filtering application\nwith a fixed set of items. In this section we describe how the covariance can be generalized\nto new inputs z X.\n\nNote that, in all of the EM algorithm, the content features xi do not contribute at all. In\n j\norder to generalize the learned covariance matrix, we employ a kernel smoother with an\nauxiliary kernel function r(, ) that takes a pair of content features as input. As a con-\nstraint, we need to guarantee that the derived kernel is positive definite, such that straight-\nforward interpolation schemes cannot readily be applied. Thus our strategy is to interpolate\nthe eigenvectors of K instead and subsequently derive a positive definite kernel. This ap-\nproach is related to the Nystrom method, which is primarily a method for extrapolating\neigenfunctions that are only known at a discrete set of points. In contrast to Nystrom,\nthe extrapolating smoothing kernel is not known in our setting and we employ a generic\nsmoothing kernel r(, ) instead [12].\n\nLet K = U U T be the eigendecomposition of covariance matrix K, with a diagonal\nmatrix of eigenvalues and orthonormal eigenvectors U . With V = U 1/2, the columns\nof V are scaled eigenvectors. We now approximate the i-th scaled eigenvector vi by a\nGaussian process with covariance function r(, ) and obtain as an approximation of the\nscaled eigenfunction\n N\n\n i(w) = r(w, xj)bi,j (6)\n j=1\n\n\f\nwith weights bi = (bi,1, . . . , bi,N ) = (R + I)-1vi. R denotes the Gram matrix for the\nsmoothing kernel on all N points. An additional regularization term I is introduced to\nstabilize the inverse. Based on the approximate scaled eigenfunctions, the resulting kernel\nfunction is simply\n\n l(w, z) = i(w)i(z) = r(w) (R + I)-1K(R + I)-1r(z). (7)\n i\n\nwith r(w) = (r(x1, w), . . . , r(xN , w)). R (resp. L) are the Gram matrices at the train-\ning data points X for kernel function r (resp. l) . is a tuning parameter that determines\nwhich proportion of K is explained by the content kernel. With = 0, L = K is repro-\nduced which means that all of K can be explained by the content kernel. With \nthen l(w, z) 0 and no portion of K is explained by the content kernel.3 Also, note that\nthe eigenvectors are only required in the derivation, and do not need to be calculated when\nevaluating the kernel.4\n\nSimilarly, one can build a kernel smoother to extrapolate from the mean vector m to an\napproximate mean function ^\n m(). The prediction for a new object v in scenario i thus\nbecomes\n f i(v) = ^\n m(v) + l(v, xj) ij (8)\n jI(i)\n\nwith weights given by i = (KI(i),I(i) + 2I)-1(yi - mI(i)).\n\nIt is important to note l has a much richer structure than the auxiliary kernel r. By expand-\ning the expression for l, one can see that l amounts to a data-dependent covariance function\nthat can be written as a superposition of kernels r,\n\n N\n\n l(v, w) = r(xi, v)aw,\n j (9)\n i=1\n\nwith input dependent weights aw = (R + I)-1K(R + I)-1rw.\n\n\n4 Experiments\n\nWe first illustrate the process of covariance matrix learning on a small toy example: Data\nis generated by sampling from a Gaussian process with the nonstationary \"neural network\ncovariance function\" [11]. Independent Gaussian noise of variance 10-4 is added. Input\npoints X are 100 randomly placed points in the interval [-1, 1]. We consider M = 20\nscenarios, where each scenario has observations on a random subset Xi of X, with N i \n0.1N . In Fig. 1(a), each scenario corresponds to one \"noisy line\" of points.\n\nUsing the EM-based covariance matrix learning (Sec. 2.1) on this data, the nonstationarity\nof the data does no longer pose problems, as Fig. 1 illustrates. The (stationary) covariance\nmatrix shown in Fig. 1(c) was used both as the initial value for K and for the prior covari-\nance S in Eq. (3). While the learned covariance matrix Fig. 1(d) does not fully match the\ntrue covariance, it clearly captures the nonstationary effects.\n\n\n4.1 A Recommendation Engine\n\nAs a testbed for the proposed methods, we consider an information filtering task. The\ngoal is to predict individual users' preferences for a large collection of art images5, where\n\n 3Note that, also if the true interpolating kernel was known, i.e., r = k, and with = 0, we obtain\nl(w, z) = k(w, z)K-1k(w, z) which is the approximate kernel obtained with Nystrom.\n 4A related form of kernel matrix extrapolation has been recently proposed by [10].\n 5http://honolulu.dbs.informatik.uni-muenchen.de:8080/paintings/index.jsp\n\n\f\n (a) Training data (b) True covariance matrix\n\n\n\n\n\n (c) Initial covariance matrix (d) Covariance matrix learned via EM\n\nFigure 1: Example to illustrate covariance matrix learning via EM. The data shown in\n(a) was drawn from a Gaussian process with a nonstationary \"neural network\" covariance\nfunction. When initialized with the stationary matrix shown in (c), EM learning resulted in\nthe covariance matrix shown in (d). Comparing the learned matrix (d) with the true matrix\n(b) shows that the nonstationary structure is captured well\n\n\neach user rated a random subset out of a total of 642 paintings, with ratings \"like\" (+1),\n\"dislike\"(-1), or \"not sure\" (0). In total, ratings from M = 190 users were collected,\nwhere each user had rated 89 paintings on average. Each image is also described by a 275-\ndimensional feature vector (containing correlogram, color moments, and wavelet texture).\n\nFig. 2(a) shows ROC curves for collaborative filtering when preferences of unrated items\nwithin the set of 642 images are predicted. Here, our transductive approach (Eq. (4), \"GP\nwith EM covariance\") is compared with a collaborative approach using Pearson correla-\ntion [3] (\"Collaborative Filtering\") and an alternative nonparametric hierarchical Bayesian\napproach [13] (\"Hybrid Filter\"). All algorithms are evaluated in a 10-fold cross validation\nscheme (repeated 10 times), where we assume that ratings for 20 items are known for each\ntest user. Based on the 20 known ratings, predictions can be made for all unrated items. We\nobtain an ROC curve by computing sensitivity and specificity for the proportion of truly\nliked paintings among the N top ranked paintings, averaged over N . The figure shows that\nour approach is considerably better than collaborative filtering with Pearson correlation and\neven gains a (yet small) advantage over the hybrid filtering technique.\n\nNote that the EM algorithm converged6 very quickly, requiring about 46 EM steps to learn\nthe covariance matrix K. Also, we found that the performance is rather insensitive with\nrespect to the hyperparameters, that is, the choice of , S and the equivalent sample sizes\nA and B.\n\nFig. 2(b) shows ROC curves for the inductive setting where predictions for items outside\n\n 6S was set by learning a standard parametric GPR model from the preference data of one ran-\ndomly chosen user, setting kernel parameters via marginal likelihood, and using this model to gener-\nate a full covariance matrix for all points.\n\n\f\n (a) Transductive methods (b) Inductive methods\n\nFigure 2: ROC curves of different methods for predicting user preferences for art images\n\n\n\nthe training set are to be made (sometimes referred to as the \"new item problem\"). Shown\nis the performance obtained with the generalized Nystrom method ( Eq. (8), \"GP with\nGeneralized Nystrom\")7, and when predicting user preferences from image features via an\nSVM with squared exponential kernel (\"SVM content-based filtering\"). It is apparent that\nthe new approach with the learned kernel is superior to the standard SVM approach. Still,\nthe overall performance of the inductive approach is quite limited. The low-level content\nfeatures are only very poor indicators for the high level concept \"liking an art image\", and\ninductive approaches in general need to rely on content-dependent collaborative filtering.\nThe purely content-independent collaborative effect, which is exploited in the transductive\nsetting, cannot be generalized to new items. The purely content-independent collaborative\neffect can be viewed as correlated noise in our model.\n\n\n\n5 Summary and Conclusions\n\n\nThis article introduced a novel method of learning Gaussian process covariance functions\nfrom multi-task learning problems, using a hierarchical Bayesian framework. In the hierar-\nchical framework, the GP models for individual scenarios borrow strength from each other\nvia a common prior for mean and covariance. The learning task was solved in two steps:\nFirst, an EM algorithm was used to learn the shared mean vector and covariance matrix\non a fixed set of points. In a second step, the learned covariance matrix was generalized\nto new points via a generalized form of Nystrom method. Our initial experiments, where\nwe use the method as a recommender system for art images, showed very promising re-\nsults. Also, in our approach, a clear distinction is made between content-dependent and\ncontent-independent collaborative filtering.\n\nWe expect that our approach will be even more effective in applications where the content\nfeatures are more powerful (e.g. in recommender systems for textual items such as news\narticles), and allow a even better prediction of user preferences.\n\nAcknowledgements This work was supported in part by the IST Programme of the Euro-\npean Union, under the PASCAL Network of Excellence (EU # 506778).\n\n\n 7To obtain the kernel r, we fitted GP user preference models for a few randomly chosen users,\nwith individual ARD weights for each input dimension in a squared exponential kernel. ARD weights\nfor r are taken to be the medians of the fitted ARD weights.\n\n\f\nReferences\n\n [1] Bakker, B. and Heskes, T. Task clustering and gating for bayesian multitask learning. Journal\n of Machine Learning Research, 4:8399, 2003.\n\n [2] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journal of Machine\n Learning Research, 3:9931022, 2003.\n\n [3] Breese, J. S., Heckerman, D., and Kadie, C. Empirical analysis of predictive algorithms for\n collaborative filtering. Tech. Rep. MSR-TR-98-12, Microsoft Research, 1998.\n\n [4] Caruana, R. Multitask learning. Machine Learning, 28(1):4175, 1997.\n\n [5] Chapelle, O. and Harchaoui, Z. A machine learning approach to conjoint analysis. In L. Saul,\n Y. Weiss, and L. Bottou, eds., Neural Information Processing Systems 17. MIT Press, 2005.\n\n [6] Gelman, A., Carlin, J., Stern, H., and Rubin, D. Bayesian Data Analysis. CRCPress, 1995.\n\n [7] Lawrence, N. D. and Platt, J. C. Learning to learn with the informative vector machine. In\n R. Greiner and D. Schuurmans, eds., Proceedings of ICML04. Morgan Kaufmann, 2004.\n\n [8] Minka, T. P. and Picard, R. W. Learning how to learn is learning with point sets, 1999. Unpub-\n lished manuscript. Revised 1999.\n\n [9] Schafer, J. L. Analysis of Incomplete Multivariate Data. Chapman&Hall, 1997.\n\n[10] Vishwanathan, S., Guttman, O., Borgwardt, K. M., and Smola, A. Kernel extrapolation, 2005.\n Unpublished manuscript.\n\n[11] Williams, C. K. Computation with infinite neural networks. Neural Computation, 10(5):1203\n 1216, 1998.\n\n[12] Williams, C. K. I. and Seeger, M. Using the nystrom method to speed up kernel machines. In\n T. K. Leen, T. G. Dietterich, and V. Tresp, eds., Advances in Neural Information Processing\n Systems 13, pp. 682688. MIT Press, 2001.\n\n[13] Yu, K., Schwaighofer, A., Tresp, V., Ma, W.-Y., and Zhang, H. Collaborative ensemble learning:\n Combining collaborative and content-based information filtering via hierarchical Bayes. In\n C. Meek and U. Kjrulff, eds., Proceedings of UAI 2003, pp. 616623, 2003.\n\n[14] Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised learning using Gaussian fields and\n harmonic functions. In Proceedings of ICML03. Morgan Kaufmann, 2003.\n\nAppendix\n\nTo derive an EM algorithm for Eq. (2), we treat the functional values f i in each scenario\ni as the unknown variables. In each EM iteration t, the parameters to be estimated are\n(t) = {m(t), K(t), 2(t)}. In the E-step, the sufficient statistics are computed,\n\n M M\n i,(t)\n E f i | yi, (t) = ~\n f (10)\n\n i=1 i=1\n\n M M\n i,(t) i,(t)\n E f i(f i) | yi, (t) = ~\n f ( ~\n f ) + ~\n Ci (11)\n\n i=1 i=1\n\n i\nwith ~\n f and ~\n Ci defined in Eq. (4) and (5). In the M-step, the parameters are re-estimated\nas (t+1) = arg max Q( | (t)), with\n\n Q( | (t)) = E lp( | f , y) | y, (t) , (12)\n\nwhere lp stands for the penalized log-likelihood of the complete data,\n\n lp( | f , y) = log Wi-1(K | , ) + log N (m | , -1K)+\n\n M M\n i i\n + log N ( ~\n f | m, K) + log N (yi | ~\n f , 21) (13)\n I(i) I(i)\n i=1 i=1\n\nUpdated parameters are obtained by setting the partial derivatives of Q( | (t)) to zero.\n\n\f\n", "award": [], "sourceid": 2595, "authors": [{"given_name": "Anton", "family_name": "Schwaighofer", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Kai", "family_name": "Yu", "institution": null}]}