{"title": "Relational Learning with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 289, "page_last": 296, "abstract": null, "full_text": "Relational Learning with Gaussian Processes\n\nWei Chu\nCCLS\n\nColumbia Univ.\n\nNew York, NY 10115\n\nVikas Sindhwani\nDept. of Comp. Sci.\n\nUniv. of Chicago\nChicago, IL 60637\n\nZoubin Ghahramani\nDept. of Engineering\nUniv. of Cambridge\n\nCambridge, UK\n\nS. Sathiya Keerthi\nYahoo! Research\n\nMedia Studios North\nBurbank, CA 91504\n\nAbstract\n\nCorrelation between instances is often modelled via a kernel function using in-\nput attributes of the instances. Relational knowledge can further reveal additional\npairwise correlations between variables of interest. In this paper, we develop a\nclass of models which incorporates both reciprocal relational information and in-\nput attributes using Gaussian process techniques. This approach provides a novel\nnon-parametric Bayesian framework with a data-dependent covariance function\nfor supervised learning tasks. We also apply this framework to semi-supervised\nlearning. Experimental results on several real world data sets verify the usefulness\nof this algorithm.\n\n1\n\nIntroduction\n\nSeveral recent developments such as the growth of the world wide web and the maturation of ge-\nnomic technologies, have brought new domains of application to machine learning research. Many\nsuch domains involve relational data in which instances have \u201clinks\u201d or inter-relationships be-\ntween them that are highly informative for learning tasks, e.g. (Taskar et al., 2002). For exam-\nple, hyper-linked web-documents are often about similar topics, even if their textual contents are\ndisparate when viewed as bags of words. In document categorization, the citations are important\nas well since two documents referring to the same reference are likely to have similar content. In\ncomputational biology, knowledge about physical interactions between proteins can supplement ge-\nnomic data for developing good similarity measures for protein network inference. In such cases,\na learning algorithm can greatly bene\ufb01t by taking into account the global network organization of\nsuch inter-relationships rather than relying on input attributes alone.\nOne simple but general type of relational information can be effectively represented in the form\nof a graph G = (V,E). The vertex set V represents a collection of input instances (which may\ncontain the labelled inputs as a subset, but is typically a much larger set of instances). The edge set\nE \u2282 V \u00d7 V represents the pairwise relations over these input instances. In this paper, we restrict our\nattention to undirected edges, i.e., reciprocal relations, though directionality may be an important\naspect of some relational datasets. These undirected edges provide useful structural knowledge\nabout correlation between the vertex instances. In particular, we allow edges to be of two types\n\u2013 \u201cpositive\u201d or \u201cnegative\u201d depending on whether the associated adjacent vertices are positively or\nnegatively correlated, respectively. On many problems, only positive edges may be available.\n\nThis setting is also applicable to semi-supervised tasks even on traditional \u201c\ufb02at\u201d datasets where the\nlinkage structure may be derived from data input attributes. In graph-based semi-supervised meth-\nods, G is typically an adjacency graph constructed by linking each instance (including labelled and\nunlabelled) to its neighbors according to some distance metric in the input space. The graph G then\nserves as an estimate of the global geometric structure of the data. Many algorithmic frameworks\nfor semi-supervised (Sindhwani et al., 2005) and transductive learning, see e.g. (Zhou et al., 2004;\nZhu et al., 2003), have been derived under the assumption that data points nearby on this graph are\npositively correlated.\n\n\fSeveral methods have been proposed recently to incorporate relational information within learning\nalgorithms, e.g. for clustering (Basu et al., 2004; Wagstaff et al., 2001), metric learning (Bar-Hillel\net al., 2003), and graphical modeling (Getoor et al., 2002). The reciprocal relations over input in-\nstances essentially re\ufb02ect the network structure or the distribution underlying the data, which enrich\nour prior belief of how instances in the entire input space are correlated. In this paper, we inte-\ngrate relational information with input attributes in a non-parametric Bayesian framework based on\nGaussian processes (GP) (Rasmussen & Williams, 2006), which leads to a data-dependent covari-\nance/kernel function. We highlight the following aspects of our approach: 1) We propose a novel\nlikelihood function for undirected linkages and carry out approximate inference using ef\ufb01cient Ex-\npectation Propagation techniques under a Gaussian process prior. The covariance function of the\napproximate posterior distribution de\ufb01nes a relational Gaussian process, hereafter abbreviated as\nRGP. RGP provides a novel Bayesian framework with a data-dependent covariance function for su-\npervised learning tasks. We also derive explicit formulae for linkage prediction over pairs of test\npoints. 2) When applied to semi-supervised learning tasks involving labelled and unlabelled data,\nRGP is closely related to the warped reproducing kernel Hilbert Space approach of (Sindhwani\net al., 2005) using a novel graph regularizer. Unlike many recently proposed graph-based Bayesian\napproaches, e.g. (Zhu et al., 2003; Krishnapuram et al., 2004; Kapoor et al., 2005), which are mainly\ntransductive by design, RGP delineates a decision boundary in the input space and provides proba-\nbilistic induction over unseen test points. Furthermore, by maximizing the joint evidence of known\nlabels and linkages, we explicitly involve unlabelled data in the model selection procedure. Such a\nsemi-supervised hyper-parameter tuning method can be very useful when there are very few, possi-\nbly noisy labels. 3) On a variety of classi\ufb01cation tasks, RGP requires very few labels for providing\nhigh-quality generalization on unseen test examples as compared to standard GP classi\ufb01cation that\nignores relational information. We also report experimental results on semi-supervised learning\ntasks comparing with competitive deterministic methods.\n\nThe paper is organized as follows. In section 2 we develop relational Gaussian processes. Semi-\nsupervised learning under this framework is discussed in section 3. Experimental results are pre-\nsented in section 4. We conclude this paper in section 5.\n\n2 Relational Gaussian Processes\n\nIn the standard setting of learning from data, instances are usually described by a collection of input\nattributes, denoted as a column vector x \u2208 X \u2282R\nd. The key idea in Gaussian process models is\nto introduce a random variable fx for all points in the input space X . The values of these random\nvariables {fx}x\u2208X are treated as outputs of a zero-mean Gaussian process. The covariance between\nfx and fz is fully determined by the coordinates of the data pair x and z, and is de\ufb01ned by any\nMercer kernel function K(x, z). Thus, the prior distribution over f = [fx1 . . . fxn] associated with\nany collection of n points x1 . . . xn is a multivariate Gaussian, written as\n\u22121f\n\n(1)\nwhere \u03a3 is the n \u00d7 n covariance matrix whose ij-th element is K(xi, xj). In the following, we\nconsider the scenario with undirected linkages over a set of instances.\n\n(2\u03c0)n/2 det(\u03a3)1/2 exp\n\nP(f) =\n\n\u22121\n2\n\nf T \u03a3\n\n(cid:1)\n\n(cid:2)\n\n1\n\n2.1 Undirected Linkages\nLet the vertex set V in the relational graph be associated with n input instances x1 . . . xn. Consider a\nset of observed pairwise undirected linkages on these instances, denoted as E = {Eij}. Each linkage\nis treated as a Bernoulli random variable, i.e. Eij \u2208 {+1,\u22121}. Here Eij = +1 indicates that the\ninstances xi and xj are \u201cpositively tied\u201d and Eij = \u22121 indicates the instances are \u201cnegatively tied\u201d.\nWe propose a new likelihood function to capture these undirected linkages, which is de\ufb01ned as\nfollows:\n\n(cid:5)\n\nif fxi\n\nfxj\n\nEij > 0\n\notherwise\n\n1\n0\n\n(2)\n\n(cid:3)Eij|fxi\n\nPideal\n\n(cid:4)\n\n, fxj\n\n=\n\nThis formulation is for ideal, noise-free cases; it enforces that the variable values corresponding\nto positive and negative edges have the same and opposite signs respectively. In the presence of\n\n\f(cid:4)\n\n(cid:6) (cid:6) Pideal\n(cid:8)\n(cid:7)\n\nuncertainty in observing Eij, we assume the variable values fxi and fxj are contaminated with\nGaussian noise that allows some tolerance for noisy observations. The Gaussian noise is of zero\nmean and unknown variance \u03c32.1 Let N (\u03b4; \u00b5, \u03c32) denote a Gaussian random variable \u03b4 with mean\n\u00b5 and variance \u03c32. Then the likelihood function (2) becomes\n\nP(cid:3)Eij = +1|fxi\n(cid:4)N (\u03b4i; 0, \u03c32)N (\u03b4j; 0, \u03c32) d\u03b4i d\u03b4j\n(cid:6) z\n(cid:4)\n\ufb01rst and third quadrants where fxi and fxj have the same sign. Note that P(cid:3)Eij = \u22121|fxi\n(3)\n\u2212\u221e N (\u03b3; 0, 1) d\u03b3. The integral in (3) evaluates the volume of a joint Gaussian in the\n(cid:4)\n1 \u2212 P(cid:3)Eij = +1|fxi\n=\n\n(cid:3)Eij = +1|fxi + \u03b4i, fxj + \u03b4j\n(cid:7)\n(cid:8)(cid:8)(cid:7)\n\n= P(cid:3)Eij = +1| \u2212 fxi\n\nand P(cid:3)Eij = +1|fxi\n\n1 \u2212 \u03a6\n(cid:4)\n\nwhere \u03a6(z) =\n\n=\n= \u03a6\n\n,\u2212fxj\n\n1 \u2212 \u03a6\n\n(cid:8)(cid:8)\n\n, fxj\n\n, fxj\n\n, fxj\n\n, fxj\n\nfxj\n\u03c3\n\nfxj\n\u03c3\n\nfxi\n\u03c3\n\n\u03a6\n\n(cid:7)\n\n(cid:8)\n\n(cid:7)\n\n(cid:7)\n\nfxi\n\u03c3\n\n(cid:4)\n\n+\n\n.\n\n(cid:7)\n\n, fxj ) =\n\nRemarks: One may consider other ways to de\ufb01ne a likelihood function for the observed edges.\nFor example, we could de\ufb01ne Pl(Eij = +1|fxi\n) where \u03bd > 0. However\nthe computation of the predictive probability (9) and its derivatives becomes complicated with this\nform. Instead of treating edges as Bernoulli variables, we could consider a graph itself as a random\nvariable and then the probability of observing the graph G can be simply evaluated as: P(G|f) =\nwhere \u03a8 is a graph-regularization matrix (e.g. graph Laplacian) and Z is a\n1Z exp\nnormalization factor that depends on the variable values f. Given that there are numerous graph\nstructures over the instances, the normalization factor Z is intractable in general cases. In the rest of\nthis paper, we will use the likelihood function developed in (3).\n\n1+exp(\u2212\u03bdfxi\n\n2 f T \u03a8 f\n\n\u2212 1\n\n(cid:8)\n\nfxj\n\n1\n\n2.2 Approximate Inference\n\nCombining the Gaussian process prior (1) with the likelihood function (3), we obtain the posterior\ndistribution as follows,\n\nP(f|E) =\n\n1\nP(E)\n\nP(f)\n\n(cid:9)\n\nP(cid:3)Eij|fxi\n\n, fxj\n\n(cid:4)\n\n(4)\n\nwhere f = [fx1 , . . . , fxn]T and ij runs over the set of observed undirected linkages. The normal-\nization factor P(E) =\nserves as a yardstick for model selection.\n\n(cid:6) P(E|f)P(f)df is known as the evidence of the model parameters that\n\nij\n\nThe posterior distribution is non-Gaussian and multi-modal with a saddle point at the origin. Clearly\nthe posterior mean is at the origin as well. It is important to note that reciprocal relations update\nthe correlation between examples but never change individual mean. To preserve computational\ntractability and the true posterior mean, we would rather approximate the posterior distribution as\na joint Gaussian centered at the true mean than resort to sampling methods. A family of inference\ntechniques can be applied for the Gaussian approximation. Some popular methods include Laplace\napproximation, mean-\ufb01eld methods, variational methods and expectation propagation. It is inappro-\npriate to apply the Laplace approximation to this case since the posterior distribution is not unimodal\nand it is a saddle point at the true posterior mean. The standard mean-\ufb01eld methods are also hard to\nuse due to the pairwise relations in observation. Both the variational methods and the expectation\npropagation (EP) algorithm (Minka, 2001) can be applied here. In this paper, we employ the EP\nalgorithm to approximate the posterior distribution as a zero-mean Gaussian. Importantly this still\ncaptures the posterior covariance structure allowing prediction of link presence.\nThe key idea of our EP algorithm here is to approximate P(f)\nproduct distribution2 in the form of\n\nij P(cid:3)Eij|fxi\n(cid:10)\n(cid:7)\n\nas a parametric\n\n, fxj\n\n(cid:8)\n\n(cid:4)\n\nQ(f) = P(f)\n\nij \u02dct(f ij) =P (f)\n\nij sij exp\n\n\u2212 1\n\n2 f T\n\nij\u03a0ijf ij\n\n, fxj ]T , and \u03a0ij is a symmetric 2 \u00d7 2 matrix. The\nwhere ij runs over the edge set, f ij = [fxi\nparameters {sij, \u03a0ij} in {\u02dct(f ij)} are successively optimized by locally minimizing the Kullback-\nLeibler divergence,\n\n(cid:10)\n\n(cid:10)\n\n1We could specify different noise levels for weighted edges. In this paper, we focus on unweighted edges\n\nonly.\n\n2The likelihood function we de\ufb01ned could also be approximated by a Gaussian mixture of two symmetric\n\ncomponents, but the dif\ufb01culty lies in the number of components growing exponentially after multiplication.\n\n\f(cid:11)\n\n(cid:13)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q(f)\n\nQ(f)\n\u02dct(f ij)old\n\nP(Eij|f ij)\n\nKL\n\n\u02dct(f ij)new = arg min\n\u02dct(f ij )\n\n(5)\nSince Q(f) is in the exponential family, this minimization can be simply solved by moment match-\ning up to the second order. At the equilibrium the EP algorithm returns a Gaussian approximation\n(cid:14)\nto the posterior distribution\nP(f|E) \u2248 N (0,A)\n(6)\nwhere A = (\u03a3\u22121 + \u03a0)\u22121, \u03a0 =\nij \u02c7\u03a0ij and \u02c7\u03a0ij is an n \u00d7 n matrix with four non-zero entries\naugmented from \u03a0ij. Note that the matrix \u03a0 could be very sparse. The normalization factor in this\nGaussian approximation serves as approximate model evidence that can be explicitly written as\n\n\u02dct(f ij)old\n\n\u02dct(f ij)\n\n.\n\nsij\n\n(7)\n\nP(E) \u2248 |A| 1\n|\u03a3| 1\n\n2\n\n2\n\n(cid:9)\n\nij\n\nThe detailed updating formulations have to be omitted here to save space. The approximate evidence\n(7) holds an upper bound on the true value of P(E) (Wainwright et al., 2005). Its partial derivatives\nwith respect to the model parameters can be analytically derived (Seeger, 2003) and then a gradient-\nbased procedure can be employed for hyperparameter tuning. Although the EP algorithm is known to\nwork quite well in practice, there is no guarantee of convergence to the equilibrium in general. Opper\nand Winther (2005) proposed expectation consistent (EC) as a new framework for approximations\nthat requires two tractable distributions matching on a set of moments. We plan to investigate the\nEC algorithm as future work.\n\n2.3 Data-dependent Covariance Function\nAfter approximate inference as outlined above, the posterior process conditioned on E is explicitly\ngiven by a modi\ufb01ed covariance function de\ufb01ned in the following proposition.\nProposition: Given (6), for any \ufb01nite collection of data points X , the latent random variables\n{fx}x\u2208X conditioned on E have a multivariate normal distribution N (0, \u02dc\u03a3) where \u02dc\u03a3 is the co-\nvariance matrix whose elements are given by evaluating the kernel function \u02dcK(x, z) :X \u00d7 X (cid:4)\u2192 R\nfor x, z \u2208 X given by:\n(8)\nwhere I is an n \u00d7 n identity matrix, kx is the column vector [K(x1, x), . . . ,K(xn, x)]T , \u03a3 is an\nn \u00d7 n covariance matrix of the vertex set V obtained by evaluating the base kernel K, and \u03a0 is\nde\ufb01ned as in (6).\n\n\u02dcK(x, z) = K(x, z) \u2212 kT\n\nx (I + \u03a0\u03a3)\n\n\u22121\u03a0kz\n\nA proof of this proposition involves some simple matrix algebra and is omitted for brevity. RGP\nis obtained by a Bayesian update of a standard GP using relational knowledge, which is closely\nrelated to the warped reproducing kernel Hilbert space approach (Sindhwani et al., 2005) using a\nnovel graph regularizer \u03a0 in place of the standard graph Laplacian. Alternatively, we could simply\nemploy the standard graph Laplacian as an approximation of the matrix \u03a0. This ef\ufb01cient approach\nhas been studied by (Sindhwani et al., 2007) for semi-supervised classi\ufb01cation problems.\n\n2.4 Linkage Prediction\n\n, fxs]T , associated with a test\nGiven a RGP, the joint distribution of the random variables f rs = [fxr\npair xr and xs, is a Gaussian as well. The linkage predictive distribution P(f rs|E) can be explicitly\nwritten as a zero-mean bivariate Gaussian N (f rs; 0, \u02dc\u03a3rs) with covariance matrix\n\n(cid:16)\n\n(cid:15)\n\n\u02dc\u03a3rs =\n\n\u02dcK(xr, xr)\n\u02dcK(xs, xr)\n\n\u02dcK(xr, xs)\n\u02dcK(xs, xs)\n\nwhere \u02dcK is de\ufb01ned as in (8). The predictive probability of having a positive edge can be evaluated\nas\n\n(cid:17)\n\nP(Ers|E) =\n\nPideal(Ers|f rs)N (f rs; 0, \u02dc\u03a3rs)dfxr\n\nwhich can be simpli\ufb01ed as\n\nP(Ers|E) =\n\n1\n2\n\n+\n\narcsin(\u03c1Ers)\n\n\u03c0\n\ndfxs\n\n(9)\n\n\f\u221a\n\n\u02dcK(xr,xs)\n\nwhere \u03c1 =\nafter we learn from the observed linkages.\n\n\u02dcK(xs,xs) \u02dcK(xr,xr)\n\n. It essentially evaluates the updated correlation between fxr and fxs\n\n3 Semi-supervised Learning\n\nWe now apply the RGP framework for semi-supervised learning where a large collection of unla-\nbelled examples are available and labelled data is scarce. Unlabelled examples often identify data\nclusters or low-dimensional data manifolds. It is commonly assumed that the labels of points within\na cluster or nearby on a manifold are highly correlated (Chapelle et al., 2003; Zhu et al., 2003). To\napply RGP, we construct positive reciprocal relations between examples within K nearest neighbor-\nhood. K could be heuristically set at the minimal integer of nearest neighborhood that could setup a\nconnected graph over labelled and unlabelled examples, where there is a path between each pair of\nnodes. Learning on these constructed relational data results in a RGP as described in the previous\nsection (see section 4.1 for an illustration). With the RGP as our new prior, supervised learning can\nbe carried out in a straightforward way. In the following we focus on binary classi\ufb01cation, but this\nprocedure is also applicable to regression, multi-class classi\ufb01cation and ranking.\nGiven a set of labelled pairs {z(cid:3), y(cid:3)}m\n(cid:3)=1 where y(cid:3) \u2208 {+1,\u22121}, the Gaussian process classi\ufb01er\n(Rasmussen & Williams, 2006) relates the variable fz(cid:1) at z(cid:3) to the label y(cid:3) through a probit noise\nmodel, i.e. P(y(cid:3)|fz(cid:1)) = \u03a6( y(cid:1)fz(cid:1)\nn speci\ufb01es the label noise\nlevel. Combining the probit likelihood with the RGP prior de\ufb01ned by the covariance function (8),\nwe have the posterior distribution as follows,\n\n\u03c3n ) where \u03a6 is the cumulative normal and \u03c32\n\nP(f (cid:3)|Y,E) =\n\n1\n\nP(Y|E)\n\nP(f (cid:3)|E)\n\nP(y(cid:3)|fz(cid:1))\n\nwhere f (cid:3) = [fz1 , . . . , fzm]T , P(f (cid:3)|E) is a zero-mean Gaussian with an m\u00d7m covariance matrix \u02dc\u03a3(cid:3)\nwhose entries are de\ufb01ned by (8), and P(Y|E) is the normalization factor. The posterior distribution\ncan be approximated as a Gaussian as well, denoted as N (\u00b5,C), and the quantity P(Y|E) can be\nevaluated accordingly (Seeger, 2003). The predictive distribution of the variable fzt at a test case\nzt then becomes a Gaussian, i.e. P(fzt\nt =\n\u02dcK(zt, zt) \u2212 kT\n(cid:3) )kt with kt = [ \u02dcK(z1, zt), . . . , \u02dcK(zm, zt)]T . One can compute\n\u22121\nthe Bernoulli distribution over the test label yt by\nP(yt|Y,E) = \u03a6\n\nt ), where \u00b5t = kt \u02dc\u03a3\n(cid:13)\n\n|Y,E) \u2248 N (\u00b5t, \u03c32\n\n\u22121\n(cid:3) \u00b5 and \u03c32\n\n\u00b5t(cid:18)\n\n(cid:3) \u2212 \u02dc\u03a3\n\u22121\n\n(cid:3) C \u02dc\u03a3\n\u22121\n\nt (\u02dc\u03a3\n\n(cid:11)\n\n(10)\n\n.\n\n(cid:9)\n\n(cid:3)\n\nn + \u03c32\n\u03c32\nt\n\nTo summarize, we \ufb01rst incorporate linkage information into a standard GP that leads to a RGP, and\nthen perform standard inference with the RGP as the prior in supervised learning. Although we\ndescribe RGP in two separate steps, these procedures can be seamlessly merged within the Bayesian\nframework. As for model selection, it is advantageous to directly use the joint evidence\n\n(11)\nto determine the model parameters (such as the kernel parameter, the edge noise level and the label\nnoise level). Note that P(Y,E) explicitly involves unlabelled data for model selection. This can be\nparticularly useful when labelled data is very scarce and possibly noisy.\n\nP(Y,E) =P (Y|E)P(E),\n\n4 Numerical Experiments\n\nWe start with a synthetic case to illustrate the proposed algorithm (RGP), and then verify the\nusefulness of this approach on three real world data sets. Throughout the experiments, we con-\n(cid:3)\u2212 \u03ba\n(cid:4)\nsistently compare with the standard Gaussian process classi\ufb01er (GPC). RGP and GPC are dif-\nferent in the prior only. We employ the linear kernel K(x, z) = x \u00b7 z or the Gaussian kernel\n(cid:14)\n(cid:14)\nK(x, z) = exp\n2(cid:7)x \u2212 z(cid:7)2\n, and shift the origin of the kernel space to the empirical mean, i.e.\nK(x, z)\u2212 1\nj K(xi, xj) where n is the number of available\ni K(x, xi)\u2212 1\ni K(z, xi)+ 1\nlabelled and unlabelled data. The centralized kernel is then used as base kernel in our experiments.\nn in the GPC and RGP models is \ufb01xed at 10\u22124. The edge noise level \u03c32\nThe label noise level \u03c32\nof the RGP models is usually varied from 5 to 0.05. The optimal setting of the \u03c32 and the \u03ba in the\nGaussian kernel is determined by the joint evidence (11) in each trial. When constructing undirected\nK nearest- neighbor graphs, K is \ufb01xed at the minimal integer required to have a connected graph.\n\n(cid:14)\n\n(cid:14)\n\nn2\n\nn\n\nn\n\n2\n\ni\n\n\fP(\u22121|x) \n\nP(+1|x) \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n(a) \n\n\u221218\n\n\u221220\n\n\u221222\n\n\u221224\n\n\u221226\n\n\u221228\n\n\u221230\n\n\u221232\n\n\u221234\n\n(b) \n\nlog P(E) \n\nlog P(E,Y) \n\n0\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\nx\n\n1\n\n2\n\n3\n\n4\n\n\u221236\n\n10\u22122\n\n5\n\n10\u22121\n\n100\n\n101\n\n\u03ba\n\nx\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(C) \n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\n3\n\n4\n\n5\n\nFigure 1: Results on the synthetic dataset. The 30 samples drawn from the Gaussian mixture are\npresented as dots in (a) and the two labelled samples are indicated by a diamond and a circle respec-\ntively. The best \u03ba value is marked by the cross in (b). The curves in (a) present the semi-supervised\npredictive distributions. The prior covariance matrix of RGP learnt from the data is presented in (c).\n\nTable 1: The four universities are Cornell University, the University of Texas at Austin, the Univer-\nsity of Washington and the University of Wisconsin. The numbers of categorized Web pages and\nundirected linkages in the four university dataset are listed in the second column. The averaged\nAUC scores of label prediction on unlabelled cases are recorded along with standard deviation over\n100 trials.\nTask Web&Link Number\nUniv. Stud Other All Link\nCorn. 128 617 865 13177 0.825\u00b10.016 0.987\u00b10.008 0.989\u00b10.009 0.708\u00b10.021 0.865\u00b10.038 0.884\u00b10.025\nTexa. 148 571 827 16090 0.899\u00b10.016 0.994\u00b10.007 0.999\u00b10.001 0.799\u00b10.021 0.932\u00b10.026 0.906\u00b10.026\nWash. 126 939 1205 15388 0.839\u00b10.018 0.957\u00b10.014 0.961\u00b10.009 0.782\u00b10.023 0.828\u00b10.025 0.877\u00b10.024\nWisc. 156 942 1263 21594 0.883\u00b10.013 0.976\u00b10.029 0.992\u00b10.008 0.839\u00b10.014 0.812\u00b10.030 0.899\u00b10.015\n\nStudent or Not\n\nLapSVM\n\nOther or Not\n\nLapSVM\n\nGPC\n\nRGP\n\nGPC\n\nRGP\n\n4.1 Demonstration Suppose samples are distributed as a Gaussian mixture with two components\nin one-dimensional space, e.g. 0.4 \u00b7 N (\u22122.5, 1) + 0.6 \u00b7 N (2.0, 1). We randomly collected 30\nsamples from this distribution, shown as dots on the x axis of Figure 1(a). With K = 3, there\nare 56 \u201cpositive\u201d edges over these 30 samples. We \ufb01xed \u03c32 = 1 for all the edges, and varied the\nparameter \u03ba from 0.01 to 10. At each setting, we carried out the Gaussian approximation by EP as\ndescribed in section 2.2. Based on the approximate model evidence P(E) (7), presented in Figure\n1(b), we located the best \u03ba = 0.4. Figure 1(c) presents the posterior covariance function \u02dcK (8)\nat this optimal setting. Compared to the data-independent prior covariance function de\ufb01ned by the\nGaussian kernel, the posterior covariance function captures the density information of the unlabelled\nsamples. The pairs within the same cluster become positively correlated, whereas the pairs between\nthe two clusters turn out to be negatively correlated. This is learnt without any explicit assumption\non density distributions. Given two labelled samples, one per class, indicated by the diamond and\nthe circle in Figure 1(a), we carried out supervised learning on the basis of the new prior \u02dcK, as\ndescribed in section 3. The joint model evidence P(Y|E)P(E) is plotted out in Figure 1(b). The\ncorresponding predictive distribution (10) with the optimal \u03ba = 0.4 is presented in Figure 1(a). Note\nthat the decision boundary of the standard GPC should be around x = 1. We observed our decision\nboundary signi\ufb01cantly shifts to the low-density region that respects the geometry of the data.\n\n4.2 The Four University Dataset We considered a subset of the WebKB dataset for categoriza-\ntion tasks.3 The subset, collected from the Web sites of computer science departments of four\nuniversities, contains 4160 pages and 9998 hyperlinks interconnecting them. These pages have been\nmanually classi\ufb01ed into seven categories: student, course, faculty, staff, department, project and\nother. The text content of each Web page was preprocessed as bag-of-words, a vector of \u201cterm fre-\nquency\u201d components scaled by \u201cinverse document frequency\u201d, which was used as input attributes.\nThe length of each document vector was normalized to unity. The hyperlinks were translated into\n66249 undirected \u201cpositive\u201d linkages over the pages under the assumption that two pages are likely\nto be positively correlated if they are hyper-linked by the same hub page. Note there are no \u201cneg-\native\u201d linkages in this case. We considered two classi\ufb01cation tasks, student vs. non-student and\nother vs. non-other, for each of the four universities. The numbers of samples and linkages of the\nfour universities are listed in Table 1. We randomly selected 10% samples as labelled data and used\nthe remaining samples as unlabelled data. The selection was repeated 100 times. The linear kernel\n\n3The dataset comes from the Web\u2192KB project, see http://www-2.cs.cmu.edu/\u223cwebkb/.\n\n\fa\n\nt\n\na\nD\n\n \n\n \nt\ns\ne\nT\nn\no\nC\nO\nR\n\n \n\n \nr\ne\nd\nn\nu\n\n \n\na\ne\nr\nA\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\nPCMAC\n\nUSPS 3vs5\n\n(a) \n\na\n\nt\n\na\nD\n\n \n\n \nt\ns\ne\nT\nn\no\nC\nO\nR\n\n \n\n \nr\ne\nd\nn\nu\n\n \n\na\ne\nr\nA\n\n10\n0.1\nPercentage of Labelled Data in Training Samples\n\n0.5\n\n5\n\n1\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n(b) \n\n10\n0.1\nPercentage of Labelled Data in Training Samples\n\n0.5\n\n1\n\n5\n\nFigure 2: Test AUC results of the two semi-supervised learning tasks, PCMAC in (a) and USPS\nin (b). The grouped boxes from left to right represent the results of GPC, LapSVM, and RGP\nrespectively at different percentages of labelled samples over 100 trials. The notched-boxes have\nlines at the lower quartile, median, and upper quartile values. The whiskers are lines extending from\neach end of the box to the most extreme data value within 1.5 interquartile range. Outliers are data\nwith values beyond the ends of the whiskers, which are displayed as dots.\n\nwas used as base kernel in these experiments. We conducted this experiment in a transductive setting\nwhere the entire linkage data was used to learn the RGP model and comparisons were made with\nGPC for predicting labels of unlabelled samples. We make comparisons with a discriminant kernel\napproach to semi-supervised learning \u2013 the Laplacian SVM (Sindhwani et al., 2005) using the lin-\near kernel and a graph Laplacian based regularizer. We recorded the average AUC for predicting\nlabels of unlabelled cases in Table 1.4 Our RGP models signi\ufb01cantly outperform the GPC models\nby incorporating the linkage information in modelling. RGP is very competitive with LapSVM on\n\u201cStudent or Not\u201d while yields better results on 3 out of 4 tasks of \u201cOther or Not\u201d. As future work, it\nwould be interesting to utilize weighted linkages and to compare with other graph kernels.\n\n4.3 Semi-supervised Learning We chose a binary classi\ufb01cation problem in the 20 newsgroup\ndataset, 985 PC documents vs. 961 MAC documents. The documents were preprocessed, same\nas we did in the previous section, into vectors with 7510 elements. We randomly selected 1460\ndocuments as training data, and tested on the remaining 486 documents. We varied the percentage\nof labelled data from 0.1% to 10% gradually, and at each percentage repeated the random selection\nof labelled data 100 times. We used the linear kernel in the RGP and GPC models. With K = 4,\nwe got 4685 edges over the 1460 training samples. The test results on the 486 documents are\npresented in Figure 2(a) as a boxplot. Model parameters for LapSVM were tuned using cross-\nvalidation with 50 labelled samples, since it is dif\ufb01cult for discriminant kernel approaches to carry\nout cross validation when the labelled samples are scarce. Our algorithm yields much better results\nthan GPC and LapSVM, especially when the fraction of labelled data is less than 5%. When the\nlabelled samples are few (a typical case in semi-supervised learning), cross validation becomes hard\nto use while our approach provides a Bayesian model selection by the model evidence.\nU.S. Postal Service dataset (USPS) of handwritten digits consists of 16 \u00d7 16 gray scale images. We\nfocused on constructing a classi\ufb01er to distinguish digit 3 from digit 5. We used the training/test split,\ngenerated and used by (Lawrence & Jordan, 2005), in our experiment for comparison purpose. This\npartition contains 1214 training samples (556 samples of digit 3 and 658 samples of digit 5) and 326\ntest samples. With K = 3, we obtained 2769 edges over the 1214 training samples. We randomly\npicked up a subset of the training samples as labelled data and treated the remaining samples as\nunlabelled. We varied the percentage of labelled data from 0.1% to 10% gradually, and at each\npercentage repeated the selection of labelled data 100 times. In this experiment, we employed the\nGaussian kernel, varied the edge noise level \u03c32 from 5 to 0.5, and tried the following values for \u03ba,\n[0.001, 0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05, 0.075, 0.1]. The optimal values of \u03ba and \u03c32 were\ndecided by the joint evidence P(Y,E) (11). We report the error rate and AUC on the 326 test data\nin Figure 2(b) as a boxplot, along with the test results of GPC and LapSVM. When the percentage\nof labelled data is less than 5%, our algorithm achieved greatly better performance than GPC, and\nvery competitive results compared with LapSVM (tuned with 50 labelled samples) though RGP used\n\n4AUC stands for the area under the Receiver-Operator Characteristic (ROC) curve.\n\n\ffewer labelled samples in model selection. Comparing with the performance of transductive SVM\n(TSVM) and the null category noise model for binary classi\ufb01cation (NCNM) reported in (Lawrence\n& Jordan, 2005), we are encouraged to see that our approach outperforms TSVM and NCNM on\nthis experiment.\n\n5 Conclusion\n\nWe developed a Bayesian framework to learn from relational data based on Gaussian processes.\nThe resulting relational Gaussian processes provide a uni\ufb01ed data-dependent covariance function\nfor many learning tasks. We applied this framework to semi-supervised learning and validated this\napproach on several real world data. While this paper has focused on modelling symmetric (undi-\nrected) relations, this relational Gaussian process framework can be generalized for asymmetric\n(directed) relations as well as multiple classes of relations. Recently, Yu et al. (2006) have repre-\nsented each relational pair by a tensor product of the attributes of the associated nodes, and have\nfurther proposed ef\ufb01cient algorithms. This is a promising direction.\n\nAcknowledgements\n\nW. Chu is partly supported by a research contract from Consolidated Edison. We thank Dengyong Zhou for\nsharing the preprocessed Web-KB data.\n\nReferences\nBar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence\n\nrelations. Proceedings of International Conference on Machine Learning (pp. 11\u201318).\n\nBasu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilisitic framework for semi-supervised clustering.\nProceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.\n59\u201368).\n\nChapelle, O., Weston, J., & Sch\u00a8olkopf, B. (2003). Cluster kernels for semi-supervised learning. Neural Infor-\n\nmation Processing Systems 15 (pp. 585\u2013592).\n\nGetoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Learning probabilistic models of link structure.\n\nJournal of Machine Learning Research, 3, 679\u2013707.\n\nKapoor, A., Qi, Y., Ahn, H., & Picard, R. (2005). Hyperparameter and kernel learning for graph-based semi-\n\nsupervised classi\ufb01cation. Neural Information Processing Systems 18.\n\nKrishnapuram, B., Williams, D., Xue, Y., Carin, L., Hartemink, A., & Figueiredo, M. (2004). On semi-\n\nsupervised classi\ufb01cation. Neural Information Processing Systems (NIPS).\n\nLawrence, N. D., & Jordan, M. I. (2005). Semi-supervised learning via Gaussian processes. Advances in\n\nNeural Information Processing Systems 17 (pp. 753\u2013760).\n\nMinka, T. P. (2001). A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Massachusetts\n\nInstitute of Technology.\n\nOpper, M., & Winther, O. (2005). Expectation consistent approximate inference. Journal of Machine Learning\n\nResearch, 2117\u20132204.\n\nRasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. The MIT Press.\nSeeger, M. (2003). Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse\n\napproximations. Doctoral dissertation, University of Edinburgh.\n\nSindhwani, V., Chu, W., & Keerthi, S. S. (2007). Semi-supervised Gaussian process classi\ufb01cation. The Twen-\n\ntieth International Joint Conferences on Arti\ufb01cial Intelligence. to appear.\n\nSindhwani, V., Niyogi, P., & Belkin, M. (2005). Beyound the point cloud: from transductive to semi-supervised\n\nlearning. Proceedings of the 22th International Conference on Machine Learning (pp. 825\u2013832).\n\nTaskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data. Proceedings\n\nof Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\nWagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background\n\nknowledge. Proceedings of International Conference on Machine Learning (pp. 577\u2013584).\n\nWainwright, M. J., Jaakkola, T., & Willsky, A. S. (2005). A new class of upper bounds on the log partition\n\nfunction. IEEE Trans. on Information Theory, 51, 2313\u20132335.\n\nYu, K., Chu, W., Yu, S., Tresp, V., & Xu, Z. (2006). Stochastic relational models for discriminative link\n\nprediction. Advances in Neural Information Processing Systems. to appear.\n\nZhou, D., Bousquet, O., Lal, T., Weston, J., & Sch\u00a8olkopf, B. (2004). Learning with local and global consistency.\n\nAdvances in Neural Information Processing Systems 18 (pp. 321\u2013328).\n\nZhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using Gaussian \ufb01elds and harmonic\n\nfunctions. Proceedings of the 20th International Conference on Machine Learning.\n\n\f", "award": [], "sourceid": 3108, "authors": [{"given_name": "Wei", "family_name": "Chu", "institution": null}, {"given_name": "Vikas", "family_name": "Sindhwani", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "S.", "family_name": "Keerthi", "institution": null}]}