{"title": "Hidden Common Cause Relations in Relational Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1345, "page_last": 1352, "abstract": null, "full_text": "Hidden Common Cause Relations in\n\nRelational Learning\n\nRicardo Silva\u2217\n\nGatsby Computational Neuroscience Unit\n\nUCL, London, UK WC1N 3AR\nrbas@gatsby.ucl.ac.uk\n\nWei Chu\n\nCenter for Computational Learning Systems\nColumbia University, New York, NY 10115\n\nchuwei@cs.columbia.edu\n\nZoubin Ghahramani\n\nDepartment of Engineering\n\nUniversity of Cambridge, UK CB2 1PZ\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nWhen predicting class labels for objects within a relational database, it is often\nhelpful to consider a model for relationships: this allows for information between\nclass labels to be shared and to improve prediction performance. However, there\nare different ways by which objects can be related within a relational database.\nOne traditional way corresponds to a Markov network structure: each existing\nrelation is represented by an undirected edge. This encodes that, conditioned on\ninput features, each object label is independent of other object labels given its\nneighbors in the graph. However, there is no reason why Markov networks should\nbe the only representation of choice for symmetric dependence structures. Here\nwe discuss the case when relationships are postulated to exist due to hidden com-\nmon causes. We discuss how the resulting graphical model differs from Markov\nnetworks, and how it describes different types of real-world relational processes.\nA Bayesian nonparametric classi\ufb01cation model is built upon this graphical repre-\nsentation and evaluated with several empirical studies.\n\n1 Contribution\n\nPrediction problems, such as classi\ufb01cation, can be easier when class labels share a sort of relational\ndependency that is not accounted by the input features [10]. If the variables to be predicted are at-\ntributes of objects in a relational database, such dependencies are often postulated from the relations\nthat exist in the database. This paper proposes and evaluates a new method for building classi\ufb01ers\nthat uses information concerning the relational structure of the problem.\n\nConsider the following standard example, adapted from [3]. There are different webpages, each\none labeled according to some class (e.g., \u201cstudent page\u201d or \u201cnot a student page\u201d). Features such\nas the word distribution within the body of each page can be used to predict each webpage\u2019s class.\nHowever, webpages do not exist in isolation: there are links connecting them. Two pages having a\ncommon set of links is evidence for similarity between such pages. For instance, if W1 and W3 both\nlink to W2, this is commonly considered to be evidence for W1 and W3 having the same class. One\nway of expressing this dependency is through the following Markov network [5]:\n\n\u2217Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk\n\n\fF\n1\n\nC1\n\nF2\n\nC2\n\nF3\n\nC3\n\nHere Fi are the features of page Wi, and Ci is its respective page label. Other edges linking F\nvariables to C variables (e.g., F1\u2212C2) can be added without affecting the main arguments presented\nin this section. The semantics of the graph, for a \ufb01xed input feature set {F1, F2, F3}, are as follows:\nC1 is marginally dependent on C3, but conditionally independent given C2. Depending on the\ndomain, this might be either a suitable or unsuitable representation of relations. For instance, in some\ndomains it could be the case that the most sensible model would state that C1 is only informative\nabout C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent\ngiven C2. This can happen if the existence of a relation (Ci, Cj) corresponds to the existence of\nhidden common causes generating this pair of random variables.\n\nConsider the following example, loosely based on a problem described by [12]. We have three\nobjects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want\nto predict the stock market price of each company given its pro\ufb01tability from last year. The given\nrelationships are that M and S are direct competitors (due to the videogame console market), as\nwell S and P (due to the TV set market).\n\nM.Profit\n\nS.Profit\n\nP.Profit\n\nM.Profit\n\nS.Profit\n\nP.Profit\n\nM.Profit\n\nS.Profit\n\nP.Profit\n\nM.Stock\n\nS.Stock\n\nP.Stock\n\nM.Stock\n\nS.Stock\n\nP.Stock\n\nM.Stock\n\nS.Stock\n\nP.Stock\n\n(a)\n\nm\n\ne s\n(b)\n\ne p\n\nm\n\ne s\n(c)\n\ne p\n\nFigure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden com-\nmon cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for\ngeneric hidden common causes relationships by using bi-directed edges; (c) A depiction of the same\nrelationship skeleton by a Markov network model, which has different probabilistic semantics.\n\nIt is expected that several market factors that affect stock prices are unaccounted by the predictor\nvariable Past Year Pro\ufb01t. For example, a shortage of Microsoft consoles is a hidden common fac-\ntor for both Microsoft\u2019s and Sony\u2019s stock. Another hidden common cause would be a high price\nfor Sony\u2019s consoles. Assume here that these factors have no effect on Philips\u2019 stock value. A de-\npiction of several hidden common causes that correpond to the relations Competitor(M, S) and\nCompetitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices.\nConsider a linear regression model for this setup. We assume that for each object Oi \u2208 {M, S, P},\nthe stock price Oi.Stock, centered at the mean, is given by\n\nOi.Stock = \u03b2 \u00d7 Oi.P rof it + \u0001i\n\n(1)\n\nwhere each \u0001i is a Gaussian random variable.\nThe fact that there are several hidden common causes between M and S can be modeled by the\ncovariance of \u0001m and \u0001s, \u03c3ms. That is, unlike in standard directed Gaussian models, \u03c3ms is allowed\nto be non-zero. The same holds for \u03c3sp. Covariances of error terms of unrelated objects should\nbe zero (\u03c3mp = 0). This setup is very closely related to the classic seemingly unrelated regression\nmodel popular in economics [12].\n\nA graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with\nbi-directed edges representing the relationship of having hidden common causes between a pair\nof vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in\nFigure 1(c). The undirected representation encodes that \u0001m and \u0001p are marginally dependent, which\n\ne\ne\n\fdoes not correspond to our assumptions1. Moreover, the model in Figure 1(b) states that once\nwe observe Sony\u2019s stock price, Philip\u2019s stocks (and pro\ufb01t) should have a non-zero association with\nMicrosoft\u2019s pro\ufb01t: this follows from a extension of d-separation to DMGs [9]. This is expected from\nthe assumptions (Philip\u2019s stocks should tell us something about Microsoft\u2019s once we know Sony\u2019s,\nbut not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting\nto use Markov networks to represent relational models (free of concerns raised by cyclic directed\nrepresentations), it is clear that there are problems for which they are not a sensible choice.\n\nThis is not to say that Markov networks are not the best representation for large classes of relational\nproblems. Conditional random \ufb01elds [4] are well-motivated Markov network models for sequence\nlearning. The temporal relationship is closed under marginalization: if we do not measure some steps\nin the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in\nFigure 2. Directed mixed graphs are not a good representation for this sequence structure.\n\nX\n4\n\nY\n4\n\nX\n5\n\nY\n5\n\nX\n1\n\nY\n1\n\nX2\n\nY\n2\n\nX\n1\n\nY\n1\n\nX2\n\nY\n2\n\nX3\n\nY\n3\n\n(a)\n\nX\n4\n\nY\n4\n\nX\n5\n\nY\n5\n\nX\n1\n\n1Y\n\nX3\n\nY\n3\n\n(b)\n\nX5\n\n5Y\n\nX3\n\n3Y\n\n(c)\n\nFigure 2: (a) A conditional random \ufb01eld (CRF) graph for sequence data; (b) A hypothetical scenario\nwhere two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF\ngraph for the remaining variables, which corresponds to the same criteria for construction of (a).\n\nTo summarize, the decision between using a Markov network or a DMG reduces to the following\nmodeling issue: if two unlinked object labels yi, yj are statistically associated when some chain\nof relationships exists between yi and yj, then the Markov network semantics should apply (as in\nthe case for temporal relationships). However, if the association arises only given the values of the\nother objects in the chain, then this is accounted by the dependence semantics of the directed mixed\ngraph representation. The DMG representation propagates training data information through other\ntraining points. The Markov network representation propagates training data information through\ntest points. Propagation through training points is relevant in real problems. For instance, in a\nwebpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of\nintermediated points between two classes labels yi and yj is likely to be more informative if we\nknow the values of the labels in this chain. The respective Markov network would ignore all training\npoints in this chain besides the endpoints.\n\nIn this paper, we introduce a non-parametric classi\ufb01cation model for relational data that factorizes\naccording to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a\nclosely related approach which bears a strong analogy to the Markov network formulation. Experi-\nments in text classi\ufb01cation are described in Section 4.\n\n2 Model\n\nChu et al. [2] describe an approach for Gaussian process classi\ufb01cation using relational information,\nwhich we review and compare to our proposed model.\nPrevious approach: relational Gaussian processes through indicators \u2212 For each point x\nin the input space X , there is a corresponding function value fx. Given observed input points\nx1, x2, . . . , xn, a Gaussian process prior over f = [f1, f2, . . . , fn]T has the shape\n\nP(f ) =\n\n1\n\n(2\u03c0)n/2|\u03a3|1/2 exp(cid:18)\u2212\n\n1\n2\n\nf T \u03a3\u22121f(cid:19)\n\n(2)\n\n1For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov\nrandom \ufb01elds) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds\nto a zero in the covariance matrix [9].\n\n\fX1\n\nf1\n\nY1\n\ne 1\n\n12\n\nX2\n\nf2\n\nY2\n\ne 2\n(a)\n\n23\n\nX3\n\nf3\n\nY3\n\n3e\n\nX1\n\nf1\n\nY1\n\ne 1\n\nX2\n\nf2\n\nY2\n\ne 2\n(b)\n\nX3\n\nf3\n\nY3\n\n3e\n\nX1\n\nf1\n\nY1\n\ne 1\n\n1\n\nX2\n\nf2\n\nY2\n\ne 2\n\n(c)\n\n2\n\nX3\n\nf3\n\nY3\n\n3e\n\n3\n\nFigure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other\ntwo datapoints. Dependencies between f1, f2 and f3 are given by a Gaussian process prior and not\nrepresented in the picture. Indicators \u03beij are known and set to 1; (b) The extra associations that\narise by conditioning on \u03be = 1 can be factorized as the Markov network model here depicted, in the\nspirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical\nmodels such as seemingly unrelated regression and structural equation models [11].\n\nwhere the ijth entry of \u03a3 is given by a Mercer kernel function K(xi, xj) [8].\nThe idea is to start from a standard Gaussian process prior, and add relational information by con-\nditioning on relational indicators. Let \u03beij be an indicator that assumes different values, e.g., 1 or 0.\nThe indicator values are observed for each pair of data points (xi, xj): they are an encoding of the\ngiven relational structure. A model for P (\u03beij = 1|fi, fj) is de\ufb01ned. This evidence is incorporated\ninto the Gaussian process by conditioning on all indicators \u03beij that are positive. Essentially, the idea\nboils down to using P(f|\u03be = 1) as the prior for a Gaussian process classi\ufb01er. Figure 3(a) illus-\ntrates a problem with datapoints {(x1, y1), (x2, y2), (x3, y3)}. Gray vertices represent unobserved\nvariables. Each yi is a binary random variable, with conditional probability given by\n\nP(yi = 1|fi) = \u03a6(fi/\u03c3)\n\n(3)\n\nwhere \u03a6(\u00b7) is the standard normal cumulative function and \u03c3 is a hyperparameter. This can be\ninterpreted as the cumulative distribution of fi + \u0001i, where fi is given and \u0001i is a normal random\nvariable with zero mean and variance \u03c32.\nIn the example of Figure 3(a), one has two relations: (x1, x2), (x2, x3). This information is incorpo-\nrated by conditioning on the evidence (\u03be12 = 1, \u03be23 = 1). Observed points (x1, y1), (x2, y2) form\nthe training set. The prediction task is to estimate y3. Notice that \u03be12 is not used to predict y3: the\nMarkov blanket for f3 includes (f1, f2, \u03be23, y3, \u00013) and the input features. Essentially, conditioning\non \u03be = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2.\nOur approach: mixed graph relational model \u2212 Figure 3(c) illustrates our proposed setup. For\nreasons that will become clear in the sequel, we parameterize the conditional probability of yi as\n\nP(yi = 1|gi, vi) = \u03a6(gi/\u221avi)\n\n(4)\n\ni , with \u0001?\n\nwhere gi = fi + \u03b6i. As before, Equation (4) can be interpreted as the cumulative distribution of\ni as a normal random variable with zero mean and variance vi = \u03c32 \u2212 \u03c32\n\u03b6i, the last\ngi + \u0001?\nterm being the variance of \u03b6i. That is, we break the original error term as \u0001i = \u03b6i + \u0001?\ni , where \u0001?\ni\nj are independent for all i 6= j. Random vector \u03b6 is a multivariate normal with zero mean and\nand \u0001?\ncovariance matrix \u03a3\u03b6. The key aspect in our model is that the covariance of \u03b6i and \u03b6j is non-zero\nonly if objects i and j are related (that is, bi-directed edge yi \u2194 yj is in the relational graph).\nParameterizing \u03a3\u03b6 for relational problems is non-trivial and discussed in the next section.\nIn the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network\nmodels 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both \u0001 and\n\u03b6 terms), following the motivation presented in Section 1.\n\n2In the \ufb01gure, we are not representing explicitly that f1, f2 and f3 are not independent (the prior covari-\nance matrix \u03a3 is complete). The \ufb01gure is meant as a representation of the extra associations that arise when\nconditioning on \u03be = 1, and the way such associations factorize.\n\nx\nx\nz\nz\nz\n\fAs before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the\nfollowing Gaussian process prior (implicitly conditioned on x):\n\nP(g) =\n\n1\n\n(2\u03c0)n/2|R|1/2 exp(cid:26)\u2212\n\n1\n2\n\ng>R\u22121g(cid:27)\n\n(5)\n\nwhere R = K + \u03a3\u03b6 is the covariance matrix of g = f + \u03b6, with Kij = K(xi, xj ).\n3 Parametrizing a mixed graph model for relational classi\ufb01cation\n\nFor simplicity, in this paper we will consider only relationships that induce positive associations\nbetween labels. Ideally, the parameterization of \u03a3\u03b6 has to ful\ufb01ll two desiderata: (i). it should respect\nthe marginal independence constraints as encoded by the graphical model (i.e., zero covariance for\nvertices that are not adjacent), and be positive de\ufb01nite; (ii). it has to be parsimonious in order to\nfacilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate\nanalysis problems in [11], the size of our covariance matrix grows with the number of data points.\n\nAs shown by [11], exact inference in models with covariance matrices with zero-entry constraints is\ncomputationally demanding. We provide two alternative parameterizations that are not as \ufb02exible,\nbut which lead to covariance matrices that are simple to compute and easy to implement. We will\nwork under the transductive scenario, where training and all test points are given in advance. The\ncorresponding graph thus contain unobserved and observed label nodes.\n\n3.1 Method I\n\nThe \ufb01rst method is an automated method to relax some of the independence constraints, while\nguaranteeing positive-de\ufb01niteness, and a parameterization that depends on a single scalar \u03c1. This\nallows for more ef\ufb01cient inference and is done as follows:\n\n1. Let G\u03b6 be the corresponding bi-directed subgraph of our original mixed graph, and let U0\n\nbe a matrix with n \u00d7 n entries, n being the number of nodes in G\u03b6\n\n2. Set U0\n\nij to be the number of cliques in G\u03b6 where yi and yj appear together;\nii to be the number of cliques containing yi, plus a small constant \u2206;\n\n3. Set U0\n4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance\n\nmatrix and rescaling it;\n\nFinally, set \u03a3\u03b6 = \u03c1U, where \u03c1 \u2208 [0, 1] is a given hyperparameter. Matrix U is always guaranteed to\nbe positive de\ufb01nite: it is equivalent to obtaining the covariance matrix of y from a linear latent vari-\nable model, where there is an independent standard Gaussian latent variable as a common parent to\nevery clique, and every observed node yi is given by the sum of its parents plus an independent error\nterm of variance \u2206. Marginal independencies are respected, since independent random variables\nwill never be in a same clique in G\u03b6. In practice, this method cannot be used as is since the number\nof cliques will in general grow at an exponential rate as a function of n. Instead, we \ufb01rst triangulate\nthe graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the\noriginal goal, since some of the original marginal independence constraints will not be enforced due\nto the triangulation3.\n\n3.2 Method II\n\nThe method suggested in the previous section is appealing under the assumption that vertices that\nappear in many common cliques are more likely to have more hidden common causes, and hence\nshould have stronger associations. However, sometimes the triangulation introduces bad artifacts,\nwith lots of marginal independence constraints being violated. In this case, this will often result in\na poor prediction performance. A cheap alternative approach is not generating cliques, and instead\n\n3The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational\n\nGaussian process of [2] also requires an approximation of its relational kernel.\n\n\f10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained\nwith the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to\nproduce much weaker associations but does not introduce spurious relations.\n\ngetting a marginal covariance matrix from a different latent variable model.\nIn this model, we\ncreate an independent standard Gaussian variable for each edge yi \u2194 yj instead of each clique. No\ntriangulation will be necessary, and all marginal independence constraints will be respected. This,\nhowever, has shortcomings of its own: for all pairs (yi, yj) connected by an edge, it will be the case\nthat U0\nii can be as large as n. This means that the resulting correlation in Uij can be\nclose to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between\nMethods I and II according to the marginal likelihood of the model.\n\nij = 1, while U0\n\n3.3 Algorithm\n\nRecall that our model is a Gaussian process classi\ufb01er with error terms \u0001i of variance \u03c3 such that\ni . Without loss of generality, we will assume that \u03c3 = 1. This results in the following\n\u0001i = \u03b6i + \u0001?\nparameterization of the full error covariance matrix:\n\n\u03a3\u0001 = (1 \u2212 \u03c1)I + \u03c1U\n\n(6)\n\nwhere I is an n \u00d7 n identity matrix. Matrix (1 \u2212 \u03c1)I corresponds to the covariance matrix \u03a3\u0001?.\nThe usefulness of separating \u0001 as \u0001? and \u03b6 becomes evident when we use an expectation-propagation\n(EP) algorithm [7] to perform inference in our relational classi\ufb01er. Instead of approximating the\nposterior of f, we approximate the posterior density P(g|D), D = {(x1, y1), . . . , (xn, yn)} being\n\u02dcti(gi) where\nthe given training data. The approximate posterior has the form Q(g) \u221d P(g)Qi\nP(g) is the Gaussian process prior with kernel matrix R = K + \u03a3\u03b6 as de\ufb01ned in the previous\nsection. Since the covariance matrix \u03a3\u0001? is diagonal, the true likelihood of y given g factorizes\nover each datapoint: P(y|g) = Qn\ni=1 P(yi|gi), and standard EP algorithms for Gaussian process\nclassi\ufb01cation can be used [8] (with the variance given by \u03a3\u0001? instead of \u03a3\u0001, and kernel matrix R\ninstead of K).\n\nThe \ufb01nal algorithm de\ufb01nes a whole new class of relational models, depends on a single hyperpa-\nrameter \u03c1 which can be optimized by grid search in [0, 1], and requires virtually no modi\ufb01cation of\ncode written for EP-based Gaussian process classi\ufb01ers4.\n\n4 Results\n\nWe now compare three different methods in relational classi\ufb01cation tasks. We will compare a\nstandard Gaussian process classi\ufb01er (GPC), the relational Gaussian process (RGP) of [2] and our\nmethod, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x \u00b7 z is used, as de-\nscribed by [2]. We set \u2206 = 10\u22124 and the hyperparameter \u03c1 is found by a grid search in the space\n{0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5.\n\n4We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/\u223csilva.\n5For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering\n\navailable at http://people.scs.fsu.edu/\u223cburkardt/m src/rcm/rcm.html\n\n\fTable 1: The averaged AUC scores of citation prediction on test cases of the Cora database are\nrecorded along with standard deviation over 100 trials. \u201cn\u201d denotes the number of papers in one\nclass. \u201cCitations\u201d denotes the citation count within the two paper classes.\n\nCitations\n\nGPC\n\nGPC with Citations\n\nXGP\n\nGroup\n5vs1\n5vs2\n5vs3\n5vs4\n5vs6\n5vs7\n\nn\n\n346/488\n346/619\n346/1376\n346/646\n346/281\n346/529\n\n2466\n3417\n3905\n2858\n1968\n2948\n\n0.905 \u00b1 0.031\n0.900 \u00b1 0.032\n0.863 \u00b1 0.040\n0.916 \u00b1 0.030\n0.887 \u00b1 0.054\n0.869 \u00b1 0.045\n\n0.891 \u00b1 0.022\n0.905 \u00b1 0.044\n0.893 \u00b1 0.017\n0.887 \u00b1 0.018\n0.843 \u00b1 0.076\n0.867 \u00b1 0.041\n\n0.945 \u00b1 0.053\n0.933 \u00b1 0.059\n0.883 \u00b1 0.013\n0.951 \u00b1 0.042\n0.955 \u00b1 0.041\n0.926 \u00b1 0.076\n\n4.1 Political books\n\nWe consider \ufb01rst a simple classi\ufb01cation problem where the goal is to classify whether a par-\nticular book is of liberal political inclination or not. The features of each book are given\nby the words in the Amazon.com front page for that particular book. The choice of books,\nlabels, and relationships are given in the data collected by Valdis Krebs and available at\nhttp://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at\nhttp://www.statslab.cam.ac.uk/\u223csilva. There are 105 books, 43 of which are labeled as liberal books.\nThe relationships are pairs of books which are frequently purchased together by a same customer.\nNotice this is an easy problem, where labels are strongly associated if they share a relationship.\nWe performed evaluation by sampling 100 times from the original pool of books, assigning half of\nthem as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary\nproblem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary\nlink matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries\ncloser to red correspond to stronger correlations. Method II gives a better performance here (Method\nI was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP\nand XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02).\n\n4.2 Cora\n\nThe Cora collection [6] contains over 50,000 computer science research papers including biblio-\ngraphic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning\npapers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper\nwas preprocessed as a bag-of-words, a vector of \u201cterm frequency\u201d components scaled by \u201cinverse\ndocument frequency\u201d, and then normalized to unity length. This follows the pre-processing used in\n[2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled\nsamples for training and tested on the remainder. The partition was repeated 100 times. We used\nthe fact that the database is composed of fairly specialized papers as an illustration of when XGP\nmight not be as optimal as RGP (whose AUC curves are very close to 1), since the population of\nlinks tends to be better separated between different classes (but this is also means that the task is\nfairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is\nvery little training data also favors RGP, since XGP propagates information through training points.\nStill, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix\nas a binary input feature for each paper does not improve the performance of the GPC, as shown in\nTable 1. Results for other classes are of similar qualitative nature and not displayed here.\n\n4.3 WebKB\n\nThe WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington\nand Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course,\nproject, staff, department and \u201cother\u201d. The relations come from actual links in the webpages. There\nis relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling,\nthis linkage mechanism is explained by a hidden common cause (e.g., a student and a course page\nare associated because that person\u2019s interest in enrolling as a student also creates demand for a\ncourse). The heterogeneity also suggests that two unlinked pages should not, on average, have an\nassociation if they link to a common page W . However, observing the type of page W might create\n\n\fTable 2: Comparison of the three algorithms on the task \u201cother\u201d vs. \u201cnot-other\u201d in the WebKB\ndomain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used\nto generate the results for XGP. Mean and standard deviation of AUC results are reported.\n\nOther or Not\n\nUniversity\n\nNumbers\n\nCornell\nTexas\n\nWashington\nWisconsin\n\nOther\n617\n571\n939\n942\n\nAll\n865\n827\n1205\n1263\n\nLink\n13177\n16090\n15388\n21594\n\nGPC\n\n0.708 \u00b1 0.021\n0.799 \u00b1 0.021\n0.782 \u00b1 0.023\n0.839 \u00b1 0.014\n\nRGP\n\n0.884 \u00b1 0.025\n0.906 \u00b1 0.026\n0.877 \u00b1 0.024\n0.899 \u00b1 0.015\n\nXGP\n\n0.917 \u00b1 0.022\n0.949 \u00b1 0.015\n0.923 \u00b1 0.016\n0.941 \u00b1 0.018\n\nthe association. We compare how the three algorithms perform when trying to predict if a webpage\nis of class \u201cother\u201d or not (the other classi\ufb01cations are easier, with smaller differences. Results are\nomitted for space purposes). The proportion of \u201cother\u201d to non-\u201cother\u201d is about 4:1, which makes the\narea under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples\nfrom [2], where 10% of the whole data is sampled from the pool for a speci\ufb01c university, and the\nremaining is used for test. We also used the same features as in [2], pre-processed as described in the\nprevious section. The results are shown in Table 2. Both relational Gaussian processes are far better\nthan the non-relational GPC. XGP gives signi\ufb01cant improvements over RGP in all four universities.\n\n5 Conclusion\n\nWe introduced a new family of relational classi\ufb01ers by extending a classical statistical model [12]\nto non-parametric relational classi\ufb01cation. This is inspired by recent advances in relational Gaus-\nsian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically\nthat modeling the type of latent phenomena that our approach postulates can sometimes improve\nprediction performance in problems traditionally approached by Markov network structures.\n\nSeveral interesting problems can be treated in the future. It is clear that there are many different ways\nby which the relational covariance matrix can be parameterized. Intermediate solutions between\nMethods I and II, approximations through matrix factorizations and graph cuts are only a few among\nmany alternatives that can be explored. Moreover, there is a relationship between our model and\nmultiple kernel learning [1], where one of the kernels comes from error covariances. This might\nprovide alternative ways of learning our models, including multiple types of relationships.\nAcknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database.\n\nReferences\n[1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm.\n\n21st International Conference on Machine Learning, 2004.\n\n[2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes.\n\nNeural Information Processing Systems, 2006.\n\n[3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to\nextract symbolic knowledge from the World Wide Web. Proceedings of AAAI\u201998, pages 509\u2013516, 1998.\n[4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. 18th International Conference on Machine Learning, 2001.\n\n[5] S. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals\n\nwith machine learning. Information Retrieval Journal, 3:127\u2013163, 2000.\n\n[7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001.\n[8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962\u20131030, 2002.\n[10] P. Sen and L. Getoor. Link-based classi\ufb01cation. Report CS-TR-4858, University of Maryland, 2007.\n[11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006.\n[12] A. Zellner. An ef\ufb01cient method of estimating seemingly unrelated regression equations and tests for\n\naggregation bias. Journal of the American Statistical Association, 1962.\n\n\f", "award": [], "sourceid": 751, "authors": [{"given_name": "Ricardo", "family_name": "Silva", "institution": null}, {"given_name": "Wei", "family_name": "Chu", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}