{"title": "Stochastic Relational Models for Discriminative Link Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1553, "page_last": 1560, "abstract": null, "full_text": "Stochastic Relational Models for Discriminative Link Prediction\nKai Yu NEC Laboratories America Cupertino, CA 95014 Wei Chu CCLS, Columbia University New York, NY 10115\n\nShipeng Yu, Volker Tresp, Zhao Xu Siemens AG, Corporate Research & Technology, 81739 Munich, Germany\n\nAbstract\nWe introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i.e., links) via a tensor interaction of multiple GPs, each defined on one type of entities. These models in fact define a set of nonparametric priors on infinite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efficient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks.\n\n1\n\nIntroduction\n\nRelational learning concerns the modeling of physical, social, or other phenomena, where rich types of entities interact via complex relational structures. If compared to the traditional machine learning settings, the entity relationships provide additional structural information. A simple example of a relational setting is the user-movie rating database, which contains user entities with user attributes (e.g., age, gender, education), movie entities with movie attributes (e.g., year, genre, director), and ratings (i.e., relations between users and movies). A typical relational learning problem is entity classification, for example, segmenting users into groups. One may apply usual clustering or classification methods based on a flat structure of data, where each user is associated with a vector of user attributes. However a sound model should additionally explore the user-movie relations as well as even the movie attributes, since like-minded users tend to give similar ratings on the same movie, and may like (or dislike) movies with similar genres. Relational learning addresses this and similar situation where it is not natural to transform the data into a flat structure. Entity classification in a relational setting has gained considerable attentions, like webpage classification using both textual contents and hyperlinks. However, in other occasions relationships themselves are often of central interest. For example, one may want to predict protein-protein in-\n\n\f\nteractions, or in another application, user ratings for products. The family of these problems has been called link prediction1 , which is the primary topic of this paper. In general, link prediction includes link existence prediction (i.e., does a link exist?), link classification (i.e., what type of the relationship?), and link regression (i.e., how does the user rate the item?). In this paper we propose a family of stochastic relational models (SRM) for link prediction and other relational learning tasks. The key idea of SRM is a stochastic link-wise process induced by a tensor interplay of multiple entity-wise Gaussian processes (GP). These models in fact define a set of nonparametric priors on an infinite dimensional tensor matrix, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the learned entity-wise GP kernels (i.e., GP covariance functions). SRM is discriminative because training is on a conditional model of links. We present various models of SRM and address the computational issue, which is crucial in link prediction because the number of potential links grows exponentially with the entity size. SRM has shown encouraging results in our experiments. This paper is organized as follows. We introduce the stochastic relational models in Sec. 2, and describe the algorithms for inference and parameter estimation in Sec. 3 and Sec. 4, followed by Sec. 5 on implementation details. Then we discuss the related work in Sec. 6 and report experimental results in Sec. 7, followed by conclusions and extensions in Sec. 8.\n\n2\n\nStochastic Relational Models\n\nWe first consider pairwise asymmetric links r between entities u  U and v  V . The two sets U and V can be identical or different. We use u or v to represent the attribute vectors of entities or their identity when entity attributes are unavailable. Note that ri,n  r(ui , vn ) does not have to be identical to rn,i when U = V , i.e. relationships can be asymmetrical. Extensions to involve more than two entity sets, multi-way relations (i.e., links connecting more than two entities), and symmetric links are straightforward and will be briefly discussed in Sec. 8. We assume that the observable links r are derived as local measurements of a real-valued latent relational function t : U  V  R, and each link ri,n is solely dependent on its latent value ti,n , modeled by the likelihood p(ri,n |ti,n ). The focus of this paper is a family of stochastic relational processes acting on U  V , the space of entity pairs, to generate the latent relational function t, via a tensor interaction of two independent entity-specific GPs, one acting on U and the other on V . We call them processes because U and V can both encompass an infinite number of entities. Let the relational processes be characterized by hyperparameters  = { ,  },  for the GP kernel function on U and  for the GP kernel function on V , a SRM thus defines a Bayesian prior p(t|) for the latent variables t. Let I be the index set of entity pairs having observed links, the marginal likelihood (also called evidence) under such a prior is ( p(RI |) = p(ri,n |ti,n )p(t|)dt,  = { ,  } (1)\ni,n)I\n\nwhere RI = {ri,n }(i,n)I . We estimate the hyperparameters  by maximizing the evidence, which is an empirical Bayesian approach to learning the relational structure of data. Once  are determined, we can predict the link for a new pair of entities via marginalization over the a posteriori p(t|RI , ). 2.1 Choices for the Piror p(t|)\n\nIn order to represent a rich class of link structures, p(t|) should be sufficiently expressive. In the following subsections, we will present three cases of p(t|), from specific to general, by gradually extending conventional GP models. 2.1.1 A Brief Introduction to Gaussian Processes\n\nA GP defines a nonparametric prior distribution over functions in Bayesian inference. A random real-valued function f : X  R follows a GP prior, denoted by G P (, ), if for every finite set\n1\n\nWe will use \"link\" and \"relationship\" interchangeably throughout this paper.\n\n\f\n{xi }N 1 , f = {f (xi }N 1 follows a multivariate Gaussian distribution with mean  = {(xi )}N 1 i= i= i= and covariance (or kernel)  = {(xi , xj ;  )}Nj =1 with parameter  . Given D = {xi , yi }N 1 , i, i= p where yi is a measurement of f (xi ) corrup ted by Gaussian noise, one can make predictions via the marginal likelihood p(y |x, D,  ) = (y |f , x)p(f |D,  )df . For non-Gaussian measurement processes, as in classification models, the integral cannot be solved analytically, and approximation for inference is required. A comprehensive coverage of GP models can be found in [9]. 2.1.2 Hierarchical Gaussian Processes\n\nBy observing the relational data collectively, one may notice that two entities ui and uj in U demonstrate correlated relationships to entities in V . For example, two users often show opposite or close opinions on movies, or two hub web pages are co-linked by a set of other authority web pages. In this case, the dependency structure of links can be reduced to a dependency structure of entities in U . A hierarchical GP (HGP) model [13], originally proposed for multi-task learning, can be conveniently applied in such a situation. The model assumes that, for every v  V , its relational function t(, v ) : U  R is an i.i.d. sample drawn from a common entity-wise GP with covariance  : U  U  R. This provides a case of p(t|) in a SRM, where  =  determines the GP kernel function . Optimizing the GP kernel  via evidence maximization means to learn the dependency of entities in U , summarized over all the entities v  V . There is a drawback if applying HGP to link prediction. The model only learns a one-side structure, while ignoring the dependency in V . In particular, the attributes of entities v cannot be incorporated even if their entity attributes are available. However, for the mentioned applications, it also makes sense to explore the dependency between movies, or the dependency between authority web pages. 2.1.3 Tensor Gaussian Processes\n\nTo overcome the shortcoming of HGP, we consider a more complex SRM, which employs two GP kernel functions  : U  U  R and  : V  V  R. The model explains the relational dependency through the entity dependencies of both V and U . Let  = { ,  }, we describe a stochastic relational process p(t|) as the following: Definition 2.1 (Tensor Gaussian Processes). Given two sets U and V , a collection of random variables {t(u, v )|t : U  V  R} follow a tensor Gaussian processes (TGP), if for every finite sets {u1 , . . . , uN } and {v1 , . . . , vM }, random variables T = {t(ui , vn )}, organized into an N  M matrix, have a matrix-variate normal distribution - MN M N 1 -1 c NN M (T|B, , ) = (2 )- 2 ||- 2 ||- 2 etr  (T - B)-1 (T - B) 2 haracterized by mean B = {b(ui , vn )} and positive definite covariance matrices  = {(ui , uj ;  )} and  = {(vn , vm ;  )}. This random process is denoted as T G P (b, , ).2 In the above theorem etr[] is a shortcut for exp[trace()]. It is easy to see that the model reduces to the HGP model if  = I. As a key difference, the new model treats the relational function t as a whole sample from a TGP, instead of being formed by i.i.d. functions in the HGP model. Let vec(T ) = [t1,1 , t1,2 , . . . , t1,M , t2,1 , . . . , t2,M , . . . , tN ,M ] . If T  NN M (T|B, , ), then vec(T )  N (0, ), where the covariance  =    is the Kronecker product of two covariance matrices [6]. In other words, TGP is in fact a GP for the relational function t, where the kernel function  : (U  V )  (U  V )  R is defined via a tensor product of two GP kernels Cov(ti,n , tj,m ) = [(ui , vn ), (uj , vm )] = (ui , uj )(vn , vm ). The model explains the dependence structure of links by the dependence structure of participating entities. It is well known that a linear predictive model has a GP interpretation if its linear weights follow a Gaussian prior. A similar connection exists for TGP. Theorem 2.2. Let U  RP , V  RQ , and W  NP Q (0, IP , IQ ), where IP denotes a P  P identity matrix and ,  denotes the inner product, then the bilinear function t(u, v ) = u Wv follows T G P (0, , ) with (ui , uj ) = ui , uj and (vn , vm ) = vn , vm .\n2\n\nHereafter we always assume b(u, v ) = 0 in TGP for simplicity.\n\n\f\nThe proof is straightforward through Cov[t(ui , vn ), t(uj , vm )] = ui , uj vn , vm and E[t(ui , vn )] = 0, where E[] means expectation. In practice, the linear model will help to provide an efficient discriminative approach to link prediction. It appears that link prediction using TGP is almost the same as a normal GP regression or classification, except that the hyperparameters  now have two parts,  for  and  for . Unfortunately TGP inference suffers from a serious computational problem  it does not scale well for even a small size of entities. For example, if there is a fixed portion of missing data for pairwise relationships between N u-entities and M v -entities, the size of observations scales in O(N M ). Since GP inference has the complexity cubic to the size of data, the complexity O(N 3 M 3 ) of TGP is computationally prohibitive. 2.1.4 A Family of Stochastic Processes for Entity Relationships\n\nTo improve the scalability of SRM, and also describe the relational dependency in a way similar to TGP, we propose a family of stochastic processes p(t|) for entity relationships. Definition 2.3 (Stochastic Relational Processes). A relational function t : U  V  R is said to d iid 1 follow a stochastic relational process (SRP), if t(u, v ) = d k=1 fk (u)gk (v ), fk (u)  G P (0, ), gk (v )  G P (0, ). We denote t  S RP d (0, , ), where d is the degrees of freedom. Interestingly, there exists an intimate connection between SRP and TGP: Theorem 2.4. S RP d (0, , ) converges to T G P (0, , ) in the limit d  . Proof. Based on the central limit theory, for every (ui , vn ), ti,n  t(ui , vn ) converges to a Gaussian random variable. In the next steps, we prove E[ti,n ] = 0 and Cov(ti,n , tj,m ) = d d 1 E[ti,n tj,m ] = d { k=1 E[fk (ui )fk (uj )gk (vn )gk (vm )] + k= E[fk (ui )f (uj )gk (vn )g (vm )]} = d 1 k=1 E[fk (ui )fk (uj )gk (vn )gk (vm )] = (ui , uj )(vn , vm ). d The theorem suggests that there is a constructive definition of TGP, where relationships are formed via interactions between infinite samples from two GPs. Moreover, given a sufficiently large d, SRP will provide a close approximation to TGP. SRP is a general family of priors for modeling the relationships between entities, in which HGP and TGP are special cases. The generalization offers several advantages: (1) SRP can model symmetric links between the same set of entities. If we build a generative process where U = V ,  =  and fk = gk , then on every finite sets {ui }N 1 , T = {t(ui , uj )} is always a symmetric matrix; (2) Given i= a fixed d, the inference with SRP obtains a much reduced complexity. In Sec. 3 we will introduce an inference algorithm that scales in O[d(N 3 + M 3 )], which is much less than O(N 3 M 3 ). 2.2 Choices for the Likelihood p(ri,n |ti,n )\niid\n\nThe likelihood term describes the dependency of observable relationships on the latent variables. It should be tailored to the types of observations to be modeled. Here we list three possible situations:  Binary Classification: Relationships may take categorical states, e.g., \"cue\" or \"no cue\" in disease-treatment relationship prediction. It is popular to consider binary classification and employ the probit function to model the Bernoulli distribution over class labels, i.e. p(ri,n |ti,n ) =  (ri,n ti,n ), where () is a cumulative normal function, and ri,n  {-1, +1}.  Regression: In this case we consider ri,n taking continuous values. For example, one may want to predict the rating of user u for movie v . The corresponding likelihood function is essentially defined by a noise model, e.g. a univariate Gaussian noise with variance 2 and zero mean.  One-class Problem: Sometimes one observed the presence of links between entities, for example, the hyperlinks between web pages. Based on the open-world assumption, if a web page does not link to another, it does not mean that they are unrelated. Therefore, we employ the likelihood p(ri,n |ti,n ) = (ri,n ti,n - b) for those observed links ri,n = 1, where b is an offset.\n\n\f\n3\n\nInference with Laplacian Approximation\n\nWe have described the relational model under a prior of SRP, in which HGP and TGP are subcases. Now we develop the inference algorithm to compute the sufficient statistics of the a posteriori distribution of latent variables. Let F = {fi,k }, G = {gn,k }, f k = [f1,k , . . . , fN ,k ] and gk = [g1,k , . . . , gM ,k ] , where fi,k = fk (ui ), gn,k = gk (vn ). Then the posterior distribution p(F, G|RI , ) is proportional to the joint distribution of the complete data: e -d r dfg ( AR f 1k k k=1 i,k n,k  xp p I , F, G| p i,n -1 f k + gk -1 gk 2 d =1 i,n)I n exact inference is intractable due to the coupling between fi,k and gn,k in the likelihood term. In this paper we apply Laplacian approximation, which approximates p(F, G|RI , ) by a multivariate normal distribution q (F, G| ) with sufficient statistics  . At the first step, we compute the means by finding the mode in the posterior, J ( (F, G) = - log p(RI , F, G|) 2) {F , G } = arg min\n{F,G}\n\nWe solve the minimization by the conjugate gradient method. The gradients are calculated by 1  J (F, G) =  SG + -1 F, F d 1  J (F, G) = S G d\nF\n\n+  -1 G ,\n\n d where S  RN M have elements si,n =  [- log p(ri,n |ti,n )]/ ti,n , ti,n = k=1 fi,k gn,k / d, if (i, n)  I, otherwise si,n = 0. At the second step we calculate the covariance by C = H-1 , where H is the Hessian, i.e., the second-order derivatives of J (F, G) with respect to {F, G}. However the inverse is prohibitive because H is a huge matrix. To reduce the complexity, we assume that there exist disjoint groups of latent variables, each group is second-order independent to any other at their equilibriums. We factorize the approximating distribution as q (F, G| ) = d     k=1 q (f k |f k , k )q (gk |gk , k ), where f k and gk are the solution from Eq. (2), and k , k are the covariances matrices. This follows t e facts: (1) Each f k (or gk ) indirectly interacts with other h  f  (or g ),  = k , through the sum =k f  g , indicating that f k (or gk ) across different k are only loosely dependent to each other, especially for a large d; (2) The dependency between fi,k and gn,k is witnessed via at most only one observation in RI . Therefore we can compute the Hessian for each group separately and obtain the covariances: k = ((k) + -1 )-1 , k = ((k) + -1 )-1 , with i,i =\n() nkn = , (k )\n\nn\n:(i,n)I\n\n2 i,n gn,k , d\n\ni,j = 0\n() nkm = 0 ,\n\n(k )\n\n(3)\n\nwith\n\ni\n:(i,n)I\n\ni,n fi2k , , d\n\n(4)\n\nwhere i,n =  2 [- log p(ri,n |ti,n )]/ t2,n . Then we obtain the sufficient statistics F , G , {k } i and {k }. Finally we note that, the posterior distribution of {F, G} has many modes (Simply, shuffling the order of latent dimensions or changing the signs of both f k and gk does not change the probability.). However each mode is equally well in constructing the relational function t.\n\n4\n\nStructural Learning by Hyperparameter Estimation\n\nWe assign a hyper prior p(|) and estimate  by maximizing a penalized marginal likelihood ( l GF   = arg max p(RI , F, G|)dFdG + log p(|) 5) og\n = {   ,  }\n\nSo far the optimization (5) is quite general. In principal, it allows to learn some parametric forms of kernel functions (ui , uj ;  ) and (vn , vm ;  ) that are generalizable to new entities. In this\n\n\f\npaper we particularly consider an situation where entity attributes are not fully informative or even absent. Therefore we introduce a direct parameterization  = ,  = , and assign conjugate inverse-Wishart priors   I W N (d, 0 ) and   I W M (d, 0 ), namely , -  d -1 d  0 p(|d, 0 )  det()- 2 etr 2 , -  d -1 - d  0 p(|d, 0 )  det() 2 etr 2 where  > 0 so that d denotes the degrees of freedom, 0 and 0 are the base kernels. Then we apply an iterative expectation-maximization (EM) algorithm to solve the problem (5). In the E-step, we follow Sec. 3 to compute q (F, G| ). In the M-step, we update the hyperparameters by maximizing the expected log-likelihood of the complete data\n{,}\n\nmax Eq [log p(RI , F, G|, )] + log p(|d, 0 ) + log p(|d, 0 )\n\nwhere Eq [] is the expectation over q (F, G| ). Due to the conjugacy of the hyper prior, the maximization have an analytical solution, d d + 1 1 k ) 0 + d k=1 (g g + k ) 0 + d k=1 (f  f  kk kk , = . (6) = +1 +1\n\n5\n\nImplementation Details\n\nThe parameters 0 , 0 , d and  have to be pre-specified. We let the base kernels have the form 0 (ui , uj ) = (1 - a) (ui , uj ) + ai,j and 0 (vn , vm ) = (1 -  ) (vn , vm ) +  n,m , where 1  a,   0,  is a Dirac delta kernel (i,j = 1 if i = j , otherwise i,j = 0),  (, ) and  (, ) are some kernel functions defined on entity attributes, which reflect our prior notion of similarities between entities. We use a and  to penalize the effects of  (, ) and  (, ), respectively, when entity attributes are deficient. If the attributes are unavailable, we set a =  = 1. The dimensionality d should be properly chosen, otherwise a too small d may deteriorate the modeling flexibility. We determine d and  based on the prediction performance on a validation set of links. The learning algorithm iterates the E-step with Eq. (2), (3), (4), and the M-step with Eq. (6) until convergence. In the experiments of this paper we use p(ri,n |t,n ) to make predictions, where t is computed from i F and G . In a longer version the predictive uncertainty of ti,n will be considered.\n\n6\n\nRelated Work\n\nThere is a history of probabilistic relational models (PRM) [8] in machine learning. Getoor et al. [5] introduced link uncertainty and defined a generative model for both entity attributes and links. Recently, [12] and [7] independently introduced an infinite (hidden) relational model to avoid the difficulty of structural learning in PRM by explaining links via a potentially infinite number of hidden states of entities. Since discriminatively trained models generally outperform generative models in prediction tasks, Taskar et al. proposed relational Markov networks (RMNs) for link prediction [11], by describing a conditional distribution of links given entity attributes and other links. RMN has to define a class of potential functions on cliques of random variables based on the observed relational structure. Compared to RMN, SRM is nonparametric because structural information (e.g., cliques as well as the classes of potential functions) is not pre-defined but learned from data. Very recently a GP model was developed to learn from undirected graphs [4], which turns out to be a special rank-one case of SRM with d = 1,  = , and fk = hk . In another work [1] a SVM using a tensor kernel based on user and item attributes was used to predict user ratings on items, which is similar to our TGP case and suffers a salability problem. When attributes are deficient or unavailable, the model does not work well, while SRM can learn informative kernels purely from only links (see Sec. 7). SRM is interestingly related to the recent fast maximum-margin matrix factorization (MMMF) in [10]. If we fix  and  as uninformative Dirac kernels, the mode of our Laplacian approximation is equivalent to the solution of Eq.(5) in [10] with the loss function l(ri,n , ti,n ) = - log p(ri,n |ti,n ). However SRM significantly differs from MMMF in two important aspects: (1) SRM is a supervised predictive model because entity attributes enter the model by forming informative priors (, ) and hyper priors (0 , 0 ); (2) More importantly, SRM deals with\n\n\f\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n10 20 30 5 10 (f) 5 10 15 20 5 10 15 20 15 20\n\n10 20 30 5 10 (g) 15 20\n\n10 20 30 5 10 (h) 5 10 15 20\n\n10 20 30 5 10 (i) 15 20\n\n10 20 30 10 (j) 5 10 20 30\n\n10 20 30 10 20 30\n\n10 20 5 10 15 20 30 10 20 30\n\n15 20\n\n15 20 5 10 15 20\n\nFigure 1: Link prediction on synthetic data: (a) training data, where black entry means positive links, white\nmeans negative links, and gray means missing; (b) prediction of MMMF (classification rate 0.906); (c) prediction of SRM with noninformative prior (classification rate 0.942); (d) prediction of SRM with informative prior (classification rate 0.965); (e-f) informative 0 and 0 ; (g-h) learned  and  with noninformative prior; (i-j) learned  and  with informative prior.\n\nstructural learning by adapting the kernels and marginalizing out the latent relational function, while MMMF only estimates the mode of the latent relational function with fixed Dirac kernels.\n\n7\n\nExperiments\n\n0 Synthetic Data: We generated two sets of entities U = {ui }2=1 and V = {vn }30 1 on a real line n= i such that ui = 0.1i and vn = 0.1n. The positions of entities were used to compute two RBF kernels that serve as informative 0 and 0 . Then we further made a deformation on the real line to form 2 clusters in U and 3 clusters in V . RBF function computed on the deformed scale gives two kernel matrices  and  whose diagonal block structure reflects the clusters. Binary links between U and V are obtained by taking the sign of a function, which is a sample from T G P (0, , ). We randomly withdrew 50% of links for training, and left the remaining for test (see Fig. 1-(a)). We performed two variants of SRM, one with informative 0 and 0 (see Fig. 1-(e,f)) and the other with noninformative Dirac kernels 0 = 0 = I, and compared with MMMF [10]. In all the cases we set d = 20. The classification accuracy rates of two SRMs, 0.942 and 0.965, are both better than 0.906 of MMMF. As shown in Fig. 1, the block structures of learned kernels indicate that both SRMs can learn the cluster structure of entities from links. The structural kernel optimization enables SRM to outperform MMMF, even with a completely noninformative prior. Note that the informative prior really helps SRM to achieve the best accuracy.\n\nEachmovie Data: We tested our algorithms on a data set from [3], which is a subset of EachMovie data, containing 5000 users' ratings, i.e., 1, 2, 3, 4, 5, or 6, on 1623 movies. We selected the first 1000 users and organized the data into a 1000  1623 table with 63, 592 observed ratings. We compared SRM with MMMF in a regression task to predict the `rating link' between users and movies. In SRM we set 0 = 0 = I. For both methods the dimensionality was chosen as d = 20. In MMMF we used the square error loss. We repeated the experiments for 10 times, where at each time we randomly withdrew 70% ratings for training and left the remaining for test. Root-mean-square error (RMSE) and mean-absolute error (MAE) were used to evaluate the accuracy. The results of all the repeats, as well as the means and standard deviations, are shown in Table 1 and Table 2. Compared to MMMF, SRM significantly reduces the prediction error by over 12% in terms of both RMSE and MAE.\n\n8\n\nConclusions and Future Extensions\n\nIn this paper we proposed a stochastic relational model (SRM) for learning relational data. Entity relationships are modeled by a tensor interaction of multiple Gaussian processes (GPs). We proposed a family of relational processes and showed its convergence to a tensor Gaussian process if the degrees of freedom goes to infinity. The process imposes an effective prior on the entity relationships,\n\n\f\nTable 1: User-movie rating prediction error measured by RMSE\nRepeats MMMF SRM 1 1.366 1.195 2 1.367 1.199 3 1.372 1.192 4 1.377 1.200 5 1.363 1.198 6 1.368 1.209 7 1.356 1.204 8 1.380 1.208 9 1.358 1.189 10 1.373 1.209 mean  std. 1.368  0.008 1.2000.007\n\nTable 2: User-movie rating prediction error measured by MAE\nRepeats MMMF SRM 1 1.067 0.924 2 1.066 0.928 3 1.074 0.924 4 1.076 0.923 5 1.066 0.924 6 1.073 0.934 7 1.060 0.931 8 1.074 0.932 9 1.062 0.918 10 1.072 0.933 mean  std. 1.0600.006 0.927 0.005\n\nand leads to a discriminative link prediction model. We demonstrated the excellent results of SRM on a synthetic data set and a user-movie rating prediction problem. Though the current work focused on the application of link prediction, the model can be used for general relational learning purposes. There are several directions to extend the current model: (1) SRM can describe a joint distribution of entity links and entity classes conditioned on entity-wise GP kernels. Therefore entity classification can be solved in a relational context; (2) One can extend SRM to model multi-way relations where more than two entities participate in a single relationship; (3) SRM can also be applied to model pairwise relations between multiple entity sets, where kernel updates amount to propagation of information through the entire relational network; (4) As discussed in Sec. 2.1.2, SRM is a natural extension of hierarchical Bayesian multi-task models, by explicitly modeling the dependency over tasks. In a recent work [2] a tensor GP for multi-task learning was independently suggested; (5) Finally, it is extremely important to make the algorithm scalable to very large relational data, like the Netflix problem, containing about 480,000 users and 17,000 movies.\n\nAcknowledgement\nThe authors thank Andreas Krause, Chris Williams, Shenghuo Zhu, and Wei Xu for the fruitful discussions.\n\nReferences\n[1] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004. [2] E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), 2007. To appear. [3] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI), 1998. [4] W. Chu, V. Sindhwani, Z. Ghahramani, and S. S. Keerthi. Relational learning with gaussian processes. In Neural Information Processing Systems (NIPS), 2007. To appear. [5] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure for hypertext classification. In Proceedings ICJAI Workshop on Text Learning: Beyond Supervision, 2001. [6] Arjun K. Gupta and Daya K. Naga. Matrix Variate Distributions. 1999. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proceedings of National Conference on Artificial Intelligence (AAAI), 1998. [9] C. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [10] Jason D. M. Rennie and Nati Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005. [11] B. Taskar, M. F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural Information Processing Systems Conference (NIPS), 2004. [12] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [13] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In Proceedings of 22nd International Conference on Machine Learning (ICML), 2005.\n\n\f\n", "award": [], "sourceid": 2998, "authors": [{"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Wei", "family_name": "Chu", "institution": null}, {"given_name": "Shipeng", "family_name": "Yu", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Zhao", "family_name": "Xu", "institution": null}]}