{"title": "Probabilistic Relational PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1123, "page_last": 1131, "abstract": "One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for relational data. In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic dimensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Experiments on real-world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance.", "full_text": "Probabilistic Relational PCA\n\nWu-Jun Li\n\nDit-Yan Yeung\n\nDept. of Comp. Sci. and Eng.\n\nHong Kong University of Science and Technology\n\nHong Kong, China\n\n{liwujun,dyyeung}@cse.ust.hk\n\nZhihua Zhang\n\nSchool of Comp. Sci. and Tech.\n\nZhejiang University\n\nZhejiang 310027, China\n\nzhzhang@cs.zju.edu.cn\n\nAbstract\n\nOne crucial assumption made by both principal component analysis (PCA) and\nprobabilistic PCA (PPCA) is that the instances are independent and identically\ndistributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for\nrelational data. In this paper, by explicitly modeling covariance between instances\nas derived from the relational information, we propose a novel probabilistic di-\nmensionality reduction method, called probabilistic relational PCA (PRPCA), for\nrelational data analysis. Although the i.i.d. assumption is no longer adopted in\nPRPCA, the learning algorithms for PRPCA can still be devised easily like those\nfor PPCA which makes explicit use of the i.i.d. assumption. Experiments on real-\nworld data sets show that PRPCA can effectively utilize the relational information\nto dramatically outperform PCA and achieve state-of-the-art performance.\n\n1 Introduction\n\nUsing a low-dimensional embedding to summarize a high-dimensional data set has been widely\nused for exploring the structure in the data. The methods for discovering such low-dimensional\nembedding are often referred to as dimensionality reduction (DR) methods. Principal component\nanalysis (PCA) [13] is one of the most popular DR methods with great success in many applications.\nAs a more recent development, probabilistic PCA (PPCA) [21] provides a probabilistic formula-\ntion of PCA [13] based on a Gaussian latent variable model [1]. Compared with the original non-\nprobabilistic derivation of PCA in [12], PPCA possesses a number of practical advantages. For ex-\nample, PPCA can naturally deal with missing values in the data; the expectation-maximization (EM)\nalgorithm [9] used to learn the parameters in PPCA may be more ef\ufb01cient for high-dimensional data;\nit is easy to generalize the single model in PPCA to the mixture model case; furthermore, PPCA as\na probabilistic model can naturally exploit Bayesian methods [2].\n\nLike many existing DR methods, both PCA and PPCA are based on some assumptions about the\ndata. One assumption is that the data should be represented as feature vectors all of the same\ndimensionality. Data represented in this form are sometimes referred to as \ufb02at data [10]. Another\none is the so-called i.i.d. assumption, which means that the instances are assumed to be independent\nand identically distributed (i.i.d.).\n\nHowever, the data in many real-world applications, such as web pages and research papers, contain\nrelations or links between (some) instances in the data in addition to the textual content informa-\ntion which is represented in the form of feature vectors. Data of this sort, referred to as relational\ndata1 [10, 20], can be found in such diverse application areas as web mining [3, 17, 23, 24], bioinfor-\nmatics [22], social network analysis [4], and so on. On one hand, the link structure among instances\n\n1In this paper, we use document classi\ufb01cation as a running example for relational data analysis. Hence, for\nconvenience of illustration, the speci\ufb01c term \u2018textual content information\u2019 is used in the paper to refer to the\nfeature vectors describing the instances. However, the algorithms derived in this paper can be applied to any\nrelational data in which the instance feature vectors can represent any attribute information.\n\n1\n\n\fcannot be exploited easily when traditional DR methods such as PCA are applied to relational data.\nVery often, the useful relational information is simply discarded. For example, a citation/reference\nrelation between two papers provides very strong evidence for them to belong to the same topic even\nthough they may bear low similarity in their content due to the sparse nature of the bag-of-words\nrepresentation, but the relational information is not exploited at all when applying PCA or PPCA.\nOne possible use of the relational information in PCA or PPCA is to \ufb01rst convert the link structure\ninto the format of \ufb02at data by extracting some additional features from the links. However, as ar-\ngued in [10], this approach fails to capture some important structural information in the data. On the\nother hand, the i.i.d. assumption underlying PCA and PPCA is unreasonable for relational data. In\nrelational data, the attributes of the connected (linked) instances are often correlated and the class\nlabel of one instance may have an in\ufb02uence on that of a linked instance. For example, in biology,\ninteracting proteins are more likely to have the same biological functions than those without inter-\naction. Therefore, PCA and PPCA, or more generally most existing DR methods based on the i.i.d.\nassumption, are not suitable for relational data analysis.\n\nIn this paper, a novel probabilistic DR method called probabilistic relational PCA (PRPCA) is pro-\nposed for relational data analysis. By explicitly modeling the covariance between instances as de-\nrived from the relational information, PRPCA seamlessly integrates relational information and tex-\ntual content information into a uni\ufb01ed probabilistic framework. Two learning algorithms, one based\non a closed-form solution and the other based on an EM algorithm [9], are proposed to learn the\nparameters of PRPCA. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning\nalgorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of\nthe i.i.d. assumption. Extensive experiments on real-world data sets show that PRPCA can effec-\ntively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art\nperformance.\n\n2 Notation\n\nWe use boldface uppercase letters, such as K, to denote matrices, and boldface lowercase letters,\nsuch as z, to denote vectors. The ith row and the jth column of a matrix K are denoted by Ki\u2217\nand K\u2217j, respectively. Kij denotes the element at the ith row and jth column of K. zi denotes the\nith element of z. KT is the transpose of K, and K\u22121 is the inverse of K. K (cid:23) 0 means that K\nis positive semi-de\ufb01nite (psd) and K (cid:31) 0 means that K is positive de\ufb01nite (pd). tr(\u00b7) denotes the\ntrace of a matrix and etr(\u00b7) , exp(tr(\u00b7)). P \u2297 Q denotes the Kronecker product [11] of P and Q.\n|\u00b7| denotes the determinant of a matrix. In is the identity matrix of size n \u00d7 n. e is a vector of 1s,\nthe dimensionality of which depends on the context. We overload N (\u00b7) for both multivariate normal\ndistributions and matrix variate normal distributions [11]. h\u00b7i denotes the expectation operation and\ncov(\u00b7) denotes the covariance operation.\nNote that in relational data, there exist both content and link observations. As in [21], {tn}N\ndenotes a set of observed d-dimensional data (content) vectors, the d \u00d7 q matrix W denotes the q\nprincipal axes (or called factor loadings), \u00b5 denotes the data sample mean, and xn = WT (tn \u2212 \u00b5)\ndenotes the corresponding q principal components (or called latent variables) of tn. We further use\nthe d \u00d7 N matrix T to denote the content matrix with T\u2217n = tn, and the q \u00d7 N matrix X to denote\nthe latent variables of T with X\u2217n = WT (tn \u2212 \u00b5). For relational data, the N \u00d7 N matrix A\ndenotes the adjacency (link) matrix of the N instances. In this paper, we assume that the links are\nundirected. For those data with directed links, we will convert the directed links into undirected\nlinks which can keep the original physical meaning of the links. This will be described in detail in\nSection 4.1.1, and an example will be given in Section 5. Hence, Aij = 1 if there exists a relation\nbetween instances i and j, and otherwise Aij = 0. Moreover, we always assume that there exist no\nself-links, i.e., Aii = 0.\n\nn=1\n\n3 Probabilistic PCA\n\nTo set the stage for the next section which introduces our PRPCA model, we \ufb01rst brie\ufb02y present\nthe derivation for PPCA [21], which was originally based on (vector-based) multivariate normal\ndistributions, from the perspective of matrix variate normal distributions [11].\n\n2\n\n\fIf we use \u03a5 to denote the Gaussian noise process and assume that \u03a5 and the latent variable matrix\nX follow these distributions:\n\n\u03a5 \u223c Nd,N (0, \u03c32Id \u2297 IN ),\n\nX \u223c Nq,N (0, Iq \u2297 IN ),\n\n(1)\n\nwe can express a generative model as follows: T = WX + \u00b5eT + \u03a5.\nBased on some properties of matrix variate normal distributions in [11], we get the following results:\nT | X \u223c Nd,N (WX + \u00b5eT , \u03c32Id \u2297 IN ),\n(2)\nLet C = WWT + \u03c32Id. The corresponding log-likelihood of the observation matrix T is then\n\n(cid:0)\u00b5eT , (WWT + \u03c32Id) \u2297 IN\n\nT \u223c Nd,N\n\n(cid:1) .\n\nL = ln p(T) = \u2212 N\n2\nn=1(T\u2217n\u2212\u00b5)(T\u2217n\u2212\u00b5)T\n= PN\n\nd ln(2\u03c0) + ln|C| + tr(C\u22121S)\n\nwhere S = (T\u2212\u00b5eT )(T\u2212\u00b5eT )T\n. We can see that S is just the sample\ncovariance matrix of the content observations. It is easy to see that this log-likelihood form is the\nsame as that in [21]. Using matrix notations, the graphical model of PPCA based on matrix variate\nnormal distributions is shown in Figure 1(a).\n\nN\n\nN\n\n,\n\n(3)\n\nh\n\ni\n\n(a) Model of PPCA\n\n(b) Model of PRPCA\n\nFigure 1: Graphical models of PPCA and PRPCA, in which T is the observation matrix, X is the latent\nvariable matrix, \u00b5, W and \u03c32 are the parameters to learn, and the other quantities are kept constant.\n\n4 Probabilistic Relational PCA\n\nPPCA assumes that all the observations are independent and identically distributed. Although this\ni.i.d. assumption can make the modeling process much simpler and has achieved great success in\nmany traditional applications, this assumption is however very unreasonable for relational data [10].\nIn relational data, the attributes of connected (linked) instances are often correlated.\n\nIn this section, a probabilistic relational PCA model, called PRPCA, is proposed to integrate both\nthe relational information and the content information seamlessly into a uni\ufb01ed framework by elim-\ninating the i.i.d. assumption. Based on our reformulation of PPCA using matrix variate notations as\npresented in the previous section, we can obtain PRPCA just by introducing some relatively simple\n(but very effective) modi\ufb01cations. A promising property is that the computation needed for PRPCA\nis as simple as that for PPCA even though we have eliminated the restrictive i.i.d. assumption.\n\n4.1 Model Formulation\n\nAssume that the latent variable matrix X has the following distribution:\n\nX \u223c Nq,N (0, Iq \u2297 \u03a6).\n\n(4)\nAccording to Corollary 2.3.3.1 in [11], we can get cov(Xi\u2217) = \u03a6 (i \u2208 {1, . . . , q}), which means\nthat \u03a6 actually re\ufb02ects the covariance between the instances. From (1), we can see that cov(Xi\u2217) =\nIN for PPCA, which also coincides with the i.i.d. assumption of PPCA.\nHence, to eliminate the i.i.d. assumption for relational data, one direct way is to use a non-identity\ncovariance matrix \u03a6 for the distribution of X in (4). This \u03a6 should re\ufb02ect the physical meaning\n(semantics) of the relations between instances, which will be discussed in detail later. Similarly, we\ncan also change the IN in (1) to \u03a6 for \u03a5 to eliminate the i.i.d. assumption for the noise process.\n\n4.1.1 Relational Covariance Construction\n\nBecause the covariance matrix \u03a6 in PRPCA is constructed from the relational information in the\ndata, we refer to it as relational covariance here.\n\nThe goal of PCA and PPCA is to \ufb01nd those principal axes onto which the retained variance under\nprojection is maximal [13, 21]. For one speci\ufb01c X, the retained variance is tr[XXT ]. If we rewrite\np(X) in (1) as p(X) = exp{tr[\u2212 1\n\n, we have the following observation:\n\n= exp{\u2212 1\n\n2 tr[XXT ]}\n\n2 XXT ]}\n\n(2\u03c0)qN/2\n\n(2\u03c0)qN/2\n\n3\n\nTXIqW\u03c32IN\u00b5TXIqW\u03c32\u2206\u22121\u00b5\fObservation 1 For PPCA, the larger the retained variance of X, i.e., the more X approaches the\ndestination point, the lower is the probability density at X given by the prior.\n\nHere, the destination point refers to the point where the goal of PPCA is achieved, i.e., the retained\nvariance is maximal. Moreover, we use the retained variance as a measure to de\ufb01ne the gap between\ntwo different points. The smaller is the gap between the retained variance of two points, the more\nthey approach each other.\n\nBecause the design principle of PRPCA is similar to that of PPCA, our working hypothesis here is\nthat Observation 1 can also guide us to design the relational covariance of PRPCA. Its effectiveness\nwill be empirically veri\ufb01ed in Section 5.\nIn PRPCA, we assume that the attributes of two linked instances are positively correlated.2 Under\nthis assumption, the ideal goal of PRPCA should be to make the latent representations of two in-\nstances as close as possible if there exists a relation (link) between them. Hence, the measure to\nde\ufb01ne the gap between two points refers to the closeness of the linked instances, i.e., the summation\nof the Euclidean distances between the linked instances. Based on Observation 1, the more X ap-\nproaches the destination point, the lower should be the probability density at X given by the prior.\nHence, under the latent space representation X, the closer the linked instances are, the lower should\nbe the probability density at X given by the prior. We will prove that if we set \u03a6 = \u2206\u22121 where\n\u2206 , \u03b3IN + (IN + A)T (IN + A) with \u03b3 being typically a very small positive number to make\n\u2206 (cid:31) 0, we can get an appropriate prior for PRPCA. Note that Aij = 1 if there exists a relation\nbetween instances i and j, and otherwise Aij = 0. Because AT = A, we can also express \u2206 as\n\u2206 = \u03b3IN + (IN + A)(IN + A).\n\nLet \u02dcD denote a diagonal matrix whose diagonal elements \u02dcDii = P\nget \u2206 = (1+\u03b3)IN +2A+AA = (1+\u03b3)IN + \u02dcD+(2A+B). Because Bij =PN\n\nj Aij. It is easy to prove that\n(AA)ii = \u02dcDii. Let B = AA \u2212 \u02dcD, which means that Bij = (AA)ij if i 6= j and Bii = 0. We can\nk=1 AikAkj for i 6=\nj, we can see that Bij is the number of paths, each with path length 2, from instance i to instance j\nin the original adjacency graph A. Because the attributes of two linked instances are positively\ncorrelated, Bij actually re\ufb02ects the degree of correlation between instance i and instance j. Let us\ntake the paper citation graph as an example to illustrate this. The existence of a citation relation\nbetween two papers often implies that they are about the same topic. If paper i cites paper k and\npaper k cites paper j, it is highly likely that paper i and paper j are about the same topic. If there\nexists another paper a 6= k linking both paper i and paper j as well, the con\ufb01dence that paper i and\npaper j are about the same topic will increase. Hence, the larger Bij is, the stronger is the correlation\nk=1 AikAkj = AT\u2217iA\u2217j, Bij can also be seen\nas the similarity between the link vectors of instance i and instance j. Therefore, B can be seen as a\nweight matrix (corresponding to a weight graph) derived from the original adjacency matrix A, and\nB is also consistent with the physical meaning underlying A.\nLetting G = 2A + B,3 we can \ufb01nd that G actually combines the original graph re\ufb02ected by A\nand the derived graph re\ufb02ected by B to get a new graph, and puts a weight 2Aij + Bij on the edge\nbetween instance i and instance j in the new graph. The new weight graph re\ufb02ected by G is also\nconsistent with the physical meaning underlying A. Letting L , D \u2212 G, where D is a diagonal\nj Gij and L is called the Laplacian matrix [6] of G, we\ncan get \u2206 = (1+\u03b3)IN + \u02dcD+D\u2212L. If we de\ufb01ne another diagonal matrix \u02c6D , (1+\u03b3)IN + \u02dcD+D,\nwe can get \u2206 = \u02c6D \u2212 L. Then we have\n\nbetween instance i and instance j. Because Bij =PN\n\nmatrix whose diagonal elements Dii =P\n\ntr[X\u2206XT ] =\n\n\u02c6DiikX\u2217ik2 \u2212 1\n2\n\nGijkX\u2217i \u2212 X\u2217jk2.\n\n(5)\n\nNX\n\ni=1\n\nNX\n\nNX\n\ni=1\n\nj=1\n\n2Links with other physical meanings, such as the directed links in web graphs [25], can be transformed into\nlinks satisfying the assumption in PRPCA via some preprocessing strategies. One such strategy to preprocess\nthe WebKB data set [8] will be given as an example in Section 5.\n\n3This means that we put a 2:1 ratio between A and B. Other ratios can be obtained by setting \u2206 =\n\u03b3IN + (\u03b1IN + A)(\u03b1IN + A) = \u03b3IN + \u03b12IN + 2\u03b1A + B. Preliminary results show that PRPCA is not\nsensitive to \u03b1 as long as \u03b1 is not too large, but we omit the detailed results here because they are out of the\nscope of this paper.\n\n4\n\n\fi=1\n\n2 X\u2206XT ]}\n\n2 tr[X\u2206XT ]}\n(2\u03c0)qN/2 |\u2206 |\u2212q/2 .\n\n(2\u03c0)qN/2 |\u2206 |\u2212q/2 = exp{\u2212 1\n\nLetting \u03a6 = \u2206\u22121, we can get p(X) = exp{tr[\u2212 1\nThe \ufb01rst termPN\nPN\n\n\u02c6DiikX\u2217ik2 in (5) can be treated as a measure of weighted variance of all the\ninstances in the latent space. We can see that the larger \u02c6Dii is, the more weight will be put on\ninstance i, which is reasonable because \u02c6Dii mainly re\ufb02ects the degree of instance i in the graph.\nIt is easy to see that, for those latent representations having a \ufb01xed value of weighted variance\n\u02c6DiikX\u2217ik2, the closer the latent representations of two linked entities are, the larger is their\ncontribution to tr[X\u2206XT ], and subsequently the less is their contribution to p(X). This means\nthat under the latent space representation X, the closer the linked instances are, the lower is the\nprobability density at X given by the prior. Hence, we can get an appropriate prior for X by setting\n\u03a6 = \u2206\u22121 in (4).\n\ni=1\n\n4.1.2 Model\n\nWith the constructed relational covariance \u03a6, the generative model of PRPCA is de\ufb01ned as follows:\n\n\u03a5 \u223c Nd,N (0, \u03c32Id \u2297 \u03a6), X \u223c Nq,N (0, Iq \u2297 \u03a6), T = WX + \u00b5eT + \u03a5,\n\nwhere \u03a6 = \u2206\u22121.\nWe can further obtain the following results:\n\nT | X \u223c Nd,N (WX + \u00b5eT , \u03c32Id \u2297 \u03a6), T \u223c Nd,N\n\n(cid:0)\u00b5eT , (WWT + \u03c32Id) \u2297 \u03a6(cid:1) .\n\n(6)\n\nThe graphical model of PRPCA is illustrated in Figure 1(b), from which we can see that the differ-\nence between PRPCA and PPCA lies solely in the difference between \u03a6 and IN . Comparing (6) to\n(2), we can \ufb01nd that the observations of PPCA are sampled independently while those of PRPCA\nare sampled with correlation. In fact, PPCA may be seen as a degenerate case of PRPCA as detailed\nbelow in Remark 1:\n\nRemark 1 When the i.i.d. assumption holds, i.e., all Aij = 0, PRPCA degenerates to PPCA by\nsetting \u03b3 = 0. Note that the only role that \u03b3 plays is to make \u2206 (cid:31) 0. Hence, in our implementation,\nwe always set \u03b3 to a very small positive value, such as 10\u22126. Actually, we may even set \u03b3 to 0,\nbecause \u2206 does not have to be pd. When \u2206 (cid:23) 0, we say T follows a singular matrix variate\nnormal distribution [11], and all the derivations for PRPCA are still correct. In our experiment,\nwe \ufb01nd that the performance under \u03b3 = 0 is almost the same as that under \u03b3 = 10\u22126. Further\ndeliberation is out of the scope of this paper.\n\nh\n\ni\n\n+ c,\n\nL1 = ln p(T) = \u2212 N\n2\n\nd ln(2\u03c0) + ln|C| + tr(C\u22121H)\n\nAs in PPCA, we set C = WWT + \u03c32Id. Then the log-likelihood of the observation matrix T in\nPRPCA is\n\n(7)\n2 ln|\u03a6| can be seen as a constant independent of the parameters \u00b5, W and \u03c32, and\n\nwhere c = \u2212 d\nH = (T\u2212\u00b5eT )\u2206(T\u2212\u00b5eT )T\nIt is interesting to compare (7) with (3). We can \ufb01nd that to learn the parameters W and \u03c32, the\nonly difference between PRPCA and PPCA lies in the difference between H and S. Hence, all the\nlearning techniques derived previously for PPCA are also potentially applicable to PRPCA simply\nby substituting S with H.\n\nN\n\n.\n\n4.2 Learning\nBy setting the gradient of L1 with respect to \u00b5 to 0, we can get the maximum-likelihood estimator\n(MLE) for \u00b5 as follows: \u00b5 = T\u2206e\neT \u2206e\nAs in PPCA [21], we devise two methods to learn W and \u03c32 in PRPCA, one based on a closed-form\nsolution and the other based on EM.\n\n.\n\n5\n\n\f4.2.1 Closed-Form Solution\n\nTheorem 1 The log-likelihood in (7) is maximized when\n\nWM L = Uq(\u039bq \u2212 \u03c32\n\nM LIq)1/2R,\n\nM L =\n\u03c32\n\nPd\n\ni=q+1 \u03bbi\nd \u2212 q\n\n,\n\nwhere \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd are the eigenvalues of H, \u039bq is a q \u00d7 q diagonal matrix containing\nthe \ufb01rst q largest eigenvalues, Uq is a d \u00d7 q matrix in which the q column vectors are the principal\neigenvectors of H corresponding to \u039bq, and R is an arbitrary q \u00d7 q orthogonal rotation matrix.\n\nThe proof of Theorem 1 makes use of techniques similar to those in Appendix A of [21] and is\nomitted here.\n\n4.2.2 EM Algorithm\nDuring the EM learning process, we treat {W, \u03c32} as parameters, X as missing data and {T, X} as\ncomplete data. The EM algorithm operates by alternating between the E-step and M-step. Here we\nonly brie\ufb02y describe the updating rules and their derivation can be found in a longer version which\ncan be downloaded from http://www.cse.ust.hk/\u223cliwujun.\nIn the E-step, the expectation of the complete-data log-likelihood with respect to the distribution of\nthe missing data X is computed. To compute the expectation of the complete-data log-likelihood,\nwe only need to compute the following suf\ufb01cient statistics:\n\nhXi = M\u22121WT (T \u2212 \u00b5eT ),\n\n(8)\nwhere M = WT W + \u03c32Iq. Note that all these statistics are computed based on the parameter\nvalues obtained from the previous iteration.\n\nhX\u2206XTi = N \u03c32M\u22121 + hXi\u2206hXiT ,\n\nIn the M-step, to maximize the expectation of the complete-data log-likelihood, the parameters\n{W, \u03c32} are updated as follows:\n\ntr(H \u2212 HWM\u22121fWT )\nNote that we use W here to denote the old value andfW for the updated new value.\n\nfW = HW(\u03c32Iq + M\u22121WT HW)\u22121,\n\ne\u03c32 =\n\n.\n\n(9)\n\nd\n\n4.3 Complexity Analysis\n\nSuppose there are \u03b4 nonzero elements in \u2206. We can see that the computation cost for H is\nO(dN + d\u03b4). In many applications \u03b4 is typically a constant multiple of N. Hence, we can say\nthat the time complexity for computing H is O(dN). For the closed-form solution, we have to in-\nvert a d \u00d7 d matrix. Hence, the computation cost is O(dN + d3). For EM, because d is typically\nlarger than q, we can see that the computation cost is O(dN + d2qT ), where T is the number of EM\niterations. If the data are of very high dimensionality, EM will be more ef\ufb01cient than the closed-form\nsolution.\n\n5 Experiments\n\nAlthough PPCA possesses additional advantages when compared with the original non-probabilistic\nformulation of PCA, they will get similar DR results when there exist no missing values in the data.\nIf the task is to classify instances in the low-dimensional embedding, the classi\ufb01ers based on the\nembedding results of PCA and PPCA are expected to achieve comparable results. Hence, in this\npaper, we only adopt PCA as the baseline to study the performance of PRPCA. For the EM algorithm\nof PRPCA, we use PCA to initialize W, \u03c32 is initialized to 10\u22126, and \u03b3 = 10\u22126. Because the EM\nalgorithm and the closed-form solution achieve similar results, we only report the results of the EM\nalgorithm of PRPCA in the following experiments.\n\n5.1 Data Sets and Evaluation Scheme\n\nHere, we only brie\ufb02y describe the data sets and evaluation scheme for space saving. More detailed\ninformation about them can be found in the longer version.\n\n6\n\n\fWe use three data sets to evaluate PRPCA. The \ufb01rst two data sets are Cora [16] and WebKB [8]. We\nadopt the same strategy as that in [26] to preprocess these two data sets. The third data set is the\nPoliticalBook data set used in [19]. For WebKB, according to the semantics of authoritative pages\nand hub pages [25], we \ufb01rst preprocess the link structure of this data set as follows: if two web\npages are co-linked by or link to another common web page, we add a link between these two pages.\nThen all the original links are removed. After preprocessing, all the directed links are converted into\nundirected links.\n\nThe Cora data set contains four subsets: DS, HA, ML and PL. The WebKB data set also contains\nfour subsets: Cornell, Texas, Washington and Wisconsin. We adopt the same strategy as that in [26]\nto evaluate PRPCA on the Cora and WebKB data sets. For the PoliticalBook data set, we use the\ntesting procedure of the latent Wishart process (LWP) model [15] for evaluation.\n\n5.2 Convergence Speed of EM\n\nWe use the DS and Cornell data sets to illustrate the convergence speed of the EM learning procedure\nof PRPCA. The performance on other data sets has similar characteristics, which is omitted here.\nWith q = 50, the average classi\ufb01cation accuracy based on 5-fold cross validation against the number\nof EM iterations T is shown in Figure 2. We can see that PRPCA achieves very promising and stable\nperformance after a very small number of iterations. We set T = 5 in all our following experiments.\n\n5.3 Visualization\n\nWe use the PoliticalBook data set to visualize the DR results of PCA and PRPCA. For the sake of\nvisualization, q is set to 2. The results are depicted in Figure 3. We can see that it is not easy to\nseparate the two classes in the latent space of PCA. However, the two classes are better separated\nfrom each other in the latent space of PRPCA. Hence, better clustering or classi\ufb01cation performance\ncan be expected when the examples are clustered or classi\ufb01ed in the latent space of PRPCA.\n\nPCA\n\nPRPCA\n\nFigure 2: Convergence speed\nof the EM learning procedure of\nPRPCA.\n5.4 Performance\n\nFigure 3: Visualization of data points in the latent spaces of PCA and\nPRPCA for the PoliticalBook data set. The positive and negative examples\nare shown as red crosses and blue circles, respectively.\n\nThe dimensionality of Cora and WebKB is moderately high, but the dimensionality of PoliticalBook\nis very high. We evaluate PRPCA on these two different kinds of data to verify its effectiveness in\ngeneral settings.\nPerformance on Cora and WebKB The average classi\ufb01cation accuracy with its standard deviation\nbased on 5-fold cross validation against the dimensionality of the latent space q is shown in Figure 4.\nWe can \ufb01nd that PRPCA can dramatically outperform PCA on all the data sets under any dimen-\nsionality, which con\ufb01rms that the relational information is very informative and PRPCA can utilize\nit very effectively.\n\nWe also perform comparison between PRPCA and those methods evaluated in [26]. The methods\ninclude: SVM on content, which ignores the link structure in the data and applies SVM only on\nthe content information in the original bag-of-words representation; SVM on links, which ignores\nthe content information and treats the links as features, i.e, the ith feature is link-to-pagei; SVM on\nlink-content, in which the content features and link features of the two methods above are combined\nto give the feature representation; directed graph regularization (DGR), which is introduced in [25];\nPLSI+PHITS, which is described in [7]; link-content MF, which is the joint link-content matrix\nfactorization (MF) method in [26]. Note that Link-content sup. MF in [26] is not adopted here\nfor comparison. Because during the DR procedure link-content sup. MF employs additional label\n\n7\n\n010203040500.50.60.70.80.9TAccuracyDSCornell\u22120.4\u22120.200.20.4\u22120.4\u22120.200.20.4\u22120.4\u22120.200.2\u22120.2\u22120.100.10.2\fDS\n\nHA\n\nML\n\nPL\n\nCornell\n\nTexas\n\nWashington\n\nWisconsin\n\nFigure 4: Comparison between PRPCA and PCA on Cora and WebKB.\n\ninformation which is not employed by other DR methods, it is unfair to directly compare it with\nother methods. As in the link-content MF method, we set q = 50 for PRPCA. The results are shown\nin Figure 5. We can see that PRPCA and link-content MF achieve the best performance among all\nthe evaluated methods. Compared with link-content MF, PRPCA performs slightly better on DS and\nHA while performing slightly worse on ML and Texas, and achieves comparable performance on the\nother data sets. We can conclude that the overall performance of PRPCA is comparable with that of\nlink-content MF. Unlike link-content MF which is transductive in nature, PRPCA naturally supports\ninductive inference. More speci\ufb01cally, we can apply the learned transformation matrix of PRPCA\nto perform DR for the unseen test data, while link-content MF can only perform DR for those data\navailable during the training phase. Very recently, another method proposed by us, called relation\nregularized matrix factorization (RRMF) [14], has achieved better performance than PRPCA on the\nCora data set. However, similar to link-content MF, RRMF cannot be used for inductive inference\neither.\n\nFigure 5: Comparison between PRPCA and other methods on Cora and WebKB.\n\nPerformance on PoliticalBook As in mixed graph Gaussian process (XGP) [19] and LWP [15], we\nrandomly choose half of the whole data for training and the rest for testing. This subsampling pro-\ncess is repeated for 100 rounds and the average area under the ROC curve (AUC) with its standard\ndeviation is reported in Table 1, where GPC is a Gaussian process classi\ufb01er [18] trained on the origi-\nnal feature representation, and relational Gaussian process (RGP) is the method in [5]. For PCA and\nPRPCA, we \ufb01rst use them to perform DR, and then a Gaussian process classi\ufb01er is trained based on\nthe low-dimensional representation. Here, we set q = 5 for both PCA and PRPCA. We can see that\non this data set, PRPCA also dramatically outperforms PCA and achieves performance comparable\nwith the state of the art. Note that RGP and XGP cannot learn a low-dimensional embedding for\nthe instances. Although LWP can also learn a low-dimensional embedding for the instances, the\ncomputation cost to obtain a low-dimensional embedding for a test instance is O(N 3) because it has\nto invert the kernel matrix de\ufb01ned on the training data.\n\nTable 1: Performance on the PoliticalBook data set. Results for GPC, RGP and XGP are taken from [19]\nwhere the standard deviation is not reported.\n\nGPC\n0.92\n\nRGP\n0.98\n\nXGP\n0.98\n\nLWP\n\n0.98 \u00b1 0.02\n\nPCA\n\n0.92 \u00b1 0.03\n\nPRPCA\n\n0.98 \u00b1 0.02\n\nAcknowledgments\n\nLi and Yeung are supported by General Research Fund 621407 from the Research Grants Council\nof Hong Kong. Zhang is supported in part by 973 Program (Project No. 2010CB327903). We thank\nYu Zhang for some useful comments.\n\n8\n\n10203040500.350.40.450.50.550.60.650.7qAccuracyPCAPRPCA10203040500.650.70.750.8qAccuracyPCAPRPCA10203040500.550.60.650.70.750.8qAccuracyPCAPRPCA10203040500.40.450.50.550.60.65qAccuracyPCAPRPCA10203040500.750.80.850.90.95qAccuracyPCAPRPCA10203040500.750.80.850.90.95qAccuracyPCAPRPCA10203040500.840.860.880.90.920.940.96qAccuracyPCAPRPCA10203040500.840.860.880.90.920.94qAccuracyPCAPRPCADSHAMLPL0.50.60.70.8AccuracySVM on contentSVM on linksSVM on link\u2212contentDGRPLSI+PHITSlink\u2212content MFPRPCACornellTexasWashingtonWisconsin0.50.60.70.80.91Accuracy\fReferences\n\n[1] D. J. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Kendall\u2019s\n\nLibrary of Statistics,7, second edition, 1999.\n\n[2] C. M. Bishop. Bayesian PCA. In NIPS 11, 1998.\n[3] J. Chang and D. M. Blei. Relational topic models for document networks. In AISTATS, 2009.\n[4] J. Chang, J. L. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting\n\nsocial networks with text. In KDD, pages 169\u2013178, 2009.\n\n[5] W. Chu, V. Sindhwani, Z. Ghahramani, and S. S. Keerthi. Relational learning with Gaussian\n\nprocesses. In NIPS 19, 2007.\n\n[6] F. Chung. Spectral Graph Theory. Number 92 in Regional Conference Series in Mathematics.\n\nAmerican Mathematical Society, 1997.\n\n[7] D. A. Cohn and T. Hofmann. The missing link - a probabilistic model of document content\n\nand hypertext connectivity. In NIPS 13, 2000.\n\n[8] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. M. Mitchell, K. Nigam, and S. Slattery.\nLearning to extract symbolic knowledge from the world wide web. In AAAI/IAAI, pages 509\u2013\n516, 1998.\n\n[9] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, 39(1):1\u201338, 1977.\n\n[10] L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. The MIT Press, 2007.\n[11] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000.\n[12] H. Howard. Analysis of a complex of statistical variables into principal components. Journal\n\nof Educational Psychology, 27:417\u2013441, 1933.\n\n[13] I. T. Jolliffe. Principal Component Analysis. Springer, second edition, 2002.\n[14] W.-J. Li and D.-Y. Yeung. Relation regularized matrix factorization. In IJCAI, 2009.\n[15] W.-J. Li, Z. Zhang, and D.-Y. Yeung. Latent Wishart processes for relational kernel learning.\n\nIn AISTATS, pages 336\u2013343, 2009.\n\n[16] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet\n\nportals with machine learning. Information Retrieval, 3(2):127\u2013163, 2000.\n\n[17] R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and\n\ncitations. In KDD, pages 542\u2013550, 2008.\n\n[18] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT\n\nPress, 2006.\n\n[19] R. Silva, W. Chu, and Z. Ghahramani. Hidden common cause relations in relational learning.\n\nIn NIPS 20. 2008.\n\n[20] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In\n\nUAI, pages 485\u2013492, 2002.\n\n[21] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal Of The\n\nRoyal Statistical Society Series B, 61(3):611\u2013622, 1999.\n\n[22] J.-P. Vert. Reconstruction of biological networks by supervised machine learning approaches.\n\nIn Elements of Computational Systems Biology, 2009.\n\n[23] T. Yang, R. Jin, Y. Chi, and S. Zhu. A Bayesian framework for community detection integrating\n\ncontent and link. In UAI, 2009.\n\n[24] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a\n\ndiscriminative approach. In KDD, pages 927\u2013936, 2009.\n\n[25] D. Zhou, B. Sch\u00a8olkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In\n\nNIPS 17, 2004.\n\n[26] S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classi\ufb01cation using matrix\n\nfactorization. In SIGIR, 2007.\n\n9\n\n\f", "award": [], "sourceid": 207, "authors": [{"given_name": "Wu-jun", "family_name": "Li", "institution": null}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": null}, {"given_name": "Zhihua", "family_name": "Zhang", "institution": null}]}