{"title": "Efficient Relational Learning with Hidden Variable Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1234, "page_last": 1242, "abstract": "Markov networks (MNs) can incorporate arbitrarily complex features in modeling relational data. However, this flexibility comes at a sharp price of training an exponentially complex model. To address this challenge, we propose a novel relational learning approach, which consists of a restricted class of relational MNs (RMNs) called relation tree-based RMN (treeRMN), and an efficient Hidden Variable Detection algorithm called Contrastive Variable Induction (CVI). On one hand, the restricted treeRMN only considers simple (e.g., unary and pairwise) features in relational data and thus achieves computational efficiency; and on the other hand, the CVI algorithm efficiently detects hidden variables which can capture long range dependencies. Therefore, the resultant approach is highly efficient yet does not sacrifice its expressive power. Empirical results on four real datasets show that the proposed relational learning method can achieve similar prediction quality as the state-of-the-art approaches, but is significantly more efficient in training; and the induced hidden variables are semantically meaningful and crucial to improve the training speed and prediction qualities of treeRMNs.", "full_text": "Ef\ufb01cient Relational Learning with\n\nHidden Variable Detection\n\nNi Lao, Jun Zhu, Liu Liu, Yandong Liu, William W. Cohen\n\nCarnegie Mellon University\n\n{nlao,junzhu,liuliu,yandongl,wcohen}@cs.cmu.edu\n\n5000 Forbes Avenue, Pittsburgh, PA 15213\n\nAbstract\n\nMarkov networks (MNs) can incorporate arbitrarily complex features in modeling\nrelational data. However, this \ufb02exibility comes at a sharp price of training an expo-\nnentially complex model. To address this challenge, we propose a novel relational\nlearning approach, which consists of a restricted class of relational MNs (RMNs)\ncalled relation tree-based RMN (treeRMN), and an ef\ufb01cient Hidden Variable De-\ntection algorithm called Contrastive Variable Induction (CVI). On one hand, the\nrestricted treeRMN only considers simple (e.g., unary and pairwise) features in re-\nlational data and thus achieves computational ef\ufb01ciency; and on the other hand, the\nCVI algorithm ef\ufb01ciently detects hidden variables which can capture long range\ndependencies. Therefore, the resultant approach is highly ef\ufb01cient yet does not\nsacri\ufb01ce its expressive power. Empirical results on four real datasets show that the\nproposed relational learning method can achieve similar prediction quality as the\nstate-of-the-art approaches, but is signi\ufb01cantly more ef\ufb01cient in training; and the\ninduced hidden variables are semantically meaningful and crucial to improve the\ntraining speed and prediction qualities of treeRMNs.\n\n1 Introduction\nStatistical relational learning has attracted ever-growing interest in the last decade, because of widely\navailable relational data, which can be as complex as citation graphs, the World Wide Web, or rela-\ntional databases. Relational Markov Networks (RMNs) are excellent tools to capture the statistical\ndependency among entities in a relational dataset, as has been shown in many tasks such as col-\nlective classi\ufb01cation [22] and information extraction [18][2]. Unlike Bayesian networks, RMNs\navoid the dif\ufb01culty of de\ufb01ning a coherent generative model, thereby allowing tremendous \ufb02exibility\nin representing complex patterns [21]. For example, Markov Logic Networks [10] can be auto-\nmatically instantiated as a RMN, given just a set of predicates representing attributes and relations\namong entities. The algorithm can be applied to tasks in different domains without any change.\nRelational Bayesian networks [22], in contrary, would require expert knowledge to design proper\nmodel structures and parameterizations whenever the schema of the domain under consideration is\nchanged. However, this \ufb02exibility of RMN comes at a high price in training very complex models.\nFor example, work by Kok and Domingos [10][11][12] has shown that a prominent problem of re-\nlational undirected models is how to handle the exponentially many features, each of which is an\nconjunction of several neighboring variables (or \u201cground atoms\u201d in terms of \ufb01rst order logic). Much\ncomputation is spent on proposing and evaluating candidate features.\nThe main goal of this paper is to show that instead of learning a very expressive relational model,\nwhich can be extremely expensive, an alternative approach that explores Hidden Variable Detection\n(HVD) to compensate a family of restricted relational models (e.g., treeRMNs) can yield a very\nef\ufb01cient yet competent relational learning framework. First, to achieve ef\ufb01cient inference, we intro-\nduce a restricted class of RMNs called relation tree-based RMNs (treeRMNs), which only considers\nunary (single variable assignment) and pairwise (conjunction of two variable assignments) features.\n\n1\n\n\fSince the Markov blanket of a variable is concisely de\ufb01ned by a relation tree on the schema, we\ncan easily control the complexities of treeRMN models. Second, to compensate for the restricted\nexpressive power of treeRMNs, we further introduce a hidden variable induction algorithm called\nContrastive Variable Induction (CVI), which can effectively detect latent variables capturing long\nrange dependencies. It has been shown in relational Bayesian networks [24] that hidden variables\ncan help propagating information across network structures, thus reducing the burden of extensive\nstructural learning. In this work, we explore the usefulness of hidden variables in learning RMNs.\nOur experiments on four real datasets show that the proposed relational learning framework can\nachieve similar prediction quality to the state-of-the-art RMN models, but is signi\ufb01cantly more ef-\n\ufb01cient in training. Furthermore, the induced hidden variables are semantically meaningful and are\ncrucial to improving training speed of treeRMN.\nIn the remainder of this paper, we \ufb01rst brie\ufb02y review related work and training undirected graphical\nmodels with mean \ufb01eld contrastive divergence. Then we present the treeRMN model and the CVI\nalgorithm for variable induction. Finally, we present experimental results and conclude this paper.\n\n2 Related Work\nThere has been a series of work by Kok and Domingos [10][11][12] developing Markov Logic\nNetworks (MLNs) and showing their \ufb02exibility in different applications. The treeRMN model we\nintroduced in this work is intended to be a simpler model than MLNs, which can be trained more\nef\ufb01ciently, yet still be able to capture complex dependencies. Most of the existing RMN models\nconstruct Markov networks by applying templates to entity relation graphs [21][8]. The treeRMN\nmodel that we are going to introduce uses a type of template called a relation tree, which is very\ngeneral and applicable to a wide range of applications. This relation tree template resembles the\npath-based feature generation approach for relational classi\ufb01ers developed by Huang et al. [7].\nRecently, much work has been done to induce hidden variables for generative Bayesian networks\n[5][4][16][9][20][14]. However, previous studies [6][19] have pointed out that the generality of\nBayesian Networks is limited by their need for prior knowledge on the ordering of nodes. On the\nother hand, very little progress has been made in the direction of non-parametric hidden variable\nmodels based on discriminative Markov networks (MNs). One recent attempt is the Multiple Re-\nlational Clustering (MRC) [11] algorithm, which performs top-down clustering of predicates and\nsymbols. However, it is computationally expensive because of its need for parameter estimation\nwhen evaluating candidate structures. The CVI algorithm introduced in this work is most similar to\nthe \u201cideal parent\u201d algorithm [16] for Gaussian Bayesian networks. The \u201cideal parent\u201d evaluates can-\ndidate hidden variables based on the estimated gain of log-likelihood they can bring to the Bayesian\nnetwork. Similarly, the CVI algorithm evaluates candidate hidden variables based on the estimated\ngain of an regularized RMN log-likelihood, thus avoids the costly step of parameter estimation.\n\n3 Preliminaries\nBefore describing our model, let\u2019s brie\ufb02y review undirected graphical models (a.k.a, Markov net-\nworks). Since our goal is to develop an ef\ufb01cient RMN model, we use the simple but very ef\ufb01cient\nmean \ufb01eld contrastive divergence [23] method. Our empirical results show that even the simplest\nnaive mean \ufb01eld can yield very promising results. Extension to using more accurate (but also more\nexpensive) inference methods, such as loopy BP [15] or structured mean \ufb01elds can be done similarly.\n\u2211\nHere we consider the general case that Markov networks have observed variables O, labeled vari-\nables Y, and hidden variables H. Let X = (Y, H) be the joint of hidden and labeled variables. The\nconditional distribution of X given observations O is p(x|o; \u03b8) = exp(\u03b8\nf (x, o))/Z(\u03b8), where f\nis a vector of feature functions fk; \u03b8 is a vector of weights; Z(\u03b8) =\nf (x, o)) is a nor-\nmalization factor; and fk(x, o) counts the number of times the k-th feature \ufb01res in (x, o). Here we\nassume that the range of each variable is discrete and \ufb01nite. Many commonly used graphical mod-\nels have tied parameters, which allow a small number of parameters to govern a large number of\nfeatures. For example, in a linear chain CRF, each parameter is associated with a feature template:\ne.g. \u201cthe current node having label yt = 1 and the immediate next neighbor having label yt+1 = 1\u201d.\nAfter applying each template to all the nodes in a graph, we get a graphical model with a large\nnumber of features (i.e., instantiations of feature templates). In general, a model\u2019s order of Markov\ndependence is determined by the maximal number of neighboring steps considered by any one of\n\nx exp(\u03b8\n\n\u22a4\n\n\u22a4\n\n2\n\n\fits feature templates. In the context of relational learning, the templates can be de\ufb01ned similarly,\nexcept having richer representations\u2013with multiple types of entities and neighboring relations.\nGiven a set of training samples D = {(ym, om)}M\nformulated as maximizing the following regularized log-likelihood\n(cid:12)\u2225(cid:18)\u22252\n2;\n\nm=1, the parameter estimation of MN can be\n\nM\u2211\n\nL((cid:18)) =\n\n(1)\n\nlm((cid:18)) \u2212 (cid:21)\u2225(cid:18)\u22251 \u2212 1\n2\n\nm=1\n\nwhere \u03bb and \u03b2 are non-negative regularization constants for the \u21131 and \u21132-norm respectively. Be-\ncause of its singularity at the origin, the \u21131-norm can yield a sparse estimate, which is a desired\nproperty for hidden variable discovery, as we shall see. The differentiable \u21132-norm is useful when\nthere are strongly correlated features. The composite \u21131/\u21132-norm is known as ElasticNet [27], which\nhas been shown to have nice properties. The log-likelihood for a single sample is\n\n(2)\n\n\u2211\n\nh\n\nl((cid:18)) = log p(y|o; (cid:18)) = log\np, where \u27e8\u00b7\u27e9\n\n\u2212 \u27e8f\u27e9\n\np(h; y|o; (cid:18));\n\npy\n\np is the expectation under the distribution p. To\n\nand its gradient is \u2207\u03b8l(\u03b8) = \u27e8f\u27e9\nsimplify notation, we use p to denote the distribution p(h, y|o; \u03b8) and py to denote p(h|y, o; \u03b8).\nFor simple (e.g. tree-structured) MNs, message passing algorithms can be used to infer the marginal\nprobabilities as required in the gradients exactly. For general MNs, however, we need approxi-\nmate strategies like variational or Monte Carlo methods. Here we use simple mean \ufb01eld variational\nmethod [23]. By analogy with statistical physics, the free energy of any distribution q is de\ufb01ned as\nF (q) = \u27e8\u2212(cid:18)\n\u22a4\n(3)\nf (y, h, o)), and l(\u03b8) = F (p) \u2212 F (py).\nTherefore, F (p) = \u2212 log Z(\u03b8), F (py) = \u2212 log\nLet q0 be the mean \ufb01eld approximation of p(h, y|o; \u03b8) with y clamped to their true values, and qt\nbe the approximation of p(h, y|o; \u03b8) obtained by applying t steps of mean \ufb01eld updates to q0 with\ny free. Then F (q0) \u2265 F (qt) \u2265 F (q\u221e) \u2265 F (p). As in [23], we set t = 1, and use\n\nf\u27e9q \u2212 H(q):\n\u22a4\nh exp(\u03b8\n\n\u2211\n\nlCD1((cid:18)) , F (q1) \u2212 F (q0)\n(4)\nto approximate l(\u03b8), and its gradient is \u2207\u03b8lCD1(\u03b8) = \u27e8f\u27e9q0\n\u2212 \u27e8f\u27e9q1. The new objective function\nLCD1(\u03b8) uses lCD1(\u03b8) to replace l(\u03b8). One advantage of CD is that it avoids q being trapped in a\npossible multimodal distribution of p(h, y|o; \u03b8) [25][3]. With the above approximation, we can use\northant-wise L-BFGS [1] to estimate the parameters \u03b8.\n\n4 Relation Tree-Based RMNs\nIn the following, we formally de\ufb01ne the treeRMN model with relation tree templates, which is very\ngeneral and applicable to a wide range of applications.\nA schema S (Figure 1 left) is a pair (T, R). T = {Ti} is a set of entity types which include\nboth basic entity types (e.g., P erson, Class) and composite entity types (e.g., \u27e8P erson, P erson\u27e9,\n\u27e8P erson, Class\u27e9). Each entity type is associated with a set of attributes A(T ) = {T.Ai}: e.g.,\nA(P erson) = {P erson.gender}. R = {R} is a set of binary relations. We use dom(R) to denote\nthe domain type of R and range(R) to denote its range. For each argument of a composite entity\ntype, we de\ufb01ne two relations, one with outward direction (e.g. P P 1 means from a Person-Person\n\u22121). Here we use \u22121 to denote\npair to its \ufb01rst argument) and another with inward direction (e.g. P P 1\nthe inverse of a relation. We further introduce a T win relation, which connects a composite entity\ntype to itself. Its semantics will be clear later. In principle, we can de\ufb01ne other types of relations\nsuch as those corresponding to functions in second order logic (e.g. P erson\nAn entity relation graph G = IE(S) (Figure 1 right), is the instantiation of schema S on a set of\nbasic entities E = {ei}. We de\ufb01ne the instantiation of a basic entity type T as IE(T ) = {e :\ne.T = T}, and similarly for a composite type IE(T = \u27e8T1, ..., Tk\u27e9) = {\u27e8e1, ..., ek\u27e9 : ei.T = Ti}.\nIn the given example, IE(P erson) = {p1, p2} is the set of persons; IE(Class) = {c1} is the\nset of classes; IE(\u27e8P erson, P erson\u27e9) = {\u27e8p1, p2\u27e9,\u27e8p2, p1\u27e9} is the set of person-person pairs; and\nIE(\u27e8P erson, Class\u27e9) = {\u27e8p1, c1\u27e9,\u27e8p2, c1\u27e9} is the set of person-class pairs. Each entity e has a set\nof variables {e.Xi} that correspond to the set of attributes of its entity type A(e.T ). For a composite\nentity that consists of two entities of the same type, we\u2019d like to capture its correlation with its twin\u2013\nthe composite entity made of the same basic entities but in reversed order. Therefore, we add the\nT win relation between all pairs of twin entities: e.g., from \u27e8p1, p2\u27e9 to \u27e8p2, p1\u27e9, and vice versa.\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 P erson).\n\nF atherOf\n\n3\n\n\fFigure 1: (Left) is a schema, where round and rectangular boxes represent basic and composite\nentity types respectively. (Right) is a corresponding entity relation graph with three basic entities:\np1, p2, c1. For clarity we only show one direction of the relations and omit their labels.\n\nFigure 2: Two-level relation trees for the P erson type (left) and the \u27e8P erson, P erson\u27e9 type (right).\n\nGiven a schema, we can conveniently express how one entity can reach another entity by the con-\ncept of a relation path. A relation path P is a sequence of relations R1 . . . R\u2113 for which the do-\nmains and ranges of adjacent relations are compatible\u2013i.e., range(Ri) = dom(Ri+1). We de\ufb01ne\ndom(R1 . . . R\u2113) \u2261 dom(R1) and range(R1 . . . R\u2113) \u2261 range(R\u2113), and when we wish to em-\nphasize the types associated with each step in a path, we will write the path P = R1 . . . R\u2113 as\nR1\u2212\u2212\u2192 . . . R\u2113\u2212\u2212\u2192 T\u2113, where T0 = dom(R1) = dom(P ), T1 = range(R1) = dom(R2) and\nT0\nso on. Note that, because some of the relations re\ufb02ect one-to-one mappings, there are groups of\n(cid:0)1\n\u2212\u2212\u2212\u2212\u2192\npaths that are equivalent\u2013e.g., the path P erson is actually equivalent to the path P erson P C1\n\u27e8P erson, Class\u27e9 P C1\u2212\u2212\u2212\u2192 P erson. To avoid creating these uninteresting paths, we add a constraint\nto outward composite relations (e.g. P P 1,P C1) that they cannot be immediately preceded by their\ninverse. We also constrain that the T win relation should not be combined with any other relations.\nNow, the Markov blanket of an entity e \u2208 T can be concisely de\ufb01ned by the set of all relation paths\nwith domain T and of length \u2264 \u2113 (as shown in Figure 2). We call this set the relation tree of type\nT , and denote it as T ree(T, \u2113) = {P}. We de\ufb01ne a unary template as T.Ai = a, where Ai is an\nattribute of type T , and a \u2208 range(Ai). This template can be applied to any entity e of type T\nin the entity relation graph. We de\ufb01ne a pairwise template as T.Ai = a\nP.Bj = b, where Ai\nis an attribute of type T , a \u2208 range(Ai), P.Bj is an attribute of type range(P ), dom(P ) = T ,\nand b \u2208 range(Bj). This template can be applied to any entity pair (e1, e2), where e1.T = T and\ne2 \u2208 e1.P . Here we de\ufb01ne e.P as the set of entities reach able from entity e \u2208 T through the\nrelation path P . For example, the following template\n\n\u2227\n\n\u2227\n\npp:coauthor = 1\n\npp P P 1\u2212\u2212\u2212\u2192 p P P 1\n\n\u2212\u2212\u2212\u2212\u2212\u2192 pp:advise = 1\n\n(cid:0)1\n\ncan be applied to any person-person pair, and it \ufb01res whenever co-author=1 for this person pair, and\nthe \ufb01rst person (identi\ufb01ed as pp P P 1\u2212\u2212\u2212\u2192 p ) also have advise=1 with another person. Here we use\np as a shorthand for the type P erson, and pp a shorthand for \u27e8P erson, P erson\u27e9. In our current\nimplementation, we systematically enumerate all possible unary and pairwise templates.\nGiven the above concepts, we de\ufb01ne a treeRMN model M = (G, f , \u03b8) as the tuple of an entity rela-\ntion graph G, a set of feature functions f, and their weights \u03b8. Each feature function fk counts the\nnumber of times the k-th template \ufb01res in G. Generally, the complexity of inference is exponential\nin the depth of the relation trees, because both the number of templates and their sizes of Markov\nblankets grow exponentially w.r.t. the depth \u2113. TreeRMN provides us a very convenient way to con-\ntrol the complexity by the single parameter \u2113. Since treeRMN only considers pairwise and unary\nfeatures, it is less expressive than Markov Logic Networks [10], which can de\ufb01ne higher order\nfeatures by conjunction of predicates; and treeRMN is also less expressive than relational Bayesian\nnetworks [9][20][14], which have factor functions with three arguments. However, the limited ex-\npressive power of treeRMN can be effectively compensated for by detecting hidden variables, which\nis another key component of our relational learning approach, as explained in the next section.\n\n4\n\nPP1genderisGrduateCourseadviseco-authergivetakePP1PC2PP2PP2PC1PC1PC2-1-1-1-1Twingender=Mgender=FisGrduateCourse=0advise=1co-auther=1advise=0co-auther=1give=1take=0give=0take=1TwinTwinPP1PP2PC2PP2-1PP1-1PC1-1PP2TwinPP2-1PP1-1PC1-1PP1PP2-1PP1-1PC1-1\fAlgorithm 1 Contrastive Variable Induction\ninitialize a treeRMN M = (G, f , \u03b8)\nwhile true do\n\n\u2032\n\n\u2032\n\nestimate parameters \u03b8 by L-BFGS\n(f\nif no hidden variable is induced then\n\n) = induceHiddenVariables(M)\n\n, \u03b8\n\nbreak\n\nend if\nend while\nreturn M\n\nAlgorithm 2 Bottom Up Clustering of Entities\ninitialize clustering \u0393 = {Ii = {i}}\nwhile true do\nfor any pair of clusters I1,I2 \u2208 \u0393 do\n\u2212 \u2206I2\ninc(I1, I2) = \u2206I1\u222aI2\nend for\nif the largest increment \u2264 0 then\nend if\nmerge the pair with the largest increment\n\n\u2212 \u2206I1\n\nbreak\n\nend while\nreturn \u0393\n\n5 Contrastive Variable Induction (CVI)\nAs we have explained in the previous section, in order to compensate for the limited expressive\npower of a shallow treeRMN and capture long-range dependencies in complex relational data, we\npropose to introduce hidden variables. These variables are detected effectively with the Contrastive\nVariable Induction (CVI) algorithm as explained below.\nThe basic procedure (Algorithm 1) starts with a treeRMN model on observed variables, which can\nbe manually designed or automatically learned [13]; then it iteratively introduces new HVs to the\nmodel and estimate its parameters. The key to making this simple procedure highly ef\ufb01cient is a\nfast algorithm to evaluate and select good candidate HVs. We give closed-form expressions of the\nlikelihood gain and the weights of newly added features under contrastive divergence approximation\n[23] (other type of inference can be done similarly). Therefore, the CVI process can be very ef\ufb01cient,\nonly adding small overhead to the training of a regular treeRMN.\nConsider introducing a new HV H to the entity type T . In order for H to in\ufb02uence the model, it\nneeds to be connected to the existing model. This is done by de\ufb01ning additional feature templates:\nwe can denote a HV candidate by a tuple ({q(i)(H)}, fH , \u03b8H ), where {q(i)(H)} is the set of distri-\nbutions of the hidden variable H on all entities of type T , fH is a set of pairwise feature templates\nthat connect H to the existing model, and \u03b8H is a vector of feature weights. Here we assume that\nany feature f \u2208 fH is in the pairwise form fH=1\nA=a, where a is the assignment to one of the\nexisting variables A in the relation tree of type T . Ideally, we would like to identify the candidate\nHV, which gives the maximal gain in the regularized objective function LCD1(\u03b8).\nFor easy evaluation of H, we set its mean \ufb01eld variational parameters \u00b5H to either 0 or 1 on the\nentities of type T . This yields a lower bound to the gain of LCD1(\u03b8). Therefore, a candidate HV\ncan be represented as (I, fH , \u03b8H ), where I is the set of indices to the entities with \u00b5H = 1. Using\nsecond order Taylor expansion, we can show that for a particular feature f \u2208 fH the maximal gain\n(5)\n\n\u2227\n\n\u2206I,f =\n\nis achieved at\n\n(cid:18)f =\n\n(6)\n\n\u230a\u2212eI [f ]\u230b2\n1\n\u03bb\n2\n(cid:14)I [f ] + (cid:12)\n\u230a\u2212eI [f ]\u230b\u03bb\n(cid:14)I [f ] + (cid:12)\n\n;\n\nq1,I\n\n\u2212 \u27e8f\u27e9\n\nq0,I is the difference of f\u2019s expectations, and \u03b4I [f ] = V ar\n\nb = a\u2212b, if a > b; a+b, if a < \u2212b; 0, otherwise. Error eI [f ] =\nwhere \u230a\u230b is a truncation operator: \u230aa\u230b\n\u27e8f\u27e9\n\u2217\nq0,I [f ] is the\ndifferences of f\u2019s variances1. Here we use q, I to denote the distribution q of the existing variables\naugmented by the distribution of H parameterized by the index set I. q0 and q1 are the wake\nand sleep distributions estimated by 1-step mean-\ufb01eld CD. The estimations in Eq. (5) and (6) are\nsimple, yet have nice intuitive explanations about the effects of the \u21131 and \u21132 regularizer as used in\nEq. (1): a large \u21132-norm (i.e. large \u03b2) smoothly shrinks both the (estimated) likelihood gain and the\nfeature weights; while the non-differentiable \u21131-norm not only shrink the estimated gain and feature\nweights, but also drive features to have zero gains, therefore, can automatically select the features.\nIf we assume that the gains of individual features are independent, then the estimated gain for H is\n\nq1,I [f ] \u2212 V ar\n\u2217\n\n1V arq,I [f ] is intractable when we have tied parameters. Therefore, we approximate it by assuming\nV \u2208V V arq,I [f (V )] =\n\nthat the occurrences of f are independent to each other:\n\n\u2217\nq,I [f ] =\n\ni.e. V ar\n\nq,I ), where V is any speci\ufb01c subset of variables that f can be applied to.\n\n\u27e8f (V )\u27e9\n\nq,I (1 \u2212 \u27e8f (V )\u27e9\n\nV \u2208V\n\n\u2211\n\n\u2211\n\n5\n\n\f\u2211\n\nf\u2208fI\n\n\u2206I \u2248\n\n\u2206I,f ;\n\nwhere fI = {f : \u2206I,f > 0} is the set of features that are expected to improve the objective function.\nHowever, \ufb01nding the index set I that maximizes \u2206I is still non-trivial\u2014an NP-hard combinatory\noptimization problem, which is often tackled by top-down or bottom-up procedures in the clustering\nliterature. Algorithm 2 uses a simple bottom up clustering algorithm to build a hierarchy of clusters.\nIt starts with each sample as an individual cluster, and then repeatedly merges the two clusters that\nlead to the best increment of gain. The merging is stopped if the best increment \u2264 0.\nAfter clustering, we introduce a single categorical variable that treats each cluster with positive gain\nas a category, and the remaining useless clusters are merged into a separate category. Introducing\nthis categorical variable is equivalent to introducing a set of binary variables\u2013one for each cluster\nwith positive gain. From the above derivation, we can see that the essential part of the CVI algorithm\nis to compute the expectations and variances of RMN features, both of which can be done by any\ninference procedures, including the mean \ufb01eld as we have used. Therefore, in principle, the CVI\nalgorithm can be extended to use other inference methods like belief propagation or exact inference.\n\nRemark 1 after the induction step, the introduced HVs are treated as observations: i.e. their vari-\national parameters are \ufb01xed to their initial 0 or 1 values. In the future, we\u2019d like to treat the HVs as\nfree variables. This can potentially correct the errors made by the greedy clustering procedure. The\ncardinalities of HVs may be adapted by operators like deleting, merging, or splitting of categories.\nRemark 2 currently, we only induce HVs to basic entity types. Extension to composite types can\nshow interesting tenary relations such as \u201cAbnormality can be PartOf Animals\u201d. However, this\nrequires clustering over a much larger number of entities, which cannot be done by our simple\nimplementation of bottom up clustering.\n\n6 Experiment\nIn this section, we present both qualitative and quantitative results of treeRMN model. We demon-\nstrate that CVI can discover semantically meaningful hidden variables, which can signi\ufb01cantly im-\nprove the speed and quality of treeRMN models.\n\nBasic Composite\n#E #A\n#E #A\n50\n80\n0\n196 56\n14 111\n0 18,225 49\n0 10,816 1*\n\n0\n\nAnimal\nNation\nUML 135\nKinship 104\n\n6.1 Datasets\nTable 1 shows the statistics of the four datasets used in our ex-\nperiments. These datasets are commonly used by previous work\nin relational learning [9][11][20][14]. The Animal dataset con-\ntains a set of animals and their attributes. It consists exclusively\nof unary predicates of the form A(a) where A is an attribute and\na is an animal (e.g., Swims(Dolphin)). This is a simple proposi-\ntional dataset with no relational structure, but is useful as a base case\nfor comparison. The Nation dataset contains attributes of nations\nand relations among them. The binary predicates are of the form\nR(n1, n2), where n1, n2 are nations and R is a relation between\nthem (e.g., ExportsTo, GivesEconomicAidTo). The unary predicates\nare of the form A(n), where n is a nation and A is a attribute (e.g.,\nCommunist(China)). The UML dataset is a biomedical ontology called Uni\ufb01ed Medical Lan-\nguage System. It consists of binary predicates of the form R(c1, c2), where c1 and c2 are biomedical\nconcepts and R is a relation between them (e.g.,Treats(Antibiotic,Disease)). The Kinship dataset\ncontains kinship relationships among members of the Alyawarra tribe from Central Australia. Pred-\nicates are of the form R(p1, p2), where R is a kinship term and p1, p2 are persons. Except for the\nanimal data, the number of composite entities is the square of the number of basic entities.\n\nTable 1: Number of entities\n(#E) and attributes (#A) for\n\u2217The kinship\nfour datasets.\ndata has only one attribute\nwhich has 26 possible values.\n\n6.2 Characterization of treeRMN and CVI\nIn this section, we analyze the properties of the discovered hidden variables and demonstrate the\nbehavior of the CVI algorithm. For the simple non-relational Animal data, if we start with a full\nmodel with all pairwise features, CVI will decide not to introduce any hidden variables. If we run\nCVI starting from a model with only unary features, however, CVI decides to introduce one hidden\nvariable H0 with 8 categories. Table 2 shows the associated entities and features for the \ufb01rst four\ncategories. We can see that they nicely identify marine mammals, predators, rodents, and primates.\n\n6\n\n\fEntities\n\nC0 KillerWhale Seal Dolphin BlueWhale\n\nWalrus HumpbackWhale\n\nC1 GrizzlyBear Tiger GermanShepherd\nLeopard Wolf Weasel Raccoon Fox\nBobcat Lion\n\nC2 Hamster Skunk Mole Rabbit Rat Rac-\n\ncoon Mouse\nSpiderMonkey Gorilla Chimpanzee\n\nC3\n\nPositive Features\nFlippers Ocean Water Swims\nFish Hairless Coastal Arctic ...\nStalker Fierce Meat Meatteeth\nClaws Hunter Nocturnal Paws\nSmart Pads ...\nHibernate Buckteeth Weak\nSmall Fields Nestspot Paws ...\nTree Jungle Bipedal Hands\nVegetation Forest ...\n\nNegative Features\nQuadrapedal Ground Furry\nStrainteeth Walks ...\nTimid Vegetation Weak\nGrazer Toughskin Hooves\nDomestic ...\nStrong Muscle Big Tough-\nskin ...\nPlains Fields Patches ...\n\nTable 2: The associated entities and features (sorted by decreasing magnitude of feature weights)\nfor the \ufb01rst four categories of the induced hidden variable a.H0 on the Animal data. The features\nare in the form a.H0 = Ci\n\na.A = 1, where A is any of the variables in the last two columns.\n\n\u2227\n\n\u2227\n\nEntities\n\nC0 AcquiredAbnormality AnatomicalAb-\n\nnormality CongenitalAbnormality\n\nC1 Alga Plant\nC2 Amphibian Animal Bird Invertebrate\n\nFish Mammal Reptile Vertebrate\n\n(cid:0)1\n(cid:0)1\n\n\u2212\u2212\u2212\u2212\u2212\u2192cc.PartOf\nc CC1\n\nPositive Features\n(cid:0)1\n\u2212\u2212\u2212\u2212\u2212\u2192cc.Causes\nc CC2\n\u2212\u2212\u2212\u2212\u2212\u2192cc.CooccursWith ...\n\u2212\u2212\u2212\u2212\u2212\u2192cc.Complicates\nc CC2\nc CC2\n\u2212\u2212\u2212\u2212\u2212\u2192cc.LocationOf ...\n\u2212\u2212\u2212\u2212\u2212\u2192cc.InteractsWith\nc CC1\nc CC1\n\u2212\u2212\u2212\u2212\u2212\u2192cc.PropertyOf\n\u2212\u2212\u2212\u2212\u2212\u2192cc.InteractsWith c CC2\nc CC1\n\u2212\u2212\u2212\u2212\u2212\u2192cc.InteractsWith c CC2\n\u2212\u2212\u2212\u2212\u2212\u2192cc.PartOf ...\nc CC2\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)1\n\n(cid:0)1\n\nTable 3: The associated entities and features (sorted by decreasing magnitude of feature weights)\nfor the \ufb01rst three categories of the induced hidden variable c.H0 on the UML data. The features are\nin the form c.H0 = Ci\n\nA = 1, where A is any of the variables in the last column.\n\nFor the three relational datasets, we use UML as an example. The\ninduction process of Nation and Kinship datasets are similar, and\nwe omit their details due to space limitation. For the UML task,\nCVI induces two multinomial hidden variables H0 and H1. As we\ncan see from Figure 3, the inclusion of each hidden variable sig-\nni\ufb01cantly improves the conditional log likelihood of the model.\nThe \ufb01rst hidden variable C.H0 has 43 categories, and Table 3\nshows the top three of them. We can see that these categories\nrepresent the hidden concepts Abnormalities, Animals and Plants\nrespectively. Abnormalities can be caused or treated by other con-\ncepts, and it can also be a part of other concepts. Plants can be\nthe location of some other concepts; and some other concepts can\nbe part of or the property of Animals. These grouping of concepts\nare similar to those reported by Kok and Domingos [11].\n\nFigure 3: change of the con-\nditional\nlog likelihood during\ntraining for the UML data.\n\n6.3 Overall Performance\nNow we present quantitative evaluation of the treeRMN model, and compare it with other relational\nlearning methods including MLN structure learning (MLS) [10], In\ufb01nite Relational Models (IRM)\n[9] and Multiple Relational Clustering (MRC) [11]. Following the methodology of [11], we situate\nour experiment in prediction tasks. We perform 10 fold cross validation by randomly splitting all\nthe variables into 10 sets. At each run, we treat one fold as hidden during training, and then evaluate\nthe prediction of these variables conditioned on the observed variables during testing. The overall\nperformance is measured by training time, average Conditional Log-Likelihood (CLL), and Area\nUnder the precision-recall Curve (AUC) [11]. All implementation is done with Java 6.0.\nTable 4 compares the overall performance of treeRMN (RMN), treeRMN with hidden variable dis-\ncovery (RMNCV I), and other relational models (MSL, IRM and MRC) as reported in [11]. We\nuse subscripts (0, 1, 2) to indicate the order of Markov dependency (depth of relation trees), and\ndim\u03b8 for the number of parameters. First, we can see that, without HVs, the treeRMNs with higher\nMarkov orders generally perform better in terms of CLL and AUC. However, due to the complex-\nity of high-order treeRMNs, this comes with large increases in training time. In some cases (e.g.,\nKinship data), a high order treeRMN can perform worse than a low order treeRMN probably due to\nthe dif\ufb01culty of inference with a large number of features. Second, training a treeRMN with CVI\n\n7\n\n-0.7-0.6-0.5-0.4-0.3-0.2-0.100102030405060L-BFGS IterationCLLInitial modelIntroduce c.H0Introduce c.H1\fRMN0\n\n1\n\n\u2020\n\u2020\n\u2020\n\nAnimal, (cid:21)=0.01, (cid:12)=1\n\nNation, (cid:21)=0.01, (cid:12)=1\n\n\u2020\n\u2020\n\u2020\n\n\u2020\n\u2020\n\u2020\n\nAUC\n\ndim\u03b8 Time\n\nCLL\n-0.34\u00b10.03\n\n1\n\n\u2020\n\u2020\n\u2020\n\n24h MSL\n10h MRC\n10h IRM\n\n5s RMN0\nRMN1\nRMN2\n\nAUC\n0.88\u00b10.02 3,655\n\ndim\u03b8 Time\n\n0\n\nRMNCV I\u22c6\nMSL\nMRC\nIRM\n\n1\n\n9s RMNCV I\n24h MSL\n10h MRC\n10h IRM\n\n-0.33\u00b10.02\n-0.54\u00b10.04\n-0.43\u00b10.04\n-0.43\u00b10.06\n\n0.89\u00b10.02 4,349\n0.68\u00b10.04\n0.80\u00b10.04\n0.79\u00b10.08\nUML, (cid:21)=0.01, (cid:12)=10\n\nCLL\n-0.40\u00b10.01\n-0.33\u00b10.02\n-0.38\u00b10.03\n-0.31\u00b10.02\n-0.33\u00b10.04\n-0.31\u00b10.02\n-0.32\u00b10.02\n\nCLL\n-0.056\u00b10.005 0.70\u00b10.02 1,081 0.3h RMN0\n-0.044\u00b10.002 0.68\u00b10.04 2,162 1.0h RMN1\n-0.028\u00b10.003 0.71\u00b10.02 6,440 14.5h RMN2\n-0.005\u00b10.001 0.94\u00b10.01 6,946 453s RMNCV I\n-0.025\u00b10.002 0.47\u00b10.06\n-0.004\u00b10.000 0.97\u00b10.00\n-0.011\u00b10.001 0.79\u00b10.01\n\nAUC\ndim\u03b8 Time\n0.63\u00b10.04 7,812\n15s\n0.72\u00b10.04 21,840\n70s\n0.71\u00b10.04 40,489 446s\n0.83\u00b10.04 22,191 104s\n0.77\u00b10.04\n24h\n0.75\u00b10.03\n10h\n0.75\u00b10.03\n10h\nKinship, (cid:21)=0.01, (cid:12)=10\nAUC\nCLL\ndim\u03b8 Time\n0.08\u00b10.00\n-2.95\u00b10.01\n\u00a7\nRMN0\n6s\n25\n-1.36\u00b10.05\n0.66\u00b10.03\n\u00a7\nRMN1\n350 107s\n-2.34\u00b10.01\n0.33\u00b10.00 1,625 2.1h\n\u00a7\nRMN2\n-1.04\u00b10.03\n0.81\u00b10.01\n\u00a7\nRMNCV I\u22c6\n900 402s\n-0.066\u00b10.006 0.59\u00b10.08\nMSL\n24h\n-0.048\u00b10.002 0.84\u00b10.01\nMRC\n10h\n-0.063\u00b10.002 0.68\u00b10.01\nIRM\n10h\nTable 4: Overall performance. Bold identi\ufb01es the best performance, and \u00b1 marks the standard\ndeviations. Experiments are conducted with Intel Xeon 2.33GHz CPU (E5410). \u22c6These results were\nstarted with a treeRMN that only has unary features. \u00a7The CLL of kinship data is not comparable to\nprevious approaches, because we treat each of its labels as one variable with 26 categories instead\nof 26 binary variables. \u2020The results of existing methods were run on different machines (Intel Xeon\n2.8GHz CPU), and their 10-fold data splits are independent to those used for the RMN models.\nThey were allowed to run up to 10-24 hours, and here we assumes that these methods cannot achieve\nsimilar accuracy when the amount of training time is signi\ufb01cantly reduced.\nis only 2\u223c4 times slower than training a treeRMN of the same order of Markov dependency. On\nall three relational datasets, treeRMNs with CVI can signi\ufb01cantly improve CLL and AUC. For the\nsimple Animal dataset, the improvement is less signi\ufb01cant because there is no long range depen-\ndency to be captured in this data. Although the CVI models have similar number features as the\nsecond order treeRMNs, their inferences are much faster due to their much smaller Markov blan-\nkets. Finally, on all datasets, the treeRMNs with CVI can achieve similar prediction quality as the\nexisting methods (i.e., MSL, IRM and MRC), but is about two orders of magnitude more ef\ufb01cient\nin training. Speci\ufb01cally, it achieves signi\ufb01cant improvements on the Animal and Nation data, but\nmoderately worse results on the UML and Kinship data. Since both UML and Kinship data have\nno attributes in basic entity types, composite entities become more important to model. Therefore,\nwe suspect that the MRC model achieves better performance because it can perform clustering on\ntwo-argument predicates which corresponds to composite entities.\n\n7 Conclusions and Future Work\n\nWe have presented a novel approach for ef\ufb01cient relational learning, which consists of a restricted\nclass of Relational Markov Networks (RMN) called relation tree-based RMN (treeRMN) and an\nef\ufb01cient hidden variable induction algorithm called Contrastive Variable Induction (CVI). By using\nsimple treeRMNs, we achieve computational ef\ufb01ciency, and CVI can effectively detect hidden vari-\nables, which compensates for the limited expressive power of treeRMNs. Experiments on four real\ndatasets show that the proposed relational learning approach can achieve state-of-the-art prediction\naccuracy and is much faster than existing relational Markov network models.\nWe can improve the presented approach in several aspects. First, to further speedup the treeRMN\nmodel we can apply ef\ufb01cient Markov network feature selection methods [17][26] instead of sys-\ntematically enumerating all possible feature templates. Second, as we have explained at the end\nof section 5, we\u2019d like to apply HVD on composite entity types. Third, we\u2019d also like to treat the\nintroduced hidden variables as free variables and to make their cardinalities adaptive. Finally, we\nwould like to explore high order features which involves more than two variable assignments.\nAcknowledgements.\nWe gratefully acknowledge the support of NSF grant IIS-0811562 and NIH grant R01GM081293.\n\n8\n\n\fReferences\n[1] Galen Andrew and Jianfeng Gao. Scalable training of \u21131-regularized log-linear models. In\n\nICML, 2007.\n\n[2] Razvan C. Bunescu and Raymond J. Mooney. Collective information extraction with relational\n\nMarkov networks. In ACL, 2004.\n\n[3] Miguel A. Carreira-Perpinan and Geoffrey E. Hinton. On contrastive divergence learning. In\n\nAISTATS, 2005.\n\n[4] Gal Elidan and Nir Friedman. The information bottleneck em algorithm. In UAI, 2003.\n[5] Gal Elidan, Noam Lotner, Nir Friedman, and Daphne Koller. Discovering hidden variables: A\n\nstructure-based approach. In NIPS, 2000.\n\n[6] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic relational\n\nmodels. In IJCAI, 1999.\n\n[7] Yi Huang, Volker Tresp, and Stefan Hagen Weber. Predictive modeling using features derived\n\nfrom paths in relational graphs. In Technical report, 2007.\n\n[8] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. Template-based inference in symmetric rela-\n\ntional Markov random \ufb01elds. In UAI, 2007.\n\n[9] Charles Kemp, Joshua B. Tenenbaum, Thomas L. Grif\ufb01ths, Takeshi Yamada, and Naonori\n\nUeda. Learning systems of concepts with an in\ufb01nite relational model. In AAAI, 2006.\n\n[10] Stanley Kok and Pedro Domingos. Learning the structure of Markov logic networks. In ICML,\n\n2005.\n\n[11] Stanley Kok and Pedro Domingos. Statistical predicate invention. In ICML, 2007.\n[12] Stanley Kok and Pedro Domingos. Learning Markov logic networks using structural motifs.\n\nIn ICML, 2010.\n\n[13] Su-In Lee, Varun Ganapathi, and Daphne Koller. Ef\ufb01cient structure learning of Markov net-\n\nworks using \u21131-regularization. In NIPS, 2006.\n\n[14] Kurt T. Miller, Thomas L. Grif\ufb01ths, and Michael I. Jordan. Nonparametric latent feature mod-\n\nels for link prediction. In NIPS, 2009.\n\n[15] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate\n\ninference: An empirical study. In UAI, 1999.\n\n[16] Iftach Nachman, Gal Elidan, and Nir Friedman. \u201cIdeal parent\u201d structure learning for continu-\n\nous variable networks. In UAI, 2004.\n\n[17] Simon Perkins, Kevin Lacker, and James Theiler. Grafting: Fast, incremental feature selection\n\nby gradient descent in function spaces. In JMLR, 2003.\n\n[18] Hoifung Poon and Pedro Domingos. Joint inference in information extraction. In AAAI, 2007.\n[19] Karen Sachs, Omar Perez, Dana Peer, Douglas A. Lauffenburger, and Garry P. Nolan. Causal\n\nprotein-signaling networks derived from multiparameter single-cell data. In Science, 2005.\n\n[20] Ilya Sutskever, Ruslan Salakhutdinov, and Josh Tenenbaum. Modelling relational data using\n\nBayesian clustered tensor factorization. In NIPS, 2009.\n\n[21] Benjamin Taskar, Pieter Abbeel, and Daphne Koller. Discriminative probabilistic models for\n\nrelational data. In UAI, 2002.\n\n[22] Benjamin Taskar, Eran Segal, and Daphne Koller. Probabilistic classi\ufb01cation and clustering in\n\nrelational data. In IJCAI, 2001.\n\n[23] Max Welling and Geoffrey E. Hinton. A new learning algorithm for mean \ufb01eld Boltzmann\n\nmachines. In ICANN, 2001.\n\n[24] Zhao Xu, Volker Tresp, Kai Yu, and Hans-Peter Kriegel. In\ufb01nite hidden relational models. In\n\nUAI, 2006.\n\n[25] Alan Yuille. The convergence of contrastive divergence. In NIPS, 2004.\n[26] Jun Zhu, Ni Lao, and Eric P. Xing. Grafting-light: Fast, incremental feature selection and\n\nstructure learning of Markov random \ufb01elds. In KDD, 2010.\n\n[27] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. In Journal\n\nOf The Royal Statistical Society Series B, 2005.\n\n9\n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Ni", "family_name": "Lao", "institution": null}, {"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Liu", "family_name": "Liu", "institution": ""}, {"given_name": "Yandong", "family_name": "Liu", "institution": null}, {"given_name": "William", "family_name": "Cohen", "institution": null}]}