{"title": "Evidence-Specific Structures for Rich Tractable CRFs", "book": "Advances in Neural Information Processing Systems", "page_first": 352, "page_last": 360, "abstract": "We present a simple and effective approach to learning tractable conditional random fields with structure that depends on the evidence. Our approach retains the advantages of tractable discriminative models, namely efficient exact inference and exact parameter learning. At the same time, our algorithm does not suffer a large expressive power penalty inherent to fixed tractable structures. On real-life relational datasets, our approach matches or exceeds state of the art accuracy of the dense models, and at the same time provides an order of magnitude speedup", "full_text": "Evidence-Speci\ufb01c Structures for Rich Tractable CRFs\n\nAnton Chechetka\n\nCarnegie Mellon University\nantonc@cs.cmu.edu\n\nCarlos Guestrin\n\nCarnegie Mellon University\nguestrin@cs.cmu.edu\n\nAbstract\n\nWe present a simple and effective approach to learning tractable conditional ran-\ndom \ufb01elds with structure that depends on the evidence. Our approach retains the\nadvantages of tractable discriminative models, namely ef\ufb01cient exact inference\nand arbitrarily accurate parameter learning in polynomial time. At the same time,\nour algorithm does not suffer a large expressive power penalty inherent to \ufb01xed\ntractable structures. On real-life relational datasets, our approach matches or ex-\nceeds state of the art accuracy of the dense models, and at the same time provides\nan order of magnitude speedup.\n\n1\n\nIntroduction\n\nConditional random \ufb01elds (CRFs, [1]) have been successful in modeling complex systems, with\napplications from speech tagging [1] to heart motion abnormality detection [2]. A key advantage\nof CRFs over other probabilistic graphical models (PGMs, [3]) stems from the observation that in\nalmost all applications, some variables are unknown at test time (we will denote such variables X ),\nbut others, called the evidence E, are known at test time. While other PGM formulations model the\njoint distribution P (X ,E), CRFs directly model conditional distributions P (X | E).\nThe discriminative approach adopted by CRFs allows for better approximation quality of the learned\nconditional distribution P (X | E), because the representational power of the model is not \u201cwasted\u201d\non modeling P (E). However, the better approximation comes at a cost of increased computational\ncomplexity for both structure [4] and parameter learning [1] as compared to generative models. In\nparticular, unlike Bayesian networks or junction trees [3], (a) the likelihood of a CRF structure does\nnot decompose into a combination of small subcomponent scores, making many existing approaches\nto structure learning inapplicable, and, (b) instead of computing optimal parameters in closed form,\nwith CRFs one has to resort to gradient-based methods. Moreover, computing the gradient of the\nlog-likelihood with respect to the CRF parameters requires inference in the current model to be done\nfor every training datapoint. For high-treewidth models, even approximate inference is NP-hard [5].\nTo overcome the extra computational challenges posed by the conditional random \ufb01elds, practition-\ners usually resort to several of the following approximations throughout the process:\n\n\u2022 CRF structure is speci\ufb01ed by hand, leading to suboptimal structures.\n\u2022 Approximate inference during parameter learning results in suboptimal parameters.\n\u2022 Approximate inference at test time results in suboptimal results [5].\n\u2022 Replacing the CRF conditional likelihood objective with a more tractable one (e.g. [6])\n\nresults in suboptimal models (both in terms of learned structure and parameters).\n\nNot only do all of the above approximation techniques lack any quality guarantees, but also com-\nbining several of them in the same system serves to further compound the errors.\nA well-known way to avoid approximations in CRF parameter learning is to restrict the models to\nhave low treewidth, where the dependencies between the variables X have a tree-like structure. For\n\n1\n\n\fsuch models, parameter learning and inference can be done exactly1; only structure learning involves\napproximations. The important dependencies between the variables X , however, usually cannot all\nbe captured with a single tree-like structure, so low-treewidth CRFs are rarely used in practice.\nIn this paper, we argue that it is the commitment to a single CRF structure irrespective of the evidence\nE that makes tree-like CRFs an inferior option. We show that tree CRFs with evidence-dependent\nstructure, learned by a generalization of the Chow-Liu algorithm [7], (a) yield results equal to or\nsigni\ufb01cantly better than densely-connected CRFs on real-life datasets, and (b) are an order of mag-\nnitude faster than the dense models. More speci\ufb01cally, our contributions are as follows:\n\n\u2022 Formally de\ufb01ne CRFs with evidence-speci\ufb01c (ES) structure.\n\u2022 Observe that, given the ES structures, CRF feature weights can be learned exactly.\n\u2022 Generalize the Chow-Liu algorithm [7] to learn evidence-speci\ufb01c structures for tree CRFs.\n\u2022 Generalize tree CRFs with evidence-speci\ufb01c structure (ESS-CRFs) to the relational setting.\n\u2022 Demonstrate empirically the superior performance of ESS-CRFs over densely connected\n\nmodels in terms of both accuracy and runtime on real-life relational models.\n\n(cid:26)(cid:88)\n\n(cid:88)\n\n2 Conditional random \ufb01elds\nA conditional random \ufb01eld with pairwise features2 de\ufb01nes a conditional distribution P (X |E) as\n\nwijkfijk(Xi, Xj,E)\n\nP (X | E) = Z\u22121(E) exp\n\n(i,j)\u2208T\n\n(1)\nwhere functions f are called features, w are feature weights, Z(E) is the normalization constant\n(which depends on evidence), and T is the set of edges of the model. To re\ufb02ect the fact that P (X | E)\ndepends on the weights w, we will write P (X |E,w). To apply a CRF model, one \ufb01rst de\ufb01nes the set\nof features f. A typical feature may mean that two pixels i and j in the same image segment tend to\nhave have similar colors: f (Xi, Xj,E)\u2261I(Xi = Xj,|colori\u2212colorj| < \u03b4), where I(\u00b7) is an indicator\nfunction. Given the features f and training data D that consists of fully observed assignments to X\nand E, the optimal feature weights w\u2217 maximize the conditional log-likelihood (CLLH) of the data:\n\nk\n\n,\n\n(cid:27)\n\nw\u2217= arg max\n\nlogP (X| E,w) = arg max\n\nwijkfijk(Xi,Xj,E) \u2212 logZ(E,w))\n\n(2)\n\n\uf8f6\uf8f8.\n\n(cid:88)\n\n(X,E)\u2208D\n\n(cid:88)\n\n\uf8eb\uf8ed(cid:88)\n\n(X,E)\u2208D\n\n(i,j)\u2208T,k\n\n\u2202wijk\n\nThe problem (2) does not have a closed form solution, but has a unique global optimum that can be\nfound using any gradient-based optimization technique because of the following fact [1]:\nFact 1 Conditional log-likelihood (2), abbreviated CLLH, is concave in w. Moreover,\n= fijk(Xi,Xj,E) \u2212 EP (Xi,Xj|E,w) [fijk(Xi,Xj,E)] ,\n\n\u2202 log P (X| E, w)\n\n(3)\n\nwhere EP denotes expectation with respect to a distribution P.\nConvexity of the negative CLLH objective and the closed-form expression for the gradient lets us use\nconvex optimization techniques such as L-BFGS [9] to \ufb01nd the unique optimum w\u2217. However, the\ngradient (3) contains the conditional distribution over XiXj, so computing (3) requires inference in\nthe model for every datapoint. Time complexity of the exact inference is exponential in the treewidth\nof the graph de\ufb01ned by edges T [5]. Therefore, exact evaluation of the CLLH objective (2)and\ngradient (3) and exact inference at test time are all only feasible for models with low-treewidth T.\nUnfortunately, restricting the space of models to only those with low treewidth severely decreases\nthe expressive power of CRFs. Complex dependencies of real-life distributions usually cannot be\nadequately captured by a single tree-like structure, so most of the models used in practice have high\ntreewidth, making exact inference infeasible. Instead, approximate inference techniques, such as\n\n1Here and in the rest of the paper, by \u201cexact parameter learning\u201d we will mean \u201cwith arbitrary accuracy\nin polynomial time\u201d using standard convex optimization techniques. This is in contrast to closed form exact\nparameter learning possible for generative low-treewidth models representing the joint distribution P (X ,E).\n2In this paper, we only consider the case of pairwise dependencies, that is, features f that depend on at most\ntwo variables from X (but may depend on arbitrary many variables from E). Our approach can be in principle\nextended to CRFs with higher order dependencies, but Chow-Liu algorithm for structure learning will have to\nbe replaced with an algorithm that learns low-treewidth junction trees, such as [8].\n\n2\n\n\fbelief propagation [10, 11] or sampling [12] are used for parameter learning and at test time. Ap-\nproximate inference is NP-hard [5], so approximate inference algorithms have very few result quality\nguarantees. Greater expressive power of the models is thus obtained at the expense of worse quality\nof estimated parameters and inference. Here, we show an alternative way to increase expressive\npower of tree-like structured CRFs without sacri\ufb01cing optimal weights learning and exact inference\nat test time. In practice, our approach is much better suited for relational than for propositional\nsettings, because of much higher parameters dimensionality in the propositional case. However, we\n\ufb01rst present in detail the propositional case theory to better convey the key high-level ideas.\n\n3 Evidence-speci\ufb01c structure for CRFs\n\nObserve that, given a particular evidence value E, the set of edges T in the CRF formulation (1)\nactually can be viewed as a supergraph of the conditional model over X . An edge (r, s) \u2208 T\nif for E = E the edge features are identically zero,\ncan be \u201cdisabled\u201d in the following sense:\nfrsk(Xr, Xs, E) \u2261 0, regardless of the values of Xr and Xs, then\n\n(cid:88)\n\nwijkfijk(Xi, Xj, E) \u2261(cid:88)\n\nwijkfijk(Xi, Xj, E),\n\n(cid:88)\n\n(cid:88)\n\n(i,j)\u2208T\\(r,s)\n\nk\n\n(i,j)\u2208T\n\nk\n\n(cid:88)\n\n(cid:110)(cid:88)\n\nT (E = E) = {(i, j) | (i, j) \u2208 T,\u2203k, xi, xj s.t. fijk(xi, xj, E) (cid:54)= 0} .\n\nand so for evidence value E, the model (1) with edges T is equivalent to (1) with (r \u2212 s) removed\nfrom T. The following notion of effective CRF structure, captures the extra sparsity:\nDe\ufb01nition 2 Given the CRF model (1) and evidence value E = E, the effective conditional model\nstructure T (E = E) is the set of edges corresponding to features that are not identically zero:\nIf T (E) has low treewidth for all values of E, inference and parameter learning using the effective\nstructure are tractable, even if a priori structure T has high treewidth. Unfortunately, in practice the\ntreewidth of T (E) is usually not much smaller than the treewidth of T. Low-treewidth effective struc-\ntures are rarely used, because treewidth is a global property of the graph (even computing treewidth\nis NP-complete [13]), while feature design is a local process. In fact, it is the ability to learn optimal\nweights for a set of mutually correlated features without \ufb01rst understanding the inter-feature depen-\ndencies that is the key advantage of CRFs over other PGM formulations. Achieving low treewidth\nfor the effective structures requires elaborate feature design, making model construction very dif-\n\ufb01cult. Instead, in this work, we separate construction of low-treewidth effective structures from\nfeature design and weight learning, to combine the advantages of exact inference and discriminative\nweights learning, high expressive power of high-treewidth models, and local feature design.\nObserve that the CRF de\ufb01nition (1) can be written equivalently as\n\nP (X | E, w) = Z\u22121(E, w) exp\n\nwijk \u00d7 (I((i, j) \u2208 T ) \u00b7 fijk(Xi, Xj,E))\n\n(cid:110)(cid:88)\n\n(4)\nEven though (1) and (4) are equivalent, in (4) the structure of the model is explicitly encoded as\nmultiplicative component of the features. In addition to the feature values f, the effective structure\nof the model is now controlled by the indicator functions I(\u00b7). These indicator functions provide us\nwith a way to control the treewidth of the effective structures independently of the features.\nTraditionally, it has been assumed that the a priori structure T of a CRF model is \ufb01xed. However,\nsuch an assumption is not necessary. In this work, we assume that the structure is determined by the\nevidence E and some parameters u : T = T (E, u). The resulting model, which we call a CRF with\nevidence-speci\ufb01c structure (ESS-CRF), de\ufb01nes a conditional distribution P (X | E, w, u) as follows\nP (X |E,w,u) = Z\u22121(E,w,u) exp\n. (5)\nThe dependence of the structure T on E and u can have different forms. We will provide one example\nof an algorithm for constructing evidence-speci\ufb01c CRF structures shortly.\nESS-CRFs have an important advantage over the traditional parametrization: in (5) the parameters\nu that determine the model structure are decoupled from the feature weights w. As a result, the\nproblem of structure learning (i.e., optimizing u) can be decoupled from feature selection (choosing\nf) and feature weights learning (optimizing w). Such a decoupling makes it much easier to guarantee\nthat the effective structure of the model has low treewidth by relegating all the necessary global\ncomputation to the structure construction algorithm T = T (E, u). For any \ufb01xed choice of a structure\nconstruction algorithm T (\u00b7,\u00b7) and structure parameters u, as long as T (\u00b7,\u00b7) is guaranteed to return\nlow-treewidth structures, learning optimal feature weights w\u2217 and inference at test time can be done\nexactly, because Fact 1 directly extends to feature weights w in ESS-CRFs:\n\nwijk (I((i, j) \u2208 T (E, u)) \u00b7 fijk(Xi, Xj,E))\n\n(cid:88)\n\n(cid:111)\n\n.\n\n(cid:111)\n\nij\n\nk\n\nij\n\nk\n\n3\n\n\fAlgorithm 1: Standard CRF approach\n1\n2\n\nDe\ufb01ne features fijk(Xi, Xj,E), implicitly de\ufb01ning the high-treewidth CRF structure T.\nOptimize weights w to maximize conditional LLH (2) of the training data.\nUse approximate inference to compute CLLH objective (2) and gradient (3).\nforeach E in test data do\n\nUse conditional model (1) to de\ufb01ne the conditional distribution P (X | E, w).\nUse approximate inference to compute the marginals or the most likely assignment to X .\n\n3\n4\n\nAlgorithm 2: CRF with evidence-speci\ufb01c structures approach\n1\n\nDe\ufb01ne features fijk(Xi, Xj,E).\nChoose structure learning alg. T (E, u) that is guaranteed to return low-treewidth structures.\nDe\ufb01ne or learn from data parameters u for the structure construction algorithm T (\u00b7,\u00b7).\nOptimize weights w to maximize conditional LLH log P (X | E, u, w) of the training data.\nUse exact inference to compute CLLH objective (2) and gradient (3).\nforeach E in test data do\n\nUse conditional model (5) to de\ufb01ne the conditional distribution P (X | E, w, u).\nUse exact inference to compute the marginals or the most likely assignment to X .\n\n2\n\n3\n4\n\nObservation 3 Conditional log-likelihood logP (X |E,w,u) of ESS-CRFs (5) is concave in w. Also,\n\u2202 logP(X| E,w,u)\n\n=I((i, j)\u2208 T (E, u))(cid:0)fijk(Xi,Xj,E)\u2212EP (Xi,Xj|E,w,u) [fijk(Xi,Xj,E)](cid:1) . (6)\n\n\u2202wijk\n\nTo summarize, instead of the standard CRF work\ufb02ow (Alg. 1), we propose ESS-CRFs (Alg. 2).\nThe standard approach has approximations (with little, if any, guarantees on the result quality) at\nevery stage (lines 1,2,4), while in our ESS-CRF approach only structure selection (line 1) involves\nan approximation. Next, we present a simple but effective algorithm for learning evidence-speci\ufb01c\ntree structures, based on an existing algorithm for generative models. Many other existing structure\nlearning algorithms can be similarly adapted to learn evidence-speci\ufb01c models of higher treewidth.\n\n4 Conditional Chow-Liu algorithm for tractable evidence-speci\ufb01c structures\n\nLearning the most likely PGM structure from data is in most cases intractable. Even for Markov\nrandom \ufb01elds (MRFs), which are a special case of CRFs with no evidence, learning the most likely\nstructure is NP-hard (c.f. [8]). However, for one very simple class of MRFs, namely tree-structured\nmodels, an ef\ufb01cient algorithm exists [7] that \ufb01nds the most likely structure. In this section, we adapt\nthis algorithm (called the Chow-Liu algorithm) to learning evidence-speci\ufb01c structures for CRFs.\nPairwise Markov random \ufb01elds are graphical models that de\ufb01ne a distribution over X as a normal-\n(i,j)\u2208T \u03c8(Xi, Xj), Notice that pair-\nwise MRFs are a special case of CRFs with fij = log \u03c8ij, wij = 1 and E = \u2205. Unlike tree CRFs,\nhowever, likelihood of tree MRF structures decomposes into contributions of individual edges:\n\nized product of low-dimensional potentials: P (X ) \u2261 Z\u22121(cid:81)\nI(Xi, Xj) \u2212(cid:88)\n\n(7)\nwhere I(\u00b7,\u00b7) is the mutual information and H(\u00b7) is entropy. Therefore, as shown in [7], the most\nlikely structure can be obtained by taking the maximum spanning tree of a fully connected graph,\nwhere the weight of an edge ij is I(Xi, Xj). Pairwise marginals have relatively low dimensionality,\nso the marginals and corresponding mutual informations can be estimated from data accurately,\nwhich makes Chow-Liu algorithm a useful one for learning tree-structured models.\nGiven the concrete value E of evidence E, one can write down the conditional version of the tree\nstructure likelihood (7) for that particular value of evidence:\n\nXi\u2208X H(Xi),\n\n(cid:88)\n\nLLH(T ) =\n\n(i,j)\u2208T\n\nLLH(T | E) =\n\n(8)\nIf exact conditional distributions P (Xi, Xj | E) were available, then the same Chow-Liu algorithm\nwould \ufb01nd the optimal conditional structure. Unfortunately, estimating conditional distributions\nP (Xi, Xj | E) with \ufb01xed accuracy in general requires the amount of data exponential in the dimen-\n\nsionality of E [14]. However, we can still plug in approximate conditionals (cid:98)P (\u00b7 | E) learned from\n\nXi\u2208X HP (\u00b7|E)(Xi).\n\n(i,j)\u2208T\n\n(cid:88)\n\nIP (\u00b7|E)(Xi, Xj) \u2212(cid:88)\n\n4\n\n\fAlgorithm 3: Conditional Chow-Liu algorithm for learning evidence-speci\ufb01c tree structures\n\nij \u2190 arg max(cid:80)\n\n// Parameter learning stage. u\u2217 is found e.g. using L-BFGS with (cid:98)P (\u00b7) as in (9)\nlog (cid:98)P (Xi, Xj | E, uij)\nij) \u2190 I(cid:98)P (Xi,Xj|E,u\u2217\n\nforeach Xi, Xj \u2208 X do u\u2217\n// Constructing structures at test time\nforeach E \u2208 Dtest do\n\nforeach Xi, Xj \u2208 X do set edge weight rij(E, u\u2217\nT (E, u\u2217) \u2190 maximum spanning tree(r(E, u\u2217))\n\n(X,E)\u2208Dtrain\n\n1\n\n2\n\n3\n4\n\nij)(Xi, Xj)\n\n3\n\nAlgorithm 4: Relational ESS-CRF algorithm - parameter learning stage\n1\n2\n\nLearn structure parameters u\u2217 using conditional Chow-Liu algorithm (Alg. 3)\nLet P (X | E,R, w, u) be de\ufb01ned as in (11)\n\nw\u2217 \u2190 arg maxw log (cid:98)P (X | E,R, w, u\u2217) // Find e.g. with L-BFGS using the gradient (12)\nare used in the CRF model, one can train a logistic regression model for (cid:98)P (\u00b7 | E) :\n(cid:111)\n\ndata using any standard density estimation technique3 In particular, with the same features fijk that\n\n(9)\nEssentially, a logistic regression model is a small CRF over only two variables. Exact optimal\nweights u\u2217 can be found ef\ufb01ciently using standard convex optimization techniques.\nThe resulting evidence-speci\ufb01c structure learning algorithm T (E, u) is summarized in Alg 3. Alg 3\nalways returns a tree, and the better the quality of the estimators (9), the better the quality of the\nresulting structures. Importantly, Alg. 3 is by no means the only choice for the ESS-CRF approach.\nOther edge scores, e.g. from [4], and edge selection procedures, e.g. [8, 15] for higher treewidth\njunction trees, can be used as components in the same way as Chow-Liu algorithm is used in Alg. 3.\n\n(cid:98)P (Xi, Xj | E, uij) = Z\u22121\n\nuijkfijk(Xi, Xj,E)\n\nij (E, uij) exp\n\n(cid:110)(cid:88)\n\nk\n\n.\n\n5 Relational CRFs with evidence-speci\ufb01c sructure\n\nTraditional (also called propositional) PGMs are not well suited for dealing with relational data,\nwhere every variable is an entity of some type, and entities are related to each other via different types\nof links. Usually, there are relatively few entity types and link types. For example, the webpages on\nthe internet are linked via hyperlinks, and social networks link people via friendship relationships.\nRelational data violates the i.i.d. data assumption of traditional PGMs, and huge dimensionalities of\nrelational datasets preclude learning meaningful propositional models. Instead, several formulations\nof relational PGMs have been proposed [16] to work with relational data, including relational CRFs.\nThe key property of all these formulations is that the model is de\ufb01ned using a few template potentials\nde\ufb01ned on the abstract level of variable types and replicated as necessary for concrete entities.\nMore concretely, in relational CRFs every variable Xi is assigned a type mi out of the set M of\npossible types. A binary relation R \u2208 R, corresponding to a speci\ufb01c type of link between two\nk (\u00b7,\u00b7,E) and feature\nvariables, speci\ufb01es the types of its input arguments, and a set of features f R\nk . We will write Xi, Xj \u2208 inst(R,X ) if the types of Xi and Xj match the input types\nweights wR\nspeci\ufb01ed by the relation R and there is a link of type R between Xi and Xj in the data (for example,\na hyperlink between two webpages). The conditional distribution P (X | E) is then generalized from\nthe propositional CRF (1) by copying the template potentials for every instance of a relation:\nk (Xi, Xj,E)\n\nP (X | E,R, w) = Z\u22121(E, w) exp\n\n(cid:26)(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:27)\n\nk f R\n\n(10)\n\nwR\n\nR\u2208R\n\nXi,Xj\u2208inst(R,X )\n\nk\n\nObserve that the only meaningful difference of the relational CRF (10) from the propositional formu-\nlation (1) is that the former shares the same parameters between different edges. By accounting for\n(cid:27)\nparameter sharing, it is straightforward to adapt our ESS-CRF formulation to the relational setting.\nWe de\ufb01ne the relational ESS-CRF conditional distribution as\nP(X |E,R,w,u) \u221d exp\nk (Xi, Xj,E)\n3Notice that the approximation error from (cid:98)P (\u00b7) is the only source of approximations in all our approach.\n\nI((i, j)\u2208 T (E,u))\n\n(cid:26)(cid:88)\n\nXi,Xj\u2208inst(R,X )\n\n(cid:88)\n\n(cid:88)\n\nk f R\n\nwR\nk\n\nR\u2208R\n\n(11)\n\n5\n\n\fFigure 1: Left: test LLH for TEMPERATURE. Middle: TRAFFIC. Right: classi\ufb01cation errors for WebKB.\nGiven the structure learning algorithm T (\u00b7,\u00b7) that is guaranteed to return low-treewidth structures,\none can learn optimal feature weights w\u2217 and perform inference at test time exactly:\nObservation 4 Relational ESS-CRF log-likelihood is concave with respect to w. Moreover,\n\u2202logP (X| E,R,w,u)\n\n(cid:2)f R\nk (Xi,Xj,E)(cid:3)(cid:1) . (12)\n\nk (Xi,Xj,E) \u2212 EP (\u00b7|E,R,w,u)\n\n=I(ij\u2208 T (E,u))\n\n(cid:88)\n\n(cid:0)f R\n\nXi,Xj\u2208inst(R,X )\n\n\u2202wR\nk\n\nConditional Chow-Liu algorithm (Alg. 3) can be also extended to the relational setting by using\ntemplated logistic regression weights for estimating edge conditionals. The resulting algorithm is\nshown as Alg. 4. Observe that the test phase of Alg. 4 is exactly the same as for Alg. 3. In the\nrelational setting, one only needs to learn O(|R|) parameters, regardless of the dataset size, for both\nstructure selection and feature weights, as opposed to O(|X|2) parameters for the propositional case.\nThus, relational ESS-CRFs are typically much less prone to over\ufb01tting than propositional ones.\n\n6 Experiments\n\nWe have tested the ESS-CRF approach on both propositional and relational data. With the large\nnumber of parameters needed for the propositional case (O(|X|2)), our approach is only practical for\ncases of abundant data. So our experiments with propositional data serve only to prove the concept,\nverifying that ESS-CRF can successfully learn a model better than a single tree baseline. In contrast\nto the propositional settings, in the relational cases the relatively low parameter space dimensionality\n(O(|R|2)) almost eliminates the over\ufb01tting problem. As a result, on relational datasets ESS-CRF is\na very attractive approach in practice. Our experiments show ESS-CRFs comfortably outperforming\nstate of the art high-treewidth discriminative models on several real-life relational datasets.\n\n6.1 Propositional models\nWe compare ESS-CRFs with \ufb01xed tree CRFs, where the tree structure learned by the Chow-Liu\nalgorithm using P (X ). We used TEMPERATURE sensor network data [17] (52 discretized vari-\nables) and San Francisco TRAFFIC data [18] (we selected 32 variables). In both cases, 5 variables\nwere used as evidence E and the rest as unknowns X . The results are in Fig. 1. We have found it\nuseful to regularize the conditional Chow-Liu (Alg. 3) by only choosing at test time from the edges\nthat have been selected often enough during training. In Fig. 1 we plot results for both regularized\n(red) and unregularized (blue). One can see that in the limit of plentiful data ESS-CRF does indeed\noutperform the \ufb01xed tree baseline. However, because the space of available models is much larger\nfor ESS-CRF, over\ufb01tting becomes an important issue and regularization is important.\n\n6.2 Relational models\nFace recognition. We evaluate ESS-CRFs on two relational models. The \ufb01rst model, called FACES,\naims to improve face recognition in collections of related images using information about similarity\nbetween different faces in addition to the standard single-face features. The key idea is that whenever\ntwo people in different images look similar, they are more likely to be the same person. Our model\nhas a variable Xi, denoting the label, for every face blob. Pairwise features f (Xi, Xj,E), based\non blob color similarity, indicate how close two faces are in appearance. Single-variable features\nf (Xi,E) encode information such as the output of an off-the-shelf standalone face classi\ufb01er or face\nlocation within the image (see [19] for details). The model is used in a semi-supervised way: at test\ntime, a PGM is instantiated jointly over the train and test entities, values of the train entities are \ufb01xed\nto the ground truth, and inference \ufb01nds the (approximately) most likely labels for the test entities.\n\n6\n\n102103104\u221235\u221230\u221225\u221220Train set sizeTest LLHTEMPERATUREESS\u2212CRF + structure reg.Chow\u2212Liu CRFESS\u2212CRF102103\u221220\u221218\u221216\u221214Train set sizeTest LLHTRAFFICESS\u2212CRFESS\u2212CRF + structure reg.Chow\u2212Liu CRF00.050.10.150.2Classification errorWebKB \u2212 Classification Error SVMESS\u2212CRFRMNM3N\fFigure 2: Results for FACES datasets. Top: evolution of classi\ufb01cation accuracy as inference progresses over\ntime. Stars show the moment when ESS-CRF \ufb01nishes running. Horizontal dashed line indicates resulting\naccuracy. For FACES 3, sum-product and max-product gave the same accuracy. Bottom: time to convergence.\n\nWe compare ESS-CRFs with a dense relational PGM encoded by a Markov logic network\n(MLN, [20]) using the same features. We used a state of the art MLN implementations in the\nAlchemy package [21] with MC-SAT sampling algorithm for discriminative parameter learning,\nand belief propagation [22] for inference. For the MLN, we had to threshold the pairwise features\nindicating the likelihood of label agreement and set those under the threshold to 0 to prevent (a)\noversmoothing and (b) very long inference times. Also, to prevent oversmoothing by the MLN, we\nhave found it useful to scale down the pairwise feature weights learned during training, thus weak-\nening the smoothing effect of any single edge in the model4. We denote models with so adjusted\nweights as MLN+. No thresholding or weights adjustment was done for ESS-CRFs.\nFigure 2 shows the results on three separate datasets: FACES 1 with 1720 images, 4 unique people\nand 100 training images in every fold, FACES 2 with 245 images, 9 unique people and 50 training\nimages, and FACES 3 with 352 images, 24 unique people and 70 training images. We tried both sum-\nproduct and max-product BP for inference, denoted as sum and max correspondingly in Fig. 2. For\nESS-CRF the choice made no difference. One can see that (a) ESS-CRF model provides superior\n(FACES 2 and 3) or equal (FACES 1) accuracy to the dense MLN model, even with extra heuristic\nweights tweaking for the MLN, (b) ESS-CRF is more than an order of magnitude faster. One can\nsee that for the FACES model, ESS-CRF is clearly superior to the high-treewidth alternative.\nHypertext data. For WebKB data (see [23] for details), the task is to label webpages from four\ncomputer science departments as course, faculty, student, project, or other, given\ntheir text and link structure. We compare ESS-CRFs to high-treewidth relational Markov networks\n(RMNs, [23]), max-margin Markov networks (M3Ns, [24]) and a standalone SVM classi\ufb01er. All\nthe relational PGMs use the same single-variable features encoding the webpage text, and pairwise\nfeatures encoding the link structure. The baseline SVM classi\ufb01er only uses single-variable features.\nRMNs and ESS-CRFs are trained to maximize the conditional likelihood of the labels, while M3Ns\nmaximize the margin in likelihood between the correct assignment and all of the incorrect ones,\nexplicitly targeting the classi\ufb01cation. The results are in Fig. 1. Observe that ESS-CRF matches the\naccuracy of high-treewidth RMNs, again showing that the smaller expressive power of tree models\ncan be fully compensated by exact parameter learning and inference. ESS-CRF is much faster than\nthe RMN, taking only 50 sec. to train and 0.3 sec. to test on a single core of a 2.7GHz Opteron\nCPU. RMN and M3N models take about 1500 sec. each to train on a 700MHz Pentium III. Even\naccounting for the CPU speed difference, the speedup is signi\ufb01cant. ESS-CRF does not achieve the\naccuracy of M3Ns, which use a different objective more directly related to the classi\ufb01cation problem\nas opposed to density estimation. Still, the RMN results indicate that it may be possible to match the\nM3N accuracy with much faster tractable ESS models by replacing the CRF conditional likelihood\nobjective with the max-margin objective, which is an important direction of future work.\n\n4Because the number of pairwise relations in the model grows quadratically with the number of variables,\n\nthe \u201cper-variable force of smoothing\u201d grows with the dataset size, hence the need to adjust.\n\n7\n\n05001000150020000.50.60.70.80.91Time, secondsAccuracyFACES 1 \u2212 ACCURACY MLN maxMLN sumMLN+ maxESS\u2212CRFMLN+ sum0501000.650.70.750.80.850.9Time, secondsAccuracyFACES 2 \u2212 ACCURACY ESS\u2212CRFMLN maxMLN sumMLN+ sumMLN+ max0204060800.10.20.30.40.50.6Time, secondsAccuracyFACES 3 \u2212 ACCURACY ESS\u2212CRFMLN+MLN050010001500200025003000Time, secondsFACES 1 \u2212 TIME TO CONVERGENCEInferenceParameter learningESS\u2212CRFMLN maxMLN sumMLN+ sumMLN+ max050100150200250300Time, secondsFACES 2 \u2212 TIME TO CONVERGENCEESS\u2212CRFMLN+ sumMLN maxMLN sumMLN+ maxInferenceParameter learning0102030405060Time, secondsFACES 3 \u2212 TIME TO CONVERGENCEESS\u2212CRFMLN+ maxMLN+ sumMLN maxMLN sumParameter learningInference\f7 Related work and conclusions\n\nRelated work. Two cornerstones of our ESS-CRF approach, namely using models that become\nmore sparse when evidence is instantiated, and using multiple tractable models to avoid restrictions\non the expressive power inherent to low-treewidth models, have been discussed in the existing lit-\nerature. First, context-speci\ufb01c independence (CSI, [25]) has been long used both for speeding up\ninference [25] and regularizing the model parameters [26]. However, so far CSI has been treated\nas a local property of the model, which made reasoning about the resulting treewidth of evidence-\nspeci\ufb01c models impossible. Thus, the full potential of exact inference for models with CSI remained\nunused. Our work is a step towards fully exploiting that potential. Multiple tractable models, such\nas trees, are widely used as components of mixtures (e.g. [27]), including mixtures of all possible\ntrees [28], to approximate distributions with rich inherent structure. Unlike the mixture models, our\napproach of selecting a single structure for any given evidence value has the advantage of allowing\nfor ef\ufb01cient exact decoding of the most probable assignment to the unknowns X using the Viterbi\nalgorithm [29]. Both for the mixture models and our approach, joint optimization of the structure\nand weights (u and w in our notation) is infeasible due to many local optima of the objective. Our\none-shot structure learning algorithm, as we empirically demonstrated, works well in practice. It is\nalso much faster then expectation maximization [30] - the standard way to train mixture models.\nLearning the CRF structure in general is NP-hard, which follows from the hardness results for the\ngenerative models (c.f. [8]). Moreover, CRF structure learning is further complicated by the fact\nthe CRF structure likelihood does not decompose into scores of local graph components, as do\nscores for some generative models [3]. Existing work on CRF structure learning thus provides only\nlocal guarantees. In practice, the hardness of CRF structure learning leads to high popularity of\nheuristics: chain and skip-chain [32] structures are often used, as well as grid-like structures. All\nthe approaches that do learn structure from data can be broadly divided into three categories. First,\nthe CRF structure can be de\ufb01ned via the sparsity pattern of the feature weights, so one can use L1\nregularization penalty to achieve sparsity during weight learning [2]. The second type of approaches\ngreedily adds the features to the CRF model so as to maximize the immediate improvement in the\n(approximate) model likelihood (e.g. [31]). Finally, one can try to approximate the CRF structure\nscore as a combination of local scores [15, 4] and use an algorithm for learning generative structures\n(where the score actually decomposes). ESS-CRF also falls in this category of approaches. Although\nthere are some negative theoretical results about learnability of even the simplest CRF structures\nusing local scores [4], such approaches often work well in practice [15].\nLearning the weights is straightforward for tractable CRFs, because the log-likelihood is concave [1]\nand the gradient (3) can be used with mature convex optimization techniques. So far, exact weights\nlearning was mostly used for special hand-crafted structures, such as chains [1, 32], but in this work\nwe use arbitrary trees. For dense structures, computing the gradient (3) exactly is intractable as\neven approximate inference in general models is NP-hard [5]. As a result, approximate inference\ntechniques, such as belief propagation [10, 11] or Gibbs sampling [12] are employed, without guar-\nantees on the quality of the result. Alternatively, an approximation of the objective (e.g. [6]) is used,\nalso yielding suboptimal weights. Our experiments showed that exact weight learning for tractable\nmodels gives an advantage in approximation quality and ef\ufb01ciency over dense structures.\nConclusions and future work. To summarize, we have shown that in both propositional and rela-\ntional settings, tractable CRFs with evidence-speci\ufb01c structures, a class of models with expressive\npower greater than any single tree-structured model, can be constructed by relying only on the glob-\nally optimal results of ef\ufb01cient algorithms (logistic regression, Chow-Liu algorithm, exact inference\nin tree-structured models, L-BFGS for convex differentiable functions). Whereas traditional CRF\nwork\ufb02ow (Alg. 1) involves approximation without any quality guaranteed on multiple stages of the\nprocess, our approach, ESS-CRF (Alg. 2), has just one source of approximation, namely conditional\nstructure scores. We have demonstrated on real-life relational datasets that our approach matches\nor exceeds the accuracy of state of the art dense discriminative models, and at the same time pro-\nvide more than a factor of magnitude speedup. Important future work directions are generalizing\nESS-CRF to larger treewidths and max-margin weights learning for better classi\ufb01cation.\nAcknowledgements. This work is supported by NSF Career IIS-0644225 and ARO MURI\nW911NF0710287 and W911NF0810242. We thank Ben Taskar for sharing the WebKB data.\nFACES model and data were developed jointly with Denver Dash and Matthai Philipose.\n\n8\n\n\fReferences\n[1] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In ICML, 2001.\n\n[2] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning in random \ufb01elds for heart motion\n\nabnormality detection. In CVPR, 2008.\n\n[3] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. 2009.\n[4] J. K. Bradley and C. Guestrin. Learning tree conditional random \ufb01elds. In ICML, to appear, 2010.\n[5] D. Roth. On the hardness of approximate reasoning. Arti\ufb01cial Intelligence, 82(1-2), 1996.\n[6] C. Sutton and A. McCallum. Piecewise pseudolikelihood for ef\ufb01cient CRF training. In ICML, 2007.\n[7] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Trans.\n\non Inf. Theory, 14(3), 1968.\n\n[8] D. Karger and N. Srebro. Learning Markov networks: Maximum bounded tree-width graphs. In SODA\u201901.\n[9] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathemat-\n\nical Programming, 45(3), 1989.\n\n[10] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. 1988.\n[11] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS, 2000.\n[12] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of\n\nimages. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6), 1984.\n\n[13] S. Arnborg, D. G. Corneil, and A. Proskurowski. Complexity of \ufb01nding embeddings in a k-tree. SIAM\n\nJournal on Algebraic and Discrete Methods, 8(2), 1987.\n\n[14] W. H\u00a8ardle, M. M\u00a8uller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models. 2004.\n[15] D. Shahaf, A. Chechetka, and C. Guestrin. Learning thin junction trees via graph cuts. In AISTATS, 2009.\n[16] L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. The MIT Press, 2007.\n[17] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in\n\nsensor networks. In VLDB, 2004.\n\n[18] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In UAI\u201905.\n[19] A. Chechetka, D. Dash, and M. Philipose. Relational learning for collective classi\ufb01cation of entities in\n\nimages. In AAAI Workshop on Statistical Relational AI, 2010.\n\n[20] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2), 2006.\n[21] S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, D. Lowd, and P. Domingos. The alchemy system\n\nfor statistical relational AI. Technical report, University of Washington, Seattle, WA., 2009.\n\n[22] J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In\n\nAISTATS, 2009.\n\n[23] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In UAI, 2002.\n[24] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[25] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speci\ufb01c independence in Bayesian\n\nnetworks. In UAI, 1996.\n\n[26] M. desJardins, P. Rathod, and L. Getoor. Bayesian network learning with abstraction hierarchies and\n\ncontext-speci\ufb01c independence. In ECML, 2005.\n\n[27] B. Thiesson, C. Meek, D. Chickering, and D. Heckerman. Learning mixtures of DAG models. In UAI\u201997.\n[28] M. Meil\u02d8a and M. I. Jordan. Learning with mixtures of trees. JMLR, 1, 2001.\n[29] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.\n\nIEEE Transactions on Information Theory, IT-13, 1967.\n\n[30] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational\n\nStatistics & Data Analysis, 19(2), 1995.\n\n[31] A. Torralba, K. P. Murphy, and W. T. Freeman. Contextual models for object detection using boosted\n\nrandom \ufb01elds. In NIPS, 2004.\n\n[32] C. Sutton and A. McCallum. Collective segmentation and labeling of distant entities in information\n\nextraction. In ICML Workshop on Statistical Relational Learning and Its Connections, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1181, "authors": [{"given_name": "Anton", "family_name": "Chechetka", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}]}