{"title": "Learning Efficient Markov Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 748, "page_last": 756, "abstract": "We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature or its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient.Experiments on a variety of domains show that our approach compares favorably with thin junction trees and other Markov network structure learners.", "full_text": "Learning Ef\ufb01cient Markov Networks\n\nVibhav Gogate William Austin Webb Pedro Domingos\n\nDepartment of Computer Science & Engineering\n\nUniversity of Washington\nSeattle, WA 98195. USA\n\n{vgogate,webb,pedrod}@cs.washington.edu\n\nAbstract\n\nWe present an algorithm for learning high-treewidth Markov networks where in-\nference is still tractable. This is made possible by exploiting context-speci\ufb01c inde-\npendence and determinism in the domain. The class of models our algorithm can\nlearn has the same desirable properties as thin junction trees: polynomial inference,\nclosed-form weight learning, etc., but is much broader. Our algorithm searches for\na feature that divides the state space into subspaces where the remaining variables\ndecompose into independent subsets (conditioned on the feature and its negation)\nand recurses on each subspace/subset of variables until no useful new features can\nbe found. We provide probabilistic performance guarantees for our algorithm un-\nder the assumption that the maximum feature length is bounded by a constant k\n(the treewidth can be much larger) and dependences are of bounded strength. We\nalso propose a greedy version of the algorithm that, while forgoing these guaran-\ntees, is much more ef\ufb01cient. Experiments on a variety of domains show that our\napproach outperforms many state-of-the-art Markov network structure learners.\n\n1\n\nIntroduction\n\nMarkov networks (also known as Markov random \ufb01elds, etc.) are an attractive class of joint prob-\nability models because of their generality and \ufb02exibility. However, this generality comes at a cost.\nInference in Markov networks is intractable [25], and approximate inference schemes can be un-\nreliable, and often require much hand-crafting. Weight learning has no closed-form solution, and\nrequires convex optimization. Computing the gradient for optimization in turn requires inference.\nStructure learning \u2013 the problem of \ufb01nding the features of the Markov network \u2013 is also intractable\n[15], and has weight learning and inference as subroutines.\n\nIntractable inference and weight optimization can be avoided if we restrict ourselves to decomposable\nMarkov networks [22]. A decomposable model can be expressed as a product of distributions over\nthe cliques in the graph divided by the product of the distributions of their intersections. An arbitrary\nMarkov network can be converted into a decomposable one by triangulation (adding edges until every\ncycle of length four or more has at least one chord). The resulting structure is called a junction tree.\nGoldman [13] proposed a method for learning Markov networks without numeric optimization based\non this idea. Unfortunately, the triangulated network can be exponentially larger than the original one,\nlimiting the applicability of this method. More recently, a series of papers have proposed methods\nfor directly learning junction trees of bounded treewidth ([2, 21, 8] etc.). Unfortunately, since the\ncomplexity of inference (and typically of learning) is exponential in the treewidth, only models of\nvery low treewidth (typically 2 or 3) are feasible in practice, and thin junction trees have not found\nwide applicability.\n\nFortunately, low treewidth is an overly strong condition. Models can have high treewidth and still\nallow tractable inference and closed-form weight learning from a reasonable number of samples, by\nexploiting context-speci\ufb01c independence [6] and determinism [7]. Both of these result in clique dis-\n\n1\n\n\ftributions that can be compactly expressed even if the cliques are large. In this paper we propose a\nlearning algorithm based on this observation. Inference algorithms that exploit context-speci\ufb01c inde-\npendence and determinism [7, 26, 11] have a common structure: they search for partial assignments\nto variables that decompose the remaining variables into independent subsets, and recurse on these\nsmaller problems until trivial ones are obtained. Our algorithm uses a similar strategy, but at learning\ntime: it recursively attempts to \ufb01nd features (i.e., partial variable assignments) that decompose the\nproblem into smaller (nearly) independent subproblems, and stops when the data does not warrant\nfurther decomposition.\n\nDecomposable models can be expressed as both Markov networks and Bayesian networks, and state-\nof-the-art Bayesian network learners extensively exploit context-speci\ufb01c independence [9]. How-\never, they typically still learn intractable models. Lowd and Domingos [18] learned tractable high-\ntreewidth Bayesian networks by penalizing inference complexity along with model complexity in\na standard Bayesian network learner. Our approach can learn exponentially more compact models\nby exploiting the additional \ufb02exibility of Markov networks, where features can overlap in arbitrary\nways. It can greatly speed up learning relative to standard Markov network learners because it avoids\nweight optimization and inference, while Lowd and Domingos\u2019 algorithm is much slower than stan-\ndard Bayesian network learning (where, given complete data, weight optimization and inference are\nalready unnecessary). Perhaps most signi\ufb01cantly, it is also more fundamental in that it is based on\nidentifying what makes inference tractable and directly exploiting it, potentially leading to a much\nbetter accuracy/inference cost trade-off. As a result, our approach has formal guarantees, which\nLowd and Domingos\u2019 algorithm lacks.\n\nWe provide both theoretical guarantees and empirical evidence for our approach. First, we provide\nprobabilistic performance guarantees for our algorithm by making certain assumptions about the\nunderlying distribution. These results rely on exhaustive search over features up to length k. (The\ntreewidth of the resulting model can still be as large as the number of variables.) We then propose\ngreedy heuristics for more ef\ufb01cient learning, and show empirically that the Markov networks learned\nin this way are more accurate than thin junction trees as well as networks learned using the algorithm\nof Della Pietra et al. [12] and L1 regularization [16, 24], while allowing much faster inference (which\nin practice translates into more accurate query answers).\n\n2 Background: Junction Trees and Feature Graphs\n\nWe denote sets by capital letters and members of a set by small letters. A double capital letter denotes\na set of subsets. We assume that all random variables have binary domains {0,1} (or {false,true}).\nWe make this assumption for simplicity of exposition; our analysis extends trivially to multi-valued\nvariables.\n\nWe begin with some necessary de\ufb01nitions. An atomic feature or literal is an assignment of a value to\na variable. x denotes the assignment x = 1 while \u00acx denotes x = 0 (note that the distinction between\nan atomic feature x and the variable which is also denoted by x is usually clear from context). A\nfeature, denoted by F , de\ufb01ned over a subset of variables V (F ) is formed by conjoining atomic\nfeatures or literals, e.g., x1 \u2227 \u00acx2 is a feature formed by conjoining two atomic features x1 and \u00acx2.\nGiven an assignment, denoted by V (F ), to all variables of F , F is said to be satis\ufb01ed or assigned the\nvalue 1 iff for all literals l \u2208 F , it also holds that l \u2208 V (F ). A feature that is not satis\ufb01ed is said to\nbe assigned the value 0. Often, given a feature F , we will abuse notation and write V (F ) as F .\nA Markov network or a log-linear model is de\ufb01ned as a set of pairs (Fi, wi) where Fi is a feature\nand wi is its weight. It represents the following joint probability distribution:\n\nwi \u00d7 Fi(V V (Fi))!\n\n(1)\n\nwhere V is a truth-assignment to all variables V = \u222aiV (Fi), Fi(V V (Gi)) = 1 if V V (Gi) satis\ufb01es\nFi, and 0 otherwise, and Z is the normalization constant, often called the partition function.\nNext, we de\ufb01ne junction trees. Let C = {C1, . . . , Cm} be a collection of subsets of V such that:\n(a) \u222am\ni=1Ci = V and (b) for each feature Fj, there exists a Ci \u2208 C such that all variables of Fj are\ncontained in Ci. Each Ci is referred to as a clique.\n\nP (V ) =\n\n1\nZ\n\nexp Xi\n\n2\n\n\fx1 \u2227 x2\n\n0\n\n1\n\nx3 \u2227 x4\n\nx5 \u2227 x6\n\nx3 \u2227 x5\n\nx4 \u2227 x6\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\n(a) A feature tree\n\nFeatures\n\nWeights\n\n\u00ac(x1 \u2227 x2) \u2227 \u00ac(x3 \u2227 x4)\n\u00ac(x1 \u2227 x2) \u2227 (x3 \u2227 x4)\n\u00ac(x1 \u2227 x2) \u2227 \u00ac(x5 \u2227 x6)\n\u00ac(x1 \u2227 x2) \u2227 (x5 \u2227 x6)\n(x1 \u2227 x2) \u2227 \u00ac(x3 \u2227 x5)\n(x1 \u2227 x2) \u2227 (x3 \u2227 x5)\n(x1 \u2227 x2) \u2227 \u00ac(x4 \u2227 x6)\n(x1 \u2227 x2) \u2227 (x4 \u2227 x6)\n\n(b) A Markov network\n\nw1\nw2\nw3\nw4\nw5\nw6\nw7\nw8\n\nx1 x2 x4 x5 x6\n\nx1 x2 x4 x5\n\nx1 x2 x3 x4 x5\n\n(c) A junction tree\n\nFigure 1: Figure showing (a) a feature tree, (b) the Markov network corresponding to the leaf features of (a)\nand (c) the (optimal) junction tree for the Markov network in (b). A leaf feature is formed by conjoining the\nfeature assignments along the path from the leaf to the root. For example, the feature corresponding to the right\nmost leaf node is: (x1 \u2227 x2) \u2227 (x4 \u2227 x6). For the feature tree, ovals denote F-nodes and rectangles denote\nA-nodes. For the junction tree, ovals denote cliques and rectangles denote separators. Notice that each F-node\nin the feature tree has a feature of size bounded by 2 while the maximum clique in the junction tree is of size 5.\nMoreover notice that the A-node corresponding to (x1 \u2227 x2) = 0 induces a different variable decomposition as\ncompared with the A-node corresponding to (x1 \u2227 x2) = 1.\nDEFINITION 1. A tree T = (C, E) is a junction tree iff it satis\ufb01es the running intersection property,\ni.e., \u2200Ci, Cj, Ck \u2208 C, i 6= j 6= k, such that Ck lies on the unique simple path between Ci and Cj,\nx \u2208 Ci \u2229Cj \u21d2 x \u2208 Ck. The treewidth of T , denoted by w, is the size of the largest clique in C minus\none. The set Sij \u2261 Ci \u2229 Cj is referred to as the separator corresponding to the edge (i \u2212 j) \u2208 E.\n\ni=1 2|Ci|) \u2261 O(n \u00d7 2w+1).\n\nThe space complexity of representing a junction tree is O(Pm\n\nOur goal is to exploit context-speci\ufb01c and deterministic dependencies that is not explicitly repre-\nsented in junction trees. Representations that do this include arithmetic circuits [10] and AND/OR\ngraphs [11]. We will use a more convenient form for our purposes, which we call feature graphs. In-\nference in feature graphs is linear in the size of the graph. For readers familiar with AND/OR graphs\n[11], a feature tree (or graph) is simply an AND/OR tree (or graph) with OR nodes corresponding to\nfeatures and AND nodes corresponding to feature assignments.\nDEFINITION 2. A feature tree denoted by ST is a rooted-tree that consists of alternating levels of\nfeature nodes or F-nodes and feature assignment nodes or A-nodes. Each F-node F is labeled by\na feature F and has two child A-nodes labeled by 0 and 1, corresponding to the true and the false\nassignments of F respectively. Each A-node A has k \u2265 0 child F-nodes that satisfy the following\nrequirement. Let {FA,1, . . . , FA,k} be the set of child F-nodes of A and let D(FA,i) be the union\nof all variables involved in the features associated with FA,i and all its descendants, then \u2200i, j \u2208\n{1, . . . , k}, i 6= j, D(FA,i) \u2229 D(FA,j) = \u2205.\n\nSemantically, each F-node represents conditioning while each A-node represents partitioning of the\nvariables into conditionally-independent subsets. The space complexity of representing a feature tree\nis the number of its A-nodes. A feature graph denoted by SG is formed by merging identical sub-\ntrees of a feature tree ST . It is easy to show that a feature graph generalizes a junction tree and in fact\nany model that can be represented using a junction tree having treewidth k can also be represented\nby a feature graph that uses only O(n \u00d7 2k) space [11].\nIn some cases, a feature graph can be\nexponentially smaller than a junction tree because it can capture context-speci\ufb01c independence [6].\n\nA feature tree can be easily converted to a Markov network. The corresponding Markov network has\none feature for each leaf node, formed by conjoining all feature assignments from the root to the leaf.\nThe following example demonstrates the relationship between a feature tree, a Markov network and\na junction tree.\nEXAMPLE 1. Figure 1(a) shows a feature tree. Figure 1(b) shows the Markov network corresponding\nto the leaf features of the feature tree given in Figure 1(a). Figure 1(c) shows the junction tree\nfor the Markov network given in 1(b). Notice that because the feature tree uses context-speci\ufb01c\nindependence, all the F -nodes in the feature tree have a feature of size bounded by 2 while the\nmaximum clique size of the junction tree is 5. The junction tree given in Figure 1(b) requires 25\u00d72 =\n64 potential values while the feature tree given in Figure 1(a) requires only 10 A-nodes.\n\nIn this paper, we will present structure learning algorithms to learn feature trees only. We can do this\nwithout loss of generality, because a feature graph can be constructed by caching information and\nmerging identical nodes, while learning (constructing) a feature tree.\n\n3\n\n\fThe distribution represented by a feature tree ST can be de\ufb01ned procedurally as follows (for more\ndetails see [11]). We assume that each leaf A-node Al is associated with a weight w(Al). For each\nA-node A and each F-node F, we associate a value denoted by v(A) and v(F) respectively. We\ncompute these values recursively as follows from the leaves to the root. The value of all A-nodes\nis initialized to 1 while the value of all F-nodes is initialized to 0. The value of the leaf A-node Al\nis w(Al) \u00d7 #(M (Al)) where #(M (Al)) is number of (full) variable assignments that satisfy the\nconstraint M (Al) formed by conjoining the feature-assignments from the root to Al. The value of\nan internal F-node is the sum of the values of the child A-nodes. The value of an internal A-node\nAp that has k children is the product of the values of its child F-nodes divided by [#(M (Ap))]k\u22121\n(the division takes care of double counting). Let v(Fr) be the value of the root node; computed as\ndescribed above. Let V be an assignment to all variables V of the feature tree, then:\n\nP (V ) =\n\nvV (Fr)\nv(Fr)\n\nwhere vV (Fr) is the value of the root node of ST computed as above in which each leaf A-node is\ninitialized instead to w(Al) if V satis\ufb01es the constraint formed by conjoining the feature-assignments\nfrom the root to Al and 0 otherwise.\n\n3 Learning Ef\ufb01cient Structure\n\nAlgorithm 1: LMIP: Low Mutual Information Partitioning\nInput: A variable set V , sample data D, mutual information subroutine I, a feature assignment F , threshold \u03b4,\n\nOutput: A set of subsets of V\nQF = {Q1, . . . , Q|V |}, where Qi = {xi} // QF is a set of singletons\nif the subset of D that satis\ufb01es F is too small then\n\nmax set size q.\n\nreturn QF\n\nelse\n\nfor A \u2286 V , |A| \u2264 q do\n\nif minX\u2282AI(X, A\\X|F ) > \u03b4 then\n\n// find min using Queyranne\u2019s algorithm [23] applied to the\n\nsubset of D satisfying F\n\nmerge all Qi \u2208 QF s.t. Qi \u2229 A 6= \u2205.\n\nreturn QF\n\nWe propose a feature-based structure learning algorithm that searches for a feature that divides the\ncon\ufb01guration space into subspaces. We will assume that the selected feature or its negation divides\nthe (remaining) variables into conditionally independent partitions (we don\u2019t require this assumption\nto be always satis\ufb01ed, as we explain in the section on greedy heuristics and implementation details).\nIn practice, the notion of conditional independence is too strong. Therefore, as in previous work\n[21, 8], we instead use conditional mutual information, denoted by I, to partition the set of variables.\nFor this we use the LMIP subroutine (see Algorithm 1), a variant of Chechetka and Guestrin\u2019s [8]\nLTCI algorithm that outputs a partitioning of V . The runtime guarantees of LMIP follow from those\nof LTCI and correctness guarantees follow in an analogous fashion. In general, estimating mutual\ninformation between sets of random variables has time and sample complexity exponential in the\nnumber of variables considered. However, we can be more ef\ufb01cient as we show below. We start with\na required de\ufb01nition.\nDEFINITION 3. Given a feature assignment F , a distribution P (V ) is (j, \u01eb, F )-coverable if there\nexists a set of cliques C such that for every Ci \u2208 C, |Ci| \u2264 j and I(Ci, V \\ Ci|F ) \u2264 \u01eb. Similarly,\ngiven a feature F , a distribution P (V ) is (j, \u01eb, F )-coverable if it is both (j, \u01eb, F = 0)-coverable and\n(j, \u01eb, F = 1)-coverable.\nLEMMA 1. Let A \u2282 V . Suppose there exists a distribution on V that is (j, \u01eb, F )-coverable and\n\u2200X \u2282 V where |X| \u2264 j, it holds that I(X \u2229 A, X \u2229 (V \\A)|F ) \u2264 \u03b4. Then, I(A, V \\A|F ) \u2264\n|V |(2\u01eb + \u03b4).\n\nLemma 1 immediately leads to the following lemma:\n\n4\n\n\fLEMMA 2. Let P (V ) be a distribution that is (j, \u01eb, F )-coverable. Then LMIP, for q \u2265 j, returns a\npartitioning of V into disjoint subsets {Q1, . . . , Qm} such that \u2200i, I(Qi, V \\Qi|F ) \u2264 |V |(2\u01eb + (j \u2212\n1)\u03b4).\n\nWe summarize the time and space complexity of LMIP in the following lemma.\n\nLEMMA 3. The time and space complexity of LMIP is O((cid:0)n\n\nis the time\ncomplexity of estimating the mutual information between two disjoint sets which have combined\ncardinality q.\n\nq(cid:1) \u00d7 n \u00d7 J M I\n\nq\n\n) where J M I\n\nq\n\nNote that our actual algorithm will use a subroutine that estimates mutual information from data,\nand the time complexity of this routine will be described in the section on sample complexity and\nprobabilistic performance guarantees.\n\nAlgorithm 2: LEM: Learning Ef\ufb01cient Markov Networks\nInput: Variable set V , sample data S, mutual information subroutine I, feature length k, set size parameter q,\n\nthreshold \u03b4, an A-node A.\n\nOutput: A feature tree M\nfor each feature F of length k constructible for V do\n\nQF =1 = LMIP(V , S, I, F = 1, \u03b4, q);\nQF =0 = LMIP (V , S, I, F = 0, \u03b4, q)\n\nG = argmaxF (Score(QF =0)+ Score(QF =1))// G is a feature\nif |QG=0| = 1 and |QG=1| = 1 then\n\nCreate a feature tree corresponding to all possible assignments to the atomic features. Add this feature tree\nas a child of A;\nreturn\n\nCreate a F-node G with G as its feature, and add it as a child of A;\nCreate two A-child nodes AG,0 and AG,1 for G;\nfor i \u2208 {0, 1} do\n\nif |QG=i| > 1 then\n\nfor each component (subset of V ) C \u2208 QG=i do\n\nSC = ProjectC ({X \u2208 S : X satis\ufb01es G = i}) // SC is the set of\n\ninstantiations of V in S that satisfy G = i restricted to the\nvariables in C\n\nLEM(C, SC, I, k, q, \u03b4,AG,i) // Recursion\n\nelse\n\nCreate a feature tree corresponding to all possible assignments to the atomic features. Add this feature\ntree as a child of AG,i.\n\nNext, we present our structure learning algorithm called LEM (see Algorithm 2) which utilizes the\nLMIP subroutine to learn feature trees from data. The algorithm has probabilistic performance guar-\nantees if we make some assumptions on the type of the distribution. We present these guarantees in\nthe next subsection. Algorithm 2 operates as follows. First, it runs the LMIP subroutine on all pos-\nsible features of length k constructible from V . Recall that given a feature assignment F , the LMIP\nsub-routine partitions the variables into (approximately) conditionally independent components. It\nthen selects a feature G having the highest score. Intuitively, to reduce the inference time and the size\nof the model, we should try to balance the trade-off between increasing the number of partitions and\nmaintaining partition size uniformity (namely, we would want the partition sizes to be almost equal).\nThe following score function achieves this objective. Let Q = {Q1, . . . , Qm} be a m-partition of V ,\nthen the score of Q is given by: Score(Q) =\ni=1 2|Qi | , where the denominator bounds worst-case\ninference complexity.\n\nPm\n\n1\n\nAfter selecting a feature G, the algorithm creates a F-node corresponding to G and two child A-nodes\ncorresponding to the true and the false assignments of G. Then, corresponding to each element of\nQG=1, it recursively creates a child node for G = 1 (and similarly for G = 0 using QG=0). An\ninteresting special case is when either |QG=1| = 1 or |QG=0| = 1 or when both conditions hold.\nIn this case, no partitioning of V exists for either or both the value assignments of G and therefore\nwe return a feature tree which has 2|V | leaf A-nodes corresponding to all possible instantiations of\nthe remaining variables. In practice, because of the exponential dependence on |V |, we would want\n\n5\n\n\fthis condition to hold only when a few variables remain. To obtain guarantees, however, we need\nstronger conditions to be satis\ufb01ed. We describe these guarantees next.\n\n3.1 Theoretical Guarantees\n\nTo derive performance guarantees and to guarantee polynomial complexity, we make some funda-\nmental assumptions about the data and the distribution P (V ) that we are trying to learn. Intuitively,\nif there exists a feature F such that the distribution P (V ) at each recursive call to LEM is (j, \u01eb, F )-\ncoverable, then the LMIP sub-routine is guaranteed to return at least a two-way partitioning of V .\nAssume that P (V ) is such that at each recursive call to LEM, there exists a unique F (such that the\ndistribution at the recursive call is (j, \u01eb, F )-coverable). Then, LEM is guaranteed to \ufb01nd this unique\nfeature tree. However, the trouble is that at each step of the recursion, there may exist m > 1 can-\ndidate features that satisfy this property. Therefore, we want this coverability requirement to hold\nnot only recursively but also for each candidate feature (at each recursive call). The following two\nde\ufb01nitions and Theorem 1 capture this intuition.\nDEFINITION 4. Given a constant \u03b4 > 0, we say that a distribution P (V ) satis\ufb01es the (j, \u01eb, m, G)\nassumption if |V | \u2264 j or if the following property is satis\ufb01ed. For every feature F , and each assign-\nment F of F , such that |V (F )| \u2264 m, P (V ) is (j, \u01eb, F )-coverable and for any partitioning S1, ..., Sz\nof V with z \u2265 2, such that for each i, I(Si, V \\ Si|F \u2227 G) \u2264 |V |(2\u01eb + \u03b4) and P (S1), ..., P (Sz) each\nsatisfy the (j, \u01eb, m, G \u2227 F ) assumption.\nDEFINITION 5. We say the a sequence of pairs (F n, Sn), (F n\u22121, Sn\u22121), . . . , (F 0, S0 = V ) sat-\nis\ufb01es the nested context independence condition for (\u03b8, w) if \u2200i, Si \u2286 Si\u22121 and the distribu-\ntion on V conditioned on the satisfaction of Gi\u22121 = (F i\u22121 \u2227 F i\u22122 \u2227 . . . \u2227 F 0) is such that\nI(Si, Si\u22121\\Si|Gi\u22121) \u2264 |Si\u22121|(2\u03b8 + w).\nTHEOREM 1. Given a distribution P (V ) that satis\ufb01es the (j, \u01eb, m, true)-assumption and a perfect\nmutual information oracle I, LEM(V , S, I, k, j + 1, \u03b4) returns a feature tree ST such that each leaf\nfeature of ST satis\ufb01es the nested context independence condition for (\u01eb, j \u00d7 \u03b4).\n\n3.1.1 Sample Complexity and Probabilistic Performance Guarantees\n\nThe foregoing analysis relies on a perfect, deterministic mutual information subroutine I. In real-\nity, all we have is sample data and probabilistic mutual information subroutines. As the following\ntheorem shows, we can get estimates of I(A, B|F ) with accuracy \u00b1\u2206 and probability 1 \u2212 \u03b3 with a\nnumber of samples and running time polynomial in 1\nLEMMA 4. (Hoffgen [14]) The entropy of a probability distribution over 2k + 2 discrete vari-\nables with domain size R can be estimated with accuracy \u2206 with probability at least 1 \u2212 \u03b3 using\nF (k, R, \u2206, \u03b3) = O( R4k+4\n\u22062\n\n)) samples and the same amount of time.\n\n\u22062 )log( R2k+2\n\n\u2206 and log 1\n\u03b3 .\n\nlog2( R2k+2\n\n\u03b3\n\nTo ensure that our algorithm doesn\u2019t run out of data somewhere in the recursion, we have to\nstrengthen our assumptions, as we de\ufb01ne below.\nDEFINITION 6. If P (V ) satis\ufb01es the (j, \u01eb, m, true)-assumption and a set of sample data H drawn\nfrom the distribution is such that for any Gi\u22121 = F i\u22121 \u2227 . . . F 0 if neither Fi = 0 or Fi = 1 hold in\nless than some constant fraction c of the subset of H that satis\ufb01es Gi\u22121, then we say that H satis\ufb01es\nthe c-strengthened (j, \u01eb, m, true) assumption.\nTHEOREM 2 (Probabilistic performance guarantees). Let P (V ) be a distribution that sat-\nis\ufb01es the (j, \u01eb, m, true) assumption and let H be the training data which satis\ufb01es the\nc-strengthened (j, \u01eb, m, true) assumption from which we draw S samples of size T =\nc )DF ( j\u22121\n( 1\nnm+j+2(j+1)3 ), where D is the worst-case length of any leaf feature returned\nby the algorithm. Given a mutual information subroutine \u02c6I implied by Lemma 4, LEM(V , S, \u02c6I, m,\nj + 1, \u01eb + \u2206) returns a feature tree, the leaves of which satisfy the nested context independence\ncondition for (\u01eb, j \u00d7 (\u01eb + \u2206)), with probability 1 \u2212 \u03b3.\n\n2 , |V |, \u2206,\n\n\u03b3\n\n4 Greedy Heuristics and Implementation Details\n\nWhen implemented naively, Algorithm 2 may be computationally infeasible. The most expensive\nstep in LEM is the LMIP sub-routine which is called O(nk) times at each A-node of the feature\n\n6\n\n\fgraph. Given a max set size of q, LMIP requires running Queyranne\u2019s algorithm [23] (complexity\nO(q3)) to minimize minX\u2282AI(X, V \\ X|F ) over every |A| \u2264 q. Thus, its overall time complexity\nis O(nq \u00d7 q3). Also, our theoretical analysis assumes access to a mutual information oracle which is\nnot available in practice and one has to compute I(X, V \\ X|F ) from data. In our implementation,\nwe used Moore and Lee\u2019s AD-trees [19] to pre-compute and cache the suf\ufb01cient statistics (counts),\nin advance, so that at each step, I(X, V \\ X|F ) can be computed ef\ufb01ciently. A second improvement\nthat we considered is due to Chechtka and Guestrin [8]. It is based on the observation that if A is a\nsubset of a connected component Q \u2208 QF , then we don\u2019t need to compute minX\u2282AI(X, V \\ X|F ),\nbecause merging all Qi \u2208 QF s.t. Qi\u2229A 6= \u2205. would not change QF . In spite of these improvements,\nour algorithm is not practical for q > 3 and k > 3. Note however, that low values of q and k are not\nentirely problematic for our approach because we may still be able to induce large treewidth models\nby taking advantage of context speci\ufb01c independence, as depicted in Figure 1.\n\nTo further improve the performance of our algorithm, we \ufb01x q to 3 and use a greedy heuristic to\nconstruct the features. The greedy heuristic is able to split on arbitrarily long features by only calling\nLMIP k \u00d7 n times instead of O(nk) times, but does not have any guarantees. It starts with a set\nof atomic features (i.e., just the variables in the domain), runs LMIP on each, and selects the (best)\nfeature with the highest score. Then, it creates candidate features by conjoining this best feature from\nthe previous step with each atomic feature, runs LMIP on each, and then selects a best feature for the\nnext iteration. It repeats this process until i equals k or the score does not improve. This heuristic is\nloosely based on the greedy approach of Della Pietra et al.[12]. We also use a balance heuristic to\nreduce the size of the model learned; which imposes a form of regularization constraint and biases\nour search towards sparser models, in order to avoid over-\ufb01tting. Here, given a set of features with\nsimilar scores, we select a feature F such that the difference between the scores of F = 0 and F = 1\nis the smallest. The intuition behind this heuristic is that by maintaining balance we reduce the height\nof the feature graph and thus its size. Finally, in our implementation, we do not return all possible\ninstantiations of the variables when a feature assignment yields only one partition, unless the number\nof remaining variables is smaller than 5. This is because even though a feature may not partition the\nset of variables, it may still partition the data, thereby reducing complexity.\n\n5 Experimental Evaluation\n\nWe evaluated LEM on one synthetic data set and four real world ones. Figure 2(f) lists the \ufb01ve data\nsets and the number of atomic features in each. The synthetic domain consists of samples from the\nAlarm Bayesian network [3]. From the UCI machine learning repository [5], we used the Adult and\nMSNBC anonymous Web data domains. Temperature and Traf\ufb01c are sensor network data sets and\nwere used in Checketka and Guestrin [8].\n\nWe compared LEM to the standard Markov network structure learning algorithm of Della Pietra\net al.[12] (henceforth, called the DL scheme), the L1 approach of Ravikumar et al. [24] and the\nlazy thin-junction tree algorithm (LPACJT) of Chechetka and Guestrin [8]. We used the following\nparameters for LEM: q = 3, and \u03b4 = 0.05. We found that the results were insensitive to the\nvalue of \u03b4 used. We suggest using any reasonably small value \u2264 0.1. The LPACJT implementation\navailable from the authors requires entropies (computed from the data) as input. We were unable to\ncompute the entropies in the required format because they use a propriety software that we did not\nhave access to, and therefore we use the results provided by the authors for the temperature, traf\ufb01c\nand alarm domains. We were unable to run LPACJT on the other two domains. We altered the DL\nalgorithm to only evaluate candidate features that match at least one example. This simple extension\nvastly reduces the number of candidate features and greatly improves the algorithm\u2019s ef\ufb01ciency. For\nimplementing DL, we use pseudo-likelihood [4] as a scoring function and optimized it via the limited-\nmemory BFGS algorithm [17]. For implementing L1, we used the OWL-QN software package of\nAndrew and Gao [1]. The neighborhood structures for L1 can be merged in two ways (logical-OR or\nlogical-AND of the structures); we tried both and used the best one for plotting the results. For the\nregularization, we tried penalty = {1, 2, 5, 10, 20, 25, 50, 100, 200, 500, 1000} and used a tuning\nset to pick the one that gave the best results. We used a time-bound of 24 hrs for each algorithm.\n\nFor each domain, we evaluated the algorithms on training set sizes varying from 100 to 10000. We\nperformed a \ufb01ve-fold train-test split. For the sensor networks, traf\ufb01c and alarm domains, we use\nthe test set sizes provided in Chechtka and Guestrin [8]. For the MSNBC and Adult domains, we\nselected a test set consisting of 58265 and 7327 examples respectively. We evaluate the performance\n\n7\n\n\fd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-14\n\n-16\n\n-18\n\n-20\n\n-22\n\n-24\n\n-26\n\n-28\n\n-30\n\n 100\n\n-3\n\n-3.2\n\n-3.4\n\n-3.6\n\n-3.8\n\n-4\n\n-4.2\n\n-4.4\n\n 100\n\nAlarm\n\nTraffic\n\nTemperature\n\n-25\n\n-30\n\n-35\n\n-40\n\n-45\n\n-50\n\n-55\n\n-60\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n 10000\n\n-65\n\n 100\n\nDL\nL1\nLEM\nLPACJT\n\n 1000\n\nTraining Set size\n\n(a) Alarm\n\nMSNBC\n\nDL\nL1\nLEM\nLPACJT\n\n 1000\n\nTraining Set size\n\n(b) Traf\ufb01c\n\nAdult\n\n 10000\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-40\n-45\n-50\n-55\n-60\n-65\n-70\n-75\n-80\n-85\n-90\n\n 100\n\nDL\nL1\nLEM\nLPACJT\n\n 1000\n\n 10000\n\nTraining Set size\n(c) Temperature\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-20\n-25\n-30\n-35\n-40\n-45\n-50\n-55\n-60\n-65\n\n 100\n\nDL\nL1\nLEM\n\n 1000\n\nTraining Set size\n\n(e) Adult\n\nDL\nL1\nLEM\n\n 1000\n\n 10000\n\nTraining Set size\n(d) MSNBC\n\nData set #Features Time in minutes\nDL L1 LEM\n60\n\n148\n\n14\n1440 2\n1440 21\n1440 1\n19\n\n22\n\n91\n691\n927\n31\n48\n\nAlarm\nTraf\ufb01c\nTemp.\n\n 10000\n\nMSNBC\n\nAdult\n\n128\n\n216\n\n17\n\n125\n\n(f) Data set characteristics and Tim-\n(f) Data set characteristics and timing\nresults\n\nFigure 2: Figures (a)-(e) showing average log-Likelihood as a function of the training data size for LEM, DL,\nL1 and LPACJT. Figure (f) reports the run-time in minutes for LEM, DL and L1 for training set of size 10000.\n\nbased on average-log-likelihood of the test data, given the learned model. The log-likelihood of the\ntest data was computed exactly for the models output by LPACJT and LEM, because inference is\ntractable in these models. The size of the feature graphs learned by LEM ranged from O(n2) to\nO(n3), comparable to those generated by LPACJT. Exact inference on the learned feature graphs\nwas a matter of milliseconds. For the Markov networks output by DL and L1, we compute the\nlog-likelihood approximately using loopy Belief propagation [20].\n\nFigure 2 summarizes the results for the \ufb01ve domains. LEM signi\ufb01cantly outperforms L1 on all the\ndomains except the Alarm dataset. It is better than the greedy DL scheme on three out of the \ufb01ve\ndomains while it is always better than LPACJT. Figure 2(f) shows the timing results for LEM, DL\nand L1. L1 is substantially faster than DL and LEM. DL is the slowest scheme.\n\n6 Conclusions\n\nWe have presented an algorithm for learning a class of high-treewidth Markov networks that admit\ntractable inference and closed-form parameter learning. This class is much richer than thin junction\ntrees because it exploits context-speci\ufb01c independence and determinism. We showed that our algo-\nrithm has probabilistic performance guarantees under the recursive assumption that the distribution at\neach node in the (rooted) feature graph (which is de\ufb01ned only over a decreasing subset of variables as\nwe move further away from the root), is itself representable by a polynomial-sized feature graph and\nin which the maximum feature-size at each node is bounded by k. We believe that our new theoretical\ninsights further the understanding of structure learning in Markov networks, especially those having\nhigh treewidth. In addition to the theoretical guarantees, we showed that our algorithm has good\nperformance in practice, usually having higher test-set likelihood than other competing approaches.\nAlthough learning may be slow, inference always has quick and predictable runtime, which is linear\nin the size of the feature graph. Intuitively, our method seems likely to perform well on large sparsely\ndependent datasets.\n\nAcknowledgements\n\nThis research was partly funded by ARO grant W911NF-08-1-0242, AFRL contract FA8750-09-C-\n0181, DARPA contracts FA8750-05-2-0283, FA8750-07-D-0185, HR0011-06-C-0025, HR0011-07-\nC-0060 and NBCH-D030010, NSF grants IIS-0534881 and IIS-0803481, and ONR grant N00014-\n08-1-0670. The views and conclusions contained in this document are those of the authors and should\nnot be interpreted as necessarily representing the of\ufb01cial policies, either expressed or implied, of\nARO, DARPA, NSF, ONR, or the United States Government.\n\n8\n\n\fReferences\n\n[1] G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models.\n\nTwenty-Fourth International Conference (ICML), pages 33\u201340, 2007.\n\nIn Proceedings of the\n\n[2] F. R. Bach and M. I. Jordan. Thin junction trees. In Advances in Neural Information Processing Systems,\n\npages 569\u2013576, 2001.\n\n[3] I. Beinlich, J. Suermondt, M. Chavez, and G. Cooper. The alarm monitoring system: A case study with\ntwo probablistic inference techniques for belief networks. In European Conference on AI in Medicine,\n1988.\n\n[4] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179\u2013195, 1975.\n[5] C. Blake and C. J. Merz. UCI repository of machine learning databases. Machine-readable data repository,\nDepartment of Information and Computer Science, University of California at Irvine, Irvine, CA, 2000.\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html.\n\n[6] C. Boutilier. Context-speci\ufb01c independence in Bayesian networks. In Proceedings of the Twelfth Annual\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 115\u2013123, 1996.\n\n[7] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Arti\ufb01cial Intelli-\n\ngence, 172(6\u20137):772\u2013799, April 2008.\n\n[8] A. Chechetka and C. Guestrin. Ef\ufb01cient principled learning of thin junction trees. In Advances in Neural\n\nInformation Processing Systems (NIPS), December 2007.\n\n[9] D.M. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian networks: Search methods and exper-\nimental results. In Proceedings of the Fifth International Workshop on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 112\u2013128, 1995.\n\n[10] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM, 50(3):280\u2013\n\n305, 2003.\n\n[11] R. Dechter and R. Mateescu. AND/OR search spaces for graphical models. Arti\ufb01cial Intelligence, 171(2-\n\n3):73\u2013106, 2007.\n\n[12] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random \ufb01elds. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 19:380\u2013392, 1997.\n\n[13] S. Goldman. Ef\ufb01cient methods for calculating maximum entropy distributions. Master\u2019s thesis, Mas-\n\nsachusetts Institute of Technology, 1987.\n\n[14] K. H\u00a8offgen. Learning and robust learning of product distributions. In Proceedings of the Sixth Annual\n\nACM Conference on Computational Learning Theory (COLT), pages 77\u201383, 1993.\n\n[15] D. R. Karger and N. Srebro. Learning Markov networks: maximum bounded tree-width graphs.\n\nIn\nProceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages\n392\u2013401, 2001.\n\n[16] S. Lee, V. Ganapathi, and D. Koller. Ef\ufb01cient structure learning of Markov networks using L1-\nIn Proceedings of the Twentieth Annual Conference on Neural Information Processing\n\nregularization.\nSystems (NIPS), pages 817\u2013824, 2006.\n\n[17] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical\n\nProgramming, 45(3):503\u2013528, 1989.\n\n[18] D. Lowd and P. Domingos. Learning arithmetic circuits. In Proceedings of the Twenty Fourth Conference\n\nin Uncertainty in Arti\ufb01cial Intelligence, pages 383\u2013392, 2008.\n\n[19] A. W. Moore and M. S. Lee. Cached suf\ufb01cient statistics for ef\ufb01cient machine learning with large datasets.\n\nJournal of Arti\ufb01cial Intelligence Research, 8:67\u201391, 1997.\n\n[20] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An\nempirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 467\u2013475, 1999.\n\n[21] M. Narasimhan and J. Bilmes. PAC-learning bounded tree-width graphical models. In Proceedings of the\n\nTwentieth Conference in Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 410\u2013417, 2004.\n\n[22] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann, San Francisco, CA, 1988.\n\n[23] M. Queyranne. Minimizing symmetric submodular functions. Mathematical Programming, 82(1):3\u201312,\n\n1998.\n\n[24] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using L1-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[25] D. Roth. On the hardness of approximate reasoning. Arti\ufb01cial Intelligence, 82:273\u2013302, 1996.\n[26] T. Sang, P. Beame, and H. Kautz. Performing Bayesian inference by weighted model counting. In Pro-\n\nceedings of The Twentieth National Conference on Arti\ufb01cial Intelligence (AAAI), pages 475\u2013482, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1083, "authors": [{"given_name": "Vibhav", "family_name": "Gogate", "institution": null}, {"given_name": "William", "family_name": "Webb", "institution": null}, {"given_name": "Pedro", "family_name": "Domingos", "institution": null}]}