{"title": "Decomposing Parameter Estimation Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1565, "page_last": 1573, "abstract": "We propose a technique for decomposing the parameter learning problem in Bayesian networks into independent learning problems. Our technique applies to incomplete datasets and exploits variables that are either hidden or observed in the given dataset. We show empirically that the proposed technique can lead to orders-of-magnitude savings in learning time. We explain, analytically and empirically, the reasons behind our reported savings, and compare the proposed technique to related ones that are sometimes used by inference algorithms.", "full_text": "Decomposing Parameter Estimation Problems\n\nKhaled S. Refaat, Arthur Choi, Adnan Darwiche\n\nComputer Science Department\n\nUniversity of California, Los Angeles\n\n{krefaat,aychoi,darwiche}@cs.ucla.edu\n\nAbstract\n\nWe propose a technique for decomposing the parameter learning problem in\nBayesian networks into independent learning problems. Our technique applies\nto incomplete datasets and exploits variables that are either hidden or observed\nin the given dataset. We show empirically that the proposed technique can lead\nto orders-of-magnitude savings in learning time. We explain, analytically and\nempirically, the reasons behind our reported savings, and compare the proposed\ntechnique to related ones that are sometimes used by inference algorithms.\n\n1\n\nIntroduction\n\nLearning Bayesian network parameters is the problem of estimating the parameters of a known\nstructure given a dataset. This learning task is usually formulated as an optimization problem that\nseeks maximum likelihood parameters: ones that maximize the probability of a dataset.\nA key distinction is commonly drawn between complete and incomplete datasets. In a complete\ndataset, the value of each variable is known in every example. In this case, maximum likelihood\nparameters are unique and can be easily estimated using a single pass on the dataset. However,\nwhen the data is incomplete, the optimization problem is generally non-convex, has multiple local\noptima, and is commonly solved by iterative methods, such as EM [5, 7], gradient descent [13] and,\nmore recently, EDML [2, 11, 12].\nIncomplete datasets may still exhibit a certain structure. In particular, certain variables may always\nbe observed in the dataset, while others may always be unobserved (hidden). We exploit this struc-\nture by decomposing the parameter learning problem into smaller learning problems that can be\nsolved independently. In particular, we show that the stationary points of the likelihood function can\nbe characterized by the ones of the smaller problems. This implies that algorithms such as EM and\ngradient descent can be applied to the smaller problems while preserving their guarantees. Empiri-\ncally, we show that the proposed decomposition technique can lead to orders-of-magnitude savings.\nMoreover, we show that the savings are ampli\ufb01ed when the dataset grows in size. Finally, we ex-\nplain these signi\ufb01cant savings analytically by examining the impact of our decomposition technique\non the dynamics of the used convergence test, and on the properties of the datasets associated with\nthe smaller learning problems.\nThe paper is organized as follows. In Section 2, we provide some background on learning Bayesian\nnetwork parameters. In Section 3, we present the decomposition technique and then prove its sound-\nness in Section 4. Section 5 is dedicated to empirical results and to analyzing the reported savings.\nWe discuss related work in Section 6 and \ufb01nally close with some concluding remarks in Section 7.\nThe proofs are moved to the appendix in the supplementary material.\n\n1\n\n\f2 Learning Bayesian Network Parameters\n\nWe use upper case letters (X) to denote variables and lower case letters (x) to denote their values.\nVariable sets are denoted by bold-face upper case letters (X) and their instantiations by bold-face\nlower case letters (x). Generally, we will use X to denote a variable in a Bayesian network and U\nto denote its parents.\nA Bayesian network is a directed acyclic graph with a conditional probability table (CPT) associated\nwith each node X and its parents U. For every variable instantiation x and parent instantiation u,\nthe CPT of X includes a parameter \u03b8x|u that represents the probability Pr (X = x|U = u). We will\nuse \u03b8 to denote the set of all network parameters. Parameter learning in Bayesian networks is the\nprocess of estimating these parameters \u03b8 from a given dataset.\nA dataset is a multi-set of examples. Each example is an instantiation of some network variables.\nWe will use D to denote a dataset and d1, . . . , dN to denote its N examples. The following is a\ndataset over four binary variables (\u201c?\u201d indicates a missing value of a variable in an example):\n\nexample E B A C\n?\n?\n?\n\nd1\nd2\nd3\n\ne\n?\ne\n\nb\nb\nb\n\na\na\na\n\nA variable X is observed in a dataset iff the value of X is known in each example of the dataset (i.e.,\n\u201c?\u201d cannot appear in the column corresponding to variable X). Variables A and B are observed in\nthe above dataset. Moreover, a variable X is hidden in a dataset iff its value is unknown in every\nexample of the dataset (i.e., only \u201c?\u201d appears in the column of variable X). Variable C is hidden in\nthe above dataset. When all variables are observed in a dataset, the dataset is said to be complete.\nOtherwise, the dataset is incomplete. The above dataset is incomplete.\nGiven a dataset D with examples d1, . . . , dN , the likelihood of parameter estimates \u03b8 is de\ufb01ned as:\n\nL(\u03b8|D) =(cid:81)N\n\ni=1 Pr \u03b8(di).\n\nHere, Pr \u03b8 is the distribution induced by the network structure and parameters \u03b8. One typically seeks\nmaximum likelihood parameters\n\n\u03b8(cid:63) = argmax\n\n\u03b8\n\nL(\u03b8|D).\n\nWhen the dataset is complete, maximum likelihood estimates are unique and easily obtainable using\na single pass over the dataset (e.g., [3, 6]). For incomplete datasets, the problem is generally non-\nconvex and has multiple local optima. Iterative algorithms are usually used in this case to try to\nobtain maximum likelihood estimates. This includes EM [5, 7], gradient descent [13], and the more\nrecent EDML algorithm [2, 11, 12]. The \ufb01xed points of these algorithms correspond to the stationary\npoints of the likelihood function. Hence, these algorithms are not guaranteed to converge to global\noptima. As such, they are typically applied to multiple seeds (initial parameter estimates), while\nretaining the best estimates obtained across all seeds.\n\n3 Decomposing the Learning Problem\n\nWe now show how the problem of learning Bayesian network parameters can be decomposed into\nindependent learning problems. The proposed technique exploits two aspects of a dataset: hidden\nand observed variables.\nProposition 1 The likelihood function L(\u03b8|D) does not depend on the parameters of variable X if\nX is hidden in dataset D and is a leaf of the network structure.\n\nIf a hidden variable appears as a leaf in the network structure, it can be removed from the structure\nwhile setting its parameters arbitrarily (assuming no prior). This process can be repeated until there\nare no leaf variables that are also hidden. The soundness of this technique follows from [14, 15].\n\n2\n\n\fOur second decomposition technique will exploit the ob-\nserved variables of a dataset. In a nutshell, we will (a) decom-\npose the Bayesian network into a number of sub-networks, (b)\nlearn the parameters of each sub-network independently, and\nthen (c) assemble parameter estimates for the original network\nfrom the estimates obtained in each sub-network.\n\nDe\ufb01nition 1 (Component) Let G be a network, O be some\nobserved variables in G and let G|O be the network which re-\nsults from deleting all edges from G which are outgoing from\nO. A component of G|O is a maximal set of nodes that are\nconnected in G|O.\n\nFigure 1:\nnetwork G given O = {V, X, Z}.\n\nIdentifying components of\n\nConsider the network G in Figure 1, with observed variables\nO = {V, X, Z}. Then G|O has three components in this case: S1 = {V }, S2 = {X}, and\nS3 = {Y, Z}.\nThe components of a network partition its parameters into groups, one group per component. In the\nabove example, the network parameters are partitioned into the following groups:\n\nS1 :\nS2 :\nS3 :\n\n{\u03b8v, \u03b8v}\n{\u03b8x|v, \u03b8x|v, \u03b8x|v, \u03b8x|v}\n{\u03b8y|x, \u03b8y|x, \u03b8y|x, \u03b8y|x, \u03b8z|y, \u03b8z|y, \u03b8z|y, \u03b8z|y}.\n\nWe will later show that the learning problem can be decomposed into independent learning problems,\neach induced by one component. To de\ufb01ne these independent problems, we need some de\ufb01nitions.\nDe\ufb01nition 2 (Boundary Node) Let S be a component of G|O. If edge B \u2192 S appears in G, B (cid:54)\u2208 S\nand S \u2208 S, then B is called a boundary for component S.\nConsidering Figure 1, node X is the only boundary for component S3 = {Y, Z}. Moreover, node\nV is the only boundary for component S2 = {X}. Component S1 = {V } has no boundary nodes.\nThe independent learning problems are based on the following sub-networks.\nDe\ufb01nition 3 (Sub-Network) Let S be a component of G|O with boundary variables B. The\nsub-network of component S is the subset of network G induced by variables S \u222a B.\n\nFigure 2 depicts the three sub-networks which correspond to our running example.\nThe parameters of a sub-network will be learned using pro-\njected datasets.\nDe\ufb01nition 4 Let D = d1, . . . , dN be a dataset over vari-\nables X and let Y be a subset of variables X. The projection\nof dataset D on variables Y is the set of examples e1, . . . , eN ,\nwhere each ei is the subset of example di which pertains to\nvariables Y.\n\nWe show below a dataset for the full Bayesian network in\nFigure 1, followed by three projected datasets, one for each\nof the sub-networks in Figure 2.\n\nFigure 2: The sub-networks induced\nby adding boundary variables to com-\nponents.\n\nV X Y Z\nz\nv\nz\nv\nv\nz\n\nx\nx\nx\n\n?\n?\n?\n\nV X count\nv\nv\nv\n\nx\nx\nx\n\n1\n1\n1\n\ncount\n\nV\nv\nv\n\ne1\ne2\n\nd1\nd2\nd3\nThe projected datasets are \u201ccompressed\u201d as we only represent unique examples, together with a\ncount of how many times each example appears in a dataset. Using compressed datasets is crucial\nto realizing the full potential of decomposition, as it ensures that the size of a projected dataset is at\nmost exponential in the number of variables appearing in its sub-network (more on this later).\n\ne1\ne2\ne3\n\ne1\ne2\n\n1\n2\n\nz\nz\n\nX Y Z count\nx\nx\n\n2\n1\n\n?\n?\n\n3\n\n\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0VXYZ\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0VXYZ\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0V\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0VX\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0\t\r \u00a0XYZ\fWe are now ready to describe our decomposition technique. Given a Bayesian network structure G\nand a dataset D that observes variables O, we can get the stationary points of the likelihood function\nfor network G as follows:\n\n1. Identify the components S1, . . . , SM of G|O (De\ufb01nition 1).\n2. Construct a sub-network for each component Si and its boundary variables Bi (De\ufb01ni-\n3. Project the dataset D on the variables of each sub-network (De\ufb01nition 4).\n4. Identify a stationary point for each sub-network and its projected dataset (using, e.g., EM,\n\ntion 3).\n\nEDML or gradient descent).\n\n5. Recover the learned parameters of non-boundary variables from each sub-network.\n\nWe will next prove that (a) these parameters are a stationary point of the likelihood function for\nnetwork G, and (b) every stationary point of the likelihood function can be generated this way\n(using an appropriate seed).\n\n4 Soundness\n\nThe soundness of our decomposition technique is based on three steps. We \ufb01rst introduce the notion\nof a parameter term, on which our proof rests. We then show how the likelihood function for\nthe Bayesian network can be decomposed into component likelihood functions, one for each sub-\nnetwork. We \ufb01nally show that the stationary points of the likelihood function (network) can be\ncharacterized by the stationary points of component likelihood functions (sub-networks).\nTwo parameters are compatible iff they agree on the state of their common variables. For example,\nparameters \u03b8z|y and \u03b8y|x are compatible, but parameters \u03b8z|y and \u03b8y|x are not compatible, as y (cid:54)= y.\nMoreover, a parameter is compatible with an example iff they agree on the state of their common\nvariables. Parameter \u03b8y|x is compatible with example x, y, z, but not with example x, y, z.\n\nDe\ufb01nition 5 (Parameter Term) Let S be network variables and let d be an example.\nparameter term for S and d, denoted \u0398d\neach variable in S, that are also compatible with example d.\nConsider the network X \u2192 Y \u2192 Z.\nther \u03b8y|x\u03b8z|y or \u03b8y|x\u03b8z|y. Moreover, if S = {X, Y, Z}, then \u0398d\n\nA\nS, is a product of compatible network parameters, one for\n\nIf S = {Y, Z} and d = x, z, then \u0398d\n\nS will denote ei-\nS will denote either \u03b8x\u03b8y|x\u03b8z|y or\nS. This holds more generally, whenever S is the set of all\n\n\u03b8x\u03b8y|x\u03b8z|y. In this case, Pr (d) =(cid:80)\n\nnetwork variables.\nWe will now use parameter terms to show how the likelihood function can be decomposed into\ncomponent likelihood functions.\nTheorem 1 Let S be a component of G|O and let R be the remaining variables of network G. If\nvariables O are observed in example d, we have\n\n\u0398d\n\n\u0398d\nS\n\n\uf8ee\uf8f0(cid:88)\n\n\uf8f9\uf8fb\uf8ee\uf8f0(cid:88)\n\n\u0398d\nS\n\n\u0398d\nR\n\n\uf8f9\uf8fb .\n\nPr \u03b8(d) =\n\n\u0398d\nS\n\n\u0398d\nR\n\nIf \u03b8 denotes all network parameters, and S is a set of network variables, then \u03b8 : S will denote the\nsubset of network parameters which pertain to the variables in S. Each component S of a Bayesian\nnetwork induces its own likelihood function over parameters \u03b8 : S.\nDe\ufb01nition 6 (Component Likelihood) Let S be a component of G|O.\nd1, . . . , dN , the component likelihood for S is de\ufb01ned as\n\nFor dataset D =\n\nL(\u03b8 : S|D) =\n\n\u0398di\nS .\n\nN(cid:89)\n\n(cid:88)\n\ni=1\n\n\u0398di\nS\n\n4\n\n\fIn our running example, the components are S1 = {V }, S2 = {X} and S3 = {Y, Z}. Moreover,\nthe observed variables are O = {V, X, Z}. Hence, the component likelihoods are\n\nL(\u03b8 : S1|D) = [\u03b8v] [\u03b8v] [\u03b8v]\n\nL(\u03b8 : S2|D) = (cid:2)\u03b8x|v\n(cid:3)(cid:2)\u03b8x|v\n(cid:3)\nL(\u03b8 : S3|D) = (cid:2)\u03b8y|x\u03b8z|y + \u03b8y|x\u03b8z|y\n\n(cid:3)(cid:2)\u03b8x|v\n\n(cid:3)(cid:2)\u03b8y|x\u03b8z|y + \u03b8y|x\u03b8z|y\n\n(cid:3)(cid:2)\u03b8y|x\u03b8z|y + \u03b8y|x\u03b8z|y\n\n(cid:3)\n\nThe parameters of component likelihoods partition the network parameters. That is, the parameters\nof two component likelihoods are always non-overlapping. Moreover, the parameters of component\nlikelihoods account for all network parameters.1\nWe can now state our main decomposition result, which is a direct corollary of Theorem 1.\nCorollary 1 Let S1, . . . , SM be the components of G|O. If variables O are observed in dataset D,\n\nM(cid:89)\n\nL(\u03b8|D) =\n\nL(\u03b8 : Si|D).\n\ni=1\n\nHence, the network likelihood decomposes into a product of component likelihoods. This leads to\nanother important corollary (see Lemma 1 in the Appendix):\nCorollary 2 Let S1, . . . , SM be the components of G|O. If variables O are observed in dataset D,\nthen \u03b8(cid:63) is a stationary point of the likelihood L(\u03b8|D) iff, for each i, \u03b8(cid:63) : Si is a stationary point for\nthe component likelihood L(\u03b8 : Si|D).\nThe search for stationary points of the network likelihood is now decomposed into independent\nsearches for stationary points of component likelihoods.\nWe will now show that the stationary points of a component likelihood can be identi\ufb01ed using any\nalgorithm that identi\ufb01es such points for the network likelihood.\n\nTheorem 2 Consider a sub-network G which is induced by component S and boundary variables\nB. Let \u03b8 be the parameters of sub-network G, and let D be a dataset for G that observes boundary\nvariables B. Then \u03b8(cid:63) is a stationary point for the sub-network likelihood, L(\u03b8|D), only if \u03b8(cid:63) : S\nis a stationary point for the component likelihood L(\u03b8 : S|D). Moreover, every stationary point for\nL(\u03b8 : S|D) is part of some stationary point for L(\u03b8|D).\n\nGiven an algorithm that identi\ufb01es stationary points of the likelihood function of Bayesian networks\n(e.g., EM), we can now identify all stationary points of a component likelihood. That is, we just ap-\nply this algorithm to the sub-network of each component S, and then extract the parameter estimates\nof variables in S while ignoring the parameters of boundary variables. This proves the soundness of\nour proposed decomposition technique.\n\n5 The Computational Bene\ufb01t of Decomposition\n\nWe will now illustrate the computational bene\ufb01ts of the proposed decomposition technique, showing\norders-of-magnitude reductions in learning time. Our experiments are structured as follows. Given\na Bayesian network G, we generate a dataset D while ensuring that a certain percentage of variables\nare observed, with all others hidden. Using dataset D, we estimate the parameters of network G\nusing two methods. The \ufb01rst uses the classical EM on network G and dataset D. The second\ndecomposes network G into its sub-networks G1, . . . , GM , projects the dataset D on each sub-\nnetwork, and then applies EM to each sub-network and its projected dataset. This method is called\nD-EM (for Decomposed EM). We use the same seed for both EM and D-EM.\nBefore we present our results, we have the following observations on our data generation model.\nFirst, we made all unobserved variables hidden (as opposed to missing at random) as this leads to\na more dif\ufb01cult learning problem, especially for EM (even with the pruning of hidden leaf nodes).\n\n1The sum-to-one constraints that underlie each component likelihood also partition the sum-to-one con-\n\nstraints of the likelihood function.\n\n5\n\n\fFigure 3: Speed-up of D-EM over EM on chain networks: three chains (180, 380, and 500 variables) (left),\nand tree networks (63, 127, 255, and 511 variables) (right), with three random datasets per network/observed\npercentage, and 210 examples per dataset.\n\nObserved % Network Speed-up Network Speed-up Network Speed-up\nD-EM\n155.54x\n52.63x\n14.27x\n2.96x\n0.77x\n1.01x\n235.63x\n37.61x\n34.19x\n16.23x\n4.1x\n3.16x\n\nD-EM\nalarm 267.67x\n95.0%\n90.0%\nalarm 173.47x\n115.4x\n80.0%\nalarm\n87.67x\n70.0%\nalarm\n92.65x\n60.0%\nalarm\n12.09x\n50.0%\nalarm\n591.38x\n95.0% win95pts\n112.57x\n90.0% win95pts\n80.0% win95pts\n22.41x\n17.92x\n70.0% win95pts\n4.8x\n60.0% win95pts\n50.0% win95pts\n7.99x\n\nandes\nandes\nandes\nandes\nandes\nandes\npigs\npigs\npigs\npigs\npigs\npigs\n\ndiagnose\ndiagnose\ndiagnose\ndiagnose\ndiagnose\ndiagnose\nwater\nwater\nwater\nwater\nwater\nwater\n\nD-EM\n43.03x\n17.16x\n11.86x\n3.25x\n3.48x\n3.73x\n811.48x\n110.27x\n7.23x\n1.5x\n2.03x\n4.4x\n\nTable 1: Speed-up of D-EM over EM on UAI networks. Three random datasets per network/observed percent-\nage with 210 examples per dataset.\n\nSecond, it is not uncommon to have a signi\ufb01cant number of variables that are always observed in\nreal-world datasets. For example, in the UCI repository: the internet advertisements dataset has\n1558 variables, only 3 of which have missing values; the automobile dataset has 26 variables, where\n7 have missing values; the dermatology dataset has 34 variables, where only age can be missing;\nand the mushroom dataset has 22 variables, where only one variable has missing values [1].\nWe performed our experiments on three sets of networks: synthesized chains, synthesized complete\nbinary trees, and some benchmarks from the UAI 2008 evaluation with other standard benchmarks\n(called UAI networks): alarm, win95pts, andes, diagnose, water, and pigs. Figure 3 and Table 1\ndepict the obtained time savings. As can be seen from these results, decomposing chains and trees\nlead to two orders-of-magnitude speed-ups for almost all observed percentages. For UAI networks,\nwhen observing 70% of the variables or more, one obtains one-to-two orders-of-magnitude speed-\nups. We note here that the time used for D-EM includes the time needed for decomposition (i.e.,\nidentifying the sub-networks and their projected datasets). Similar results for EDML are shown in\nthe supplementary material.\nThe reported computational savings appear quite surprising. We now shed some light on the culprit\nbehind these savings. We also argue that some of the most prominent tools for Bayesian networks\ndo not appear to employ the proposed decomposition technique when learning network parameters.\nOur \ufb01rst analytic explanation for the obtained savings is based on understanding the role of\ndata projection, which can be illustrated by the following example. Consider a chain network over\nbinary variables X1, . . . , Xn, where n is even. Consider also a dataset D in which variable Xi is\nobserved for all odd i. There are n/2 sub-networks in this case. The \ufb01rst sub-network is X1. The\nremaining sub-networks are in the form Xi\u22121 \u2192 Xi \u2192 Xi+1 for i = 2, 4, . . . , n \u2212 2 (node Xn\nwill be pruned). The dataset D can have up to 2n/2 distinct examples. If one learns parameters\nwithout decomposition, one would need to call the inference engine once for each distinct example,\nin each iteration of the learning algorithm. With m iterations, the inference engine may be called\nup to m2n/2 times. When learning with decomposition, however, each projected dataset will have\n\n6\n\n50607080909505001000Observed %Speed\u2212up50607080909505001000Observed %Speed\u2212up\fFigure 4: Left: Speed-up of D-EM over EM as a function of dataset size. This is for a chain network with 180\nvariables, while observing 50% of the variables. Right Pair: Graphs showing the number of iterations required\nby each sub-network, sorted descendingly. The problem is for learning Network Pigs while observing 90% of\nthe variables, with convergence based on parameters (left), and on likelihood (right).\n\nat most 2 distinct examples for sub-network X1, and at most 4 distinct examples for sub-network\nXi\u22121 \u2192 Xi \u2192 Xi+1 (variable Xi is hidden, while variables Xi\u22121 and Xi+1 are observed). Hence,\nif sub-network i takes mi iterations to converge, then the inference engine would need to be called\nat most 2m1 +4(m2 +m4 +. . .+mn\u22122) times. We will later show that mi is generally signi\ufb01cantly\nsmaller than m. Hence, with decomposed learning, the number of calls to the inference engine can\nbe signi\ufb01cantly smaller, which can contribute signi\ufb01cantly to the obtained savings. 2\nOur analysis suggests that the savings obtained\nfrom decomposing the learning problem would\namplify as the dataset gets larger. This can be\nseen clearly in Figure 4 (left), which shows that\nthe speed-up of D-EM over EM grows linearly\nwith the dataset size. Hence, decomposition can\nbe critical when learning with very large datasets.\nInterestingly, two of the most prominent (non-\ncommercial) tools for Bayesian networks do not\nexhibit this behavior on the chain network dis-\ncussed above. This is shown in Figure 5, which\ncompares D-EM to the EM implementations of\nthe GENIE/SMILE and SAMIAM systems,3 both\nof which were represented in previous inference\nevaluations [4]. In particular, we ran these sys-\ntems on a chain network X0 \u2192 \u00b7\u00b7\u00b7 \u2192 X100, where each variable has 10 states, and using datasets\nwith alternating observed and hidden variables. Each plot point represents an average over 20 sim-\nulated datasets, where we recorded the time to execute each EM algorithm (excluding the time to\nread networks and datasets from \ufb01le, which was negligible compared to learning time).\nClearly, D-EM scales better in terms of time than both SMILE and SAMIAM, as the size of the\ndataset increases. As explained in the above analysis, the number of calls to the inference engine by\nD-EM is not necessarily linear in the dataset size. Note here that D-EM used a stricter convergence\nthreshold and obtained better likelihoods, than both SMILE and SAMIAM, in all cases. Yet, D-EM\nwas able to achieve one-to-two orders-of-magnitude speed-ups as the dataset grows in size. On the\nother hand, SAMIAM was more ef\ufb01cient than SMILE, but got worse likelihoods in all cases, using\ntheir default settings (the same seed was used for all algorithms).\nOur second analytic explanation for the obtained savings is based on understanding the dy-\nnamics of the convergence test, used by iterative algorithms such as EM. Such algorithms employ\na convergence test based on either parameter or likelihood change. According to the \ufb01rst test, one\ncompares the parameter estimates obtained at iteration i of the algorithm to those obtained at itera-\n\nFigure 5: Effect of dataset size (log-scale) on learn-\ning time in seconds.\n\n2The analysis in this section was restricted to chains to make the discussion concrete. This analysis, how-\n\never, can be generalized to arbitrary networks if enough variables are observed in the corresponding dataset.\n\n3Available at http://genie.sis.pitt.edu/ and http://reasoning.cs.ucla.edu/samiam/.\n\nSMILE\u2019s C++ API was used to run EM, using default options, except we suppressed the randomized parame-\nters option. SAMIAM\u2019s Java API was used to run EM (via the CodeBandit feature), also using default options,\nand the Hugin algorithm as the underlying inference engine.\n\n7\n\n810121416010002000Dataset SizeSpeed\u2212up0200400020004000Sub\u2212network# iterations0200400010002000Sub\u2212network# iterations8101214100101102103Dataset SizeTime SMILESAMIAMD\u2212EM\ftion i \u2212 1. If the estimates are close enough, the algorithm converges. The likelihood test is similar,\nexcept that the likelihood of estimates is compared across iterations. In our experiments, we used\na convergence test based on parameter change. In particular, when the absolute change in every\nparameter falls below the set threshold of 10\u22124, convergence is declared by EM.\nWhen learning with decomposition, each sub-network is allowed to converge independently, which\ncan contribute signi\ufb01cantly to the obtained savings. In particular, with enough observed variables,\nwe have realized that the vast majority of sub-networks converge very quickly, sometimes in one\niteration (when the projected dataset is complete). In fact, due to this phenomenon, the convergence\nthreshold for sub-networks can be further tightened without adversely affecting the total running\ntime. In our experiments, we used a threshold of 10\u22125 for D-EM, which is tighter than the threshold\nused for EM. Figure 4 (right pair) illustrates decomposed convergence, by showing the number\nof iterations required by each sub-network to converge, sorted decreasingly, with convergence test\nbased on parameters (left) and likelihood (right). The vast majority of sub-networks converged\nvery quickly. Here, convergence was declared when the change in parameters or log-likelihood,\nrespectively, fell below the set threshold of 10\u22125.\n\n6 Related Work\n\nThe decomposition techniques we discussed in this paper have long been utilized in the context of\ninference, but apparently not in learning. In particular, leaf nodes that do not appear in evidence\ne have been called Barren nodes in [14], which showed the soundness of their removal during in-\nference with evidence e. Similarly, deleting edges outgoing from evidence nodes has been called\nevidence absorption and its soundness was shown in [15]. Interestingly enough, both of these tech-\nniques are employed by the inference engines of SAMIAM and SMILE,4 even though neither seem\nto employ them when learning network parameters as we propose here (see earlier experiments).\nWhen employed during inference, these techniques simplify the network to reduce the time needed\nto compute queries (e.g., conditional marginals which are needed by learning algorithms). However,\nwhen employed in the context of learning, these techniques reduce the number of calls that need\nto be made to an inference engine. The difference is therefore fundamental, and the effects of the\ntechniques are orthogonal. In fact, the inference engine we used in our experiments does employ\ndecomposition techniques. Yet, we were still able to obtain orders-of-magnitude speed-ups when\ndecomposing the learning problem. On the other hand, our proposed decomposition techniques do\nnot apply fully to Markov random \ufb01elds (MRFs) as the partition function cannot be decomposed,\neven when the data is complete (evaluating the partition function is independent of the data). How-\never, distributed learning algorithms have been proposed in the literature. For example, the recently\nproposed LAP algorithm is a consistent estimator for MRFs under complete data [10]. A similar\nmethod to LAP was independently introduced by [9] in the context of Gaussian graphical models.\n\n7 Conclusion\n\nWe proposed a technique for decomposing the problem of learning Bayesian network parameters\ninto independent learning problems. The technique applies to incomplete datasets and is based on\nexploiting variables that are either hidden or observed. Our empirical results suggest that orders-of-\nmagnitude speed-up can be obtained from this decomposition technique, when enough or particular\nvariables are hidden or observed in the dataset. The proposed decomposition technique is orthogonal\nto the one used for optimizing inference as one reduces the time of inference queries, while the other\nreduces the number of such queries. The latter effect is due to decomposing the dataset and the\nconvergence test. The decomposition process incurs little overhead as it can be performed in time\nthat is linear in the structure size and dataset size. Hence, given the potential savings it may lead to,\nit appears that one must always try to decompose before learning network parameters.\n\nAcknowledgments\n\nThis work has been partially supported by ONR grant #N00014-12-1-0423 and NSF grant #IIS-\n1118122.\n\n4SMILE actually employs a more advanced technique known as relevance reasoning [8].\n\n8\n\n\fReferences\n[1] K. Bache and M. Lichman. UCI machine learning repository. Technical report, Irvine, CA:\n\nUniversity of California, School of Information and Computer Science, 2013.\n\n[2] Arthur Choi, Khaled S. Refaat, and Adnan Darwiche. EDML: A method for learning param-\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial\n\neters in Bayesian networks.\nIntelligence, 2011.\n\n[3] Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University\n\nPress, 2009.\n\n[4] Adnan Darwiche, Rina Dechter, Arthur Choi, Vibhav Gogate, and Lars Otten. Results\nfrom the probabilistic inference evaluation of uncertainty in arti\ufb01cial intelligence UAI-08.\nhttp://graphmod.ics.uci.edu/uai08/Evaluation/Report, 2008.\n\n[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society B, 39:1\u201338, 1977.\n\n[6] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques.\n\nMIT Press, 2009.\n\n[7] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Com-\n\nputational Statistics and Data Analysis, 19:191\u2013201, 1995.\n\n[8] Yan Lin and Marek Druzdzel. Computational advantages of relevance reasoning in Bayesian\nIn Proceedings of the Thirteenth Conference on Uncertainty in Arti\ufb01cial\n\nbelief networks.\nIntelligence, 1997.\n\n[9] Z. Meng, D. Wei, A. Wiesel, and A. O. Hero III. Distributed learning of Gaussian graphical\nmodels via marginal likelihoods. In Proceedings of the International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2013.\n\n[10] Yariv Dror Mizrahi, Misha Denil, and Nando de Freitas. Linear and parallel learning of Markov\n\nrandom \ufb01elds. In International Conference on Machine Learning (ICML), 2014.\n\n[11] Khaled S. Refaat, Arthur Choi, and Adnan Darwiche. New advances and theoretical insights\ninto EDML. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n705\u2013714, 2012.\n\n[12] Khaled S. Refaat, Arthur Choi, and Adnan Darwiche. EDML for learning parameters in di-\n\nrected and undirected graphical models. In Neural Information Processing Systems, 2013.\n\n[13] S. Russel, J. Binder, D. Koller, and K. Kanazawa. Local learning in probabilistic networks with\nhidden variables. In Proceedings of the Fourteenth International Joint Conference on Arti\ufb01cial\nIntelligence, 1995.\n\n[14] R. Shachter. Evaluating in\ufb02uence diagrams. Operations Research, 1986.\n\n[15] R. Shachter. Evidence absorption and propagation through evidence reversals. In Proceedings\n\nof the Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, 1989.\n\n9\n\n\f", "award": [], "sourceid": 822, "authors": [{"given_name": "Khaled", "family_name": "Refaat", "institution": "UCLA"}, {"given_name": "Arthur", "family_name": "Choi", "institution": "UCLA"}, {"given_name": "Adnan", "family_name": "Darwiche", "institution": "UCLA"}]}