{"title": "MAS: a multiplicative approximation scheme for probabilistic inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1761, "page_last": 1768, "abstract": "We propose a multiplicative approximation scheme (MAS) for inference problems in graphical models, which can be applied to various inference algorithms. The method uses $\\epsilon$-decompositions which decompose functions used throughout the inference procedure into functions over smaller sets of variables with a known error $\\epsilon$. MAS translates these local approximations into bounds on the accuracy of the results. We show how to optimize $\\epsilon$-decompositions and provide a fast closed-form solution for an $L_2$ approximation. Applying MAS to the Variable Elimination inference algorithm, we introduce an algorithm we call DynaDecomp which is extremely fast in practice and provides guaranteed error bounds on the result. The superior accuracy and efficiency of DynaDecomp is demonstrated.", "full_text": "MAS: a multiplicative approximation scheme for\n\nprobabilistic inference\n\nYdo Wexler\n\nMicrosoft Research\nRedmond, WA 98052\n\nydow@microsoft.com\n\nChristopher Meek\nMicrosoft Research\nRedmond, WA 98052\n\nmeek@microsoft.com\n\nAbstract\n\nWe propose a multiplicative approximation scheme (MAS) for inference problems\nin graphical models, which can be applied to various inference algorithms. The\nmethod uses \u0001-decompositions which decompose functions used throughout the\ninference procedure into functions over smaller sets of variables with a known\nerror \u0001. MAS translates these local approximations into bounds on the accuracy\nof the results. We show how to optimize \u0001-decompositions and provide a fast\nclosed-form solution for an L2 approximation. Applying MAS to the Variable\nElimination inference algorithm, we introduce an algorithm we call DynaDecomp\nwhich is extremely fast in practice and provides guaranteed error bounds on the\nresult. The superior accuracy and ef\ufb01ciency of DynaDecomp is demonstrated.\n\n1 Introduction\n\nProbabilistic graphical models gained popularity in the recent decades due to their intuitive rep-\nresentation and because they enable the user to query about the value distribution of variables of\ninterest [19]. Although very appealing, these models suffer from the problem that performing infer-\nence in the model (e.g. computing marginal probabilities or its likelihood) is NP-hard [6].\nAs a result, a variety of approximate inference methods have been developed. Among these meth-\nods are loopy message propagation algorithms [24], variational methods [16, 12], mini buckets [10],\nedge deletion [8], and a variety of Monte Carlo sampling techniques [13, 19, 21, 4, 25]. Approxima-\ntion algorithms that have useful error bounds and speedup while maintaining high accuracy, include\nthe work of Dechter and colleagues [2, 3, 10, 17], which provide both upper and lower bounds on\nprobabilities, upper bounds suggested by Wainwright et.al. [23], and variational lower bounds [16].\nIn this paper we present an approximation scheme called the Multiplicative Approximation Scheme\n(MAS), that provides error bounds for the computation of likelihood of evidence, marginal probabil-\nities, and the Maximum Probability Explanation (MPE) in discrete directed and undirected graphical\nmodels. The approximation is based on a local operation called an \u0001-decomposition, that decom-\nposes functions used in the inference procedure into functions over smaller subsets of variables, with\na guarantee on the error introduced. The main difference from existing approximations is the ability\nto translate the error introduced in the local decompositions performed during execution of the algo-\nrithm into bounds on the accuracy of the entire inference procedure. We note that this approximation\ncan be also applied to the more general class of multiplicative models introduced in [27].\nWe explore optimization of \u0001-decompositions and provide a fast optimal closed form solution for\nthe L2 norm. We also show that for the Kullback-Leiber divergence the optimization problem can\nbe solved using variational algorithms on local factors. MAS can be applied to various inference\nalgorithms. As an example we show how to apply MAS to the Variable Elimination (VE) algo-\nrithm [9, 20], and present an algorithm called DynaDecomp, which dynamically decomposes func-\ntions in the VE algorithm. In the results section we compare the performance of DynaDecomp with\nthat of Mini-buckets [10], GMF [28] and variational methods [26] for various types of models. We\n\ufb01nd that our method achieves orders of magnitude better accuracy on all datasets.\n\n\f2 Multiplicative Approximation Scheme (MAS)\n\nWe propose an approximation scheme, called the Multiplicative Approximation Scheme (MAS) for\ninference problems in graphical models. The basic operations of the scheme are local approxima-\ntions called \u0001-decompositions that decouple the dependency of variables. Every such local decom-\nposition has an associated error that our scheme combines into an error bound on the result.\nConsider a graphical model for n variable X = {X1, . . . , Xn} that encodes a probability distribu-\nj \u03c8j(dj) where Dj \u2286 X are sets determined by the model. Throughout the paper\nwe denote variables and sets of variables with capital letters and denote a value assigned to them\nwith lowercase letters. We denote the observed variables in the model by E = X \\ H where E = e.\nTo simplify the proofs we assume \u03c8j(dj) > 1. When this is not the case, as in BNs, every function\n\u03c8j can be multiplied by a constant zj such that the assumption holds, and the result is obtained after\n\ntion P (X) =(cid:81)\ndividing by(cid:81)\n\nj zj. Thus, here we assume positivity but discuss how this can be relaxed below.\n\nIn addition to approximating functions \u03c8 by which the original model is de\ufb01ned, we also may wish\nto approximate other functions such as intermediate functions created in the course of an inference\nalgorithm. We can write the result of marginalizing out a set of hidden variables as a factor of\nfunctions fi. The log of the probability distribution the model encodes after such marginalization\ncan then be written as\n\nlog P (A, E) = log(cid:89)\n\nfi(Ui) =(cid:88)\n\n\u03c6i(Ui)\n\n(1)\n\ni\n\ni\n\nwhere A \u2286 H. When A = H we can choose sets Ui = Di and functions fi(Ui) = \u03c8i(Di).\nDe\ufb01nition 1 (\u0001-decomposition) Given a set of variables W , and a function \u03c6(W ) that assigns real\nvalues to every instantiation W = w, a set of m functions \u02dc\u03c6l(Wl), l = 1 . . . m, where Wl \u2286 W is\n\nan \u0001-decomposition if(cid:83)\n\nl Wl = W , and\n1\n\n\u2264\n\n1 + \u0001\n\n(cid:80)\n\n\u02dc\u03c6l(wl)\nl\n\u03c6(w)\n\n\u2264 1 + \u0001\n\n(2)\n\n0 = 1 and \u221e\n\nfor some \u0001 \u2265 0, where wl is the projection of w on Wl.\nNote that an \u0001-decomposition is not well de\ufb01ned for functions \u03c6 that equal zero or are in\ufb01nite for\nsome instantiations. These functions can still be \u0001-decomposed for certain choices of subsets Wl\n\u221e = 1. We direct the interested reader to the paper of Geiger et.al. [12]\nby de\ufb01ning 0\nfor a discussion on choosing such subsets. We also note that when approximating models in which\nsome assignments have zero probability, the theoretical error bounds can be arbitrarily bad, yet, in\npractice the approximation can sometimes yield good results.\nThe following theorems show that using \u0001-decompositions the log-likelihood, log P (e), log of\nmarginal probabilities, the log of the Most Probable Explanation (MPE) and the log of the Max-\nimum Aposteriori Probability (MAP) can all be approximated within a multiplicative factor using a\nset of \u0001-decompositions.\nLemma 1 Let A \u2286 H, and let P (A, E) factor according to Eq. 1, then the log of the joint prob-\nability P (a, e) can be approximated within a multiplicative factor of 1 + \u0001max using a set of \u0001i-\ndecompositions, where \u0001max = maxi{\u0001i}.\nProof:\n\n\u02dc\u03c6il(uil) =(cid:88)\n\u02dc\u03c6il(uil) =(cid:88)\n\n\u02dc\u03c6il(uil) \u2264(cid:88)\nlog \u02dcP (a, e) \u2261 log(cid:89)\n\u02dc\u03c6il(uil) \u2265(cid:88)\nlog \u02dcP (a, e) \u2261 log(cid:89)\nTheorem 1 For a set A(cid:48) \u2286 A the expression log(cid:80)\n\ni,l\n\ni,l\n\ni,l\n\ni\n\ni\n\ni,l\n\ne\n\ne\n\nplicative factor of 1 + \u0001max using a set of \u0001i-decompositions.\n\n(1 + \u0001i)\u03c6i(ui) \u2264 (1 + \u0001max) log P (a, e)\n\n1\n\n1 + \u0001i\n\n\u03c6i(ui) \u2265\n\n1\n\n1 + \u0001max\n\nlog P (a, e)\n\na(cid:48) P (a, e) can be approximated within a multi-\n\n\fusing Lemma 1 summing out any set of variables A(cid:48) \u2286 A does not increase the error:\n\nj(cj)r \u2264 (cid:16)(cid:80)\n(cid:32)(cid:89)\n\nProof: Recall that(cid:80)\n\u02dcP (a, e) \u2264 log(cid:88)\nlog(cid:88)\nSimilarly for the upper bound approximation we use the fact that(cid:80)\n\n(cid:17)r\n(cid:33)1+\u0001max\n\n(cid:32)(cid:88)\n\n(cid:89)\n\n\u2264 log\n\ne\u03c6i(ui)\n\ne\u03c6i(ui)\n\nj cj\n\na(cid:48)\n\na(cid:48)\n\ni\n\na(cid:48)\n\ni\n\n(cid:33)1+\u0001max\n= (1+\u0001max) log(cid:88)\n(cid:17)r\nj(cj)r \u2265(cid:16)(cid:80)\n\nj cj\n\na(cid:48)\n\nP (a, e)\n\nfor any set\n\nfor any set of numbers cj \u2265 0 and r \u2265 1. Therefore,\n\nof numbers cj \u2265 0 and 0 < r \u2264 1.\nNote that whenever E = \u2205, Theorem 1 claims that the log of all marginal probabilities can be\nIn addition, for any E \u2286 X by setting\napproximated within a multiplicative factor of 1 + \u0001max.\nA(cid:48) = A the log-likelihood log P (e) can be approximated with the same factor.\nA similar analysis can also be applied with minor modi\ufb01cations to the computation of related prob-\nlems like the MPE and MAP. We adopt the simpli\ufb01cation of the problems suggested in [10], reduc-\ning the problem of the Most Probable Explanation (MPE) to computing P (h\u2217, e) = maxh P (h, e)\nand the problem of the Maximum Aposteriori Probability (MAP) to computing P (a\u2217, e) =\nmaxa\nDenote the operator \u2295 as either a sum or a max operator. Then, similar to Eq. 1, for a set H(cid:48) \u2286 H\nwe can write\n\n(cid:80)\nH\\A=h\u2212 P (h, e) for a set A \u2286 H.\n\nlog \u2295h(cid:48) P (h, e) = log(cid:89)\n\nfi(Ui) =(cid:88)\n\n\u03c6i(Ui)\n\n(3)\n\nTheorem 2 Given a set A \u2286 H, the log of the MAP probability log maxa\nbe approximated within a multiplicative factor of 1 + \u0001max using a set of \u0001i-decompositions.\n\nH\\A=h\u2212 P (h, e) can\n\ni\n\ni\n\n(cid:80)\n\nProof:\n(maxjcj)r for any set of real numbers cj \u2265 0 and r \u2265 0.\n\nThe proof follows that of Theorem 1 with the addition of the fact that maxj(cj)r =\n\nAn immediate conclusion from Theorem 2 is that the MPE probability can also be approximated\nwith the same error bounds, by choosing A = H.\n\n2.1 Compounded Approximation\n\nThe results on using \u0001-decompositions assume that we decompose functions fi as in Eqs. 1 and 3.\nHere we consider decompositions of any function created during the inference procedure, and in\nparticular compounded decompositions of functions that were already decomposed. Suppose that a\nfunction \u02dc\u03c6(W ), that already incurs an error \u00011 compared to a function \u03c6(W ), can be decomposed\nwith an error \u00012. Then, according to Eq. 2, this results in a set of functions \u02c6\u03c6l(Wl), such that the\n\n\u02c6\u03c6l(Wl) is (1 + \u00011) \u00b7 (1 + \u00012) wrt \u03c6(W ).\n\nerror of(cid:80)\n\nl\n\nTo understand what is the guaranteed error for an entire inference procedure consider a directed\ngraph where the nodes represent functions of the inference procedure, and each node v has an asso-\nciated error rv. The nodes representing the initial potential functions of the model \u03c8i have no parents\nin the model and are associated with zero error (rv = 1). Every multiplication operation is denoted\nby edges directed from the nodes S, representing the multiplied functions, to a node t representing\nthe resulting function, the error of which is rt = maxs\u2208S rs. An \u0001-decomposition on the other hand\nhas a single source node s with an associated error rs, representing the decomposed function, and\nseveral target nodes T , with an error rt = (1 + \u0001)rs for every t \u2208 T . The guaranteed error for the\nentire inference procedure is then the error associated with the sink function in the graph. In Figure 1\nwe illustrate such a graph for an inference procedure that starts with four functions (fa, fb, fc and\nfd) and decomposes three functions, fa, fg and fj, with errors \u00011, \u00012 and \u00013 respectively.\nIn this\nexample we assume that \u00011 > \u00012 and that 1 + \u00011 < (1 + \u00012)(1 + \u00013).\n\n2.2 \u0001-decomposition Optimization\n\n\u0001-decompositions can be utilized in inference algorithms to reduce the computational cost by par-\nsimoniously approximating factors that occur during the course of computation. As we discuss in\n\n\fSection 3, both the selection of the form of the \u0001-decomposition (i.e., the sets Wi) and which factors\nto approximate impact the overall accuracy and runtime of the algorithm. Here we consider the\nproblem of optimizing the approximating functions \u02dc\u03c6i given a selected factorization Wi.\nGiven a function f(W ) = e\u03c6(W ) and the sets Wi, the goal is to optimize the functions \u03c6i(Wi) in\norder to minimize the error \u0001f introduced in the decomposition. The objective function is therefore\n\n(cid:41)\n\n(cid:40)(cid:80)\n(cid:111)\n\n,\n\nmax\nw\u2208W\n\n\u02dc\u03c6i(wi)\ni\n\u03c6(w)\n\n\u03c6(w)(cid:80)\n\u03c6(w)(cid:80)\n\u02dc\u03c6i(wi)\ns.t. \u2200(W = w) Sw \u2264 t\n\nand Sw =\n\ni\n\n\u02dc\u03c6i(wi)\n\nand S\u22121\n\nw \u2264 t\n\nmin\n\n( \u02dc\u03c61,..., \u02dc\u03c6m)\n\n(cid:110)(cid:80)\n\nmin\n\nt\n\n( \u02dc\u03c61,..., \u02dc\u03c6m)\n\nThis problem can be formalized as a convex problem using the following notations.\nLet t = maxw\u2208W\nproblem as\n\n\u03c6(w)(cid:80)\n\n\u02dc\u03c6i(wi)\ni\n\u03c6(w)\n\n\u02dc\u03c6i(wi)\n\n,\n\ni\n\ni\n\n. Now we can reformulate the\n\n(4)\n\n(5)\n\n(6)\n\nThis type of problems can be solved with geometric programming techniques, and in particular\nusing interior-point methods [18]. Unfortunately, in the general case the complexity of solving this\nproblem requires O(m3|W|3) time, and hence can be too expensive for functions over a large do-\nmain. On the other hand, many times functions de\ufb01ned over a small domain can not be decomposed\nwithout introducing a large error. Thus, when trying to limit the error introduced, a signi\ufb01cant\namount of time is needed for such optimization. To reduce the computational cost of the optimiza-\ntion we resort to minimizing similar measures, in the hope that they will lead to a small error \u0001f .\nNote that by deviating from Eq. 4 to choose the functions \u02dc\u03c6i we may increase the worst case penalty\nerror but not necessarily the actual error achieved by the approximation. In addition, even when\nusing different measures for the optimization we can still compute \u0001f exactly.\n\n2.2.1 Minimizing the L2 Norm\nAn alternative minimization measure, the L2 norm, is closely related to that in Eq. 4 and given as:\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:88)\n\nw\u2208W\n\n(cid:34)(cid:32)(cid:88)\n\ni\n\nmin\n\n( \u02dc\u03c61,..., \u02dc\u03c6m)\n\n(cid:33)\n\n(cid:35)2\n\n\u02dc\u03c6i(wi)\n\n\u2212 \u03c6(w)\n\nWe give a closed form analytic solution for this minimization problem when the sets Wi are disjoint,\nbut \ufb01rst we can remove the square root from the optimization formula due to the monotonicity of\nthe square root for positive values. Hence we are left with the task of minimizing:\n\nFigure 1: A schematic description of an inference proce-\ndure along with the associated error. The procedure starts\nwith four functions (fa, fb, fc and fd) and decomposes\nthree functions, fa, fg and fj, with errors \u00011, \u00012 and \u00013\nrespectively.\nIn this example we assume that \u00011 > \u00012,\nwhich results in an error rk = 1 + \u00011, and assume that\n1 + \u00011 < (1 + \u00012)(1 + \u00013), which results in the errors\nrm = ro = (1 + \u00012)(1 + \u00013).\n\nFigure 2: An irreducible minor graph of a\n4 \u00d7 4 Ising model that can be obtained via VE\nwithout creating functions of more than 3 vari-\nables. Applying MAS, only one function over\nthree variables needs to be decomposed into\ntwo functions over overlapping sets of vari-\nables in order to complete inference using only\nfunctions over three or less variables.\n\n\f(cid:34)(cid:32)(cid:88)\n\ni\n\n(cid:88)\n\nw\u2208W\n\n(cid:33)\n\n\u02dc\u03c6i(wi)\n\n(cid:35)2\n\n\u2212 \u03c6(w)\n\nmin\n\n( \u02dc\u03c61,..., \u02dc\u03c6m)\n\n(7)\n\nWe use the notation w \u2248 wk to denote an instantiation W = w that is consistent with the instan-\ntiation Wk = wk. To \ufb01nd the optimal value of \u02dc\u03c6i(wi) we differentiate Eq. 7 with respect to each\nin the resulting under-\n\n\u02dc\u03c6i(wi) =\n\n(cid:80)\n\n\u02dc\u03c6k(wk) and set to zero. Choosing the constraint(cid:80)\n|Wi| \u2212(cid:88)\n\nconstrained set of linear equations we get\n\n\u02dc\u03c6k(wk) =\n\nw\u2248wk\n\n\u03c6(w)\n|Wj|\n\n\u03c6(w)\n\n\u03c6(w)\n\nm\n\nw\n\nAs the last term is independent of the index i we \ufb01nally obtain\n\nw\n\nw\n\n(cid:80)\nm(cid:81)\n(m \u2212 1)(cid:80)\n\ni(cid:54)=k\n\nj\n\ni(cid:54)=k\n\n(cid:80)\n(cid:81)\n(cid:80)\n(cid:81)\n\ni(cid:54)=k\n\n\u02dc\u03c6k(wk) =\n\nw\u2248wk\n\n\u03c6(w)\n\n|Wi| \u2212\n\n\u03c6(w)\n\nw\n\nm|W|\n\n(8)\n\nThe second term of Eq. 8 is computed once for a decomposition operation. Denoting |W| = N this\nterm can be computed in O(N) time. Computing the \ufb01rst term of Eq. 8 also takes O(N) time but it\nneeds to be computed for every resulting function \u02dc\u03c6k, hence taking an overall time of O(N m).\n\n2.2.2 Minimizing the KL Divergence\nThe Kulback-liebert (KL) divergence is another common alternative measure used for optimization:\n\n(cid:34)(cid:88)\n\n(cid:88)\n\nw\u2208W\n\ni\n\nmin\n\n( \u02dc\u03c61,..., \u02dc\u03c6m)\n\n(cid:35)\n\n(cid:80)\n\n\u02dc\u03c6i(wi)\n\nlog\n\n\u02dc\u03c6i(wi)\ni\n\u03c6(w)\n\n(9)\n\nAlthough no closed form solution is known for this minimization problem, iterative algorithms were\ndevised for variational approximation, which start with arbitrary functions \u02dc\u03c6i(Wi) and converge\nto a local minimum [16, 12]. Despite the drawbacks of unbounded convergence time and lack of\nguarantee to converge to the global optimum, these methods have proven quite successful. In our\ncontext this approach has the bene\ufb01t of allowing overlapping sets Wi.\n\n3 Applying MAS to Inference Algorithms\n\nOur multiplicative approximation scheme offers a way to reduce the computational cost of inference\nby decoupling variables via \u0001-decompositions. The fact that many existing inference algorithms\ncompute and utilize multiplicative factors during the course of computation means that the scheme\ncan be applied widely. The approach does require a mechanism to select functions to decompose,\nhowever, the \ufb02exibility of the scheme allows a variety of alternative mechanisms. One simple cost-\nfocused strategy is to decompose a function whenever its size exceeds some threshold. An alternative\nquality-focused strategy is to choose an \u0001 and search for \u0001-decompositions Wi. Below we consider\nthe application of our approximation scheme to variable elimination with yet another selection strat-\negy. We note that heuristics for choosing approximate factorizations exist for the selection of disjoint\nsets [28] and for overlapping sets [5] and could be utilized. The ideal application of our scheme is\nlikely to depend both on the speci\ufb01c inference algorithm and the application of interest.\n3.1 Dynamic Decompositions\n\nOne family of decomposition strategies which are of particular interest, are those which allow for dy-\nnamic decompositions during the inference procedure. In this dynamic framework, MAS can be in-\ncorporated into known exact inference algorithms for graphical models, provided that local functions\ncan be bounded according to Eq. 2. A dynamic decomposition strategy applies \u0001-decompositions to\nfunctions in which the original model is de\ufb01ned and to intermediate functions created in the course\nof the inference algorithm, according to Eq. 1 or Eq. 3, based on the current state of the algorithm,\nand the accuracy introduced by the possible decompositions.\n\n\fUnlike other approximation methods, such as the variational approach [16] or the edge deletion ap-\nproach [8], dynamic decompositions has the capability of decoupling two variables in some contexts\nwhile maintaining their dependence in others. If we wish to restrict ourselves to functions over three\nor less variables when performing inference on a 4 \u00d7 4 Ising model, the model in Figure 2 is an\ninevitable minor, and from this point of the elimination, approximation is mandatory. In the vari-\national framework, an edge in the graph should be removed, disconnecting the direct dependence\nbetween two or more variables (e.g. removing the edge A-C would result in breaking the set ABC\ninto the sets AB and BC and breaking the set ACD into AD and CD). The same is true for the edge\ndeletion method, with the difference in the new potentials associated with the new sets. Dynamic\ndecompositions allow for a more re\ufb01ned decoupling, where the dependence is removed only in some\nof the functions. In our example breaking the set ABC into AB and BC while keeping the set ACD\nintact is possible and is also suf\ufb01cient for reducing the complexity of inference to functions of no\nmore than three variables (the elimination order would be: A,B,F,H,C,E,D,G). Moreover, if decom-\nposing the set ABC can be done with an error \u0001ABC, as de\ufb01ned in Eq. 2, then we are guaranteed not\nto exceed this error for the entire approximate inference procedure. An extreme example will be the\nfunctions for the sets ABC and ACD as appear in the tables of Figure 2. It is possible to decompose\nthe function over the set ABC into two functions over the sets AB and BC with an arbitrarily small\nerror, while the same is not possible for the function over the set ACD. Hence, in this example the\nresult of our method will be nearly equal to the solution of exact inference on the model, and the\ntheoretical error bounds will be arbitrarily small, while other approaches, such as the variational\nmethod, can yield arbitrarily bad approximations.\nWe discuss how to incorporate MAS into the Variable Elimination (VE) algorithm for computing\nthe likelihood of a graphical model [9, 20]. In this algorithm variables V \u2208 H are summed out\niteratively after multiplying all existing functions that include V , yielding intermediate functions\nf(W \u2286 X) where V /\u2208 W . MAS can be incorporated into the VE algorithm by identifying \u0001-\ndecompositions for some of the intermediate functions f. This results in the elimination of f from\nthe pool of functions and adding instead the functions \u02dcfi(Wi) = e \u02dc\u03c6i(Wi). Note that the sets Wi\nare not necessarily disjoint and can have common variables. Using \u0001-decompositions reduces the\ncomputational complexity, as some variables are decoupled in speci\ufb01c points during execution of\nthe algorithm. Throughout the algorithm the maximal error \u0001max introduced by the decompositions\n\nTable 1: Accuracy and speedup for grid-like\nmodels. Upper panel: attractive Ising mod-\nels; Middle panel:\nrepulsive Ising models;\nLower panel: Bayesian network grids with\nrandom probabilities.\nModel\n10 \u00d7 10\n10 \u00d7 10\n15 \u00d7 15\n15 \u00d7 15\n20 \u00d7 20\n25 \u00d7 25\n30 \u00d7 30\n10 \u00d7 10\n10 \u00d7 10\n15 \u00d7 15\n15 \u00d7 15\n20 \u00d7 20\n25 \u00d7 25\n30 \u00d7 30\n10 \u00d7 10\n12 \u00d7 12\n15 \u00d7 15\n18 \u00d7 18\n20 \u00d7 20\n10 \u00d7 10\n12 \u00d7 12\n7 \u00d7 7\n8 \u00d7 8\n\nNum Accuracy Bounds Speedup DD time\n(secs)\nValues\n0.04\n0.01\n0.21\n0.04\n0.08\n0.10\n0.11\n0.04\n0.01\n0.12\n0.05\n0.10\n0.11\n0.10\n0.01\n0.02\n0.05\n0.15\n1.30\n0.03\n0.05\n0.03\n0.15\n\n49.2\n0.0096\n2.5\n0.0094\n223.3\n0.0099\n8.3\n0.0096\n12.9\n0.0095\n20.9\n0.0092\n236.7\n0.0097\n38.2\n0.0099\n2.3\n0.0098\n568.4\n0.0099\n7.2\n0.0094\n14.3\n0.0091\n22.8\n0.0094\n218.7\n0.0099\n1.1\n0.0098\n11.3\n0.0096\n0.0098\n201.4\n0.0090 1782.8\n0.0097 7112.9\n49.3\n0.0095\n458.6\n0.0096\n0.0093\n7.8\n8.4\n0.0098\n\n2.4e-4\n2.1e-4\n1.2e-4\n2.2e-4\n1.2e-4\n2.6e-5\n5.7e-4\n3.2e-4\n3.5e-4\n3.2e-3\n8.6e-4\n4.5e-4\n3.1e-5\n8.1e-5\n3.0e-3\n8.1e-3\n1.7e-3\n3.0e-4\n1.8e-3\n2.8e-5\n5.5e-4\n1.8e-4\n1.4e-4\n\n5\n2\n5\n2\n2\n2\n2\n5\n2\n5\n2\n2\n2\n2\n2\n2\n2\n2\n2\n5\n5\n10\n10\n\nP (X) =(cid:81)\n\nfunctions \u03c8i(Di \u2286 X), that encodes\n\nAlgorithm 1: DynaDecomp\nInput: A model for n variables X = {X1, . . . , Xn} and\ni \u03c8i(Di); A set E = X \\ H of observed\nvariables and their assignment E = e; An\nelimination order R over the variables in H; scalars\nM and \u03b7.\n\nOutput: The log-likelihood log P (e); an error \u0001.\nInitialize: \u0001 = 0; F \u2190 {\u03c8i(Di)}; I(\u03c8i) = f alse;\nfor i = 1 to n do\n\nk \u2190 R[i];\nT \u2190 {f : f contains Xk, f \u2208 F};\nF \u2190 F \\ T ;\n\nf(cid:48) \u2190(cid:80)\nI(f(cid:48)) =(cid:86)\n\nxk\n\n\u2297(T );\nf\u2208T I(f );\n(\u0001f(cid:48) , \u02dcF ) \u2190 (cid:11)(f(cid:48));\nif \u0001f(cid:48) \u2264 \u03b7 then\n\nif |f(cid:48)| \u2265 M and I(f(cid:48)) = true then\n\n\u2200 \u02dcf \u2208 \u02dcF I( \u02dcf ) = f alse;\nF \u2190 F \u222a \u02dcF ;\n\u0001 = max{\u0001, \u0001f(cid:48)};\nF \u2190 F \u222a f(cid:48);\n\nelse\n\nelse\n\nF \u2190 F \u222a f(cid:48);\n\nmultiply all constant functions in F and put in p;\nreturn log p, \u0001;\n\n\fcan be easily computed by associating functions with errors, as explained in Section 2.1. In our\nexperiments we restrict attention to non-compounded decompositions. Our algorithm decomposes\n\u221a\na function only if it is over a given size M, and if it introduces no more than \u03b7 error. The ap-\nproximating functions in this algorithm are strictly disjoint, of size no more than\nM, and with\nthe variables assigned randomly to the functions. We call this algorithm DynaDecomp (DD) and\nprovide a pseudo-code in Algorithm 1. There we use the notation \u2297(T ) to denote multiplication of\nthe functions f \u2208 T , and (cid:11)(f) to denote decomposition of function f. The outcome of (cid:11)(f) is a\npair (\u0001, \u02dcF ) where the functions \u02dcfi \u2208 \u02dcF are over a disjoint set of variables.\nWe note that MAS can also be used on top of other common algorithms for exact inference in\nprobabilistic models which are widely used, thus gaining similar bene\ufb01ts as those algorithms. For\nexample, applying MAS to the junction tree algorithm [14] a decomposition can decouple vari-\nables in messages sent from one node in the junction tree to another, and approximate all marginal\ndistributions of single variables in the model in a single run, with similar guarantees on the error.\nThis extension is analogous to how the mini-clusters algorithm [17] extends the mini-bucket algo-\nrithm [10].\n\n4 Results\n\nlog \u02dcL\n\nWe demonstrate the power of MAS by reporting the accuracy and theoretical bounds for our Dy-\nnaDecomp algorithm for a variety of models. Our empirical study focuses on approximating the\nlikelihood of evidence, except when comparing to the results of Xing et. al. [28] on grid mod-\nels. The quality of approximation is measured in terms of accuracy and speedup. The accuracy is\nreported as max{ log L\nlog L} \u2212 1 where L is the likelihood and \u02dcL is the approximate likelihood\n, log \u02dcL\nachieved by DynaDecomp. We also report the theoretical accuracy which is the maximum error\nintroduced by decomposition operations. The speedup is reported as a ratio of run-times for obtain-\ning the approximated and exact solutions, in addition to the absolute time of approximation. In all\nexperiments a random partition was used to decompose the functions, and the L2 norm optimization\nintroduced in Section 2.2.1 was applied to minimize the error. The parameter M was set to 10, 000\nand the guaranteed accuracy \u03b7 was set to 1%, however, as is evident from the results, the algorithm\nusually achieves better accuracy.\nWe compared the performance of DynaDecomp with the any-time Mini-buckets (MB) algo-\nrithm [10]. The parameters i and m, which are the maximal number of variables and functions\nin a mini-bucket, were initially set to 3 and 1 respectively. The parameter \u0001 was set to zero, not con-\nstraining the possible accuracy. Generally we allowed MB to run the same time it took DynaDecomp\nto approximate the model, but not less than one iteration (with the initial parameters).\nWe used two types of grid-like models. The \ufb01rst is an Ising model with random attractive or repulsive\npair-wise potentials, as was used in [28]. When computing likelihood in these models we randomly\nassigned values to 10% of the variables in the model. The other kind of grids were Bayesian net-\nworks where every variable Xij at position (i, j) in the grid has the variables Xi\u22121,j and Xi,j\u22121\nas parents in the model. In addition, every variable Xij has a corresponding observed variable Yij\nconnected to it. Probabilities in these models were uniformly distributed between zero and one. In-\nference on these models, often used in computer vision [11], is usually harder than on Ising models,\ndue to reduced factorization. We used models where the variables had either two, \ufb01ve or ten values.\nThe results are shown in Table 1. In addition, we applied DynaDecomp to two 100 \u00d7 100 Ising\ngrid models with binary variables. Inference in these models is intractable. We estimate the time\nfor exact computation using VE on current hardware to be 3 \u00b7 1015 seconds. This is longer than\nthe time since the disappearance of the dinosaurs. Setting \u03b7 to 2%, DynaDecomp computated the\napproximated likelihood in 7.09 seconds for the attractive model and 8.14 seconds for the repulsive\none.\nComparing our results with those obtained by the MB algorithm with an equivalent amount of com-\nputations, we \ufb01nd that on the average the accuracy of MB across all models in Tables 1 is 0.198\nwhile the average accuracy of DynaDecomp is 9.8e\u22124, more than 200 times better than that of MB.\nIn addition the theoretical guarantees are more than 30% for MB and 0.96% for DynaDecomp, a\n30-fold improvement. As a side note, the MB algorithm performed signi\ufb01cantly better on attrac-\ntive Ising models than on repulsive ones. To compare our results with those reported in [28] we\ncomputed all the marginal probabilities (without evidence) and calculated the L1-based measure\n\n\f(cid:80)\n\ni,j\n\n(cid:80)\n\nxij\n\nP (xij) \u2212 \u02dcP (xij). Running on the Ising models DynaDecomp obtained an average of\n1.86e\u22125 compared to 0.003 of generalized belief propagation (GBP) and 0.366 of generalized mean\n\ufb01eld (GMF). Although the run times are not directly comparable due to differences in hardware,\nDynaDecomp average run-time was less than 0.1 seconds, while the run-time of GBP and GMF was\npreviously reported [28] to be 140 and 1.6 seconds respectively, on 8 \u00d7 8 grids.\nWe applied our method to probabilistic phylogenetic models.\nInference on these large models,\nwhich can contain tens of thousands of variables, is used for model selection purposes. Previous\nworks [15, 26] have obtained upper and lower bounds on the likelihood of evidence in the models\nsuggested in [22] using variational methods, reporting an error of 1%. Using the data as in [26],\nwe achieved less than 0.01% error on average within a few seconds, which improves over previous\nresults by two orders of magnitude both in terms of accuracy and speedup.\nIn addition, we applied DynaDecomp to 24 models from the UAI\u201906 evaluation of probabilistic\ninference repository [1] with \u03b7 = 1%. Only models that did not have zeros and that our exact infer-\nence algorithm could solve in less than an hour were used. The average accuracy of DynaDecomp\non these models was 0.0038 with an average speedup of 368.8 and average run-time of 0.79 seconds.\nWe also applied our algorithm to two models from the CPCS benchmark (cpcs360b and cpcs422b).\nDynaDecomp obtained an average accuracy of 0.008 versus 0.056 obtained by MB. We note that\nthe results obtained by MB are consistent with those reported in [10] for the MPE problem.\n\nReferences\n[1] Evaluation of probabilistic inference systems: http://tinyurl.com/3k9l4b, 2006.\n[2] Bidyuk and Dechter. An anytime scheme for bounding posterior beliefs. AAAI 2006.\n[3] Bidyuk and Dechter. Improving bound propagation. In ECAI 342\u2013346, 2006.\n[4] Cheng and Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in\n\nlarge Bayesian networks. JAIR 13:155\u2013188, 2000.\n\n[5] Choi and Darwiche. A variational approach for approximating Bayesian networks by edge deletion. UAI\n\n2006.\n\n[6] Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. AI\n\n42(2-3):393\u2013405, 1990.\n\n[7] Dagum and Luby. Approximating probabilistic inference in Bayesian belief networks is NP-hard. AI,\n\n60(1):141\u2013153, 1993.\n\n[8] Darwiche, Chan, and Choi. On Bayesian network approximation by edge deletion. UAI 2005.\n[9] Dechter. Bucket elimination: A unifying framework for reasoning. AI 113(1-2):41\u201385, 1999.\n[10] Dechter and Rish. Mini-buckets:A general scheme for bounded inference. J.ACM 50:107\u2013153, 2003.\n[11] W. Freeman, W. Pasztor, and O. Carmichael. Learning low-level vision. IJCV 40:25\u201347, 2000.\n[12] Geiger, Meek, and Wexler. A variational inference procedure allowing internal structure for overlapping\n\nclusters and deterministic constraints. JAIR 27:1\u201323, 2006.\n\n[13] Henrion. Propagating uncertainty in bayesian networks by probabilistic logic sampling. UAI 1988.\n[14] Jensen, Lauritzen, and Olesen. Bayesian updating in causal probabilistic networks by local computations.\n\nComp. Stat. Quaterly 4:269\u2013282, 1990.\n\n[15] Jojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman. Ef\ufb01cient approximations for learning phy-\n\nlogenetic hmm models from data. ISMB 2004.\n\n[16] Jordan, Ghahramani, Jaakkola, and Saul. An introduction to variational methods for graphical models.\n\nMachine Learning 37(2):183\u2013233, 1999.\n\n[17] Mateescu, Dechter, and Kask. Partition-based anytime approximation for belief updating. 2001.\n[18] Boyd and Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[19] Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[20] Shachter, D\u2019Ambrosio, and Del Favero. Symbolic probabilistic inference in belief networks. AAAI 1990.\n[21] Shachter and Peot. Simulation approaches to general probabilistic inference on belief networks.UAI 1989.\n[22] Siepel and Haussler. Combining phylogenetic and HMMs in biosequence analysis. RECOMB 2003.\n[23] Wainwright, Jaakkola, and Willsky. A new class of upper bounds on the log partition function. IEEE\n\nTrans. Info. Theory 51(7):2313\u20132335, 2005.\n\n[24] Weiss. Belief propagation and revision in networks with loops. Technical Report AIM-1616, 1997.\n[25] Wexler and Geiger. Importance sampling via variational optimization. UAI 2007.\n[26] Wexler and Geiger. Variational upper bounds for probabilistic phylogenetic models. RECOMB 2007.\n[27] Wexler and Meek. Inference for multiplicative models. UAI 2008.\n[28] Xing, Jordan, and Russell. Graph partition strategies for generalized mean \ufb01eld inference. UAI 2004.\n\n\f", "award": [], "sourceid": 707, "authors": [{"given_name": "Ydo", "family_name": "Wexler", "institution": null}, {"given_name": "Christopher", "family_name": "Meek", "institution": null}]}