{"title": "Dynamic Importance Sampling for Anytime Bounds of the Partition Function", "book": "Advances in Neural Information Processing Systems", "page_first": 3196, "page_last": 3204, "abstract": "Computing the partition function is a key inference task in many graphical models. In this paper, we propose a dynamic importance sampling scheme that provides anytime finite-sample bounds for the partition function. Our algorithm balances the advantages of the three major inference strategies, heuristic search, variational bounds, and Monte Carlo methods, blending sampling with search to refine a variationally defined proposal. Our algorithm combines and generalizes recent work on anytime search and probabilistic bounds of the partition function. By using an intelligently chosen weighted average over the samples, we construct an unbiased estimator of the partition function with strong finite-sample confidence intervals that inherit both the rapid early improvement rate of sampling and the long-term benefits of an improved proposal from search. This gives significantly improved anytime behavior, and more flexible trade-offs between memory, time, and solution quality. We demonstrate the effectiveness of our approach empirically on real-world problem instances taken from recent UAI competitions.", "full_text": "Dynamic Importance Sampling for Anytime Bounds\n\nof the Partition Function\n\nQi Lou\n\nComputer Science\n\nUniv. of California, Irvine\nIrvine, CA 92697, USA\nqlou@ics.uci.edu\n\nRina Dechter\n\nComputer Science\n\nUniv. of California, Irvine\nIrvine, CA 92697, USA\ndechter@ics.uci.edu\n\nAlexander Ihler\nComputer Science\n\nUniv. of California, Irvine\nIrvine, CA 92697, USA\nihler@ics.uci.edu\n\nAbstract\n\nComputing the partition function is a key inference task in many graphical models.\nIn this paper, we propose a dynamic importance sampling scheme that provides\nanytime \ufb01nite-sample bounds for the partition function. Our algorithm balances\nthe advantages of the three major inference strategies, heuristic search, variational\nbounds, and Monte Carlo methods, blending sampling with search to re\ufb01ne a\nvariationally de\ufb01ned proposal. Our algorithm combines and generalizes recent\nwork on anytime search [16] and probabilistic bounds [15] of the partition function.\nBy using an intelligently chosen weighted average over the samples, we construct\nan unbiased estimator of the partition function with strong \ufb01nite-sample con\ufb01dence\nintervals that inherit both the rapid early improvement rate of sampling and the\nlong-term bene\ufb01ts of an improved proposal from search. This gives signi\ufb01cantly\nimproved anytime behavior, and more \ufb02exible trade-offs between memory, time,\nand solution quality. We demonstrate the effectiveness of our approach empirically\non real-world problem instances taken from recent UAI competitions.\n\n1\n\nIntroduction\n\nProbabilistic graphical models, including Bayesian networks and Markov random \ufb01elds, provide a\nframework for representing and reasoning with probabilistic and deterministic information [5, 6, 8].\nReasoning in a graphical model often requires computing the partition function, or normalizing\nconstant of the underlying distribution. Exact computation of the partition function is known to be\n#P-hard [19] in general, leading to the development of many approximate schemes. Two important\nproperties for a good approximation are that (1) it provides bounds or con\ufb01dence guarantees on the\nresult, so that the degree of approximation can be measured; and that (2) it can be improved in an\nanytime manner, so that the approximation becomes better as more computation is available.\nIn general, there are three major paradigms for approximate inference: variational bounds, heuristic\nsearch, and Monte Carlo sampling. Each method has advantages and disadvantages. Variational\nbounds [21], and closely related approximate elimination methods [7, 14] provide deterministic\nguarantees on the partition function. However, these bounds are not anytime; their quality often\ndepends on the amount of memory available, and do not improve without additional memory. Search\nalgorithms [12, 20, 16] explicitly enumerate over the space of con\ufb01gurations and eventually provide\nan exact answer; however, while some problems are well-suited to search, others only improve their\nquality very slowly with more computation. Importance sampling [e.g., 4, 15] gives probabilistic\nbounds that improve with more samples at a predictable rate; in practice this means bounds that\nimprove rapidly at \ufb01rst, but are slow to become very tight. Several algorithms combine two strategies:\napproximate hash-based counting combines sampling (of hash functions) with CSP-based search [e.g.,\n3, 2] or other MAP queries [e.g., 9, 10], although these are not typically formulated to provide anytime\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbehavior. Most closely related to this work are [16] and [15], which perform search and sampling,\nrespectively, guided by variational bounds.\nIn this work, we propose a dynamic importance sampling algorithm that provides anytime probabilis-\ntic bounds (i.e., they hold with probability 1 \u2212 \u03b4 for some con\ufb01dence parameter \u03b4). Our algorithm\ninterleaves importance sampling with best \ufb01rst search [16], which is used to re\ufb01ne the proposal\ndistribution of successive samples. In practice, our algorithm enjoys both the rapid bound improve-\nment characteristic of importance sampling [15], while also bene\ufb01ting signi\ufb01cantly from search\non problems where search is relatively effective, or when given enough computational resources,\neven when these points are not known in advance. Since our samples are drawn from a sequence of\ndifferent, improving proposals, we devise a weighted average estimator that upweights higher-quality\nsamples, giving excellent anytime behavior.\n\nMotivating example. We illustrate the focus and contribu-\ntions of our work on an example problem instance (Fig. 1).\nSearch [16] provides strict bounds (gray) but may not improve\nrapidly, particularly once memory is exhausted; on the other\nhand, importance sampling [15] provides probabilistic bounds\n(green) that improve at a predictable rate, but require more and\nmore samples to become tight. We \ufb01rst describe a \u201ctwo stage\u201d\nsampling process that uses a search tree to improve the baseline\nbound from which importance sampling starts (blue), greatly\nimproving its long-term performance, then present our dynamic\nimportance sampling (DIS) algorithm, which interleaves the\nsearch and sampling processes (sampling from a sequence of\nproposal distributions) to give bounds that are strong in an\nanytime sense.\n\n2 Background\n\n[16]\n\n[15]\n\nFigure 1: Example: bounds on\nlogZ for protein instance 1bgc.\n\nLet X = (X1, . . . , XM ) be a vector of random variables, where each Xi takes values in a discrete\ndomain Xi; we use lower case letters, e.g. xi \u2208 Xi, to indicate a value of Xi, and x to indicate an\nassignment of X. A graphical model over X consists of a set of factors F = {f\u03b1(X\u03b1) | \u03b1 \u2208 I},\nwhere each factor f\u03b1 is de\ufb01ned on a subset X\u03b1 = {Xi | i \u2208 \u03b1} of X, called its scope.\nWe associate an undirected graph G = (V, E) with F, where each node i \u2208 V corresponds to\na variable Xi and we connect two nodes, (i, j) \u2208 E, iff {i, j} \u2286 \u03b1 for some \u03b1. The set I then\ncorresponds to cliques of G. We can interpret F as an unnormalized probability measure, so that\n\n(cid:88)\n\n(cid:89)\n\n\u03b1\u2208I\n\nx\n\nf\u03b1(x\u03b1)\n\n(cid:89)\n\n\u03b1\u2208I\n\nf (x) =\n\nf\u03b1(x\u03b1),\n\nZ =\n\nZ is called the partition function, and normalizes f (x). Computing Z is often a key task in evaluating\nthe probability of observed data, model selection, or computing predictive probabilities.\n\n2.1 AND/OR search trees\n\nWe \ufb01rst require some notations from search. AND/OR search trees are able to exploit the conditional\nindependence properties of the model, as expressed by a pseudo tree:\nDe\ufb01nition 1 (pseudo tree). A pseudo tree of an undirected graph G = (V, E) is a directed tree\nT = (V, E(cid:48)) sharing the same set of nodes as G. The tree edges E(cid:48) form a subset of E, and we\nrequire that each edge (i, j) \u2208 E \\ E(cid:48) be a \u201cback edge\u201d, i.e., the path from the root of T to j passes\nthrough i (denoted i \u2264 j). G is called the primal graph of T.\nFig. 2(a)-(b) show an example primal graph and pseudo tree. Guided by the pseudo tree, we can\nconstruct an AND/OR search tree T consisting of alternating levels of OR and AND nodes. Each OR\nnode s is associated with a variable, which we slightly abuse notation to denote Xs; the children of\ns, ch(s), are AND nodes corresponding to the possible values of Xs. The root \u2205 of the AND/OR\nsearch tree corresponds to the root of the pseudo tree. Let pa(c) = s indicate the parent of c, and\nan(c) = {n | n \u2264 c} be the ancestors of c (including itself) in the tree.\n\n2\n\n102104\u221278\u221276\u221274\u221272\u221270\u221268\u221266\u221264time (sec)upper bound searchsamplingtwo-stageDIS\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) A primal graph of a graphical model over 7 variables. (b) A pseudo tree for the primal\ngraph consistent with elimination order G, F, E, D, C, B, A. (c) AND/OR search tree guided by the\npseudo tree. One full solution tree is marked red and one partial solution tree is marked blue.\n\nAs the pseudo tree de\ufb01nes a partial ordering on the variables Xi, the AND/OR tree extends this to one\nover partial con\ufb01gurations of X. Speci\ufb01cally, any AND node c corresponds to a partial con\ufb01guration\nx\u2264c of X, de\ufb01ned by its assignment and that of its ancestors: x\u2264c = x\u2264p \u222a {Xs = xc}, where\ns = pa(c), p = pa(s). For completeness, we also de\ufb01ne x\u2264s for any OR node s, which is the same\nas that of its AND parent, i.e., x\u2264s = x\u2264pa(s). For any node n, the corresponding variables of\nx\u2264n is denoted as X\u2264n. Let de(Xn) be the set of variables below Xn in the pseudo tree; we de\ufb01ne\nX>n = de(Xn) if n is an AND node; X>n = de(Xn) \u222a {Xn} if n is an OR node.\nThe notion of a partial solution tree captures partial con\ufb01gurations of X respecting the search order:\nDe\ufb01nition 2 (partial solution tree). A partial solution tree T of an AND/OR search tree T is a subtree\nsatisfying three conditions: (1) T contains the root of T ; (2) if an OR node is in T , at most one of its\nchildren is in T ; (3) if an AND node is in T , all of its children or none of its children are in T .\n\nAny partial solution tree T de\ufb01nes a partial con\ufb01guration xT of X; if xT is a complete con\ufb01guration\nof X, we call T a full solution tree, and use Tx to denote the corresponding solution tree of a complete\nassignment x. Fig. 2(c) illustrates these concepts.\nWe also associate a weight wc with each AND node, de\ufb01ned to be the product of all factors f\u03b1 that\nare instantiated at c but not before:\n\n(cid:89)\n\nwc =\n\nf\u03b1(x\u03b1),\n\nIc = {\u03b1 | Xc \u2208 X\u03b1 \u2286 X\u2264c}\n\n\u03b1\u2208Ic\n\nproduct of weights on a path to the root, gn =(cid:81)\n\nFor completeness, de\ufb01ne ws = 1 for any OR node s. It is then easy to see that, for any node n, the\na\u2264n wa (termed the cost of the path), equals the value\nof the factors whose scope is fully instantiated at n, i.e., fully instantiated by x\u2264n. We can extend this\ncost notion to any partial solution tree T by de\ufb01ning g(T ) as the product of all factors fully instantiated\nby xT ; we will slightly abuse notation by using g(T ) and g(xT ) interchangeably. Particularly, the\ncost of any full solution tree equals the value of its corresponding complete con\ufb01guration. We use\ng(x>n|x\u2264n) (termed the conditional cost) to denote the quotient g([x\u2264n, x>n])/g(x\u2264n), where x>n\nis any assignment of X>n, the variables below n in the search tree.\nWe give a \u201cvalue\u201d vn to each node n equal to the total conditional cost of all con\ufb01gurations below n:\n(1)\n\ng(x>n|x\u2264n).\n\nvn =\n\n(cid:88)\n\nx>n\n\nThe value of the root is simply the partition function, v\u2205 = Z. Equivalently, vn can be de\ufb01ned\nrecursively: if n is an AND node corresponding to a leaf of the pseudo tree, let vn = 1; otherwise,\n\nvn =\n\nc\u2208ch(n) vc,\nc\u2208ch(n) wcvc,\n\nif AND node n\nif OR node n\n\n(2)\n\n(cid:40)(cid:81)\n(cid:80)\n\n2.2 AND/OR best-\ufb01rst search for bounding the partition function\n\nAND/OR best-\ufb01rst search (AOBFS) can be used to bound the partition function in an anytime fashion\nby expanding and updating bounds de\ufb01ned on the search tree [16]. Beginning with only the root\n\n3\n\nA\"B\"C\"D\"E\"F\"G\"ABCFGDEABB010101F01G01G01F01G01G01C01E01D01E01D01C01E01D01E01D01C01E01D01E01D01C01E01D01E01D01F01G01G01F01G01G01\f\u2205, AOBFS expands the search tree in a best-\ufb01rst manner. More precisely, it maintains an explicit\nAND/OR search tree of visited nodes, denoted S. For each node n in the AND/OR search tree,\nAOBFS maintains un, an upper bound on vn, initialized via a pre-compiled heuristic vn \u2264 h+\nn , and\nsubsequently updated during search using information propagated from the frontier:\n\nc\u2208ch(n) uc,\nc\u2208ch(n) wcuc,\n\nif AND node n\nif OR node n\n\n(3)\n\n(cid:40)(cid:81)\n(cid:80)\n\nun =\n\nThus, the upper bound at the root, u\u2205, is an anytime deterministic upper bound of Z. Note that this\nupper bound depends on the current search tree S, so we write US = u\u2205.\nIf all nodes below n have been visited, then un = vn; we call n solved and can remove the subtree\nbelow n from memory. Hence we can partition the frontier nodes into two sets: solved frontier nodes,\nSOLVED(S), and unsolved ones, OPEN(S). AOBFS assigns a priority to each node and expands a\ntop-priority (unsolved) frontier node at each iteration. We use the \u201cupper priority\u201d from [16],\n\nUn = gnun\n\nus\n\n(4)\n\n(cid:89)\n\ns\u2208branch(n)\n\nwhere branch(n) are the OR nodes that are siblings of some node \u2264 n. Un quanti\ufb01es n\u2019s contribution\nto the global bound US, so this priority attempts to reduce the upper bound on Z as quickly as possible.\nWe can also interpret our bound US as a sum of bounds on each of the partial con\ufb01gurations covered\nby S. Concretely, let TS be the set of projections of full solution trees on S (in other words, TS are\npartial solution trees whose leaves are frontier nodes of S); then,\n\nUT\n\nwhere\n\nUT = g(T )\n\nus\n\n(5)\n\n(cid:88)\n\nT\u2208TS\n\nUS =\n\n(cid:89)\n\ns\u2208leaf (T )\n\nand leaf (T ) are the leaf nodes of the partial solution tree T .\n\n2.3 Weighted mini-bucket for heuristics and sampling\n\nTo construct a heuristic function for search, we can use a class of variational bounds called weighted\nmini-bucket (WMB, [14]). WMB corresponds to a relaxed variable elimination procedure, respecting\nthe search pseudo tree order, that can be tightened using reparameterization (or \u201ccost-shifting\u201d) opera-\ntions. Importantly for this work, this same relaxation can also be used to de\ufb01ne a proposal distribution\nfor importance sampling that yields \ufb01nite-sample bounds [15]. We describe both properties here.\nLet n be any node in the search tree; then, one can show that WMB yields the following reparametriza-\ntion of the conditional cost below n [13]:\n\ng(x>n|x\u2264n) = h+\n\nn\n\nbkj(xk|xanj (k))\u03c1kj , Xk \u2208 X>n\n\n(6)\n\n(cid:89)\n\n(cid:89)\n\nk\n\nj\n\nwhere Xanj (k) are the ancestors of Xk in the pseudo tree that are included in the j-th mini-bucket\nof Xk. The size of Xanj (k) is controlled by a user-speci\ufb01ed parameter called the ibound. The\n\nbkj(xk|xanj (k)) are conditional beliefs, and the non-negative weights \u03c1kj satisfy(cid:80)\n\nj \u03c1kj = 1.\n\nSuppose that we de\ufb01ne a conditional distribution q(x>n|x\u2264n) by replacing the geometric mean over\nthe bkj in (6) with their arithmetic mean:\n\nq(x>n|x\u2264n) =\n\n\u03c1kjbkj(xk|xanj (k))\n\n(7)\n\nApplying the arithmetic-geometric mean inequality, we see that g(x>n|x\u2264n)/h+\nSumming over x>n shows that h+\n\nn is a valid upper bound heuristic for vn:\n\nn \u2264 q(x>n|x\u2264n).\n\nvn =\n\ng(x>n|x\u2264n) \u2264 h+\n\nn\n\nThe mixture distribution q can be also used as a proposal for importance sampling, by drawing\nsamples from q and averaging the importance weights, g/q. For any node n, we have that\n\ng(x>n|x\u2264n)/q(x>n|x\u2264n)\n\n= vn\n\n(8)\n\n(cid:105)\n\ng(x>n|x\u2264n)/q(x>n|x\u2264n) \u2264 h+\nn ,\n\n4\n\n(cid:89)\n\n(cid:88)\n\nk\n\nj\n\n(cid:88)\n\nx>n\n\nE(cid:104)\n\n\fi.e., the importance weight g(x>n|x\u2264n)/q(x>n|x\u2264n) is an unbiased and bounded estimator of vn.\nIn [15], this property was used to give \ufb01nite-sample bounds on Z which depended on the WMB\nbound, h+\u2205 . To be more speci\ufb01c, note that g(x>n|x\u2264n) = f (x) when n is the root \u2205, and thus\nf (x)/q(x) \u2264 h+\u2205 ; the boundedness of f (x)/q(x) results in the following \ufb01nite-sample upper bound\non Z that holds with probability at least 1 \u2212 \u03b4:\n\nZ \u2264 1\nN\n\nf (xi)\nq(xi)\n\n+\n\ni=1 are i.i.d. samples drawn from q(x), and (cid:100)Var({f (xi)/q(xi)}N\n\nwhere {xi}N\ni=1) is the unbiased\nempirical variance. This probabilistic upper bound usually becomes tighter than h+\u2205 very quickly. A\ncorresponding \ufb01nite-sample lower bound on Z exists as well [15].\n\nN\n\ni=1) ln(2/\u03b4)\n\n+\n\n7 ln(2/\u03b4)h+\u2205\n3(N \u2212 1)\n\n(9)\n\n(cid:115)\n\n2(cid:100)Var({f (xi)/q(xi)}N\n\nN(cid:88)\n\ni=1\n\n3 Two-step sampling\n\nThe \ufb01nite-sample bound (9) suggests that improvements to the upper bound on Z may be translatable\ninto improvements in the probabilistic, sampling bound. In particular, if we de\ufb01ne a proposal that\nuses the search tree S and its bound US, we can improve our sample-based bound as well. This\nmotivates us to design a two-step sampling scheme that exploits the re\ufb01ned upper bound from search;\nit is a top-down procedure starting from the root:\n\nStep 1 For an internal node n: if it is an AND node, all its children are selected; if n is an OR node,\nStep 2 When a frontier node n is reached, if it is unsolved, draw a sample of X>n from q(x>n|x\u2264n);\n\none child c \u2208 ch(n) is randomly selected with probability wcuc/un.\n\nif it is solved, quit.\n\nThe behavior of Step 1 can be understood by the following proposition:\nProposition 1. Step 1 returns a partial solution tree T \u2208 TS with probability UT /US (see (5)). Any\nfrontier node of S will be reached with probability proportional to its upper priority de\ufb01ned in (4).\nNote that at Step 2, although the sampling process terminates when a solved node n is reached, we\nassociate every con\ufb01guration x>n of X>n with probability g(x>n|x\u2264n)/vn which is appropriate in\nlieu of (1). Thus, we can show that this two-step sampling scheme induces a proposal distribution,\ndenoted qS (x), which can be expressed as:\n\nqS (x) =\n\nwnun/upa(n)\n\nq(x>n(cid:48)|x\u2264n(cid:48))\n\ng(x>n(cid:48)(cid:48)|x\u2264n(cid:48)(cid:48))/vn(cid:48)(cid:48)\n\nn\u2208AND(Tx\u2229S)\n\nn(cid:48)\u2208OPEN(S)\u2229Tx\n\nn(cid:48)(cid:48)\u2208SOLVED(S)\u2229Tx\n\nwhere AND(Tx \u2229 S) is the set of all AND nodes of the partial solution tree Tx \u2229 S. By applying (3),\nand noticing that the upper bound is the initial heuristic for any node in OPEN(S) and is exact at any\nsolved node, we re-write qS (x) as\n\n(cid:89)\n\n(cid:89)\n\n(cid:89)\n\nqS (x) =\n\ng(Tx \u2229 S)\n\nUS\n\nn(cid:48) q(x>n(cid:48)|x\u2264n(cid:48))\nh+\n\ng(x>n(cid:48)(cid:48)|x\u2264n(cid:48)(cid:48))\n\n(10)\n\nn(cid:48)\u2208OPEN(S)\u2229Tx\n\nn(cid:48)(cid:48)\u2208SOLVED(S)\u2229Tx\n\n(cid:89)\n\nqS (x) actually provides bounded importance weights that can use the re\ufb01ned upper bound US:\nProposition 2. Importance weights from qS (x) are bounded by the upper bound of S, and are\nunbiased estimators of Z, i.e.,\n\nf (x)/qS (x) \u2264 US ,\n\nf (x)/qS (x)\n\n= Z\n\n(11)\n\nE(cid:104)\n\n(cid:105)\n\n(cid:89)\n\n(cid:89)\n\nProof. Note that f (x) can be written as\n\nf (x) = g(Tx \u2229 S)\n\ng(x>n(cid:48)|x\u2264n(cid:48))\n\nn(cid:48)\u2208OPEN(S)\u2229Tx\n\n(cid:89)\n\nNoticing that for any n(cid:48) \u2208 OPEN(S), g(x>n(cid:48)|x\u2264n(cid:48)) \u2264 h+\nwith (10), we see f (x)/qS (x) is bounded by US. Its unbiasedness is trivial.\n\ng(x>n(cid:48)(cid:48)|x\u2264n(cid:48)(cid:48))\n\n(12)\n\nn(cid:48)(cid:48)\u2208SOLVED(S)\u2229Tx\nn(cid:48) q(x>n(cid:48)|x\u2264n(cid:48)) by (8), and comparing\n\n5\n\n\fExpand Nd nodes via AOBFS (Alg. 1 of [16]) with the upper priority de\ufb01ned in (4).\n\n// update S and its associated upper bound US\n\nif within the memory budget\n\ni=1), (cid:98)Z, \u2206.\n\nEnsure: N, HM(U ),(cid:100)Var({(cid:98)Zi/Ui}N\n\nend if\nDraw Nl samples via TWOSTEPSAMPLING(S).\nAfter drawing each sample:\n\nAlgorithm 1 Dynamic importance sampling (DIS)\nRequire: Control parameters Nd, Nl; memory budget, time budget.\n1: Initialize S \u2190 {\u2205} with the root \u2205.\n2: while within the time budget\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end while\n11: function TWOSTEPSAMPLING(S)\n12:\n13:\n14:\n15:\n16: end function\n\nUpdate N, HM(U ),(cid:100)Var({(cid:98)Zi/Ui}N\nUpdate (cid:98)Z, \u2206 via (13), (14).\n\nStart from the root of the search tree S:\n\ni=1).\n\nFor an internal node n: select all its children if it is an AND node; select exactly\none child c \u2208 ch(n) with probability wcuc/un if it is an OR node.\nAt any unsolved frontier node n, draw one sample from q(x>n|x\u2264n) in (7).\n\nThus, importance weights resulting from our two-step sampling can enjoy the same type of bounds\ndescribed in (9). Moreover, note that at any solved node, our sampling procedure incorporates the\n\u201cexact\u201d value of that node into the importance weights, which serves as Rao-Blackwellisation and can\npotentially reduce variance.\nWe can see that if S = \u2205 (before search), qS (x) is the proposal distribution of [15]; as search\nproceeds, the quality of the proposal distribution improves (gradually approaching the underlying\ndistribution f (x)/Z as S approaches the complete search tree). If we perform search \ufb01rst, up to some\nmemory limit, and then sample, which we refer to as two-stage sampling, our probabilistic bounds\nwill proceed from an improved baseline, giving better bounds at moderate to long computation times.\nHowever, doing so sacri\ufb01ces the quick improvement early on given by basic importance sampling. In\nthe next section, we describe our dynamic importance sampling procedure, which balances these two\nproperties.\n\n4 Dynamic importance sampling\n\nTo provide good anytime behavior, we would like to do both sampling and search, so that early\nsamples can improve the bound quickly, while later samples obtain the bene\ufb01ts of the search tree\u2019s\nimproved proposal. To do so, we de\ufb01ne a dynamic importance sampling (DIS) scheme, presented in\nAlg. 1, which interleaves drawing samples and expanding the search tree.\nOne complication of such an approach is that each sample comes from a different proposal distribution,\nand thus has a different bound value entering into the concentration inequality. Moreover, each sample\nis of a different quality \u2013 later samples should have lower variance, since they come from an improved\nproposal. To this end, we construct an estimator of Z that upweights higher-quality samples. Let\n{xi}N\ni=1 the corresponding\nimportance weights, and {Ui = USi}N\ni=1 the corresponding upper bounds on the importance weights\n\ni=1 be a series of samples drawn via Alg. 1, with {(cid:98)Zi = f (xi)/qSi (xi)}N\nN(cid:88)\n\nrespectively. We introduce an estimator (cid:98)Z of Z:\n\n(cid:105)\u22121\n\nN(cid:88)\n\n(cid:104) 1\n\nHM(U )\n\n(cid:98)Zi\n\nN\n\nwhere HM(U ) is the harmonic mean of the upper bounds Ui. (cid:98)Z is an unbiased estimator of Z\n(cid:98)Z/ HM(U ), and (cid:98)Zi/Ui are all within the interval [0, 1], we can apply an empirical Bernstein\n\n(since it is a weighted average of independent, unbiased estimators). Additionally, since Z/ HM(U ),\n\nUi\n\ni=1\n\n1\nUi\n\nN\n\ni=1\n\n,\n\nHM(U ) =\n\n(13)\n\n(cid:98)Z =\n\nbound [17] to derive \ufb01nite-sample bounds:\n\n6\n\n\fTheorem 1. De\ufb01ne the deviation term\n\n(cid:16)(cid:115)\n\n2(cid:100)Var({(cid:98)Zi/Ui}N\n\n(cid:17)\n\n(14)\n\ni=1) ln(2/\u03b4)\n\n+\n\n7 ln(2/\u03b4)\n3(N \u2212 1)\n\n\u2206 = HM(U )\n\nN\n\ni=1) is the unbiased empirical variance of {(cid:98)Zi/Ui}N\n\nwhere(cid:100)Var({(cid:98)Zi/Ui}N\ni=1. Then (cid:98)Z + \u2206 and (cid:98)Z \u2212 \u2206\nare upper and lower bounds of Z with probability at least 1 \u2212 \u03b4, respectively, i.e., Pr[Z \u2264 (cid:98)Z + \u2206] \u2265\n1 \u2212 \u03b4 and Pr[Z \u2265 (cid:98)Z \u2212 \u2206] \u2265 1 \u2212 \u03b4.\nIt is possible that (cid:98)Z \u2212 \u2206 < 0 at \ufb01rst; if so, we may replace (cid:98)Z \u2212 \u2206 with any non-trivial lower bound\nof Z. In the experiments, we use (cid:98)Z\u03b4, a (1 \u2212 \u03b4) probabilistic bound by the Markov inequality [11].\nWe can also replace (cid:98)Z + \u2206 with the current deterministic upper bound if the latter is tighter.\n\nIntuitively, our DIS algorithm is similar to Monte Carlo tree search (MCTS) [1], which also grows\nan explicit search tree while sampling. However, in MCTS, the sampling procedure is used to grow\nthe tree, while DIS uses a classic search priority. This ensures that the DIS samples are independent,\nsince samples do not in\ufb02uence the proposal distribution of later samples. This also distinguishes DIS\nfrom methods such as adaptive importance sampling (AIS) [18].\n\n5 Empirical evaluation\n\nWe evaluate our approach (DIS) against AOBFS (search, [16]) and WMB-IS (sampling, [15]) on\nseveral benchmarks of real-world problem instances from recent UAI competitions. Our benchmarks\ninclude pedigree, 22 genetic linkage instances from the UAI\u201908 inference challenge1; protein, 50\nrandomly selected instances made from the \u201csmall\u201d protein side-chains of [22]; and BN, 50 randomly\nselected Bayesian networks from the UAI\u201906 competition2. These three sets are selected to illustrate\ndifferent problem characteristics; for example protein instances are relatively small (M = 100\nvariables on average, and average induced width 11.2) but high cardinality (average max|Xi| = 77.9),\nwhile pedigree and BN have more variables and higher induced width (average M 917.1 and 838.6,\naverage width 25.5 and 32.8), but lower cardinality (average max|Xi| 5.6 and 12.4).\nWe alloted 1GB memory to all methods, \ufb01rst computing the largest ibound that \ufb01ts the memory budget,\nand using the remaining memory for search. All the algorithms used the same upper bound heuristics,\nwhich also means DIS and AOBFS had the same amount of memory available for search. For AOBFS,\nwe use the memory-limited version (Alg. 2 of [16]) with \u201cupper\u201d priority, which continues improving\nits bounds past the memory limit. Additionally, we let AOBFS access a lower bound heuristic for no\ncost, to facilitate comparison between DIS and AOBFS. We show DIS for two settings, (Nl=1, Nd=1)\nand (Nl=1, Nd=10), balancing the effort between search and sampling. Note that WMB-IS can be\nviewed as DIS with (Nl=Inf, Nd=0), i.e., it runs pure sampling without any search, and two-stage\nsampling viewed as DIS with (Nl=1, Nd=Inf), i.e., it searches to the memory limit then samples. We\nset \u03b4 = 0.025 and ran each algorithm for 1 hour. All implementations are in C/C++.\nAnytime bounds for individual instances. Fig. 3 shows the anytime behavior of all methods on\ntwo instances from each benchmark. We observe that compared to WMB-IS, DIS provides better\nupper and lower bounds on all instances. In 3(d)\u2013(f), WMB-IS is not able to produce tight bounds\nwithin 1 hour, but DIS quickly closes the gap. Compared to AOBFS, in 3(a)\u2013(c),(e), DIS improves\nmuch faster, and in (d),(f) it remains nearly as fast as search. Note that four of these examples are\nsuf\ufb01ciently hard to be unsolved by a variable elimination-based exact solver, even with several orders\nof magnitude more computational resources (200GB memory, 24 hour time limit).\nThus, DIS provides excellent anytime behavior; in particular, (Nl=1, Nd=10) seems to work well,\nperhaps because expanding the search tree is slightly faster than drawing a sample (since the tree\ndepth is less than the number of variables). On the other hand, two-stage sampling gives weaker early\nbounds, but is often excellent at longer time settings.\nAggregated results across the benchmarks. To quantify anytime performance of the methods in\neach benchmark, we introduce a measure based on the area between the upper and lower bound of\n\n1http://graphmod.ics.uci.edu/uai08/Evaluation/Report/Benchmarks/\n2http://melodi.ee.washington.edu/~bilmes/uai06InferenceEvaluation/\n\n7\n\n\f(a) pedigree/pedigree33\n\n(b) protein/1co6\n\n(c) BN/BN_30\n\n(d) pedigree/pedigree37\n\n(e) protein/1bgc\n\n(f) BN/BN_129\n\nFigure 3: Anytime bounds on logZ for two instances per benchmark. Dotted line sections on some\ncurves indicate Markov lower bounds. In examples where search is very effective (d,f), or where\nsampling is very effective (a), DIS is equal or nearly so, while in (b,c,e) DIS is better than either.\n\nTable 1: Mean area between upper and lower bounds of logZ, normalized by WMB-IS, for each\nbenchmark. Smaller numbers indicate better anytime bounds. The best for each benchmark is bolded.\n\npedigree\nprotein\n\nBN\n\nAOBFS WMB-IS DIS (Nl=1, Nd=1) DIS (Nl=1, Nd=10)\n16.638\n1.576\n0.233\n\n0.585\n0.095\n0.162\n\n1\n1\n1\n\n0.711\n0.110\n0.340\n\ntwo-stage\n\n1.321\n2.511\n0.865\n\nlogZ. For each instance and method, we compute the area of the interval between the upper and\nlower bound of logZ for that instance and method. To avoid vacuous lower bounds, we provide each\nalgorithm with an initial lower bound on logZ from WMB. To facilitate comparison, we normalize\nthe area of each method by that of WMB-IS on each instance, then report the geometric mean of the\nnormalized areas across each benchmark in Table 1. This shows the average relative quality compared\nto WMB-IS; smaller values indicate tighter anytime bounds. We see that on average, search is more\neffective than sampling on the BN instances, but much less effective on pedigree. Across all three\nbenchmarks, DIS (Nl=1, Nd=10) produces the best result by a signi\ufb01cant margin, while DIS (Nl=1,\nNd=1) is also very competitive, and two-stage sampling does somewhat less well.\n\n6 Conclusion\n\nWe propose a dynamic importance sampling algorithm that embraces the merits of best-\ufb01rst search\nand importance sampling to provide anytime \ufb01nite-sample bounds for the partition function. The\nAOBFS search process improves the proposal distribution over time, while our particular weighted\naverage of importance weights gives the resulting estimator quickly decaying \ufb01nite-sample bounds,\nas illustrated on several UAI problem benchmarks. Our work also opens up several avenues for future\nresearch, including investigating different weighting schemes for the samples, more \ufb02exible balances\nbetween search and sampling (for example, changing over time), and more closely integrating the\nvariational optimization process into the anytime behavior.\n\n8\n\n101102103104\u2212130\u2212125\u2212120time (sec)logZ ( \u2212124.979 ) AOBFSWMB-ISDIS(Nl=1,Nd=1)DIS(Nl=1,Nd=10)two-stage100102104\u2212105\u2212100\u221295\u221290\u221285time (sec)logZ ( unknown ) AOBFSWMB-ISDIS(Nl=1,Nd=1)DIS(Nl=1,Nd=10)two-stage100102104\u221235\u221230\u221225\u221220time (sec)logZ ( unknown ) AOBFSWMB-ISDIS(Nl=1,Nd=1)DIS(Nl=1,Nd=10)two-stage101102103104\u2212280\u2212275\u2212270\u2212265\u2212260time (sec)logZ ( \u2212268.435 ) AOBFSWMB-ISDIS(Nl=1,Nd=1)DIS(Nl=1,Nd=10)two-stage100102104\u221295\u221290\u221285\u221280\u221275\u221270\u221265time (sec)logZ ( unknown ) AOBFSWMB-ISDIS(Nl=1,Nd=1)DIS(Nl=1,Nd=10)two-stage101102103104\u2212160\u2212150\u2212140\u2212130\u2212120time (sec)logZ ( unknown ) AOBFSWMB-ISDIS(Nl=1,Nd=1)DIS(Nl=1,Nd=10)two-stage\fAcknowledgements\n\nWe thank William Lam, Wei Ping, and all the reviewers for their helpful feedback.\nThis work is sponsored in part by NSF grants IIS-1526842, IIS-1254071, and by the United States\nAir Force under Contract No. FA8750-14-C-0011 and FA9453-16-C-0508.\n\nReferences\n[1] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez,\nS. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on\nComputational Intelligence and AI in games, 4(1):1\u201343, 2012.\n\n[2] S. Chakraborty, K. S. Meel, and M. Y. Vardi. Algorithmic improvements in approximate counting for\n\nprobabilistic inference: From linear to logarithmic SAT calls. IJCAI\u201916.\n\n[3] S. Chakraborty, D. J. Fremont, K. S. Meel, S. A. Seshia, and M. Y. Vardi. Distribution-aware sampling and\n\nweighted model counting for SAT. AAAI\u201914, pages 1722\u20131730. AAAI Press, 2014.\n\n[4] P. Dagum and M. Luby. An optimal approximation algorithm for Bayesian inference. Arti\ufb01cial Intelligence,\n\n93(1-2):1\u201327, 1997.\n\n[5] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.\n[6] R. Dechter. Reasoning with probabilistic and deterministic graphical models: Exact algorithms. Synthesis\n\nLectures on Arti\ufb01cial Intelligence and Machine Learning, 7(3):1\u2013191, 2013.\n\n[7] R. Dechter and I. Rish. Mini-buckets: A general scheme of approximating inference. Journal of ACM, 50\n\n(2):107\u2013153, 2003.\n\n[8] R. Dechter, H. Geffner, and J. Y. Halpern. Heuristics, Probability and Causality. A Tribute to Judea Pearl.\n\nCollege Publications, 2010.\n\n[9] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. Taming the curse of dimensionality: Discrete\nIn International Conference on Machine Learning, pages\n\nintegration by hashing and optimization.\n334\u2013342, 2013.\n\n[10] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. Low-density parity constraints for hashing-based\n\ndiscrete integration. In International Conference on Machine Learning, pages 271\u2013279, 2014.\n\n[11] V. Gogate and R. Dechter. Sampling-based lower bounds for counting queries. Intelligenza Arti\ufb01ciale, 5\n\n(2):171\u2013188, 2011.\n\n[12] M. Henrion. Search-based methods to bound diagnostic probabilities in very large belief nets.\n\nProceedings of the 7th conference on Uncertainty in Arti\ufb01cial Intelligence, pages 142\u2013150, 1991.\n\nIn\n\n[13] Q. Liu. Reasoning and Decisions in Probabilistic Graphical Models\u2013A Uni\ufb01ed Framework. PhD thesis,\n\nUniversity of California, Irvine, 2014.\n\n[14] Q. Liu and A. Ihler. Bounding the partition function using H\u00f6lder\u2019s inequality. In Proceedings of the 28th\n\nInternational Conference on Machine Learning (ICML), New York, NY, USA, 2011.\n\n[15] Q. Liu, J. W. Fisher, III, and A. T. Ihler. Probabilistic variational bounds for graphical models. In Advances\n\nin Neural Information Processing Systems, pages 1432\u20131440, 2015.\n\n[16] Q. Lou, R. Dechter, and A. Ihler. Anytime anyspace AND/OR search for bounding the partition function.\n\nIn Proceedings of the 31st AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[17] A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In COLT, 2009.\n[18] M.-S. Oh and J. O. Berger. Adaptive importance sampling in Monte Carlo integration. Journal of Statistical\n\nComputation and Simulation, 41(3-4):143\u2013168, 1992.\n\n[19] L. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189 \u2013 201,\n\n1979.\n\n[20] C. Viricel, D. Simoncini, S. Barbe, and T. Schiex. Guaranteed weighted counting for af\ufb01nity computation:\nBeyond determinism and structure. In International Conference on Principles and Practice of Constraint\nProgramming, pages 733\u2013750. Springer, 2016.\n\n[21] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Founda-\n\ntions and Trends R(cid:13) in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[22] C. Yanover and Y. Weiss. Approximate inference and protein-folding. In Advances in Neural Information\n\nProcessing Systems, pages 1457\u20131464, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1807, "authors": [{"given_name": "Qi", "family_name": "Lou", "institution": "UCI"}, {"given_name": "Rina", "family_name": "Dechter", "institution": "UCI"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}]}