{"title": "Accelerated Adaptive Markov Chain for Partition Function Computation", "book": "Advances in Neural Information Processing Systems", "page_first": 2744, "page_last": 2752, "abstract": "We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of ``null moves'' in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories.", "full_text": "Accelerated Adaptive Markov Chain\nfor Partition Function Computation\u2217\n\nStefano Ermon, Carla P. Gomes\n\nDept. of Computer Science\n\nCornell University\n\nIthaca NY 14853, U.S.A.\n\nAshish Sabharwal\n\nBart Selman\n\nIBM Watson Research Ctr.\n\nDept. of Computer Science\n\nYorktown Heights\nNY 10598, U.S.A.\n\nCornell University\n\nIthaca NY 14853, U.S.A.\n\nAbstract\n\nWe propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute\nthe partition function. In particular, we show how to accelerate a \ufb02at histogram\nsampling technique by signi\ufb01cantly reducing the number of \u201cnull moves\u201d in the\nchain, while maintaining asymptotic convergence properties. Our experiments\nshow that our method converges quickly to highly accurate solutions on a range of\nbenchmark instances, outperforming other state-of-the-art methods such as IJGP,\nTRW, and Gibbs sampling both in run-time and accuracy. We also show how ob-\ntaining a so-called density of states distribution allows for ef\ufb01cient weight learning\nin Markov Logic theories.\n\n1\n\nIntroduction\n\nWe propose a novel and general method to approximate the partition function of intricate probability\ndistributions de\ufb01ned over combinatorial spaces. Computing the partition function is a notoriously\nhard computational problem. Only a few tractable cases are know. In particular, if the corresponding\ngraphical model has low treewidth, then the problem can be solved exactly using methods based on\ntree decompositions, such as the junction tree algorithm [1]. The partition function for planar graphs\nwith binary variables and no external \ufb01eld can also be computed in polynomial time [2].\nWe will consider an adaptive MCMC sampling strategy, inspired by the Wang-Landau method [3],\nwhich is a so-called \ufb02at histogram sampling strategy from statistical physics. Given a combinatorial\nspace and an energy function (for instance, describing the negative log-likelihood of each con\ufb01gu-\nration), a \ufb02at histogram method is a sampling strategy based on a Markov Chain that converges to a\nsteady state where it spends approximately the same amount of time in states with a low density of\ncon\ufb01gurations (which are usually low energy states) as in states with a high density.\nWe propose two key improvements to the Wang-Landau method, namely energy saturation\nand a focused-random walk component, leading to a new and more ef\ufb01cient algorithm called\nFocusedFlatSAT. Energy saturation allows the chain to visit fewer energy levels, and the ran-\ndom walk style moves reduce the number of \u201cnull moves\u201d in the Markov chain. Both improvements\nmaintain the same global stationary distribution, while allowing us to go well beyond the domain of\nspin glasses where the Wang-Landau method has been traditionally applied.\nWe demonstrate the effectiveness of our approach by a comparison with state-of-the-art methods to\napproximate the partition function or bound it, such as Tree Reweighed Belief Propagation [4], IJGP-\nSampleSearch [5], and Gibbs sampling [6]. Our experiments show that our approach outperforms\nthese approaches in a variety of problem domains, both in terms of accuracy and run-time.\nThe density of states serves as a rich description of the underlying probabilistic model. Once com-\nputed, it can be used to ef\ufb01ciently evaluate the partition function for all parameter settings without\n\n\u2217Supported by NSF Expeditions in Computing award for Computational Sustainability (grant 0832782).\n\n1\n\n\fthe need for further inference steps \u2014 a stark contrast with competing methods for partition function\ncomputation. For instance, in statistical physics applications, we can use it to evaluate the partition\nfunction Z(T ) for all values of the temperature T . This level of abstraction can be a fundamental\nadvantage for machine learning methods: in fact, in a learning problem we can parameterize Z(\u00b7)\naccording to the model parameters that we want to learn from the training data. For example, in\nthe case of a Markov Logic theory [7, 8] with weights w1, . . . , wK of its K \ufb01rst order formulas,\nwe can parameterize the partition function as Z(w1, . . . , wK). Upon de\ufb01ning an appropriate energy\nfunction and obtaining the corresponding density of states, we can then use ef\ufb01cient evaluations of\nthe partition function to search for model parameters that best \ufb01t the training data, thus obtaining a\npromising new approach to learning in Markov Logic Networks and graphical models.\n\n2 Probabilistic model and the partition function\n\nWe focus on intricate probability distributions de\ufb01ned over a set of con\ufb01gurations, i.e., assignments\nto a set of N discrete variables {x1, . . . , xN}, assumed here to be Boolean for simplicity. The\nprobability distribution is speci\ufb01ed through a set of combinatorial features or constraints over these\nvariables. Such constraints can be either hard or soft, with the i-th soft constraint Ci being associated\nwith a weight wi. Let \u03c7i(x) = 1 if a con\ufb01guration x violates Ci, and 0 otherwise. The probability\nPw(x) of x is de\ufb01ned as 0 if x violates any hard constraint, and as\n\nPw(x) =\n\n1\n\nZ(w)\n\nexp\n\nwi\u03c7i(x)\n\n(1)\n\notherwise, where Csoft is the set of soft constraints. The partition function, Z(w), is simply the\nnormalization constant for this probability distribution, and is given by:\n\n \n\u2212 X\n\nCi\u2208Csoft\n\n \n\u2212 X\n\n!\n\n!\n\nZ(w) = X\n\nexp\n\nwi\u03c7i(x)\n\n(2)\n\nx\u2208Xhard\n\nCi\u2208Csoft\n\nwhere Xhard \u2286 {0, 1}N is the set of con\ufb01gurations satisfying all hard constraints. Note that as\nwi \u2192 \u221e, the soft constraint Ci effectively becomes a hard constraint. This factored representation\nis closely related to a graphical model where we use weighted Boolean formulas to specify clique\npotentials. This is a natural framework for combining purely logical and probabilistic inference,\nused for example to de\ufb01ne grounded Markov Logic Networks [8, 9].\nThe partition function is a very important quantity but computing it is a well-known computational\nchallenge, which we propose to address by employing the \u201cdensity of states\u201d method to be discussed\nshortly. We will compare our approach against several state-of-the-art methods available for com-\nputing the partition function or obtaining bounds on it. Wainwright et al. [4], for example, proposed\na variational method known as tree re-weighting (TRW) to obtain bounds on the partition function\nof graphical models. Unlike standard Belief Propagation schemes which are based on Bethe free en-\nergies [10], the TRW approach uses a tree-reweighted (TRW) free energy which consists of a linear\ncombination of free energies de\ufb01ned on spanning trees of the model. Using convexity arguments it\nis then possible to obtain upper bounds on various quantities, such as the partition function.\nBased on iterated join-graph propagation, IJGP-SampleSearch [5] is a popular solver for the proba-\nbility of evidence problem (i.e., partition function computation with a subset of \u201cevidence\u201d variables\n\ufb01xed) for general graphical models. This method is based on an importance sampling scheme which\nis augmented with systematic constraint-based backtracking search. An alternative approach is to\nuse Gibbs sampling to estimate the partition function by estimating, using sample average, a se-\nquence of multipliers that correspond to the ratios of the partition function evaluated at different\nweight levels [6]. Lastly, the partition function for planar graphs where all variables are binary and\nhave only pairwise interactions (i.e., the zero external \ufb01eld case) can be calculated exactly in poly-\nnomial time [2]. Although we are interested in algorithms for the general (intractable) case, we used\nthe software associated with this approach to obtain the ground truth for planar graphs and evaluate\nthe accuracy of the estimates obtained by other methods.\n\n2\n\n\f3 Density of states\n\nOur approach for computing the partition function is based on solving the density of states problem.\nGiven a combinatorial space such as the one de\ufb01ned earlier and an energy function E : {0, 1}N \u2192\nR, the density of states (DOS) n is a function n : range(E) \u2192 N that maps energy levels to the\nnumber of con\ufb01gurations with that energy, i.e., n(k) = |{\u03c3 \u2208 {0, 1}N | E(\u03c3) = k}|. In our context,\nwe are interested in computing the number of con\ufb01gurations that satisfy certain properties that are\nspeci\ufb01ed using an appropriate energy function. For instance, we might de\ufb01ne the energy E(\u03c3) of a\ncon\ufb01guration \u03c3 to be the number of hard constraints that are violated by \u03c3. Or we may use the sum\nof the weights of the violated soft constraints.\nOnce we are able to compute the full density of states, i.e., the number of con\ufb01gurations at each\npossible energy level, it is straightforward to evaluate the partition function Z(w) for any weight\nvector w, by summing up terms of the form n(i) exp(\u2212E(i)), where E(i) denotes the energy of\nevery con\ufb01guration in state i. This is the method we use in this work for estimating the partition\nfunction. More complex energy functions may be de\ufb01ned for other related tasks, such as weight\nlearning, i.e., given some training data x \u2208 X = {0, 1}N , computing arg maxw Pw(x) where\nPw(x) is given by Equation (1). Here we can de\ufb01ne the energy E(\u03c3) to be w \u00b7 \u2018, where \u2018 =\n(\u20181, . . . , \u2018M ) gives the number of constraints of weight wi violated by \u03c3. Our focus in the rest of\nthe paper will thus be on computing the density of states ef\ufb01ciently.\n\n3.1 The MCMCFlatSAT algorithm\n\nn\n\n( 1\n\nN min\n0\n\np\u03c3\u2192\u03c30 =\n\no\n\nMCMCFlatSAT [11] is an Adaptive Markov Chain Monte Carlo (adaptive MCMC) method for\ncomputing the density of states for combinatorial problems, inspired by the Wang-Landau algorithm\n[3] from statistical physics. Interestingly, this algorithm does not make any assumption about the\nform or semantics of the energy. At least in principle, the only thing it needs is a partitioning of the\nstate space, where the \u201cenergy\u201d just provides an index over the subsets that compose the partition.\nThe algorithm is based on the \ufb02at histogram idea and works by trying to construct a reversible\nMarkov Chain on the space {0, 1}N of all con\ufb01gurations such that the steady state probability of a\ncon\ufb01guration \u03c3 is inversely proportional to the density of states n(E(\u03c3)). In this way, the stationary\ndistribution is such that all the energy levels are visited equally often (i.e., when we count the visits\nto each energy level, we see a \ufb02at visit histogram). Speci\ufb01cally, we de\ufb01ne a Markov Chain with the\nfollowing transition probability:\n\n1, n(E(\u03c3))\nn(E(\u03c30))\n\ndH(\u03c3, \u03c30) = 1\ndH(\u03c3, \u03c30) > 1\n\n(3)\n\nis given by the normalization constraint p\u03c3\u2192\u03c3 +P\nof the states visited because P (E) =P\n\nwhere dH(\u03c3, \u03c30) is the Hamming distance between \u03c3 and \u03c30. The probability of a self-loop p\u03c3\u2192\u03c3\n\u03c30|dH (\u03c3,\u03c30)=1 p\u03c3\u2192\u03c30 = 1. The detailed balance\nequation P (\u03c3)p\u03c3\u2192\u03c30 = P (\u03c30)p\u03c30\u2192\u03c3 is satis\ufb01ed by P (\u03c3) \u221d 1/n(E(\u03c3)). This means1 that the\nMarkov Chain will reach a stationary probability distribution P (regardless of the initial state) such\nthat the probability of a con\ufb01guration \u03c3 with energy E = E(\u03c3) is inversely proportional to the num-\nber of con\ufb01gurations with energy E. This leads to an asymptotically \ufb02at histogram of the energies\nn(E) = 1 (i.e., independent of E).\nSince the density of states is not known a priori, and computing it is precisely the goal of the algo-\nrithm, it is not possible to construct directly a random walk with transition probability (3). However\nit is possible to start with an initial guess g(\u00b7) for n(\u00b7) and keep updating this estimate g(\u00b7) in a\nsystematic way to produce a \ufb02at energy histogram and simultaneously make the estimate g(E) con-\nverge to the true value n(E) for every energy level E. The estimate is adjusted using a modi\ufb01cation\nfactor F which controls the trade-off between the convergence rate of the algorithm and its accuracy\n(large initial values of F lead to fast convergence to a rather inaccurate solution). For completeness,\nwe provide the pseudo-code as Algorithm 1; see [11] for details.\n\n\u03c3:E(\u03c3)=E P (\u03c3) \u221d n(E) 1\n\n1The chain is \ufb01nite, irreducible, and aperiodic, therefore ergodic.\n\n3\n\n\fRandomly pick a con\ufb01guration \u03c3\nrepeat\n\nAlgorithm 1 MCMCFlatSAT algorithm to compute the density of states\n1: Start with a guess g(E) = 1 for all E = 1, . . . , m\n2: Initialize H(E) = 0 for all E = 1, . . . , m\n3: Start with a modi\ufb01cation factor F = F0 = 1.5\n4: repeat\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: until F is close enough to 1\n\nGenerate a new con\ufb01guration \u03c30 (by \ufb02ipping a variable)\nLet E = E(\u03c3) and E0 = E(\u03c30) (saturated energies)\nSet \u03c3 = \u03c30 with probability min\n1, g(E)\ng(E0)\nLet Ec = E(\u03c3) be the current energy level\nAdjust the density g(Ec) = g(Ec) \u00d7 F\nUpdate visit histogram H(Ec) = H(Ec) + 1\nReduce F , F \u2190 \u221a\nuntil H is \ufb02at (all the values are at least 90% of the maximum value)\n17: Normalize g so thatP\n\nn\n\nReset the visit histogram H\n\nF\n\no\n\n18: return g as estimate of n\n\nE g(E) = 2N\n\n(move acceptance/rejection step)\n\n4 FocusedFlatSAT: Ef\ufb01cient computation of density of states\n\nimprovements to MCMCFlatSAT, namely energy saturation and\nWe propose two crucial\nthe introduction of a focused-random walk component,\nleading to a new algorithm called\nFocusedFlatSAT. As we will see in Table 1, FocusedFlatSAT provides the same accuracy as\nMCMCFlatSAT but is about 10 times faster on that benchmark. Moreover, our results for the Ising\nmodel (described below) in Figure 2 demonstrate that FocusedFlatSAT scales much better.\n\nEnergy saturation. The time needed for each iteration of MCMCFlatSAT to converge is signif-\nicantly affected by the number of different non-empty energy levels (buckets). In many cases, the\nweights de\ufb01ning the probability distribution Pw(x) are all positive (i.e., there is an incentive to sat-\nisfy the constraints), and as an effect of the exponential discounting in Equation (1), con\ufb01gurations\nthat violate a large number of constraints have a negligible contribution to the sum de\ufb01ning the par-\ntition function Z. We therefore de\ufb01ne a new saturated energy function E0(\u03c3) = min{E(\u03c3), K},\nwhere K is a user-de\ufb01ned parameter. For the positive weights case, the partition function Z0 asso-\nciated with the saturated energy function is a guaranteed upper bound on the original Z, for any K.\nWhen all constraints are hard, Z0 = Z for any value K \u2265 1 because only the \ufb01rst energy bucket\nmatters. In general, when soft constraints are present, the bound gets tighter as K increases, and we\ncan obtain theoretical worst-case error bounds when K is chosen to be a percentile of the energy\ndistribution (e.g., saturation at median energy yields a 2x bound). In our experiments, we set K to be\nthe average number of constraints violated by a random con\ufb01guration, and we found that the error\nintroduced by the saturation is negligible compared to other inherent approximations in density of\nstates estimation. Intuitively, this is because the states where the probability is concentrated turn out\nto typically have a much lower energy than K, and thus an exponentially larger contribution to Z.\nFurthermore, energy saturation preserves the connectivity of the chain.\n\nFocused Random Walk. Both in the original Wang-Landau method and in MCMCFlatSAT, new\ncon\ufb01gurations are generated by \ufb02ipping a variable selected uniformly at random [3, 11]. Let us\ncall this con\ufb01guration selection distribution the proposal distribution, and let T\u03c3\u2192\u03c30 denote the\nprobability of generating a \u03c30 from this distribution while in con\ufb01guration \u03c3. In the Wang-Landau\nalgorithm, proposed con\ufb01gurations are then rejected with a probability that depends on the density\nof states of the respective energy levels. Move rejections obviously lengthen the mixing time of\nthe underlying Markov Chain. We introduce here a novel proposal distribution that signi\ufb01cantly\nreduces the number of move rejections, resulting in much faster convergence rates. It is inspired by\nlocal search SAT solvers [12] and is especially critical for the class of highly combinatorial energy\nfunctions we consider in this work. We note that if the acceptance probability is taken to be\n\n(cid:26)\n\nmin\n\n(cid:27)\n\n1,\n\nn(E(\u03c3)) T\u03c30\u2192\u03c3\nn(E(\u03c30)) T\u03c3\u2192\u03c30\n\n4\n\n\fFigure 1: Histograms depicting the number of proposed moves accepted and rejected. Left: MCM-\nCFlatSAT. Right: FocusedFlatSAT. See PDF for color version.\n\nthe properties of the steady state distribution are preserved as long as the proposal distribution is\nsuch that the ergodicity property is maintained.\nIn order to understand the motivation behind the new proposal distribution, consider the move accep-\ntance/rejection histogram shown in the left panel of Figure 1. For the instance under consideration,\nMCMCFlatSAT converged to a \ufb02at histogram after having visited each of the 385 energy levels (on\nx-axis) roughly 2.6M times. Each colored region shows the cumulative number of moves (on y-axis)\naccepted or rejected from each energy level (on x-axis) to another con\ufb01guration with a higher, equal,\nor lower energy level, resp. This gives six possible move types, and the histogram shows how often\nis each taken at any energy level. Most importantly, notice that at low energy levels, a vast majority\nof the moves were proposed to a higher energy level and were rejected by the algorithm (shown as\nthe dominating purple region). This is an indirect consequence of the fact that in such instances, in\nthe low energy regime, the density of states increases drastically as the energy level is increases, i.e.,\ng(E0) (cid:29) g(E) when E0 > E. As a result, most of the proposed moves are to higher energy levels\nand are in turn rejected by the algorithm in the move acceptance/rejection step discussed above.\nIn order to address this issue, we propose to modify the proposal distribution in a way that increases\nthe chance of proposing moves to the same or lower energy levels, despite the fact that there are\nrelatively few such moves. Inspired by local search SAT solvers, we enhance MCMCFlatSAT with\na focused random walk component that gives preference to selecting variables to \ufb02ip from violated\nconstraints (if any), thereby introducing an indirect bias towards lower energy states. Speci\ufb01cally,\nif the given con\ufb01guration \u03c3 is a satisfying assignment, pick a variable uniformly at random to be\n\ufb02ipped (thus T\u03c3\u2192\u03c30 = 1/N when the Hamming distance dH(\u03c3, \u03c30) = 1, zero otherwise). If \u03c3 is\nnot a solution, then with probability p a variable to be \ufb02ipped is chosen uniformly at random from\na randomly chosen violated constraint, and with probability 1 \u2212 p a variable is chosen uniformly at\nrandom. With this approach, when \u03c3 is not solution and \u03c3 and \u03c30 differ only on the i-th variable,\n\nwhere \u03c7c(\u03c3) = 1 iff \u03c3 violates constraint c and |c| denotes the number of variables in constraint c.\nWith this proposal distribution we ensure that for all 1 > p \u2265 0 whenever T\u03c3\u2192\u03c30 > 0, we also have\nT\u03c30\u2192\u03c3 > 0. Moreover, the connectivity of the Markov Chain is preserved (since we don\u2019t remove\nany edge from the original Markov Chain). We therefore have the following result:\nProposition 1 For all p \u2208 [0, 1), the Markov Chain with proposal distribution T\u03c3\u2192\u03c30 de\ufb01ned above\nis irreducible and aperiodic. Therefore it has a unique stationary distribution, given by 1/n(E(\u03c3)).\n\nThe right panel of Figure 1 shows the move acceptance/rejection histogram when FocusedFlatSAT\nis used, i.e., with the above proposal distribution. The same instance now needs under 1.2M visits\nper energy level for the method to converge. Moreover, the number of rejected moves (shown in\npurple and green) in low energy states is signi\ufb01cantly fewer than the dominating purple region in the\nleft panel. This allows the Markov Chain to move more freely in the space and to converge faster.\nFigure 2 shows a runtime comparison of FocusedFlatSAT against MCMCFlatSAT on n \u00d7 n Ising\nmodels (details to be discussed in Section 5). As we see, incorporating energy saturation reduces the\ntime to convergence (while achieving the same level of accuracy), and using focused random walk\nmoves further decreases the convergence time, especially as n increases.\n\n5\n\nT\u03c3\u2192\u03c30 = (1 \u2212 p)\n\n1\nN\n\n+ p\n\nP\nP\nc\u2208C|i\u2208c \u03c7c(\u03c3) \u00b7 1/|c|\n\nc\u2208C \u03c7c(\u03c3)\n\n050000010000001500000200000025000003000000123456789111133155177199221243265287309331353375Number of movesEnergy levelMCMCFlatSATAcc. upAcc. sameAcc. downRej. upRej. sameRej. down0200000400000600000800000100000012000001400000123456789111133155177199221243265287309331353375Number of movesEnergy levelFocusedFlatSATAcc. upAcc. sameAcc. downRej. upRej. sameRej. down\fFigure 2: Runtime comparison on ferromagnetic Ising models on square lattices of size n \u00d7 n.\n\nTable 1: Comparison with model counters; only hard constraints. Runtime is in seconds.\n\nInstance\n\nn\n\nm\n\nExact #\n\n766 2.10 \u00d7 1029 1.91 \u00d7 1029\n2bitmax 6 252\n525 1.40 \u00d7 1014 1.43 \u00d7 1014\n150\nwff-3-3.5\n150 1.80 \u00d7 1021 1.86 \u00d7 1021\n100\nwff-3.1.5\n9.31 \u00d7 1016\n500\nwff-4-5.0\n100\nls8-norm 301 1603 5.40 \u00d7 1011 5.78 \u00d7 1011\n\nModels\n\nModels\n\nFocusedFlatSat\nTime\n156 1.96 \u00d7 1029\n20 1.34 \u00d7 1014\n1 1.83 \u00d7 1021\n5 8.64 \u00d7 1016\n231 5.93 \u00d7 1011\n\nMCMC-FlatSat\n\nSampleMiniSAT\nModels\n\nTime\n\nSampleCount\nModels\n\nTime\n\nTime\n29 2.08 \u00d7 1029\n1863 \u2265 2.40 \u00d7 1028\n145 1.60 \u00d7 1013\n393 \u2265 1.60 \u00d7 1013\n240 1.58 \u00d7 1021\n21 \u2265 1.00 \u00d7 1020\n120 1.09 \u00d7 1017\n189 \u2265 8.00 \u00d7 1015\n2693 \u2265 3.10 \u00d7 1010 1140 2.22 \u00d7 1011\n\n345\n240\n128\n191\n168\n\n5 Experimental evaluation\n\nWe compare FocusedFlatSAT against several state-of-the-art methods for computing an estimate\nof or bound on the partition function.2 An evaluation such as this is inherently challenging as the\nground truth is very hard to obtain and computational bounds can be orders of magnitude off from\nthe truth, making a comparison of estimates not very meaningful. We therefore propose to evaluate\nthe methods on either small instances whose ground truth can be evaluated by \u201cbrute force,\u201d or larger\ninstances whose ground truth (or bounds on it) can be computed analytically or through other tools\nsuch as ef\ufb01cient model counters. We also consider planar cases for which a specialized polynomial\ntime exact algorithm is available. Ef\ufb01cient methods for handling instances of small treewidth are\nalso well known; here we push the boundaries to instances of relatively higher treewidth.\nFor partition function evaluation, we compare against the tree re-weighting (TRW) variational\nmethod for upper bounds, the iterated join-graph propagation (IJGP), and Gibbs sampling; see Sec-\ntion 2 for a very brief discussion of these approaches. For weight learning, we compare against\nthe Alchemy system. Unless otherwise speci\ufb01ed, the energy function used is always the number of\nviolated constraints, and we use a 50% ratio of random moves (p = 0.5). The algorithm is run for\n20 iterations, with an initial modi\ufb01cation factor F0 = 1.5. The experiments were conducted on a\n16-core 2.4 GHz Intel Xeon machine with 32 GB memory, running RedHat Linux.\n\nHard constraints. First, consider models with only hard constraints, which de\ufb01ne a uniform mea-\nsure on the set of satisfying assignments. In this case, the problem of computing the partition func-\ntion is equivalent to standard model counting. We compare the performance of FocusedFlatSAT\nwith MCMC-FlatSat and with two state-of-the-art approximate model counters: SampleCount\n[13] and SampleMiniSATExact [14]. The instances used are taken from earlier work [11]. The re-\nsults in Table 1 show that FocusedFlatSAT almost always obtains much more accurate solution\ncounts, and is often signi\ufb01cantly faster (about an order of magnitude faster than MCMC-FlatSat).\nSoft Constraints. We consider Ising Models de\ufb01ned on an n \u00d7 n square lattice where P (\u03c3) =\n6= \u03c3j]. Here I is the indicator function. This\nimposes a penalty wij if spins \u03c3i and \u03c3j are not aligned. We consider a ferromagnetic case where\nwij = w > 0 for all edges, and a frustrated case with a mixture of positive and negative interactions.\nThe partition function for these planar models is computable with a specialized polynomial time\nalgorithm, as long as there is no external magnetic \ufb01eld [2]. In Figure 3, we compare the true value\nof the partition function Z\u2217 with the estimate obtained using FocusedFlatSAT and with the upper\n\nP\n\u03c3 exp(\u2212E(\u03c3)) with E(\u03c3) = P\n\n(i,j) wijI[\u03c3i\n\n2Benchmark instances available online at http://www.cs.cornell.edu/\u223cermonste\n\n6\n\n050001000015000200002500030000010203040Time (s)Grid size nMCMCFlatSATMCMCFlatSAT+SaturationFocusedFlatSAT\fFigure 3: Error in log10(Z). Left: 40 \u00d7 40 ferromagnetic grid. Right: 32 \u00d7 32 spin glass grid.\n\nTable 2: Log partition function for weighted formulas.\n\nInstance\n\nn\n\nm Weight\n\nlog10 Z(w)\n\ngrid32x32\ngrid32x32\ngrid40x40\n2bitmax6\n2bitmax6\nwff.100.150\nwff.100.150\nls8-normalized\nls8-normalized\nls8-normalized\nls8-normalized\nls8-simpli\ufb01ed-2\nls8-simpli\ufb01ed-4\nls8-simpli\ufb01ed-5\n\n1024\n1024\n1600\n252\n252\n100\n100\n301\n301\n301\n301\n172\n119\n83\n\n3968\n3968\n6240\n766\n766\n150\n150\n1603\n1603\n1603\n1603\n673\n410\n231\n\n1\n16.0920\n1\n16.0920\n1\n23.5434\n5 > 29.3222\n5 > 29.3222\n5 > 21.2553\n8 > 21.2553\n3 > 11.7324\n6 > 11.7324\n6 > 11.7324\n6 > 11.7324\n6\n> 4.3083\n6\n> 2.2479\n6\n> 1.3424\n\nFocusedFlatSat\nTime\nlog10 Z(w)\n628\n16.0964\n628\n16.0964\n1522\n23.4844\n30.4373\n360\n360\n30.4373\n5\n21.3187\n21.2551\n5\n589\n17.6655\n589\n11.7974\n11.7974\n589\n589\n11.7974\n100\n4.3379\n63\n2.3399\n1.3880\n40\n\nIJGP-SampleSearch\nTime\nlog10 Z(w)\n600\n14.4330\n2000\n13.8980\n2000\n15.9386\n12.0526\n600\n2000\n12.3802\n200\n21.3373\n200\n21.2694\n600\n16.5458\n600\n-2.3987\n-1.7459\n1200\n2000\n-1.8578\n1200\n-1.8305\n1200\n2.7037\n1.3688\n600\n\nGibbs\nlog10 Z(w)\n15.4856\n\n22.3125\n25.1274\n\n21.3992\n21.3107\n8.6825\n-17.356\n\n2.8516\n-6.7132\n1.3420\n\nTime\n651\n\n1650\n732\n\n40\n40\n708\n770\n\n300\n174\n51\n\nbound given by TRW (which is generally much faster but inaccurate), for a range of w values. What\nis plotted is the accuracy, log Z \u2212log Z\u2217. We see that the estimate provided by FocusedFlatSAT\nis very accurate throughout the range of w values. For the ferromagnetic model, the bounds obtained\nby TRW, on the other hand, are tight only when the weights are suf\ufb01ciently high, when essentially\nonly the two ground states of energy zero matter. On spin glasses, where computing ground states is\nitself an intractable problem, TRW is unsurprisingly inaccurate even in the high weights regime. The\nconsistent accuracy of FocusedFlatSAT here is a strong indication that the method is accurately\ncomputing the density of most of the underlying states. This is because, as the weight w changes,\nthe value of the partition function is dominated by the contributions of a different set of states.\nTable 2 (top) shows a comparison with IJGP-SampleSearch and Gibbs Sampling for the ferromag-\nnetic case with w = 1. Here FocusedFlatSAT provides the most accurate estimates, even\nwhen other methods are given a longer running time. E.g., IJGP is two orders of magnitude off\nfor the 32 \u00d7 32 grid.3 Results with other weights are similar but omitted due to limited space.\nFocusedFlatSAT also signi\ufb01cantly outperforms IJGP and Gibbs sampling in accuracy on the\ncircuit synthesis instance 2bitmax6. All methods perform well on randomly generated 3-SAT in-\nstances, but FocusedFlatSAT is much faster.\nAs another test case, we use formulas from a previously used model counting benchmark involving\nn \u00d7 n Latin Square completion [11], and add a weight w to each constraint. Since these instances\nhave high treewidth, are non-planar, and beyond the reach of direct enumeration, we don\u2019t have\nground truth for this benchmark. However, we are able to provide a lower bound,4 which is given\nby the number of models of the original formula. Our results are reported in Table 2. Our lower\nbound indicates that the estimate given by FocusedFlatSAT is more accurate, even when other\nmethods are given a longer running time. As the last 3 lines of the table show, IJGP and Gibbs\nsampling improve in performance as the problem is simpli\ufb01ed more and more, by \ufb01xing the values\nof 2, 4, or 5 \u201ccells\u201d and simplifying the instance. Nonetheless, on the un-simpli\ufb01ed ls8-normalized\nwith weight 6, both IJGP and Gibbs sampling underestimate by over 12 orders of magnitude.\n\n3On smaller instances with limited treewidth, IJGP-SampleSearch quickly provides good estimates.\n4The upper bound provided by TRW is very loose on this benchmark (possibly because of the conversion\n\nto a pairwise \ufb01eld) and not reported.\n\n7\n\n010203040506070800123456Log10(Z)-Log10(Z*)weight wFocusedFlatSATTRW-500501001502002503000123456Log10(Z)-Log10(Z*)weight wFocusedFlatSATTRW\fTable 3: Weight learning: likelihood of the training data x computed using learned weights.\n\nType\n\nTraining Data\n\nThreeChain(30)\n\nFourChain(5)\n\nHChain(10)\n\nSocialNetwork(5)\n\nx =data-30-1\nx =data-30-2\nx =dataFC-5-1\nx =dataFC-5-2\nx =dataH-10-1\nx =dataH-10-2\nx =data-SN-1\nx =data-SN-2\n\nOptimal\n\nLikelihood (O)\n4.09 \u00d7 10\u221227\n9.31 \u00d7 10\u221210\n5.77 \u00d7 10\u22126\n3.84 \u00d7 10\u22123\n1.19 \u00d7 10\u22129\n2.62 \u00d7 10\u22129\n2.98 \u00d7 10\u22128\n2.44 \u00d7 10\u22129\n\nFocusedFlatSAT\n\nAccuracy (F/O)\n\n1.0\n1.0\n1.0\n1.0\n1.0\n1.0\n1.0\n1.0\n\nAlchemy\n\nAccuracy (A/O)\n\n0.08\n0.93\n0.61\n\n0.000097\n\n0.87\n0.53\n0.69\n0.2\n\n\u20181\n\n\u2018M\n\nP\n\u20182 . . .P\n\nthe partition function can be written as Z(w) = P\n\nWeight learning. Suppose the set of soft constraints Csoft is composed of M disjoint sets of con-\nstraints {Si}M\ni=1, where all the constraints c \u2208 Si have the same weight wi that we wish to learn\nfrom data (for instance, these constraints can all be groundings of the same \ufb01rst order formula in\nMarkov Logic [8]). Let us assume for simplicity that there are no hard constraints. The probability\nPw(x) can be parameterized by a weight vector w = (w1, . . . , wM ). The key observation is that\nn(\u20181, . . . , \u2018M ) exp (\u2212w \u00b7 \u2018),\nwhere n(\u20181, . . . , \u2018M ) gives the number of con\ufb01gurations that violate \u2018i constraints of type Si for\ni = 1, . . . , M. This function n(\u20181, . . . , \u2018M ) is precisely the density of states required to compute\nZ(w) for all values of w, without additional inference steps.\nGiven training data x \u2208 {0, 1}N , the problem of weight learning is that of \ufb01nding arg maxw Pw(x)\nwhere Pw(x) is given by Eqn. (1). Once we compute n(\u20181, . . . , \u2018M ) using FocusedFlatSAT,\nwe can ef\ufb01ciently evaluate Z(w), and therefore Pw(x), as a function of the parameters w =\n(w1, . . . , wM ). Using this ef\ufb01cient evaluation as a black-box, we can solve the weight learning\nproblem using a numerical optimization package with no additional inference steps required.5\nWe evaluate this learning method on relatively simple instances on which commonly used software\nsuch as Alchemy can be a few orders of magnitude off from the optimal likelihood of the training\ndata. Speci\ufb01cally, Table 3 compares the likelihood of the training data under the weights learned by\nFocusedFlatSAT and by Generative Weight Learning [7], as implemented in Alchemy, for four\ntypes of Markov Logic theories. The Optimal Likelihood value is obtained using a partition function\ncomputed either by direct enumeration or using analytic results for the synthetic instances.\nThe instance ThreeChain(K) is a grounding of the following \ufb01rst order formulas \u2200xP (x) \u21d2\nQ(x),\u2200xQ(x) \u21d2 R(x),\u2200xR(x) \u21d2 P (x) while FourChain(K) is a similar chain of 4 implica-\ntions. The instance HChain(K) is a grounding of \u2200xP (x) \u2227 Q(x) \u21d2 R(x),\u2200xR(x) \u21d2 P (x) where\nx \u2208 {a1, a2, . . . , aK}. The instance SocialNetwork(K) (from the Alchemy Tutorial) is a ground-\ning of the following \ufb01rst order formulas where x, y \u2208 {a1, a2, . . . , aK}: \u2200x \u2200y F riend(x, y) \u21d2\n(Smokes(x) \u21d4 Smokes(y)), \u2200x Smokes(x) \u21d2 Cancer(x).\nTable 3 shows the accuracy of FocusedFlatSAT and Alchemy for the weight learning task, as\nmeasured by the resulting likelihood of observing the data in the learned model, which we are trying\nto maximize. The accuracy is measured as the ratio of the optimal likelihood (O) and the likelihood\nin the learned model (F and A, resp.). In these instances, FocusedFlatSAT always matches the\noptimal likelihood up to two digits of precision, while Alchemy can underestimate it by several\norders of magnitude, e.g., by over 4 orders in the case of FourChain(5).\n\n6 Conclusion\n\nWe introduced FocusedFlatSAT, a Markov Chain Monte Carlo technique based on the \ufb02at his-\ntogram method with a random walk style component to estimate the partition function from the\ndensity of states. We demonstrated the effectiveness of our approach on several types of problems.\nOur method outperforms the current state-of-the-art techniques on a variety of instances, at times\nby several orders of magnitude. Moreover, from the density of states we can obtain directly the\npartition function Z(w) as a function of the model parameters w. We show an application of this\nproperty to weight learning in Markov Logic Networks.\n\n5Storing the full density function n(\u20181, . . . , \u2018M ) of course requires space (and hence time) that is exponen-\n\ntial in M. One must use a relatively coarse partitioning of the state space for scalability when M is large.\n\n8\n\n\fReferences\n[1] Martin J Wainwright and Michael I Jordan. Graphical Models, Exponential Families, and Variational\n\nInference. Now Publishers Inc., Hanover, MA, USA, 2008.\n\n[2] N.N. Schraudolph and D. Kamenetsky. Ef\ufb01cient exact inference in planar Ising models.\n\nNIPS-08, 2008.\n\nIn Proc. of\n\n[3] F. Wang and DP Landau. Ef\ufb01cient, multiple-range random walk algorithm to calculate the density of\n\nstates. Physical Review Letters, 86(10):2050\u20132053, 2001.\n\n[4] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. A new class of upper bounds on the log partition\n\nfunction. Information Theory, IEEE Transactions on, 51(7):2313\u20132335, 2005.\n\n[5] Vibhav Gogate and Rina Dechter. SampleSearch: A Scheme that Searches for Consistent Samples. Jour-\n\nnal of Machine Learning Research, 2:147\u2013154, 2007.\n\n[6] Mark Jerrum and Alistair Sinclair. The Markov chain Monte Carlo method: an approach to approximate\n\ncounting and integration, pages 482\u2013520. PWS Publishing Co., Boston, MA, USA, 1997.\n\n[7] P. Domingos, S. Kok, H. Poon, M. Richardson, and P. Singla. Unifying logical and statistical ai. In Proc.\n\nof AAAI-06, pages 2\u20137, Boston, Massachusetts, 2006. AAAI Press.\n\n[8] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1):107\u2013136, 2006.\n[9] H. Poon and P. Domingos. Sound and ef\ufb01cient inference with probabilistic and deterministic dependen-\n\ncies. In Proc. of AAAI-06, pages 458\u2013463, 2006.\n\n[10] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282\u20132312, 2005.\n\n[11] S. Ermon, C. Gomes, and B. Selman. Computing the density of states of Boolean formulas. In Proc. of\n\nCP-2010, 2010.\n\n[12] B. Selman, H.A. Kautz, and B. Cohen. Local search strategies for satis\ufb01ability testing. In DIMACS Series\n\nin Discrete Mathematics and Theoretical Computer Science, 1996.\n\n[13] C.P. Gomes, J. Hoffmann, A. Sabharwal, and B. Selman. From sampling to model counting. In Proc. of\n\nIJCAI-07, 2007.\n\n[14] V. Gogate and R. Dechter. Approximate counting by sampling the backtrack-free search space. In Proc.\n\nof AAAI-07, pages 198\u2013203, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1485, "authors": [{"given_name": "Stefano", "family_name": "Ermon", "institution": null}, {"given_name": "Carla", "family_name": "Gomes", "institution": null}, {"given_name": "Ashish", "family_name": "Sabharwal", "institution": null}, {"given_name": "Bart", "family_name": "Selman", "institution": null}]}