{"title": "Bayesian Model Scoring in Markov Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": null, "full_text": "Bayesian Model Scoring in Markov Random Fields\n\nBren School of Information and Computer Science\n\nSridevi Parise\n\nUC Irvine\n\nIrvine, CA 92697-3425\n\nsparise@ics.uci.edu\n\nBren School of Information and Computer Science\n\nMax Welling\n\nUC Irvine\n\nIrvine, CA 92697-3425\n\nwelling@ics.uci.edu\n\nAbstract\n\nScoring structures of undirected graphical models by means of evaluating the\nmarginal likelihood is very hard. The main reason is the presence of the parti-\ntion function which is intractable to evaluate, let alone integrate over. We propose\nto approximate the marginal likelihood by employing two levels of approximation:\nwe assume normality of the posterior (the Laplace approximation) and approxi-\nmate all remaining intractable quantities using belief propagation and the linear\nresponse approximation. This results in a fast procedure for model scoring. Em-\npirically, we \ufb01nd that our procedure has about two orders of magnitude better\naccuracy than standard BIC methods for small datasets, but deteriorates when the\nsize of the dataset grows.\n\n1 Introduction\n\nBayesian approaches have become an important modeling paradigm in machine learning. They\noffer a very natural setting in which to address issues such as over\ufb01tting which plague standard\nmaximum likelihood approaches. A full Bayesian approach has its computational challenges as it\noften involves intractable integrals. While for Bayesian networks many of these challenges have\nbeen met successfully[3], the situation is quite reverse for Markov random \ufb01eld models. In fact, it\nis very hard to \ufb01nd any literature at all on model order selection in general MRF models. The main\nreason for this discrepancy is the fact that MRF models have a normalization constant that depends\non the parameters but is in itself intractable to compute, let alone integrate over. In fact, the presence\nof this term even prevents one to draw samples from the posterior distribution in most situations\nexcept for some special cases1.\nIn terms of approximating the posterior some new methods have become available recently. In [7]\na number of approximate MCMC samplers are proposed. Two of them were reported to be most\nsuccessful: one based on Langevin sampling with approximate gradients given by contrastive di-\nvergence and one where the acceptance probability is approximated by replacing the log partition\nfunction with the Bethe free energy. Both these methods are very general, but inef\ufb01cient. In [2]\nMCMC methods are explored for the Potts model based on the reversible jump formalism. To com-\npute acceptance ratios for dimension-changing moves they need to estimate the partition function\n1If one can compute the normalization term exactly (e.g. graphs with small treewidth) or if one can draw\nperfect samples from the MRF [8](e.g. positive interactions only) then one construct a Markov chain for the\nposterior.\n\n\fusing a separate estimation procedure making it rather inef\ufb01cient as well. In [6] and [8] MCMC\nmethods are proposed that use perfect samples to circumvent the calculation of the partition func-\ntion altogether. This method is elegant but limited in its application due to the need to draw perfect\nsamples. Moreover, two approaches that approximate the posterior by a Gaussian distribution are\nproposed in [11] (based on expectation propagation) and [13] (based on the Bethe-Laplace approxi-\nmation).\nIn this paper we focus on a different problem, namely that of approximating the marginal likelihood.\nThis quantity is at the heart of Bayesian analysis because it allows one to compare models of different\nstructure. One can use it to either optimize or average over model structures. Even if one has an\napproximation to the posterior distribution it is not at all obvious how to use it to compute a good\nestimate for the marginal likelihood. The most direct approach is to use samples from the posterior\nand compute importance weights,\n\np(D|\u03b8n)p(\u03b8n)/Q(\u03b8n|D)\n\n\u03b8n \u223c Q(\u03b8n|D)\n\n(1)\n\nN(cid:88)\n\nn=1\n\np(D) \u2248 1\nN\n\nwhere Q(\u03b8n|D) denotes the approximate posterior. Unfortunately, this importance sampler suffers\nfrom very high variance when the number of parameters becomes large. It is not untypical that the\nestimate is effectively based on a single example.\nWe propose to use the Laplace approximation, including all O(1) terms where the intractable quan-\ntities of interest are approximated by either belief propagation (BP) or the linear response theorem\nbased on the solution of BP. We show empirically that the O(1) terms are indispensable for small\nN. Their inclusion can improve accuracy to up to two orders of magnitude. At the same time we\nobserve that as a function of N, the O(1)-term based on the covariance between features deteriorates\nand should be omitted for large N. We conjecture that this phenomenon is explained by the fact that\nthe calculation of the covariance between features, which is equal to the second derivative of the\nlog-normalization constant, becomes instable if the bias in the MAP estimate of the parameters is of\nthe order of the variance in the posterior. For any biased estimate of the parameters this phenomenon\nis therefore bound to happen as we increase N because the variance of the posterior distribution is\nexpected to decrease with N.\nIn summary we present a very accurate estimate for the marginal likelihood where it is most needed,\ni.e.\nfor small N. This work seems to be the \ufb01rst practical method for estimating the marginal\nevidence in undirected graphical models.\n\n2 The Bethe-Laplace Approximation for log p(D)\n\nWithout loss of generality we represent a MRF as a log-linear model,\n\n(cid:104)\n\n(cid:105)\n\np(x|\u03bb) =\n\n1\n\nZ(\u03bb)\n\nexp\n\n\u03bbT f(x)\n\n(2)\n\nwhere f(x) represent features. In the following we will assume that the random variables x are\nobserved. Generalizations to models with hidden variables exist in theory but we defer the empirical\nevaluation of this case to future research.\nTo score a structure we will follow the Bayesian paradigm and aim to compute the log-marginal\nlikelihood log p(D) where D represents a dataset of size N,\n\n(cid:90)\n\nlog p(D) = log\n\nd\u03bb p(D|\u03bb) p(\u03bb)\n\n(3)\n\nwhere p(\u03bb) is some arbitrary prior on the parameters \u03bb.\nIn order to approximate this quantity we employ two approximations. Firstly, we expand the both\nlog-likelihood and log-prior around the MAP value \u03bbMP. For the log-likelihood this boils down to\nexpanding the log-partition function,\n\nlog Z(\u03bb) \u2248 log Z(\u03bbMP) + \u03baT \u03b4\u03bb +\n\n1\n2 \u03b4\u03bbT C\u03b4\u03bb\n\n(4)\n\n\fC = E[f(x)f(x)T ]p(x) \u2212 E[f(x)]p(x)E[f(x)]T\n\np(x),\n\n\u03ba = E[f(x)]p(x)\n\n(5)\n\nwith \u03b4\u03bb = (\u03bb \u2212 \u03bbMP) and\n\nand where all averages are taken over p(x|\u03bbMP).\nSimilarly for the prior we \ufb01nd,\n\nlog p(\u03bb) = log p(\u03bbMP) + gT \u03b4\u03bb +\n\n1\n2 \u03b4\u03bbT H\u03b4\u03bb\n\n(6)\n\nwhere g is the \ufb01rst derivative of log p evaluated at \u03bbMP and H is the second derivative (or Hessian).\nThe variables \u03b4\u03bb represent \ufb02uctuations of the parameters around the MAP value \u03bbMP. The marginal\nlikelihood can now be approximated by integrating out the \ufb02uctuations \u03b4\u03bb, considering \u03bbMP as a\nhyper-parameter,\n\n(cid:90)\n\nlog p(D) = log\n\nd\u03b4\u03bb p(D|\u03b4\u03bb, \u03bbMP) p(\u03b4\u03bb|\u03bbMP)\n\n(7)\n\nInserting the expansions eqns.4 and 6 into eqn.7 we arrive at the standard expression for the Laplace\napproximation applied to MRFs,\nlog p(D) \u2248\n\n\u03bbMPT f(xn) \u2212 N log Z(\u03bbMP) + log p(\u03bbMP) +\n\n1\n\n2 F log(2\u03c0) \u2212 1\n\n2 F log(N) \u2212 1\n\n2\n\n(8)\nlog det(C \u2212 H\nN\n\n)\n\n(cid:88)\n\nn\n\nwith F the number of features.\nThe difference with Laplace approximations for Bayesian networks is the fact that many terms in\nthe expression above can not be evaluated. First of all, determining \u03bbMP requires running gradient\nascent or iterative scaling to maximize the penalized log-likelihood which requires the computation\nof the average suf\ufb01cient statistics E[f(x)]p(x). Secondly, the expression contains the log-partition\nfunction Z(\u03bbMP) and the covariance matrix C which are both intractable quantities.\n\n2.1 The BP-Linear Response Approximation\n\nTo make further progress, we introduce a second layer of approximations based on belief propa-\ngation. In particular, we approximate the required marginals in the gradient for \u03bbMP with the ones\nobtained with BP. For fully observed MRFs the value for \u03bbMP will be very close to the solution ob-\ntained by pseudo-moment matching (PMM) [5]; the in\ufb02uence of the prior being the only difference\nbetween the two. Hence, we use \u03bbPMM to initialize gradient descent. The approximation incurred by\nPMM is not always small [10] in which case other approximations such as contrastive divergence\nmay be substituted instead. The term \u2212 log Z(\u03bbMP) will be approximated with the Bethe free en-\nergy. This will involve running belief propagation on a model with parameters \u03bbMP and inserting the\nbeliefs at their \ufb01xed points into the expression for the Bethe free energy [16].\nTo compute the covariance matrix between the features C (eqn.5), we use the linear response al-\ngorithm of [15]. This approximation is based on the observation that C is the Hessian of the log-\npartition function w.r.t. the parameters. This is approximated by the Hessian of the Bethe free energy\nw.r.t. the parameters which in turn depends to the partial derivatives of the beliefs from BP w.r.t. the\nparameters.\n\nC\u03b1\u03b2 = \u22022 log Z(\u03bb)\n\u2202\u03bb\u03b1\u2202\u03bb\u03b2\n\n\u2248 \u2212 \u22022 log FBethe(\u03bb)\n\n=\n\n\u2202\u03bb\u03b1\u2202\u03bb\u03b2\n\nf\u03b1(x\u03b1) \u2202pBP\n\n\u03b1 (x\u03b1|\u03bb)\n\u2202\u03bb\u03b2\n\n(9)\n\n(cid:88)\n\nx\u03b1\n\nwhere \u03bb = \u03bbMP, pBP\n\u03b1 is the marginal computed using belief propagation and x\u03b1 is the collection of\nvariables in the argument of feature f\u03b1 (e.g. nodes or edges). This approximate C is also guaranteed\nto be symmetric and positive semi-de\ufb01nite. In [15] two algorithms were discussed to compute C\nin the linear response approximation, one based on a matrix inverse, the other a local propagation\nalgorithm. The main idea is to perform a Taylor expansion of the beliefs and messages in the\nparameters \u03b4\u03bb = \u03bb \u2212 \u03bbMP and keep track of \ufb01rst order terms in the belief propagation equations.\nOne can show that the \ufb01rst order terms carry the information to compute the covariance matrix. We\nrefer to [15] for more information. In appendix A we provide explicit equations for the case of\nBoltzman machines which is what is needed to reproduce the experiments in section 4.\n\n\f(a)\n\n(b)\n\nFigure 1: Comparision of various scores on synthetic data\n\n3 Conditional Random Fields\n\nPerhaps the most practical class of undirected graphical models are the conditional random \ufb01eld\n(CRF) models. Here we jointly model labels t and input variables x. The most signi\ufb01cant mod-\ni\ufb01cation relative to MRFs is that the normalization term now depends on the input variable. The\nprobability of label given input is given as,\np(t|x, \u03bb) =\n\n\u03bbT f(t, x)\n\n(10)\n\nexp\n\n(cid:104)\n\n(cid:105)\n\n1\n\nZ(\u03bb, x)\n\nTo approximate the log marginal evidence we obtain an expression very similar to eqn.8 with the\nfollowing replacement,\n(cid:179)\nC \u2192 1\n(cid:88)\nN\n\n(cid:88)\n\nN(cid:88)\n\n(cid:180)\n\n(cid:179)\n\n(cid:180)\n\n(11)\n\nCxn\n\nn=1\n\n\u03bbMPT f(tn, xn) \u2212 log Z(\u03bbMP, xn)\n\n(12)\n\n\u2022\n\n\u2022\n\n\u03bbMPT f(xn)\n\n\u2212 N log Z(\u03bbMP) \u2192\n\nn\n\nn\n\nwhere\n\nCxn = E[f(t, xn)f(t, xn)T ]p(t|xn) \u2212 E[f(t, xn)]p(t|xn)E[f(t, xn)]T\n\n(13)\nand where all averages are taken over distributions p(t|xn, \u03bbMP) at the MAP value \u03bbMP of the condi-\ntional log-likelihood\n\n(cid:80)\nn log p(tn|xn, \u03bb).\n\np(t|xn)\n\n4 Experiments\n\nIn the following experiments we probe the accuracy of the Bethe-Laplace(BP-LR) approximation.\nIn these experiments we have focussed on comparing the value of the estimated log marginal like-\nlihood with \u201cannealed importance sampling\u201d (AIS), which we treat as ground truth[9, 1]. We have\nfocussed on this performance measure because the marginal likelihood is the relevant quantity for\nboth Bayesian model averaging as well as model selection.\nWe perform experiments on synthetic data as well as a real-world dataset. For the synthetic data,\nwe use Boltzman machine models (binary undirected graphical models with pairwise interactions)\nbecause we believe that the results will be representative of multi-state models and because the\nimplementation of the linear response approximation is straightforward in this case (see appendix\nA).\n\n\u22122024681012\u22123.5\u22123.45\u22123.4\u22123.35\u22123.3\u22123.25\u22123.2\u22123.15\u22123.1#edges (nested models)score/Ntrue model\u22125 nodes, 6 edges; N=50BIC\u2212MLMAPBP\u2212LRAISBP\u2212LR\u2212ExactGradLaplace\u2212Exact\u22122024681012\u22123.11\u22123.1\u22123.09\u22123.08\u22123.07\u22123.06\u22123.05\u22123.04#edges (nested models)score/Ntrue model\u22125 nodes, 6 edges; N=10000BIC\u2212MLMAPBP\u2212LRAISBP\u2212LR\u2212ExactGradLaplace\u2212Exact\f(a)\n\n(b)\n\nFigure 2: Mean difference in scores with AIS (synthetic data). Error-bars are too small to see.\n\nScores computed using the proposed method (BP-LR) were compared against MAP scores (or pe-\nnalized log-likelihood) where we retain only the \ufb01rst three terms in equation (8) and the commonly\nused BIC-ML scores where we ignore all O(1) terms (i.e retain only terms 1, 2 and 5). BIC-ML\nuses the maximum likelihood value \u03bbML instead of \u03bbMP. We also evaluate two other scores - BP-\nLR-ExactGrad where we use exact gradients to compute the \u03bbMP and Laplace-Exact which is same\nas BP-LR-ExactGrad but with C computed exactly as well. Note that these last two methods are\npractical only for models with small tree-width. Nevertheless they are useful here to illustrate the\neffect of the bias from BP.\n\n4.1 Synthetic Data\n\n4 , 0.5\n\n4 , 1.0\n\n4 , 1.5\n\n4 , 0.2\n\n4 , 2.0\n\nWe generated 50 different random structures on 5 nodes. For each we sample 6 different sets of\nparameters with weights w \u223c U{[\u2212d,\u2212d + \u0001] \u222a [d, d + \u0001]}, d > 0, \u0001 = 0.1\n4 and biases b \u223c U[\u22121, 1]\n4 ]. We then generated N = 10000\nand varying the edge strength d in [ 0.1\nsamples from each of these (50 \u00d7 6) models using exact sampling by exhaustive enumeration.\n4 (the true structure had 6\nIn the \ufb01rst experiment we picked a random dataset/model with d = 0.5\nedges) and studied the variation of different scores with model complexity. We de\ufb01ne an ordering\non models based on complexity by using nested model sequences. These are such that a model\nappearing later in the sequence contains all edges from models appearing earlier. Figure (1) shows\nthe results for two such random nested sequences around the true model, for the number of datacases\nN = 50 and N = 10000 respectively. The error-bars for AIS are over 10 parallel annealing runs\nwhich we see are very small. We repeated the plots for multiple such model sequences and the\nresults were similar. Figure (2) shows the average absolute difference of each score with the AIS\nscore over 50 sequences. From these one can see that BP-LR is very accurate at low N. As known\nin the literature, BIC-ML tends to over-penalize model complexity. At large N, the performance of\nall methods improve but BP-LR does slightly worse than the BIC-ML.\nIn order to better understand the performance of various scores with N, we took the datasets at\nd = 0.5\n4 and computed scores at various values of N. At each value, we \ufb01nd the absolute difference\nin the score assigned to the true structure with the corresponding AIS score. These are then averaged\nover the 50 datasets. The results are shown in \ufb01gure (3). We note that all BP-LR methods are\nabout two orders of magnitude more accurate than methods that ignore the O(1) term based on C.\nHowever, as we increase N, BP-LR based on \u03bbMP computed using BP signi\ufb01cantly deteriorates. This\ndoes not happen with both BP-LR methods based on \u03bbMP computed using exact gradients (i.e. BP-\nLR-ExactGrad and Laplace-Exact). Since the latter two methods perform identically, we conclude\nthat it is not the approximation of C by linear response that breaks down, but rather that the bias in\n\u03bbMP is the reason that the estimate of C becomes unreliable. We conjecture that this happens when\nthe bias becomes of the order of the standard deviation of the posterior distribution. Since the bias is\n\n\u2212202468101200.050.10.150.20.250.30.35#edges (nested models)Abs. score diff. with AIStrue model:5nodes, 6 edges; N=50BIC\u2212MLMAPBP\u2212LRBP\u2212LR\u2212ExactGradLaplace\u2212Exact\u221220246810120123456789x 10\u22123#edges (nested models)Abs. score diff. with AIStrue model:5nodes, 6 edges; N=10000BIC\u2212MLMAPBP\u2212LRBP\u2212LR\u2212ExactGradLaplace\u2212Exact\fFigure 3: Variation of score accuracy with N\n\nFigure 4: Variation of score accuracy with d\n\nconstant but the variance in the posterior decreases as O(1/N) this phenomenon is bound to happen\nfor some value of N.\nFinally since our BP-LR method relies on the BP approximation which is known to break down at\nstrong interactions, we investigated the performance of various scores with d. Again at each value\nof d we compute the average absolute difference in the scores assigned to the true structure by a\nmethod and AIS. We use N = 10000 to keep the effect of N minimal. Results are shown in \ufb01gure\n(4). As expected all BP based methods deteriorate with increasing d. The exact methods show that\none can improve performance by having a more accurate estimate of \u03bbMP.\n\n4.2 Real-world Data\n\nTo see the performance of the BP-LR on real world data, we implemented a linear chain CRF on the\n\u201cnewsgroup FAQ dataset\u201d2 [4]. This dataset contains 48 \ufb01les where each line can be either a header,\na question or an answer. The problem is binarized by only retaining the question/answer lines. For\neach line we use 24 binary features ga(x) = 0/1, a = 1, .., 24 as provided by [4]. These are used to\ni (ti, ti+1, xi) = titi+1ga(xi)\ni (ti, xi) = tiga(xi) and f a\nde\ufb01ne state and transition features using: f a\nwhere i denotes the line in a document and a indexes the 24 features.\nWe generated a random sequence of models by incrementally adding some state features and then\nsome transition features. We then score each model using MAP, BIC-MAP (which is same as BIC-\nML but with \u03bbMP), AIS and Laplace-Exact. Note that since the graph is a chain, BP-LR is equivalent\nto BP-LR-ExactGrad and Laplace-Exact. We use N = 2 \ufb01les each truncated to 100 lines. The results\nare shown in \ufb01gure (5). Here again, the Laplace-Exact agrees very closely with AIS compared to the\nother two methods. (Another less relevant observation is that the scores \ufb02atten out around the point\nwhere we stop adding the state features showing their importance compared to transition features).\n\n5 Discussion\n\nThe main conclusion from this study is that the Bethe-Laplace approximation can give an excellent\napproximation to the marginal likelihood for small datasets. We discovered an interesting phe-\nnomenon, namely that as N grows the error in the O(1) term based on the covariance between\nfeatures increases. We found that this term can give an enormous boost in accuracy for small N (up\nto two orders of magnitude), but its effect can be detrimental for large N. We conjecture that this\nswitch-over point takes place when the bias in \u03bbMP becomes of the order of the standard deviation in\nthe posterior (which decreases as 1/N). At that point the second derivative of the log-likelihood in\nthe Taylor expansion becomes unreliable.\nThere are a number of ways to improve the accuracy of approximation. One approach is to use higher\norder Kikuchi approximations to replace the Bethe approximation. Linear response results are also\n\n2Downloaded from: http://www.cs.umass.edu/\u223cmccallum/data/faqdata/\n\n\u2212200002000400060008000100001200010\u2212410\u2212310\u2212210\u22121100N (# samples)Mean absolute score diff. with AISd=0.5/4BIC\u2212MLMAPBP\u2212LRBP\u2212LR\u2212ExactGradLaplace\u2212Exact00.511.522.510\u2212410\u2212310\u2212210\u22121100d*4 (edge strength)Mean Absolute score diff. with AISN=10kBIC\u2212MLMAPBP\u2212LRBP\u2212LR\u2212ExactGradLaplace\u2212Exact\fFigure 5: Comparision of various scores on real-world dataset\n\navailable for this case [12]. A second improvement could come from improving the estimate of\n\u03bbMP using alternative learning techniques such as contrastive divergence or alternative sample-based\napproaches. As discussed above, less bias in \u03bbMP will make the covariance term useful for larger N.\nFinally, the case of hidden variables needs to be addressed. It is not hard to imagine how to extend the\ntechniques proposed in this paper to hidden variables in theory, but we haven\u2019t run the experiments\nnecessary to make claims about its performance. This, we leave for future study.\n\nA Computation of C for Boltzman Machines\n\nFor binary variables and pairwise interactions we de\ufb01ne the variables as \u03bb = {\u03b8i, wij} where \u03b8i is a\nparameter multiplying the node-feature xi and wij the parameter multiplying the edge feature xixj.\nMoreover, we\u2019ll de\ufb01ne the following independent quantities qi = p(xi = 1) and \u03beij = p(xi =\n1, xj = 1). Note that all other quantities, e.g. p(xi = 1, xj = 0) are functions of {qi, \u03beij}.\nIn the following we will assume that {qi, \u03beij} are computed using belief propagation (BP). At the\n\ufb01xed points of BP the following relations hold [14],\n\n(cid:182)\n\n(cid:195)\n\n(1 \u2212 qi)zi\u22121(cid:81)\n(cid:81)\nj\u2208N (i)(qi \u2212 \u03beij)\nj\u2208N (i)(\u03beij + 1 \u2212 qi \u2212 qj)\n\nqzi\u22121\n\ni\n\n(cid:33)\n\n(14)\n\n\u03b8i = log\n\nwij = log\n\n\u03beij(\u03beij + 1 \u2212 qi \u2212 qj)\n(qi \u2212 \u03beij)(qj \u2212 \u03beij)\n\n(cid:181)\n\nwhere N(i) are neighboring nodes of node i in the graph and zi = |N(i)| is the number of neighbors\nof node i.\nTo compute the covariance matrix we \ufb01rst compute its inverse from eqns.14 as follows, C\u22121 =\n\n(cid:34)\n\n(cid:35)\n\uf8ee\uf8f0 1 \u2212 zi\n(cid:183) \u22121\n(cid:183)\n(cid:183)\n\nqi \u2212 \u03beik\n\n1\n\nqi \u2212 \u03beij\n1\n+\n\u03beij\n\nqi(1 \u2212 qi)\n\n\u2202\u03b8\n\u2202q\n\u2202w\n\u2202q\n\n\u2202\u03b8\n\u2202\u03be\n\u2202w\n\u2202\u03be\n\n\u2202\u03b8i\n\u2202qk\n\n=\n\n\u2202\u03b8i\n\u2202\u03bejk\n\u2202Wij\n\u2202qk\n\u2202Wij\n\u2202\u03bekl\n\n=\n\n=\n\n=\n\nand subsequently take its inverse. The four terms in this matrix are given by,\n\n(cid:182)\uf8f9\uf8fb \u03b4ik\n\n(cid:181)\n\n(cid:88)\n\nj\u2208N (i)\n\n+\n\n1\n\nqi \u2212 \u03beij\n\n+\n\n(cid:184)\n(cid:184)\n\n\u03beij + 1 \u2212 qi \u2212 qj\n\n1\n\n(cid:183) \u22121\n(cid:183)\n\nqi \u2212 \u03beij\n\n+\n\n\u2212\n\n(cid:184)\n\n1\n\n1\n\n+\n\n\u2212\n\n1\n\n\u03beik + 1 \u2212 qi \u2212 qk\n\n1\n\n\u03beij + 1 \u2212 qi \u2212 qj\n+\n\n1\n\n\u03beij + 1 \u2212 qi \u2212 qj\n\n\u03b4ij +\n\n\u03b4ik +\n\n1\n\nqi \u2212 \u03beij\n\n+\n\nqj \u2212 \u03beij\n\n\u03b4ik\u03b4jl\n\n1\n\n\u03beij + 1 \u2212 qi \u2212 qj\n\n1\n\nqj \u2212 \u03beij\n\n\u03beij + 1 \u2212 qi \u2212 qj\n\n\u03b4jk\n\n(cid:184)\n(cid:184)\n\n\u03b4ik\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n01020304050\u221260\u221255\u221250\u221245\u221240\u221235\u221230\u221225\u221220\u221215# featuresscore/NCRF N=2, Sequence Length=100MAPBIC_MAPLaplace\u2212ExactAIS\fAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n0447903.\n\nReferences\n[1] M.J. Beal and Z. Ghahramani. The variational bayesian EM algorithm for incomplete data:\nwith application to scoring graphical model structures. In Bayesian Statistics, pages 453\u2013464.\nOxford University Press, 2003.\n\n[2] P. Green and S. Richardson. Hidden markov models and disease mapping. Journal of the\n\nAmerican Statistical Association, 97(460):1055\u20131070, 2002.\n\n[3] D. Heckerman. A tutorial on learning with bayesian networks. pages 301\u2013354, 1999.\n[4] A. McCallum and D. Freitag F. Pereira. Maximum entropy Markov models for information\nextraction and segmentation. In Int\u2019l Conf. on Machine Learning, pages p.591\u2013598, San Fran-\ncisco, 2000.\n\n[5] T.S. Jaakkola M.J. Wainwright and A.S. Willsky. Tree-reweighted belief propagation algo-\n\nrithms and approximate ml estimation via pseudo-moment matching. In AISTATS, 2003.\n\n[6] J. M\u00f8ller, A. Pettitt, K. Berthelsen, and R. Reeves. An ef\ufb01cient Markov chain Monte Carlo\nto\n\nmethod for distributions with intractable normalisation constants. Biometrica, 93, 2006.\nappear.\n\n[7] I. Murray and Z. Ghahramani. Bayesian learning in undirected graphical models: approximate\nMCMC algorithms. In Proceedings of the 14th Annual Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI-04), San Francisco, CA, 2004.\n\n[8] I. Murray, Z. Ghahramani, and D.J.C. MacKay. Mcmc for doubly-intractable distributions. In\nProceedings of the 14th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-06),\nPittsburgh, PA, 2006.\n\n[9] R.M. Neal. Annealed importance sampling.\n\n2001.\n\nIn Statistics and Computing, pages 125\u2013139,\n\n[10] S. Parise and M. Welling. Learning in markov random \ufb01elds: An empirical study. In Proc. of\n\nthe Joint Statistical Meeting \u2013 JSM2005, 2005.\n\n[11] Y. Qi, M. Szummer, and T.P. Minka. Bayesian conditional random \ufb01elds. In Arti\ufb01cial Intelli-\n\ngence and Statistics, 2005.\n\n[12] K. Tanaka. Probabilistic inference by means of cluster variation method and linear response\n\ntheory. IEICE Transactions in Information and Systems, E86-D(7):1228\u20131242, 2003.\n\n[13] M. Welling and S. Parise. Bayesian random \ufb01elds: The Bethe-Laplace approximation. In UAI,\n\n2006.\n\n[14] M. Welling and Y.W. Teh. Approximate inference in boltzmann machines. Arti\ufb01cial Intelli-\n\ngence, 143:19\u201350, 2003.\n\n[15] M. Welling and Y.W. Teh. Linear response algorithms for approximate inference in graphical\n\nmodels. Neural Computation, 16 (1):197\u2013221, 2004.\n\n[16] J.S. Yedidia, W. Freeman, and Y. Weiss. Constructing free energy approximations and gen-\neralized belief propagation algorithms. Technical report, MERL, 2002. Technical Report\nTR-2002-35.\n\n\f", "award": [], "sourceid": 3149, "authors": [{"given_name": "Sridevi", "family_name": "Parise", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}