{"title": "Query-Aware MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 2564, "page_last": 2572, "abstract": "Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for it all the unobserved variables in a graphical model. However, in many real-world applications the user's interests are focused on a subset of the variables, specified by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-specific approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efficiency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models.", "full_text": "Query-Aware MCMC\n\nMichael Wick\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA\n\nmwick@cs.umass.edu\n\nAndrew McCallum\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA\n\nmccallum@cs.umass.edu\n\nAbstract\n\nTraditional approaches to probabilistic inference such as loopy belief propagation\nand Gibbs sampling typically compute marginals for all the unobserved variables\nin a graphical model. However, in many real-world applications the user\u2019s inter-\nests are focused on a subset of the variables, speci\ufb01ed by a query. In this case it\nwould be wasteful to uniformly sample, say, one million variables when the query\nconcerns only ten. In this paper we propose a query-speci\ufb01c approach to MCMC\nthat accounts for the query variables and their generalized mutual information\nwith neighboring variables in order to achieve higher computational ef\ufb01ciency.\nSurprisingly there has been almost no previous work on query-aware MCMC. We\ndemonstrate the success of our approach with positive experimental results on a\nwide range of graphical models.\n\n1\n\nIntroduction\n\nGraphical models are useful for representing relationships between large numbers of random vari-\nables in probabilistic models spanning a wide range of applications, including information extraction\nand data integration. Exact inference in these models is often computationally intractable due to the\ndense dependency structures required in many real world problems, thus there exists a large body\nof work on both variational and sampling approximations to inference that help manage large tree-\nwidth. More recently, however, inference has become dif\ufb01cult for a different reason: large data. The\nproliferation of interconnected data and the desire to model it has given rise to graphical models with\nmillions or even billions of random variables. Unfortunately, there has been little research devoted\nto approximate inference in graphical models that are large in terms of their number of variables.\nOther than acquiring more machines and parallelizing inference [1, 2], there have been few options\nfor coping with this problem.\nFortunately, many inference needs are instigated by queries issued by users interested in particular\nrandom variables. These real-world queries tend to be grounded (i.e., focused on speci\ufb01c data cases).\nFor example, a funding agency might be interested in the expected impact that funding a particular\nresearch group has on a certain scienti\ufb01c topic. In these situations not all variables are of equal\nrelevance to the user\u2019s query; some variables become observed given the query, others become\nstatistically independent given the query, and the remaining variables are typically marginalized.\nThus, a user-generated query provides a tremendous amount of information that can be exploited by\nan intelligent inference procedure. Unfortunately, traditional approaches to inference such as loopy\nbelief propagation (BP) and Gibbs sampling are query agnostic in the sense that they fail to take\nadvantage of this knowledge and treat each variable as equally relevant. Surprisingly, there has been\nlittle research on query speci\ufb01c inference and the only existing approaches focus on loopy BP [3, 4].\nIn this paper we propose a query-aware approach to Markov chain Monte Carlo (QAM) that exploits\nthe dependency structure of the graph and the query to achieve faster convergence to the answer. Our\nmethod selects variables for sampling in proportion to their in\ufb02uence on the query variables. We\n\n1\n\n\fdetermine this in\ufb02uence using a computationally tractable generalization of mutual information be-\ntween the query variables and each variable in the graph. Because our query-speci\ufb01c approach to\ninference is based on MCMC, we can provide arbitrarily close approximations to the query answer\nwhile also scaling to graphs whose structure and unrolled factor density would ordinarily preclude\nboth exact and belief propagation inference methods. This is essential for the method to be de-\nployable in real-world probabilistic databases where even a seemingly innocuous relational algebra\nquery over a simple fully independent structure can produce an inference problem that is #P-hard\n[5]. We demonstrate dramatic improvements over traditional Markov chain Monte Carlo sampling\nmethods across a wide array of models of diverse structure.\n\n2 Background\n\n2.1 Graphical Models\n\n1 and m factors \u03c8 = {\u03c8i}m\n\nGraphical models are a \ufb02exible framework for capturing statistical relationships between random\nvariables. A factor graph G := hx, \u03c8i is a bipartite graph consisting of n random variables x =\n{xi}n\n1 . Each variable xi has a domain Xi, and we notate the entire\ndomain space of the random variables (x) as X with associated \u03c3-algebra \u2126. Intuitively, a factor \u03c8i\nis a function that maps a subset of random variable values vi \u2208 Xi to a non-negative real-valued\nnumber, thus capturing the compatibility of an assignment to those variables. The factor graph then\nexpresses a probability measure over (X, \u2126), the probability of a particular event \u03c9 \u2208 \u2126 is given as\n\n\u03c0(v).\n\n(1)\n\nX\n\nmY\n\nv\u2208\u03c9\n\ni=1\n\n\u03c8i(vi), Z = X\n\nv\u2208X\n\n\u03c0(\u03c9) =\n\n1\nZ\n\nWe will assume that \u2126 is de\ufb01ned so that marginalization of any subset of the variables is well\nde\ufb01ned; this is important in the sequel.\n\n2.2 Queries on Graphical Models\n\nInformally, a query on a graphical model is a request for some quantity of interest that the graphical\nmodel is capable of providing. That is, a query is a function mapping the graphical model to an\nanswer set. Inference is required to recover these quantities and produce an answer to the query.\nWhile in the general case, a query may contain arbitrary functions over the support of a graphical\nmodel, for this work we consider queries of the marginal form. That is a query Q consists of three\nparts Q = hxq, xl, xei. Where xq is the set of query variables whose marginal distributions (or\nMAP con\ufb01guration) are the answer to the query, xe is a set of evidence variables whose values\nare observed, and xl is the set of latent variables over which one typically marginalizes to obtain\nthe statistically sound answer. Note that this class of queries is remarkably general and includes\nqueries that require expectations over arbitrary functions. We can see this because a function over\nthe graphical model (or a subset of the graphical model) is itself a random variable, and can therefore\nbe included in xq.1 More precisely, a query over a graphical model is:\n\nQ(xq, xl, xe, \u03c0) = \u03c0(xq|xe = ve) =X\n\n\u03c0(xq, xl|xe = ve)\n\n(2)\n\nwe assume that \u2126 is well de\ufb01ned with respect to marginalization over arbitrary subsets of variables.\n\nvl\n\n2.3 Markov Chain Monte Carlo\n\nMarkov chain Monte Carlo (MCMC) is an important inference method for graphical models where\ncomputing the normalization constant Z is intractable. In particular, for many MCMC schemes such\nas Gibbs sampling and more generally Metropolis-Hastings, Z cancels out of the computation for\ngenerating a single sample. MCMC has been successfully used in a wide variety of applications\nincluding information extraction [8], data integration [9], and machine vision [10]. For simplicity,\nin this work, we consider Markov chains over discrete state spaces. However, many of the results\n\n1Research in probabilistic databases has demonstrated that a large class of relational algebra queries can be\n\nrepresented as graphical models and answered using statistical queries of the this form [6, 7].\n\n2\n\n\fpresented in this paper may be extended to arbitrary state spaces using more general statements with\nmeasure theoretic de\ufb01nitions.\nMarkov chain Monte Carlo produces a sequence of states {si}\u221e\n1 in a state space S according to\na transition kernel K : S \u00d7 S \u2192 R+, which in the discrete case is a stochastic matrix: for all\ns \u2208 S K(s,\u00b7) is a valid probability measure and for all s \u2208 S K(\u00b7, s) is a measurable function.\nSince we are concerned with MCMC for inference in graphical models, we will from now on let\nS:=X, and use X instead. Under certain conditions the Markov chain is said to be ergodic, then the\nchain exhibits two types of convergence. The \ufb01rst is of practical interest: a law of large numbers\nconvergence\n\nf(st) =\n\nf(s)\u03c0(s)ds\n\n(3)\n\nX\n\nlim\nt\u2192\u221e\n\n1\nt\n\nZ\n\ns\u2208X\n\nwhere the st are empirical samples from the chain.\nThe second type of convergence is to the distribution \u03c0. At each time step, the Markov chain is in a\ntime-speci\ufb01c distribution over the state space (encoding the probability of being in a particular state\nat time t). For example, given an initial distribution \u03c00 over the state space, the probability of being\nin a next state s0 is the probability of all paths beginning in starting states s with probabilities \u03c00(s)\nand transitioning to s0 with probabilities K(s, s0). Thus the time-speci\ufb01c (t = 1) distribution over\nall states is given by \u03c0(1) = \u03c00K; more generally, the distribution at time t is given by \u03c0(t) = \u03c00K t.\nUnder certain conditions and regardless of the initial distribution, the Markov chain will converge\nto the stationary (invariant) distribution \u03c0. A suf\ufb01cient (but not necessary) condition for this is to\nrequire that the Markov transition kernel obey detailed balance:\n\n\u03c0(x)K(x, x0) = \u03c0(x0)K(x0, x) \u2200x, x0 \u2208 X\n\n(4)\n\nConvergence of the chain is established when repeated applications of the transition kernel main-\ntain the invariant distribution \u03c0 = \u03c0K, and convergence is traditionally quanti\ufb01ed using the total\nvariation norm:\n\nk\u03c0(t) \u2212 \u03c0ktv := sup\nA\u2208\u2126\n\n|\u03c0(t)(A) \u2212 \u03c0(A)| =\n\n1\n2\n\n|\u03c0(t)(x) \u2212 \u03c0(x)|\n\n(5)\n\nX\n\nx\u2208X\n\nThe rate at which a Markov chain converges to the stationary distribution is proportional to the\nspectral gap of the transition kernel, and so there exists a large body of literature proving bounds on\nthe second eigenvalues.\n\n2.4 MCMC Inference in Graphical Models\n\nMCMC is used for inference in graphical models by constructing a Markov chain with invariant\ndistribution \u03c0 (given by the graphical model). One particularly successful approach is the Metropolis\nHastings (MH) algorithm. The idea is to devise a proposal distribution T : X\u00d7 X \u2192 [0, 1] for which\nit is always tractable to sample a next state s0 given a current state s. Then, the proposed state s0 is\naccepted with probability function A\n\n(cid:18)\n\n(cid:19)\n\nA(s, s0) = min\n\n\u03c0(s0)T (s, s0)\n\u03c0(s)T (s0, s)\n\n1,\n\nThe resulting transition kernel KMH is given by\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\nT (s, s0) + P\n\nT (s, s0)\nT (s, s0)A(s, s0)\n\nr:A(s,r)<1\n\nKM H(s, s0) =\n\nK(s, r)(1 \u2212 A(s, r))\n\nif A(s, s0) > 1, s 6= s0\nif A(s, s0) < 1\nif s = s0\n\nFurther, observe that in the computation of A, the partition function Z cancels, as do factors out-\nside the Markov blanket of the variables that have changed. As a result, generating samples from\ngraphical models with Metropolis-Hastings is usually inexpensive.\n\n3\n\n(6)\n\n(7)\n\n\f3 Query Speci\ufb01c MCMC\n\nGiven a query Q = hxq, xl, xei, and a probability distribution \u03c0 encoded by a graphical model G\nwith factors \u03c8 and random variables x, the problem of query speci\ufb01c inference is to return the high-\nest \ufb01delity answer to Q given a possible time budget. We can put more precision on this statement\nby de\ufb01ning \u201chighest \ufb01delity\u201d as closest to the truth in total variation distance.\nOur approach for query speci\ufb01c inference is based on the Metropolis Hastings algorithm described\nin Section 2.4. A simple yet generic case of the Metropolis Hastings proposal distribution T (that\nhas been quite successful in practice) employs the following steps:\n1: Beginning in a current state s, select a random variable xi \u2208 x from a probability distribution p\n2: Sample a new value for xi according to some distribution q(Xi) over that variable\u2019s domain,\n\nover the indices of the variables (1, 2,\u00b7\u00b7\u00b7 , n).\nleave all other variables unchanged and return the new state s0.\n\nIn brief, this strategy arrives at a new state s0 from a current state s by simply updating the value\nof one variable at a time. In traditional MCMC inference, where the marginal distributions of all\nvariables are of equal interest, the variables are usually sampled in a deterministic order, or selected\nn induces a uniform distribution over the integers 1, 2,\u00b7\u00b7\u00b7 , n.\nuniformly at random; that is, p(i) = 1\nHowever, given a query Q, it is reasonable to choose a p that more frequently selects the query\nvariables for sampling. Clearly, the query variable marginals depend on the remaining latent vari-\nables, so we must tradeoff sampling between query and non-query variables. A key observation is\nthat not all latent variables in\ufb02uence the query variables equally. A fundamental question raised and\naddressed in this paper is: how do we pick a variable selection distribution p for a query Q to obtain\nthe highest \ufb01delity answer under a \ufb01nite time budget. We propose to select variables based on their\nin\ufb02uence on the query variable according to the graphical model.\nWe will now formalize a broad de\ufb01nition of in\ufb02uence by generalizing mutual information. The\nmutual information I(x, y) = \u03c0(x, y) log( \u03c0(x,y)\n\u03c0(x)\u03c0(y)) between two random variables measures\nthe strength of their dependence.\nIt is easy to check that this quantity is the KL divergence\nbetween the joint distribution of the variables and the product of the marginals: I(x, y) =\nKL(\u03c0(x, y)||\u03c0(x)\u03c0(y)). In this sense, mutual information measures dependence as a \u201cdistance\u201d\nbetween the full joint distribution and its independent approximation. Clearly, if x and y are inde-\npendent then this distance is zero and so is their mutual information. We produce a generalization\nof mutual information which we term the in\ufb02uence by substituting an arbitrary divergence function\nf in place of the KL divergence.\nDe\ufb01nition 1 (In\ufb02uence). Let x and y be two random variables with marginal distributions\n\u03c0(x, y),\u03c0(x), \u03c0(y). Let f(\u03c01(\u00b7), \u03c02(\u00b7)) 7\u2192 r, r \u2208 R+ be a non-negative real-valued divergence\nbetween probability distributions. The in\ufb02uence \u03b9(x, y) between x and y is\n\n\u03b9(x, y) := f(\u03c0(x, y), \u03c0(x)\u03c0(y))\n\n(8)\n\nIf we let f be the KL divergence then \u03b9 becomes the mutual information; however, because MCMC\nconvergence is more commonly assessed with total variation norm, we de\ufb01ne an in\ufb02uence metric\nbased on this choice for f. In particular we de\ufb01ne \u03b9tv(x, y) := k\u03c0(x, y) \u2212 \u03c0(x)\u03c0(y)ktv.\nAs we will now show, the total variation in\ufb02uence (between the query variable and the latent vari-\nables) has the important property that it is exactly the error incurred from ignoring a single latent\nvariable when sampling values for xq. For example, suppose we design an approximate query spe-\nci\ufb01c sampler that saves computational resources by ignoring a particular random variable xl. Then,\nthe variable xl will remain at its burned-in value xl=vl for the duration of query speci\ufb01c sampling.\nAs a result, the chain will converge to the invariant distribution \u03c0(\u00b7|xl=vl). If we use this conditional\ndistribution to approximate the marginal, then the expected error we incur is exactly the in\ufb02uence\nscore under total variation distance.\nProposition 1. If p(i) = 1(i 6= l) 1\nn\u22121 induces an MH kernel that neglects variable xl, then the\nexpected total variation error \u03betv of the resulting MH sampling procedure under the model is the\ntotal variation in\ufb02uence \u03b9tv.\n\n4\n\n\fProof: The resulting chain has stationary distribution \u03c0(xq|xl = vl). The expected error is:\n\nvl\u2208Xl\n\nE\u03c0[\u03betv] = X\n= X\nX\nX\n\nvl\u2208Xl\n1\n2\n\n=\n\n=\n\n1\n2\n\n\u03c0(xl=vl)\n\nX\nX\n\nvl\u2208Xl\n\nvq\u2208Xq\n\nvl\u2208Xl\n\nvq\u2208Xq\n\n\u03c0(xl=vl)k\u03c0(xq|xl=vl) \u2212 \u03c0(t)(xq)ktv\n\n1\n2\n\nvq\u2208Xq\n\n(cid:12)(cid:12)(cid:12)\u03c0(xq|xl=vl) \u2212 \u03c0(t)(xq)\n(cid:12)(cid:12)(cid:12)\nX\n(cid:12)(cid:12)(cid:12)\u03c0(xq|xl=vl)\u03c0(xl=vl) \u2212 \u03c0(t)(xq)\u03c0(xl=vl)\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\u03c0(xq, xl) \u2212 \u03c0(t)(xq)\u03c0(xl)\n(cid:12)(cid:12)(cid:12) = \u03b9tv(xq, xl)\n\nThis demonstrates that the expected cost of not sampling a variable is exactly that variable\u2019s in\ufb02u-\nence on the query variable. We are now justi\ufb01ed in selecting variables proportional to their in\ufb02uence\nto reduce the error they assert on the query marginal. For example, if a variable\u2019s in\ufb02uence score is\nzero this also means that there is no cost incurred from neglecting that variable (if a query renders\nvariables statistically independent of the query variable then these variables will be correctly ignored\nunder the in\ufb02uence based sampling procedure).\nNote, however, that computing either \u03b9tv or the mutual information is as dif\ufb01cult as inference itself.\nThus, we de\ufb01ne a computationally ef\ufb01cient variant of in\ufb02uence that we term the in\ufb02uence trail score.\nThe idea is to approximate the true in\ufb02uence as a product of factors along an active trail in the graph.\nDe\ufb01nition 2 (In\ufb02uence Trail Score). Let \u03c1 = (x0, x1,\u00b7\u00b7\u00b7 , xr) be an active trail between the query\nvariable xq and xi where x0 = xq and xr = xi. Let \u03c6(xi, xj) be the approximate joint distribution\n\u03c6(xi, xj)\n\nbetween xi and xj according only to the mutual factors in their scopes. Let \u03c6(xi) =P\n\nbe a marginal distribution. The in\ufb02uence trail score with respect to an active trail \u03c1 is\n\nxj\n\n\u03c4\u03c1(xq, xi) :=\n\nf(\u03c6i(xi, xi+1), \u03c6i(xi)\u03c6i(xi+1))\n\n(9)\n\nr\u22121Y\n\ni=1\n\nThe in\ufb02uence trail score is ef\ufb01cient to compute because all factors and variables outside the mutual\nscopes of each variable pair are ignored. In the experimental results we evaluate both the in\ufb02uence\nand the in\ufb02uence trail and \ufb01nd that they perform similarly and outperform competing graph-based\nheuristics for determining p.\nWhile in general it is dif\ufb01cult to uniformly state that one choice of p converges faster than another\nfor all models and queries, we present the following analysis showing that even an approximate\nquery aware sampler can exhibit faster \ufb01nite time convergence progress than an exact sampler. Let\nK be an exact MCMC kernel that converges to the correct stationary distribution and let L be an\napproximate kernel that exclusively samples the query variable and thus converges to the conditional\ndistribution of the query variable. We now assume an ergodic scheme for the two samplers where\nthe convergence rates are geometrically bounded from above and below by constants \u03b3l and \u03b3k:\n\nk\u03c00Lt \u2212 \u03c0Kktv = \u0398(\u03b3t\nl )\nk\u03c00K t \u2212 \u03c0Kktv = \u0398(\u03b3t\nk)\n\n(10)\n(11)\nBecause L only samples the query variable, the dimensionality of L\u2019s state space is much smaller\nthan K\u2019s state space, and we will assume that L converges more quickly to its own invariant distribu-\ntion, that is, \u03b3l (cid:28) \u03b3k. Extrapolating Proposition 1, we know that the error incurred from neglecting\nto sample the latent variables is the in\ufb02uence \u03b9tv between the joint distribution of the latent variables\nand the query variable. Observe that L is simultaneously making progress towards two distributions:\nits own invariant distribution and the invariant distribution of K plus an error term. If the error term\n\u03b9tv is suf\ufb01ciently small then we can write the following inequality:\n\n(12)\nWe want this inequality to hold for as many time steps as possible. The amount of time that L (the\nquery only kernel) is closer to K\u2019s stationary distribution \u03c0k can be determined by solving for t,\n\nk\n\nl + \u03b9tv \u2264 \u03b3t\n\u03b3t\n\n5\n\n\fyielding the \ufb01xed point iteration:\n\nt =\n\nlog (\u03b3t\n\nl + \u03b9tv)\n\nlog \u03b3k\n\n(13)\n\nThe one-step approximation yields a non-trivial, but conservative bound: t \u2265 log(\u03b3l+\u03b9tv)\n. Thus, for a\nsuf\ufb01ciently small error, t can be positive. This implies that the strategy of exclusively sampling the\nquery variables can achieve faster short-term convergence to the correct invariant distribution even\nthough asymptotic convergence is to the incorrect invariant distribution. Indeed, we observe this\nphenomena experimentally in Section 5.\n\nlog \u03b3k\n\n4 Related Work\n\nDespite the prevalence of probabilistic queries, the machine learning and statistics communities\nhave devoted little attention to the problem of query-speci\ufb01c inference. The only existing papers\nof which we are aware both build on loopy belief propagation [3, 4]; however, for many inference\nproblems, MCMC is a preferred alternative to LPB because it is (1) able to obtain arbitrarily close\napproximations to the true marginals and (2) is better able to scale to models with large or real-valued\nvariable domains that are necessary for state-of-the-art results in data integration [9], information\nextraction [8], and deep vision tasks with many latent layers [11].\nTo the best of our knowledge, this paper is one of the \ufb01rst to propose a query-aware sampling\nstrategy for MCMC in either the machine learning or statistics community. The decayed MCMC\nalgorithm for \ufb01ltering [12] can be thought of as a special case of our method where the model is a\nlinear chain, and the query is for the last variable in the sequence. That paper proves a \ufb01nite mixing\ntime bounds on in\ufb01nitely long sequences. In contrast we are interested in arbitrarily shaped graphs\nand in the practical consideration of large \ufb01nite models. MCMC has also recently been deployed\nin probabilistic databases [13] where it is possible to incorporate the deterministic constraints of a\nrelational algebra query directly into a Metropolis-Hastings proposal distribution to obtain quicker\nanswers [14, 15].\nA related idea from statistics is data augmentation (or auxiliary variable) approaches to sampling\nwhere latent variables are arti\ufb01cially introduced into the model to improve convergence of the orig-\ninal variables (e.g., Swendsen-Wang [16] and slice sampling [17]). In this setting, we see QAM\nas a way of determining a more sophisticated variable selection strategy that can balance sampling\nefforts between the original and auxiliary variables.\n\n5 Experiments\n\nIn this section we demonstrate the effectiveness and broad applicability of query aware MCMC\n(QAM) by demonstrating superior convergence rates to the query marginals across a diverse range\nof graphical models that vary widely in structure. In our experiments, we generate a wide range\nof graphical models and evaluate the convergence of each chain exactly, avoiding noisy empirical\nsampling error by performing exact computations with full transition kernels.\nWe evaluate the following query-aware samplers:\n1. Polynomial graph distance 1 (QAM-Poly1): p(xi)\u221dd(xq, xi)\u2212N , where d is shortest path;\n2. In\ufb02uence - Exact mutual information (QAM-MI): p(xi)\u221dI(xq, xi);\n3. In\ufb02uence - total variation distance (QAM-TV): p(xi)\u221d\u03b9tv(xq, xi);\n4. In\ufb02uence trail score - total variation (QAM-TV): p(xi) set according to Equation 9;\nand two baseline samplers\n7. Traditional Metropolis-Hastings (Uniform): p(xi)\u221d1;\n8. Query-only Metropolis-Hastings (qo): p(xi) = 1(xq = xi);\non six different graphical models with varying parameters generated from a Beta(2,2) distribution\n(this ensures an interesting dynamic range over the event space).\n\n6\n\n\f1. Independent - each variable is statistically independent\n2. Linear chain - a linear-chain CRF (used in NLP and information extraction)\n3. Hoop - same as linear chain plus additional factor to close the loop\n4. Grid - or Ising model, used in statistical physics and vision\n5. Fully connected PW - each pair of variables has a pairwise factor\n6. Fully connected - every variable is connected through a single factor\nMirroring many real-world conditional random \ufb01eld applications, the non-unary factors (connect-\ning more than one variable) are generated from the same factor-template and thus share the same\nparameters (each generated from log(Beta(2,2))). Each variable has a corresponding observation\nfactor whose parameters are not shared and randomly set according to log(Beta2,2)/2.\nFor our experiments we randomly generate ten parameter settings for each of the six model types\nand measure convergence of the six chains to the the single-variable marginal query \u03c0(xq) for each\nvariable in each of the sixty realized models. Convergence is measured using the total variation\nnorm: k\u03c0(xq) \u2212 \u03c0(xq)(t)ktv. In this set of experiments we do not wish to introduce empirical sam-\npling error so we generate models with nine-variables per graph enabling us to (1) exactly compute\nthe answer to the marginal query, (2) fully construct the 2n \u00d7 2n transition matrices, and (3) alge-\nbraically compute the time t distributions for each chain \u03c0(t) = \u03c00K t\nMH given an initial uniform\ndistribution \u03c00(x) = 2\u22129.\nWe display marginal convergence results in Figure 1. Generally, all the query speci\ufb01c sampling\nchains converge more quickly than the uniform baseline in the early iterations across every model.\nIt is interesting to compare the convergence rates of the various QAM approaches at different time\nstages. The query-only and mutual information chain exhibit the most rapid convergence in the early\nstages of learning, with the query-only chain converging to an incorrect distribution, and the mutual\ninformation chain slowly converging during the later time stages. While QAM-TV exhibits similar\nconvergence patterns to the polynomial chains, QAM-TV slightly outperforms them in the more\nconnected models (grid and fully-connected-pw). Finally, notice that the in\ufb02uence-trail variant of\ntotal variation in\ufb02uence converges at a similar rate to the actual total variation in\ufb02uence, and in some\ncases converges more quickly (e.g., in the grid and the latter stages of the full pairwise model).\nIn the next experiment, we demonstrate how the size of the graphical model affects conver-\ngence of the various chains.\nIn particular, we plot the convergence of all chains on six dif-\nferent hoop-structured models containing three, four, six, eight, ten, and twelve variables (Fig-\nure 2). Again, the results are averaged over ten randomly generated graphs, but this time we plot\nthe advantage over the uniform kernel. That is we measure the difference in convergence rates\nk\u03c0 \u2212 \u03c00K t\nQAMktv so that points above the line x = 0 mean the QAM is closer to\nthe answer than the uniform baseline and points below the line mean the QAM is further from the\nanswer. As expected, increasing the number of variables in the graph increases the opportunities for\nquery speci\ufb01c sampling and thus increases QAM\u2019s advantage over traditional MCMC.\n\nUnifktv \u2212 k\u03c0 \u2212 \u03c00K t\n\n6 Conclusion\n\nIn this paper we presented a query-aware approach to MCMC, motivated by the need to answer\nqueries over large scale graphical models. We found that the query-aware sampling methods outper-\nform the traditional Metropolis Hastings sampler across all models in the early time steps. Further,\nas the number of variables in the models increase, the query aware samplers not only outperform the\nbaseline for longer periods of time, but also exhibit more dramatic convergence rate improvements.\nThus, query speci\ufb01c sampling is a promising approach for approximately answering queries on real-\nworld probabilistic databases (and relational models) that contain billions of variables. Successfully\ndeploying QAM in this setting will require algorithms for ef\ufb01ciently constructing and sampling the\nvariable selection distribution. An exciting area of future work is to combine query speci\ufb01c sam-\npling with adaptive MCMC techniques allowing the kernel to evolve in response to the underlying\ndistribution. Further, more rapid convergence could be obtained by mixing the kernels in a way\nthat combines the strength of each: some kernels converge quickly in the early stages of sampling\nwhile other converge more quickly in the later stages, thus together they could provide a very pow-\nerful query speci\ufb01c inference tool. There has been little theoretical work on analyzing marginal\nconvergence of MCMC chains and future work can help develop these tools.\n\n7\n\n\fFigure 1: Convergence to the query marginals of the stationary distribution from an initial uniform\ndistribution.\n\nFigure 2: Improvement over uniform p as the number of variables increases. Above the line x = 0\nis an improvement in marginal convergence, and below is worse than the baseline. As number of\nvariables increase, the improvements of the query speci\ufb01c techniques increase.\n\n8\n\n010203040500.000.100.20Independent TimeTotal variation distanceUniformQuery-onlyQAM-Poly1QAM-MIQAM-TVQAM-TV-G010203040500.000.050.100.150.20Linear Chain TimeTotal variation distance010203040500.050.100.150.20Hoop TimeTotal variation distance010203040500.000.050.100.150.20Grid TimeTotal variation distance010203040500.00.10.20.3Fully Connected (PW) TimeTotal variation distance010203040500.000.020.040.06Fully Connected TimeTotal variation distance010203040500.000.050.10 3 VariablesTimeImprovement over uniformUniformQuery-onlyQAM-Poly1QAM-Poly2QAM-MIQAM-TVQAM-TV-G010203040500.000.050.10 4 VariablesTimeImprovement over uniform010203040500.000.050.10 6 VariablesTimeImprovement over uniform010203040500.000.050.10 8 VariablesTimeImprovement over uniform010203040500.000.050.10 10 VariablesTimeImprovement over uniform010203040500.000.050.10 12 VariablesTimeImprovement over uniform\f7 Acknowledgements\n\nThis work was supported in part by the Center for Intelligent Information Retrieval, in part by\nIARPA via DoI/NBC contract #D11PC20152, in part by IARPA and AFRL contract #FA8650-10-\nC-7060 , and in part by UPenn NSF medium IIS-0803847. The U.S. Government is authorized to\nreproduce and distribute reprint for Governmental purposes notwithstanding any copyright annota-\ntion thereon. Any opinions, \ufb01ndings and conclusions or recommendations expressed in this material\nare the authors\u2019 and do not necessarily re\ufb02ect those of the sponsor. The authors would also like to\nthank Alexandre Passos and Benjamin Marlin for useful discussion.\n\nReferences\n[1] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M.\nIn Conference on\n\nHellerstein. Graphlab: A new parallel framework for machine learning.\nUncertainty in Arti\ufb01cial Intelligence (UAI), Catalina Island, California, July 2010.\n\n[2] Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. Large-scale\ncross-document coreference using distributed inference and hierarchical models. In Associa-\ntion for Computational Linguistics: Human Language Technologies (ACL HLT), 2011.\n\n[3] Arthur Choi and Adnan Darwiche. Focusing generalizations of belief propagation on targeted\n\nqueries. In Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2008.\n\n[4] Anton Chechetka and Carlos Guestrin. Focused belief propagation for query-speci\ufb01c inference.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics (AI STATS), 2010.\n\n[5] Nilesh Dalvi and Dan Suciu. The dichotomy of conjunctive queries on probabilistic structures.\n\nTechnical Report 0612102, University of Washington, 2007.\n\n[6] Prithviraj Sen, Amol Deshpande, and Lise Getoor. Exploiting shared correlations in proba-\n\nbilistic databases. In Very Large Data Bases (VLDB), 2008.\n\n[7] Daisy Zhe Wang, Eirlinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein.\nBayesStore: Managing large, uncertain data repositories with probabilistic graphical models.\nIn Very Large Data Bases (VLDB), 2008.\n\n[8] Hoifung Poon and Pedro Domingos. Joint inference in information extraction. In Association\n\nfor the Advancement of Arti\ufb01cial Intelligence, pages 913\u2013918, Vancouver, Canada, 2007.\n\n[9] Aron Culotta, Michael Wick, Robert Hall, and Andrew McCallum. First-order probabilistic\nmodels for coreference resolution. In Human Language Technology Conf. of the North Ameri-\ncan Chapter of the Assoc. of Computational Linguistics (HLT/NAACL), pages 81\u201388, 2007.\n\n[10] Adrian Barbu and Song Chun Zhu. Generalizing Swendsen-Wang to sampling arbitrary pos-\n\nterior probabilities. IEEE Trans. Pattern Anal. Mach. Intell., 27(8):1239\u20131253, 2005.\n\n[11] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines.\n\nConference on Arti\ufb01cial Intelligence and Statistics (AI STATS), 2009.\n\nIn International\n\n[12] Bhaskara Marthi, Hanna Pasula, Stuart Russell, and Yuval Peres. Decayed MCMC \ufb01ltering.\n\nIn Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 319\u2013326, 2002.\n\n[13] Michael Wick, Andrew McCallum, and Gerome Miklau. Scalable probabilistic databases with\n\nfactor graphs and MCMC. In Very Large Data Bases (VLDB), pages 794\u2013804, 2010.\n\n[14] Michael Wick, Andrew McCallum, and Gerome Miklau. Representing uncertainty in prob-\nabilistic databases with scalable factor graphs. Master\u2019s thesis, University of Massachusetts,\nproposed September 2008 and submitted April 2009.\n\n[15] Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein, and\nMichael L. Wick. Hybrid in-database inference for declarative information extraction. In Pro-\nceedings of the 2011 international conference on Management of data, SIGMOD \u201911, pages\n517\u2013528, New York, NY, USA, 2011. ACM.\n\n[16] R.H. Swendsen and J.S. Wang. Nonuniversal critical dynamics in MC simulations. Phys. Rev.\n\nLett., 58(2):68\u201388, 1987.\n\n[17] Radford Neal. Slice sampling. Annals of Statistics, 31:705\u2013767, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1397, "authors": [{"given_name": "Michael", "family_name": "Wick", "institution": null}, {"given_name": "Andrew", "family_name": "McCallum", "institution": null}]}