{"title": "Collective Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1161, "page_last": 1169, "abstract": "There are many settings in which we wish to fit a model of the behavior of individuals but where our data consist only of aggregate information (counts or low-dimensional contingency tables). This paper introduces Collective Graphical Models---a framework for modeling and probabilistic inference that operates directly on the sufficient statistics of the individual model. We derive a highly-efficient Gibbs sampling algorithm for sampling from the posterior distribution of the sufficient statistics conditioned on noisy aggregate observations, prove its correctness, and demonstrate its effectiveness experimentally.", "full_text": "Collective Graphical Models\n\nDaniel Sheldon\n\nOregon State University\n\nsheldon@eecs.oregonstate.edu\n\nThomas G. Dietterich\nOregon State University\n\ntgd@eecs.oregonstate.edu\n\nAbstract\n\nThere are many settings in which we wish to \ufb01t a model of the behavior of in-\ndividuals but where our data consist only of aggregate information (counts or\nlow-dimensional contingency tables). This paper introduces Collective Graphi-\ncal Models\u2014a framework for modeling and probabilistic inference that operates\ndirectly on the suf\ufb01cient statistics of the individual model. We derive a highly-\nef\ufb01cient Gibbs sampling algorithm for sampling from the posterior distribution\nof the suf\ufb01cient statistics conditioned on noisy aggregate observations, prove its\ncorrectness, and demonstrate its effectiveness experimentally.\n\n1\n\nIntroduction\n\nIn \ufb01elds such as ecology, marketing, and the social sciences, data about identi\ufb01able individuals is\nrarely available, either because of privacy issues or because of the dif\ufb01culty of tracking individuals\nover time. Far more readily available are aggregated data in the form of counts or low-dimensional\ncontingency tables. Despite the fact that only aggregated data are available, researchers often seek\nto build models and test hypotheses about individual behavior. One way to build a model connecting\nindividual-level behavior to aggregate data is to explicitly model each individual in the population,\ntogether with the aggregation mechanism that yields the observed data.\nHowever, with large populations it is infeasible to reason about each individual. Luckily, for many\npurposes it is also unnecessary. To \ufb01t a probabilistic model of individual behavior, we only need\nthe suf\ufb01cient statistics of that model. This paper introduces a formalism in which one starts with a\ngraphical model describing the behavior of individuals, and then derives a new graphical model \u2014\nthe Collective Graphical Model (CGM) \u2014 on the suf\ufb01cient statistics of a population drawn from\nthat model. Remarkably, the CGM has a structure similar to that of the original model.\nThis paper is devoted to the problem of inference in CGMs, where the goal is to calculate conditional\nprobabilities over the suf\ufb01cient statistics given partial observations made at the population level. We\nconsider both an exact observation model where subtables of the suf\ufb01cient statistics are observed\ndirectly, and a noisy observation model where these counts are corrupted. A primary application is\nlearning: for example, computing the expected value of the suf\ufb01cient statistics comprises the \u201cE\u201d\nstep of an EM algorithm for learning the individual model from aggregate data.\nMain concepts. The ideas behind CGMs are best illustrated by an example. Figure 1(a) shows the\ngraphical model plate notation for the bird migration model from [1, 2], in which birds transition\nstochastically among a discrete set of locations (say, grid cells on a map) according to a Markov\nchain (the individual model). The variable X m\nt denotes the location of the mth bird at time t, and\nbirds are independent and identically distributed. This model gives an explicit way to reason about\nthe interplay between individual-level behavior (inside the plate) and aggregate data. Suppose, for\nexample, that very accurate surveys reveal the number of birds nt(i) in each location i at each time\nt, and these numbers are collected into a single vector nt for each time step. Then, for example, one\ncan compute the likelihood of the survey data given parameters of the individual model by summing\nout the individual variables. However, this is highly impractical: if our map has L grid cells, then\nthe variable elimination algorithm run on this model would instantiate tabular potentials of size LM .\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Collective graphical model of bird migation:\n(a) replicates of individual model connected to\npopulation-level observations, (b) CGM after marginalizing away individuals, (c) trellis graph on locations\n{i, j} for T = 3, M = 10; numbers on edges indicate \ufb02ow amounts, (d) a degree-one cycle; \ufb02ows remain\nnon-negative for \u03b4 \u2208 {\u22123, . . . , 1}, (e) a degree-two cycle; \ufb02ows remain non-negative for \u03b4 \u2208 {\u22122, . . . , 1}.\n\n(cid:1) = O(M L2\u22121) possible values for the table\n\nL2\u22121\n\nFigure 1(b) shows the CGM for this model, which we obtain by analytically marginalizing away the\nindividual variables to get a new model on their suf\ufb01cient statistics, which are the tables nt,t+1 with\nentries nt,t+1(i, j) equaling the number of birds that \ufb02y from i to j from time t to t + 1. A much\nbetter inference approach would be to conduct variable elimination or message passing directly in\nthe CGM. However, this would still instantiate potentials that are much too big for realistic problems\n\ndue to the huge state space: e.g., there are(cid:0)M +L2\u22121\n\nnt,t+1.\nInstead, we will perform approximate inference using MCMC. Here, we are faced with yet another\nchallenge: the CGM has hard constraints encoded into its distribution, and our MCMC moves must\npreserve these constraints yet still connect the state space. To understand this, observe that the\nhidden variables in this example comprise a \ufb02ow of M units through the trellis graph of the Markov\nchain, with the interpretation that nt,t+1(i, j) birds \u201c\ufb02ow\u201d along edge (i, j) at time t (see Figure\n1(c) and [1]). The constraints are that (1) \ufb02ow is conserved at each trellis node, and (2) the number\nof birds that enter location i at time t equals the observed number nt(i). (In the case of noisy or\npartial observations, the latter constraint may not be present.)\nHow can we design a set of moves that connect any two M-unit \ufb02ows while preserving these con-\nstraints? The answer is to make moves that send \ufb02ow around cycles. Cycles of the form illustrated\nin Figure 1(d) preserve \ufb02ow conservation but change the amount of \ufb02ow through some trellis nodes.\nCycles of the form in Figure 1(e) preserve both constraints. One can show by graph-theoretic argu-\nments that moves of these two general classes are enough to connect any two \ufb02ows.\nThis gives us the skeleton of an ergodic MCMC sampler: starting with a feasible \ufb02ow, select cycles\nfrom these two classes uniformly at random and propose moves that send \u03b4 units of \ufb02ow around\nthe cycle. There is one unassuming but crucially important \ufb01nal question: how to select \u03b4? The\nfollowing is a form of Gibbs sampler: from all values that preserve non-negativity, select \u03b4 with\nprobability proportional to that of the new \ufb02ow. Such moves are always accepted. Remarkably,\neven though \u03b4 may take on as many as M different values, the resulting distribution over \u03b4 has an\nextremely tractable form \u2014 either binomial or hypergeometric \u2014 and thus it is possible to select \u03b4\nin constant time, so we can make very large moves in time independent of the population size.\nContributions. This paper formally develops these concepts in a way that generalizes the construc-\ntion of Figure 1 to allow arbitrary graphical models inside the plate, and a more general observation\nmodel that includes both noisy observations and observations involving multiple variables. We de-\nvelop an ef\ufb01cient Gibbs sampler to conduct inference in CGMs that builds on existing work for con-\nducting exact tests in contingency tables and makes several novel technical contributions. Foremost\nis the analysis of the distribution over the move size \u03b4, which we show to be a discrete univariate\ndistribution that generalizes both the binomial and hypergeometric distributions. In particular, we\nprove that it is always log-concave [3], so it can be sampled in constant expected running time. We\n\n2\n\nXm1Xm2m=1:MXmT\u00b7\u00b7\u00b7n1n2nTn1n2nTn1,2nT\u22121,Tn2,3n3......ijij3115ij2215t=1t=2t=3ijijijt=1t=2t=315253+\u03b41\u2212\u03b42+\u03b41\u2212\u03b4ijij3+\u03b41\u2212\u03b41\u2212\u03b45+\u03b4t=1t=2ij2215t=3\fshow empirically that resulting inference algorithm runs in time that is independent of the population\nsize, and is dramatically faster than alternate approaches.\nRelated Work. The bird migration model of [1, 2] is a special case of CGMs where the individual\nmodel is a Markov chain and observations are made for single variables only. That work considered\nonly maximum a posteriori (MAP) inference; the method of this paper could be used for learning in\nthat application. Sampling methods for exact tests in contingency tables (e.g. [4]) generate tables\nwith the same suf\ufb01cient statistics as an observed table. Our work differs in that our observations\nare not suf\ufb01cient, and we are sampling the suf\ufb01cient statistics instead of the complete contingency\ntable. Diaconis and Sturmfels [5] broadly introduced the concept of Markov bases, which are sets\nof moves that connect the state space when sampling from conditional distributions by MCMC. We\nconstruct a Markov basis in Section 3.1 based on work of Dobra [6]. Lauritzen [7] discusses the\nproblem of exact tests in nested decomposable models, a setup that is similar to ours. Inference\nin CGMs can be viewed as a form of lifted inference [8\u201312]. The counting arguments used to de-\nrive the CGM distribution (see below) are similar to the operations of counting elimination [9] and\ncounting conversion [10] used in exact lifted inference algorithms for \ufb01rst-order probabilistic mod-\nels. However, those algorithms do not replicate the CGM construction when applied to a \ufb01rst-order\nrepresentation of the underlying population model. For example, when applied to the bird migration\nmodel, the C-FOVE algorithm of Milch et al. [10] cannot introduce contingency tables over pairs\nof variables (Xt, Xt+1) as required to represent the suf\ufb01cient statistics; it can only introduce his-\ntograms over single variables Xt. Apsel and Brafman [13] have recently taken a step in this direction\nby introducing a lifting operation to construct the Cartesian product of two \ufb01rst-order formulas. In\nthe applications we are considering, exact inference (even when lifted) is intractable.\n\n2 Problem Setup\n\nLet (X1, X2, . . . , X|V |) be a set of discrete random variables indexed by the \ufb01nite set V , where Xv\ntakes values in the set Xv. Let x = (x1, . . . , x|V |) denote a joint setting for these variables from the\nset X = X1 \u00d7 . . . \u00d7 X|V |. For our individual model, we consider graphical models of the form:\n\nY\n\nC\u2208C\n\np(x) =\n\n1\nZ\n\n\u03c6C(xC).\n\n(1)\n\nhas entries n(i) = PM\n\nHere, C is the set of cliques of the independence graph, the functions \u03c6C : XC \u2192 R+ are potentials,\nand Z is a normalization constant. For A \u2282 V , we use the notation xA to indicate the sub-vector\nof variables with indices belonging to A, and use similar notation for the corresponding domain\nXA. We also assume that p(x) > 0 for all x \u2208 X , which is required for our sampler to be ergodic.\nModels that fail this restriction can be modi\ufb01ed by adding a small positive amount to each potential.\nA collection A is a set of subsets of V . For collections A and B, de\ufb01ne A (cid:22) B to mean that\neach A \u2208 A is contained in some B \u2208 B. A collection A is decomposable if there is a junction tree\nT = (A,E(T )) on vertex set A [7]. Any collection A can be extended to a decomposable collection\nB such that A (cid:22) B; this corresponds to adding \ufb01ll-in edges to a graphical model.\nConsider a sample {x(1), . . . , x(M )} from the graphical model. A contingency table n = (n(i))i\u2208X\nm=1 I{x(m) = i} that count the number of times each element i \u2208 X ap-\npears in the sample. We use index variables such as i, j \u2208 X (instead of x \u2208 X ) to refer to\ncells of the contingency table, where i = (i1, . . . , iV ) is a vector of indices and iA is the sub-\nvector corresponding to A \u2286 V . Let tbl(A) denote the set of all valid contingency tables on the\ndomain XA. A valid table is indexed by elements iA \u2208 XA and has non-negative integer entries.\nFor a full table n \u2208 tbl(V ) and A \u2286 V , let the marginal table n \u2193 A \u2208 tbl(A) be de\ufb01ned as\nn(iA, iB). When A = \u2205, de\ufb01ne n \u2193 A to be\nthe scalar M, the grand total of the table. Write nA (cid:22) nB to mean that nA is a marginal table of\nnB (i.e., A \u2286 B and nA = nB \u2193 A)\nOur observation model is as follows. We assume that a sample {x(1), . . . , x(M )} is drawn from the\nindividual model, resulting in a complete, but unobserved, contingency table nV . We then observe\nthe marginal tables nD = nV \u2193 D for each set D in a collection of observed margins D, which\nwe require to be decomposable. Write this overall collection of tables as nD = {nD}D\u2208D. We\nconsider noisy observations in Section 3.3.\n\n(n \u2193 A)(iA) =PM\n\niB\u2208XV \\A\n\nA = iA} =P\n\nm=1 I{x(m)\n\n3\n\n\fBuilding the CGM. In a discrete graphical model, the suf\ufb01cient statistics are the contingency tables\nnC = {nC}C\u2208C over cliques. Our approach relies on the ability to derive a tractable probabilistic\nmodel for these statistics by marginalizing out the sample. If C is decomposable, this is possible, so\nlet us assume that C has a junction tree TC (if not, \ufb01ll-in edges must be added to the original model).\nLet \u00b5C be the table of marginal probabilities for clique C (i.e. \u00b5C(iC) = Pr(XC = iC)). Let S\nbe the collection of separators of TC (with repetition if the same set appears as a separator multiple\ntimes) and let nS and \u00b5S be the tables of counts and marginal probabilities for the separator S \u2208 S.\nThe distribution of nC was \ufb01rst derived by Sundberg [14]:\n\n!\u22121\n\n\u00b5S(iS)nS (iS )\n\nnS(iS)!\n\n,\n\n(2)\n\n Y\n\nY\n\nC\u2208C\n\niC\u2208XC\n\np(nC) = M!\n\n! Y\n\nY\n\nS\u2208S\n\niS\u2208XS\n\n\u00b5C(iC)nC (iC )\n\nnC(iC)!\n\nwhich can be understood as a product of multinomial distributions corresponding to a sampling\nscheme for nC (details omitted). It is this distribution that we call the collective graphical model;\nthe parameters are the marginal probabilities of the individual model. To understand the conditional\ndistribution given the observations, let us further assume that D (cid:22) C (if not, add additional \ufb01ll-\nin edges for variables that co-occur within D), so that each observed table is determined by some\nclique table. Write nD (cid:22) nC to express the condition that the tables nC produce observations nD:\nformally, this means that D (cid:22) C and that D \u2286 C implies that nD (cid:22) nC. Let I{\u00b7} be an indicator\nvariable. Then\n\np(nC | nD) \u221d p(nC, nD) = p(nC)I{nD (cid:22) nC}.\n\n(3)\nIn general, the number of contingency tables over small sets of variables leads to huge state spaces\nthat prohibit exact inference schemes using (2) and (3). Thus, our approach is based on Gibbs\nsampling. However, there are two constraints that signi\ufb01canlty complicate sampling. First, the\nclique tables must match the observations (i.e., nD (cid:22) nC). Second, implicit in (2) is the constraint\nthat the tables nC must be consistent in the sense that they are the suf\ufb01cient statistics of some sample,\notherwise p(nC) = 0.\nDe\ufb01nition 1. Refer to the set of contingency tables nA = {nA}A\u2208A as a con\ufb01guration. A con\ufb01gu-\nration is (globally) consistent if there exists nV \u2208 tbl(V ) such that nA = nV \u2193 A for all A \u2208 A.\nConsistency requires, for example, that any two tables must agree on their common marginal, which\nyields the \ufb02ow conservation constraints in the bird migration model. Table entries must be carefully\nupdated in concert to maintain these constraints. A full discussion follows.\n\nInference\n\n3\nOur goal is to develop a sampler for p(nC | nD) given the observed tables nD. We assume that the\nCGM speci\ufb01ed in Equations (1) and (2) satis\ufb01es D (cid:22) C, and that the con\ufb01guration nD is consistent.\nInitialization. The \ufb01rst step is to construct a valid initial value for nC, which must be a globally\nconsistent con\ufb01guration satisfying nD (cid:22) nC. Doing so without instantiating huge intermediate\ntables requires a careful sequence of operations on the two junction trees TC and TD. We state one\nkey theorem, but defer the full algorithm, which is lengthy and technical, to the supplement.\nTheorem 1. Let A be a decomposable collection with junction tree TA. Say that the con\ufb01guration\nnA is locally consistent if it agrees on edges of TA, i.e., if nA \u2193 S = nB \u2193 S for all (A, B) \u2208 E(TA)\nwith S = A \u2229 B. If nA is locally consistent, then it is also globally consistent.\nIn the bird migration example, Theorem 1 guarantees that preserving \ufb02ow conservation is enough to\nmaintain consistency. It is structurally equivalent to the \u201cjunction tree theorem\u201d (e.g., [15]) which\nasserts that marginal probability tables {\u00b5A}A\u2208A that are locally consistent are realizable as the\nmarginals of some joint distribution p(x). Like that result, Theorem 1 also has a constructive proof,\nwhich is the foundation for our initialization algorithm. However, the integrality requirements of\ncontingency tables necessitate a different style of construction.\n\n3.1 Markov Basis\n\nThe \ufb01rst key challenge in designing the MCMC sampler is constructing a set of moves that preserve\nthe constraints mentioned above, yet still connect any two points in the support of the distribution.\nSuch a set of moves is called a Markov basis [5].\n\n4\n\n\fDe\ufb01nition 2. A set of moves M is a Markov basis for the set F if, for any two con\ufb01gurations\n\u2018=1 z\u2018, and (ii)\n\nn, n0 \u2208 F, there is a sequence of moves z1, . . . , zL \u2208 M such that: (i) n0 = n +PL\nn +PL0\n\n\u2018=1 z\u2018 \u2208 F for all L0 = 1, . . . , L \u2212 1.\n\nIn our problem, the set we wish to connect is the support of p(nC | nD). Our positivity assumption\non p(x) implies that any consistent con\ufb01guration nC has positive probability, and thus the support\nof p(nC | nD) is exactly the set of consistent con\ufb01gurations that match the observations:\n\nFnD = {nC : nC is consistent and nD (cid:22) nC}\n\nIt is useful at this point to think of the con\ufb01guration nC as a vector obtained by sorting the table\nentries in any consistent fashion (e.g., lexicographically \ufb01rst by C \u2208 C and then by iC \u2208 XC). A\nmove can be expressed as n0\nC = nC + z where z is an integer-valued vector of the same dimension\nas nC that may have negative entries.\n\nlet A be decomposable and let nA be consistent withSA = V , so that each variable is part of an\n\nThe Dobra Markov basis for complete tables. Dobra [6] showed how to construct a Markov\nbasis for moves in a complete contingency table given a decomposable set of margins. Speci\ufb01cally,\nobserved margin. De\ufb01ne F\u2217\nnA = {nV \u2208 tbl(V ) : nA (cid:22) nV }. Dobra gave a Markov basis for F\u2217\nconsisting of only degree-two moves:\nDe\ufb01nition 3. Let (A, S, B) be a partition of V . A degree-two move z has two positive entries and\ntwo negative entries:\n\nnA\n\nz(i, j, k) = 1, z(i, j, k0) = \u22121, z(i0, j, k) = \u22121, z(i0, j, k0) = 1,\n\n(4)\nwhere i 6= i0 \u2208 XA, j \u2208 XS k 6= k0,\u2208 XB. Let Md=2(A, S, B) be the set of all degree-two moves\ngenerated from this partition.\n\nk\n\nk0\ni + \u2212\ni0 \u2212 +\n\nA is a Markov basis for F\u2217\nnA.\n\nTheorem 2 (Dobra [6]). Let A be decomposable withSA = V . Let M\u2217\n\nThese are extensions of the well-known \u201cswap moves\u201d for two-dimensional con-\ntingency tables (e.g. [5]) to the subtable n(\u00b7, j,\u00b7), and they can be visualized as\nshown at right. In this arrangement, it is clear that any such move preserves the\nmarginal table nA (row sums) and the marginal table nB (column sums); in other\nwords, z \u2193 A = 0 and z \u2193 B = 0. Moreover, because j is \ufb01xed, it is straightfor-\nward to see that z \u2193 A \u222a S = 0 and z \u2193 B \u222a S = 0. The cycle in Figure 1(e) is a degree-two move\non the table n1,2, with A = {X1}, S = \u2205, C = {X2}.\nA be the union of the sets of\ndegree-two moves Md=2(A, S, B) where S is a separator of TA and (A, S, B) is the corresponding\ndecomposition of V . Then M\u2217\nAdaptation of Dobra basis to FnD. We now adapt the Dobra basis to our setting. Consider a\ncomplete table n \u2208 tbl(V ) and the con\ufb01guration nC = {n \u2193 C}C\u2208C. Because marginalization is a\nlinear operation, there is a linear operator A such that nC = AnV . Moreover, FnA is the image of\nF\u2217\nnA under A. Thus, the image of the Dobra basis under A is a Markov basis for FnA.\nA be a Markov basis for F\u2217\nLemma 1. Let M\u2217\nA} is a Markov basis\nfor FnA. We call MA the projected Dobra basis.\nProof. Let nC, n0\nn0\nC = An0\nthat n0\nhave that n0\n\nnA such that nC = AnV and\nV to nV , meaning\n\u2018=1 z\u2018. By appliyng the linear operator A to both sides of this equation, we\nAz\u2018 =\n\nAz\u2018. Furthermore, each intermediate con\ufb01guration nC +PL0\n\nC \u2208 FnA. By consistency, there exist nV , n0\nV . There is a sequence of moves z1, . . . , zL \u2208 M\u2217\n\nV = nV +PL\nC = nC +PL\n\nnA. Then MA = {Az : z \u2208 M\u2217\n\nV \u2208 F\u2217\nA leading from n0\n\nA(nV +PL0\n\n\u2018=1 z\u2018) \u2208 FnA. Thus MA = {Az : z \u2208 M\u2217\n\nA} is a Markov basis for FnA.\n\n\u2018=1\n\n\u2018=1\n\nLocality of moves. First consider the case where all variables are part of some observed table, as\nin Dobra\u2019s setting. The practical message so far is that to sample from p(nC | nD), it suf\ufb01ces to\ngenerate moves from the projected Dobra basis MD. This is done by \ufb01rst selecting a degree-two\nmove z \u2208 M\u2217\nD, and then marginalizing z onto each clique of C. Naively, it appears that a single\nmove may require us to update each clique. However, we will show that z \u2193 C will be zero for many\ncliques, a fact we can exploit to implement moves more ef\ufb01ciently. Let (A, S, B) be the partition\n\n5\n\n\fused to generate z. We deduce from the discussion following De\ufb01nition 3 that z \u2193 C = 0 unless C\nhas a nonempty intersection with both A and B, so we may restrict our attention to these cliques,\nwhich form a connected subtree (Proposition S.1 in supplementary material). An implementation\ncan then exploit this by pre-computing the connected subtrees for each separator and only generating\nthe necessary components of the move. Algorithm 1 gives the details of generating moves.\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n7\n\nVS =SCS.\n\nAlgorithm 1: The projected Dobra basis MA\nInput: Junction tree TA with separators SA\nBefore sampling\n\nFor each S \u2208 SA, \ufb01nd the associated\ndecomposition (A, S, B)\nFind the cliques C \u2208 C that have non-empty\nintersection with both A and B. These form a\nsubtree of TC. Denote these cliques by CS and let\nLet AS = A \u2229 VS and BS = B \u2229 VS\n\nUnobserved variables. Let us now consider\nsettings where some variables are not part of\nany observed table, which may happen when\nthe individual model has hidden variables, or,\nlater, with noisy observations. Additional\nmoves are needed to connect two con\ufb01gura-\ntions that disagree on marginal tables involv-\ning unobserved variables. Several approaches\nare possible. All require the introduction\nof degree-one moves z \u2208 Md=1(A, B),\nwhich partition the variables into two sets\n(A, B) and have two nonzero entries z(i, j) =\n1, z(i0, j) = \u22121 for i 6= i0 \u2208 XA, j \u2208 XB. In\nthe parlance of two-dimensional tables, these\nmoves adjust two entries in a single column so\nthey preserve the column sums (nB) but mod-\nify the row sums (nA). The cycle in Figure 1(d) is a degree-one move which adjusts the marginal\ntable over A = {X2}, but preserves the marginal table over B = {X1, X3}. We proceed once again\nby constructing a basis for complete tables and then marginalizing the moves onto cliques.\nand let D0 = D \u222a U. Let M\u2217 consist of the moves M\u2217\nfor each A \u2208 U. Then M\u2217 is a Markov basis for F\u2217\nbasis for FnD.\nTheorem 3 is proved in the supplementary material. The degree-one moves also become local upon\nmarginalization: it is easy to check that z \u2193 C is zero unless C \u2229 A is nonempty. These cliques also\nform a connected subtree. We recommend choosing U by restricting TC to the variables in U. This\nhas the effect of adding degree-one moves for each clique of C. By matching the structure of TC,\nmany of the additional degree-two moves become zero upon marginalization.\n\nTheorem 3. Let U be any decomposable collection on the set of unobserved variables U = V \\SD,\n\nD0 together with the moves Md=1(A, V \\ A)\nnD, and M = {Az : z \u2208 M\u2217} is a Markov\n\nDuring sampling: to generate a move for separator\nS \u2208 SA\n\nSelect z \u2208 Md=2(AS, S, BS)\nFor each clique C \u2208 CS, calculate z \u2193 C\n\n3.2 Constructing an ef\ufb01cient MCMC sampler\n\nThe second key challenge in constructing the MCMC sampler is utilizing the moves from the\nMarkov basis in a way that ef\ufb01ciently explores the state space. A standard approach is to select\na random move z, a direction \u03b4 = \u00b11 (each with probability 1/2), and then propose the move\nnC + \u03b4z in a Metropolis Hastings sampler. Although these moves are enough to connect any two\ncon\ufb01gurations, we are particularly interested in problems where M is large, for which moving by\nincrements of \u00b11 will be prohibitively slow.\nFor general Markov bases, Diaconis and Sturmfels [5] suggest instead to construct a Gibbs sampler\nthat uses the moves as directions for longer steps, by choosing the value of \u03b4 from the following\ndistribution:\n(5)\nLemma 2 (Adapted from Diaconis and Sturmfels [5]). Let M be a Markov basis for FnD. Consider\nthe Markov chain with moves \u03b4z generated by \ufb01rst choosing z uniformly at random from M and\nthen choosing \u03b4 according to (5). This is a connected, reversible, aperiodic Markov chain on FnD\nwith stationary distribution p(nC | nD).\nHowever, it is not obvious how to sample from p(\u03b4). They suggest running a Markov chain in \u03b4,\nagain having the property of moving in increments of one (see also [16]). In our case, the support of\np(\u03b4) may be as big as the population size M, so this solution remains unsatisfactory.\nFortunately, p(\u03b4) has several properties that allow us to create a very ef\ufb01cient sampling algorithm.\nFor a separator S \u2208 S, de\ufb01ne zS as zC \u2193 S for any clique C containing S. Now let C(z) be the\n\np(\u03b4) \u221d p(nC + \u03b4z | nD),\n\n\u03b4 \u2208 {\u03b4 : nC + \u03b4z \u2265 0}.\n\n6\n\n\fAlgorithm 2: Sampling from p(\u03b4) in constant time\nInput: move z and current con\ufb01guration nC, with |C(z)| > 1\nCalculate \u03b4min and \u03b4max using (8)\nExtend the function f (\u03b4) := log p(\u03b4) to the real line using the\nequality n! = \u0393(n + 1) in Equation (7) for each constituent\nfunction fA(\u03b4) := log pA(\u03b4), A \u2208 S(z) \u222a C(z).\nUse the logarithm of Equation (6) to evaluate f (\u03b4) (for sampling)\nand its derivatives (for Newton\u2019s method):\n\nf (q)(\u03b4) =\n\nf (q)\nS (\u03b4).\n\nq = 0, 1, 2.\n\nX\n\nC (\u03b4) \u2212 X\n\nf (q)\n\nC\u2208C(z)\n\nS\u2208S(z)\n\ndn2 \u0393(n).\n\nEvaluate the derivatives of fA(\u03b4) using the logarithm of Equation\n(7) and the digamma and trigamma functions \u03c8(n) = d\nand \u03c81(n) = d2\nFind the mode \u03b4\u2217 by \ufb01rst using Newton\u2019s method to \ufb01nd \u03b40\nmaximizing f (\u03b4) over the real line, and then letting \u03b4\u2217 be the\nvalue in {b\u03b40c,d\u03b40e, \u03b4min, \u03b4max} that attains the maximum.\nRun the rejection sampling algorithm of Devroye [3].\n\ndn \u0393(n)\n\nFigure 2: Top:\nrunning time\nvs. M for a small CGM. Bottom:\nconvergence of MCMC for ran-\ndom Bayes nets.\n\n1\n\n2\n\n3\n\n4\n\n5\n\nset of cliques C for which zC is nonzero, and let S(z) be de\ufb01ned analogously. For A \u2208 S \u222a C, let\nI +(zA) \u2286 XA be the indices of +1 entries of zA and let I\u2212(zA) be the indices of \u22121 entries. By\nignoring constant terms in (2), we can write (5) as\n\np(\u03b4) \u221d Y\n\nC\u2208C(z)\n\n\u00b5A(i)\u03b4\n\n(nA(i) + \u03b4)!\n\npC(\u03b4) Y\nY\n\nS\u2208S(z)\n\nj\u2208I\u2212(zA)\n\npS(\u03b4)\u22121,\n\n\u00b5A(j)\u2212\u03b4\n(nA(j) \u2212 \u03b4)! ,\n\npA(\u03b4) := Y\n\ni\u2208I+(zA)\n\n(6)\n\n(7)\n\nA \u2208 S \u222a C.\n\n\u03b4min := \u2212\n\nTo maintain the non-negativity of nC, \u03b4 is restricted to the support \u03b4min, . . . , \u03b4max with:\n\nmin\n\nmin\n\nnC(i),\n\n\u03b4max :=\n\nC\u2208C(z),i\u2208I+(zC )\n\nC\u2208C(z),j\u2208I\u2212(zC )\n\n(8)\nNotably, each move in our basis satis\ufb01es |I +(zA) \u222a I +(zA)| \u2264 4, so p(\u03b4) can be evaluated by\nexamining at most four entries in each table for cliques in C(z). It is worth noting that Equation\n(7) reduces to the binomial distribution for degree-one moves and the (noncentral) hypergeomet-\nric distribution for degree-two moves, so we may sample from these distributions directly when\n|C(z)| = 1. More importantly, we will now show that p(\u03b4) is always a member of the log-concave\nclass of distributions, which are unimodal and can be sampled very ef\ufb01ciently.\nDe\ufb01nition 4. A discrete distribution {pk} is log-concave if p2\nTheorem 4. For any degree-one or degree-two move z, the distribution p(\u03b4) is log-concave.\n\nk \u2265 pk\u22121pk+1 for all k [3].\n\nnC(j).\n\nIt is easy to show that both pC(\u03b4) and pS(\u03b4) are log-concave. The proof of Theorem 4, which is\nfound in the supplementary material, then pairs each separator S with a clique C and uses properties\nof the moves to show that pC(\u03b4)/pS(\u03b4) is also log-concave. Then, by Equation (6), we see that p(\u03b4)\nis a product of log-concave distributions, which is also log-concave.\nWe have implemented the rejection sampling algorithm of Devroye [3], which applies to any discrete\nlog-concave distribution and is simple to implement. The expected number of times it evaluates p(\u03b4)\n(up to normalization) is fewer than 5. We must also provide the mode of the distribution, which we\n\ufb01nd by Newton\u2019s method, usually taking only a few steps. The running time for each move is thus\nindependent of the population size. Additional details are given in Algorithm 2.\n\n3.3 Noisy Observations\n\nPopulation-level counts from real survey data are rarely exact, and it is thus important to incorpo-\nrate noisy observations into our model. In this section, we describe how to modify the sampler for\n\n7\n\n101102100102104Population sizeSeconds VEMCMC05010000.10.20.30.4SecondsRelative error exact\u2212nodesexact\u2212chainnoisy\u2212nodesnoisy\u2212chain\fthe case when all observations are noisy; it is a straightforward generalization to allow both noisy\nand exact observations. Suppose that we make noisy observations yR = {yR : R \u2208 R} corre-\nsponding to the true marginal tables nR for a collection R (cid:22) C (that need not be decomposable).\nFor simplicity, we restrict our attention to models where each entry n in the true table is corrupted\nindependently according to a univariate noise model p(y | n).\nWe assume that the noise model is log-concave, meaning in this case that log p(y | n) is a concave\nfunction of the parameter n. Most commonly-used univariate densities are log-concave with respect\nto various parameters [17]. A canonical example from the bird migration model is p(y | n) =\nPoisson(\u03b1n), so the survey count is Poisson with mean proportional to the true number of birds\npresent. This example and others are discussed in [2]. We also assume that the support of p(y | n)\ndoes not depend on n, so that observations do not restrict the support of the sampling distribution.\nFor example, we must modify our Poisson noise model to be p(y | n) = Poisson(\u03b1n + \u03bb0) with\nsmall background rate \u03bb0 to avoid the hard constraint that n must be positive if y is positive.\nIn analogy with (3), we can then write p(nC | yR) \u221d p(nC)p(yR|nC) (the hard constraint is now\nreplaced with the likelihood term p(yR|nC)). Given our assumption on p(y | n), the support of\np(nC | yR) is the same as the support of p(nC), and a Markov basis can be constructed using the\ntools from Section 3.1, with all variables being unobserved. In the sampler, the expression for p(\u03b4)\nmust now be updated to incorporate the likelihood term p(yR|nC +\u03b4z). Following reasoning similar\nto before, we let R(z) be the sets in R for which z \u2193 R is nonzero and \ufb01nd that Equation (6) gains\n\nthe additional factorQ\n\npR(\u03b4) = Y\n\nR\u2208R(z) pR(\u03b4), where\n\np(yR(i) | nR(i) + \u03b4) Y\n\ni\u2208I+(zR)\n\nj\u2208I\u2212(zR)\n\nEach factor in (9) is log-concave in \u03b4 by our assumption on p(y | n), and hence the overall distribu-\ntion p(\u03b4) remains log-concave. To update the sampler for p(\u03b4), modify line 3 of Algorithm 2 in the\nobvious fashion to include these new factors when computing log p(\u03b4) and its derivatives.\n\np(yR(j) | nR(j) \u2212 \u03b4).\n\n(9)\n\n4 Experiments\n\nWe implemented our sampler in MATLAB using Murphy\u2019s Bayes net toolbox [18] for the underlying\noperations on graphical models and junction trees. Figure 2 (top) compares the running time of our\nmethod vs. exact inference in the CGM by variable elimination (VE) for a very small model. The\ntask was to estimate E[n2,3 | n1, n3] in the bird migration model for L = 2, T = 3, and varying M.\nThe running time of VE is O(M L2\u22121), which is cubic in M (linear on a log-log plot), while the time\nfor our method to estimate the same quantity within 2% relative error actually decreases slightly with\npopulation size. Figure 2 (bottom) shows convergence of the sampler for more complex models. We\ngenerated 30 random Bayes nets on 10 binary variables, and generated two sets of observed tables\nfor a population of M = 100, 000: the set NODES has a table for each single variable, while the\nset CHAIN has tables for pairs of variables that are adjacent in a random ordering. We repeated the\nsame process with the noise model p(y | n) = Poisson(0.2n + 0.1) to generate noisy observations.\nWe then ran our sampler to estimate E[nC | nD] as would be done in the EM algorithm. The plots\nshow relative error in this estimate as a function of time, averaged over the 30 nets. For more details,\nincluding how we derived the correct answer for comparison, see Section 4.1 in the supplementary\nmaterial. The sampler converged quickly in all cases with the more complex CHAIN observation\nmodel taking longer than NODES, and noisy observations taking slightly longer than exact ones. We\nfound (not shown) that the biggest source of variability in convergence time was due to individual\nBayes nets, while repeat trials using the same net demonstrated very similar behavior.\nConcluding Remarks. An important area of future research is to further explore the use of CGMs\nwithin learning algorithms, as well as the limitations of that approach: when is it possible to learn in-\ndividual models from aggregate data? We believe that the ability to model noisy observations will be\nan indispensable tool in real applications. For complex models, convergence may be dif\ufb01cult to di-\nagnose. Some mixing results are known for samplers in related problems with hard constraints [16];\nany such results for our model would be a great advance. The use of distributional approximations\nfor the CGM model and other methods of approximate inference also hold promise.\nAcknowledgments. We thank Lise Getoor for pointing out the connection between CGMs and lifted\ninference. This research was supported in part by the grant DBI-0905885 from the NSF.\n\n8\n\n\fReferences\n[1] D. Sheldon, M. A. S. Elmohamed, and D. Kozen. Collective inference on Markov models for\nmodeling bird migration. In Advances in Neural Information Processing Systems (NIPS 2007),\npages 1321\u20131328, Cambridge, MA, 2008. MIT Press.\n\n[2] Daniel Sheldon. Manipulation of PageRank and Collective Hidden Markov Models. PhD\n\nthesis, Cornell University, 2009.\n\n[3] L. Devroye. A simple generator for discrete log-concave distributions. Computing, 39(1):\n\n87\u201391, 1987.\n\n[4] A. Agresti. A survey of exact inference for contingency tables. Statistical Science, 7(1):131\u2013\n\n153, 1992.\n\n[5] P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distribu-\n\ntions. The Annals of statistics, 26(1):363\u2013397, 1998. ISSN 0090-5364.\n\n[6] A. Dobra. Markov bases for decomposable graphical models. Bernoulli, 9(6):1093\u20131108,\n\n2003. ISSN 1350-7265.\n\n[7] S.L. Lauritzen. Graphical models. Oxford University Press, USA, 1996.\n[8] D. Poole. First-order probabilistic inference. In Proc. IJCAI, volume 18, pages 985\u2013991, 2003.\n[9] R. de Salvo Braz, E. Amir, and D. Roth. Lifted \ufb01rst-order probabilistic inference. Introduction\n\nto Statistical Relational Learning, page 433, 2007.\n\n[10] B. Milch, L.S. Zettlemoyer, K. Kersting, M. Haimes, and L.P. Kaelbling. Lifted probabilistic\n\ninference with counting formulas. Proc. 23rd AAAI, pages 1062\u20131068, 2008.\n\n[11] P. Sen, A. Deshpande, and L. Getoor. Bisimulation-based approximate lifted inference.\n\nIn\nProceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n496\u2013505. AUAI Press, 2009.\n\n[12] J. Kisynski and D. Poole. Lifted aggregation in directed \ufb01rst-order probabilistic models. In\n\nProc. IJCAI, volume 9, pages 1922\u20131929, 2009.\n\n[13] Udi Apsel and Ronen Brafman. Extended lifted inference with joint formulas. In Proceedings\nof the Proceedings of the Twenty-Seventh Conference Annual Conference on Uncertainty in\nArti\ufb01cial Intelligence (UAI-11), pages 11\u201318, Corvallis, Oregon, 2011. AUAI Press.\n\n[14] R. Sundberg. Some results about decomposable (or Markov-type) models for multidimensional\ncontingency tables: distribution of marginals and partitioning of tests. Scandinavian Journal\nof Statistics, 2(2):71\u201379, 1975.\n\n[15] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[16] P. Diaconis, S. Holmes, and R.M. Neal. Analysis of a nonreversible Markov chain sampler.\n\nThe Annals of Applied Probability, 10(3):726\u2013752, 2000.\n\n[17] W.R. Gilks and P. Wild. Adaptive Rejection sampling for Gibbs Sampling. Journal of the Royal\n\nStatistical Society. Series C (Applied Statistics), 41(2):337\u2013348, 1992. ISSN 0035-9254.\n\n[18] K. Murphy. The Bayes net toolbox for MATLAB. Computing science and statistics, 33(2):\n\n1024\u20131034, 2001.\n\n9\n\n\f", "award": [], "sourceid": 680, "authors": [{"given_name": "Daniel", "family_name": "Sheldon", "institution": null}, {"given_name": "Thomas", "family_name": "Dietterich", "institution": null}]}