{"title": "Distributed Parameter Estimation in Probabilistic Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1700, "page_last": 1708, "abstract": "This paper presents foundational theoretical results on distributed parameter estimation for undirected probabilistic graphical models. It introduces a general condition on composite likelihood decompositions of these models which guarantees the global consistency of distributed estimators, provided the local estimators are consistent.", "full_text": "Distributed Parameter Estimation\nin Probabilistic Graphical Models\n\nYariv D. Mizrahi1 Misha Denil2 Nando de Freitas2,3,4\n\n1University of British Columbia, Canada\n2University of Oxford, United Kingdom\n3Canadian Institute for Advanced Research\n\n4Google DeepMind\n\nyariv@math.ubc.ca\n\n{misha.denil,nando}@cs.ox.ac.uk\n\nAbstract\n\nThis paper presents foundational theoretical results on distributed parameter es-\ntimation for undirected probabilistic graphical models.\nIt introduces a general\ncondition on composite likelihood decompositions of these models which guaran-\ntees the global consistency of distributed estimators, provided the local estimators\nare consistent.\n\n1\n\nIntroduction\n\nUndirected probabilistic graphical models, also known as Markov Random Fields (MRFs), are a\nnatural framework for modelling in networks, such as sensor networks and social networks [24, 11,\n20]. In large-scale domains there is great interest in designing distributed learning algorithms to\nestimate parameters of these models from data [27, 13, 19]. Designing distributed algorithms in\nthis setting is challenging because the distribution over variables in an MRF depends on the global\nstructure of the model.\nIn this paper we make several theoretical contributions to the design of algorithms for distributed\nparameter estimation in MRFs by showing how the recent works of Liu and Ihler [13] and of Mizrahi\net al. [19] can both be seen as special cases of distributed composite likelihood. Casting these two\nworks in a common framework allows us to transfer results between them, strengthening the results\nof both works.\nMizrahi et al. introduced a theoretical result, known as the LAP condition, to show that it is possible\nto learn MRFs with untied parameters in a fully-parallel but globally consistent manner. Their result\nled to the construction of a globally consistent estimator, whose cost is linear in the number of cliques\nas opposed to exponential as in centralised maximum likelihood estimators. While remarkable, their\nresults apply only to a speci\ufb01c factorisation, with the cost of learning being exponential in the size of\nthe factors. While their factors are small for lattice-MRFs and other models of low degree, they can\nbe as large as the original graph for other models, such as fully-observed Boltzmann machines [1]. In\nthis paper, we introduce the Strong LAP Condition, which characterises a large class of composite\nlikelihood factorisations for which it is possible to obtain global consistency, provided the local\nestimators are consistent. This much stronger condition enables us to construct linear and globally\nconsistent distributed estimators for a much wider class of models than Mizrahi et al., including\nfully-connected Boltzmann machines.\nUsing our framework we also show how the asymptotic theory of Liu and Ihler applies more gener-\nally to distributed composite likelihood estimators. In particular, the Strong LAP Condition provides\na suf\ufb01cient condition to guarantee the validity of a core assumption made in the theory of Liu and\nIhler, namely that each local estimate for the parameter of a clique is a consistent estimator of the\n\n1\n\n\f1\n\n4\n\n7\n\n2\n\n5\n\n8\n\n3\n\n6\n\n9\n\n1\n\n4\n\n7\n\n2\n\n5\n\n8\n\n3\n\n6\n\n9\n\n1\n\n4\n\n7\n\n2\n\n5\n\n8\n\n3\n\n6\n\n9\n\n1\n\n4\n\n7\n\n2\n\n5\n\n8\n\n3\n\n6\n\n9\n\nFigure 1: Left: A simple 2d-lattice MRF to illustrate our notation. For node j = 7 we have N (xj) =\n{x4, x8}. Centre left: The 1-neighbourhood of the clique q = {x7, x8} including additional edges\n(dashed lines) present in the marginal over the 1-neighbourhood. Factors of this form are used by\nthe LAP algorithm of Mizrahi et. al. Centre right: The MRF used by our conditional estimator of\nSection 5 when using the same domain as Mizrahi et. al. Right: A smaller neighbourhood which\nwe show is also suf\ufb01cient to estimate the clique parameter of q.\n\ncorresponding clique parameter in the joint distribution. By applying the Strong LAP Condition to\nverify the assumption of Liu and Ihler, we are able to import their M-estimation results into the LAP\nframework directly, bridging the gap between LAP and consensus estimators.\n\n2 Background\n\nOur goal is to estimate the D-dimensional parameter vector \u2713 of an MRF with the following Gibbs\ndensity or mass function:\n\np(x| \u2713) =\n\n1\n\nZ(\u2713)\n\nexp(Xc\n\nE(xc | \u2713c))\n\n(1)\n\nHere c 2C is an index over the cliques of an undirected graph G = (V,E), E(xc | \u2713c) is known as\nthe energy or Gibbs potential, and Z(\u2713) is a normalizing term known as the partition function.\nWhen E(xc | \u2713c) = \u2713T\nc c(xc), where c(xc) is a local suf\ufb01cient statistic derived from the values\nof the local data vector xc, this model is known as a maximum entropy or log-linear model. In\nthis paper we do not restrict ourselves to a speci\ufb01c form for the potentials, leaving them as general\nfunctions; we require only that their parameters are identi\ufb01able. Throughout this paper we focus\non the case where the xj\u2019s are discrete random variables, however generalising our results to the\ncontinuous case is straightforward.\nThe j-th node of G is associated with the random variable xj for j = 1, . . . , M, and the edge con-\nnecting nodes j and k represents the statistical interaction between xj and xk. By the Hammersley-\nClifford Theorem [10], the random vector x satis\ufb01es the Markov property with respect to the graph\nG, i.e., p(xj|xj) = p(xj|xN (xj )) for all j where xj denotes all variables in x excluding xj,\nand xN (xj ) are the variables in the neighbourhood of node j (variables associated with nodes in G\ndirectly connected to node j).\n\n2.1 Centralised estimation\n\nThe standard approach to parameter estimation in statistics is through maximum likelihood, which\nchooses parameters \u2713 by maximising\n\nLM L(\u2713) =\n\np(xn | \u2713)\n\n(2)\n\n(To keep the notation light, we reserve n to index the data samples. In particular, xn denotes the\nn-th |V|-dimensional data vector and xmn refers to the n-th observation of node m.)\nThis estimator has played a central role in statistics as it has many desirable properties including\nconsistency, ef\ufb01ciency and asymptotic normality. However, applying maximum likelihood estima-\ntion to an MRF is generally intractable since computing the value of log LM L and its derivative\nrequire evaluating the partition function, and an expectation over the model, respectively. Both of\nthese values involve a sum over exponentially many terms.\n\nNYn=1\n\n2\n\n\fTo surmount this dif\ufb01culty it is common to approximate p(x| \u2713) as a product over more tractable\nterms. This approach is known as composite likelihood and leads to an objective of the form\n\nLCL(\u2713) =\n\nNYn=1\n\nIYi=1\n\nf i(xn, \u2713i)\n\n(3)\n\nwhere \u2713i denote the (possibly shared) parameters of each composite likelihood factor f i.\nComposite likelihood estimators are both well studied and widely applied [6, 14, 12, 7, 16, 2, 22,\n4, 21]. In practice the f i terms are chosen to be easy to compute, and are typically local functions,\ndepending only on some local region of the underlying graph G.\nAn early and in\ufb02uential variant of composite likelihood is pseudo-likelihood (PL) [3], where\nf i(x, \u2713i) is chosen to be the conditional distribution of xi given its neighbours,\n\nLP L(\u2713) =\n\np(xmn | xN (xm)n, \u2713m)\n\n(4)\n\nNYn=1\n\nMYm=1\n\nSince the joint distribution has a Markov structure with respect to the graph G, the conditional\ndistribution for xm depends only on its neighbours, namely xN (xm). In general more statistically\nef\ufb01cient composite likelihood estimators can be obtained by blocking, i.e. choosing the f i(x, \u2713i) to\nbe conditional or marginal likelihoods over blocks of variables, which may be allowed to overlap.\nComposite likelihood estimators are often divided into conditional and marginal variants, depending\non whether the f i(x, \u2713i) are formed from conditional or marginal likelihoods. In machine learning\nthe conditional variant is quite popular [12, 7, 16, 15, 4] while the marginal variant has received less\nattention. In statistics, both the marginal and conditional variants of composite likelihood are well\nstudied (see the comprehensive review of Varin et. al. [26]).\nAn unfortunate dif\ufb01culty with composite likelihood is that the estimators cannot be computed in\nparallel, since elements of \u2713 are often shared between the different factors. For a \ufb01xed value of \u2713\nthe terms of log LCL decouple over data and over blocks of the decomposition; however, if \u2713 is not\n\ufb01xed then the terms remain coupled.\n\n2.2 Consensus estimation\n\nSeeking greater parallelism, researchers have investigated methods for decoupling the sub-problems\nin composite likelihood. This leads to the class of consensus estimators, which perform parameter\nestimation independently in each composite likelihood factor. This approach results in parameters\nthat are shared between factors being estimated multiple times, and a \ufb01nal consensus step is required\nto force agreement between the solutions from separate sub-problems [27, 13].\nCentralised estimators enforce sub-problem agreement throughout the estimation process, requiring\nmany rounds of communication in a distributed setting. Consensus estimators allow sub-problems\nto disagree during optimisation, enforcing agreement as a post-processing step which requires only\na single round of communication.\nLiu and Ihler [13] approach distributed composite likelihood by optimising each term separately\n\n\u02c6\u2713\n\ni\ni = arg max\n\n\u2713i NYn=1\n\nf i(xAi,n, \u2713i)!\n\n(5)\n\nwhere Ai denotes the group of variables associated with block i, and \u2713i is the corresponding set of\nparameters. In this setting the sets i \u2713V are allowed to overlap, but the optimisations are carried\nout independently, so multiple estimates for overlapping parameters are obtained. Following Liu\nand Ihler we have used the notation \u2713i = \u2713i to make this interdependence between factors explicit.\n\ni\nThe analysis of this setting proceeds by embedding each local estimator \u02c6\u2713\ni into a degenerate esti-\nmator \u02c6\u2713\nc = 0 for c /2 i. The degenerate estimators\nare combined into a single non-degenerate global estimate using different consensus operators, e.g.\nweighted averages of the \u02c6\u2713\n\ni for the global parameter vector \u2713 by setting \u02c6\u2713\n\ni.\n\ni\n\n3\n\n\fThe analysis of Liu and Ihler assumes that for each sub-problem i and for each c 2 i\n\n(\u02c6\u2713\n\ni\ni)c\n\np! \u2713c\n\n(6)\n\ni.e., each local estimate for the parameter of clique c is a consistent estimator of the corresponding\nclique parameter in the joint distribution. This assumption does not hold in general, and one of the\ncontributions of this work is to give a general condition under which this assumption holds.\nThe analysis of Liu and Ihler [13] considers the case where the local estimators in Equation 5 are ar-\nbitrary M-estimators [25], however their experiments address only the case of pseudo-likelihood. In\nSection 5 we prove that the factorisation used by pseudo-likelihood satis\ufb01es Equation 6, explaining\nthe good results in their experiments.\n\n2.3 Distributed estimation\n\nConsensus estimation dramatically increases the parallelism of composite likelihood estimates by\nrelaxing the requirements on enforcing agreement between coupled sub-problems. Recently Mizrahi\net. al. [19] have shown that if the composite likelihood factorisation is constructed correctly then\nconsistent parameter estimates can be obtained without requiring a consensus step.\nIn the LAP algorithm of Mizrahi et al. [19] the domain of each composite likelihood factor (which\nthey call the auxiliary MRF) is constructed by surrounding each maximal clique q with the variables\nin its 1-neighbourhood\n\nAq = [c\\q6=;\n\nc\n\nwhich contains all of the variables of q itself as well as the variables with at least one neighbour in\nq; see Figure 1 for an example. For MRFs of low degree the sets Aq are small, and consequently\nmaximum likelihood estimates for parameters of MRFs over these sets can be obtained ef\ufb01ciently.\nThe parametric form of each factor in LAP is chosen to coincide with the marginal distribution over\nAq.\nThe factorisation of Mizrahi et al. is essentially the same as in Equation 5, but the domain of each\nterm is carefully selected, and the LAP theorems are proved only for the case where f i(xAq , \u2713q ) =\np(xAq , \u2713q ).\nAs in consensus estimation, parameter estimation in LAP is performed separately and in parallel for\neach term; however, agreement between sub-problems is handled differently. Instead of combining\nparameter estimates from different sub-problems, LAP designates a speci\ufb01c sub-problem as author-\nitative for each parameter (in particular the sub-problem with domain Aq is authoritative for the\nparameter \u2713q). The global solution is constructed by collecting parameters from each sub-problem\nfor which it is authoritative and discarding the rest.\nIn order to obtain consistency for LAP, Mizrahi et al. [19] assume that both the joint distribution and\neach composite likelihood factor are parametrised using normalized potentials.\nDe\ufb01nition 1. A Gibbs potential E(xc|\u2713c) is said to be normalised with respect to zero if E(xc|\u2713c) =\n0 whenever there exists t 2 c such that xt = 0.\nA perhaps under-appreciated existence and uniqueness theorem [9, 5] for MRFs states that there\nexists one and only one potential normalized with respect to zero corresponding to a Gibbs distribu-\ntion. This result ensures a one to one correspondence between Gibbs distributions and normalised\npotential representations of an MRF.\nThe consistency of LAP relies on the following observation. Suppose we have a Gibbs distribution\np(xV | \u2713) that factors according to the clique system C, and suppose that the parametrisation is\nchosen so that the potentials are normalised with respect to zero. For a particular clique of interest\nq, the marginal over xAq can be written as follows (see Appendix A for a detailed derivation)\n\np(xAq | \u2713) =\n\n1\n\nZ(\u2713)\n\nexp(E(xq | \u2713q) Xc2Cq\\{q}\n\nE(xc | \u2713V\\q))\n\n(7)\n\n4\n\n\fwhere Cq denotes the clique system of the marginal, which in general includes cliques not present in\nthe joint. The same distribution can also be written in terms of different parameters \u21b5\n\np(xAq | \u21b5) =\n\n1\n\nZ(\u21b5)\n\nexp(E(xq | \u21b5q) Xc2Cq\\{q}\n\nE(xc | \u21b5c))\n\n(8)\n\nwhich are also assumed to be normalised with respect to zero. As shown in Mizrahi et. al. [19], the\nuniqueness of normalised potentials can be used to obtain the following result.\nProposition 2 (LAP argument [19]). If the parametrisations of p(xV | \u2713) and p(xAq | \u21b5) are cho-\nsen to be normalized with respect to zero, and if the parameters are identi\ufb01able with respect to the\npotentials, then \u2713q = \u21b5q.\n\nThis proposition enables Mizrahi et. al. [19] to obtain consistency for LAP under the standard\nsmoothness and identi\ufb01ability assumptions for MRFs [8].\n\n3 Contributions of this paper\n\nThe strength of the results of Mizrahi et al. [19] is to show that it is possible to perform parameter\nestimation in a completely distributed way without sacri\ufb01cing global consistency. They prove that\nthrough careful design of a composite likelihood factorisation it is possible to obtain estimates for\neach parameter of the joint distribution in isolation, without requiring even a \ufb01nal consensus step\nto enforce sub-problem agreement. Their weakness is that the LAP algorithm is very restrictive,\nrequiring a speci\ufb01c composite likelihood factorisation.\nThe strength of the results of Liu and Ihler [13] is that they apply in a very general setting (arbitrary\nM-estimators) and make no assumptions about the underlying structure of the MRF. On the other\nhand they assume the convergence in Equation 6, and do not characterise the conditions under which\nthis assumption holds.\nThe key to unifying these works is to notice that the speci\ufb01c decomposition used in LAP is chosen\nessentially to ensure the convergence of Equation 6. This leads to our development of the Strong\nLAP Condition and an associated Strong LAP Argument, which is a drop in replacement for the LAP\nargument of Mizrahi et al. and holds for a much larger range of composite likelihood factorisations\nthan their original proof allows.\nSince the purpose of the Strong LAP Condition is to guarantee the convergence of Equation 6, we\nare able to import the results of Liu and Ihler [13] into the LAP framework directly, bridging the\ngap between LAP and consensus estimators. The same Strong LAP Condition also provides the\nnecessary convergence guarantee for the results of Liu and Ihler to apply.\nFinally we show how the Strong LAP Condition can lead to the development of new estimators, by\ndeveloping a new distributed estimator which subsumes the distributed pseudo-likelihood and gives\nestimates that are both consistent and asymptotically normal.\n\n4 Strong LAP argument\n\nIn this section we present the Strong LAP Condition, which provides a general condition under\nwhich the convergence of Equation 6 holds. This turns out to be intimately connected to the structure\nof the underlying graph.\nDe\ufb01nition 3 (Relative Path Connectivity). Let G = (V,E) be an undirected graph, and let A be a\ngiven subset of V. We say that two nodes i, j 2A are path connected with respect to V \\ A if there\nexists a path P = {i, s1, s2, . . . , sn, j}6 = {i, j} with none of the sk 2A . Otherwise, we say that\ni, j are path disconnected with respect to V \\ A.\nFor a given A\u2713V we partition the clique system of G into two parts, Cin\nA\ncliques that are a subset of A, and Cout\nthis notation we can write the marginal distribution over xA as\nexp( Xc2Cout\nexp( Xc2Cin\n\nthat contains all of the\nthat contains the remaining cliques of G. Using\n\nE(xc | \u2713c)) XxV\\A\n\nA = C \\ Cin\nA\n\n1\n\np(xA | \u2713) =\n\nZ(\u2713)\n\nE(xc | \u2713c))\n\nA\n\n(9)\n\nA\n\n5\n\n\f(a)\n\n3\n\n1\n\n4\n\nj\n\ni\n\n2\n\nk\n\n5\n\n6\n\n(b)\n3\n\n4\n\n0\n\n2\n\n5\n\n1\n\n(c)\n3\n3\n\n4\n4\n\n0\n0\n\n2\n2\n\n5\n5\n\n1\n1\n\n(d)\n3\n\n4\n\n0\n0\n\n2\n\n5\n\n1\n1\n\nFigure 2: (a) Illustrating the concept of relative path connectivity. Here, A = {i, j, k}. While (k, j)\nare path connected via {3, 4} and (k, i) are path connected via {2, 1, 5}, the pair (i, j) are path\ndisconnected with respect to V \\A. (b)-(d) Illustrating the difference between LAP and Strong LAP.\n(b) Shows a star graph with q highlighted. (c) Shows Aq required by LAP. (d) Shows an alternative\nneighbourhood allowed by Strong LAP. Thus, if the root node is a response variable and the leafs\nare covariates, Strong LAP states we can estimate each parameter separately and consistently.\n\nA\n\nexp(Pc2Cout\n\nUp to a normalisation constant,PxV\\A\nE(xc | \u2713c)) induces a Gibbs density (and\ntherefore an MRF) on A, which we refer to as the induced MRF. (For example, as illustrated in\nFigure 1 centre-left, the induced MRF involves all the cliques over the nodes 4, 5 and 9.) By the\nHammersley-Clifford theorem this MRF has a corresponding graph which we refer to as the induced\ngraph and denote GA. Note that the induced graph does not have the same structure as the marginal,\nit contains only edges which are created by summing over xV\\A.\nRemark 4. To work in the general case, we assume throughout that that if an MRF contains the\npath {i, j, k} then summing over j creates the edge (i, k) in the marginal.\nProposition 5. Let A be a subset of V, and let i, j 2A . The edge (i, j) exists in the induced graph\nGA if and only if i and j are path connected with respect to V \\ A.\nProof. If i and j are path connected then there is a path P = {i, s1, s2, . . . , sn, j}6 = {i, j} with\nnone of the sk 2A . Summing over sk forms an edge (sk1, sk+1). By induction, summing over\ns1, . . . , sn forms the edge (i, j).\nIf i and j are path disconnected with respect to V \\ A then summing over any s 2V \\ A cannot\nform the edge (i, j) or i and j would be path connected through the path {i, s, j}. By induction, if\nthe edge (i, j) is formed by summing over s1, . . . , sn this implies that i and j are path connected via\n{i, s1, . . . , sn, j}, contradicting the assumption.\nCorollary 6. B\u2713A is a clique in the induced graph GA if and only if all pairs of nodes in B are\npath connected with respect to V \\ A.\nDe\ufb01nition 7 (Strong LAP condition). Let G = (V,E) be an undirected graph and let q 2C be a\nclique of interest. We say that a set A such that q \u2713A\u2713V satis\ufb01es the strong LAP condition for q\nif there exist i, j 2 q such that i and j are path-disconnected with respect to V \\ A.\nProposition 8. Let G = (V,E) be an undirected graph and let q 2C be a clique of interest. If\nAq satis\ufb01es the Strong LAP condition for q then the joint distribution p(xV | \u2713) and the marginal\np(xAq | \u2713) share the same normalised potential for q.\nProof. If Aq satis\ufb01es the Strong LAP Condition for q then by Corollary 6 the induced MRF contains\nno potential for q. Inspection of Equation 9 reveals that the same E(xq | \u2713q) appears as a potential\nin both the marginal and the joint distributions. The result follows by uniqueness of the normalised\npotential representation.\n\nWe now restrict our attention to a set Aq which satis\ufb01es the Strong LAP Condition for a clique of\ninterest q. The marginal over p(xAq | \u2713) can be written as in Equation 9 in terms of \u2713, or in terms of\nauxiliary parameters \u21b5\n\np(xAq | \u21b5) =\n\n1\n\nZ(\u21b5)\n\nexp(Xc2Cq\n\nE(xc | \u21b5c))\n\n(10)\n\nWhere Cq is the clique system over the marginal. We will assume both parametrisations are nor-\nmalised with respect to zero.\nTheorem 9 (Strong LAP Argument). Let q be a clique in G and let q \u2713A q \u2713V . Suppose p(xV | \u2713)\nand p(xAq | \u21b5) are parametrised so that their potentials are normalised with respect to zero and the\nparameters are identi\ufb01able with respect to the potentials. If Aq satis\ufb01es the Strong LAP Condition\nfor q then \u2713q = \u21b5q.\n\n6\n\n\fProof. From Proposition 8 we know that p(xV | \u2713) and p(xAq | \u2713) share the same clique potential\nfor q. Alternatively we can write the marginal distribution as in Equation 10 in terms of auxiliary\nvariables \u21b5. By uniqueness, both parametrisations must have the same normalised potentials.\nSince the potentials are equal, we can match terms between the two parametrisations. In particular\nsince E(xq | \u2713q) = E(xq | \u21b5q) we see that \u2713q = \u21b5q by identi\ufb01ability.\n4.1 Ef\ufb01ciency and the choice of decomposition\n\nTheorem 9 implies that distributed composite likelihood is consistent for a wide class of decompo-\nsitions of the joint distribution; however it does not address the issue of statistical ef\ufb01ciency.\nThis question has been studied empirically in the work of Meng et. al. [17, 18], who introduce a\ndistributed algorithm for Gaussian random \ufb01elds and consider neighbourhoods of different sizes.\nMeng et. al. \ufb01nd the larger neighbourhoods produce better empirical results and the following theo-\nrem con\ufb01rms this observation.\nTheorem 10. Let A be set of nodes which satis\ufb01es the Strong LAP Condition for q. Let \u02c6\u2713A be the\nML parameter estimate of the marginal over A. If B is a superset of A, and \u02c6\u2713B is the ML parameter\nestimate of the marginal over B. Then (asymptotically):\n\n|\u2713q (\u02c6\u2713B)q|\uf8ff| \u2713q (\u02c6\u2713A)q|.\n\nProof. Suppose that |\u2713q (\u02c6\u2713B)q| > |\u2713q (\u02c6\u2713A)q|. Then the estimates \u02c6\u2713A over the various subsets A\nof B improve upon the ML estimates of the marginal on B. This contradicts the Cramer-Rao lower\nbound achieved by the ML estimate of the marginal on B.\nIn general the choice of decomposition implies a trade-off in computational and statistical ef\ufb01ciency.\nLarger factors are preferable from a statistical ef\ufb01ciency standpoint, but increase computation and\ndecrease the degree of parallelism.\n\n5 Conditional LAP\n\nThe Strong LAP Argument tells us that if we construct composite likelihood factors using marginal\ndistributions over domains that satisfy the Strong LAP Condition then the LAP algorithm of Mizrahi\net. al. [19] remains consistent. In this section we show that more can be achieved.\nOnce we have satis\ufb01ed the Strong LAP Condition we know it is acceptable to match parameters\nbetween the joint distribution p(xV | \u2713) and the auxiliary distribution p(xAq | \u21b5). To obtain a con-\nsistent LAP algorithm from this correspondence all that is required is to have a consistent estimate\nof \u21b5q. Mizrahi et. al. [19] achieve this by applying maximum likelihood estimation to p(xAq | \u21b5),\nbut any consistent estimator is valid.\nWe exploit this fact to show how the Strong LAP Argument can be applied to create a consistent\nconditional LAP algorithm, where conditional estimation is performed in each auxiliary MRF. This\nallows us to apply the LAP methodology to a broader class of models. For some models, such as\nlarge densely connected graphs, we cannot rely on the LAP algorithm of Mizrahi et. al. [19]. For\nexample, for a restricted Boltzmann machine (RBM) [23], the 1-neighbourhood of any pairwise\nclique includes the entire graph. Hence, the complexity of LAP is exponential in the size of V.\nHowever, it is linear for conditional LAP, without sacri\ufb01cing consistency.\nTheorem 11. Let q be a clique in G and let xj 2 q \u2713A q \u2713V . If Aq satis\ufb01es the Strong LAP\nCondition for q then p(xV | \u2713) and p(xj | xAq\\{xj}, \u21b5) share the same normalised potential for q.\nProof. We can write the conditional distribution of xj given Aq \\ {xj} as\n\nBoth the numerator and the denominator of Equation 11 are Gibbs distributions, and can therefore\nbe expressed in terms of potentials over clique systems.\n\np(xAq | \u2713)\n\np(xAq | \u2713)\n\n(11)\n\np(xj | xAq\\{xj}, \u2713) =\n\nPxj\n\n7\n\n\fp(xj | xAq\\{xj}, \u21b5) =\n\n1\n\nZj(\u21b5)\n\nexp( Xc2CAq\n\nE(xc | \u21b5c))\n\n(12)\n\nSince Aq satis\ufb01es the Strong LAP Condition for q we know that p(xAq | \u2713) and p(xV | \u2713) have the\nsame potential for q. Moreover, the domain ofPxj\np(xAq | \u2713) does not include q, so it cannot\ncontain a potential for q. We conclude that the potential for q in p(xj | xAq\\{xj}, \u2713) must be shared\nwith p(xV | \u2713).\nRemark 12. There exists a Gibbs representation normalised with respect\np(xj | xAq\\{xj}, \u2713). Moreover, the clique potential for q is unique in that representation.\nExistence in the above remark is an immediate result of the the existence of normalized repre-\nsentation both for the numerator and denominator of Equation 11, and the fact that difference\nof normalised potentials is a normalized potential. For uniqueness, \ufb01rst note that p(xAq | \u2713) =\np(xj | xAq\\{xj}, \u2713)p(xAq\\{xj}, \u2713) The variable xj is not part of p(xAq\\{xj}, \u2713) and hence this dis-\ntribution does not contain the clique q. Suppose there were two different normalised representations\nfor the conditional p(xj | xAq\\{xj}, \u2713). This would then imply two normalised representations for\nthe joint, which contradicts the fact that the joint has a unique normalized representation.\nWe can now proceed as in the original LAP construction from Mizrahi et al. [19]. For a clique of\ninterest q we \ufb01nd a set Aq which satis\ufb01es the Strong LAP Condition for q. However, instead of\ncreating an auxiliary parametrisation of the marginal we create an auxiliary parametrisation of the\nconditional in Equation 11.\n\nto zero for\n\nFrom Theorem 11 we know that E(xq | \u21b5q) = E(xq | \u2713q). Equality of the parameters is also\nobtained, provided they are identi\ufb01able.\nCorollary 13. If Aq satis\ufb01es the Strong LAP Condition for q then any consistent estimator of \u21b5q in\np(xj | xAq\\{xj}, \u21b5) is also a consistent estimator of \u2713q in p(xV | \u2713).\n5.1 Connection to distributed pseudo-likelihood and composite likelihood\nTheorem 11 tells us that if Aq satis\ufb01es the Strong LAP Condition for q then to estimate \u2713q in\np(xV | \u2713) it is suf\ufb01cient to have an estimate of \u21b5q in p(xj | xAq\\{xj}, \u21b5) for any xj 2 q. This tells\nus that it is suf\ufb01cient to use pseudo-likelihood-like conditional factors, provided that their domains\nsatisfy the Strong LAP Condition. The following remark completes the connection by telling us\nthat the Strong LAP Condition is satis\ufb01ed by the speci\ufb01c domains used in the pseudo-likelihood\nfactorisation.\nRemark 14. Let q = {x1, x2, .., xm} be a clique of interest, with 1-neighbourhood Aq = q [\n{N (xi)}xi2q. Then for any xj 2 q, the set q [N (xj) satis\ufb01es the Strong LAP Condition for q.\nMoreover, q [N (xj) satis\ufb01es the Strong LAP Condition for all cliques in the graph that contain xj.\nImportantly, to estimate every unary clique potential we need to visit each node in the graph. How-\never, to estimate pairwise clique potentials, visiting all nodes is redundant because the parameters of\neach pairwise clique are estimated twice. If a parameter is estimated more than once it is reasonable\nfrom a statistical standpoint to apply a consensus operator to obtain a single estimate. The theory of\nLiu and Ihler tells us that the consensus estimates are consistent and asymptotically normal, provided\nEquation 6 is satis\ufb01ed. In turn, the Strong LAP Condition guarantees the convergence in Equation 6.\nWe can go beyond pseudo-likelihood and consider either marginal or conditional factorisations over\nlarger groups of variables. Since the asymptotic results of Liu and Ihler [13] apply to any dis-\ntributed composite likelihood estimator where the convergence of Equation 6 holds, it follows that\nany distributed composite likelihood estimator where each factor satis\ufb01es the Strong LAP Condition\n(including LAP and the conditional composite likelihood estimator from Section 5) immediately\ngains asymptotic normality and variance guarantees as a result of their work and ours.\n6 Conclusion\nWe presented foundational theoretical results for distributed composite likelihood. The results pro-\nvide us with suf\ufb01cient conditions to apply the results of Liu and Ihler to a broad class of distributed\nestimators. The theory also led us to the construction of a new globally consistent estimator, whose\ncomplexity is linear even for many densely connected graphs. We view extending these results to\nmodel selection, tied parameters, models with latent variables, and inference tasks as very important\navenues for future research.\n\n8\n\n\fReferences\n[1] D. H. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive\n\nScience, 9:147\u2013169, 1985.\n\n[2] A. Asuncion, Q. Liu, A. Ihler, and P. Smyth. Learning with blocks: Composite likelihood and contrastive\n\ndivergence. In Arti\ufb01cial Intelligence and Statistics, pages 33\u201340, 2010.\n\n[3] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical\n\nSociety, Series B, 36:192\u2013236, 1974.\n\n[4] J. K. Bradley and C. Guestrin. Sample complexity of composite likelihood. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 136\u2013160, 2012.\n\n[5] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer-Verlag, 2001.\n[6] B. Cox. Composite likelihood methods. Contemporary Mathematics, 80:221\u2013239, 1988.\n[7] J. V. Dillon and G. Lebanon. Stochastic composite likelihood. Journal of Machine Learning Research,\n\n11:2597\u20132633, 2010.\n\n[8] S. E. Fienberg and A. Rinaldo. Maximum likelihood estimation in log-linear models. The Annals of\n\nStatistics, 40(2):996\u20131023, 2012.\n\n[9] D. Griffeath. Introduction to random \ufb01elds. In Denumerable Markov Chains, volume 40 of Graduate\n\nTexts in Mathematics, pages 425\u2013458. Springer, 1976.\n\n[10] J. M. Hammersley and P. Clifford. Markov \ufb01elds on \ufb01nite graphs and lattices. 1971.\n[11] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n2009.\n\n[12] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood\n\nestimators. In International Conference on Machine Learning, pages 584\u2013591, 2008.\n\n[13] Q. Liu and A. Ihler. Distributed parameter estimation via pseudo-likelihood. In International Conference\n\non Machine Learning, 2012.\n\n[14] K. V. Mardia, J. T. Kent, G. Hughes, and C. C. Taylor. Maximum likelihood estimation using composite\n\nlikelihoods for closed exponential families. Biometrika, 96(4):975\u2013982, 2009.\n\n[15] B. Marlin and N. de Freitas. Asymptotic ef\ufb01ciency of deterministic estimators for discrete energy-based\nmodels: Ratio matching and pseudolikelihood. In Uncertainty in Arti\ufb01cial Intelligence, pages 497\u2013505,\n2011.\n\n[16] B. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for restricted Boltzmann machine\n\nlearning. In Arti\ufb01cial Intelligence and Statistics, pages 509\u2013516, 2010.\n\n[17] Z. Meng, D. Wei, A. Wiesel, and A. O. Hero III. Distributed learning of Gaussian graphical models via\n\nmarginal likelihoods. In Arti\ufb01cial Intelligence and Statistics, pages 39\u201347, 2013.\n\n[18] Z. Meng, D. Wei, A. Wiesel, and A. O. Hero III. Marginal likelihoods for distributed parameter estimation\n\nof Gaussian graphical models. Technical report, arXiv:1303.4756, 2014.\n\n[19] Y. Mizrahi, M. Denil, and N. de Freitas. Linear and parallel learning of Markov random \ufb01elds.\n\nInternational Conference on Machine Learning, 2014.\n\nIn\n\n[20] K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.\n[21] S. Nowozin. Constructing composite likelihoods in general random \ufb01elds. In ICML Workshop on Infern-\n\ning: Interactions between Inference and Learning, 2013.\n\n[22] S. Okabayashi, L. Johnson, and C. Geyer. Extending pseudo-likelihood for Potts models. Statistica Sinica,\n\n21(1):331\u2013347, 2011.\n\n[23] P. Smolensky. Information processing in dynamical systems: foundations of harmony theory. Parallel\n\ndistributed processing: explorations in the microstructure of cognition, 1:194\u2013281, 1986.\n\n[24] D. Strauss and M. Ikeda. Pseudolikelihood estimation for social networks. Journal of the American\n\nStatistical Association, 85(409):204\u2013212, 1990.\n\n[25] A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.\n[26] C. Varin, N. Reid, and D. Firth. An overview of composite likelihood methods. Statistica Sinica, 21:5\u201342,\n\n2011.\n\n[27] A. Wiesel and A. Hero III. Distributed covariance estimation in Gaussian graphical models. IEEE Trans-\n\nactions on Signal Processing, 60(1):211\u2013220, 2012.\n\n9\n\n\f", "award": [], "sourceid": 897, "authors": [{"given_name": "Yariv", "family_name": "Mizrahi", "institution": "University of British Columbia"}, {"given_name": "Misha", "family_name": "Denil", "institution": "University of Oxford"}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "University of Oxford"}]}