{"title": "Expectation Consistent Free Energies for Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": " Expectation Consistent Free Energies for\n Approximate Inference\n\n\n\n Manfred Opper Ole Winther\n ISIS Informatics and\n School of Electronics and Mathematical Modelling\n Computer Science Technical University of Denmark\n University of Southampton DK-2800 Lyngby, Denmark\n SO17 1BJ, United Kingdom owi@imm.dtu.dk\n mo@ecs.soton.ac.uk\n\n\n\n\n Abstract\n\n We propose a novel a framework for deriving approximations for in-\n tractable probabilistic models. This framework is based on a free energy\n (negative log marginal likelihood) and can be seen as a generalization\n of adaptive TAP [1, 2, 3] and expectation propagation (EP) [4, 5]. The\n free energy is constructed from two approximating distributions which\n encode different aspects of the intractable model such a single node con-\n straints and couplings and are by construction consistent on a chosen set\n of moments. We test the framework on a difficult benchmark problem\n with binary variables on fully connected graphs and 2D grid graphs. We\n find good performance using sets of moments which either specify fac-\n torized nodes or a spanning tree on the nodes (structured approximation).\n Surprisingly, the Bethe approximation gives very inferior results even on\n grids.\n\n\n\n1 Introduction\n\nThe development of tractable approximations for the statistical inference with probabilis-\ntic data models is of central importance in order to develop their full potential. The most\nprominent and widely developed [6] approximation technique is the so called Variational\nApproximation (VA) in which the true intractable probability distribution is approximated\nby the closest one in a tractable family. The most important tractable families of distribu-\ntions are multivariate Gaussians and distributions which factorize in all or in certain groups\nof variables [7]. Both choices have their drawbacks. While factorizing distributions neglect\ncorrelations, multivariate Gaussians allow to retain a significant amount of dependencies\nbut are restricted to continuous random variables which have the entire real space as their\nnatural domain (otherwise KL divergences becomes infinite).\n\nMore recently a variety of non variational approximations have been developed which can\nbe understood from the idea of global consistency between local approximations. E.g.,\nin the BetheKikuchi approach [8] the local neighborhood of each variable in a graphical\nmodel is implicitly approximated by a tree-like structure. Consistency is achieved by the\nmatching of marginal distributions at the connecting edges of the graph. Thomas Minka's\n\n\f\nExpectation Propagation (EP) framework seems to provide a general framework for devel-\noping and unifying such consistency approximations [4, 5]. Although the new frameworks\nhave led to a variety of promising applications, often outperforming VA schemes, the un-\nsatisfactory division between the treatment of constrained and unconstrained, continuous\nrandom variables seems to persist.\n\nIn this paper we propose an alternative approach which we call the expectation consistent\n(EC) approximation which is not plagued by this problem. We require consistency between\ntwo complimentary global approximations (say, a factorizing & a Gaussian one) to the same\nprobabilistic model which may have different support. Our method is a generalization of\nthe adaptive TAP approach (ADATAP) [2, 3] developed for inference on densely connected\ngraphical models which has been applied successfully to a variety of problems ranging from\nprobabilistic ICA over Gaussian process models to bootstrap methods for kernel machines.\n\n\n2 Approximative inference\n\nWe consider the problem of computing expectations, i.e. certain sums or integrals involving\na probability distribution with density\n\n 1\n p(x) = f (x) , (1)\n Z\n\nfor a vector of random variables x = (x1, x2, . . . , xN ) with the partition function Z =\n dxf (x). We assume that the necessary exact operations are intractable, where the in-\ntractability arises either because the necessary sums are over a too large number of vari-\nables or because multivariate integrals cannot be evaluated exactly. In a typical scenario,\nf (x) is expressed as a product of two functions\n\n f (x) = f1(x)f2(x) (2)\n\nwith f1,2(x) 0, where f1 is \"simple\" enough to allow for tractable computations.\nThe idea of many approximate inference methods is to approximate the \"complicated\"\npart f2(x) by replacing it with a \"simpler\" function, say of some exponential form\nexp T g(x) exp K \n j=1 j gj (x) . The vector of functions g is chosen in such a\nway that the desired sums or integrals can be calculated in an efficient way and the param-\neters are adjusted to optimize certain criteria. Hence, the word tractability should always\nbe understood as relative to some approximating set of functions g.\n\nOur novel framework of approximation will be restricted to problems, where both parts f1\nand f2 can be considered as tractable relative to some suitable g, and the intractability of\nthe density p arises from forming their product. Take, as an example, the density (with\nrespect to the Lebesgue measure in RN ) given by\n\n \n p(x) = (x) exp xiJijxj\n i 0.\nWe compute the average one-norm error on the marginals: |p(x\n i i = 1) - p(xi =\n1|Method)|/N , p(xi = 1) = (1 + mi)/2 over 100 trials testing the following Methods:\nSP = sum-product (aka loopy belief propagation (BP) or Bethe approximation) and LD =\nlog-determinant maximization [12], EC factorized and EC structured. Results for SP and\nLD are taken from Ref. [12]. For EC, we are minimizing the EC free energy eq. (22) where\nGq(m, M, 0) depend upon the approximation we are using. For the factorized model we\nuse the free energy eq. (12) and for the structured model we assume a single tractable\npotential (x) in eq. (3) which contains all couplings on a spanning tree. For Gq, we use\nthe free energy eq. (13). The spanning tree is defined by the following simple heuristic:\nchoose as next pair of nodes to link, the (so far unlinked) pair with strongest absolute\ncoupling |Jij| that will not cause a loop in the graph.\n\nThe results are summarized in table 1. The Bethe approximation always give inferior re-\nsults compared to EC (note that only loopy BP convergent problem instances were used to\ncalculate the error [12]). This might be a bit surprising for the sparsely connected grids.\nThis indicates that loopy BP and too a lesser degree extensions building upon BP [5] are\nonly to be applied to really sparse graphs and/or weakly coupled nodes, where the error\ninduced by not using a properly normalized distribution can be expected to be small. We\nalso speculate that a structured variational approximation, using the same heuristics as de-\nscribed above to construct the spanning tree, in many cases will be superior to the Bethe\napproximation as also observed by Ref. [5]. LD is a robust method which seems to be\nlimited in it's achievable precision. EC structured is uniformly superior to all other ap-\nproaches. Additional simulations (not included in the paper) also indicate that EC give\nmuch improved estimates of free energies and two-node marginals when compared to the\nBethe- and Kikuchi-approximation.\n\n\n8 Conclusion and outlook\n\nWe have introduced a novel method for approximate inference which tries to overcome\ncertain limitations of single approximating distributions by achieving consistency for two\nof these on the same problem. While we have demonstrated its accuracy in this paper\nonly for a model with binary elements, it can also be applied to models with continuous\nrandom variables or hybrid models with both discrete and continuous variables. We expect\nthat our method becomes most powerful when certain tractable substructures of variables\nwith strong dependencies can be identified in a model. Our approach would then allow\nto deal well with the weaker dependencies between the groups. A generalization of our\nmethod to treat graphical models beyond pair-wise interaction is obtained by iterating the\napproximation. This is useful in cases, where an initial three term approximation GEC =\n\n\f\n Table 1: The average one-norm error on marginals for the Wainwright-Jordan set-up.\n\n Problem type Method\n SP LD EC fac EC struct\n Graph Coupling dcoup Mean Mean Mean Mean\n Repulsive 0.25 0.037 0.020 0.003 0.0017\n Repulsive 0.50 0.071 0.018 0.031 0.0143\n Full Mixed 0.25 0.004 0.020 0.002 0.0013\n Mixed 0.50 0.055 0.021 0.022 0.0151\n Attractive 0.06 0.024 0.027 0.004 0.0031\n Attractive 0.12 0.435 0.033 0.117 0.0211\n Repulsive 1.0 0.294 0.047 0.153 0.0031\n Repulsive 2.0 0.342 0.041 0.198 0.0021\n Grid Mixed 1.0 0.014 0.016 0.011 0.0018\n Mixed 2.0 0.095 0.038 0.082 0.0068\n Attractive 1.0 0.440 0.047 0.125 0.0028\n Attractive 2.0 0.520 0.042 0.177 0.0024\n\n\n\nGq + Gr - Gs still contains non-tractable component free energies G.\n\n\nReferences\n\n [1] M. Opper and O. Winther, \"Gaussian processes for classification: Mean field algorithms,\"\n Neural Computation, vol. 12, pp. 26552684, 2000.\n\n [2] M. Opper and O. Winther, \"Tractable approximations for probabilistic models: The adaptive\n Thouless-Anderson-Palmer mean field approach,\" Phys. Rev. Lett., vol. 86, pp. 3695, 2001.\n\n [3] M. Opper and O. Winther, \"Adaptive and self-averaging Thouless-Anderson-Palmer mean field\n theory for probabilistic modeling,\" Phys. Rev. E, vol. 64, pp. 056131, 2001.\n\n [4] T. P. Minka, \"Expectation propagation for approximate Bayesian inference,\" in UAI 2001,\n 2001, pp. 362369.\n\n [5] T. Minka and Y. Qi, \"Tree-structured approximations by expectation propagation,\" in NIPS 16,\n S. Thrun, L. Saul, and B. Sch olkopf, Eds. MIT Press, Cambridge, MA, 2004.\n\n [6] Christopher M. Bishop, David Spiegelhalter, and John Winn, \"Vibes: A variational inference\n engine for bayesian networks,\" in Advances in Neural Information Processing Systems 15,\n S. Thrun S. Becker and K. Obermayer, Eds., pp. 777784. MIT Press, Cambridge, MA, 2003.\n\n [7] H. Attias, \"A variational Bayesian framework for graphical models,\" in Advances in Neural\n Information Processing Systems 12, T. Leen et al., Ed. 2000, MIT Press, Cambridge.\n\n [8] J. S. Yedidia, W. T. Freeman, and Y. Weiss, \"Generalized belief propagation,\" in Advances in\n Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds.,\n 2001, pp. 689695.\n\n [9] A. L. Yuille, \"CCCP algorithms to minimize the Bethe and Kikuchi free energies: convergent\n alternatives to belief propagation,\" Neural Comput., vol. 14, no. 7, pp. 16911722, 2002.\n\n[10] T. Heskes, K. Albers, and H. Kappen, \"Approximate inference and constrained optimization,\"\n in UAI-03, San Francisco, CA, 2003, pp. 313320, Morgan Kaufmann Publishers.\n\n[11] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.\n\n[12] M. J. Wainwright and M. I. Jordan, \"Semidefinite methods for approximate inference on graphs\n with cycles,\" Tech. Rep. UCB/CSD-03-1226, UC Berkeley CS Division, 2003.\n\n\f\n", "award": [], "sourceid": 2661, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}