{"title": "Improving on Expectation Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 1241, "page_last": 1248, "abstract": "We develop as series of corrections to Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results.", "full_text": "Improving on Expectation Propagation\n\nManfred Opper\n\nComputer Science, TU Berlin\n\nopperm@cs.tu-berlin.de\n\nUlrich Paquet\n\nComputer Laboratory, University of Cambridge\n\nulrich@cantab.net\n\nInformatics and Mathematical Modelling, Technical University of Denmark\n\nOle Winther\n\nowi@imm.dtu.dk\n\nAbstract\n\nA series of corrections is developed for the \ufb01xed points of Expectation Propaga-\ntion (EP), which is one of the most popular methods for approximate probabilistic\ninference. These corrections can lead to improvements of the inference approxi-\nmation or serve as a sanity check, indicating when EP yields unrealiable results.\n\n1 Introduction\n\nThe expectation propagation (EP) message passing algorithm is often considered as the method of\nchoice for approximate Bayesian inference when both good accuracy and computational ef\ufb01ciency\nare required [5]. One recent example is a comparison of EP with extensive MCMC simulations for\nGaussian process (GP) classi\ufb01ers [4], which has shown that not only the predictive distribution, but\nalso the typically much harder marginal likelihood (the partition function) of the data, are approxi-\nmated remarkably well for a variety of data sets. However, while such empirical studies hold great\nvalue, they can not guarantee the same performance on other data sets or when completely different\ntypes of Bayesian models are considered.\n\nIn this paper methods are developed to assess the quality of the EP approximation. We compute\nexplicit expressions for the remainder terms of the approximation. This leads to various corrections\nfor partition functions and posterior distributions. Under the hypothesis that the EP approximation\nworks well, we identify quantities which can be assumed to be small and can be used in a series\nexpansion of the corrections with increasing complexity. The computation of low order corrections\nin this expansion is often feasible, typically require only moderate computational efforts, and can\nlead to an improvement to the EP approximation or to the indication that the approximation cannot\nbe trusted.\n\n2 Expectation Propagation in a Nutshell\n\nSince it is the goal of this paper to compute corrections to the EP approximation, we will not dis-\ncuss details of EP algorithms but rather characterise the \ufb01xed points which are reached when such\nalgorithms converge.\n\nEP is applied to probabilistic models with an unobserved latent variable x having an intractable\ndistribution p(x). In applications p(x) is usually the Bayesian posterior distribution conditioned on\na set of observations. Since the dependency on the latter variables is not important for the subsequent\ntheory, we will skip them in our notation.\n\n1\n\n\fIt is assumed that p(x) factorizes into a product of terms fn such that\n\np(x) =\n\nfn(x) ,\n\n1\n\nZ Yn\n\n(1)\n\nwhere the normalising partition function Z = R dx Qn fn(x) is also intractable. We then assume\n\nan approximation to p(x) in the form\n\nq(x) =Yn\n\ngn(x)\n\n(2)\n\nwhere the terms gn(x) belong to a tractable, e.g. exponential family of distributions. To compute\nthe optimal parameters of the gn term approximation a set of auxiliary tilted distributions is de\ufb01ned\nvia\n\nqn(x) =\n\n1\n\nZn (cid:18) q(x)fn(x)\n\ngn(x) (cid:19) .\n\n(3)\n\nHere a single approximating term gn is replaced by an original term fn. Assuming that this re-\nplacement leaves qn still tractable, the parameters in gn are determined by the condition that q(x)\nand all qn(x) should be made as similar as possible. This is usually achieved by requiring that these\ndistributions share a set of generalised moments (which usually coincide with the suf\ufb01cient statistics\nof the exponential family). Note, that we will not assume that this expectation consistency [8] for\nthe moments is derived by minimising a Kullback\u2013Leibler divergence, as was done in the original\nderivations of EP [5]. Such an assumption would limit the applicability of the approximate inference\nand exclude e.g. the approximation of models with binary, Ising variables by a Gaussian model as\nin one of the applications in the last section.\n\nThe corresponding approximation to the normalising partition function in (1) was given in [8] and\n[7] and reads in our present notation1\n\nZEP =Yn\n\nZn .\n\n(4)\n\n3 Corrections to EP\n\nAn expression for the remainder terms which are neglected by the EP approximation can be obtained\nby solving for fn in (3), and taking the product to get\n\nYn\n\nfn(x) =Yn (cid:18) Znqn(x)gn(x)\n\nq(x)\n\nq(x) (cid:19) .\n(cid:19) = ZEP q(x)Yn (cid:18) qn(x)\n\nHence Z =R dx Qn fn(x) = ZEP R, with\nR =Z dx q(x)Yn (cid:18) qn(x)\n\nq(x) (cid:19) and p(x) =\n\n1\nR\n\nq(x) (cid:19) .\nq(x)Yn (cid:18) qn(x)\n\n(5)\n\n(6)\n\nThis shows that corrections to EP are small when all distributions qn are indeed close to q, justifying\nthe optimality criterion of EP. For related expansions, see [2, 3, 9].\n\nExact probabilistic inference with the corrections described here again leads to intractable computa-\ntions. However, we can derive exact perturbation expansions involving a series of corrections with\nincreasing computational complexity. Assuming that EP already yields a good approximation, the\ncomputation of a small number of these terms maybe suf\ufb01cient to obtain the most dominant correc-\ntions. On the other hand, when the leading corrections come out large or do not suf\ufb01ciently decrease\nwith order, this may indicate that the EP approximation is inaccurate. Two such perturbation expan-\nsions are be presented in this section.\n\n1The de\ufb01nition of partition functions Zn is slightly different from previous works.\n\n2\n\n\f3.1 Expansion I: Clusters\n\nThe most basic expansion is based on the variables \u03b5n(x) = qn(x)\nq(x) \u2212 1 which we can assume to be\ntypically small, when the EP approximation is good. Expanding the products in (6) we obtain the\ncorrection to the partition function\n\n(1 + \u03b5n(x))\n\nR =Z dx q(x)Yn\n= 1 + Xn1 a. This is a regression model yn = xn+\u03bdn where i.i.d. noise\nvariables \u03bdn have uniform distribution and the observed outputs are all zero, i.e. yn = 0. For this\ncase, the exact posterior variance does not shrink to zero even if the number of data points goes\nto in\ufb01nity. The EP approximation however has the variance decrease to zero and our corrections\nincrease with sample size.\n\n4.3 Ising models\n\nSomewhat surprising (and probably less known) is the fact that EP and our corrections apply\nwell to a fairly limiting case of the GP model where the terms are of the form tn(xn) =\ne\u03b8nxn (\u03b4(xn + 1) + \u03b4(xn \u2212 1)), where \u03b4(x) is the Dirac distribution. These terms, together with\na \u201cGaussian\u201d f0(x) = exp[Pi