{"title": "Learning unbelievable probabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 738, "page_last": 746, "abstract": "Loopy belief propagation performs approximate inference on graphical models with loops. One might hope to compensate for the approximation by adjusting model parameters. Learning algorithms for this purpose have been explored previously, and the claim has been made that every set of locally consistent marginals can arise from belief propagation run on a graphical model. On the contrary, here we show that many probability distributions have marginals that cannot be reached by belief propagation using any set of model parameters or any learning algorithm. We call such marginals `unbelievable.' This problem occurs whenever the Hessian of the Bethe free energy is not positive-definite at the target marginals. All learning algorithms for belief propagation necessarily fail in these cases, producing beliefs or sets of beliefs that may even be worse than the pre-learning approximation. We then show that averaging inaccurate beliefs, each obtained from belief propagation using model parameters perturbed about some learned mean values, can achieve the unbelievable marginals.", "full_text": "Learning unbelievable probabilities\n\nXaq Pitkow\n\nUniversity of Rochester\nRochester, NY 14607\n\nYashar Ahmadian\n\nColumbia University\nNew York, NY 10032\n\nDepartment of Brain and Cognitive Science\n\nCenter for Theoretical Neuroscience\n\nxaq@neurotheory.columbia.edu\n\nya2005@columbia.edu\n\nKen D. Miller\n\nCenter for Theoretical Neuroscience\n\nColumbia University\nNew York, NY 10032\n\nken@neurotheory.columbia.edu\n\nAbstract\n\nLoopy belief propagation performs approximate inference on graphical models\nwith loops. One might hope to compensate for the approximation by adjusting\nmodel parameters. Learning algorithms for this purpose have been explored pre-\nviously, and the claim has been made that every set of locally consistent marginals\ncan arise from belief propagation run on a graphical model. On the contrary, here\nwe show that many probability distributions have marginals that cannot be reached\nby belief propagation using any set of model parameters or any learning algorithm.\nWe call such marginals \u2018unbelievable.\u2019 This problem occurs whenever the Hessian\nof the Bethe free energy is not positive-de\ufb01nite at the target marginals. All learn-\ning algorithms for belief propagation necessarily fail in these cases, producing\nbeliefs or sets of beliefs that may even be worse than the pre-learning approxima-\ntion. We then show that averaging inaccurate beliefs, each obtained from belief\npropagation using model parameters perturbed about some learned mean values,\ncan achieve the unbelievable marginals.\n\n1\n\nIntroduction\n\nCalculating marginal probabilities for a graphical model generally requires summing over exponen-\ntially many states, and is NP-hard in general [1]. A variety of approximate methods have been used\nto circumvent this problem. One popular technique is belief propagation (BP), in particular the sum-\nproduct rule, which is a message-passing algorithm for performing inference on a graphical model\n[2]. Though exact and ef\ufb01cient on trees, it is merely an approximation when applied to graphical\nmodels with loops.\nA natural question is whether one can compensate for the shortcomings of the approximation by\nsetting the model parameters appropriately. In this paper, we prove that some sets of marginals\nsimply cannot be achieved by belief propagation. For these cases we provide a new algorithm that\ncan achieve much better results by using an ensemble of parameters rather than a single instance.\nWe are given a set of variables x with a given probability distribution P (x) of some data. We would\nlike to construct a model that reproduces certain of its marginal probabilities, in particular those over\nP (x) for nodes i 2 V , and those over some relevant clusters\nP (x) for \u21b5 = {i1, . . . , id\u21b5}. We will write the collection of all\n\nindividual variables pi(xi) =Px\\xi\nof variables, p\u21b5(x\u21b5) = Px\\x\u21b5\n\nthese marginals as a vector p.\n\n1\n\n\fWe assume a model distribution Q0(x) in the exponential family taking the form\n\nQ0(x) = eE(x)/Z\n\nwith normalization constant Z =Px eE(x) and energy function\n\nE(x) = X\u21b5\n\n\u2713\u21b5 \u00b7 \u21b5(x\u21b5)\n\n(1)\n\n(2)\n\nHere, \u21b5 indexes sets of interacting variables (factors in the factor graph [3]), and x\u21b5 is a sub-\nset of variables whose interaction is characterized by a vector of suf\ufb01cient statistics \u21b5(x\u21b5) and\ncorresponding natural parameters \u2713\u21b5. We assume without loss of generality that each \u21b5(x\u21b5) is\nirreducible, meaning that the elements are linearly independent functions of x\u21b5. We collect all these\nsuf\ufb01cient statistics and natural parameters in the vectors and \u2713.\nNormally when learning a graphical model, one would \ufb01t its parameters so the marginal probabilities\nmatch the target. Here, however, we will not use exact inference to compute the marginals. Instead\nwe will use approximate inference via loopy belief propagation to match the target.\n\n2 Learning in Belief Propagation\n\n2.1 Belief propagation\n\nThe sum-product algorithm for belief propagation on a graphical model with energy function (2)\nuses the following equations [4]:\n\nmi!\u21b5(xi) / Y2Ni\\\u21b5\n\nm!i(xi)\n\nm\u21b5!i(xi) / Xx\u21b5\\xi\n\ne\u2713\u21b5\u00b7\u21b5(x\u21b5) Yj2N\u21b5\\i\n\nmj!\u21b5(xj)\n\n(3)\n\nwhere Ni and N\u21b5 are the neighbors of node i or factor \u21b5 in the factor graph. Once these messages\nconverge, the single-node and factor beliefs are given by\n\nbi(xi) / Y\u21b52Ni\n\nm\u21b5!i(xi)\n\nb\u21b5(x\u21b5) / e\u2713\u21b5\u00b7\u21b5(x\u21b5)Yi2N\u21b5\n\nmi!\u21b5(xi)\n\n(4)\n\nwhere the beliefs must each be normalized to one. For tree graphs, these beliefs exactly equal the\nmarginals of the graphical model Q0(x). For loopy graphs, the beliefs at stable \ufb01xed points are\noften good approximations of the marginals. While they are guaranteed to be locally consistent,\nb\u21b5(x\u21b5) = bi(xi), they are not necessarily globally consistent: There may not exist a single\njoint distribution B(x) of which the beliefs are the marginals [5]. This is why the resultant beliefs\nare called pseudomarginals, rather than simply marginals. We use a vector b to refer to the set of\nboth node and factor beliefs produced by belief propagation.\n\nPx\u21b5\\xi\n\n2.2 Bethe free energy\n\nDespite its limitations, BP is found empirically to work well in many circumstances. Some theoreti-\ncal justi\ufb01cation for loopy belief propagation emerged with proofs that its stable \ufb01xed points are local\nminima of the Bethe free energy [6, 7]. Free energies are important quantities in machine learning\nbecause the Kullback-Leibler divergence between the data and model distributions can be expressed\nin terms of free energies, so models can be optimized by minimizing free energies appropriately.\nGiven an energy function E(x) from (2), the Gibbs free energy of a distribution Q(x) is\n\nwhere U is the average energy of the distribution\n\nF [Q] = U [Q] S[Q]\n\nU [Q] =Xx\n\n\u2713\u21b5 \u00b7Xx\u21b5\n\n\u21b5(x\u21b5)q\u21b5(x\u21b5)\n\nwhich depends on the marginals q\u21b5(x\u21b5) of Q(x), and S is the entropy\n\nE(x)Q(x) = X\u21b5\nS[Q] = Xx\n\n2\n\nQ(x) log Q(x)\n\n(5)\n\n(6)\n\n(7)\n\n\fMinimizing the Gibbs free energy F [Q] recovers the distribution Q0(x) for the graphical model (1).\nThe Bethe free energy F is an approximation to the Gibbs free energy,\n\n(8)\nin which the average energy U is exact, but the true entropy S is replaced by an approximation, the\nBethe entropy S, which is a sum over the factor and node entropies [6]:\n\nF [Q] = U [Q] S[Q]\n\nS[Q] =X\u21b5\n\nS\u21b5[q\u21b5] +Xi\n\nq\u21b5(x\u21b5) log q\u21b5(x\u21b5)\n\n(1 di)Si[qi]\n\nSi[qi] = Xxi\n\nqi(xi) log qi(xi)\n\n(9)\n\n(10)\n\nS\u21b5[q\u21b5] = Xx\u21b5\n\nThe coef\ufb01cients di = |Ni| are the number of factors neighboring node i, and compensate for the\novercounting of single-node marginals due to overlapping factor marginals. For tree-structured\ngraphical models, which factorize as Q(x) =Q\u21b5 q\u21b5(x\u21b5)Qi qi(xi)1di, the Bethe entropy is exact,\n\nand hence so is the Bethe free energy. On loopy graphs, the Bethe entropy S isn\u2019t really even an\nentropy (e.g.\nit may be negative) because it neglects all statistical dependencies other than those\npresent in the factor marginals. Nonetheless, the Bethe free energy is often close enough to the\nGibbs free energy that its minima approximate the true marginals [8]. Since stable \ufb01xed points of\nBP are minima of the Bethe free energy [6, 7], this helped explain why belief propagation is often\nso successful.\nTo emphasize that the Bethe free energy directly depends only on the marginals and not the joint\ndistribution, we will write F [q] where q is a vector of pseudomarginals q\u21b5(x\u21b5) for all \u21b5 and all x\u21b5.\nPseudomarginal space is the convex set [5] of all q that satisfy the positivity and local consistency\nconstraints,\n(11)\n\nq\u21b5(x\u21b5) = qi(xi)\n\nqi(xi) = 1\n\n0 \uf8ff q\u21b5(x\u21b5) \uf8ff 1\n2.3 Pseudo-moment matching\n\nXx\u21b5\\xi\n\nXxi\n\nWe now wish to correct for the de\ufb01ciencies of belief propagation by identifying the parameters \u2713\nso that BP produces beliefs b matching the true marginals p of the target distribution P (x). Since\nthe \ufb01xed points of BP are stationary points of F [6], one may simply try to \ufb01nd parameters \u2713 that\nproduce a stationary point in pseudomarginal space at p, which is a necessary condition for BP to\nreach a stable \ufb01xed point there. Simply evaluate the gradient at p, set it to zero, and solve for \u2713.\nNote that in principle this gradient could be used to directly minimize the Bethe free energy, but\nF [q] is a complicated function of q that usually cannot be minimized analytically [8]. In contrast,\nhere we are using it to solve for the parameters needed to move beliefs to a target location. This is\nmuch easier, since the Bethe free energy is linear in \u2713. This approach to learning parameters has\nbeen described as \u2018pseudo-moment matching\u2019 [9, 10, 11].\nThe Lq-element vector q is an overcomplete representation of the pseudomarginals because it must\nobey the local consistency constraints (11). It is convenient to express the pseudomarginals in terms\nof a minimal set of parameters \u2318 with the smaller dimensionality L\u2713 of \u2713 and , using an af\ufb01ne\ntransform\n(12)\nwhere W is an Lq \u21e5 L\u2713 rectangular matrix. One example is the expectation parameters \u2318\u21b5 =\nPx\u21b5\nq\u21b5(x\u21b5)\u21b5(x\u21b5) [5], giving the energy simply as U = \u2713 \u00b7 \u2318. The gradient with respect to\nthose minimal parameters is\n\nq = W \u2318 + k\n\n@F \n@\u2318\n\n=\n\n@U\n@\u2318 \n\n@S\n@q\n\n@q\n@\u2318\n\n= \u2713 \n\n@S\n@q\n\nW\n\n\u2713 = \n\n@S\n\n@q p\n\n3\n\nThe Bethe entropy gradient is simplest in the overcomplete representation q,\n\n@S\n\n@q\u21b5(x\u21b5)\n\n= 1 log q\u21b5(x\u21b5)\n\n@S\n\n@qi(xi)\n\n= (1 log qi(xi))(1 di)\n\nSetting the gradient (13) to zero, we have a simple linear equation for the parameters \u2713 that tilt the\nBethe free energy surface (Figure 1A) enough to place a stationary point at the desired marginals p:\n\nW\n\n(15)\n\n(13)\n\n(14)\n\n\fA\n\n]\nq\n[\n \n \n\n\n\nF\n\nB\n\n\nF\n2\n@\n\n2\n)\nq\n\u00b7\n\n1\nv\n(\n@\n\n0\n\npseudomarginal space\n\nv1\u00b7q\n\npseudomarginal space\n\nv1\u00b7q\n\nC\n\n+1\n0\n\u20131\n\n\n\nmin\n\n]\nq\n[\n\nF\n\nb\n\nv1\u00b7q\n\np\n\nv2\u00b7q\n\nFigure 1: Landscape of Bethe free energy for the binary graphical model with pairwise interactions.\n(A) A slice through the Bethe free energy (solid lines) along one axis v1 of pseudomarginal space,\nfor three different values of parameters \u2713. The energy U is linear in the pseudomarginals (dotted\nlines), so varying the parameters only changes the tilt of the free energy. This can add or remove\nlocal minima. (B) The second derivatives of the free energies in (A) are all identical. Where the\nsecond derivative is positive, a local minimum can exist (cyan); where it is negative (yellow), no\nparameters can produce a local minimum. (C) A two-dimensional slice of the Bethe free energy,\ncolored according to the minimum eigenvalue min of the Bethe Hessian. During a run of Bethe\nwake-sleep learning, the beliefs (blue dots) proceed along v2 toward the target marginals p. Stable\n\ufb01xed points of BP can exist only in the believable region (cyan), but the target p resides in an unbe-\nlievable region (yellow). As learning equilibrates, the stable \ufb01xed points jump between believable\nregions on either side of the unbelievable zone.\n\n2.4 Unbelievable marginals\n\nIt is well known that BP may converge on stable \ufb01xed points that cannot be realized as marginals of\nany joint distribution. In this section we show that the converse is also true: There are some distribu-\ntions whose marginals cannot be realized as beliefs for any set of couplings. In these cases, existing\nmethods for learning often yield poor results, sometimes even worse than performing no learning\nat all. This is surprising in view of claims to the contrary: [9, 5] state that belief propagation run\nafter pseudo-moment matching can always reach a \ufb01xed point that reproduces the target marginals.\nWhile BP does technically have such \ufb01xed points, they are not always stable and thus may not be\nreachable by running belief propagation.\nDe\ufb01nition 1. A set of marginals are \u2018unbelievable\u2019 if belief propagation cannot converge to them\nfor any set of parameters.\nFor belief propagation to converge to the target \u2014 namely, the marginals p \u2014 a zero gradient is\nnot suf\ufb01cient: The Bethe free energy must also be a local minimum [7].1 This requires a positive-\nde\ufb01nite Hessian of F (the \u2018Bethe Hessian\u2019 H) in the subspace of pseudomarginals that satis\ufb01es the\nlocal consistency constraints. Since the energy U is linear in the pseudomarginals, the Hessian is\ngiven by the second derivative of the Bethe entropy,\n\nH =\n\n@2F \n\n@\u23182 = W > @2S\n@q2 W\n\n(16)\n\nwhere projection by W constrains the derivatives to the subspace spanned by the minimal parameters\n\u2318. If this Hessian is positive de\ufb01nite when evaluated at p then the parameters \u2713 given by (15) give\nF a minimum at the target p. If not, then the target cannot be a stable \ufb01xed point of loopy belief\npropagation. In Section 3, we calculate the Bethe Hessian explicitly for a binary model with pairwise\ninteractions.\nTheorem 1. Unbelievable marginal probabilities exist.\nProof. Proof by example. The simplest unbelievable example is a binary graphical model with\n\npairwise interactions between four nodes, x 2 {1, +1}4, and the energy E(x) = JP(ij) xixj.\n\n1Even this is not suf\ufb01cient, but it is necessary.\n\n4\n\n\fBy symmetry and (1), marginals of this target P (x) are the same for all nodes and pairs: pi(xi) = 1\n2\nand pij(xi = xj) = \u21e2 = (2 + 4/(1 + e2J e4J + e6J ))1. Substituting these marginals into\nthe appropriate Bethe Hessian (22) gives a matrix that has a negative eigenvalue for all \u21e2 > 3\n8, or\nJ > 0.316. The associated eigenvector u has the same symmetry as the marginals, with single-\n2 (2 + 7\u21e2 8\u21e22 +p10 28\u21e2 + 81\u21e22 112\u21e23 + 64\u21e24) and pairwise\nnode components ui = 1\ncomponents uij = 1. Thus the Bethe free energy does not have a minimum at the marginals of these\nP (x). Stable \ufb01xed points of BP occur only at local minima of the Bethe free energy [7], and so BP\ncannot reproduce the marginals p for any parameters. Hence these marginals are unbelievable.\n\nNot only do unbelievable marginals exist, but they are actually quite common, as we will see in\nSection 3. Graphical models with multinomial or gaussian variables and at least two loops always\nhave some pseudomarginals for which the Hessian is not positive de\ufb01nite [12]. On the other hand,\nall marginals with suf\ufb01ciently small correlations are believable because they are guaranteed to have\na positive-de\ufb01nite Bethe Hessian [12]. Stronger conditions have not yet been described.\n\n2.5 Bethe wake-sleep algorithm\n\nWhen pseudo-moment matching fails to reproduce unbelievable marginals, an alternative is to use a\ngradient descent procedure for learning, analagous to the wake-sleep algorithm used to train Boltz-\nmann machines [13]. That original rule can be derived as gradient descent of the Kullback-Leibler\ndivergence DKL between the target P (x) and the Boltzmann distribution Q0(x) (1),\n\nDKL[P||Q0] =Xx\n\nP (x) log\n\nP (x)\nQ0(x)\n\n= F [P ] F [Q0] 0\n\n(17)\n\nwhere F is the Gibbs free energy (5). Note that this free energy depends on the same energy function\nE (2) that de\ufb01nes the Boltzmann distribution Q0 (1), and achieves its minimal value of log Z for\nthat distribution. The Kullback-Leibler divergence is therefore bounded by zero, with equality if and\nonly if P = Q0. By changing the energy E and thus Q0 to decrease this divergence, the graphical\nmodel moves closer to the target distribution.\nHere we use a new cost function, the \u2018Bethe divergence\u2019 D[p||b], by replacing these free energies\nby Bethe free energies [14] evaluated at the true marginals p and at the beliefs b obtained from BP\nstable \ufb01xed points,\n\nD[p||b] = F [p] F [b]\n\nWe use gradient descent to optimize this cost, with gradient\n@D\n@b\n\n@D\n@\u2713\n\ndD\nd\u2713\n\n+\n\n=\n\n@b\n@\u2713\n\n(19)\n\n(18)\n\n(20)\n\nThe data\u2019s free energy does not depend on the beliefs, so @F [p]/@b = 0, and \ufb01xed points of\nbelief propagation are stationary points of the Bethe free energy, so @F [b]/@b = 0. Consequently\n@D/@b = 0. Furthermore, the entropy terms of the free energies do not depend explicitly on \u2713, so\n\ndD\nd\u2713\n\n=\n\n@U (p)\n@\u2713 \n\n@U (b)\n\n@\u2713\n\n= \u2318(p) + \u2318(b)\n\nwhere \u2318(q) =Px q(x)(x) are the expectations of the suf\ufb01cient statistics (x) under the pseudo-\n\nmarginals q. This gradient forms the basis of a simple learning algorithm. At each step in learning,\nbelief propagation is run, obtaining beliefs b for the current parameters \u2713. The parameters are then\nchanged in the opposite direction of the gradient,\n\n\u2713 = \u270f\n\ndD\nd\u2713\n\n= \u270f(\u2318(p) \u2318(b))\n\n(21)\n\nwhere \u270f is a learning rate. This generally increases the Bethe free energy for the beliefs while\ndecreasing that of the data, hopefully allowing BP to draw closer to the data marginals. We call this\nlearning rule the Bethe wake-sleep algorithm.\nWithin this algorithm, there is still the freedom of how to choose initial messages for BP at each\nlearning iteration. The result depends on these initial conditions because BP can have several stable\n\ufb01xed points. One might re-initialize the messages to a \ufb01xed starting point for each run of BP, choose\n\n5\n\n\frandom initial messages for each run, or restart the messages where they stopped on the previous\nlearning step. In our experiments we use the \ufb01rst approach, initializing to constant messages at the\nbeginning of each BP run.\nThe Bethe wake-sleep learning rule sometimes places a minimum of F at the true data distribution,\nsuch that belief propagation can give the true marginals as one of its (possibly multiple) stable \ufb01xed\npoints. However, for the reasons provided above, this cannot occur where the Bethe Hessian is not\npositive de\ufb01nite.\n\n2.6 Ensemble belief propagation\nWhen the Bethe wake-sleep algorithm attempts to learn unbelievable marginals, the parameters\nand beliefs do not reach a \ufb01xed point but instead continue to vary over time (Figure 2A,B). Still,\nif learning reaches equilibrium, then the temporal average of beliefs is equal to the unbelievable\nmarginals.\nTheorem 2. If the Bethe wake-sleep algorithm reaches equilibrium, then unbelievable marginals\nare matched by the belief propagation stable \ufb01xed points averaged over the equilibrium ensemble of\nparameters.\nProof. At equilibrium, the time average of the parameter changes is zero by de\ufb01nition, h\u2713it = 0.\nSubstitution of the Bethe wake-sleep equation, \u2713 = \u270f(\u2318(p) \u2318(b(t))) (20), directly implies\nthat h\u2318(b(t))it = \u2318(p). The deterministic mapping (12) from the minimal representation to the\npseudomarginals gives hb(t)it = p.\nAfter learning has equilibrated, stable \ufb01xed points of belief propagation occur with just the right\nfrequency so that they can be averaged together to reproduce the target distribution exactly (Figure\n2C). Note that none of the individual stable \ufb01xed points may be close to the true marginals. We call\nthis inference algorithm ensemble belief propagation (eBP).\nEnsemble BP produces perfect marginals by exploiting a constant, small amplitude learning, and\nthus assumes that the correct marginals are perpetually available. Yet it also works well when\nlearning is turned off, if parameters are drawn randomly from a gaussian distribution with mean\nand covariance matched to the equilibrium distribution, \u2713 \u21e0 N (\u00af\u2713, \u2303\u2713). In the simulations below\n(Figures 2C\u2013D, 3B\u2013C), \u2303\u2713 was always low-rank, and only one or two principle components were\nneeded for good performance. The gaussian ensemble is not quite as accurate as continued learning\n(Figure 3B,C), but the performance is still markedly better than any of the available stable \ufb01xed\npoints.\nIf the target is not within a convex hull of believable pseudomarginals, then learning cannot reach\nequilibrium: Eventually BP gets as close as it can but there remains a consistent difference \u2318(p) \n\u2318(b), so \u2713 must increase without bound. Though possible in principle, we did not observe this effect\nin any of our experiments. There may also be no equilibrium if belief propagation at each learning\niteration fails to converge.\n\n3 Experiments\nThe experiments in this section concentrate on the Ising model: N binary variables, s 2 {1, +1}N,\nwith factors comprising individual variables xi and pairs xi, xj. The energy function is E(x) =\nPi hixi P(ij) Jijxixj. Then the suf\ufb01cient statistics are the various \ufb01rst and second moments,\nxi and xixj, and the natural parameters are hi, Jij. We use this model both for the target distri-\nbutions and the model. We parameterize pseudomarginals as {q+\ni = qi(xi = +1)\nij = qij(xi = xj = +1) [8]. The remaining probabilities are linear functions of these\nand q++\nvalues. Positivity constraints and local consistency constraints then appear as 0 \uf8ff q+\ni \uf8ff 1 and\nmax(0, q+\nj ). If all the interactions are \ufb01nite, then the inequality\nconstraints are not active [15]. In this parameterization, the elements of the Bethe Hessian (16) are\n\nij \uf8ff min(q+\n\nij } where q+\n\ni + q+\n\ni , q++\n\ni , q+\n\n@2S\ni @q+\n@q+\nj\n\n\n\nj 1) \uf8ff q++\n= i,j(1 di)\u21e5(q+\n+ i,j Xk2Ni\u21e5(q+\n\ni )1 + (1 q+\ni q++\n\nik )1 + (1 q+\n\ni )1\u21e4 + j2Ni\u21e5(1 q+\nik )1\u21e4\n\ni q+\n\nk + q++\n\ni q+\n\nj + q++\n\nij )1\u21e4\n\n(22)\n\n6\n\n\fFigure 2: Averaging over variable couplings can produce marginals otherwise unreachable by belief\npropagation. (A) As learning proceeds, the Bethe wake-sleep algorithm causes parameters \u2713 to con-\nverge on a discrete limit cycle when attempting to learn unbelievable marginals. (B) The same limit\ncycle, projected onto their \ufb01rst two principal components u1 and u2 of \u2713 during the cycle. (C) The\ncorresponding beliefs b during the limit cycle (blue circles), projected onto the \ufb01rst two principal\ncomponents v1 and v2 of the trajectory through pseudomarginal space. Believable regions of pseu-\ndomarginal space are colored with cyan and the unbelievable regions with yellow, and inconsistent\npseudomarginals are black. Over the limit cycle, the average beliefs \u00afb (blue \u21e5) are precisely equal\nto the target marginals p (black \u21e4). The average \u00afb (red +) over many stable \ufb01xed points of BP\n(red dots) generated from randomly perturbed parameters \u00af\u2713 + \u2713 still produces a better approxima-\ntion of the target marginals than any of the individual believable stable \ufb01xed points. (D) Even the\nbest amongst several BP stable \ufb01xed points cannot match unbelievable marginals (black and grey).\nEnsemble BP leads to much improved performance (red and pink).\n\n@2S\ni @q++\njk\n\n@q+\n\n\n\n@2S\nij @q++\nk`\n\n@q++\n\n\n\n= i,j\u21e5(q+\n i,k\u21e5(q+\n= ij,k`\u21e5(q++\n\nik )1 + (1 q+\nij )1 + (1 q+\n\ni q++\ni q++\nij )1 + (q+\n\ni q+\ni q+\nij )1 + (q+\n\ni q++\n\nk + q++\n\nj + q++\n\nik )1\u21e4\nij )1\u21e4\nij )1 + (1 q+\n\nj + q++\n\ni q+\n\nj q++\n\nij )1\u21e4\n3 ) and Jij \u21e0 N (0, J ). For J & 1\n\nFigure 3A shows the fraction of marginals that are unbelievable for 8-node, fully-connected Ising\n4, most\nmodels with random coupling parameters hi \u21e0 N (0, 1\nmarginals cannot be reproduced by belief propagation with any parameters, because the Bethe Hes-\nsian (22) has a negative eigenvalue.\n\nFigure 3: Performance in learning unbelievable marginals. (A) Fraction of marginals that are unbe-\nlievable. Marginals were generated from fully connected, 8-node binary models with random biases\n3 ) and Jij \u21e0 N (0, J ). (B,C) Performance of \ufb01ve models on\nand pairwise couplings, hi \u21e0 N (0, 1\n370 unbelievable random target marginals (Section 3), measured with Bethe divergence D[p||b]\n(B) and Euclidean distance |p b| (C). Target were generated as in (A) with J = 1\n3, and selected\nfor unbelievability. Bars represent central quartiles, and white line indicates the median. The \ufb01ve\nmodels are: (i) BP on the graphical model that generated the target distribution, (ii) BP after pa-\nrameters are set by pseudomoment matching, (iii) the beliefs with the best performance encountered\nduring Bethe wake-sleep learning, (iv) eBP using exact parameters from the last 100 iterations of\nlearning, and (v) eBP with gaussian-distributed parameters with the same \ufb01rst- and second-order\nstatistics as iv.\n\n7\n\nv1\u00b7qv2\u00b7qu1\u00b7\u2713\u2713u2\u00b7\u2713<0>0BCDmintrue marginalslearning iterationestimated marginals0101q+iq++ijBPEBPA\u2713ijhtJifraction unbelievableJ1001.0110\u2013410\u20135.001.111010\u2013210\u2013110\u201331D[p||b]|pb|iiiiiiivvBPeBPiiiiiiivvcoupling standard deviationBethe divergenceEuclidean distanceABC\fWe generated 500 Ising model targets using J = 1\n3, selected the unbelievable ones, and eval-\nuated the performance of BP and ensemble BP for various methods of choosing parameters \u2713.\nEach run of BP used exponential temporal message damping of 5 time steps [16], mt+1 =\namt + (1 a)mundamped with a = e1/5. Fixed points were declared when messages changed\nby less than 109 on a single time step. We evaluated BP performance for the actual parameters\nthat generated the target (1), pseudomoment matching (15), and at best-matching beliefs obtained at\nany time during Bethe wake-sleep learning. We also measured eBP performance for two parameter\nensembles: the last 100 iterations of Bethe wake-sleep learning, and parameters sampled from a\ngaussian N (\u00af\u2713, \u2303\u2713) with the same mean and covariance as that ensemble.\nBelief propagation gave a poor approximation of the target marginals, as expected for a model\nwith many strong loops. Even with learning, BP could never get the correct marginals, which was\nguaranteed by selection of unbelievable targets. Yet ensemble belief propagation gave excellent\nresults. Using the exact parameter ensemble gave orders of magnitude improvement, limited by the\nnumber of beliefs being averaged. The gaussian parameter ensemble also did much better than even\nthe best results of BP.\n\n4 Discussion\n\nOther studies have also made use of the Bethe Hessian to draw conclusions about belief propagation.\nFor instance, the Hessian reveals that the Ising model\u2019s paramagnetic state becomes unstable in\nBP for large enough couplings [17]. For another example, when the Hessian is positive de\ufb01nite\nthroughout pseudomarginal space, then the Bethe free energy is convex and thus BP has a unique\nstable \ufb01xed point [18]. Yet the stronger interpretation appears to be underappreciated: When the\nHessian is not positive de\ufb01nite for some pseudomarginals, then BP can never have a stable \ufb01xed\npoint there, for any parameters.\nOne might hope that by adjusting the parameters of belief propagation in some systematic way,\n\u2713 ! \u2713BP, one could \ufb01x the approximation and so perform exact inference. In this paper we proved\nthat this is a futile hope, because belief propagation simply can never converge to certain marginals.\nHowever, we also provided an algorithm that does work: Ensemble belief propagation uses BP\non several different parameters with different stable \ufb01xed points and averages the results. This\napproach preserves the locality and scalability which make BP so popular, but corrects for some of\nits defects at the cost of running the algorithm a few times. Additionally, it raises the possibility that\na systematic compensation for the \ufb02aws of BP might exist, but only as a mapping from individual\nparameters to an ensemble of parameters \u2713 ! {\u2713eBP} that could be used in eBP.\nAn especially clear application of eBP is to discriminative models like Conditional Random Fields\n[19]. These models are trained so that known inputs produce known inferences, and then generalize\nto draw novel inferences from novel inputs. When belief propagation is used during learning, then\nthe model will fail even on known training examples if they happen to be unbelievable. Overall\nperformance will suffer. Ensemble BP can remedy those training failures and thus allow better\nperformance and more reliable generalization.\nThis paper addressed learning in fully-observed models only, where marginals for all variables were\navailable during training. Yet unbelievable marginals exist for models with hidden variables as well.\nEnsemble BP should work as in the fully-observed case, but training will require inference over the\nhidden variables during both wake and sleep phases.\nOne important inference engine is the brain. When inference is hard, neural computations may resort\nto approximations, perhaps including belief propagation [20, 21, 22, 23, 24]. It would be undesirable\nfor neural circuits to have big blind spots, i.e.\nreasonable inferences it cannot draw, yet that is\nprecisely what occurs in BP. By averaging over models with eBP, this blind spot can be eliminated. In\nthe brain, synaptic weights \ufb02uctuate due to a variety of mechanisms. Perhaps such \ufb02uctuations allow\naveraging over models and thereby reach conclusions unattainable by a deterministic mechanism.\nNote added in proof: After submission of this work, [25] presented partially overlapping results\nshowing that some marginals cannot be achieved by belief propagation.\n\nAcknowledgments\nThe authors thank Greg Wayne for helpful conversations.\n\n8\n\n\fReferences\n\n[1] Cooper G (1990) The computational complexity of probabilistic inference using bayesian belief networks.\n\nArti\ufb01cial intelligence 42: 393\u2013405.\n\n[2] Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan\n\nKaufmann Publishers, San Mateo CA.\n\n[3] Kschischang F, Frey B, Loeliger H (2001) Factor graphs and the sum-product algorithm. IEEE Transac-\n\ntions on Information Theory 47: 498\u2013519.\n\n[4] Bishop C (2006) Pattern recognition and machine learning. Springer New York.\n[5] Wainwright M, Jordan M (2008) Graphical models, exponential families, and variational inference. Foun-\n\ndations and Trends in Machine Learning 1: 1\u2013305.\n\n[6] Yedidia JS, Freeman WT, Weiss Y (2000) Generalized belief propagation. In: Advances in Neural Infor-\n\nmation Processing Systems 13. MIT Press, pp. 689\u2013695.\n\n[7] Heskes T (2003) Stable \ufb01xed points of loopy belief propagation are minima of the Bethe free energy.\n\nAdvances in Neural Information Processing Systems 15: 343\u2013350.\n\n[8] Welling M, Teh Y (2001) Belief optimization for binary networks: A stable alternative to loopy belief\npropagation. In: Uncertainty in Arti\ufb01cial Intelligence. Morgan Kaufmann Publishers Inc., pp. 554\u2013561.\n[9] Wainwright MJ, Jaakkola TS, Willsky AS (2003) Tree-reweighted belief propagation algorithms and ap-\n\nproximate ML estimation by pseudo-moment matching. In: Arti\ufb01cial Intelligence and Statistics.\n\n[10] Welling M, Teh Y (2003) Approximate inference in Boltzmann machines. Arti\ufb01cial Intelligence 143:\n\n19\u201350.\n\n[11] Parise S, Welling M (2005) Learning in markov random \ufb01elds: An empirical study. In: Joint Statistical\n\nMeeting. volume 4.\n\n[12] Watanabe Y, Fukumizu K (2011) Loopy belief propagation, Bethe free energy and graph zeta function.\n\narXiv cs.AI: 1103.0605v1.\n\n[13] Hinton G, Sejnowski T (1983) Analyzing cooperative computation. Proceedings of the Fifth Annual\n\nCognitive Science Society, Rochester NY .\n\n[14] Welling M, Sutton C (2005) Learning in markov random \ufb01elds with contrastive free energies. In: Cowell\n\nRG, Ghahramani Z, editors, Arti\ufb01cial Intelligence and Statistics. pp. 397-404.\n\n[15] Yedidia J, Freeman W, Weiss Y (2005) Constructing free-energy approximations and generalized belief\n\npropagation algorithms. IEEE Transactions on Information Theory 51: 2282\u20132312.\n\n[16] Mooij J, Kappen H (2005) On the properties of the Bethe approximation and loopy belief propagation on\n\nbinary networks. Journal of Statistical Mechanics: Theory and Experiment 11: P11012.\n\n[17] Mooij J, Kappen H (2005) Validity estimates for loopy belief propagation on binary real-world networks.\n\nIn: Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, pp. 945\u2013952.\n\n[18] Heskes T (2004) On the uniqueness of loopy belief propagation \ufb01xed points. Neural Computation 16:\n\n2379\u20132413.\n\n[19] Lafferty J, McCallum A, Pereira F (2001) Conditional random \ufb01elds: Probabilistic models for segmenting\nand labeling sequence data. Proceedings of the 18th International Conference on Machine Learning :\n282\u2013289.\n\n[20] Litvak S, Ullman S (2009) Cortical circuitry implementing graphical models. Neural Computation 21:\n\n3010\u20133056.\n\n[21] Steimer A, Maass W, Douglas R (2009) Belief propagation in networks of spiking neurons. Neural\n\nComputation 21: 2502\u20132523.\n\n[22] Ott T, Stoop R (2007) The neurodynamics of belief propagation on binary markov random \ufb01elds. In:\n\nAdvances in Neural Information Processing Systems 19, Cambridge, MA: MIT Press. pp. 1057\u20131064.\n\n[23] Shon A, Rao R (2005) Implementing belief propagation in neural circuits. Neurocomputing 65\u201366: 393\u2013\n\n399.\n\n[24] George D, Hawkins J (2009) Towards a mathematical theory of cortical micro-circuits. PLoS Computa-\n\ntional Biology 5: 1\u201326.\n\n[25] Heinemann U, Globerson A (2011) What cannot be learned with Bethe approximations. In: Uncertainty\n\nin Arti\ufb01cial Intelligence. Corvallis, Oregon: AUAI Press, pp. 319\u2013326.\n\n9\n\n\f", "award": [], "sourceid": 500, "authors": [{"given_name": "Zachary", "family_name": "Pitkow", "institution": null}, {"given_name": "Yashar", "family_name": "Ahmadian", "institution": null}, {"given_name": "Ken", "family_name": "Miller", "institution": null}]}