{"title": "Amortized Bethe Free Energy Minimization for Learning MRFs", "book": "Advances in Neural Information Processing Systems", "page_first": 15546, "page_last": 15557, "abstract": "We propose to learn deep undirected graphical models (i.e., MRFs) with a non-ELBO objective for which we can calculate exact gradients. In particular, we optimize a saddle-point objective deriving from the Bethe free energy approximation to the partition function. Unlike much recent work in approximate inference, the derived objective requires no sampling, and can be efficiently computed even for very expressive MRFs. We furthermore amortize this optimization with trained inference networks. Experimentally, we find that the proposed approach compares favorably with loopy belief propagation, but is faster, and it allows for attaining better held out log likelihood than other recent approximate inference schemes.", "full_text": "Amortized Bethe Free Energy Minimization for\n\nLearning MRFs\n\nSam Wiseman\n\nToyota Technological Institute at Chicago\n\nChicago, IL, USA\n\nswiseman@ttic.edu\n\nYoon Kim\n\nHarvard University\n\nCambridge, MA, USA\n\nyoonkim@seas.harvard.edu\n\nAbstract\n\nWe propose to learn deep undirected graphical models (i.e., MRFs) with a non-\nELBO objective for which we can calculate exact gradients. In particular, we\noptimize a saddle-point objective deriving from the Bethe free energy approxima-\ntion to the partition function. Unlike much recent work in approximate inference,\nthe derived objective requires no sampling, and can be ef\ufb01ciently computed even\nfor very expressive MRFs. We furthermore amortize this optimization with trained\ninference networks. Experimentally, we \ufb01nd that the proposed approach compares\nfavorably with loopy belief propagation, but is faster, and it allows for attaining\nbetter held out log likelihood than other recent approximate inference schemes.\n\n1\n\nIntroduction\n\nThere has been much recent work on learning deep generative models of discrete data, in both the\ncase where all the modeled variables are observed [35, 58, inter alia], and in the case where they are\nnot [37, 36, inter alia]. Most of this recent work has focused on directed graphical models, and when\napproximate inference is necessary, on variational inference. Here we consider instead undirected\nmodels, that is, Markov Random Fields (MRFs), which we take to be interesting for at least two\nreasons: \ufb01rst, some data are more naturally modeled using MRFs [25]; second, unlike their directed\ncounterparts, many intractable MRFs of interest admit a learning objective which both approximates\nthe log marginal likelihood, and which can be computed exactly (i.e., without sampling). In particular,\nlog marginal likelihood approximations that make use of the Bethe Free Energy (BFE) [4] can be\ncomputed in time that effectively scales linearly with the number of factors in the MRF, provided that\nthe factors are of low degree. Indeed, loopy belief propagation (LBP) [33], the classic approach to\napproximate inference in MRFs, can be viewed as minimizing the BFE [66]. However, while often\nquite effective, LBP is also an iterative message-passing algorithm, which is less amenable to GPU\nparallelization and can therefore slow down the training of deep generative models.\nTo address these shortcomings of LBP in the context of training deep models, we propose to train\nMRFs by minimizing the BFE directly during learning, without message-passing, using inference\nnetworks trained to output approximate minimizers. This scheme gives rise to a saddle-point learning\nproblem, and we show that learning in this way allows for quickly training MRFs that are competitive\nwith or outperform those trained with LBP.\nWe also consider the setting where the discrete latent variable model to be learned admits both\ndirected and undirected variants. For example, we might be interested in learning an HMM-like\nmodel, but we are free to parameterize transition factors in a variety of ways, including such that all the\ntransition factors are unnormalized and of low-degree (see Figure 1). Such a parameterization makes\nBFE minimization particularly convenient, and indeed we show that learning such an undirected\nmodel with BFE minimization allows for outperforming the directed variant learned with amortized\nvariational inference in terms of both held out log likelihood and speed. Thus, when possible, it may\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: Factor graphs of (a) a full 3rd order HMM, and (b) a 3rd order HMM-like model with only pairwise\nfactors.\n\nin fact be advantageous to consider transforming a directed model into an undirected variant, and\nlearning it with BFE minimization.\n\n2 Background\nLet G = (V \u222a F,E) be a factor graph [11, 26], with V the set of variable nodes, F the set of factor\nnodes, and E the set of undirected edges between elements of V and elements of F; see Figure 1 for\nexamples. We will refer collectively to variables in V that are always observed as x, and to variables\nwhich are never observed as z. We will take all variables to be discrete.\nthe joint distribution over x and z factorizes as\nIn a Markov Random Field (MRF),\n\u03b1 \u03a8\u03b1(x\u03b1, z\u03b1; \u03b8), where the notation x\u03b1 and z\u03b1 is used to denote the (pos-\nP (x, z; \u03b8) = 1\nsibly empty) subvectors of x and z that participate in factor \u03a8\u03b1,\nthe factors \u03a8\u03b1 are as-\nsumed to be positive and are parameterized by \u03b8, and where Z(\u03b8) is the partition function:\n\n(cid:81)\nx(cid:48)(cid:80)\nz(cid:48)(cid:81)\n\nZ(\u03b8)\n\nZ(\u03b8) =(cid:80)\n\n\u03b1 \u03a8\u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\n\u03b1; \u03b8).\n\nIn order to simplify the exposition we will assume all factors are either unary (functions of a\nsingle variable in V) or pairwise (functions of two variables in V), and we lose no generality\nin doing so [67, 60]. Thus, if a node v1 \u2208V may take on one of K1 discrete values, we view\na unary factor \u03a8\u03b1(x\u03b1, z\u03b1; \u03b8) = \u03a8\u03b1(v1; \u03b8) as a function \u03a8\u03b1 :{1, . . . , K1} \u2192 R+. Similarly, if\nnodes v1 and v2 may take on K1 and K2 discrete values respectively, we view a binary factor\n\u03a8\u03b2(x\u03b2, z\u03b2; \u03b8) = \u03a8\u03b2(v1, v2; \u03b8) as a function \u03a8\u03b2 :{1, . . . , K1} \u00d7 {1, . . . , K2} \u2192 R+. It will also be\nconvenient to use the (bolded) notation \u03a8\u03b1 to refer to the vector of a factor\u2019s possible output values\nfor unary and binary factors, respectively), and the notation |\u03a8\u03b1| to refer to the\n(in RK1\nlength of this vector. We will consider both scalar and neural parameterizations of factors.\nWhen the model involves unobserved variables, we will also make use of the \u201cclamped\u201d partition func-\n\u03b1; \u03b8), with x clamped to a particular value. The clamped partition\nfunction gives the unnormalized marginal probability of x, the partition function of P (z| x; \u03b8).\n\ntion Z(x, \u03b8) =(cid:80)\nz(cid:48)(cid:81)\n\n+ and RK1\u00b7K2\n\n\u03b1 \u03a8\u03b1(x\u03b1, z(cid:48)\n\n+\n\n2.1 The Bethe Free Energy\n\nall possible marginals for each factor in G as \u03c4 \u2208 [0, 1]M (G), where M (G) =(cid:80)\n\nBecause calculation of Z(\u03b8) or Z(x, \u03b8) may be intractable, maximum likelihood learning of MRFs of-\nten makes use of approximations to these quantities. One such approximation makes use of the Bethe\nfree energy (BFE), due to Bethe [4] and popularized by Yedidia et al. [66], which is de\ufb01ned in terms of\n\u03b1)\u2208 [0, 1]\nthe factor and node marginals of the corresponding factor graph. In particular, let \u03c4\u03b1(x(cid:48)\nbe the marginal probability of the event x(cid:48)\n\u03b1, which are again (possibly empty) settings of the\nsubvectors associated with factor \u03a8\u03b1. We will refer to the vector consisting of the concatenation of\n\u03b1\u2208F |\u03a8\u03b1|, the total\nnumber of values output by all factors associated with the graph. As a concrete example, consider\nthe 10 factors in Figure 1 (b): if each variable can take on only two possible values, then since each\nfactor is pairwise (i.e., considers only two variables), there are 22 possible settings for each factor,\nand thus 22 corresponding marginals. In total, we then have 10 \u00d7 4 marginals and so \u03c4 \u2208 [0, 1]40.\nFollowing Yedidia et al. [67], the BFE is then de\ufb01ned as\n\n\u03b1 and z(cid:48)\n\n\u03b1, z(cid:48)\n\n\u2212(cid:88)\n\nv\u2208V\n\n(cid:88)\n\nv(cid:48)\n\n(|ne(v)| \u2212 1)\n\n(cid:48)) log \u03c4v(v\n\n(cid:48)),\n\n\u03c4v(v\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\n\u03b1\n\nx(cid:48)\n\u03b1,z(cid:48)\n\n\u03b1\n\nF (\u03c4 , \u03b8) =\n\n\u03c4\u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\n\u03b1) log\n\n\u03c4\u03b1(x(cid:48)\n\u03a8\u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\u03b1)\n\u03b1, z(cid:48)\n\u03b1)\n\n2\n\nz1z2z3z4x1x2x3x4z1z2z3z4x1x2x3x4\fwhere ne(v) gives the set of factor-neighbors node v has in the factor graph, and \u03c4v(v(cid:48)) is the marginal\nprobability of node v taking on the value v(cid:48).\nImportantly, in the case of a distribution P\u03b8 representable as a tree-structured model, we have\nmin\u03c4 F (\u03c4 , \u03b8) = \u2212 log Z(\u03b8), since (1) is precisely KL[Q||P\u03b8] \u2212 log Z(\u03b8), where Q is another tree\nrepresentable distribution with marginals \u03c4 [17, 60, 13]. In the case where P\u03b8 is not tree-structured\n(i.e., it has a loopy factor graph), we no longer have a KL divergence, and min\u03c4 F (\u03c4 , \u03b8) will in\ngeneral give only an approximation, but not a bound, on the partition function: min\u03c4 F (\u03c4 , \u03b8) \u2248\n\u2212 log Z(\u03b8) [60, 65, 61, 62].\nAlthough minimizing the BFE only provides an approximation to \u2212 log Z(\u03b8), it is attractive for our\npurposes because while the BFE is exponential in the degree of each factor (since it sums over all\nassignments), it is only linear in the number of factors. Thus, evaluating (1) for a factor graph with a\nlarge number of small-degree (e.g., pairwise) factors remains tractable. Moreover, while restricting\nmodels to have low-degree factors severely limits the expressiveness of directed graphical models,\nit does not so limit the expressiveness of MRFs, since MRFs are free to have arbitrary pairwise\ndependence, as in Figure 1 (b). Indeed, the idea of establishing complex dependencies through many\npairwise factors in an MRF is what underlies product-of-experts style modeling [18].\n\nany two factors \u03b1, \u03b2 sharing a variable v agree:(cid:80)\n\n2.2 Minimizing the Bethe Free Energy\nHistorically, the BFE has been minimized during learning with loopy belief propagation (LBP) [41,\n33]. Yedidia et al. [66] show that the \ufb01xed points found by LBP correspond to stationary points\nof the optimization problem min\u03c4\u2208C F (\u03c4 , \u03b8), where C contains vectors of length M (G), and in\nparticular the concatenation of \u201cpseudo-marginal\u201d vectors \u03c4 \u03b1(x\u03b1, z\u03b1) for each factor, subject to\neach pseudo-marginal vector being positive and summing to 1, and the pseudo-marginal vectors\nbeing locally consistent. Local consistency requires that the pseudo-marginal vectors associated with\n\u03b2); see\nalso Heskes [17]. Note that even if \u03c4 satis\ufb01es these conditions, for loopy models it may still not\ncorrespond to the marginals of any distribution [60].\nWhile LBP is quite effective in practice [33, 38, 67, 34], it does not integrate well with the current\nGPU-intensive paradigm for training deep generative models, since it is a typically sequential message-\npassing algorithm (though see Gonzalez et al. [12]), which may require a variable number of iterations\nand a particular message-passing scheduling to converge [10, 13]. We therefore propose to drop the\nmessage-passing metaphor, and instead directly minimize the constrained BFE during learning using\ninference networks [51, 23, 22, 56], which are trained to output approximate minimizers. This style\nof training gives rise to a saddle-point objective for learning, detailed in the next section.\n\n\u03b1) =(cid:80)\n\n\u03b2\\v \u03c4 \u03b2(x(cid:48)\n\n\u03b1\\v \u03c4 \u03b1(x(cid:48)\n\n\u03b2, z(cid:48)\n\n\u03b1, z(cid:48)\n\nx(cid:48)\n\u03b2 ,z(cid:48)\n\nx(cid:48)\n\u03b1,z(cid:48)\n\n3 Learning with Amortized Bethe Free Energy Minimization\nConsider learning an MRF consisting of only observed variables x via maximum like-\nrequires minimizing \u2212 log P (x; \u03b8) =\u2212 log \u02dcP (x; \u03b8) + log Z(\u03b8), where\nlihood, which\n\u03b1 log \u03a8\u03b1(x\u03b1; \u03b8). Using the Bethe approximation to log Z(\u03b8) from the pre-\n\nlog \u02dcP (x; \u03b8) =(cid:80)\n\n\u03c4\u2208C F (\u03c4 ) \u2248 \u2212 log \u02dcP (x; \u03b8) + log Z(\u03b8),\n\n(2)\n\n(cid:104)\u2212 log \u02dcP (x; \u03b8) \u2212 F (\u03c4 , \u03b8)\n(cid:105)\n\n.\n\nvious section, we then arrive at the objective:\n(cid:96)F (\u03b8) = \u2212 log \u02dcP (x; \u03b8) \u2212 min\n\nand thus the saddle-point learning problem:\n\u2212 log \u02dcP (x; \u03b8) \u2212 min\n\n(cid:20)\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\n= min\n\nmax\n\u03c4\u2208C\n\n(cid:96)F (\u03b8) = min\n\n\u03c4\u2208C F (\u03c4 , \u03b8)\n\n(3)\nmin\nWhile (cid:96)F is neither an upper nor lower bound on \u2212 log P (x; \u03b8), it is an approximation, and indeed\nits gradients are precisely those that arise from approximating the true gradient of \u2212 log P (x; \u03b8) by\nreplacing the factor marginals in the gradient with pseudo-marginals; see Sutton et al. [53].\nIn the case where our MRF contains unobserved variables z, we wish to learn by minimizing\n\u2212 log Z(x, \u03b8) + log Z(\u03b8). Here we can additionally approximate the clamped partition function\n\u2212 log Z(x, \u03b8) using the BFE. In particular, we have min\u03c4 x\u2208Cx F (\u03c4 x, \u03b8) \u2248 \u2212 log Z(x, \u03b8), where\n\u03c4 x contains the marginals of the MRF with its observed variables clamped to x (which is equivalent\nto replacing these variables with unary factors, and so \u03c4 x will in general be smaller than \u03c4 ). We thus\n\n(cid:21)\n\n3\n\n\farrive at the following saddle point learning problem for MRFs with latent variables:\n\nmin\n\n\u03b8\n\n(cid:96)F,z(\u03b8) = min\n\n\u03b8\n\nmin\n\u03c4 x\u2208Cx\n\nF (\u03c4 x, \u03b8) \u2212 min\n\n\u03c4\u2208C F (\u03c4 , \u03b8)\n\n= min\n\u03b8,\u03c4 x\n\n\u03c4\u2208C [F (\u03c4 x, \u03b8) \u2212 F (\u03c4 , \u03b8)] .\n\nmax\n\n(4)\n\n(cid:20)\n\n(cid:21)\n\nInference Networks\n\n3.1\nOptimizing (cid:96)F and (cid:96)F,z requires tackling a constrained, saddle-point optimization problem. While\nwe could in principle optimize over \u03c4 or \u03c4 x directly, we found this optimization to be dif\ufb01cult, and\nwe instead follow recent work [51, 23, 22, 56] in replacing optimization over the variables of interest\nwith optimization over the parameters \u03c6 of an inference network f (\u00b7; \u03c6) outputting the variables of\ninterest. Thus, an inference network consumes a graph G and predicts a pseudo-marginal vector; we\nprovide additional details below.\nWe also note that because our inference networks consume graphs they are similar to graph neural\nnetworks [47, 29, 24, 68, inter alia]. However, because we are interested in being able to quickly\nlearn MRFs, our inference networks do not do any iterative message-passing style updates; they\nsimply consume either a symbolic representation of the graph or, in the \u201cclamped\u201d setting, a symbolic\nrepresentation of the graph together with the observed variables. We provide further details of our\ninference network parameterizations in Section 4 and in the Supplementary Material.\n\nHandling Constraints on Predicted Marginals The predicted pseudo-marginals output by our\ninference network f must respect the positivity, normalization, and local consistency constraints\ndescribed in Section 2.2. Since the normalization and local consistency constraints are linear equality\nconstraints, it is possible to optimize only in the subspace they de\ufb01ne. However, such an approach\nrequires the explicit calculation of a basis for the null space of the constraint matrix, which becomes\nunwieldy as the graph gets large. We accordingly adopt the much simpler and more scalable approach\nof handling the positivity and normalization constraints by optimizing over the \u201csoftmax basis\u201d (i.e.,\nover logits), and we handle the local consistency constraints by simply adding a term to our objective\nthat penalizes this constraint violation [7, 40].\nIn particular, let f (G, \u03b1; \u03c6)\u2208 RK1\u00b7K2 be the vector of scores given by inference network f to all\ncon\ufb01gurations of variables associated with factor \u03b1. We de\ufb01ne the predicted factor marginals to be\n(5)\nWe obtain predicted node marginals for each node v by averaging all the associated factor-level\nmarginals:\n\n\u03c4 \u03b1(x\u03b1, z\u03b1; \u03c6) = softmax(f (G, \u03b1; \u03c6)).\n\n\u03c4 v(v; \u03c6) =\n\n1\n\n|ne(v)|\n\n\u03c4 \u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\n\u03b1; \u03c6).\n\n(6)\n\n(cid:88)\n\n(cid:88)\n\n\u03b1\u2208ne(v)\n\nx(cid:48)\n\u03b1,z(cid:48)\n\n\u03b1\\v\n\nWe obtain our \ufb01nal learning objective by adding a term penalizing the distance between the marginal\nassociated with node v according to a particular factor, and \u03c4 v(v; \u03c6). Thus, the optimization\nproblem (3) becomes\n\n(cid:104)\u2212 log \u02dcP (x; \u03b8) \u2212 F (\u03c4 (\u03c6), \u03b8) \u2212 \u03bb\n\n(cid:16)\n\n(cid:88)\n\n(cid:88)\n\nv\u2208V\n\n\u03b1\u2208ne(v)\n\n|F|\n\nmin\n\n\u03b8\n\nmax\n\n\u03c6\n\n(cid:17)(cid:105)\n\n,\n\nd\n\n\u03c4 v(v; \u03c6),\n\n\u03c4 \u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\n\u03b1; \u03c6)\n\n(cid:88)\n\nx(cid:48)\n\u03b1,z(cid:48)\n\n\u03b1\\v\n\n(7)\nwhere d(\u00b7,\u00b7) is a non-negative distance or divergence calculated between the marginals (typically L2\ndistance in experiments), \u03bb is a tuning parameter, and the notation \u03c4 (\u03c6) refers to the entire vector of\nconcatenated predicted marginals. We note that the number of penalty terms in (7) scales with |F|,\nsince we penalize agreement with node marginals; an alternative objective that penalizes agreement\nbetween factor marginals is possible, but would scale with |F|2.\nFinally, we note that we can obtain an analogous objective for the latent variable saddle-point\nproblem (4) by introducing an additional inference network fx which additionally consumes x, and\nadding an additional set of penalty terms.\n3.2 Learning\nWe learn by alternating I1 steps of gradient ascent on (7) with respect to \u03c6 with one step of gradient\ndescent on (7) with respect to \u03b8. When the MRF contains latent variables, we take I2 gradient\n\n4\n\n\fAlgorithm 1 Saddle-point MRF Learning\n\n(cid:80)\n\nfor i = 1, . . . , I1 do\n\nif there are latents then\nfor i = 1, . . . , I2 do\n\nObtain \u03c4 (\u03c6) from f (\u00b7; \u03c6) using Equations (5) and (6)\n\u03c6 \u2190 \u03c6 + \u2207\u03c6[\u2212F (\u03c4 (\u03c6), \u03b8) \u2212 \u03bb|F|\n\n(cid:80)\n\u03b1\u2208ne(v) d(\u03c4 v(v; \u03c6),(cid:80)\n(cid:80)\n(cid:80)\n\u03b8 \u2190 \u03b8 \u2212 \u2207\u03b8[F (\u03c4 x(\u03c6x), \u03b8) \u2212 F (\u03c4 (\u03c6), \u03b8)]\n\u03b8 \u2190 \u03b8 \u2212 \u2207\u03b8[\u2212 log \u02dcP (x; \u03b8) \u2212 F (\u03c4 (\u03c6), \u03b8)]\n\nObtain \u03c4 x(\u03c6x) from fx(\u00b7; \u03c6x) using Equations (5) and (6)\n\u03c6x \u2190 \u03c6x\u2212\u2207\u03c6x [F (\u03c4 x(\u03c6x), \u03b8)+ \u03bb|F|\n\n\u03b1\u2208ne(v) d(\u03c4 x(v; \u03c6x),(cid:80)\n\nelse\n\nv\u2208V\n\n\u03b1\\v \u03c4 \u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\nx(cid:48)\n\u03b1,z(cid:48)\n\n\u03b1; \u03c6))]\n\nv\u2208z\n\n\u03b1\\v \u03c4 x,\u03b1(x(cid:48)\n\n\u03b1, z(cid:48)\n\nx(cid:48)\n\u03b1,z(cid:48)\n\n\u03b1; \u03c6x))]\n\ndescent steps to minimize the objective with respect to \u03c6x before updating \u03b8. We show pseudo-code\ndescribing this procedure for a single minibatch in Algorithm 1.\nBefore moving on to experiments we emphasize two of the attractive features of the learning scheme\ndescribed in (7) and Algorithm 1, which we verify empirically in the next section. First, because there\nis no message-passing and because minimization with respect to the \u03c4 and \u03c4 x pseudo-marginals is\namortized using inference networks, we are often able to reap the bene\ufb01ts of training MRFs with\nLBP but much more quickly. Second, we emphasize that the objective (7) and its gradients can be\ncalculated exactly, which stands in contrast to much recent work in variational inference for both\ndirected models [43, 23] and undirected models [27], where the ELBO and its gradients must be\napproximated with sampling. As the variance of ELBO gradient estimators is known to be an issue\nwhen learning models with discrete latent variables [37], if it is possible to develop undirected analogs\nof the models of interest it may be bene\ufb01cial to do so, and then learn these models with the (cid:96)F or\n(cid:96)F,z objectives, rather than approximating the ELBO. We consider one such case in the next section.\n\n4 Experiments\n\nOur experiments are designed to verify that amortizing BFE minimization is an effective way\nof performing inference, that it allows for learning models that generalize, and that we can do\nthis quickly. We accordingly consider learning and performing inference on three different kinds\nof popular MRFs, comparing amortized BFE minimization with standard baselines. We provide\nadditional experimental details in the Supplementary Material, and code for duplicating experiments\nis available at https://github.com/swiseman/bethe-min.\n\nIsing Models\n\nZ(\u03b8) exp((cid:80)\n\n(i,j)\u2208E Jijxixj +(cid:80)\n\n4.1\nWe \ufb01rst study our approach as applied to Ising models. An n\u00d7 n grid Ising model gives\nrise to a distribution over binary vectors x\u2208{\u22121, 1}n2 via the following parameterization:\ni\u2208V hixi), where Jij are the pairwise log potentials\nP (x; \u03b8) = 1\nand hi are the node log potentials. The generative model parameters are thus given by \u03b8 =\n{Jij}(i,j)\u2208E \u222a {hi}i\u2208V. While Ising models are conceptually simple, they are in fact quite gen-\neral since any binary pairwise MRF can be transformed into an equivalent Ising model [50].\nIn these experiments, we are interested in quantifying how well we can approximate the true marginal\ndistributions with approximate marginal distributions obtained from the inference network. We\ntherefore experiment with model sizes for which exact inference is reasonably fast on modern\nhardware (up to 15 \u00d7 15).1\nOur inference network associates a learnable embedding vector ei with each node and\napplies a single Transformer layer [59] to obtain a new node representation hi, with\n[h1, . . . , hn2] = Transformer([e1, . . . , en2 ]). The distribution over xi, xj for (i, j)\u2208E is given\nby concatenating hi, hj and applying an af\ufb01ne layer followed by a softmax: \u03c4 ij(xi, xj; \u03c6) =\nsoftmax(W[hi; hj] + b). The parameters of the inference network \u03c6 are given by the node embed-\n\n1The calculation of the partition function in grid Ising models is exponential in n, but it is possible to reduce\n\nthe running time from O(2n2\n\n) to O(2n) with dynamic programming (i.e., variable elimination).\n\n5\n\n\fTable 1: Correlation and Mean L1 distance between the true vs. approximated marginals for the various methods.\n\nn Mean Field\n5\n10\n15\n\n0.835\n0.854\n0.833\n\nCorrelation\n\nLoopy BP\n\nInference Network\n\n0.950\n0.946\n0.942\n\n0.988\n0.984\n0.981\n\nMean Field\n\nMean L1 distance\nLoopy BP\n\nInference Network\n\n0.128\n0.123\n0.132\n\n0.057\n0.064\n0.065\n\n0.032\n0.037\n0.040\n\nFigure 2: For each method, we plot the approximate marginals (x-axis) against the true marginals (y-axis) for a\n15 \u00d7 15 Ising model. Top shows the node marginals while bottom shows the pairwise factor marginals, and \u03c1\ndenotes the Pearson correlation coef\ufb01cient.\n\ndings and the parameters of the Transformer/af\ufb01ne layers. The node marginals \u03c4 i(xi; \u03c6) then are\nobtained from averaging the pairwise factor marginals (Eq (6)).2\nWe \ufb01rst examine whether minimizing the BFE with an inference network gives rise to reasonable\nmarginal distributions. Concretely, for a \ufb01xed \u03b8 (sampled from spherical Gaussian with unit variance),\nwe minimize F (\u03c4 (\u03c6), \u03b8) (Eq (1)) with respect to \u03c6, where \u03c4 (\u03c6) denotes the full vector of marginal\ndistributions obtained from the inference network. Table 1 shows the correlation and the mean L1\ndistance between the true marginals and the approximated marginals, where the numbers are averaged\nover 100 samples of \u03b8. We \ufb01nd that compared to approximate marginals obtained from mean \ufb01eld and\nLBP the inference network produces marginal distributions that are more accurate. Figure 2 shows\na scatter plot of approximate marginals (x-axis) against the true marginals (y-axis) for a randomly\nsampled 15 \u00d7 15 Ising model. Interestingly, we observe that both loopy belief propagation and the\ninference network produce accurate node marginals (top), but the pairwise factor marginals from\nthe inference network are much better (bottom). We \ufb01nd that this trend holds for Ising models with\ngreater pairwise interaction strength as well; see the additional experiments in the Supplementary\nMaterial where pairwise potentials are sampled from N (0, 3) and N (0, 5).\nIn Table 2 we show results from learning the generative model alongside the inference network. For\na randomly generated Ising model, we obtain 1000 samples each for train, validation, and test sets,\nusing a version of the forward-\ufb01ltering backward-sampling algorithm to obtain exact samples in\nO(2n). We then train a (randomly-initialized) Ising model via the saddle point learning problem in\nEq (7). While models trained with exact inference perform best, models trained with an inference\nnetwork\u2019s approximation to log Z(\u03b8) perform almost as well, and outperform both those trained with\nmean \ufb01eld and even with LBP. See the Supplementary Material for additional details.\n\n4.2 Restricted Boltzmann Machines (RBMs)\n\nWe next consider learning Restricted Boltzmann Machines [49], a classic MRF model with latent\nvariables. A binary RBM parameterizes the joint distribution over observed variables x \u2208 {0, 1}V\n2As there are no latent variables in these experiments, inference via the inference network is not amortized in\nthe traditional sense (i.e., across different data points as in Eq (4)) since it does not condition on x. However,\ninference is still amortized across each optimization step, and thus we still consider this to be an instance of\namortized inference.\n\n6\n\n\fTable 2: Held out NLL of learned Ising models. True entropy refers to NLL under the true model (i.e.\nEP (x;\u03b8)[\u2212 log P (x; \u03b8)]), and \u2018Exact\u2019 refers to an Ising model trained with the exact partition function.\nInference Network\n\nTrue Entropy Rand. Init.\n\nMean Field\n\nLoopy BP\n\n7.35\n29.70\n60.03\n\n7.17\n28.34\n59.79\n\n6.47\n26.80\n54.91\n\nn\n5\n10\n15\n\n6.27\n25.76\n51.80\n\n45.62\n162.53\n365.36\n\nExact\n6.30\n25.89\n52.24\n\nTable 3: Held out average NLL of learned RBMs, as estimated by AIS [46]. Neural Variational Inference results\nare taken from Kuleshov and Ermon [27].\n\nNLL\n25.47\nLoopy BP\n23.43\nInference Network\nPCD\n21.24\nNeural Variational Inference [27] \u2265 24.5\n\n(cid:96)F\n53.02\n23.11\nN/A\n\nEpochs to Converge\n\nSeconds/Epoch\n\n8\n38\n29\n\n21617\n\n14\n1\n\nZ(\u03b8) exp(x(cid:62)Wz + x(cid:62)b + z(cid:62)a). Thus, there is a\n\nand latent variables z \u2208 {0, 1}H as P (x, z; \u03b8) = 1\npairwise factor for each (xi, zj) pair, and a unary factor for each xi and zj.\nIt is standard when learning RBMs to marginalize out the latent variables, which can be done tractably\ndue to the structure of the model, and so we may train with the objective in (7). Our inference\nnetwork is similar to that used in our Ising model experiments: we associate a learnable embedding\nvector with each node in the model, which we concatenate with an embedding corresponding to an\nindicator feature for whether the node is in x or z. These V + H embeddings are then consumed\nby a bidirectional LSTM [20, 15], which outputs vectors hx,i and hz,j for each node.3 Finally, we\nobtain \u03c4 ij(xi, zj; \u03c6) = softmax(MLP[hx,i; hz,j]). We set the d(\u00b7,\u00b7) penalty function to be the KL\ndivergence, which worked slightly better than L2 distance in preliminary experiments.\nWe follow the experimental setting of Kuleshov and Ermon [27], who recently introduced a neural\nvariational approach to learning MRFs, and train RBMs with 100 hidden units on the UCI digits\ndataset [1], which consists of 8 \u00d7 8 images of digits. We compare with persistent contrastive\ndivergence (PCD) [54] and LBP, as well as with the best results reported in Kuleshov and Ermon\n[27].4 We used a batch size of 32, and selected hyperparameters through random search, monitoring\nvalidation expected pseudo-likelihood [3] for all models; see the Supplementary Material.\nTable 3 reports the held out average NLL as estimated with annealed importance sampling (AIS) [39,\n46], using 10 chains and 103 intermediate distributions; it also reports average seconds per epoch,\nrounded to the nearest second.5 We see that while amortized BFE minimization is able to outperform\nall results except PCD, it does lag behind PCD. These results are consistent with previous claims in\nthe literature [46] that LBP and its variants do not work well on RBMs. Amortizing BFE minimization\ndoes, however, again outperform LBP. We also emphasize that PCD relies on being able to do fast\nblock Gibbs updates during learning, which will not be available in general, whereas amortized BFE\nminimization has no such requirement.\n\n4.3 High-order HMMs\n\nFinally, we consider a scenario where both Z(\u03b8) and Z(x, \u03b8) must be approximated, namely, that of\nlearning 3rd order neural HMMs [55] (as in Figure 1) with approximate inference. We consider this\nsetting in particular because it allows for the use of dynamic programs to compare the true NLL at-\ntained when learning with approximate inference. However, because these dynamic programs scale as\nO(T K L+1), where T, L, K are the sequence length, Markov order, and number of latent state values,\nrespectively, considering even higher-order models becomes dif\ufb01cult. A standard 3rd order neural\nHMM parameterizes the joint distribution over observed sequence x\u2208{1, . . . , V }T and latent se-\nquence z\u2208{1, . . . , K}T as P (x, z; \u03b8) = 1\nt=1 log \u03a8t,1(zt\u22123:t; \u03b8) + log \u03a8t,2(zt, xt; \u03b8)).\n\nZ(\u03b8) exp((cid:80)T\n\n3We found LSTMs to work somewhat better than Transformers for both the RBM and HMM experiments.\n4The corresponding NLL number reported in Table 3 is derived from a \ufb01gure in Kuleshov and Ermon [27].\n5While it is dif\ufb01cult to exactly compare the speed of different learning algorithms, speed results were\n\nmeasured on the same 1080 Ti GPU, averaged over 10 epochs, and used our fastest implementations.\n\n7\n\n\fDirected 3rd Order HMMs To further motivate the results of this section let us begin by\nconsidering using approximate inference techniques to learn directed 3rd order neural HMMs,\nwhich are obtained by having each factor output a normalized distribution.\nIn particular, we\nde\ufb01ne the emission distribution \u03a8t,2(zt=k, xt; \u03b8) = softmax(W LayerNorm(ek + MLP(ek))),\nwhere ek \u2208 Rd is an embedding corresponding to the k\u2019th discrete value zt can take on,\nW \u2208 RV \u00d7d is a word embedding matrix with a row for each word in the vocabulary, and\nlayer normalization [2] is used to stabilize training. We also de\ufb01ne the transition distribu-\ntion \u03a8t,1(zt, zt\u22121=k1, zt\u22122=k2) = softmax(U LayerNorm([ek1; ek2 ]+MLP([ek1; ek2 ]))), where\nU\u2208 RK\u00d72K and the ek are shared with the emission parameterization.\nWe now consider learning a K = 30 state 3rd order directed neural HMM on sentences from the Penn\nTreebank [32] (using the standard splits and preprocessing by Mikolov et al. [35]) of length at most\n30. The top part of Table 4 compares the average NLL on the validation set obtained by learning\nsuch an HMM with exact inference against learning it with several variants of discrete VAE [43, 23]\nand the REINFORCE [64] gradient estimator. In particular, we consider two inference network\narchitectures:\n\n\u2022 Mean Field: we obtain approximate posteriors q(zt | x1:T ) for each timestep t as\nsoftmax(Q LayerNorm(ext + ht)), where ht \u2208 Rd2 is the output of a bidirectional\nLSTM [19, 15] run over the observations x1:T , ext is the embedding of token xt, and\nQ \u2208 RK\u00d7d2.\n\u2022 1st Order: Instead of assuming the approximate posterior q(z1:T | x1:T ) factorizes indepen-\ndently over timesteps, we assume it is given by the posterior of a \ufb01rst-order (and thus more\ntractable) HMM. We parameterize this inference HMM identically to the neural directed\nHMM above, except that it conditions on the observed sequence x1:T by concatenating the\naveraged hidden states of a bidirectional LSTM run over the sequence onto the ek.\n\nFor the mean \ufb01eld architecture we consider optimizing either the ELBO with the REINFORCE\ngradient estimator together with an input dependent baseline [37] for variance reduction, or the\ncorresponding 10-sample IWAE objective [5]. When the 1st Order HMM inference network is used,\nwe sample from it exactly using quantities calculated with the forward algorithm [42, 6, 48, 69]. We\nprovide more details in the Supplementary Material.\nAs the top of Table 4 shows, exact inference signi\ufb01cantly outperforms the approximate methods,\nperhaps due to the dif\ufb01culty in controlling the variance of the ELBO gradient estimators.\n\ninference,\n\nthen,\n\nt=1\n\nZ(\u03b8) exp((cid:80)T\n\n(cid:80)t\u22121\ns=max(t\u22123,1) log \u03a8t,1,s(zs, zt; \u03b8) +(cid:80)T\n\nUndirected 3rd Order HMMs An alternative to learning a 3rd order HMM with vari-\nis to consider an analogous undirected model, which can be\national\nlearned using BFE approximations, and therefore requires no sampling.\nIn particular,\nwe will consider the 3rd order undirected product-of-experts style HMM in Figure 1 (b),\nwhich contains only pairwise factors, and parameterizes the joint distribution of x and\nz as P (x, z; \u03b8) = 1\nt=1 log \u03a8t,2(zt, xt; \u03b8)).\nNote that while this variant captures only a subset of the distributions that can be represented\nby the full parameterization (Figure 1 (a)), it still captures 3rd order dependencies using pairwise\nfactors.\nIn our undirected parameterization the transition factors \u03a8t,1,s are homogeneous (i.e., independent\nof the timestep) in order to allow for a fair comparison with the standard directed HMM, and are\ngiven by r(cid:62)\nLayerNorm([a|t\u2212s|; ek1 ] + MLP([a|t\u2212s|; ek1])), where a|t\u2212s| is the embedding vector\ncorresponding to factors relating two nodes that are |t \u2212 s| steps apart, and where ek1 and rk2 are\nagain discrete state embedding vectors. The emission factors \u03a8t,2 are those used in the directed case.\nWe train inference networks f and fx to output pseudo-marginals \u03c4 and \u03c4 x as in Algorithm 1, using\nI1 = 1 and I2 = 1 gradient updates per minibatch. Because Z(\u03b8) and Z(x, \u03b8) depend only on the\nlatent variables (since factors involving the xt remain locally normalized), f and fx are bidirectional\nLSTMs consuming embeddings corresponding to the zt, where fx also consumes x. In particular,\nfx is almost identical to the mean \ufb01eld inference network described above, except it additionally\nconsumes an embedding for the current node (as did the RBM and Ising model inference networks)\nand an embedding indicating the total number of nodes in the graph. The inference network f\nproducing unclamped pseudo-marginals is identical, except it does not consume x. As the bottom of\n\nk2\n\n8\n\n\fTable 4: Average NLL of 3rd Order HMM variants learned with approximate and exact inference.\n\nDirected\n\nUndirected\n\nExact\nMean-Field VAE + BL\nMean-Field IWAE-10\n1st Order HMM VAE\nExact\nLBP\nInference Network\n\nNLL\n105.66\n119.27\n119.20\n118.35\n104.07\n108.74\n115.86\n\n-ELBO/(cid:96)F,z\n\nEpochs to Converge\n\nSeconds/Epoch\n\n105.66\n175.46\n167.71\n118.88\n104.07\n99.89\n114.75\n\n20\n14\n5\n12\n20\n20\n11\n\n137\n82\n876\n187\n122\n247\n70\n\nTable 4 shows, this amortized approach manages to outperform all the VAE variants both in terms of\nheld out average NLL and speed. It performs less well than true LBP, but is signi\ufb01cantly faster.\n\n5 Related Work\n\nUsing neural networks to perform approximate inference is a popular way to learn deep generative\nmodels, leading to a family of models called variational autoencoders [23, 44, 37]. However, such\nmethods have generally been employed in the context of learning directed graphical models. Moreover,\napplying amortized inference to learn discrete latent variable models has proved challenging due to\npotentially high-variance gradient estimators that arise from sampling, though there have been some\nrecent advances [21, 31, 57, 14].\nOutside of directed models, several researchers have proposed to incorporate deep networks directly\ninto message-passing inference operations, mostly in the context of computer vision applications.\nHeess et al. [16] and Lin et al. [30] train neural networks that learn to map input messages to output\nmessages, while inference machines [45, 9] also directly estimate messages from inputs. In contrast,\nLi and Zemel [28] and Dai et al. [8] instead approximate iterations of mean \ufb01eld inference with\nneural networks.\nClosely related to our work, Yoon et al. [68] employ a deep network over an underlying graphical\nmodel to obtain node-level marginal distributions. However, their inference network is trained against\nthe true marginal distribution (i.e., not Bethe free energy as in the present work), and is therefore\nnot applicable to settings where exact inference is intractable (e.g. RBMs). Also related is the early\nwork of Welling and Teh [63], who also consider direct (but unamortized) minimization of the BFE,\nthough only for inference and not learning. Finally, Kuleshov and Ermon [27] also learn undirected\nmodels via a variational objective, cast as an upper bound on the partition function.\n\n6 Conclusion\n\nWe have presented an approach to learning MRFs which amortizes the minimization of the Bethe free\nenergy by training inference networks to output approximate minimizers. This approach allows for\nlearning models that are competitive with loopy belief propagation and other approximate inference\nschemes, and yet takes less time to train.\n\nAcknowledgments\n\nWe are grateful to Alexander M. Rush and Justin Chiu for insightful conversations and suggestions.\nYK is supported by a Google AI PhD Fellowship.\n\nReferences\n[1] Fevzi Alimoglu, Ethem Alpaydin, and Yagmur Denizhan. Combining multiple classi\ufb01ers for pen-based\n\nhandwritten digit recognition. 1996.\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n9\n\n\f[3] Julian Besag. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D\n\n(The Statistician), 24(3):179\u2013195, 1975.\n\n[4] Hans A Bethe. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series\n\nA-Mathematical and Physical Sciences, 150(871):552\u2013575, 1935.\n\n[5] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In Proceedings\n\nof ICLR, 2015.\n\n[6] Siddhartha Chib. Calculating posterior distributions and modal estimates in markov mixture models.\n\nJournal of Econometrics, 75(1):79\u201397, 1996.\n\n[7] Richard Courant et al. Variational methods for the solution of problems of equilibrium and vibrations. Bull.\n\nAmer. Math. Soc, 49(1):1\u201323, 1943.\n\n[8] Hanjun Dai, Bo Dai, , and Le Song. Discriminative embeddings of latent variable models for structured\n\ndata. In Proceedings of ICML, 2016.\n\n[9] Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. Structure inference machines: Recurrent neural\n\nnetworks for analyzing relations in group activity recognition. In Proceedings of CVPR, 2016.\n\n[10] Gal Elidan, Ian McGraw, and Daphne Koller. Residual belief propagation: Informed scheduling for\n\nasynchronous message passing. arXiv preprint arXiv:1206.6837, 2012.\n\n[11] Brendan J Frey, Frank R Kschischang, Hans-Andrea Loeliger, and Niclas Wiberg. Factor graphs and\nalgorithms. In Proceedings of the Annual Allerton Conference on Communication Control and Computing,\nvolume 35, pages 666\u2013680. University of Illinois, 1997.\n\n[12] Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. Residual splash for optimally parallelizing belief\n\npropagation. In Arti\ufb01cial Intelligence and Statistics, pages 177\u2013184, 2009.\n\n[13] Matthew Gormley and Jason Eisner. Structured belief propagation for nlp. In Proceedings of the 52nd\n\nAnnual Meeting of the Association for Computational Linguistics: Tutorials, pages 9\u201310, 2014.\n\n[14] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through\nthe Void: Optimizing Control Variates for Black-box Gradient Estimation. In Proceedings of ICLR, 2018.\n\n[15] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirec-\ntional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding, pages 273\u2013278.\nIEEE, 2013.\n\n[16] Nicolas Heess, Daniel Tarlow, and John Winn. Learning to pass expectation propagation messages. In\n\nProceedings of NIPS, 2013.\n\n[17] Tom Heskes. Stable \ufb01xed points of loopy belief propagation are local minima of the bethe free energy. In\n\nAdvances in neural information processing systems, pages 359\u2013366, 2003.\n\n[18] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation,\n\n14(8):1771\u20131800, 2002.\n\n[19] Sepp Hochreiter and J\u00b4\u2019urgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9:1735\u20131780,\n\n1997.\n\n[20] Sepp Hochreiter and Jurgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[21] Eric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax.\n\nProceedings of ICLR, 2017.\n\nIn\n\n[22] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and\n\nsuper-resolution. In European conference on computer vision, pages 694\u2013711. Springer, 2016.\n\n[23] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of ICLR, 2014.\n\n[24] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[25] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n10\n\n\f[26] Frank R Kschischang, Brendan J Frey, Hans-Andrea Loeliger, et al. Factor graphs and the sum-product\n\nalgorithm. IEEE Transactions on information theory, 47(2):498\u2013519, 2001.\n\n[27] Volodymyr Kuleshov and Stefano Ermon. Neural variational inference and learning in undirected graphical\n\nmodels. In Advances in Neural Information Processing Systems, pages 6734\u20136743, 2017.\n\n[28] Yujia Li and Richard Zemel. Mean-\ufb01eld networks. arXiv preprint arXiv:1410.5884, 2014.\n\n[29] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[30] Guosheng Lin, Chunhua Shen, Ian Reid, and Anton van den Hengel. Deeply learning the messages in\n\nmessage passing inference. In Proceedings of NIPS, 2015.\n\n[31] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation\n\nof Discrete Random Variables. In Proceedings of ICLR, 2017.\n\n[32] Mitchell Marcus, Beatrice Santorini, and Mary Marcinkiewicz. Building a Large Annotated Corpus of\n\nEnglish: the Penn Treebank. Computational Linguistics, 19:331\u2013330, 1993.\n\n[33] Robert J McEliece, David JC MacKay, and Jung-Fu Cheng. Turbo decoding as an instance of pearl\u2019s\n\u201cbelief propagation\u201d algorithm. IEEE Journal on selected areas in communications, 16(2):140\u2013152, 1998.\n\n[34] Ofer Meshi, Ariel Jaimovich, Amir Globerson, and Nir Friedman. Convexifying the bethe free energy.\nIn Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 402\u2013410.\nAUAI Press, 2009.\n\n[35] Tomas Mikolov, Anood Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernocky. Empirical Evaluation\nand Combination of Advanced Language Modeling Techniques. In Proceedings of INTERSPEECH, 2011.\n\n[36] Andriy Mnih and Danilo J. Rezende. Variational Inference for Monte Carlo Objectives. In Proceedings of\n\nICML, 2016.\n\n[37] Andryi Mnih and Karol Gregor. Neural Variational Inference and Learning in Belief Networks.\n\nProceedings of ICML, 2014.\n\nIn\n\n[38] Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference:\nAn empirical study. In Proceedings of the Fifteenth conference on Uncertainty in arti\ufb01cial intelligence,\npages 467\u2013475. Morgan Kaufmann Publishers Inc., 1999.\n\n[39] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125\u2013139, 2001.\n\n[40] Jorge Nocedal and Stephen J Wright. Numerical optimization, second edition. Numerical optimization,\n\npages 497\u2013528, 2006.\n\n[41] Judea Pearl. Fusion, propagation, and structuring in belief networks. Arti\ufb01cial intelligence, 29(3):241\u2013288,\n\n1986.\n\n[42] Lawrence R. Rabiner. A Tutorial on Hidden markov Models and Selected Applications in Speech\n\nRecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[43] Danilo J. Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. In Proceedings\n\nof ICML, 2015.\n\n[44] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate\n\nInference in Deep Generative Models. In Proceedings of ICML, 2014.\n\n[45] Stephane Ross, Geoffrey J. Gordon, and Drew Bagnell. A Reduction of Imitation Learning and Structured\n\nPrediction to No-Regret Online Learning. In Proceedings of AISTATS, 2011.\n\n[46] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th international conference on Machine learning, pages 872\u2013879. ACM, 2008.\n\n[47] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[48] Steven L Scott. Bayesian methods for hidden markov models: Recursive computing in the 21st century.\n\nJournal of the American Statistical Association, 97(457):337\u2013351, 2002.\n\n11\n\n\f[49] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical\n\nreport, Colorado Univ at Boulder Dept of Computer Science, 1986.\n\n[50] David Sontag. Cutting plane algorithms for variational inference in graphical models. Technical Report,\n\nMIT, 2007.\n\n[51] Vivek Srikumar, Gourab Kundu, and Dan Roth. On amortizing inference cost for structured prediction. In\nProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and\nComputational Natural Language Learning, pages 1114\u20131124. Association for Computational Linguistics,\n2012.\n\n[52] Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical Risk Minimization of Graphical Model\nParameters Given Approximate Inference, Decoding, and Model Structure. In Proceedings of AISTATS,\n2011.\n\n[53] Charles Sutton, Andrew McCallum, et al. An Introduction to Conditional Random Fields. Foundations\n\nand Trends R(cid:13) in Machine Learning, 4(4):267\u2013373, 2012.\n\n[54] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient.\nIn Proceedings of the 25th international conference on Machine learning, pages 1064\u20131071. ACM, 2008.\n\n[55] Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. Unsupervised Neural Hidden\n\nMarkov Models. In Proceedings of the Workshop on Structured Prediction for NLP, 2016.\n\n[56] Lifu Tu and Kevin Gimpel. Learning approximate inference networks for structured prediction. In ICLR,\n\n2018.\n\n[57] George Tucker, Andriy Mnih, Chris J. Maddison, Dieterich Lawson, and Jascha Sohl-Dickstein. REBAR:\nLow-variance, Unbiased Gradient Estimates for Discrete Latent Variable Models. In Proceedings of NIPS,\n2017.\n\n[58] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel Recurrent Neural Networks. In\n\nProceedings of ICML, 2016.\n\n[59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\n\nKaiser, and Illia Polosukhin. Attention is All You Need. In Proceedings of NIPS, 2017.\n\n[60] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[61] Adrian Weller and Tony Jebara. Approximating the bethe partition function. In Proceedings of the Thirtieth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 858\u2013867. AUAI Press, 2014.\n\n[62] Adrian Weller, Kui Tang, David Sontag, and Tony Jebara. Understanding the bethe approximation: when\nIn Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial\n\nand how can it go wrong?\nIntelligence, pages 868\u2013877. AUAI Press, 2014.\n\n[63] Max Welling and Yee Whye Teh. Belief optimization for binary networks: A stable alternative to loopy\nbelief propagation. In Proceedings of the Seventeenth conference on Uncertainty in arti\ufb01cial intelligence,\npages 554\u2013561. Morgan Kaufmann Publishers Inc., 2001.\n\n[64] Ronald J. Williams. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement\n\nLearning. Machine Learning, 8, 1992.\n\n[65] Alan S Willsky, Erik B Sudderth, and Martin J Wainwright. Loop series and bethe variational bounds in\nattractive graphical models. In Advances in neural information processing systems, pages 1425\u20131432,\n2008.\n\n[66] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation. In Advances in\n\nneural information processing systems, pages 689\u2013695, 2001.\n\n[67] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propagation and its\n\ngeneralizations. Exploring arti\ufb01cial intelligence in the new millennium, 8:236\u2013239, 2003.\n\n[68] KiJung Yoon, Renjie Liao, Yuwen Xiong, Lisa Zhang, Ethan Fetaya, Raquel Urtasun, Richard Zemel,\nand Xaq Pitkow. Inference in probabilistic graphical models by graph neural networks. arXiv preprint\narXiv:1803.07710, 2018.\n\n[69] Walter Zucchini, Iain L MacDonald, and Roland Langrock. Hidden Markov models for time series: an\n\nintroduction using R. Chapman and Hall/CRC, 2016.\n\n12\n\n\f", "award": [], "sourceid": 9014, "authors": [{"given_name": "Sam", "family_name": "Wiseman", "institution": "TTIC"}, {"given_name": "Yoon", "family_name": "Kim", "institution": "Harvard University"}]}