{"title": "Faithful Inversion of Generative Models for Effective Amortized Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 3070, "page_last": 3080, "abstract": "Inference amortization methods share information across multiple posterior-inference problems, allowing each to be carried out more efficiently. Generally, they require the inversion of the dependency structure in the generative model, as the modeller must learn a mapping from observations to distributions approximating the posterior. Previous approaches have involved inverting the dependency structure in a heuristic way that fails to capture these dependencies correctly, thereby limiting the achievable accuracy of the resulting approximations. We introduce an algorithm for faithfully, and minimally, inverting the graphical model structure of any generative model. Such inverses have two crucial properties: (a) they do not encode any independence assertions that are absent from the model and; (b) they are local maxima for the number of true independencies encoded. We prove the correctness of our approach and empirically show that the resulting minimally faithful inverses lead to better inference amortization than existing heuristic approaches.", "full_text": "Faithful Inversion of Generative Models\n\nfor Effective Amortized Inference\n\nStefan Webb\u2217\n\nUniversity of Oxford\n\nAdam Goli\u00b4nski\n\nUniversity of Oxford\n\nRobert Zinkov\n\nUBC\n\nN. Siddharth\n\nUniversity of Oxford\n\nTom Rainforth\n\nUniversity of Oxford\n\nYee Whye Teh\n\nUniversity of Oxford\n\nFrank Wood\n\nUBC\n\nAbstract\n\nInference amortization methods share information across multiple posterior-\ninference problems, allowing each to be carried out more ef\ufb01ciently. Generally, they\nrequire the inversion of the dependency structure in the generative model, as the\nmodeller must learn a mapping from observations to distributions approximating\nthe posterior. Previous approaches have involved inverting the dependency struc-\nture in a heuristic way that fails to capture these dependencies correctly, thereby\nlimiting the achievable accuracy of the resulting approximations. We introduce an\nalgorithm for faithfully, and minimally, inverting the graphical model structure of\nany generative model. Such inverses have two crucial properties: a) they do not\nencode any independence assertions that are absent from the model and b) they are\nlocal maxima for the number of true independencies encoded. We prove the cor-\nrectness of our approach and empirically show that the resulting minimally faithful\ninverses lead to better inference amortization than existing heuristic approaches.\n\nIntroduction\n\n1\nEvidence from human cognition suggests that the brain reuses the results of past inferences to speed\nup subsequent related queries (Gershman & Goodman, 2014). In the context of Bayesian statistics,\nit is reasonable to expect that, given a generative model, p(x, z), over data x and latent variables z,\ninference on p(z | x1) is informative about inference on p(z | x2) for two related inputs, x1 and\nx2. Several algorithms (Kingma & Welling, 2014; Rezende et al., 2014; Stuhlm\u00fcller et al., 2013;\nPaige & Wood, 2016; Le et al., 2017, 2018; Maddison et al., 2017a; Naesseth et al., 2018) have\nbeen developed with this insight to perform amortized inference by learning an inference artefact\nq(z | x), which takes as input the values of the observed variables, and\u2014typically with the use\nof neural network architectures\u2014return a distribution over the latent variables approximating the\nposterior. These inference artefacts are known variously as inference networks, recognition models,\nprobabilistic encoders, and guide programs; we will adopt the term inference networks throughout.\nAlong with conventional \ufb01xed-model settings (Stuhlm\u00fcller et al., 2013; Le et al., 2017; Ritchie et al.,\n2016; Paige & Wood, 2016), a common application of inference amortization is in the training of\nvariational auto-encoders (VAEs) (Kingma & Welling, 2014), for which the inference network is\nsimultaneously learned alongside a generative model. It is well documented that de\ufb01ciencies in the\nexpressiveness or training of the inference network can also have a knock-on effect on the learned\ngenerative model in such contexts (Burda et al., 2016; Cremer et al., 2017, 2018; Rainforth et al.,\n2018), meaning that poorly chosen coarse-grained structures can be particularly damaging.\nImplicit in the factorization of the generative model and inference network in both \ufb01xed and learned\nmodel settings are probabilistic graphical models, commonly Bayesian networks (BNs), encoding\ndependency structures. We refer to these as the coarse-grain structure, in opposition to the \ufb01ne-grain\nstructure of the neural networks that form each inference (and generative) network factor. In this\nsense, amortized inference can be framed as the problem of graphical model inversion\u2014how to invert\nthe graphical model of the generative model to give a graphical model approximating the posterior.\n\n\u2217Correspondence to info@stefanwebb.me\n\n\fMany models from the deep generative modeling literature can be represented as BNs (Krishnan\net al., 2017; Gan et al., 2015; Neal, 1990; Kingma & Welling, 2014; Germain et al., 2015; van den\nOord et al., 2016b,a), and fall within this framework.\nIn this paper, we borrow ideas from the probabilistic graphical models literature, to address the previ-\nously open problem of how best to automate the design of the coarse-grain structure of the inference\nnetwork (Ritchie et al., 2016). Typically, the inverse graphical model is formed heuristically. At the\nsimplest level, some methods just invert the edges in the BN for the generative model, removing edges\nbetween observed variables (Kingma & Welling, 2014; Gan et al., 2015; Ranganath et al., 2015). In a\nmore principled, but still heuristic, approach, Stuhlm\u00fcller et al. (2013); Paige & Wood (2016) con-\nstruct the inference network by inverting the edges and additionally connecting the parents of children\nin the original graph (both of which are a subset of a variable\u2019s Markov blanket; see Appendix C).\nIn general, these heuristic methods introduce conditional inde-\npendencies into the inference network that are not present in the\noriginal distribution. Consequently, they cannot represent the\ntrue posterior even in the limit of in\ufb01nite neural network capaci-\nties. Take the simple generative model with branching structure\nof Figure 1a. The inference network formed by Stuhlm\u00fcller\u2019s\nmethod inverts the edges of the model as in Figure 1b. However,\nan inference network that is able to represent the true posterior\nrequires extra edges between the branches, as in Figure 1c.\nAnother approach, taken by Le et al. (2017), is to use a fully\nconnected BN for the inverse graphical model, such that every\nrandom choice made by the inference network depends on every previous one. Though such a model\nis expressive enough to correctly represent the data given in\ufb01nite capacity and training time, it ignores\nsubstantial available information from the forward model, inevitably leading to reduced performance\nfor \ufb01nite training budgets and/or network capacities.\nIn this paper, we develop a tractable framework to remedy these de\ufb01ciencies: the Natural Minimal\nI-map generator (NaMI). Given an arbitrary BN structure, NaMI can be used to construct an inverse\nBN structure that is provably both faithful and minimal. It is faithful in that it contains suf\ufb01cient edges\nto avoid encoding conditional independencies absent from the model. It is minimal in that it does not\ncontain any unnecessary edges; i.e., removing any edge would result in an unfaithful structure.\nNaMI chie\ufb02y draws upon variable elimination (Koller & Friedman, 2009, Ch 9,10), a well-known\nalgorithm from the graphical model literature for performing exact inference on discrete factor\ngraphs. The key idea in the operation of NaMI is to simulate variable elimination steps as a tool\nfor successively determining a minimal, faithful, and natural inverse structure, which can then be\nused to parametrize an inference network. NaMI further draws on ideas such as the min-\ufb01ll heuristic\n(Fishelson & Geiger, 2004), to choose the ordering in which variable elimination is simulated, which\nin turn in\ufb02uences the structure of the generated inverse.\nTo summarize, our key contributions are:\n\nFigure 1: (a) Generative model BN;\n(b) Inverse BN by Stuhlm\u00fcller\u2019s Al-\ngorithm; (c) Faithful inverse BN by\nour algorithm.\n\n(c)\n\n(a)\n\n(b)\n\ni) framing generative model learning through amortized variational inference as a graphical model\n\nii) using the simulation of exact inference algorithms to construct an algorithm for generating\n\ninversion problem, and\n\nprovably minimally faithful inverses.\n\nOur work thus highlights the importance of constructing both minimal and faithful inverses, while\nproviding the \ufb01rst approach to produce inverses satisfying these properties.\n2 Method\nOur algorithm builds upon the tools of probabilistic graphical models\u2014 a summary for unfamiliar\nreaders is given in Appendix A.\n2.1 General idea\nAmortized inference algorithms make use of inference networks that approximate the posterior. To be\nable to represent the posterior accurately, the distribution of the inference network should not encode\nindependence assertions that are absent from the generative model. An inference network that did\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n2\n\n \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc37 \ud835\udc38 \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc37 \ud835\udc38 \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc37 \ud835\udc38 \fencode additional independencies could not represent the true posterior, even in the non-parametric\nlimit, with neural network factors whose capacity approaches in\ufb01nity.\nLet us de\ufb01ne a stochastic inverse for a generative model p(x|z)p(z) that factors according to a BN\nstructure G to be a factorization of q(z|x)q(x) over H (Stuhlm\u00fcller et al., 2013; Paige & Wood,\n2016). The q(z|x) part of the stochastic inverse will de\ufb01ne the factorization, or rather, coarse-grain\nstructure, of the inference network. Recall from \u00a71 that this involved two characteristics. We \ufb01rst\nrequire H to be an I-map for G:\nDe\ufb01nition 1. Let G and H be two BN structures. Denote the set of all conditional independence\nassertions made by a graph, K, as I(K). We say H is an I-map for G if I(H) \u2286 I(G).\nTo be an I-map for G, H may not encode all the independencies that G does, but it must not mislead\nus by encoding independencies not present in G. We term such inverses as being faithful. While\nthe aforementioned heuristic methods do not in general produce faithful inverses, using either a\nfully-connected inverse, or our method, does.\nSecond, since a fully-connected graph encodes no conditional independencies and is therefore\nsuboptimal, we require in addition that H be a minimal I-map for G:\nDe\ufb01nition 2. A graph K is a minimal I-map for a set of independencies I if it is an I-map for I and\nif removal of even a single edge from K renders it not an I-map.\nWe call such inverses minimally faithful, which roughly means that the inverse is a local optimum in\nthe number of true independence assertions it encodes.\nThere will be many minimally faithful inverses for G, each with a varying number of edges. Our\nalgorithm produces a natural inverse in the sense that it either inverts the order of the random choices\nfrom that of the generative model (when it is run in the topological mode), or it preserves the ordering\nof the random choices (when it is run in reverse topological mode):\nDe\ufb01nition 3. A stochastic inverse H for G over variables X is a natural\ninverse if either, for all X \u2208 X there are no edges in H from X to its\ndescendants in G, or, for all X \u2208 X there are no edges in H from X to\nits ancestors in G.\nEssentially, a natural inverse is one for which if we were to perform\nancestral sampling, the variables would be sampled in either a topological\nor reverse-topological ordering, relative to the original model. Consider\nthe inverse networks of G shown in Figure 2. H1 is not a natural inverse\nof G, since there is both an edge A \u2192 C from a parent to a child, and\nan edge C \u2192 B from a child to a parent, relative to G. However, H2 and\nH3 are natural, as they correspond respectively to the reverse-topological\nand topological orderings C, B, A and B, A, C.\nMost heuristic methods, including those of (Stuhlm\u00fcller et al., 2013;\nPaige & Wood, 2016), produce (unfaithful) natural inverses that invert\nthe order of the random choices, giving a reverse-topological ordering.\n2.2 Obtaining a natural minimally faithful inverse\nWe now present NaMI\u2019s graph inversion procedure that given an arbitrary BN structure, G, produces\na natural minimal I-map, H. We illustrate the procedure step-by-step on the example given in Figure\n3. Here H and J are observed, as indicated by the shaded nodes. Thus, our latent variables are\nZ = {D, I, G, S, L}, our data is X = {H, J}, and a factorization for p(z | x) is desired.\nThe NaMI graph-inversion algorithm is traced in Table 1. Each step in-\ncrementally constructs two graphs: an induced graph J and a stochastic\ninverse H. The induced graph is an undirected graph whose maximally\nconnected subgraphs, or cliques, correspond to the scopes of the in-\ntermediate factors produced by simulating variable elimination. The\nstochastic inverse represents our eventual target which encodes the in-\nverse dependency structure. It is constructed using information from the\npartially-constructed induced graph. Speci\ufb01cally, NaMI goes through\nthe following steps for this example.\nSTEP 0: The partial induced graph and stochastic inverse are initialized. The initial induced graph\nis formed by taking the directed graph for the forward model, G, removing the directionality of the\n\nFigure 2: Illustrating def-\ninition of naturalness.\n\nFigure 3: Example BN\n\nH2\n\nH3\n\nG\n\nH1\n\n3\n\n \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc34 \ud835\udc35 \ud835\udc36 \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \fTable 1: Tracing the NaMI algorithm on example from Figure 3. S is the set of \u201cfrontier\u201d variables\nthat are considered for elimination, v \u2208 S the variable eliminated at each step chosen by the\ngreedy min-\ufb01ll heuristic, J the partially constructed induced graph after each step with black nodes\nindicating a eliminated variables, and H the partially constructed stochastic inverse.\nSTEP S\n\nSTEP S\n\nH\n\nv\n\nJ\n\nv\n\nJ\n\nH\n\n0\n\n1\n\n2\n\n\u2205\n\n\u2205\n\nD, I D\n\nI\n\nI\n\n3\n\n4\n\n5\n\nG, S S\n\nG\n\nG\n\nL\n\nL\n\nedges, and adding additional edges between variables that share a child in G\u2014in this example, edges\nD \u2212 I, S \u2212 L and G \u2212 J. This process is known as moralization. The stochastic inverse begins as\ndisconnected variables, and edges are added to it at each step.\nSTEP 1: The frontier set of variables to consider for elimination, S, is initialized to the latent\nvariables having no latent parents in G, that is, D, I. To choose which variable to eliminate \ufb01rst,\nwe apply the greedy min-\ufb01ll heuristic, which is to choose the (possibly non-unique) variable that\nadds the fewest edges to the induced graph J in order to produce as compact an inverse as possible\nunder the topological ordering. Speci\ufb01cally, noting that the cliques of J correspond to the scopes of\nintermediate factors during variable elimination, we want to avoid producing intermediate factors\nwhich would require us to add additional edges to J , as doing so will in turn induce additional edges\nin H at future steps. For this example, if we were to eliminate D, that would produce an intermediate\nfactor, \u03c8D(D, I, G), while if we were to eliminate I, that would produce an intermediate factor,\n\u03c8I (I, D, G, S). Choosing to eliminate would I thus requires adding an edge G\u2013S to the induced\ngraph, as there is no clique I, D, G, S in the current state of J . Conversely, eliminating D does not\nrequire adding extra edges to J and so we choose to eliminate D.\nThe elimination of D is simulated by marking its node in J . The parents of D in the inverse H are\nset to be its nonmarked neighbours in J , that is, I and G. D is then removed from the frontier, and\nany non-observed children in G of D whose parents have all been marked added to it\u2014in this case,\nthere are none as the only child of D, G, still has an unmarked parent I.\nSTEP 2: Variable I is the sole member of the frontier and is chosen for elimination. The elimination\nof I is simulated by marking its node in J and adding the additional edge G\u2013S. This is required\nbecause elimination of I requires the addition of a factor, \u03c8I (I, G, S), that is not currently present in\nJ . The parents of I in the inverse H are set to be its nonmarked neighbours in J , G and S. I is then\nremoved from the frontier. Now, G and S are children of I, and both their parents D and I have been\nmarked. Therefore, they are added to the frontier.\nSTEP 3-5: The process is continued until the end of the \ufb01fth step when all the latent variables,\nD, I, S, G, L, have been eliminated and the frontier is empty. At this point, H represents a factor-\nization p(z | x), and we stop here as only a factorization for the posterior is required for amortized\ninference. Note, however, that it is possible to continue simulating steps of variable elimination on\nthe observed variables to complete the factorization as p(z | x)p(x).\nAn important point to note is that NaMI\u2019s graph inversion can be run in one of two modes. The\n\u201ctopological mode,\u201d which we previously implicitly considered, simulates variable elimination in\na topological ordering, producing an inverse that reverses the order of the random choices from\nthe generative model. Conversely, NaMI\u2019s graph inversion can also be run in \u201creverse topological\n\n4\n\n \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \ud835\udc37 \ud835\udc3c \ud835\udc3a \ud835\udc46 \ud835\udc3b \ud835\udc3f \ud835\udc3d \fSelect v \u2208 S according to min-\ufb01ll criterion\nAdd edges in J between unmarked neighbours of v\n\nAlgorithm 1 NaMI Graph Inversion\n1: Input: BN structure G, latent variables Z, TOPMODE?\n2: J \u2190 MORALIZE(G)\n3: Set all vertices of J to be unmarked\n4: H \u2190 {VARIABLES(G),\u2205}, i.e. unconnected graph\n5: UPSTREAM \u2190 \u201cparent\u201d if TOPMODE? else \u201cchild\u201d\n6: DOWNSTREAM \u2190 \u201cchild\u201d if TOPMODE? else \u201cparent\u201d\n7: S \u2190 all latent variables without UPSTREAM latents in G\n8: while S (cid:54)= \u2205 do\n9:\n10:\n11: Make unmarked neighbours of v \u2208 J , v\u2019s parents in H\n12: Mark v and remove from S\nfor unmarked latents DOWNSTREAM u of v in G do\n13:\nAdd u to S if all its UPSTREAM latents in G are marked\n14:\nend for\n15:\n16: end while\n17: return H\n\nmode,\u201d which simulates variable elimi-\nnation in a reverse topological ordering,\nproducing an inverse that preserves the\norder of random choices in the genera-\ntive model. We will refer to these ap-\nproaches as forward-NaMI and reverse-\nNaMI respectively in the rest of the pa-\nper. The rationale for these two modes is\nthat, though they both produce minimally\nfaithful inverses, one may be substantially\nmore compact than the other, remember-\ning that minimality only ensures a local\noptimum. For an arbitrary graph, it can-\nnot be said in advance which ordering\nwill produce the more compact inverse.\nHowever, as the cost of running the in-\nversion algorithm is low, it is generally\nfeasible to try and pick the one producing\na better solution.\nThe general NaMI graph-reversal procedure is given in Algorithm 1. It is further backed up by the\nfollowing formal demonstration of correctness, the proof for which is given in Appendix F.\nTheorem 1. The Natural Minimal I-Map Generator of Algorithm 1 produces inverse factorizations\nthat are natural and minimally faithful.\nWe further note that NaMI\u2019s graph reversal has a running time of order O(nc) where n is the number\nof latent variables in the graph and c << n is the size of the largest clique in the induced graph.\nWe consequently see that it can be run cheaply for practical problems: the computational cost of\ngenerating the inverse is generally dominated by that of training the resulting inference network itself.\nSee Appendix F for more details.\n2.3 Using the faithful inverse\nOnce we have obtained the faithful inverse structure H, the next step is to use it to learn an inference\nnetwork, q\u03c8(z | x). For this, we use the factorization given by H. Let \u03c4 denote the reverse of the\norder in which variables were selected for elimination by Line 9 in Algorithm 1, such that \u03c4 is a\npermutation of 1, . . . , n and \u03c4 (n) is the \ufb01rst variable eliminated. H encodes the factorization\n\nqi(z\u03c4 (i) | PaH(z\u03c4 (i)))\n\n(cid:89)n\nwhere PaH(z\u03c4 (i)) \u2286(cid:8)x, z\u03c4 (1), . . . , z\u03c4 (i\u22121)\n(cid:9) indicates the parents of z\u03c4 (i) in H. For each factor qi,\n\nwe must decide both the class of distributions for z\u03c4 (i) | PaH(z\u03c4 (i)), and how the parameters for\nthat class are calculated. Once learned, we can both sample from, and evaluate the density of, the\ninference network for a given dataset by considering each factor in turn.\nThe most natural choice for the class of distributions for each factor is to use the same distribution fam-\nily as the corresponding variable in the generative model, such that the supports of these distributions\nmatch. For instance, continuing the example from Figure 3, if D \u223c N (0, 1) in the generative model,\nthen a normal distribution would also be used for D | I, G in the inference network. To establish the\nmapping from data to the parameters to this distribution, we train neural networks using stochastic\ngradient ascent methods. For instance, we could set D | {I = i, G = g} \u223c N (\u00b5\u03d5(i, g), \u03c3\u03d5(i, g)),\nwhere \u00b5\u03d5 and \u03c3\u03d5 are two densely connected feedforward networks, with learnable parameters \u03d5. In\ngeneral, it will be important to choose architectures which well match the problem at hand. For exam-\nple, when perceptual inputs such as images and language are present in the conditioning variables,\nit is advantageous to \ufb01rst embed them to a lower-dimensional representation using, for example,\nconvolutional neural networks.\nMatching the distribution families in the inference network and generative model, whilst a simple\nand often adequate approximation, can be suboptimal. For example, suppose that for a normally\ndistributed variable in the generative model, the true conditional distribution in the posterior for that\nvariable is multimodal. In this case, using a (single mode) normal factor in the inference network\nwould not suf\ufb01ce. One could straightforwardly instead use, for example, either a mixture of Gaussians,\nor, normalizing \ufb02ows (Rezende & Mohamed, 2015; Kingma et al., 2016), to parametrize each\ninference network factor in order to improve expressivity, at the cost of additional implementational\n\nq\u03c8(z | x) =\n\ni=1\n\n(1)\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Results for the relaxed Bernoulli VAE with 30 latent units, compared after 1000 epochs\nof learning the: (a) negative ELBO, and (b) negative AIS estimates, varying inference network\nfactorizations and capacities (total number of parameters); (c) An estimate of the variational gap, that\nis, the difference between marginal log-likelihood and the ELBO.\n\ncomplexity. In particular, if one were to use a provably universal density estimator to parameterize\neach inference network factor, such as that introduced in Huang et al. (2018), the resulting NaMI\ninverse would constitute a universal density estimator of the true posterior.\nAfter the inference network has been parametrized, it can be trained in number of different ways,\ndepending on the \ufb01nal use case of the network. For example, in the context of amortized stochastic\nvariational inference (SVI) methods such as VAEs (Kingma & Welling, 2014; Rezende et al., 2014),\nthe model p\u03b8(x, z) is learned along with the inference network q\u03c8(z | x) by optimizing a lower\nbound on the marginal loglikelihood of the data, LELBO = Eq\u03c8(z|x) [ln p\u03b8(x, z) \u2212 ln q\u03c8(z | x)].\nStochastic gradient ascent can then be used to optimize LELBO in the same way a standard VAE,\nsimulating from q\u03c8(z|x) by considering each factor in turn and using reparameterization (Kingma &\nWelling, 2014) when the individual factors permit doing so.\nA distinct training approach is provided when the model p(x, z) is \ufb01xed (Papamakarios & Murray,\n2015). Here a proposal is learnt for either importance sampling (Le et al., 2017) or sequential\nMonte Carlo (Paige & Wood, 2016) by using stochastic gradient ascent to minimize the reverse\nKL-divergence between the inference network q\u03c8(z | x) and the true posterior p(z | x). Up to a\nconstant, the objective is given by LIC = Ep(x,z) [\u2212 ln q\u03c8(z | x)] .\nUsing a minimally faithful inverse structure typically improves the best inference network attainable\nand the \ufb01nite time training performance for both these settings, compared with previous naive\napproaches. In the VAE setting, this can further have a knock-on effect on the quality of the learned\nmodel p\u03b8(x, z), both because a better inference network will give lower variance updates of the\ngenerative network (Rainforth et al., 2018) and because restrictions in the expressiveness of the\ninference network lead to similar restrictions in the generative network (Cremer et al., 2017, 2018).\nIn deep generative models, the BNs may be much larger than the examples shown here. However,\ntypically at the macro-level, where we collapse each vector to a single node, they are quite simple.\nWhen we invert this type of collapsed graph, we must do so with the understanding that the distribution\nover a vector-valued node in the inverse must express dependencies between all its elements in order\nfor the inference network to be faithful.\n3 Experiments\nWe now consider the empirical impact of using NaMI compared with previous approaches. In \u00a73.1,\nwe highlight the importance of using a faithful inverse in the VAE context, demonstrating that doing\nso results in a tighter variational bound and a higher log-likelihood. In \u00a73.2, we use NaMI in the\n\ufb01xed-model setting. Here our results demonstrate the importance of using both a faithful and minimal\ninverse on the ef\ufb01ciency of the learned inference network. Low-level details on the experimental\nsetups can be found in Appendix D and an implementation at https://git.io/fxVQu.\n3.1 Relaxed Bernoulli VAEs\nPrior work has shown that more expressive inference networks give an improvement in amortized\nSVI on sigmoid belief networks and standard VAEs, relative to using the mean-\ufb01eld approximation\n(Uria et al., 2016; Maal\u00f8e et al., 2016; Rezende & Mohamed, 2015; Kingma et al., 2016). Krishnan\net al. (2017) report similar results when using more expressive inverses in deep linear-chain state-\nspace models. It is straightforward to see that any minimally faithful inverse for the standard VAE\n\n6\n\n255075100125150175params (10,000s)105.0107.5110.0112.5115.0117.5120.0122.5125.0NLL at 1000 epochsELBO (mean-field)ELBO (faithful)255075100125150175params (10,000s)9092949698100NLL at 1000 epochsAIS (mean-field)AIS (faithful)255075100125150175params (10,000s)1516171819202122variational gapmean-fieldfaithful\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 5: (a) BN structure for a binary tree with d = 3; (b) Stuhlm\u00fcller\u2019s heuristic inverse; (c) Natural\nminimally faithful inverse produced by NaMI in topological mode; (d) Most compact inverse when\nd > 3, given by running NaMI in reverse topological mode; (e) Fully connected inverse.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: Results for binary tree Gaussian BNs with depth d = 5, comparing inference network\nfactorizations in the compiled inference setting. The KL divergence from the analytical posterior\nestimated to the inference network on the training and test sets are shown in (a) and (b) respectively.\n(c) shows the average negative log-likelihood of inference network samples under the analytical\nposterior, conditioning on \ufb01ve held-out data sets. The results are averaged over 10 runs and 0.75\nstandard deviations indicated. The drop at 100 epochs is due to decimating the learning rate.\n\nframework (Kingma & Welling, 2014) has a fully connected clique over the latent variables so that\nthe inference network can take account of the explaining-away effects between the latent variables in\nthe generative model. As such, both forward-NaMI and backward-NaMI produce the same inverse.\nThe relaxed Bernoulli VAE (Maddison et al., 2017b; Jang et al., 2017) is a VAE variation that replaces\nboth the prior on the latents and the distribution over the latents given the observations with the\nrelaxed Bernoulli distribution (also known as the Concrete distribution). It can also be understood as\na \u201cdeep\u201d continuous relaxation of sigmoid belief networks.\nWe learn a relaxed Bernoulli VAE with 30 latent variables on MNIST, comparing a faithful inference\nnetwork (parametrized with MADE (Germain et al., 2015)) to the mean-\ufb01eld approximation, after\n1000 epochs of learning for ten different sizes of inference network, keeping the size of the generative\nnetwork \ufb01xed. We note that the mean-\ufb01eld inference network has the same structure as the heuristic\none that reverses the edges from the generative model. A tight bound on the marginal likelihood is\nestimated with annealed importance sampling (AIS) (Neal, 1998; Wu et al., 2017).\nThe results shown in Figure 4 indicate that using a faithful inverse on this model produces a signi\ufb01cant\nimprovement in learning over the mean-\ufb01eld inverse. Note that the x-axis indicates the number of\nparameters in the inference network. We observe that for every capacity level, the faithful inference\nnetwork has a lower negative ELBO and AIS estimate than that of the mean-\ufb01eld inference network.\nIn Figure 4c, the variational gap is observed to decrease (or rather, the variational bound tightens) for\nthe faithful inverse as its capacity is increased, whereas it increases for the mean-\ufb01eld inverse. This\nexample illustrates the inadequacy of the mean-\ufb01eld approximation in certain classes of models, in\nthat it can result in signi\ufb01cantly underutilizing the capacity of the model.\n3.2 Binary-tree Gaussian BNs\nGaussian BNs are a class of models in which the conditional distribution of each variable is normally\ndistributed, with a \ufb01xed variance and a mean that is a \ufb01xed linear combination of its parents plus an\noffset. We consider here Gaussian BNs with a binary-tree structured graph and observed leaves (see\nFigure 5a for the case of depth, d = 3). In this class of models, the exact posterior can be calculated\nanalytically (Koller & Friedman, 2009, \u00a77.2) and so it forms a convenient test-bed for performance.\nThe heuristic inverses simply invert the edges of the graph (Figure 5b), whereas a natural minimally\nfaithful inverse requires extra edges between subtrees (e.g. Figure 5c) to account for the in\ufb02uence one\n\n7\n\n \ud835\udc4b0 \ud835\udc4b1 \ud835\udc4b3 \ud835\udc4b2 \ud835\udc4b4 \ud835\udc4b5 \ud835\udc4b6 \ud835\udc4b0 \ud835\udc4b1 \ud835\udc4b3 \ud835\udc4b2 \ud835\udc4b4 \ud835\udc4b5 \ud835\udc4b6 \ud835\udc4b0 \ud835\udc4b1 \ud835\udc4b3 \ud835\udc4b2 \ud835\udc4b4 \ud835\udc4b5 \ud835\udc4b6 \ud835\udc4b0 \ud835\udc4b1 \ud835\udc4b3 \ud835\udc4b2 \ud835\udc4b4 \ud835\udc4b5 \ud835\udc4b6 \ud835\udc4b0 \ud835\udc4b1 \ud835\udc4b3 \ud835\udc4b2 \ud835\udc4b4 \ud835\udc4b5 \ud835\udc4b6 0255075100125150175200epochs0.00.20.40.60.81.01.21.4KLTrainheuristicfully connectedforward-NaMIreverse-NaMI0255075100125150175200epochs0.00.20.40.60.81.01.21.4KLTest0255075100125150175200epochs0.00.51.01.52.02.53.03.54.04.5NLLLog-probability of Samples\f(a)\n\n(b)\n\nFigure 7: Convergence of reverse KL divergence (used as the training objective) for Bayesian GMM\nfor K = 3 clusters and N = 200 data points, comparing inference networks with a \ufb01xed generative\nmodel. The shaded regions indicate 1 standard error in the estimation.\n\nnode can have on others through its parent. For this problem, it turns out that running reverse-NaMI\n(Figure 5d) produces a more compact inverse than forward-NaMI. This, in fact, turns out to be\nthe most compact possible I-map for any d > 3. Nonetheless, all three inversion methods have\nsigni\ufb01cantly fewer edges than the fully connected inverse (Figure 5e).\nThe model is \ufb01xed and the inference network is learnt from samples from the generative model,\nminimizing the \u201creverse\u201d KL-divergence, namely that from the posterior to the inference network\nKL(p\u03b8(z|x)||q\u03c8(z|x)), as per (Paige & Wood, 2016). We compared learning across the inverses pro-\nduced by using Stuhlm\u00fcller\u2019s heuristic, forward-NaMI, reverse-NaMI, and taking the fully connected\ninverse. The fully connected inference network was parametrized using MADE (Germain et al.,\n2015), and the forward-NaMI one with a novel MADE variant that modi\ufb01es the masking matrix to\nexactly capture the tree-structured dependencies (see Appendix E.2). As the same MADE approaches\ncannot be used for heuristic and reverse-NaMI inference networks, these were instead parametrized\nwith a separate neural network for each variable\u2019s density function. The inference network sizes were\nkept constant across approaches.\nResults are given in Figure 6 for depth d = 5 averaging over 10 runs. Figures 6a and 6b show an\nestimate of KL(p\u03b8(z|x)||q\u03c8(z|x)) using the train and test sets respectively. From this, we observe\nthat it is necessary to model at least the edges in an I-map for the inference network to be able to\nrecover the posterior, and convergence is faster with fewer edges in the inference network. Despite\nthe more compact reverse-NaMI inverse converging faster than the forward-NaMI one, the latter\nseems to converges to a better \ufb01nal solution. This may be because the MADE approach could not be\nused for the reverse-NaMI inverse, but this is a subject for future investigation nonetheless.\nFigure 6c shows the average negative log-likelihood of 200 samples from the inference networks\nevaluated on the analytical posterior, conditioning on \ufb01ve \ufb01xed datasets sampled from the generative\nmodel not seen during learning. It is thus a measure of how successful inference amortiziation has\nbeen. All three faithful inference networks have signi\ufb01cantly lower variance over runs compared to\nthe unfaithful inference network produced by Stuhlm\u00fcller\u2019s algorithm.\nWe also observed during other experimentation that if one were to decrease the capacity of all\nmethods, learning remains stable in the natural minimally faithful inverse at a threshold where it\nbecomes unstable in the fully connected case and in Stuhlm\u00fcller\u2019s inverse.\n3.3 Gaussian Mixture Models\nGaussian mixture models (GMMs) are a clustering model where the data x = {x1, x2, . . . , xN} is\nassumed to have been generated from one of K clusters, each of which has a Gaussian distribution\nwith parameters {\u00b5j, \u03a3j}, j = 1, 2, . . . , K. Each datum, xi is associated with a corresponding\nindex, zi \u2208 {1, . . . , K} that gives the identity of that datum\u2019s cluster. The indices, z(cid:48) = {zi}\nare drawn i.i.d. from a categorical distribution with parameter \u03c6. Prior distributions are placed\non \u03b8 = {\u00b51, \u03a31, . . . , \u00b5K, \u03a3K} and \u03c6, so that the latent variables are z = {z(cid:48), \u03b8, \u03c6}. The goal of\ninference is then to determine the posterior p(z | x), or some statistic of it.\nAs per the previous experiment, this falls into the \ufb01xed-model setting. We factor the fully-\nconnected inverse as, q(\u03b8|x)q(\u03c6|\u03b8, x)q(z(cid:48)|\u03c6, \u03b8, x).\nIt turns out that applying reverse-NaMI de-\n\n8\n\n05001000150020002500epochs9101112131415KL+constTrainfully connectedreverse-NaMI05001000150020002500epochs9101112131415KL+constTest\fq(\u03b8|x, \u03c6)(cid:81)N\n\n(a) 12 skips edges\n\n(b) 16 skips edges\n\nFigure 8: Additional edges over forward-NaMI.\n\ncouples the dependence between the indices, z(cid:48), and produces a much more compact factorization,\ni q(zi|xi, \u03c6, \u03b8)q(\u03c6|x), than either the fully-connected or forward-NaMI inverses for this\nmodel. The inverse structure produced by Stuhlm\u00fcller\u2019s heuristic algorithm is very similar to the\nreverse-NaMI structure for this problem and is omitted.\nWe train our amortization artifact over datasets with N = 200 samples and K = 3 clusters. The\ninference network terms with distributions over vectors were parametrized by MADE, and we\ncompare the results for the fully-connected and reverse-NaMI inverses. We hold the neural network\ncapacities constant across methods and average over 10 runs, the results for which are shown in\nFigure 7. We see that learning is faster for the minimally faithful reverse-NaMI method, relative to the\nfully-connected inverse, and converges to a better solution, in agreement with the other experiments.\n3.4 Minimal and Non-minimal Faithful Inverses\nTo further examine the hypothesis that a non-\nminimal faithful inverse has slower learning\nand converges to a worse solution relative to\na minimal one, we performed the setup of Ex-\nperiment 3.2 with depth d = 4, comparing the\nforward-NaMI network to two additional net-\nworks that added 12 and 16 connections to\nforward-NaMI (holding the total capacity \ufb01xed).\nThe additional edges are shown in Figure 8. Note the\nregular forward-NaMI edges are omitted for visual clarity.\nFigure 9 shows the average negative log likelihood (NLL)\nunder the true posterior for samples generated by the infer-\nence network, based on 5 datasets not seen during training.\nIt appears that the more edges are added beyond minimal-\nity, the slower is the initial learning and convergence is to\na worse solution.\nTo further explain why minimality is crucial, we note that\nadding additional edges beyond minimality means that\nthere will be factors that condition on variables whose\nprobabilistic in\ufb02uence is blocked by the other variables.\nThis effectively adds an input of random noise into these\nfactors, which is why we then see slower learning and\nconvergence to a worse solution.\n4 Discussion\nWe have presented NaMI, a tractable framework that, given the BN structure for a generative model,\nproduces a natural factorization for its inverse that is a minimal I-map for the model. We have argued\nthat this should be used to guide the design of the coarse-grain structure of the inference network in\namortized inference. Having empirically analyzed the implications of using NaMI, we \ufb01nd that it\nlearns better inference networks than previous heuristic approaches. We further found that, in the\ncontext of VAEs, improved inference networks have a knock-on effect on the generative network,\nimproving the generative networks as well.\nOur framework opens new possibilities for learning structured deep generative models that combine\ntraditional Bayesian modeling by probabilistic graphical models with deep neural networks. This\nallows us to leverage our typically strong knowledge of which variables effect which others, while\nnot overly relying on our weak knowledge of the exact functional form these relationships take.\nTo see this, note that if we forgo the niceties of making mean-\ufb01eld assumptions, we can impose\narbitrary structure on a generative model simply by controlling its parameterization. The only\nrequirement on the generative network to evaluate the ELBO is that we can evaluate the network\ndensity at a given input. Recent advances in normalizing \ufb02ows (Huang et al., 2018; Chen et al., 2018)\nmean it is possible to construct \ufb02exible and general purpose distributions that satisfy this requirement\nand are amenable to application of dependency constraints from our graphical model. This obviates\nthe need to make assumptions such as conjugacy as done by, for example, Johnson et al. (2016).\nNaMI provides a critical component to constructing such a framework, as it allows one to ensure\nthat the inference network respects the structural assumptions imposed on the generative network,\nwithout which a tight variational bound cannot be achieved.\n\nFigure 9: Average NLL of inference net-\nwork samples under analytical posterior.\n\n9\n\n 0100200300400500epochs0.00.20.40.60.81.0NLLLog-probability of Samplesforward-NaMIskip (+12 edges)skip (+16 edges)\fAcknowledgments\nWe would like to thank (in alphabetical order) Rob Cornish, Rahul Krishnan, Brooks Paige, and\nHongseok Yang for their thoughtful help and suggestions.\nSW and AG gratefully acknowledge support from the EPSRC AIMS CDT through grant\nEP/L015987/2. RZ acknowledges support under DARPA D3M, under Cooperative Agreement\nFA8750-17-2-0093. NS was supported by EPSRC/MURI grant EP/N019474/1. TR and YWT are\nsupported in part by the European Research Council under the European Union\u2019s Seventh Frame-\nwork Programme (FP7/2007\u20132013) / ERC grant agreement no. 617071. TR further acknowledges\nsupport of the ERC StG IDIU. FW was supported by The Alan Turing Institute under the EP-\nSRC grant EP/N510129/1, DARPA PPAML through the U.S. AFRL under Cooperative Agreement\nFA8750-14-2-0006, an Intel Big Data Center grant, and DARPA D3M, under Cooperative Agreement\nFA8750-17-2-0093.\nReferences\nBurda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. Interna-\n\ntional Conference on Learning Representations, 2016.\n\nChen, Tian Qi, Rubanova, Yulia, Bettencourt, Jesse, and Duvenaud, David. Neural ordinary differen-\n\ntial equations. arXiv preprint arXiv:1806.07366, 2018.\n\nCremer, Chris, Morris, Quaid, and Duvenaud, David. Reinterpreting importance-weighted autoen-\n\ncoders. International Conference on Learning Representations Workshop Track, 2017.\n\nCremer, Chris, Li, Xuechen, and Duvenaud, David. Inference suboptimality in variational autoen-\n\ncoders. Proceedings of the International Conference on Machine Learning, 2018.\n\nFishelson, Ma\u00e1yan and Geiger, Dan. Optimizing exact genetic linkage computations. Journal of\n\nComputational Biology, 11(2-3):263\u2013275, 2004.\n\nGan, Zhe, Li, Chunyuan, Henao, Ricardo, Carlson, David E, and Carin, Lawrence. Deep temporal\nsigmoid belief networks for sequence modeling. Advances in Neural Information Processing\nSystems, 2015.\n\nGermain, Mathieu, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. MADE: masked autoencoder\nfor distribution estimation. Proceedings of the International Conference on Machine Learning,\n2015.\n\nGershman, Samuel J and Goodman, Noah D. Amortized inference in probabilistic reasoning. In\n\nProceedings of the Annual Conference of the Cognitive Science Society, 2014.\n\nHuang, Chin-Wei, Krueger, David, Lacoste, Alexandre, and Courville, Aaron. Neural Autoregressive\n\nFlows. Proceedings of the International Conference on Machine Learning, 2018.\n\nJang, Eric, Gu, Shixiang, and Poole, Ben. Categorical reparameterization with Gumbel-softmax.\n\nInternational Conference on Learning Representations, 2017.\n\nJohnson, Matthew J, Duvenaud, David, Wiltschko, Alexander B, Datta, Sandeep R, and Adams,\nRyan P. Composing graphical models with neural networks for structured representations and fast\ninference. arXiv preprint arXiv:1603.06277v2 [stat.ML], 2016.\n\nKingma, Diederik P and Welling, Max. Auto-encoding variational bayes. International Conference\n\non Learning Representations, 2014.\n\nKingma, Diederik P, Salimans, Tim, and Welling, Max. Improving variational inference with Inverse\n\nAutoregressive Flow. Advances in Neural Information Processing Systems, 2016.\n\nKoller, Daphne and Friedman, Nir. Probabilistic Graphical Models. MIT Press, 2009.\n\n9780262013192.\n\nISBN\n\nKrishnan, Rahul G, Shalit, Uri, and Sontag, David. Structured inference networks for nonlinear state\n\nspace models. Proceedings of the national conference on Arti\ufb01cial intelligence (AAAI), 2017.\n\n10\n\n\fLe, Tuan Anh, Baydin, Atilim Gunes, and Wood, Frank. Inference compilation and universal proba-\nbilistic programming. In Proceedings of the International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2017.\n\nLe, Tuan Anh, Igl, Maximilian, Jin, Tom, Rainforth, Tom, and Wood, Frank. Auto-encoding\n\nSequential Monte Carlo. In International Conference on Learning Representations, 2018.\n\nMaal\u00f8e, Lars, S\u00f8nderby, Casper Kaae, S\u00f8nderby, S\u00f8ren Kaae, and Winther, Ole. Auxiliary deep\ngenerative models. In Proceedings of the International Conference on Machine Learning, 2016.\nMaddison, Chris J, Lawson, John, Tucker, George, Heess, Nicolas, Norouzi, Mohammad, Mnih,\nAndriy, Doucet, Arnaud, and Teh, Yee. Filtering variational objectives. In Advances in Neural\nInformation Processing Systems, 2017a.\n\nMaddison, Chris J, Mnih, Andriy, and Teh, Yee Whye. The Concrete distribution: A continuous\nrelaxation of discrete random variables. In International Conference on Learning Representations,\n2017b.\n\nNaesseth, Christian A, Linderman, Scott W, Ranganath, Rajesh, and Blei, David M. Variational\nSequential Monte Carlo. In Proceedings of the International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2018.\n\nNeal, Radford M. Learning stochastic feedforward networks. Department of Computer Science,\n\nUniversity of Toronto, 1990.\n\nNeal, Radford M. Annealed Importance Sampling (technical report 9805 (revised)). Department of\n\nStatistics, University of Toronto, 1998.\n\nPaige, Brooks and Wood, Frank. Inference networks for Sequential Monte Carlo in graphical models.\n\nIn Proceedings of the International Conference on Machine Learning, 2016.\n\nPapamakarios, George and Murray, Iain. Distilling intractable generative models. In Probabilistic\n\nIntegration Workshop at Neural Information Processing Systems, 2015.\n\nRainforth, Tom, Kosiorek, Adam R, Le, Tuan Anh, Maddison, Chris J, Igl, Maximilian, Wood, Frank,\nand Teh, Yee Whye. Tighter variational bounds are not necessarily better. Proceedings of the\nInternational Conference on Machine Learning, 2018.\n\nRanganath, Rajesh, Tang, Linpeng, Charlin, Laurent, and Blei, David M. Deep exponential families.\n\nIn Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\nRezende, Danilo and Mohamed, Shakir. Variational inference with normalizing \ufb02ows. In Proceedings\n\nof the International Conference on Machine Learning, 2015.\n\nRezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and\napproximate inference in deep generative models. In Proceedings of the International Conference\non Machine Learning, 2014.\n\nRitchie, Daniel, Horsfall, Paul, and Goodman, Noah D. Deep amortized inference for probabilistic\n\nprograms. arXiv preprint arXiv:1610.05735, 2016.\n\nStuhlm\u00fcller, Andreas, Taylor, Jacob, and Goodman, Noah. Learning stochastic inverses. In Advances\n\nin Neural Information Processing Systems, 2013.\n\nUria, Benigno, C\u00f4t\u00e9, Marc-Alexandre, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. Neural\nautoregressive distribution estimation. Journal of Machine Learning Research, 17(205):1\u201337,\n2016.\n\nvan den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al.\nIn Advances in Neural Information\n\nConditional image generation with PixelCNN decoders.\nProcessing Systems, 2016a.\n\nvan den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks.\n\nIn Proceedings of the International Conference on Machine Learning, 2016b.\n\nWu, Yuhuai, Burda, Yuri, Salakhutdinov, Ruslan, and Grosse, Roger. On the quantitative analysis of\ndecoder-based generative models. In International Conference on Learning Representations, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1588, "authors": [{"given_name": "Stefan", "family_name": "Webb", "institution": "University of Oxford"}, {"given_name": "Adam", "family_name": "Golinski", "institution": "University of Oxford"}, {"given_name": "Rob", "family_name": "Zinkov", "institution": "University of Oxford"}, {"given_name": "Siddharth", "family_name": "N", "institution": "Unversity of Oxford"}, {"given_name": "Tom", "family_name": "Rainforth", "institution": "University of Oxford"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of British Columbia"}]}