{"title": "Graphical Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6069, "page_last": 6080, "abstract": "We propose Graphical Generative Adversarial Networks (Graphical-GAN) to model structured data. Graphical-GAN conjoins the power of Bayesian networks on compactly representing the dependency structures among random variables and that of generative adversarial networks on learning expressive dependency functions. We introduce a structured recognition model to infer the posterior distribution of latent variables given observations. We generalize the Expectation Propagation (EP) algorithm to learn the generative model and recognition model jointly. Finally, we present two important instances of Graphical-GAN, i.e. Gaussian Mixture GAN (GMGAN) and State Space GAN (SSGAN), which can successfully learn the discrete and temporal structures on visual datasets, respectively.", "full_text": "Graphical Generative Adversarial Networks\n\nChongxuan Li\u2217\n\nlicx14@mails.tsinghua.edu.cn\n\nMax Welling\u2020\n\nM.Welling@uva.nl\n\nJun Zhu\u2217\n\ndcszj@mail.tsinghua.edu.cn\n\nBo Zhang\u2217\n\ndcszb@mail.tsinghua.edu.cn\n\nAbstract\n\nWe propose Graphical Generative Adversarial Networks (Graphical-GAN) to\nmodel structured data. Graphical-GAN conjoins the power of Bayesian networks on\ncompactly representing the dependency structures among random variables and that\nof generative adversarial networks on learning expressive dependency functions.\nWe introduce a structured recognition model to infer the posterior distribution of\nlatent variables given observations. We generalize the Expectation Propagation\n(EP) algorithm to learn the generative model and recognition model jointly. Finally,\nwe present two important instances of Graphical-GAN, i.e. Gaussian Mixture\nGAN (GMGAN) and State Space GAN (SSGAN), which can successfully learn\nthe discrete and temporal structures on visual datasets, respectively.\n\n1\n\nIntroduction\n\nDeep implicit models [29] have shown promise on synthesizing realistic images [10, 33, 2] and\ninferring latent variables [26, 11]. However, these approaches do not explicitly model the underlying\nstructures of the data, which are common in practice (e.g., temporal structures in videos). Probabilistic\ngraphical models [18] provide principle ways to incorporate the prior knowledge about the data\nstructures but these models often lack the capability to deal with the complex data like images.\nTo conjoin the bene\ufb01ts of both worlds, we propose a \ufb02exible generative modelling framework\ncalled Graphical Generative Adversarial Networks (Graphical-GAN). On one hand, Graphical-GAN\nemploys Bayesian networks [18] to represent the structures among variables. On the other hand,\nGraphical-GAN uses deep implicit likelihood functions [10] to model complex data.\nGraphical-GAN is suf\ufb01ciently \ufb02exible to model structured data but the inference and learning are\nchallenging due to the presence of deep implicit likelihoods and complex structures. We build a\nstructured recognition model [17] to approximate the true posterior distribution. We study two families\nof the recognition models, i.e. the mean \ufb01eld posteriors [14] and the inverse factorizations [39].\nWe generalize the Expectation Propagation (EP) [27] algorithm to learn the generative model and\nrecognition model jointly. Motivated by EP, we minimize a local divergence between the generative\nmodel and recognition model for each individual local factor de\ufb01ned by the generative model. The\nlocal divergences are estimated via the adversarial technique [10] to deal with the implicit likelihoods.\nGiven a speci\ufb01c scenario, the generative model is determined a priori by context or domain knowledge\nand the proposed inference and learning algorithms are applicable to arbitrary Graphical-GAN. As\ninstances, we present Gaussian Mixture GAN (GMGAN) and State Space GAN (SSGAN) to learn\nthe discrete and temporal structures on visual datasets, respectively. Empirically, these models can\n\u2217Department of Computer Science & Technology, Institute for Arti\ufb01cial Intelligence, BNRist Center, THBI\nLab, State Key Lab for Intell. Tech. & Sys., Tsinghua University. Correspondence to: J. Zhu.\n\u2020University of Amsterdam, and the Canadian Institute for Advanced Research (CIFAR).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Generative models\n\n(b) Recognition models\n\nFigure 1: (a) Generative models of GMGAN (left panel) and SSGAN (right panel). (b) Recognition\nmodels of GMGAN and SSGAN. The grey and white units denote the observed and latent variables,\nrespectively. The arrows denote dependencies between variables. \u03b8 and \u03c6 denote the parameters in\nthe generative model and recognition model, respectively. We omit \u03b8 and \u03c6 in SSGAN for simplicity.\n\ninfer the latent structures and generate structured samples. Further, Graphical-GAN outperforms the\nbaseline models on inference, generation and reconstruction tasks consistently and substantially.\nOverall, our contributions are: (1) we propose Graphical-GAN, a general generative modelling\nframework for structured data; (2) we present two instances of Graphical-GAN to learn the discrete\nand temporal structures, respectively; and (3) we empirically evaluate Graphical-GAN on generative\nmodelling of structured data and achieve good qualitative and quantitative results.\n\n2 General Framework\n\nIn this section, we present the model de\ufb01nition, inference method and learning algorithm.\n\n2.1 Model De\ufb01nition\n\nLet X and Z denote the observable variables and latent variables, respectively. We assume\nthat we have N i.i.d. samples from a generative model with the joint distribution pG(X, Z) =\npG(Z)pG(X|Z), where G is the associated directed acyclic graph (DAG). According to the local\nstructures of G, the distribution of a single data point can be further factorized as follows:\n\n|Z|(cid:89)\n\npG(X, Z) =\n\n|X|(cid:89)\n\np(zi|paG(zi))\n\np(xj|paG(xj)),\n\n(1)\n\ni=1\n\nj=1\n\nwhere paG(x) denotes the parents of x in the associated graph G. Note that only latent variables can\nbe the parent of a latent variable and see Fig 1 (a) for an illustration. Following the factorization in\nEqn. (1), we can sample from the generative model ef\ufb01ciently via ancestral sampling.\nGiven the dependency structures, the dependency functions among the variables can be parameterized\nas deep neural networks to \ufb01t complicated data. As for the likelihood functions, we consider implicit\nprobabilistic models [29] instead of prescribed probabilistic models. Prescribed models [17] de\ufb01ne\nthe likelihood functions for X with an explicit speci\ufb01cation. In contrast, implicit models [10]\ndeterministically transform Z to X and the likelihood can be intractable. We focus on implicit models\nbecause they have been proven effective on image generation [33, 2] and the learning algorithms for\nimplicit models can be easily extended to prescribed models. We also directly compare with existing\nstructured prescribed models [7] in Sec. 5.1. Following the well established literature, we refer to our\nmodel as Graphical Generative Adversarial Networks (Graphical-GAN).\nThe inference and learning of Graphical-GAN are nontrivial. On one hand, Graphical-GAN employs\ndeep implicit likelihood functions, which makes the inference of the latent variables intractable\nand the likelihood-based learning method infeasible. On the other hand, Graphical-GAN involves\ncomplex structures, which requires the inference and learning algorithm to exploit the structural\ninformation explicitly. To address the problems, we propose structured recognition models and a\nsample-based massage passing algorithm, as detailed in Sec. 2.2 and Sec. 2.3, respectively.\n\n2\n\n\f2.2\n\nInference Method\n\nWe leverage recent advances on amortized inference of deep generative models [17, 9, 8, 40] to infer\nthe latent variables given the data. Basically, these approaches introduce a recognition model, which\nis a family of distributions of a simple form, to approximate the true posterior. The recognition model\nis shared by all data points and often parameterized as a deep neural network.\nThe problem is more complicated in our case because we need to further consider the graphical\nstructure during the inference procedure. Naturally, we introduce a structured recognition model with\nan associated graph H as the approximate posterior, whose distribution is formally given by:\n\nqH(Z|X) =\n\nq(zi|paH(zi)).\n\n(2)\n\n|Z|(cid:89)\n\ni=1\n\n|Z|(cid:89)\n\ni=1\n\n|Z|(cid:89)\n\nGiven data points from the true data distribution q(X), we can obtain samples following the joint\ndistribution qH(X, Z) = q(X)qH(Z|X) ef\ufb01ciently via ancestral sampling. Considering different\ndependencies among the variables, or equivalently Hs, we study two types of recognition models:\nthe mean-\ufb01eld posteriors [14] and the inverse factorizations [39].\nThe mean-\ufb01eld assumption has been widely adopted to variational inference methods [14] because\nof its simplicity. In such methods, all of the dependency structures among the latent variables are\nignored and the approximate posterior could be factorized as follows:\n\nqH(Z|X) =\n\nq(zi|X),\n\n(3)\n\nwhere the associated graph H has fully factorized structures.\nThe inverse factorizations [39] approach views the original graphical model as a forward factorization\nand samples the latent variables given the observations ef\ufb01ciently by inverting G step by step.\nFormally, the inverse factorization is de\ufb01ned as follows:\n\nqH(Z|X) =\n\nq(zi|\u2202G(zi) \u2229 z>i),\n\n(4)\n\ni=1\n\nwhere \u2202G(zi) denotes the Markov blanket of zi on G and z>i denotes all z after zi in a certain order,\nwhich is de\ufb01ned from leaves to root according to the structure of G. See the formal algorithm to build\nH based on G in Appendix A.\nGiven the structure of the approximate posterior, we also parameterize the dependency functions\nas neural networks of similar sizes to those in the generative models. Both posterior families are\ngenerally applicable for arbitrary Graphical-GANs and we use them in two different instances,\nrespectively. See Fig. 1 (b) for an illustration.\n\n2.3 Learning Algorithm\n\nLet \u03b8 and \u03c6 denote the parameters in the generative model, p, and the recognition model q, respectively.\nOur goal is to learn \u03b8 and \u03c6 jointly via divergence minimization, which is formulated as:\n\nD(q(X, Z)||p(X, Z)),\n\nmin\n\u03b8,\u03c6\n\nin the f-divergence family [5], that is D(q(X, Z)||p(X, Z)) =(cid:82) p(X, Z)f ( q(X,Z)\n\n(5)\nwhere we omit the subscripts of the associated graphs in p and q for simplicity. We restrict D\np(X,Z) )dXdZ, where\nf is a convex function of the likelihood ratio. The Kullback-Leibler (KL) divergence and the\nJensen-Shannon (JS) divergence are included.\nNote that we cannot optimize Eqn. (5) directly because the likelihood ratio is unknown given implicit\np(X, Z). To this end, ALI [8, 9] introduces a parametric discriminator to estimate the divergence\nvia discriminating the samples from the models. We can directly apply ALI to Graphical-GAN by\ntreating all variables as a whole and we refer it as the global baseline (See Appendix B for the formal\nalgorithm). The global baseline uses a single discriminator that takes all variables as input. It may be\n\n3\n\n\fsub-optimal in practice because the capability of a single discriminator is insuf\ufb01cient to distinguish\ncomplex data, which makes the estimate of the divergence not reliable. Intuitively, the problem will\nbe easier if we exploit the data structures explicitly when discriminating the samples. The intuition\nmotivates us to propose a local algorithm like Expectation Propagation (EP) [27], which is known\nas a deterministic approximation algorithm with analytic and computational advantages over other\napproximations, including Variational Inference [21].\nFollowing EP, we start from the factorization of p(X, Z) in terms of a set of factors FG:\n\np(A).\n\n(6)\n\np(X, Z) \u221d (cid:89)\n\nA\u2208FG\n\nq(X, Z) \u221d (cid:89)\n\nA\u2208FG\n\nGenerally, we can choose any reasonable FG 3 but here we specify that FG consists of families\n(x, paG(x)) and (z, paG(z)) for all x and z in the model. We assume that the recognition model can\nalso be factorized in the same way. Namely, we have\n\nq(A).\n\n(7)\n\n(cid:20)\n\n(cid:88)\n\nA\u2208FG\n\nInstead of minimizing Eqn. (5), EP iteratively minimizes a local divergence in terms of each factor\nindividually. Formally, for factor A, we\u2019re interested in the following divergence [27, 28]:\n\nD(q(A)q(A)||p(A)p(A)),\n\n(8)\nwhere p(A) denotes the marginal distribution over the complementary \u00afA of A. EP [27] further\nassmues that q(A) \u2248 p(A) to make the expression tractable. Though the approximation cannot be\njusti\ufb01ed theoretically, empirical results [28] suggest that the gap is small if the approximate posterior\nis a good \ufb01t to the true one. Given the approximation, for each factor A, the objective function\nchanges to:\n\n(9)\nHere we make the same assumption because q(A) will be cancelled in the likelihood ratio if D\nbelongs to f-divergence and we can ignore other factors when checking factor A, which reduces the\ncomplexity of the problem. For instance, we can approximate the JS divergence for factor A as:\n\nD(q(A)q(A)||p(A)q(A)).\n\nDJS(q(X, Z)||p(X, Z))\u2248Eq[log\n\nq(A)\nm(A)\n\n]+Ep[log\n\np(A)\nm(A)\n\n],\n\n(10)\n\nwhere m(A) = 1\ninference, we further average the divergences over all local factors as:\n\n2 (p(A) + q(A)). See Appendix C for the derivation. As we are doing amortized\n\nlog\n\nEq[log\n\n]\n\n=\n\n1\n|FG|\n\nq(A)\nm(A)\n\np(A)\nm(A)\n\n]+Ep[log\n\n1\n|FG|\nThe equality holds due to the linearity of the expectation. The expression in Eqn. (11) provides an\nef\ufb01cient solution where we can obtain samples over the entire variable space once and repeatedly\nproject the samples into each factor. Finally, we can estimate the local divergences using individual\ndiscriminators and the entire objective function is as follows:\n|FG|Ep[\n\nlog(1 \u2212 DA(A))],\n\nlog(DA(A))] +\n\n1\n\n|FG|Eq[\n\nq(A)\nm(A)\n\nlog\n\np(A)\nm(A)\n\n(cid:88)\n\n(cid:88)\n\n]+Ep[\n\nmax\n\n\u03c8\n\n(12)\n\n1\n\nA\u2208FG\n\nA\u2208FG\n\nwhere DA is the discriminator for the factor A and \u03c8 denotes the parameters in all discriminators.\nThough we assume that q(X, Z) shares the same factorization with p(X, Z) as in Eqn. (7) when\nderiving the objective function, the result in Eqn. (12) does not specify the form of q(X, Z). This is\nbecause we do not need to compute q(A) explicitly and instead we directly estimate the likelihood\nratio based on samples. This makes it possible for Graphical-GAN to use an arbitrary q(X, Z),\nincluding the two recognition models presented in Sec. 2.2, as long as we can sample from it quickly.\nGiven the divergence estimate, we perform the stochastic gradient decent to update the parameters. We\nuse the reparameterization trick [17] and the Gumbel-Softmax trick [12] to estimate the gradients with\ncontinuous and discrete random variables, respectively. We summarize the procedure in Algorithm 1.\n3For instance, we can specify that FG has only one factor that involves all variables, which reduces to ALI.\n\n4\n\n(cid:21)\n\n\uf8ee\uf8f0Eq[\n\n(cid:88)\n\nA\u2208FG\n\n(cid:88)\n\nA\u2208FG\n\n\uf8f9\uf8fb. (11)\n\n]\n\n\f3 Two Instances\n\nWe consider two common and typical\nscenarios involving structured data in\npractice. In the \ufb01rst one, the dataset con-\nsists of images with discrete attributes\nor classes but the groundtruth for an\nindividual sample is unknown. In the\nsecond one, the dataset consists of se-\nquences of images with temporal depen-\ndency within each sequence. We present\ntwo important instances of Graphical-\nGAN, i.e. Gaussian Mixture GAN (GM-\nGAN) and State Space GAN (SSGAN),\nto deal with these two scenarios, respec-\ntively. These instances show the abil-\nities of our general framework to deal\nwith discrete latent variables and com-\nplex structures, respectively.\nGMGAN We assume that the data consists of K mixtures and hence uses a mixture of Gaussian\nprior. Formally, the generative process of GMGAN is:\n\nAlgorithm 1 Local algorithm for Graphical-GAN\nrepeat\u2022 Get a minibatch of samples from p(X, Z)\n\u2022 Get a minibatch of samples from q(X, Z)\n\u2022 Approximate the divergence D(q(X, Z)||p(X, Z))\nusing Eqn. (12) and the current value of \u03c8\n\u2022 Update \u03c8 to maximize the divergence\n\u2022 Get a minibatch of samples from p(X, Z)\n\u2022 Get a minibatch of samples from q(X, Z)\n\u2022 Approximate the divergence D(q(X, Z)||p(X, Z))\nusing Eqn. (12) and the current value of \u03c8\n\u2022 Update \u03b8 and \u03c6 to minimize the divergence\nuntil Convergence or reaching certain threshold\n\nk \u223c Cat(\u03c0), h|k \u223c N (\u00b5k, \u03a3k), x|h = G(h),\n\nwhere Z = (k, h), and \u03c0 and G are the coef\ufb01cient vector and the generator, respectively. We assume\nthat \u03c0 and \u03a3ks are \ufb01xed as the uniform prior and identity matrices, respectively. Namely, we only\nhave a few extra trainable parameters, i.e. the means for the mixtures \u00b5ks.\nWe use the inverse factorization as the recognition model because it preserves the dependency\nrelationships in the model. The resulting approximate posterior is a simple inverse chain as follows:\n\nh|x = E(x), q(k|h) =\n\n(cid:80)\n\n\u03c0kN (h|\u00b5k, \u03a3k)\nk(cid:48) \u03c0k(cid:48)N (h|\u00b5k(cid:48), \u03a3k(cid:48))\n\n,\n\nwhere E is the extractor that maps data points to the latent variables.\nIn the global baseline, a single network is used to discriminate the (x, h, k) tuples. In our local algo-\nrithm, two separate networks are introduced to discriminate the (x, h) and (h, k) pairs, respectively.\nSSGAN We assume that there are two types of latent variables. One is invariant across time, denoted\nas h and the other varies across time, denoted as vt for time stamp t = 1, ..., T . Further, SSGAN\nassumes that vts form a Markov Chain. Formally, the generative process of SSGAN is:\n\nv1 \u223c N (0, I), h \u223c N (0, I),\nvt+1|vt = O(vt, \u0001t),\u2200t = 1, 2, ..., T \u2212 1,\n\n\u0001t \u223c N (0, I),\u2200t = 1, 2, ..., T \u2212 1,\nxt|h, vt = G(h, vt),\u2200t = 1, 2, ..., T,\n\nwhere Z = (h, v1, ..., vT ), and O and G are the transition operator and the generator, respectively.\nThey are shared across time under the stationary and output independent assumptions, respectively.\nFor simplicity, we use the mean-\ufb01eld recognition model as the approximate posterior:\n\nh|x1, x2..., xT = E1(x1, x2..., xT ),\n\nvt|x1, x2..., xT = E2(xt),\u2200t = 1, 2, ..., T,\n\nwhere E1 and E2 are the extractors that map the data points to h and v respectively. E2 is also shared\nacross time.\nIn the global baseline, a single network is used to discriminate the (x1, ..., xT , v1, ..., vT , h) samples.\nIn our local algorithm, two separate networks are introduced to discriminate the (vt, vt+1) pairs and\n(xt, vt, h) tuples, respectively. Both networks are shared across time, as well.\n\n4 Related Work\n\nGeneral framework The work of [13, 16, 22] are the closest papers on the structured deep generative\nmodels. Johnson et al. [13] introduce structured Bayesian priors to Variational Auto-Encoders\n\n5\n\n\f(VAE) [17] and propose ef\ufb01cient inference algorithms with conjugated exponential family structure.\nLin et al. [22] consider a similar model as in [13] and derive an amortized variational message passing\nalgorithm to simplify and generalize [13]. Compared to [13, 22], Graphical-GAN is more \ufb02exible on\nthe model de\ufb01nition and learning methods, and hence can deal with natural data.\nAdversarial Massage Passing (AMP) [16] also considers structured implicit models but there exist\nseveral key differences to make our work unique. Theoretically, Graphical-GAN and AMP optimize\ndifferent local divergences. As presented in Sec. 2.3, we follow the recipe of EP precisely to optimize\nD(q(A)q(A)||p(A)q(A)) and naturally derive our algorithm that involves only the factors de\ufb01ned\nby p(X, Z), e.g. A = (zi, paG(zi)). On the other hand, AMP optimizes another local divergence\nD(q(A(cid:48))||p(A)), where A(cid:48) is a factor de\ufb01ned by q(X, Z), e.g. A(cid:48) = (zi, paH(zi)). In general, A(cid:48)\ncan be different from A because the DAGs G and H have different structures. Further, the theoretical\ndifference really matters in practice. In AMP, the two factors involved in the local divergence are\nde\ufb01ned over different domains and hence may have different dimensionalities generally. Therefore, it\nremains unclear how to implement AMP 4 because a discriminator cannot take two types of inputs\nwith different dimensionalities. In fact, no empirical evidence is reported in AMP [16]. In contrast,\nGraphical-GAN is easy to implement by considering only the factors de\ufb01ned by p(X, Z) and achieves\nexcellent empirical results (See Sec. 5).\nThere is much work on the learning of implicit models. f-GAN [31] and WGAN [2] generalize\nthe original GAN using the f-divergence and Wasserstein distance, respectively. The work of [40]\nminimizes a penalized form of the Wasserstein distance in the optimal transport point of view and\nnaturally considers both the generative modelling and inference together. The Wasserstein distance\ncan also be used in Graphical-GAN to generalize our current algorithms and we leave it for the\nfuture work. The recent work of [34] and [41] perform Bayesian learning for GANs. In comparison,\nGraphical-GAN focuses on learning a probabilistic graphical model with latent variables instead of\nposterior inference on global parameters.\nInstances Several methods have learned the discrete structures in an unsupervised manner. Makhzani\net al. [24] extend an autoencoder to a generative model by matching the aggregated posterior to a prior\ndistribution and shows the ability to cluster handwritten digits. [4] introduce some interpretable codes\nindependently from the other latent variables and regularize the original GAN loss with the mutual\ninformation between the codes and the data. In contrast, GMGAN explicitly builds a hierarchical\nmodel with top-level discrete codes and no regularization is required. The most direct competitor [7]\nextends VAE [17] with a mixture of Gaussian prior and is compared with GMGAN in Sec. 5.1.\nThere exist extensive prior methods on synthesizing videos but most of them condition on input\nframes [38, 32, 25, 15, 46, 43, 6, 42]. Three of these methods [44, 35, 42] can generate videos\nwithout input frames. In [44, 35], all latent variables are generated jointly and without structure. In\ncontrast, SSGAN explicitly disentangles the invariant latent variables from the variant ones and builds\na Markov chain on the variant ones, which makes it possible to do motion analogy and generalize to\nlonger sequences. MoCoGAN [42] also exploits the temporal dependency of the latent variables via\na recurrent neural network but it requires heuristic regularization terms and focuses on generation. In\ncomparison, SSGAN is an instance of the Graphical-GAN framework, which provides theoretical\ninsights and a recognition model for inference.\nCompared with all instances, Graphical-GAN does not focus on a speci\ufb01c structure, but provides a\ngeneral way to deal with arbitrary structures that can be encoded as Bayesian networks.\n\n5 Experiments\n\nWe implement our model using the TensorFlow [1] library.5 In all experiments, we optimize the JS-\ndivergence. We use the widely adopted DCGAN architecture [33] in all experiments to fairly compare\nGraphical-GAN with existing methods. We evaluate GMGAN on the MNIST [20], SVHN [30],\nCIFAR10 [19] and CelebA [23] datasets. We evaluate SSGAN on the Moving MNIST [38] and 3D\nchairs [3] datasets. See Appendix D for further details of the model and datasets.\nIn our experiments, we are going to show that\n\n4Despite our best efforts to contact the authors we did not receive an answer of the issue.\n5Our source code is available at https://github.com/zhenxuan00/graphical-gan.\n\n6\n\n\f(a) GAN-G\n\n(b) GMGAN-G (K = 10) (c) GMGAN-L (K = 10)\n\n(d) GMVAE (K = 10)\n\nFigure 2: Samples on the MNIST dataset. The results of (a) are comparable to those reported in [8].\nThe mixture k is \ufb01xed in each column of (b) and (c). k is \ufb01xed in each row of (d), which is from [7].\n\n(a) (K = 50)\n\n(b) (K = 30)\n\n(c) (K = 100)\n\nFigure 4: Part of samples of GMGAN-L on SVHN (a) CIFAR10 (b) and CelebA (c) datasets. The\nmixture k is \ufb01xed in each column. See the complete results in Appendix E.\n\n\u2022 Qualitatively, Graphical-GAN can infer the latent structures and generate structured samples\n\nwithout any regularization, which is required by existing models [4, 43, 6, 42];\n\n\u2022 Quantitatively, Graphical-GAN can outperform all baseline methods [7\u20139] in terms of\ninference accuracy, sample quality and reconstruction error consistently and substantially.\n\n5.1 GMGAN Learns Discrete Structures\n\n(a) GAN-G (b) GMGAN-L (c) GAN-G (d) GMGAN-L\n\nWe focus on the unsupervised learning\nsetting in GMGAN. Our assumption is\nthat there exist discrete structures, e.g.\nclasses and attributes, in the data but\nthe ground truth is unknown. We com-\npare Graphical-GAN with three existing\nmethods, i.e. ALI [8, 9], GMVAE [38]\nand the global baseline. For simplic-\nity, we denote the global baseline and\nour local algorithm as GMGAN-G and\nGMGAN-L, respectively. Following this,\nwe also denote ALI as GAN-G.\nWe \ufb01rst compare the samples of all mod-\nels on the MNIST dataset in Fig. 2. As\nfor sample quality, GMGAN-L has less meaningless samples compared with GAN-G (i.e. ALI), and\nhas sharper samples than those of the GMVAE. Besides, as for clustering performance, GMGAN-L is\nsuperior to GMGAN-G and GMVAE with less ambiguous clusters. We then demonstrate the ability\nof GMGAN-L to deal with more challenging datasets. The samples on the SVHN, CIFAR10 and\nCelebA datasets are shown in Fig. 4. Given a \ufb01xed mixture k, GMGAN-L can generate samples with\nsimilar semantics and visual factors, including the object classes, backgrounds and attributes like\n\nFigure 3: Reconstruction on the MNIST and SVHN\ndatasets. Each odd column shows the test inputs and the\nnext even column shows the corresponding reconstruction.\n(a) and (c) are comparable to those reported in [8, 9].\n\n7\n\n\fTable 1: The clustering accuracy (ACC) [37], inception score (IS) [36] and mean square error\n(MSE) results for inference, generation and reconstruction tasks, respectively. The results of our\nimplementation are averaged over 10 (ACC) or 5 (IS and MSE) runs with different random seeds.\n\nAlgorithm\nGMVAE\nCatGAN\nGAN-G\nGMM (our implementation)\nGAN-G+GMM (our implementation)\nGMGAN-G (our implementation)\nGMGAN-L (ours)\n\nACC on MNIST\n92.77 (\u00b11.60) [7]\n\n-\n\n90.30 [37]\n68.33(\u00b10.21)\n70.27(\u00b10.50)\n91.62 (\u00b11.91)\n93.03 (\u00b11.65)\n\nIS on CIFAR10 MSE on MNIST\n\n-\n-\n\n-\n\n5.34 (\u00b10.05) [45]\n\n5.26 (\u00b10.05)\n5.41 (\u00b10.08)\n5.94 (\u00b10.06)\n\n-\n-\n-\n-\n\n0.071 (\u00b10.001)\n0.056 (\u00b10.001)\n0.044 (\u00b10.001)\n\nSSGAN-L 3DCNN ConcatX ConcatZ\n\nSSGAN-\nL\n\n3DCNN\n\nConcatX\n\nConcatZ\n\nFigure 5: Samples on the Moving MNIST and\n3D chairs datasets when T = 4. Each row in a\nsub\ufb01gure represents a video sample.\n\nFigure 6: Samples (\ufb01rst 12 frames) on the\nMoving MNIST dataset when T = 16.\n\n\u201cwearing glasses\u201d. We also show the samples of GMGAN-L by varying K and linearly interpolating\nthe latent variables in Appendix E.\nWe further present the reconstruction results in Fig. 3. GMGAN-L outperforms GAN-G signi\ufb01cantly\nin terms of preserving the same semantics and similar visual appearance. Intuitively, this is because\nthe Gaussian mixture prior helps the model learn a more spread latent space with less ambiguous\nareas shared by samples in different classes. We empirically verify the intuition by visualizing the\nlatent space via the t-SNE algorithm in Appendix E.\nFinally, we compare the models on inference, generation and reconstruction tasks in terms of three\nwidely adopted metrics in Tab. 1. As for the clustering accuracy, after clustering the test samples, we\n\ufb01rst \ufb01nd the sample that is nearest to the centroid of each cluster and use the label of that sample as\nthe prediction of the testing samples in the same cluster following [37]. GAN-G cannot cluster the\ndata directly, and hence we train a Gaussian mixture model (GMM) on the latent space of GAN-G\nand the two-stage baseline is denoted as GAN-G + GMM. We also train a GMM on the raw data as\nthe simplest baseline. For the GMM implementation, we use the sklearn package and the settings are\nsame as our Gaussian mixture prior. AAE [24] achieves higher clustering accuracy while it is less\ncomparable to our method. Nevertheless, GMGAN-L outperforms all baselines consistently, which\nagrees with the qualitative results. We also provide the clustering results on the CIFAR10 dataset in\nAppendix E.\n\n5.2 SSGAN Learns Temporal Structures\n\nWe denote the SSGAN model trained with the local algorithm as SSGAN-L. We construct three types\nof baseline models, which are trained with the global baseline algorithm but use discriminators with\ndifferent architectures. The ConcatX baseline concatenates all input frames together and processes\nthe input as a whole image with a 2D CNN. The ConcatZ baseline processes the input frames\nindependently using a 2D CNN and concatenates the features as the input for fully connected layers\nto obtain the latent variables. The 3DCNN baseline uses a 3D CNN to process the whole input\ndirectly. In particular, the 3DCNN baseline is similar to existing generative models [44, 35]. The\n\n8\n\n\fFigure 7: Motion analogy results. Each odd row\nshows an input and the next even row shows the\nsample.\n\nFigure 8: 16 of 200 frames generated by SSGAN-\nL. The frame indices are 47-50, 97-100, 147-150\nand 197-200 from left to right in each row.\n\nkey difference is that we omit the two stream architecture proposed in [44] and the singular value\nclipping proposed in [35] for fair comparison as our contribution is orthogonal to these techniques.\nAlso note that our problem is more challenging than those in existing methods [44, 35] because the\ndiscriminator in Graphical-GAN needs to discriminate the latent variables besides the video frames.\nAll models can generate reasonable samples of length 4 on both Moving MNIST and 3D chairs\ndatasets, as shown in Fig. 5. However, if the structure of the data gets complicated, i.e. T = 16\non Moving MNIST and T = 31 on 3D chairs, all baseline models fail while SSGAN-L can still\nsuccessfully generate meaningful videos, as shown in Fig. 6 and Appendix F, respectively. Intuitively,\nthis is because a single discriminator cannot provide reliable divergence estimate with limited\ncapability in practise. See the reconstruction results of SSGAN-L in Appendix F.\nCompared with existing GAN models [44, 35, 42] on videos, SSGAN-L can learn interpretable\nfeatures thanks to the factorial structure in each frame. We present the motion analogy results on\nthe 3D chairs dataset in Fig. 7. We extract the variant features v, i.e. the motion, from the input\ntesting video and provide a \ufb01xed invariant feature h, i.e. the content, to generate samples. The\nsamples can track the motion of the corresponding input and share the same content at the same time.\nExisting methods [43, 6] on learning interpretable features rely on regularization terms to ensure the\ndisentanglement while SSGAN uses a purely adversarial loss.\nFinally, we show that though trained on videos of length 31, SSGAN can generate much longer\nsequences of 200 frames in Fig. 8 thanks to the Markov structure, which again demonstrates the\nadvantages of SSGAN over existing generative models [44, 35, 42].\n\n6 Conclusion\n\nThis paper introduces a \ufb02exible generative modelling framework called Graphical Generative Ad-\nversarial Networks (Graphical-GAN). Graphical-GAN provides a general solution to utilize the\nunderlying structural information of the data. Empirical results of two instances show the promise of\nGraphical-GAN on learning interpretable representations and generating structured samples. Possible\nextensions to Graphical-GAN include: generalized learning and inference algorithms, instances with\nmore complicated structures (e.g., trees) and semi-supervised learning for structured data.\n\nAcknowledgments\n\nThe work was supported by the National Key Research and Development Program of China (No.\n2017YFA0700900), the National NSF of China (Nos. 61620106010, 61621136008, 61332007), the\nMIIT Grant of Int. Man. Comp. Stan (No. 2016ZXFB00001), the Youth Top-notch Talent Support\nProgram, Tsinghua Tiangong Institute for Intelligent Computing, the NVIDIA NVAIL Program and a\nProject from Siemens. This work was done when C. Li visited the university of Amsterdam. During\nthis period, he was supported by China Scholarship Council.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for\nlarge-scale machine learning. 2016.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n9\n\n\f[3] Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan C Russell, and Josef Sivic. Seeing\n3D chairs: exemplar part-based 2D-3D alignment using a large dataset of cad models. In\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 3762\u2013\n3769, 2014.\n\n[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-\nGAN: Interpretable representation learning by information maximizing generative adversarial\nnets. In Advances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[5] Imre Csisz\u00e1r, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations\n\nand Trends R(cid:13) in Communications and Information Theory, 1(4):417\u2013528, 2004.\n\n[6] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations\nfrom video. In Advances in Neural Information Processing Systems, pages 4417\u20134426, 2017.\n\n[7] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni,\nKai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with Gaussian mixture\nvariational autoencoders. arXiv preprint arXiv:1611.02648, 2016.\n\n[8] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[9] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier\narXiv preprint\n\nMastropietro, and Aaron Courville. Adversarially learned inference.\narXiv:1606.00704, 2016.\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[11] Ferenc Husz\u00e1r.\n\narXiv:1702.08235, 2017.\n\nVariational\n\ninference using implicit distributions.\n\narXiv preprint\n\n[12] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[13] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast\ninference. In Advances in neural information processing systems, pages 2946\u20132954, 2016.\n\n[14] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An in-\ntroduction to variational methods for graphical models. Machine learning, 37(2):183\u2013233,\n1999.\n\n[15] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex\nGraves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527,\n2016.\n\n[16] Theofanis Karaletsos. Adversarial message passing for graphical models. arXiv preprint\n\narXiv:1612.05048, 2016.\n\n[17] Diederik Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[18] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[19] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, University of Toronto, 2009.\n\n[20] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[21] Yingzhen Li, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, and Richard E Turner. Stochastic expectation\npropagation. In Advances in Neural Information Processing Systems, pages 2323\u20132331, 2015.\n\n10\n\n\f[22] Wu Lin, Nicolas Hubacher, and Mohammad Emtiyaz Khan. Variational message passing with\n\nstructured inference networks. arXiv preprint arXiv:1803.05589, 2018.\n\n[23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[24] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adver-\n\nsarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.\n\n[25] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond\n\nmean square error. arXiv preprint arXiv:1511.05440, 2015.\n\n[26] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational Bayes:\narXiv preprint\n\nUnifying variational autoencoders and generative adversarial networks.\narXiv:1701.04722, 2017.\n\n[27] Thomas P Minka. Expectation propagation for approximate bayesian inference. In Proceedings\nof the Seventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 362\u2013369. Morgan\nKaufmann Publishers Inc., 2001.\n\n[28] Tom Minka. Divergence measures and message passing. Technical report, Microsoft Research,\n\n2005.\n\n[29] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[30] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning, 2011.\n\n[31] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neu-\nral samplers using variational divergence minimization. In Advances in Neural Information\nProcessing Systems, pages 271\u2013279, 2016.\n\n[32] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-\nIn Advances in Neural\n\nconditional video prediction using deep networks in atari games.\nInformation Processing Systems, pages 2863\u20132871, 2015.\n\n[33] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[34] Yunus Saatci and Andrew G Wilson. Bayesian gan.\n\nprocessing systems, pages 3622\u20133631, 2017.\n\nIn Advances in neural information\n\n[35] Masaki Saito and Eiichi Matsumoto. Temporal generative adversarial nets. arXiv preprint\n\narXiv:1611.06624, 2016.\n\n[36] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[37] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical genera-\n\ntive adversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[38] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of\nvideo representations using LSTMs. In International Conference on Machine Learning, pages\n843\u2013852, 2015.\n\n[39] Andreas Stuhlm\u00fcller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In\n\nAdvances in neural information processing systems, pages 3048\u20133056, 2013.\n\n[40] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-\n\nencoders. arXiv preprint arXiv:1711.01558, 2017.\n\n11\n\n\f[41] Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-\n\nfree variational inference. arXiv preprint arXiv:1702.08896, 2017.\n\n[42] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing\n\nmotion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.\n\n[43] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing\nmotion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033,\n2017.\n\n[44] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics.\n\nIn Advances In Neural Information Processing Systems, pages 613\u2013621, 2016.\n\n[45] David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with\n\ndenoising feature matching. 2016.\n\n[46] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic\nfuture frame synthesis via cross convolutional networks. In Advances in Neural Information\nProcessing Systems, pages 91\u201399, 2016.\n\n12\n\n\f", "award": [], "sourceid": 2991, "authors": [{"given_name": "Chongxuan", "family_name": "LI", "institution": "Tsinghua University"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam / Qualcomm AI Research"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}