{"title": "Exponential Family Estimation via Adversarial Dynamics Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 10979, "page_last": 10990, "abstract": "We present an efficient algorithm for maximum likelihood estimation (MLE) of exponential family models, with a general parametrization of the energy function that includes neural networks. We exploit the primal-dual view of the MLE with a kinetics augmented model to obtain an estimate associated with an adversarial dual sampler. To represent this sampler, we introduce a novel neural architecture, dynamics embedding, that generalizes Hamiltonian Monte-Carlo (HMC). The proposed approach inherits the flexibility of HMC while enabling tractable entropy estimation for the augmented model. By learning both a dual sampler and the primal model simultaneously, and sharing parameters between them, we obviate the requirement to design a separate sampling procedure once the model has been trained, leading to more effective learning. We show that many existing estimators, such as contrastive divergence, pseudo/composite-likelihood, score matching, minimum Stein discrepancy estimator, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow, are special cases of the proposed approach, each expressed by a different (fixed) dual sampler. An empirical investigation shows that adapting the sampler during MLE can significantly improve on state-of-the-art estimators.", "full_text": "Exponential Family Estimation via\nAdversarial Dynamics Embedding\n\n\u21e4Bo Dai1, \u21e4Zhen Liu2, \u21e4Hanjun Dai1, Niao He3, Arthur Gretton4, Le Song5,6, Dale Schuurmans1,7\n\n1Google Research, Brain Team, 2Mila, University of Montreal,\n\n3University of Illinois at Urbana Champaign, 4University College London,\n5Georgia Institute of Technology, 6Ant Financial, 7University of Alberta\n\nAbstract\n\nWe present an ef\ufb01cient algorithm for maximum likelihood estimation (MLE) of\nexponential family models, with a general parametrization of the energy function\nthat includes neural networks. We exploit the primal-dual view of the MLE with\na kinetics augmented model to obtain an estimate associated with an adversarial\ndual sampler. To represent this sampler, we introduce a novel neural architecture,\ndynamics embedding, that generalizes Hamiltonian Monte-Carlo (HMC). The\nproposed approach inherits the \ufb02exibility of HMC while enabling tractable entropy\nestimation for the augmented model. By learning both a dual sampler and the\nprimal model simultaneously, and sharing parameters between them, we obviate\nthe requirement to design a separate sampling procedure once the model has\nbeen trained, leading to more effective learning. We show that many existing\nestimators, such as contrastive divergence, pseudo/composite-likelihood, score\nmatching, minimum Stein discrepancy estimator, non-local contrastive objectives,\nnoise-contrastive estimation, and minimum probability \ufb02ow, are special cases of the\nproposed approach, each expressed by a different (\ufb01xed) dual sampler. An empirical\ninvestigation shows that adapting the sampler during MLE can signi\ufb01cantly improve\non state-of-the-art estimators1.\n\nIntroduction\n\n1\nThe exponential family is one of the most important classes of distributions in statistics and machine\nlearning, encompassing undirected graphical models (Wainwright and Jordan, 2008) and energy-\nbased models (LeCun et al., 2006; Wu et al., 2018), which include, for example, Markov random\n\ufb01elds (Kinderman and Snell, 1980), conditional random \ufb01elds (Lafferty et al., 2001) and language\nmodels (Mnih and Teh, 2012). Despite the \ufb02exibility of this family and the many useful properties it\npossesses (Brown, 1986), most such distributions are intractable because the partition function does\nnot possess an analytic form. This leads to dif\ufb01culty in evaluating, sampling and learning exponential\nfamily models, hindering their application in practice. In this paper, we consider a longstanding\nquestion:\n\nCan a simple yet effective algorithm be developed for estimating general exponen-\ntial family distributions?\n\nThere has been extensive prior work addressing this question. Many approaches focus on approximat-\ning maximum likelihood estimation (MLE), since it is well studied and known to possess desirable\nstatistical properties, such as consistency, asymptotic unbiasedness, and asymptotic normality (Brown,\n1986). One prominent example is contrastive divergence (CD) (Hinton, 2002) and its variants (Tiele-\nman and Hinton, 2009; Du and Mordatch, 2019). It approximates the gradient of the log-likelihood by\na stochastic estimator that uses samples generated from a few Markov chain Monte Carlo (MCMC)\nsteps. This approach has two shortcomings: \ufb01rst and foremost, the stochastic gradient is biased,\n\n\u21e4indicates equal contribution. Email: {bodai, hadai}@google.com, zhen.liu.2@umontreal.ca.\n1The code repository is available at https://github.com/lzzcd001/ade-code.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhich can lead to poor estimates; second, CD and its variants require careful design of the MCMC\ntransition kernel, which can be challenging.\nGiven these dif\ufb01culties with MLE, numerous learning criteria have been proposed to avoid the\npartition function. Pseudo-likelihood estimators (Besag, 1975) approximate the joint distribution\nby the product of conditional distributions, each of which only represents the distribution of a\nsingle random variable conditioned on the others. However, the the partition function of each factor\nis still generally intractable. Score matching (Hyv\u00e4rinen, 2005) minimizes the Fisher divergence\nbetween the empirical distribution and the model. Unfortunately, it requires third order derivatives\nfor optimization, which becomes prohibitive for large models (Kingma and LeCun, 2010; Li et al.,\n2019). Noise-contrastive estimation (Gutmann and Hyv\u00e4rinen, 2010) recasts the problem as ratio\nestimation between the target distribution and a pre-de\ufb01ned auxiliary distribution. However, the\nauxiliary distribution must cover the support of the data with an analytical expression that still allows\nef\ufb01cient sampling; this requirement is dif\ufb01cult to satisfy in practice, particularly in high dimensional\nsettings. Minimum probability \ufb02ow (Sohl-Dickstein et al., 2011) exploits the observation that, ideally,\nthe empirical distribution will be the stationary distribution of transition dynamics de\ufb01ned under an\noptimal model. The model can then be estimated by matching these two distributions. Even though\nthis idea is inspiring, it is challenging to construct appropriate dynamics that yield ef\ufb01cient learning.\nIn this paper, we introduce a novel algorithm, Adversarial Dynamics Embedding (ADE), that directly\napproximates the MLE while achieving computational and statistical ef\ufb01ciency. Our development\nstarts with the primal-dual view of the MLE (Dai et al., 2019) that provides a natural objective for\njointly learning both a sampler and a model, as a remedy for the expensive and biased MCMC steps\nin the CD algorithm. To parameterize the dual distribution, Dai et al. (2019) applies a naive transport\nmapping, which makes entropy estimation dif\ufb01cult and requires learning an extra auxiliary model,\nincurring additional computational and memory cost.\nWe overcome these shortcomings by considering a different approach, inspired by the properties of\nHamiltonian Monte-Carlo (HMC) (Neal, 2011):\n\ni) HMC forms a stationary distribution with independent potential and kinetic variables;\nii) HMC can approximate the exponential family arbitrarily closely.\n\nAs in HMC, we consider an augmented model with latent kinetic variables in Section 3.1, and\nintroduce a novel neural architecture in Section 3.2, called dynamics embedding, that mimics sampling\nand represents the dual distribution via parameters of the primal model. This approach shares with\nHMC the advantage of a tractable entropy function for the augmented model, while enriching\nthe \ufb02exibility of sampler without introducing extra parameters. In Section 3.3 we develop a max-\nmin objective that allows the shared parameters in primal model and dual sampler to be learned\nsimultaneously, which improves computational and sample ef\ufb01ciency. We further show that the\nproposed estimator subsumes CD, pseudo-likelihood, score matching, non-local contrastive objectives,\nnoise-contrastive estimation, and minimum probability \ufb02ow as special cases with hand-designed\ndual samplers in Section 4. Finally, in Section 5 we \ufb01nd that the proposed approach can outperform\ncurrent state-of-the-art estimators in a series of experiments.\n2 Preliminaries\nExponential family and energy-based model The natural form of the exponential family over\n\u2326 \u21e2 Rd is de\ufb01ned as\npf0 (x) = exp (f0(x) log p0 (x) Ap0 (f0)) , Ap0 (f0) := logR\u2326 exp (f0 (x)) p0 (x) dx,\n(1)\nwhere f0 (x) = w>$ (x). The suf\ufb01cient statistic $ (\u00b7) : \u2326 ! Rk can be any general parametric\nmodel, e.g., a neural network. The (w, $) are the parameters to be learned from observed data.\nThe exponential family de\ufb01nition (1) includes the energy-based model (LeCun et al., 2006) as a\nspecial case, by setting f0 (x) = $ (x) with k = 1, which has been generalized to the in\ufb01nite\ndimensional case (Sriperumbudur et al., 2017). The p0 (x) is \ufb01xed and covers the support \u2326,\nwhich is usually unknown in practical high-dimensional problems. Therefore, we focus on learning\nf (x) = f0 (x) log p0 (x) jointly with p0 (x), which is more dif\ufb01cult: in particular, the doubly dual\nembedding approach (Dai et al., 2019) is no longer applicable.\nGiven a sample D = [xi]N\nfamily model can be estimated by maximum log-likelihood, i.e.,\n\ni=1 and denoting f 2 F as the valid parametrization family, an exponential\n(2)\n\nmaxf2F L (f ) :=bED [f (x)] A (f ) , A (f ) = logR\u2326 exp (f (x)) dx,\n\n2\n\n\fwith gradient rf L (f ) =bED [rf f (x)] Epf (x) [rf f (x)]. Since A (f ) and Epf (x) [rf f (x)] are\n\nboth intractable, solving the MLE for a general exponential family model is very dif\ufb01cult.\nDynamics-based MCMC Dynamics-based MCMC is a general and effective tool for sampling.\nThe idea is to represent the target distribution as the solution to a set of (stochastic) differential\nequations, which allows samples from the target distribution to be obtained by simulating along the\ndynamics de\ufb01ned by the differential equations.\nHMC (Neal, 2011) is a representative algorithm in this category, which exploits the well-known\nHamiltonian dynamics. Speci\ufb01cally, given a target distribution pf (x) / exp (f (x)), the Hamiltonian\nis de\ufb01ned as H (x, v) = f (x) + k (v), where k (v) = 1\n2 v>v is the kinetic energy. The Hamiltonian\ndynamics generate (x, v) over time t by following\n\ndt = [@vH (x, v) ,@xH (x, v)] = [v,rxf (x)] .\n\n(3)\nAsymptotically as t ! 1, x visits the underlying space according to the target distribution. In\npractice, to reduce discretization error, an acceptance-rejection step is introduced. The \ufb01nite-step\ndynamics-based MCMC sampler can be used for approximating Epf (x) [rf f (x)] in rf L (f ), which\nleads to the CD algorithm (Hinton, 2002; Zhu and Mumford, 1998).\nPrimal-dual view of MLE The Fenchel duality of A (f ) has been exploited (Rockafellar, 1970;\nWainwright and Jordan, 2008; Dai et al., 2019) as another way to address the intractability of the\nlog-partition function.\nTheorem 1 (Fenchel dual of log-partition (Wainwright and Jordan, 2008)) Let H (q)\n\n\uf8ff dx\n\n:=\n\ndv\n\n,\n\ndt\n\nA (f ) = maxq2P hq(x), f (x)i + H (q) ,\n\nR\u2326 q (x) log q (x)dx. Then:\nwhere P denotes the space of distributions and hf, gi =R\u2326 f (x) g (x) dx.\nq2P bED [f (x)] Eq(x) [f (x)] H (q) ,\n\nmax\nf2F\n\nmin\n\nPlugging the Fenchel dual of A (f ) into the MLE (2), we arrive at a max-min reformulation\n\npf (x) = argmaxq2P hq(x), f (x)i + H (q) ,\n\n(4)\n\n(5)\n\nwhich bypasses the explicit computation of the partition function. Another byproduct of the primal-\ndual view is that the dual distribution can be used for inference, however in vanila estimators this\nusually requires expensive sampling algorithms.\nThe dual sampler q (\u00b7) plays a vital role in the primal-dual formulation of the MLE in (5). To achieve\nbetter performance, we have several principal requirements in parameterizing the dual distribution:\ni) the parametrization family needs to be \ufb02exible enough to achieve small error in solving the\n\ninner minimization problem;\n\nii) the entropy of the parametrized dual distribution should be tractable.\n\nMoreover, as shown in (4) in Theorem 1, the optimal dual sampler q (\u00b7) is determined by primal\npotential function f (\u00b7). This leads to the third requirement:\n\niii) the parametrized dual sampler should explicitly incorporate the primal model f.\n\nSuch a dependence can potentially reduce both the memory and learning sample complexity.\nA variety of techniques have been developed for distribution parameterization, such as reparametrized\nlatent variable models (Kingma and Welling, 2014; Rezende et al., 2014), transport mapping (Good-\nfellow et al., 2014), and normalizing \ufb02ow (Rezende and Mohamed, 2015; Dinh et al., 2017; Kingma\net al., 2016). However, none of these satis\ufb01es the requirements of \ufb02exibility and a tractable density\nsimultaneously, nor do they offer a principled way to couple the parameters of the dual sampler with\nthe primal model.\n3 Adversarial Dynamics Embedding\nBy augmenting the original exponential family with kinetic variables, we can parametrize the dual\nsampler with a dynamics embedding that satis\ufb01es all three requirements without effecting the MLE,\nallowing the primal potential function and dual sampler to both be trained adversarially. We start\nwith the embedding of classical Hamiltonian dynamics (Neal, 2011; Caterini et al., 2018) for the dual\nsampler parametrization, as a concrete example, then discuss its generalization in latent space and the\nstochastic Langevin dynamics embedding. This technique is extended to other dynamics, with their\nown advantages, in Appendix B.\n\n3\n\n\f3.1 Primal-Dual View of Augmented MLE\nAs noted, it is dif\ufb01cult to \ufb01nd a parametrization of q (x) in (5) that simultaneously satis\ufb01es all three\nrequirements. Therefore, instead of directly tackling (5) in the original model, and inspired by HMC,\nwe consider the augmented exponential family p (x, v) with an auxiliary momentum variable, i.e.,\n\n(6)\n\n2 v>v)\n\np (x, v) =\n\nexp(f (x) \nZ(f )\n\nThe MLE of such a model can be formulated as\n\n, Z (f ) =R expf (x) \nmaxf L (f ) :=bEx\u21e0D\u21e5logR p (x, v) dv\u21e4 =bEx\u21e0DEp(v|x)\u21e5f (x) \nwhere the last equation comes from true posterior p (v|x) = N0, 1\n\n2 v>v dxdv.\n2 v>v log p (v|x)\u21e4log Z (f )\n2 I due to the independence of\n\nx and v. This independence also induces the equivalent MLE as proved in Appendix A.\nTheorem 2 (Equivalent MLE) The MLE of the augmented model is the same as the original MLE.\nApplying the Fenchel dual to Z (f ) of the augmented model (6), we derive a primal-dual formulation\nof (7), leading to the objective,\n\n(8)\nThe q (x, v) in (8) contains momentum v as the latent variable. One can also exploit the latent variable\n\nL (f ) / minq(x,v)2P bEx\u21e0D [f (x)] Eq(x,v)\u21e5f (x) \n\nmodel for q (x) =R q (x|v) q (v) dv in (5). However, the H (q) in (5) requires marginalization, which\nis intractable in general, and usually estimated through variational inference with the introduction of\nan extra posterior model q (v|x). Instead, by considering the speci\ufb01cally designed augmented model,\n(8) eliminates these extra variational steps.\nSimilarly, one can consider the latent variable augmented model with multiple momenta, i.e.,\n\n2 v>v log q (x, v)\u21e4 .\n\n(7)\n\np\u21e3x,vi T\n\ni=1\u2318 =\n\nL (f ) / minq(x,{vi}T\n\ni\n\ni=1\nZ(f )\n\n2\u2318\n2 kvik2\n\nexp\u21e3f (x)PT\ni=1)2PbEx\u21e0D [f (x)] Eq(x,{vi}T\n\n, leading to the optimization\n\ni=1)hf (x) PT\n\ni=1\n\ni\n\n2 vi2\n\n2 log q\u21e3x,vi T\n\ni=1\u2318i .\n\n3.2 Representing Dual Sampler via Primal Model\nWe now introduce the Hamiltonian dynamics embedding to represent the dual sampler q (\u00b7), as well\nas its generalization and special instantiation that satisfy all three of the principal requirements.\nThe vanilla HMC is derived by discretizing the Hamiltonian dynamics (3) with a leapfrog integrator.\nSpeci\ufb01cally, in a single time step, the sample (x, v) moves towards (x0, v0) according to\n\n(9)\n\n(10)\n\nwhere \u2318 is de\ufb01ned as the leapfrog stepsize. Let us denote the one-step leapfrog as (x0, v0) =\n\nv 1\n2 = v + \u2318\n\nx0 = x + \u2318v 1\n\n(x0, v0) = Lf,\u2318 (x, v) :=0@\nxT , vT = Lf,\u2318 Lf,\u2318 . . . Lf,\u2318x0, v0 .\n\n2rxf (x)\n2rxf (x0)\n\nv0 = v 1\n\n2 + \u2318\n\n\u2713 (x, v). After T iterations, we obtain\n\n1A ,\n\n2\n\nLf,\u2318 (x, v) and assume thex0, v0 \u21e0 q0\n\n(11)\nNote that this can be viewed as a neural network with a special architecture, which we term Hamilto-\nnian (HMC) dynamics embedding. Such a representation explicitly characterizes the dual sampler by\nthe primal model, i.e., the potential function f, meeting the dependence requirement.\nThe \ufb02exibility of the distributions HMC embedding actually is ensured by the nature of the dynamics-\nbased samplers. In the limiting case, the proposed neural network (11) reduces to a gradient \ufb02ow,\nwhose stationary distribution is exactly the model distribution:\n\np (x, v) = argmaxq(x,v)2P Eq(x,v)\u21e5f (x) \n\n2 v>v log q (x, v)\u21e4 .\n\nThe approximation strength of the HMC embedding is formally justi\ufb01ed as follows:\nTheorem 3 (HMC embeddings as gradient \ufb02ow) In continuous time, i.e. with in\ufb01nitesimal step-\nsize \u2318 ! 0, the density of particles (xt, vt), denoted qt (x, v), follows the Fokker-Planck equation\n(12)\n0, which has a stationary distribution p (x, v) / exp (H (x, v)) with the\nwith G = \uf8ff 0\nmarginal distribution p(x) / exp (f (x)).\n\n@t = r \u00b7 (qt (x, v) GrH (x, v)) ,\n\n@qt(x,v)\n\nI\n\nI\n\n4\n\n\fv0 = v 1\n\nDetails of the proofs are given in Appendix A. Note that this stationary distribution result is an\ninstance of the more general dynamics described in Ma et al. (2015), showing the \ufb02exility of the\ninduced distributions. As demonstrated in Theorem 3, the neural parametrization formed by the HMC\nembedding is able to well approximate an exponential family distribution on continuous variables.\nRemark (Generalized HMC dynamics in latent space) The leapfrog operation in vanilla HMC\nworks directly in the original observation space, which could be high-dimensional and noisy. We\ngeneralize the leapfrog update rule to the latent space and form a new dynamics as follows,\n\n2 gv (rxf (x) , x)\n\n2 gv (rxf (x0) , x0)\n\n2\u2318\u2318 + \u2318gx\u21e3v 1\n2\u2318\n\nv 1\n2 = v exp (Sv (rxf (x) , x)) + \u2318\n\nx0 = x exp\u21e3Sx\u21e3v 1\n2 exp (Sv (rxf (x0) , x0)) + \u2318\n\n(x0, v0) = Lf,\u2318,S,g (x, v) :=0B@\nwhere v 2 Rl denote the momentum evolving space and denotes element-wise product.\nSpeci\ufb01cally, the terms Sv (rxf (x) , x) and Sx\u21e3v 1\n2\u2318 rescale v and x coordinatewise. The term\ngv (rxf (x) , x) 7! Rl can be understood as projecting the gradient information to the essential latent\nspace where the momentum is evolving. Then, for updating x, the latent momentum is projected back\nto original space via gx\u21e3v 1\n2\u2318 7! \u2326. With these generalized leapfrog updates, the dynamical system\n\navoids operating in the high-dimensional noisy input space, and becomes more computationally\nef\ufb01cient. We emphasize that the proposed generalized leapfrog parametrization (13) is different from\nthe one used in Levy et al. (2018), which is inspired from the real-NVP \ufb02ow (Dinh et al., 2017).\nBy the generalized HMC embedding (13), we have a \ufb02exible layer (x0, v0) = Lf,\u2318,S,g (x, v), where\n(Sv, Sx, gv, gx) will be learned in addition to the stepsize. Obviously, the classic HMC layer\nLf,\u2318,M (x, v) is a special case of Lf,\u2318,S,g (x, v) by setting (Sv, Sx) to zero and (gv, gf ) to iden-\ntity functions.\nRemark (Stochastic Langevin dynamics) The stochastic Langevin dynamics can also be recov-\nered from the leapfrog step by resampling momentum in every step. Speci\ufb01cally, the sample (x, \u21e0)\nmoves according to\n\n1CA ,\n\n(13)\n\nHence, stochastic Langevin dynamics resample \u21e0 to replace the momentum in leapfrog (10), ignoring\nthe accumulated gradients. By unfolding T updates, we obtain\n\nf,\u2318\n\nwhere x (xt) and v (vt) denote\n\nas the derived neural network. Similarly, we can also generalize the stochastic Langevin updates L\u21e0\nto a low-dimension latent space by introducing gv (rxf (x) , x) and gx (v0) correspondingly.\nOne of the major advantages of the proposed distribution parametrization is its density value is also\ntractable, leading to tractable entropy estimation in (8) and (9). In particular, we have the following,\nTheorem 4 (Density value evaluation) Ifx0, v0\u21e0 q0\n\u2713 (x, v), after T vanilla HMC steps (10), then\n\u2713x0, v0 .\nqTxT , vT = q0\n(16)\nForxT , vT from the generalized leapfrog steps (13), we have\n\u2713x0, v0QT\nx (xt) = |det (diag (exp (2Sv (rxf (xt) , xt))))| , v (vt) =det\u21e3diag\u21e3exp\u21e3Sx\u21e3v 1\nFor\u21e3xT ,vi T\n\u2713 (x, \u21e0)QT1\n\ni=1\u2318 from the Langevin dynamics (14) with\u21e3x0,\u21e0i T1\ni=1 q\u2713i\u21e0i .\n\nThe proof of Theorem 4 can be found in Appendix A.\nThe proposed dynamics embedding satis\ufb01es all three requirements: it de\ufb01nes a \ufb02exible family of\ndistributions with computable entropy; and couples the learning of the dual sampler with the primal\nmodel, leading to memory and sample ef\ufb01cient learning algorithms, as we introduce in next section.\n\nqTxT , vT = q0\n\n\u2713x0, \u21e00QT1\n\nqT\u21e3xT ,vi T\n\n2\u2318\u2318\u2318\u2318 .\n\ni=0\u2318 \u21e0 q0\n\ni=1\u2318 = q0\n\n(18)\ni=i q\u2713i (\u21e0),\n\nt=1 (x (xt) v (vt)) ,\n\nwe have\n\n(17)\n\n(19)\n\n(x0, v0) = L\u21e0\n\n2rxf (x)\n\nx0 = x + v0\n\nf,\u2318 (x) :=\u2713 v0 = \u21e0 + \u2318\ni=1\u2318 = L\u21e0T1\n\u21e3xT ,vi T\n\nf,\u2318\n\n\u25c6 , with \u21e0 \u21e0 q\u2713 (\u21e0) .\nf,\u2318x0\n\n L\u21e0T2\n\nf,\u2318\n\n . . . L\u21e00\n\n(14)\n\n(15)\n\n5\n\n\ff ,vT\n\n3.3 Coupled Model and Sampler Learning\nBy plugging the T -step Hamiltonian dynamics embedding (10) into the primal-dual MLE of the aug-\nmented model (8) and applying the density value evaluation (16), we obtain the proposed optimization,\nwhich learns primal potential f and the dual sampler adversarially,\n\n\u2713 (x,v)hfxT \n\n2vT2\n\n2i Hq0\n\u2713 .\n\n4:\n5:\n\ni , v0\n\nAlgorithm 1 MLE via Adversarial Dynamics Em-\nbedding (ADE)\n1: Initialize \u21e51 randomly, set length of steps T .\n2: for iteration k = 1, . . . , K do\nSample mini-batch {xi}m\ni=1 from dataset D\n3:\nandx0\ni m\ni=1 from q0\n\u2713 (x, v).\nfor iteration t = 1, . . . , T do\nCompute (xt, vt) = Lxt1, vt1 for\neach pair ofx0\nend for\n[Learning the sampler] \u21e5k+1 = \u21e5k \nk \u02c6r\u21e5` (fk; \u21e5k)\n[Estimating the\nfamily]\nexponential\nfk+1 = fk + k \u02c6rf ` (fk; \u21e5k).\n\ni m\n\ni=1.\n\ni , v0\n\n6:\n7:\n\n8:\n\nmaxf2F min\u21e5 ` (f, \u21e5) :=bED [f ] E(x0,v0)\u21e0q0\n\n(20)\nHere \u21e5 denotes the learnable components in the dynamics embedding, e.g., initialization q0\n\u2713, the\nstepsize (\u2318) in the HMC/Langevin updates, and the adaptive part (Sv, Sx, gv, gx) in the generalized\nHMC. The parametrization of the initial distribution is discussed in Appendix C. Compared to the\noptimization in GANs (Goodfellow et al., 2014; Arjovsky et al., 2017; Dai et al., 2017), beside\nthe reversal of min-max in (20), the major difference is that our \u201cgenerator\u201d (the dual sampler)\nshares parameters with the \u201cdiscriminator\u201d (the primal potential function). In our formulation, the\nupdates of the potential function automatically push the generator toward the target distribution, thus\naccelerating learning ef\ufb01ciency. Meanwhile, the tunable parameters in the dynamics embedding are\nlearned adversarially, further promoting the ef\ufb01ciency of the dual sampler. These bene\ufb01ts will be\nempirically demonstrated in Section 5.\nSimilar optimization can be derived for generalized HMC (13) with density (17). For the T -step\nstochastic Langevin dynamics embedding (14), we apply the density value (19) to (9), which also\nleads to a max-min optimization with multiple momenta.\nWe use stochastic gradient descent to estimate\nf for the exponential families as well as the\nparameters of the dynamics embedding \u21e5 ad-\nversarially. Note that since the generated sam-\nple (xT\nf ) depends on f, the gradient w.r.t. f\nshould also take these variables into account as\nback-propagation through time (BPTT), i.e.,\n\nrf ` (f ; \u21e5) =bED [rf f (x)] Eq0\u21e5rf fxT\u21e4\nEq0\u21e5rxfxTrf xT + vTrf vT\u21e4 .\n\n(21)\nWe illustrate the MLE via HMC adversarial dy-\nnamics embedding in Algorithm 1. The same\ntechnique can be applied to alternative dynamics\nembeddings parametrized dual sampler as in Ap-\npendix B. Considering the dynamics embedding\nas an adaptive sampler that automatically learns\nw.r.t. different models and datasets, the updates\nfor \u21e5 can be understood as learning to sample.\n4 Related Work\nConnections to other estimators\nThe primal-dual view of the MLE also\nallows us to establish connections\nbetween the proposed estimator, ad-\nversarial dynamics embedding (ADE),\nand existing approaches, including\ncontrastive divergence (Hinton, 2002),\npseudo-likelihood\n(Besag,\n1975), conditional composite like-\nlihood (CL) (Lindsay, 1988), score\nmatching (SM) (Hyv\u00e4rinen, 2005),\nminimum (diffusion) Stein kernel\ndiscrepancy estimator (DSKD) (Barp\net al., 2019), non-local contrastive ob-\njectives (NLCO)\n(Vickrey et al.,\n2010),\nprobability\n\ufb02ow (MPF) (Sohl-Dickstein et al.,\n2011), and noise-contrastive estimation (NCE) (Gutmann and Hyv\u00e4rinen, 2010). As summarized\n\nR Tf (x0|x) pD (x) dx\ndPd\nmPm\n{Ai}m\ni=1 = d and Ai \\ Aj = ;\ni=1R p(f,i) (x) p (Si|x0) pD (x0) dx\nPm\np(f,i) (x) = exp(f (x))\n, x 2 Si\nR Tf (x0|x) exp 1\n2 (f (x0) f (x)) pD (x) dx\n 1\n2 pn\n\nTable 1: (Fix) dual samplers used in alternative estimators. We\ndenote pD as the empirical data distribution, xi as x without i-th\ncoordinate, pn as the pre\ufb01xed noise distribution, Tf (x0|x) as the\nHMC/Langevin transition kernel, TD,f (x) as the Stein variational\ngradient descent, and A (x, x0) as the acceptance ratio.\n\ni=1 Tfxi|xi1 A(xi, xi1)pD (x0) dxT1\nRQT\n\nx0 = TD,f (x)\ni=1 pf (xi|xi)pD(xi)\ni=1 pf (xAi|xAi)pD(xAi)\n\nDual Sampler q(x)\n\n9: end for\n\nminimum\n\nCD\nSM\n\nPL\nCL\n\nq(x) = 1\n\nq(x) = 1\n\nexp(f (x))\n\nexp(f (x))+pn(x)\n\n2 pD + 1\n\nEstimators\n\n(PL)\n\n0\n\nDSKD\n\nNLCO\n\nMPF\nNCE\n\n6\n\nZi(f )\n\n\fin Table 1, these existing estimators can be recast as the special cases of ADE, by replacing the\nadaptive dual sampler with hand-designed samplers, which can lead to extra error and inferior\nsolutions. Appendix D gives detailed derivations of the connections.\nExploiting deep models for energy-based model estimation has been investigated in Kim and Bengio\n(2016); Dai et al. (2017); Liu and Wang (2017); Dai et al. (2019). However, the parametrization of\nthe dual sampler should both be \ufb02exible and tractable to achieve better performance. Existing work\nis limited in one aspect or another. Kim and Bengio (2016) parameterized the sampler via a deep\ndirected graphical model, whose approximation ability is restrictive and the entropy is intractable. Dai\net al. (2017) proposed algorithms relying either on a heuristic approximation or a lower bound of\nthe entropy, and requiring learning an extra auxiliary component besides the dual sampler. Dai\net al. (2019) applied the Fenchel dual representation twice to reformulate the entropy term, but the\nalgorithm requires knowing a proposal distribution with the same support, which is impractical\nfor high-dimensional data. By contrast, ADE achieves both suf\ufb01cient \ufb02exibility and tractability by\nexploiting the augmented model and a novel parametrization within the primal-dual view.\nLearning to sample ADE also shares some similarity with meta learning for sampling (Levy et al.,\n2018; Feng et al., 2017; Song et al., 2017; Gong et al., 2019), where the sampler is parametrized\nvia a neural network and learned through certain objectives. The most signi\ufb01cant difference lies\nin the ultimate goal: we focus on exponential family model estimation, where the learned sampler\nassists with this objective. By contrast, learning to sample techniques target on a sampler for\na \ufb01xed model. This fundamentally distinguishes ADE from methods that only learn samplers.\nMoreover, ADE exploits an augmented model that yields tractable entropy estimation, which has not\nbeen fully investigated in previous literature.\n\n5 Experiments\nIn this section, we test ADE on several synthetic datasets in Section 5.1 and real-world image datasets\nin Section 5.2. The details of each experiment setting can be found in Appendix F.\n\n5.1 Synthetic experiments\nWe compare ADE with SM, CD, and primal-dual MLE with the normalizing planar \ufb02ow (Rezende\nand Mohamed, 2015) sampler (NF) to investigate the claimed bene\ufb01ts. SM, CD and primal-dual\nwith NF can be viewed as special cases of our method, with either a \ufb01xed sampler or restricted\nparametrized q\u2713. Thus, this also serves as an ablation study of ADE to verify the signi\ufb01cance of\nits different subcomponents. We keep the model sizes the same in NF and ADE (10 planar layers).\nThen we perform 5-steps stochastic Langevin steps to obtain the \ufb01nal samples xT with standard\nGaussian noise in each step, and without incurring extra memory cost. For fairness, we conduct CD\nwith 15 steps. This setup is preferable to CD with an extra acceptance-rejection step. We emphasize\nthat, by comparison to SM and CD, ADE learns the sampler and exploits the gradients through\nthe sampler. In comparison to primal-dual with NF, dynamics embedding achieves more \ufb02exibility\nwithout introducing extra parameters. Complete experiment details are given in Appendix F.1.\nIn Figure 1, we visualize the learned distribu-\ntion using both the learned dual sampler and the\nunnormalized exponential model on several syn-\nthetic datasets. Overall, the sampler almost per-\nfectly recovers the distribution, and the learned\nf captures the landscape of the distribution. We\nalso plot the convergence behavior in Figure 2.\nWe observe that the samples are smoothly con-\nverging to the true data distribution. As the\nlearned sampler depends on f, this \ufb01gure also\nindirectly suggests good convergence behavior\nfor f. More results for the learned models can\nbe found in Figure 5 in Appendix G.\nA quantitative comparison in terms of the\nMMD (Gretton et al., 2012) of the samplers\nis in Table 2. To compute the MMD, for NF\nand ADE, we use 1,000 samples from their sam-\n\nCD-15 ADE\n-0.61\n-0.45\n-0.99\n-0.31\n-1.13\n-0.83\n-0.55\n7.15\n-1.09\n0.78\n-0.38\n-0.75\n-0.36\n0.20\n-1.30\n10.5\n-1.10\n2.21\n-1.02\n-0.38\n-1.03\n-0.95\n-0.91\n0.12\n-0.81\n-0.41\n-1.17\n-0.94\n\nTable 2: Comparison on synthetic data using maxi-\nmum mean discrepancy (MMD \u21e51e3).\n\nNF\n0.69\n0.88\n0.76\n0.91\n2.15\n-0.92\n1.97\n0.39\n0.80\n0.30\n3.01\n161.89\n5.96\n0.00\n\nSM\n5.09\n8.10\n4.90\n10.36\n8.34\n13.07\n19.93\n10.28\n41.34\n2.01\n18.41\n9.22\n9.48\n5.88\n\nMultiring\npinwheel\n\nRing\nSpiral\nUniform\n\nDataset\n2spirals\nBanana\ncircles\n\ncos\n\nCosine\nFunnel\n\nswissroll\n\nline\nmoons\n\n7\n\n\f(b) Cosine\n\n(c) moons\n\n(a) 2spirals\n(f) Spiral\nFigure 1: We illustrated the learned samplers from different synthetic datasets in the \ufb01rst row. The\n\u21e5 denotes training data and \u2022 denotes the ADE samplers. The learned potential functions f are\nillustrated in the second row.\n\n(d) Multiring\n\n(e) pinwheel\n\nFigure 2: Convergence behavior of sampler on moons, Multiring, pinwheel synthetic datasets.\npler with Gaussian kernel. The kernel bandwidth is chosen using median trick (Dai et al., 2016).\nFor SM, since there is no such sampler available, we use vanilla HMC to get samples from the\nlearned model f, and use them to estimate MMD as in Dai et al. (2019). As we can see from Ta-\nble 2, ADE obtains the best MMD in most cases, which demonstrates the \ufb02exibility of dynamics\nembedding compared to normalizing \ufb02ow, and the effectiveness of adversarial training compared to\nSM and CD.\nWe also investigate the parameters recovery of ADE on the multivariate Gaussians with different\ndimensions where we know the potential functions. The empirical results can be found in Table 5\nin Appendix G. In this simple task, the SM is proven to be consistent and achieve the same estimator\nas MLE (Hyv\u00e4rinen, 2005). The objective of ADE can be non-convex due to the learning of the\nsampler parametrization, therefore, it losses the theoretical guarantees and incurs extra cost. However,\nas we can see the ADE still achieves comparable performances.\n\n5.2 Real-world Image Datasets\nWe apply ADE to MNIST and CIFAR-10 data. In both cases, we use a CNN architecture for the\ndiscriminator, following Miyato et al. (2018), with spectral normalization added to the discriminator\nlayers. In particular, for the discriminator in the CIFAR-10 experiments, we replace all downsampling\noperations by average pooling, as in Du and Mordatch (2019). We parametrize the initial distribution\np0 (x, v) with a deep Gaussian latent variable model (Deep LVM), speci\ufb01ed in Appendix C. The\noutput sample is clipped to [0, 1] after each HMC step and the Deep LVM initialization. The detailed\narchitectures and experimental con\ufb01gurations are described in Appendix F.2.\n\nTable 3: Inception scores of different models on CIFAR-10 (unconditional).\nInception Score\n\nModel\n\nWGAN-GP (Gulrajani et al., 2017)\nSpectral GAN (Miyato et al., 2018)\n\nLangevin PCD (Du and Mordatch, 2019)\n\nLangevin PCD (10 ensemble) (Du and Mordatch, 2019)\n\nADE: Deep LVM init w/o HMC\nADE: Deep LVM init w/ HMC\n\n6.50\n7.42\n6.02\n6.78\n7.26\n7.55\n\n8\n\n\f(a) Samples on MNIST\n\n(b) Histogram on MNIST\n\n(c) Samples on CIFAR-10\n\n(d) Histogram on CIFAR-10\n\nFigure 3: The generated images on MNIST and CIFAR-10 and the comparison between energies\nof generated samples and real images. The blue histogram illustrates the distribution of f (x) on\ngenerated samples, and the orange histogram is generated by f (x) on testing samples. As we can\nsee, the learned potential function f (x) matches the empirical dataset well.\nWe report the inception scores in Table 3. For ADE, we train with Deep LVM as the initial q0\n\u2713\nwith/without HMC steps for an ablation study. The HMC embedding greatly improves the perfor-\nmance of the samples generated by the initial q0\n\u2713 alone. The proposed ADE not only achieves better\nperformance, compared to the \ufb01xed Langevin PCD for energy-based models reported in (Du and\nMordatch, 2019), but also enables the generator to outperform the Spectral GAN.\nWe show some of the generated images in Figure 3(a) and (c); additional sampled images can be\nfound in Figure 6 and 7 in Appendix G. We also plot the potential distribution (unnormalized) of the\ngenerated samples and that of the real images for MNIST and CIFAR-10 (using 1000 data points for\neach) in Figure 3(b) and (d). The energy distributions of both the generated and real images show\nsigni\ufb01cant overlap, demonstrating that the obtained energy functions have successfully learned the\ndesired distributions.\nSince ADE learns an energy-based\nmodel, the learned model and sam-\npler can also be used for image com-\npletion. To further illustrate the\nversatility of ADE, we provide sev-\neral image completions on MNIST\nin Figure 4. Speci\ufb01cally, we esti-\nmate the model with ADE on fully\nobserved images. For the input im-\nages, we mask the lower half with\nuniform noise. To complete the\ncorrupted images, we perform the\nlearned dual sampler steps to up-\ndate the lower half of images with\nthe upper half images \ufb01xed. We visualize the output from each of the 20 HMC runs in Figure 4.\nFurther details are given in Appendix F.2.\n\nFigure 4: Image completion with the ADE learned model and\nsampler on MNIST.\n\n6 Conclusion\n\nWe proposed Adversarial Dynamics Embedding (ADE) to ef\ufb01ciently perform MLE with general\nexponential families. In particular, by utilizing the primal-dual formulation of the MLE for an\naugmented distribution with auxiliary kinetic variables, we incorporate the parametrization of the\ndual sampler into the estimation process in a fully differentiable way. This approach allows for shared\nparameters between the primal and dual, achieving better estimation quality and inference ef\ufb01ciency.\nWe also established the connection between ADE and existing estimators. Our empirical results on\nboth synthetic and real data illustrate the advantages of the proposed approach.\n\nAcknowledgments\nWe thank David Duvenaud, Arnaud Doucet, George Tucker and the Google Brain team, as well as\nthe anonymous reviewers for their insightful comments and suggestions. L.S. was supported in part\nby NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351, CAREER IIS-1350983.\n\n9\n\n\fReferences\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. In International Conference\n\non Machine Learning, 2017.\n\nAlessandro Barp, Francois-Xavier Briol, Andrew B. Duncan, Mark Girolami, and Lester Mackey.\n\nMinimum Stein Discrepancy Estimators. arXiv preprint arXiv:1906.08283, 2019.\n\nDimitri Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 1995.\n\nJulian Besag. Statistical analysis of non-lattice data. The Statistician, 24:179\u2013195, 1975.\n\nChristos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and Anas-\ntasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric\npositive de\ufb01nite matrix. Linear Algebra and its Applications, 533:95\u2013117, 2017.\n\nLawrence D. Brown. Fundamentals of Statistical Exponential Families, volume 9 of Lecture notes-\n\nmonograph series. Institute of Mathematical Statistics, Hayward, Calif, 1986.\n\nAnthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational auto-encoder. In\n\nAdvances in Neural Information Processing Systems, 2018.\n\nBo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent.\nIn Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n985\u2013994, 2016.\n\nBo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, and Le Song. Coupled\nvariational bayes via optimization embedding. In Advances in Neural Information Processing\nSystems, 2018.\n\nBo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential\nfamily estimation via doubly dual embedding. In Proceedings of the 19th International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 2321-2330, 2019.\n\nZihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating\nenergy-based generative adversarial networks. In International Conference on Learning Represen-\ntations , 2017.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In\n\nInternational Conference on Learning Representations , 2017.\n\nYilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv\n\npreprint arXiv:1903.08689, 2019.\n\nYihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein variational\n\ngradient descent. In Conference on Uncertainty in Arti\ufb01cial Intelligence, 2017.\n\nWenbo Gong, Yingzhen Li, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Meta-learning for stochastic gradient\n\nMCMC. In International Conference on Learning Representations , 2019.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Advances in Neural\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets.\nInformation Processing Systems, pages 2672\u20132680, 2014.\n\nWill Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. FFJORD:\nFree-form continuous dynamics for scalable reversible generative models. In International Confer-\nence on Learning Representations , 2019.\n\nArthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola.\n\nA kernel two-sample test. JMLR, 13:723\u2013773, 2012.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5767\u20135777, 2017.\n\n10\n\n\fMichael Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation principle\nfor unnormalized statistical models. In Proceedings of the Thirteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\nInsu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation through\nIn International Conference on Machine Learning, pages\n\nstochastic chebyshev expansions.\n908\u2013917, 2015.\n\nGeoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):1771\u20131800, 2002.\n\nAapo Hyv\u00e4rinen. Estimation of non-normalized statistical models using score matching. Journal of\n\nMachine Learning Research, 6:695\u2013709, 2005.\n\nAapo Hyv\u00e4rinen. Connections between score matching, contrastive divergence, and pseudolikelihood\nfor continuous-valued variables. IEEE Transactions on neural networks, 18(5):1529-1531, 2007.\nTaesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability\n\nestimation. arXiv preprint arXiv:1606.03439, 2016.\n\nRoss Kindermann and J. Laurie Snell Markov Random Fields and their applications.\n\nAmer. Math. Soc., Providence, RI, 1980.\nDiederik P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1 \u21e5 1 convolutions.\n\nIn Advances in Neural Information Processing Systems, 2018.\n\nDiederik P Kingma and Yann LeCun. Regularized estimation of image statistics by score matching.\n\nIn NIPS, 2010.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference\n\non Learning Representations, 2014.\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751, 2016.\n\nJohn Lafferty, Andrew McCallum, and Fernando C.N. Pereira. Conditional random \ufb01elds: Prob-\nabilistic modeling for segmenting and labeling sequence data. In Proceedings of International\nConference on Machine Learning, volume 18, pages 282\u2013289, San Francisco, CA, 2001. Morgan\nKaufmann.\n\nYann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based\n\nlearning. Predicting structured data, 1(0), 2006.\n\nDaniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo\n\nwith neural networks. In International Conference on Learning Representations , 2018.\n\nBruce G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221\u2013239,\n\n1988.\n\nQiang Liu and Dilin Wang. Learning deep energy models: Contrastive divergence vs. amortized\n\nMLE. arXiv preprint arXiv:1707.00797, 2017.\n\nYi-An Ma, Tianqi Chen, and Emily Fox A complete recipe for stochastic gradient MCMC. In In\n\nAdvances in Neural Information Processing Systems, pages 2917\u20132925, 2015.\n\nTakeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\ngenerative adversarial networks. In International Conference on Learning Representations , 2018.\nAndriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic\nlanguage models. In Proceedings of the 29th International Conference on International Conference\non Machine Learning, 2012.\n\nRadford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2\n\n(11), 2011.\n\n11\n\n\fDanilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on\nMachine Learning, pages 1278\u20131286, 2014.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In\n\nInternational Conference on Machine Learning, 2015.\n\nR. Tyrrell Rockafellar ockafellar Convex Analysis, volume 28 of Princeton Mathematics Series.\n\nPrinceton University Press, Princeton, NJ, 1970.\n\nJascha Sohl-Dickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability \ufb02ow learning.\nIn Proceedings of the 28th International Conference on Machine Learning, pages 905\u2013912, 2011.\nJiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-MC: Adversarial training for mcmc. In\n\nAdvances in Neural Information Processing Systems, pages 5140\u20135150, 2017.\n\nBharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv\u00e4rinen, and Revant Kumar.\nDensity estimation in in\ufb01nite dimensional exponential families. The Journal of Machine Learning\nResearch, 18(1):1830\u20131888, 2017.\n\nTijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood\n\ngradient. In Proceedings of the International Conference on Machine Learning, 2008.\n\nTijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive diver-\ngence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages\n1033\u20131040. ACM, 2009.\n\nDavid Vickrey, Cliff Chiung-Yu Lin, and Daphne Koller. Non-local contrastive objectives.\n\nProceedings of the International Conference on Machine Learning, 2010.\n\nIn\n\nMartin J. Wainwright and Michael I. Jordan Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1 \u2013 2):1\u2013305, 2008.\n\nLi Wenliang, Dougal Sutherland, Heiko Strathmann, and Arthur Gretton. Learning deep kernels for\n\nexponential family densities. In International Conference on Machine Learning, 2019.\n\nYing Nian Wu, Jianwen Xie, Yang Lu, and Song-Chun Zhu. Sparse and deep generalizations of the\n\nframe model. Annals of Mathematical Sciences and Applications, 3(1):211\u2013254, 2018.\n\nLinfeng Zhang, Weinan E, and Lei Wang. Monge-Amp\u00e8re \ufb02ow for generative modeling. arXiv\n\npreprint arXiv:1809.10188, 2018.\n\nSong Chun Zhu and David Mumford. Grade: Gibbs reaction and diffusion equations. In Sixth\nInternational Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 847\u2013854. IEEE,\n1998.\n\n12\n\n\f", "award": [], "sourceid": 5875, "authors": [{"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Zhen", "family_name": "Liu", "institution": "MILA, University of Montreal"}, {"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Institute of Technology"}, {"given_name": "Niao", "family_name": "He", "institution": "UIUC"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google Inc."}]}