{"title": "On Fenchel Mini-Max Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10427, "page_last": 10439, "abstract": "Inference, estimation, sampling and likelihood evaluation are four primary goals of probabilistic modeling. Practical considerations often force modeling approaches to make compromises between these objectives. We present a novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), that accommodates all four desiderata in a flexible and scalable manner. Our derivation is rooted in classical maximum likelihood estimation, and it overcomes a longstanding challenge that prevents unbiased estimation of unnormalized statistical models. By reformulating MLE as a mini-max game, FML enjoys an unbiased training objective that (i) does not explicitly involve the intractable normalizing constant and (ii) is directly amendable to stochastic gradient descent optimization. To demonstrate the utility of the proposed approach, we consider learning unnormalized statistical models, nonparametric density estimation and training generative models, with encouraging empirical results presented.", "full_text": "On Fenchel Mini-Max Learning\n\nChenyang Tao1, Liqun Chen1, Shuyang Dai1, Junya Chen1,2, Ke Bai1, Dong Wang1,\n\nJianfeng Feng3, Wenlian Lu2, Georgiy Bobashev4, Lawrence Carin1\n1 Electrical & Computer Engineering, Duke University, Durham, NC, USA\n\n2 School of Mathematical Sciences, Fudan University, Shanghai, China\n\n3 ISTBI, Fudan University, Shanghai, China\n\n4 RTI International, Research Triangle Park, NC, USA\n\n{chenyang.tao, lcarin}@duke.edu\n\nAbstract\n\nInference, estimation, sampling and likelihood evaluation are four primary goals of\nprobabilistic modeling. Practical considerations often force modeling approaches\nto make compromises between these objectives. We present a novel probabilistic\nlearning framework, called Fenchel Mini-Max Learning (FML), that accommo-\ndates all four desiderata in a \ufb02exible and scalable manner. Our derivation is rooted\nin classical maximum likelihood estimation, and it overcomes a longstanding chal-\nlenge that prevents unbiased estimation of unnormalized statistical models. By\nreformulating MLE as a mini-max game, FML enjoys an unbiased training objec-\ntive that (i) does not explicitly involve the intractable normalizing constant and (ii)\nis directly amendable to stochastic gradient descent optimization. To demonstrate\nthe utility of the proposed approach, we consider learning unnormalized statistical\nmodels, nonparametric density estimation and training generative models, with\nencouraging empirical results presented.\n\nIntroduction\n\n1\nWhen learning a probabilistic model, we are typically interested in one or more of the following\noperations:\n\ndescribe the observed (training) data.\n\nwith d (cid:28) p; z is often a latent variable in a model of x.\n\n\u2022 Inference: Represent observation x \u2208 Rp with an informative feature vector z \u2208 Rd, ideally\n\u2022 Estimation: Given a statistical model p\u03b8(x) for data x, learn model parameters \u03b8 that best\n\u2022 Sampling: Ef\ufb01ciently synthesize samples from p\u03b8(x) given learned \u03b8, with drawn x \u223c p\u03b8(x)\n\u2022 Likelihood evaluation: With learned \u03b8 for model p\u03b8(x), calculate the likelihood of new x.\nOne often makes trade-offs between these goals, as a result of practical considerations (e.g.,\ncomputational ef\ufb01ciency); see Table S1 in the Supplementary Material (SUPP) for a brief sum-\nmary. We are particularly interested in the case for which the model \u02dcp\u03b8(x) is unnormalized; i.e.,\n\n(cid:82) \u02dcp\u03b8(x)dx = Z(\u03b8) (cid:54)= 1, with Z(\u03b8) dif\ufb01cult to compute [49].\n\nfaithful to the training data.\n\nMaximum likelihood estimation (MLE) is widely employed in the training of probabilistic models\n[11, 22], in which the expected log-likelihood log p\u03b8(x) is optimized wrt \u03b8, based on the training\nexamples. For unnormalized model density function \u02dcp\u03b8(x) = exp(\u2212\u03c8\u03b8(x)), where \u03c8\u03b8(x) is the\npotential function and \u03b8 are the model parameters, the likelihood is p\u03b8(x) = 1\nZ(\u03b8) \u02dcp\u03b8(x). The partition\nfunction Z(\u03b8) is typically not represented in closed-form when considering a \ufb02exible choice of\n\u03c8\u03b8(x), such as a deep neural network. This makes the learning of unnormalized models particularly\nchallenging, as the gradient computation requires an evaluation of the integral. In practice, this\nintegral is approximated with averaging over a \ufb01nite number of Monte Carlo samples. However,\nusing the existing \ufb01nite-sample Monte Carlo estimate of Z\u03b8 will lead to a biased approximation of\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe log-likelihood objective (see Section 2.1). This issue is aggravated as the dimensionality of the\nproblem grows.\nMany studies have been devoted to addressing the challenge of estimation with unnormalized\nstatistical models. Geyer [23, 24] proposed Markov chain Monte Carlo MLE (MCMC-MLE), which\nemploys a likelihood-ratio trick. Contrastive divergence (CD) [33] directly estimates the gradient by\ntaking MCMC samples. Hyv\u00e4rinen [36] proposed score matching (SM) to estimate an unnormalized\ndensity, bypassing the need to take MCMC samples. Noise contrastive estimation (NCE) learns\nthe parameters for unnormalized statistical models via discriminating empirical data against noise\nsamples [28, 29]. This concept can be further generalized under the Bregman divergence [27]. More\nrecently, dynamic dual embedding (DDE) explored a primal-dual view of MLE [15, 16], while Stein\nimplicit learning (SIL) [46, 41] and kernel score estimation [60] match the landscape of the potential\nwith that of kernel-smoothed empirical observations. However, these approaches are susceptible to\npoor scalability (SM, MCMC-MLE), biased estimation (CD), and computational (DDE, SIL) and\nstatistical (NCE) ef\ufb01ciency issues.\nConcerning design of models that yield realistic drawn samples, considerable recent focus has been\nplaced on implicit generative models [48], which include the generative adversarial network (GAN)\n[25, 51, 4, 61], the generative moment matching network (GMMN) [42, 19], implicit MLE (IMLE)\n[39], among others. In this setting one typically doesn\u2019t have an explicit \u02dcp\u03b8(x) or p\u03b8(x), and the goal\nis to build a model of the data generation process directly. Consequently, such schemes typically have\ndif\ufb01culty addressing the aforementioned likelihood goal. Additionally, such models often involve\ntraining strategies that are challenging due to instabilities or expressiveness, such as adversarial\nestimation (GAN) and kernelized formulation (GMMN).\nFor these reasons, likelihood-based models remain popular. Among them variational inference\n(VI) [6] and generative \ufb02ows (FLOW) [56, 53] are two of the most promising directions, and have\nundergone rapid development recently [66]. Despite this progress, challenges remain. The variational\nbound employed by VI is often not suf\ufb01ciently tight in practice (undermining the likelihood goal),\nand there exist model identi\ufb01ability issues [62]. In FLOW a trade-off has to be made between the\ncomputational cost and model expressiveness.\nThis paper presents a novel strategy for MLE learn-\ning for unnormalized statistical models, that allows\nef\ufb01cient parameter estimation and accurate likeli-\nhood approximation.\nImportantly, while compet-\ning solutions can only yield stochastic upper/lower\nbounds, our treatment allows unbiased estimation of\nlog-likelihood and model parameters. Further, this\nsetup can be used for effective sampling goals, and\nit has the ability to perform inference. This work\nmakes the following contributions: (i) Derivation of\na mini-max formulation of MLE, resulting in an un-\nbiased log-likelihood estimator directly amenable to\nstochastic gradient descent (SGD) optimization, with\nconvergence guarantees. (ii) Amortized likelihood\nestimation with deep neural networks, enabling direct likelihood prediction and feature extraction\n(inference). (iii) Development of a novel training scheme for latent-variable models, presenting a\ncompetitive alternative to VI. (iv) We show that our models compare favorably to existing alternatives\nin likelihood-based distribution learning, both in terms of model estimation and sample generation.\n\nFigure 1: Comparison of popular likeli-\nhood approximations: Monte-Carlo estimator\n(MC) (e.g., contrastive divergence (CD) [33]),\nRenyi [40], importance-weighted ELBO [10],\nand the proposed FML. Cheap approxima-\ntions often lead to biased estimate of likeli-\nhood, a point FML seeks to \ufb01x.\n\n2 Fenchel Mini-Max Learning\n2.1 Preliminaries\nMaximum likelihood estimation Given a family of parameterized probability density functions\n{p\u03b8(x)}\u03b8\u2208\u0398 and a set of empirical observations {xi}n\ni=1, MLE seeks to identify the most probable\nmodel \u02c6\u03b8MLE via maximizing the expected model log-likelihood, i.e., \u02c6L(\u03b8) (cid:44) 1\ni=1 log p\u03b8(xi). For\n\ufb02exible choices of p\u03b8(x), such as an unnormalized explicit-variable model p\u03b8(x) \u221d exp(\u2212\u03c8\u03b8(x))\n\nor latent variable model of the form p\u03b8(x) =(cid:82) p\u03b8(x|z)p(z)dz, direct optimization wrt MLE loss is\n\ntypically computationally infeasible. Instead, relatively inexpensive likelihood approximations are\noften used to derive surrogate objectives.\n\n(cid:80)n\n\nn\n\n2\n\n-0.200.20.40.60.81.01.2x0.00.51.01.52.0LikelihoodLikelihood Approximationsp(x)FMLMCRenyiELBO\fq\u03b2 (z|x)\n\nVariational inference Consider a latent variable model p\u03b8(x, z) = p\u03b8(x|z)p(z). To avoid direct\nnumerical estimation of p\u03b8(x), VI instead maximizes the variational lower bound to the marginal log-\n\nlikelihood: ELBO(p\u03b8(x, z), q\u03b2(z|x)) = Eq\u03b2 (z|x) log(cid:2) p\u03b8(x,z)\n\n(cid:3), where q\u03b2(z|x) is an approximation\n\nto the true posterior p\u03b8(z|x). This bound tightens as q\u03b2(z|x) approaches the true posterior p\u03b8(z|x).\nFor estimation, we seek parameters \u03b8 that maximize the ELBO, and the commensurately learned\nparameters \u03b2 are used in a subsequent inference task with new data. However, with such learning,\nsamples drawn x \u223c p\u02c6\u03b8(x|z) with z \u223c p(z) may not be as close to the training data as desired [12].\nAdversarial distribution matching Adversarial learning [25, 4] exploits the fact that many discrep-\nancy measures have a dual formulation D(pd, p\u03b8) = maxD{VD(pd, p\u03b8; D)}, where VD(pd, p\u03b8; D) is a\nvariational objective that can be estimated with samples from the true distribution pd(x) and the model\ndistribution p\u03b8(x), and D(x) is an auxiliary function commonly known as the critic (or discriminator).\nTo match draws from p\u03b8(x) to the data (sampled implicitly from pd(x)) wrt D(pd, p\u03b8), one solves a\n{maxD{VD(pd, p\u03b8; D)}}.\nmini-max game between the model p\u03b8(x) and critic D(x): p\u2217\nIn adversarial distribution matching, draws from p\u03b8(x) are often modeled via a deterministic function\nG\u03b8(z) that transforms samples from a (simple) source distribution p(z) (e.g., Gaussian) to the (com-\nplex) target distribution. This practice bypasses the dif\ufb01culties involved when specifying a \ufb02exible yet\neasy to sample likelihood. However, it makes dif\ufb01cult the goal of subsequent likelihood estimation\nand inference of the latent z for new data x.\nFenchel conjugacy Let f (t) be a proper con-\nvex, lower-semicontinuous function; then its\nconvex conjugate function f\u2217(v) is de\ufb01ned as\nf\u2217(v) = supt\u2208D(f ){tv \u2212 f (t)} where D(f ) de-\nnotes the domain of function f [34]. f\u2217 is also\nknown as the Fenchel conjugate of f, which is\nagain convex and lower-semicontinuous. The\nFenchel conjugate pair (f, f\u2217) are dual to each\nother, in the sense that f\u2217\u2217 = f, i.e., f (t) =\nsupv\u2208D(f\u2217){vt \u2212 f\u2217(v)}. As a concrete exam-\nple, (\u2212 log(t),\u22121 \u2212 log(\u2212v)) gives such a pair,\nas we exploit in the next section.\n\nAlgorithm 1 Fenchel Mini-Max Learning\nEmpirical data distribution \u02c6pd = {xi}n\nProposal q(x), learning rate schedule {\u03b7t}\nInitialize parameters \u03b8, b\nfor t = 1, 2,\u00b7\u00b7\u00b7 do\nSample {xt,j}m\nut,j = \u03c8(xt,j) + b,\nIt,j = exp(\u03c8\u03b8(xt,j) \u2212 \u03c8\u03b8(x(cid:48)\n[\u03b8, b] = [\u03b8, b] \u2212 \u03b7t\u2207[\u03b8,b]Jt\n% Update proposal q(x) if needed\n\nj{ut,j + exp(\u2212ut,j)It,j}\n\nJt =(cid:80)\n\nj=1 \u223c \u02c6pd(x), {x(cid:48)\n\nt,j) \u2212 log q(x(cid:48)\n\n\u03b8 = arg minp\u03b8\n\nj=1 \u223c q(x)\n\nt,j}m\n\nend for\n\nt,j))\n\ni=1\n\n(cid:80)m\nj=1 exp(\u2212\u03c8\u03b8(X(cid:48)\n\nestimator for the normalizing constant Z\u03b8 = (cid:82) e\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48), with {X(cid:48)\n\nBiased \ufb01nite sample Monte-Carlo for unnormalized statistical models For unnormalized sta-\ntistical model \u02dcp\u03b8(x) = exp(\u2212\u03c8\u03b8(x)), the naive Monte-Carlo estimator for the log-likelihood is\ngiven by log \u02c6p\u03c8(x) = \u2212\u03c8\u03b8(x) \u2212 log \u02c6Z\u03b8, where \u02c6Z\u03b8 = 1\nj)) is the \ufb01nite-sample\nj} i.i.d. uniform samples\non \u2126. Via the Jensen\u2019s inequality (i.e., EX [log f (X)] \u2264 log(EX [f (X)])), it is readily seen that\nEX(cid:48)\n[ \u02c6Z\u03b8]) = log Z\u03b8, which implies the naive MC estimator gives an upper\n[log \u02c6p\u03c8(x)] \u2265 log p\u03c8(x). The inability to take in\ufb01nite samples\nbound of the log-likelihood, i.e., EX(cid:48)\nmakes unbiased estimation of unnormalized statistical models a long-standing challenge posed to the\nstatistical community, especially for high-dimensional problems [9].\n2.2 Mini-Max formulation of MLE for unnormalized statistical models\nFor unnormalized statistical model \u02dcp\u03b8(x) = exp(\u2212\u03c8\u03b8(x)), we rewrite model log-likelihood as\n\n[log \u02c6Z\u03b8] \u2264 log(EX(cid:48)\n\n1:m\n\n1:m\n\n1:m\n\nm\n\nlog p\u03b8(x) = log\n\n(1)\nRecalling the Fenchel conjugate of \u2212 log(t), we have \u2212 log(t) = maxu{\u2212u \u2212 exp(\u2212u)t + 1}, and\nthe optimal value of u is u\u2217\n\nt = log(t). Plugging this into (1) yields the following expression\n\ne\u2212\u03c8\u03b8(x)\n\n(cid:82) e\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48) = \u2212 log\n(cid:26)\n(cid:90)\n(cid:16)(cid:82) e\u03c8\u03b8(x)\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48)(cid:17)\n\n\u2212 log p\u03b8(x) = min\n\nux + exp(\u2212ux)\n\nux\n\n(cid:18)(cid:90)\n\ne\u03c8\u03b8(x)\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48)(cid:19)\n(cid:27)\n\ne\u03c8\u03b8(x)\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48) \u2212 1\n\n.\n\n(2)\n\nx = log\n\nSince u\u2217\nx) = p\u03b8(x). Consequently,\nthe auxiliary dual variable u is an estimate of the negative log-likelihood. The key insight here is that\nwe have turned the numerical integration problem into an optimization problem. This may seem like\na step backward at \ufb01rst sight, as we are still summing over the support and we have a dual variable\nto optimize. The payoff is that we can now sidestep the log term and estimate the log-likelihood in\n\n= \u2212 log p\u03b8(x), we have exp(\u2212u\u2217\n\n3\n\n\fan unbiased manner using \ufb01nite MC samples, a major step-up over existing estimators. As argued\nbelow and veri\ufb01ed experimentally, this extra optimization can be executed ef\ufb01ciently and robustly.\nThis implies we are able to more accurately estimate unnormalized statistical models at a comparable\nbudget, without compromising training stability.\n\nDenote I(x; \u03c8\u03b8) =(cid:82) e\u03c8\u03b8(x)\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48). To estimate I(x; \u03c8\u03b8) more ef\ufb01ciently, we may introduce a\n\nproposal distribution q(x) with tractable likelihood and leverage an importance weighted estimator:\nI(x; \u03c8\u03b8) = EX(cid:48)\u223cq[exp(\u03c8\u03b8(x) \u2212 \u03c8\u03b8(X(cid:48)) \u2212 log q(X(cid:48)))]. We discuss the practical choice of proposal\ndistribution in more detail in Section 2.4. Putting everything together, we have the following mini-max\nformulation of MLE for unnormalized statistical models:\n\n(cid:40)\n\n(cid:40)(cid:88)\n\ni\n\n\u02c6\u03b8MLE = arg max\n\n\u03b8\n\n\u2212 min\n\nu\n\n(cid:41)(cid:41)\n\nJ\u03b8(xi; ui, \u03c8)\n\n,\n\n(3)\n\nwhere J\u03b8(x; u, \u03c8) (cid:44) u + exp(\u2212u)I(x; \u03c8\u03b8).\nIn practice, we can model all {ui} with only one additional free parameter as u\u03b8(x) = \u03c8\u03b8(x) + b\u03b8,\nwhere b\u03b8 models the log-partition function, i.e., b\u03b8 (cid:44) log Z\u03b8; we make explicit here that u is a\nfunction of \u03b8, i.e., u\u03b8(x). Note that b\u03b8 is the log-partition parameter to be learned, that minimizes\nthe objective if and only if it equals the true log-partition. Although model parameters \u03b8 are shared\nbetween u\u03b8(x; b\u03b8)) and \u03c8\u03b8(x), they are \ufb01xed in the u-updates. Hence, when alternating between\nupdating \u03b8 and u in (3), the update of u corresponds to re\ufb01ning the update of the log-partition\nfunction b\u03b8 for \ufb01xed \u03b8, followed by updating \u03b8 with b \ufb01xed; we have isolated learning the partition\nfunction (the minu step) and the model parameters (the max\u03b8 step)1. We call this new formulation\nFenchel Mini-Max Learning (FML), and summarize its pseudocode in Algorithm 1. For complex\ndistributions, we also optimize the proposal q(x) to enable ef\ufb01cient & robust learning with the\nimportance weighted estimator.\nConsidering the form of J(x; u, \u03c8\u03b8), one may observe that the learning signal comes from contrasting\ndata samples xi with a random draw X(cid:48) under the current model potential \u03c8\u03b8(x) (e.g., the term\n\u03c8\u03b8(xi) \u2212 \u03c8\u03b8(X(cid:48))). Figure 1 compares our FML to other popular likelihood approximation schemes.\nUnlike existing solutions, FML targets the exact likelihood without explicitly using \ufb01nite-sample\nestimator for the partition function. Instead, FML optimizes an objective where the untransformed\nintegral directly appears, which leads to an unbiased estimator provided the minimization is solved\naccurately.\n2.3 Gradient analysis of FML\nTo further understand the workings of FML, we inspect the gradient of model parameters. In classical\n\u2207p\u03b8(x)\nMLE learning, we have \u2207 log p\u03b8(x) =\np\u03b8(x) . That is to say, in MLE the gradient of the likelihood\nis normalized by the model evidence. A key observation is that, while \u2207p\u03b8(x) is dif\ufb01cult to compute,\nbecause of the partition function, we can easily acquire an unbiased gradient estimate of the inverse\nlikelihood\n\n= \u2207(cid:8)(cid:82) exp(\u03c8\u03b8(x) \u2212 \u03c8\u03b8(x(cid:48))) dx(cid:48)(cid:9) =(cid:82) \u2207{exp(\u03c8\u03b8(x) \u2212 \u03c8\u03b8(x(cid:48)))} dx(cid:48)\n\np\u03b8(x) using Monte-Carlo samples,\n\n\u2207(cid:110) 1\n\n(cid:111)\n\n1\n\nwhich only differs from \u2207 log p\u03b8(x) by a factor of negative inverse likelihood\n\np\u03b8(x)\n\n\u2207\n\n(cid:27)\n(cid:26) 1\n= \u2212 \u2207p\u03b8(x)\n(p\u03b8(x))2 = \u2212\u2207 log p\u03b8(x)\nexp(\u2212\u02c6ux)(cid:82) e\u03c8\u03b8(x)\u2212\u03c8\u03b8(x(cid:48)) dx(cid:48)(cid:111)\n\u2207J\u03b8(x; \u02c6ux, \u03c8) = \u2212\u2207(cid:110)\n= \u2212\u02c6p\u03b8(x)\u2207(cid:110) 1\n(cid:111)\n\n= \u02c6p\u03b8(x)\n\np\u03b8(x)\n\np\u03b8(x)\n\n.\n\np\u03b8(x)\n\nNow considering the gradient of FML, we have\n\np\u03b8(x)\u2207 log p\u03b8(x) \u2248 \u2207 log p\u03b8(x),\n\n(4)\n\n(5)\n\n(6)\n\nwhere \u02c6ux denotes an approximate solution to the Fenchel maximization game (2) and \u02c6p\u03b8 (cid:44) exp(\u2212\u02c6ux)\nis an approximation of the likelihood based on our previous analysis. We denote \u03be (cid:44) \u02c6p\u03b8(x)\np\u03b8(x), and refer\nto log \u03be as the approximation error. If this approximation \u02c6p\u03b8 is suf\ufb01ciently accurate then \u03be \u2248 1, which\nimplies the FML gradient is a good approximation to the gradient of true likelihood.\n\n1In practice, we \ufb01nd that instead of separated updates, simultaneous gradient descent of \u03b8 and b also works\n\nwell.\n\n4\n\n\fthe empirical model log-likelihood(cid:80)\n\nWhen we model the auxiliary variable as u(x) = \u03c8\u03b8(x) + b, then the FML gradient \u2207J\u03b8(x; u, \u03c8)\ndiffers from \u2207 log p\u03b8(x) by a common multiplicative factor \u03be = exp(b \u2212 b\u03b8) for all x \u2208 \u2126. Next we\nshow SGD is insensitive to this approximation error; FML still converges to the same solution of\nMLE even if \u03be deviates from 1 differently at each iteration.\n2.4 Choice of proposal distribution\nLike all importance-weighted estimators, the ef\ufb01ciency of FML critically depends on the choice of\nproposal q(x). A poor match between the proposal and integrand can lead to extremely high variance\n[52], which compromises learning. In order to keep the variance in check, a general guiding principle\nfor choosing a good q(x) is to make it close to the data distribution pd. Note this practice differs from\nthe optimal minimal variance proposal, which is proportional to the integrand. However, it does not\nneed to constantly update the proposal to adapt to the current parameter, which brings both robustness\nand computational savings. To obtain such a static proposal matched to the data distribution, we\ncan pre-train a parameterized tractable sampler q\u03c6(x) with empirical data samples by maximizing\ni log q\u03c6(xi), with \u03c6 parameterizing the proposal. Note that we\nonly require the proposal q(x) to be similar to the data distribution, using a rough approximation\nto facilitate the learning of an unnormalized model that more accurately characterize the data. The\nproposal does not necessarily need to capture every minute detail of the target distribution, as\nsuch simpler models are generally preferable for better computational ef\ufb01ciency, provided adequate\napproximation and coverage can be achieved. Popular choice of parameterized proposal include\ngenerative \ufb02ows [53] or mixture of Gaussians [44]. We leave a more detailed speci\ufb01cation of our\ntreatment to the Supplementary Material (SUPP).\n2.5 Convergence results\nIn modern machine learning, \ufb01rst order stochastic gradient descent (SGD) is a popular choice, and\nin many cases the only feasible approach, for large-scale problems. In the case of MLE, let h(\u03b8; \u03c9)\nbe an unbiased stochastic gradient estimator for \u02c6L(\u03b8), i.e., E\u03c9\u223cp(\u03c9)[h(\u03b8; \u03c9)] = \u2207 \u02c6L(\u03b8). Here we\nhave used \u03c9 \u223c p(\u03c9) to denote the source of randomness for h(\u03b8; \u03c9). SGD \ufb01nds a solution by using\nthe following iterative procedure \u03b8t+1 = \u03b8t + \u03b7th(\u03b8t; \u03c9t), where {\u03b7t} is a pre-determined sequence\ncommonly known as the learning-rate schedule and {\u03c9t} are iid draws from p(\u03c9). Then under\ncommon technical assumptions on h(\u03b8; \u03c9) and {\u03b7t}, if there exists only one unique minimizer \u03b8\u2217\nthen the SGD solution \u02c6\u03b8SGD (cid:44) limt\u2192\u221e \u03b8t will converge to it [57].\nNow consider FML\u2019s naive stochastic gradient estimator \u02dch(\u03b8; \u03c9) = e\u2212u(X)\u2207 exp(\u03c8\u03b8(X)\u2212\u03c8\u03b8(X(cid:48))),\nwhere X \u223c \u02c6pd, X(cid:48) \u223c U(\u2126); the contrast \u03c8\u03b8(x) \u2212 \u03c8\u03b8(x(cid:48)) between real and synthetic data is evident.\nBased on the analysis from the last section, we have the decomposition \u02dch(\u03b8; \u03c9) = \u03be h(\u03b8; \u03c9), where\nh(\u03b8; \u03c9) is the unbiased stochastic gradient term and \u03be relates to the (unknown) approximation error.\nUsing the same learning rate schedule, we are updating model parameter with effective random\nstep-sizes \u02dc\u03b7t (cid:44) \u03bet\u03b7t relative to SGD with MLE, where \u03bet depends on the current approximation error.\nWe formalize this as the generalized SGD problem described below.\nProblem 2.1 (Generalized SGD). Let h(\u03b8; \u03c9), \u03c9 \u223c p(\u03c9) be an unbiased stochastic gradient estimator\nfor objective f (\u03b8), {\u03b7t > 0} is the \ufb01xed learning rate schedule, {\u03bet > 0} is the random perturbations\nto the learning rate. We want to solve for \u2207f (\u03b8) = 0 with the iterative scheme \u03b8t+1 = \u03b8t +\n\u02dc\u03b7t h(\u03b8t; \u03c9t), where {\u03c9t} are iid draws and \u02dc\u03b7t = \u03b7t\u03bet is the randomized learning rate.\nProposition 2.2 (Generalized stochastic approximation). Under the standard regularity conditions\nt ] < \u221e. Then\nE[\u02dc\u03b72\n\u03b8n \u2192 \u03b8\u2217 with probability 1 from any initial point \u03b80.\nRemark. This is a straightforward generalization of the Robbins-Monro theory. The original proof\nstill applies by simply replacing expectation wrt the deterministic sequence {\u03b7t} with the randomized\nt ] < \u221e can be satis\ufb01ed by saying {log \u03bet}\nis bounded. The u-updates used in FML force {log \u03bet} to stay close to zero, thereby enforcing the\nboundedness condition. Although such assumptions are too strong for deep neural nets, empirically\nFML converges to very reasonable solutions. We discuss more general theories in the SUPP.\nCorollary 2.3. Under the assumptions of Prop. 2.2, FML converges to \u02c6\u03b8MLE with SGD.\n3 FML for Latent Variable Models and Sampling Distributions\n3.1 Likelihood-free modeling & latent variable models\nOne can reformulate generative adversarial networks (GANs) [25, 30] into a latent-variable model,\nby introducing arbitrarily small Gaussian perturbations. Speci\ufb01cally, X(cid:48) = G\u03b8(Z) + \u03c3\u03b6, where\n\nlisted in Assumption D.1 in the SUPP, we further assume(cid:80)\nsequence {\u02dc\u03b7t}. Assumptions(cid:80)\n\nE[\u02dc\u03b7t] = \u221e and(cid:80)\n\nE[\u02dc\u03b72\n\nt\n\nE[\u02dc\u03b7t] = \u221e and(cid:80)\n\nt\n\nt\n\nt\n\n5\n\n\f\u03b6 \u223c N (0, 1) is standard Gaussian, and \u03c3 is the noise standard deviation. This gives the joint\n\u2020\n\u2020\n\u03b8(x, z) = N (G\u03b8(z), \u03c32)p(z). It is well known the marginal likelihood p\nlikelihood p\n\u03b8(x) converges\nto p\u03b8(x) as \u03c3 goes to zero [4]. As such, we can always use a latent-variable model to approximate\nthe likelihood of an implicitly de\ufb01ned distribution p\u03b8(x), which is easy to sample from. It also allows\nus to associate generator parameters \u03b8 to likelihood-based losses.\n3.2 Fenchel reformulation of marginal likelihood\nReplacing the log term with its Fenchel dual, we have the following alternative expression for the\n\nmarginal likelihood: log p\u03b8(x) = log((cid:82) p\u03b8(x, z) dz) = minux{ux + exp(\u2212ux)I(x; p\u03b8)\u2212 1}, where\nI(x; p\u03b8) (cid:44) (cid:82) p\u03b8(x, z) dz. Note that, different from the last section, here estimate \u02c6ux provides a\n\ndirect approximation to the marginal likelihood log p\u03b8(x) rather than its negative. By analogy with\nvariational inference (VI), an approximate posterior q\u03b2(z|x) can also be introduced, assuming the\nrole of proposal distribution for the integral term. Model parameter \u03b8 can be learned via the following\nmini-max setup\n\n{min\n\nu\n\nmax\n\n\u03b8\n\n(cid:125)\n(cid:124)\n{EX\u223cpd [uX + exp(\u2212uX )I(X; p\u03b8, q\u03b2)]\n\n(cid:123)(cid:122)\n\n}},\n\nJ (u;p\u03b8,q\u03b2 )\n\n(7)\n\n\u03b8 (x|z)p(z)\nq\u03b2 (z|x)\n\nq\u03b2 (z|x) with p\u03c4t\n\nq\u03b2 (Z|x) ] is the importance weighted estimator with proposal q\u03b2(z|x), and\nwhere I(x; p\u03b8, q\u03b2) (cid:44) Eq\u03b2 [ p\u03b8(x,Z)\nu \u2208 Rn is a vector modeling the marginal likelihood log p\u03b8(xi) for each training example xi with ui.\nA good proposal encodes the association between x and z (this is expanded upon in the SUPP); as\nsuch, we also refer to q\u03b2 as the inference distribution. We will return to the optimization of inference\nparameter \u03b2 in Section 3.3. Our analysis from Sections 2.3 to 2.5 also applies in the latent variable\ncase and is not repeated here. To further stabilize the training, annealed training can be considered,\nas in Neal [49]. Here {\u03c4t} is the annealing schedule,\nreplacing integrand p\u03b8(x,z)\nmonotonically increasing wrt time t going from \u03c40 = 0 to \u03c4\u221e = 1.\n3.3 Optimization of inference distribution\nThe choice of proposal distribution q\u03b2(z|x) is important for the statistical ef\ufb01ciency of FML. To ad-\ndress this issue, we propose to encourage more informative proposal via regularizing the vanilla FML\nobjective. In particular, we consider regularizing with the mutual information Ip (cid:44) Ep[log p(X,Z)\np(X)p(Z) ].\nLet us denote our model distribution p\u03b8(x, z) as \u03c1 and the approximate joint q\u03b2(x, z) (cid:44) q\u03b2(z|x)pd(x)\nas q, and the respective mutual information are denoted as I\u03c1 and Iq. It is necessary to regularize\nboth I\u03c1 and Iq, since Iq directly encourage more informative proposal, while the \u201cinformativeness\u201d\nis upper bounded by I\u03c1 [2]. In other words, this encourages the proposal to approach the posterior.\nDirect estimation of I\u03c1 and Iq is infeasible, due to the absence of analytical expressions for\nthe marginals p\u03b8(x) and q\u03b2(z). Instead we use their respective lower bounds [5, 2] D\u03c1(\u03b8, \u03b2) (cid:44)\nE(X,Z)\u223cp\u03b8 [log q\u03b2(Z|X)] and Dq(\u03b2|\u03b8) (cid:44) E(X,Z)\u223cq\u03b2 [log p\u03b8(X|Z)] as our regularizer (see the SUPP\nfor details). Note these bounds are tight as the proposal q\u03b2(z|x) approaches the true posterior p\u03b8(z|x)\n(Lemma 5.1, Chen et al. [13]). We then solve the following regularized mini-max game\n\nmin\n\nu\n\n.\n\nmax\n\u03b8,\u03b2\n\n{J (u, \u03b8, \u03b2)} \u2212 \u03bbqDq(\u03b2|\u03b8) \u2212 \u03bb\u03c1D\u03c1(\u03b8, \u03b2)\n\n(8)\nHere the nonnegative {\u03bb\u03c1, \u03bbq} are the regularization strengths, and we have used notation Dq(\u03b2|\u03b8)\nto highlight the fact this term does not contribute to the gradient of model parameter \u03b8. Solving (8)\nusing standard simultaneous gradient descent/ascent as in standard GAN training is observed to be\nef\ufb01cient and stable in practice.\n3.4 Amortized inference of marginal likelihoods\nUnlike the explicit likelihood case from Section 2, the marginal likelihoods {log p\u03b8(xi)} are no longer\ndirectly related by an explicit potential function \u03c8\u03b8(x). Individually update ui for each sample xi is\ncomputationally inef\ufb01cient: (i) it does not scale to large datasets; (ii) parameters are not shared across\nsamples; (iii) it does not permit ef\ufb01cient prediction of the likelihood at test time for a new observation\nxnew. Motivated by its success in variational inference, we propose to employ the amortization tech-\nnique to tackle the above issues [14]. When optimizing some objective function with distinct parame-\ni (cid:96)\u03b8(xi; \u03b6i), amortized learning\nreplaces these parameters with a parameterized function \u03b6\u03c6(x) with \u03c6 as the amortization parame-\ni (cid:96)\u03b8(xi; \u03b6\u03c6(xi))\ninstead. Contextualized under our FML, we amortize the marginal likelihood estimate {ui} with\n\nters \u03b6i associated with each training example xi, e.g., L(\u03b8, \u03b6) =(cid:80)\nters. The optimization is then carried out wrt the amortized objective L(\u03b8, \u03c6) =(cid:80)\n\n(cid:110)\n\n(cid:111)\n\n6\n\n\fa parameterized function u\u03c6(x), and optimize max\u03b8{min\u03c6{EX\u223cpd [J (u\u03c6; p\u03b8, q\u03b2)]}} instead of (7).\nSince Epd [log p\u03b8] = minu{Epd [J (uX ; p\u03b8, q\u03b2)}] \u2264 min\u03c6{EpdJ (u\u03c6(x); p\u03b8, q\u03b2)}, amortized latent\nFML effectively optimizes an upper bound of the likelihood loss. This bound tightens as the function\nfamily u\u03c6 becomes more expressive, which makes expressive deep neural networks an appealing\nchoice for u\u03c6 [35]. To further improve parameter ef\ufb01ciency, we note parameter \u03c6 can be shared with\nthe proposal parameter \u03b2 used by q\u03b2(z|x).\n3.5 Sampling From Unnormalized Distribution\nThere are problems for which we are given an unnormalized distribution p\u03c8\u2217 (x) \u221d exp[\u2212\u03c8\u2217(x)]\nand no data samples; we would like to model p\u03c8\u2217 (x) in the sense that we\u2019d like to ef\ufb01ciently sample\nfrom it. This problem arises, for example, in reinforcement learning [31], among others. To address\nthis problem under FML, we propose to parameterize a sampler X = G\u03b8(Z), Z \u223c p(z) and a\nnonparametric potential function \u03c8\u03b8(x) 2. FML is used to estimate the model likelihood via solving\n(9)\n\n{F(\u03c8, b; \u03b8)}}, F(\u03c8, b; \u03b8) (cid:44) EZ\u223cp(z)[J (G\u03b8(Z), u\u03c8,b, \u03c8)]\n\n{\u2212 min\n\nwhere u\u03c8,b(x) = \u03c8\u03b8(x) + b is our estimate for \u2212 log p\u03b8(x) implicitly de\ufb01ned by G\u03b8(z).\nTo match model samples to the target distribution, G\u03b8(z) is trained to minimize the KL-divergence\nKL(p\u03b8 (cid:107) p\u03c8\u2217 ) = EX\u223cp\u03b8 [log p\u03b8(X) \u2212 log p\u03c8\u2217 (X)] = EX\u223cp\u03b8 [log p\u03b8(X) + \u03c8\u2217(X)] + log Z\u03c8\u2217\n\nmax\n\n\u03c8\n\nb\n\nSince the last term is independent of model parameter \u03b8, we obtain the KL-based training objective\nJKL(\u03b8; \u03c8, b, \u03c8\u2217) (cid:44) EZ\u223cp(z)[\u03c8\u2217(G\u03b8(Z)) \u2212 u\u03c8,b(G\u03b8(Z))] by replacing log p\u03b8(x) with our FML\nestimate. Due to the dependence of ub(x) on \u03b8, the \ufb01nal learning procedure is\n\n[\u03c8t, bt] = [\u03c8t\u22121, bt\u22121] \u2212 \u03b7t\u2207[\u03c8,b]F(\u03c8t\u22121, bt\u22121; \u03b8t), \u03b8t+1 \u2190 \u03b8t \u2212 \u03b7t\u2207\u03b8JKL(\u03b8t; \u03c8t, bt, \u03c8\u2217).\n\n4 Related Work\nFenchel duality In addition to optimization schemes, the Fenchel duality also \ufb01nds successful\napplications in probabilistic modeling. Prominent examples include divergence minimization [3] and\nlikelihood-ratio estimation [50], and more recently adversarial learning [51]. In discrete learning,\nFagan and Iyengar [20] employed it to speedup extreme classi\ufb01cation. To the best of the authors\u2019\nknowledge, Fenchel duality has not been applied previously to likelihoods with latent variables.\nNonparametric density estimation To combat the biased estimation of the partition function, Burda\net al. [9] proposed a conservative estimator, which partly alleviates this issue. Parallel to our work,\nDai et al. [16] explored Fenchel duality in the setting of MLE for an unnormalized statistical model\nestimation, under the name dynamics dual embedding (DDE), which seeks optimal embedding in the\nspace of probability measures. The authors used parameterized Hamiltonian \ufb02ows for distribution\nembeddings, which limits its scalability and expressiveness. In particular, DDE fails if the search\nspace does not contain the target distribution, while our formulation only requires the support of the\nproposal distribution to cover that of the target.\nAdversarial distribution learning The proposed FML framework is complementary to the develop-\nment of GANs. FML prioritizes the learning of a potential function, while GANs have focused on the\ntraining of a sampler. Both schemes are derived via learning by contrast. Notably f-GANs contrast\nthe difference between likelihoods under respective models, while our FML contrasts data samples\nwith proposal samples under the current model potential. Synergies can be explored between the two\nschemes.\nApproximate inference Compared with VI, FML optimizes a direct estimate of the marginal\nlikelihood instead of a variational bound. While tighter bounds can be achieved for VI via importance\nre-weighting [10], \ufb02exible posteriors [47] and alternative evidence scores [62], these strategies do not\nnecessarily improve performance [55]. Another fundamental difference is that while VI discards all\nconditional likelihoods after the ELBO evaluation, FML consolidates them into an estimate of the\nmarginal likelihood through SGD.\nSampling unormalized potentials This is one of the fundamental topics in statistics and computer\nscience [45]. Recent studies have explored the use of deep neural sampler for this purpose: Feng et al.\n[21] trains the sampler with kernel Stein variational gradients, and Li et al. [38] adversarially updates\nthe sampler based on the adaptive contrast technique [47]. FML provides an expressive, scalable and\n2With slight abuse of notation, we assume \u03c8\u03b8(x) is parameterized by \u03c8 to avoid notation collision with\n\nsampler G\u03b8(z).\n\n7\n\n\fModel\nMC\n\nTable 1: Quantitative evaluation on toy models.\nParameter estimation error \u2020 \u2193 Likelihood consistency score \u2191\nbanana kidney rings river wave banana kidney rings river wave\n0.961 0.881 0.508 0.702 0.619\n3.46\n\u00d7\n\u00d7\n7.79\nSM [36]\n0.968 0.882 0.557 0.721 0.759\n3.88\nNCE [28]\n\u00d7\n0.973 0.755 0.183 0.436 0.265\nKEF [59]\nDDE [16]\n6.59\n0.944 0.830 0.426 0.520 0.186\n0.974 0.901 0.562 0.731 0.782 Figure 2: FML predicted likelihood\nFML (ours) 3.05\n\n3.9\n4.71 1.71 1.78\n2.75 3.62 1.64 2.61\n4.81 2.85 1.20\n2.5\n\u00d7\n\u00d7 \u00d7 \u00d7\n7.31 24.9 29.1 25.7\n1.9\n2.59 1.13 1.27\n\nusing nonparametric potentials.\n\n\u00d7\n\n\u00d7\n\n\u00d7\n\n7.74\n7.97\n7.18\n7.75\n7.98\n8.20\n\nModel wine-red wine-white yeast htru2\nKDE\n3.01 15.47\n7.74\n4.82 22.06\nGMM 7.42\n3.79 18.83\nDDE\n7.45\n3.31 20.48\nFLOW 7.09\n4.84 22.05\n7.29\nNCE\n8.45\n4.96 22.15\nFML\n\nnumerically stable solution based on the simulation of a Langevin gradient \ufb02ow.\n5 Experiments\nTo validate the proposed FML framework and benchmark it against state-of-the-art methods, we\nconsider a wide range of experiments, using synthetic and real-world datasets. All experiments\nare implemented with Tensor\ufb02ow and executed on a single NVIDIA TITAN X GPU. Details of\nthe experimental setup are provided in the SUPP, due to space limits, and our code is from https:\n//www.github.com/chenyang-tao/FML. For the evaluation metrics reported, \u2191 indicates a higher\nscore is considered better, and vice versa with \u2193. Our goal is to verify FML works favorably or\nsimilarly compared with competing solutions under the same setup, not to beat state-of-the-art results.\n5.1 Estimating unnormalized statistical models Table 2: log-likelihood evaluation on UCI datasets \u2191.\nWe compare FML with competing solutions on pa-\nrameter estimation and likelihood prediction with\nunnormalized statistical models. We report \u00d7 if a\nmethod is unable to compute or failed to reach a\nreasonable result. Grid search is used for KDE to\noptimize the kernel bandwidth.\nParameter estimation for unnormalized models We \ufb01rst benchmark the performance on parameter\nestimation with a number of representative toy models, including both continuous distributions with\nvarying dimensionality (see SUPP for details). The exact parametric form of the potential function\nis given, and the task is to estimate the parameter values that generate the samples. We use 1,000\nand 5,000 samples, respectively, for training and evaluation. To assess performance, we repeat\neach experiment 10 times and report the mean absolute error (cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)1, where \u02c6\u03b8 and \u03b8\u2217 denote the\nparameter estimate and ground-truth, respectively. We benchmark FML against naive Monte-Carlo,\nscore matching, noise contrastive estimation and dual dynamics embedding, with results reported in\nTable 1. FML provides comparable, if not better, performance on all the models considered.\nNonparametric likelihood prediction In the absence of an explicit parametric model of the likeli-\nhood, a deep neural network is used as a nonparametric model of the potential. To evaluate model\nperformance, we consider the likelihood consistency score, de\ufb01ned as the correlation between the\nlearned nonparametric potential and the ground truth potential, i.e., corr(log p\u03b8\u2217 (X), log p\u02c6\u03b8(X)),\nwhere the expectation is taken wrt ground-truth samples. The results are summarized in Table 1.\nIn Figure 2, we also visualize the nonparametric FML estimates of the likelihood compared with\nground truth. Note SM proved computationally unstable in all cases, and DDE has to be trained with\na smaller learning rate, due to stability issues.\nIn addition to the toy examples, we also evaluate the proposed FML on real datasets from the UCI\ndata repository [17]. To evaluate model performance, we randomly split the data into ten folds, and\nuse seven of them for training and three of them for evaluation. To cope with the high-dimensionality\nof the data, we use a GMM proposal for both NCE and FML. The averaged log-likelihood on the test\nset is reported in Table 2, and the proposed FML shows an advantage over its counterparts.\n5.2 Latent variable models and generative modeling\nOur next experiment considers FML-based training for latent variable models and generative modeling\ntasks. In particular, we directly benchmark FML against the VAE [37], for modeling complex\ndistributions, such as images and natural language, for real-world applications. We focus on evaluating\nthe model\u2019s ability to (ef\ufb01ciently) synthesize realistic samples. Additionally, we also demonstrate how\nFML can assist the training of generative adversarial nets by following the variational annealing setup\n\n8\n\n\fTable 3: VAE quantitative results.\nMNIST IS\u2191\nVAE\n8.08\nFML\n8.30\n\nFID\u2193 \u2212 log \u02c6p \u2193\n103.7\n24.3\n22.7\n101.5\n\nMNIST\n\nCelebA\n\nCifar10\n\nFigure 3: Sampled images from FML-trained models.\n\n46.8\n47.4\n\n60.7\n64.4\n\n23.1\n24.3\n\n38.9\n40.3\n\n11.6\n12.2\n\n24.8\n25.2\n\n82.1\n84.2\n\nFID\u2193\n37.4\n30.7\n30.0\n\nTable 4: GAN quantitative results.\n\nIS\u2191\nCifar10\nGAN\n6.29\nDFM 6.93\n6.91\nFML\n\nTable 5: Results on language models, with the example\nsynthesized text representative of typical results.\nPPL \u2193 BLEU-2 \u2191 BLEU-3 \u2191 BLEU-4 \u2191 BLEU-5 \u2191\n\nEMNLP WMT news\nVAE 12.5\n76.1\nFML 11.6\n77.2\nMS COCO\nVAE 9.5\nFML 8.6\nSampled sentences from respective models on WMT news\nVAE\u201cChina\u2019s economic crisis, the number of US exports,\nwhich is still in recent years of the UK\u2019 s popula-\ntion.\u201d\n\ndescribed in Tao et al. [63], with results summarized in Table 4. Our FML-based solution outperforms\nDAE score estimator [1] based DFM [64] in terms of FID, while giving similar performance in IS.\nImage datasets We applied FML-training to a number of popular\nimage datasets including MNIST, CelebA, and Cifar10. The following\nmetrics are considered for quantitative evaluation of model perfor-\nmance: (i) Inception Score (IS) [58], (ii) Fr\u00e9chet Inception Distance\n(FID) [32], and (iii) negative log-likelihood estimates [65]. See Ta-\nble 3 for quantitative evaluations (additional results on CelebA see\nSUPP), and images sampled from the FML-trained models are presented in Figure 3 for qualitative\nassessment. FML-based training consistently improves model performance wrt quantitative measures,\nwhich is also veri\ufb01ed based on our human evaluation (see SUPP).\nNatural language models We further apply\nFML to the learning of natural language mod-\nels. The following two benchmark datasets are\nconsidered: (i) EMNLP WMT news [26] and\n(ii) MS COCO [43]. In accordance with stan-\ndard literature in language modeling, we report\nboth perplexity (PPL) [8] and BLEU [54] scores.\nNote PPL is an evaluation metric based on the\nlikelihood. Quantitative results along with sen-\ntence samples generated from trained models\nare reported in Table 5. FML-based training\nleads to consistently improved performance wrt\nboth PPL and BLEU; it also typically gener-\nates more coherent sentences compared with its\ncounterpart.\n5.3 Sampling unnormalized distributions\nOur \ufb01nal experiment considers an application in reinforce-\nment learning (RL) with FML-trained neural sampler. We\nbenchmark the effectiveness of our FML-based sampling\nscheme described in Sec 3.5 by comparing it with the\nSVGD sampler used in state-of-the-art soft Q-learning im-\nplementation [31]. We examine the performance on three\ncontinuous control tasks, namely swimmer, hopper and\nreacher, de\ufb01ned in OpenAI gym [7] and rllab [18] environ-\nments, with results summarized in Figure 4. Figure 4(a)\noverlays samples from the FML-trained policy network on\nthe potential of the model estimated optimal policy, ver-\nifying FML\u2019s capability to capture complex multi-modal\ndistributions. The evolution of policy rewards wrt training iterations is provided in Figure 4(b-d), and\nFML-based policy updates improve on original SVGD updates.\n6 Conclusion\nWe have developed a scalable and \ufb02exible learning scheme for probabilistic modeling. Rooted\nin classical MLE learning, our solution handles inference, estimation, sampling and likelihood\nevaluation in a uni\ufb01ed framework, without major compromises. Empirical evidence veri\ufb01ed the\nproposed method delivers competitive performance on a wide range of tasks.\n\nFML\u201cIn addition, police of\ufb01cials have also found a new\ninvestigation into the area where they could take a\nfurther notice of a similar investigation into.\u201d\n\nFigure 4: Soft Q-Learning with FML.\n\n9\n\n-1012Training-steps(million)02kRewardHopper-v10.00.10.20.30.40.50100150RewardSwimmer-rllabSQL-FMLSQL-SVGD0.00.10.20.31k500200Reacher-v11k3k2kTraining-steps(million)(b)0011(a)(d)(c)\fAcknowledgements\nThe authors would like to thank the anonymous reviewers for their insightful comments. This research\nwas supported in part by DARPA, DOE, NIH, ONR, NSF and RTI Internal research & development\nfunds. J Chen was partially supported by China Scholarship Council (CSC). W Lu and J Feng\nwere supported by the Shanghai Municipal Science and Technology Major Project and ZJ Lab (No.\n2018SHZDZX01).\n\nReferences\n[1] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-\ngenerating distribution. The Journal of Machine Learning Research, 15(1):3563\u20133593, 2014.\n\n[2] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy.\n\nFixing a broken ELBO. In ICML, pages 159\u2013168, 2018.\n\n[3] Yasemin Altun and Alex Smola. Unifying divergence minimization and statistical inference via\n\nconvex duality. In COLT, 2006.\n\n[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In ICML, 2017.\n\n[5] Toby Berger. Rate distortion theory: A mathematical basis for data compression. 1971.\n\n[6] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[7] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[8] Peter F Brown, Vincent J Della Pietra, Robert L Mercer, Stephen A Della Pietra, and Jennifer C\nLai. An estimate of an upper bound for the entropy of english. Computational Linguistics, 18\n(1):31\u201340, 1992.\n\n[9] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Accurate and conservative estimates of\n\nmrf log-likelihood using reverse annealing. In AISTATS, pages 102\u2013110, 2015.\n\n[10] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In\n\nICLR, 2016.\n\n[11] George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Paci\ufb01c Grove,\n\nCA, 2002.\n\n[12] Liqun Chen, Shuyang Dai, Yunchen Pu, Erjin Zhou, Chunyuan Li, Qinliang Su, Changyou\nChen, and Lawrence Carin. Symmetric variational autoencoder and connections to adversarial\nlearning. In AISTATS, 2018.\n\n[13] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-\nGAN: Interpretable representation learning by information maximizing generative adversarial\nnets. In NIPS, 2016.\n\n[14] Chris Cremer, Xuechen Li, and David Duvenaud.\n\nautoencoders. arXiv preprint arXiv:1801.03558, 2018.\n\nInference suboptimality in variational\n\n[15] Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel\nexponential family estimation via doubly dual embedding. arXiv preprint arXiv:1811.02228,\n2018.\n\n[16] Bo Dai, Hanjun Dai, Niao He, Arthur Gretton, Le Song, and Dale Schuurmans. Exponential\nfamily estimation via dynamics embedding. In NIPS Bayesian Deep Learning Workshop, 2018.\n\n[17] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017. URL http:\n\n//archive.ics.uci.edu/ml.\n\n10\n\n\f[18] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep\n\nreinforcement learning for continuous control. In ICML, pages 1329\u20131338, 2016.\n\n[19] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural\n\nnetworks via maximum mean discrepancy optimization. In UAI, 2015.\n\n[20] Francois Fagan and Garud Iyengar. Unbiased scalable softmax optimization. arXiv preprint\n\narXiv:1803.08577, 2018.\n\n[21] Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein\n\nvariational gradient descent. In UAI, 2017.\n\n[22] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics New York, NY, USA:, 2001.\n\n[23] Charles J Geyer. Markov chain Monte Carlo maximum likelihood. 1991.\n\n[24] Charles J Geyer. On the convergence of monte carlo maximum likelihood calculations. Journal\n\nof the Royal Statistical Society. Series B (Methodological), pages 261\u2013274, 1994.\n\n[25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[26] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation\n\nvia adversarial training with leaked information. In AAAI, 2018.\n\n[27] Michael Gutmann and Jun-ichiro Hirayama. Bregman divergence as general framework to\n\nestimate unnormalized statistical models. arXiv preprint arXiv:1202.3727, 2012.\n\n[28] Michael Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation\n\nprinciple for unnormalized statistical models. In AISTATS, pages 297\u2013304, 2010.\n\n[29] Michael U Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation of unnormalized\nstatistical models, with applications to natural image statistics. Journal of Machine Learning\nResearch, 13(Feb):307\u2013361, 2012.\n\n[30] Michael U Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander. Likelihood-free\n\ninference via classi\ufb01cation. Statistics and Computing, 28(2):411\u2013425, 2018.\n\n[31] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning\n\nwith deep energy-based policies. In ICML, 2017.\n\n[32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS,\n2017.\n\n[33] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):1771\u20131800, 2002.\n\n[34] Jean-Baptiste Hiriart-Urruty and Claude Lemar\u00e9chal. Fundamentals of convex analysis. Springer\n\nScience & Business Media, 2012.\n\n[35] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks,\n\n4(2):251\u2013257, 1991.\n\n[36] Aapo Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. Journal\n\nof Machine Learning Research, 6(Apr):695\u2013709, 2005.\n\n[37] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[38] Chunyuan Li, Ke Bai, Jianqiao Li, Guoyin Wang, Changyou Chen, and Lawrence Carin.\n\nAdversarial learning of a sampler based on an unnormalized distribution. 2019.\n\n[39] Ke Li and Jitendra Malik. Implicit maximum likelihood estimation, 2018.\n\n[40] Yingzhen Li and Richard E Turner. R\u00e9nyi divergence variational inference. In NIPS, 2016.\n\n11\n\n\f[41] Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. In ICLR, 2018.\n\n[42] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML,\n\n2015.\n\n[43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European\nconference on computer vision, pages 740\u2013755. Springer, 2014.\n\n[44] Bruce G Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS regional\n\nconference series in probability and statistics, pages i\u2013163. JSTOR, 1995.\n\n[45] Jun S Liu. Monte Carlo strategies in scienti\ufb01c computing. Springer Science & Business Media,\n\n2008.\n\n[46] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian\n\ninference algorithm. In NIPS, 2016.\n\n[47] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational Bayes:\n\nunifying variational autoencoders and generative adversarial networks. In ICML, 2017.\n\n[48] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[49] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125\u2013139,\n\n2001.\n\n[50] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence function-\nals and the likelihood ratio by penalized convex risk minimization. In NIPS, pages 1089\u20131096,\n2008.\n\n[51] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural\n\nsamplers using variational divergence minimization. In NIPS, 2016.\n\n[52] Art B. Owen. Monte Carlo theory, methods and examples. 2013.\n\n[53] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive \ufb02ow for density\n\nestimation. In NIPS, pages 2335\u20132344, 2017.\n\n[54] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[55] Tom Rainforth, Tuan Anh Le, Maximilian Igl Chris J Maddison, and Yee Whye Teh Frank\n\nWood. Tighter variational bounds are not necessarily better. In NIPS workshop. 2017.\n\n[56] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nIn ICML, 2015.\n\n[57] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of\n\nMathematical Statistics, 22:400, 1951.\n\n[58] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training GANs. In NIPS, 2016.\n\n[59] Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv\u00e4rinen, and Revant Kumar.\nDensity estimation in in\ufb01nite dimensional exponential families. The Journal of Machine\nLearning Research, 18(1):1830\u20131888, 2017.\n\n[60] Dougal J Sutherland, Heiko Strathmann, Michael Arbel, and Arthur Gretton. Ef\ufb01cient and\n\nprincipled score estimation with nystr\u00f6m kernel exponential families. In AISTATS, 2018.\n\n[61] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chi-\n\nsquare generative adversarial network. In ICML, 2018.\n\n12\n\n\f[62] Chenyang Tao, Liqun Chen, Ruiyi Zhang, Ricardo Henao, and Lawrence Carin Duke. Varia-\n\ntional inference and model selection with generalized evidence bounds. In ICML, 2018.\n\n[63] Chenyang Tao, Shuyang Dai, Liqun Chen, Ke Bai, Junya Chen, Chang Liu, Georgiy Bobashev,\nand Lawrence Carin. Variational annealing of GANs: A Langevin perspective. In ICML, 2019.\n\n[64] David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with\n\ndenoising feature matching. In ICLR, 2017.\n\n[65] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis\n\nof decoder-based generative models. In ICLR. 2017.\n\n[66] Cheng Zhang, Judith B\u00fctepage, Hedvig Kjellstr\u00f6m, and Stephan Mandt. Advances in variational\n\ninference. CoRR, abs/1711.05597, 2017. URL http://arxiv.org/abs/1711.05597.\n\n13\n\n\f", "award": [], "sourceid": 5511, "authors": [{"given_name": "Chenyang", "family_name": "Tao", "institution": "Duke University"}, {"given_name": "Liqun", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Shuyang", "family_name": "Dai", "institution": "Duke University"}, {"given_name": "Junya", "family_name": "Chen", "institution": "Duke U"}, {"given_name": "Ke", "family_name": "Bai", "institution": "Duke University"}, {"given_name": "Dong", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Jianfeng", "family_name": "Feng", "institution": "Fudan University"}, {"given_name": "Wenlian", "family_name": "Lu", "institution": "Fudan University"}, {"given_name": "Georgiy", "family_name": "Bobashev", "institution": "RTI International"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}