{"title": "Coupled Variational Bayes via Optimization Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 9690, "page_last": 9700, "abstract": "Variational inference plays a vital role in learning graphical models, especially on large-scale datasets. Much of its success depends on a proper choice of auxiliary distribution class for posterior approximation. However, how to pursue an auxiliary distribution class that achieves both good approximation ability and computation efficiency remains a core challenge. In this paper, we proposed coupled variational Bayes which exploits the primal-dual view of the ELBO with the variational distribution class generated by an optimization procedure, which is termed optimization embedding. This flexible function class couples the variational distribution with the original parameters in the graphical models, allowing end-to-end learning of the graphical models by back-propagation through the variational distribution. Theoretically, we establish an interesting connection to gradient flow and demonstrate the extreme flexibility of this implicit distribution family in the limit sense. Empirically, we demonstrate the effectiveness of the proposed method on multiple graphical models with either continuous or discrete latent variables comparing to state-of-the-art methods.", "full_text": "Coupled Variational Bayes via\n\nOptimization Embedding\n\n\u2217Bo Dai1,2, \u2217\u2217Hanjun Dai1, Niao He3, Weiyang Liu1, Zhen Liu1,\n\nJianshu Chen4, Lin Xiao5, Le Song1,6\n\n1Georgia Institute of Technology, 2Google Brain, 3University of Illinois at Urbana Champaign\n\n4Tencent AI, 5Microsoft Research, 6Ant Financial\n\nAbstract\n\nVariational inference plays a vital role in learning graphical models, especially on\nlarge-scale datasets. Much of its success depends on a proper choice of auxiliary\ndistribution class for posterior approximation. However, how to pursue an auxiliary\ndistribution class that achieves both good approximation ability and computation\nef\ufb01ciency remains a core challenge. In this paper, we proposed coupled variational\nBayes which exploits the primal-dual view of the ELBO with the variational distri-\nbution class generated by an optimization procedure, which is termed optimization\nembedding. This \ufb02exible function class couples the variational distribution with\nthe original parameters in the graphical models, allowing end-to-end learning of\nthe graphical models by back-propagation through the variational distribution.\nTheoretically, we establish an interesting connection to gradient \ufb02ow and demon-\nstrate the extreme \ufb02exibility of this implicit distribution family in the limit sense.\nEmpirically, we demonstrate the effectiveness of the proposed method on multiple\ngraphical models with either continuous or discrete latent variables comparing to\nstate-of-the-art methods.\n\n1\n\nIntroduction\n\nProbabilistic models with Bayesian inference provides a powerful tool for modeling data with\ncomplex structure and capturing the uncertainty. The latent variables increase the \ufb02exibility of the\nmodels, while making the inference intractable. Typically, one resorts to approximate inference such\nas sampling [Neal, 1993, Neal et al., 2011, Doucet et al., 2001], or variational inference [Wainwright\nand Jordan, 2003, Minka, 2001]. Sampling algorithms enjoys good asymptotic theoretical properties,\nbut they are also known to suffer from slow convergence especially for complex models. As a result,\nvariational inference algorithms become more and more attractive, especially driven by the recent\ndevelopment on stochastic approximation methods [Hoffman et al., 2013].\nVariational inference methods approximate the intractable posterior distributions by a family of\ndistributions. Choosing a proper variational distribution family is one of the core problems in\nvariational inference. For example, the mean-\ufb01eld approximation exploits the distributions generated\nby the independence assumption. Such assumption will reduce the computation complexity, however,\nit often leads to the distribution family that is too restricted to recover the exact posterior [Turner and\nSahani, 2011]. Mixture models and nonparametric family [Jaakkola and Jordon, 1999, Gershman\net al., 2012, Dai et al., 2016a] are the natural generalization. By introducing more components in\nthe parametrization, the distribution family become more and more \ufb02exible, and the approximation\nerror is reduced. However, the computational cost increases since it requires the evaluations of\nthe log-likelihood and/or its derivatives for each component in each update, which could limit\nthe scalability of variational inference. Inspired by the \ufb02exibility of deep neural networks, many\nneural networks parametrized distributions [Kingma and Welling, 2013, Mnih and Gregor, 2014]\n\n\u2217indicates equal contributions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand tractable \ufb02ows [Rezende and Mohamed, 2015, Kingma et al., 2016, Tomczak and Welling, 2016,\nDinh et al., 2016] have been introduced as alternative families in variational inference framework.\nThe compromise in designing neural networks for computation tractability restricts the expressive\nability of the approximation distribution family. Finally, the introduction of the variational distribution\nalso brings extra separate parameters to be learned from data. As we know, the more \ufb02exible the\napproximation model is, the more samples are required for \ufb01tting such a model. Therefore, besides\nthe approximation error and computational tractability, the sample ef\ufb01ciency should also be taken\ninto account when designing the variational distribution family.\nIn summary, most existing works suffer from a trade-off between approximation accuracy, computa-\ntion ef\ufb01ciency, and sample complexity. It remains open to design a variational inference approach\nthat enjoys all three aspects. This paper provides a method towards such a solution, called coupled\nvariational Bayes (CVB). The proposed approach hinges upon two key components: i), the primal-\ndual view of the ELBO; and ii), the optimization embedding technique for generating variational\ndistributions. The primal-dual view of ELBO avoids the computation of determinant of Jacobian\nin \ufb02ow-based model and makes the arbitrary \ufb02ow parametrization applicable, therefore, reducing\nthe approximation error. The optimization embedding generates an interesting class of variational\ndistribution family, derived from the updating rule of an optimization procedure. This distribution\nclass reduces separate parameters by coupling the variational distribution with the original parameters\nin the graphical models. Therefore, we can back-propagate the gradient w.r.t. the original parameters\nthrough the variational distributions, which promotes the sample ef\ufb01ciency of the learning procedure.\nWe formally justify that in continuous-time case, such a technique implicitly provides a \ufb02exible\nenough approximation distribution family from the gradient \ufb02ow view, implying that the CVB al-\ngorithm also guarantees zero approximation error in the limit sense. These advantages are further\ndemonstrated in our numerical experiments.\nIn the remainder of this paper, we \ufb01rst provide a preliminary introduction to problem settings described\nin directed graphical models and the variational auto-encoder (VAE) framework in Section 2. We\npresent our coupled variational Bayes in Section 3, which leverages the optimization embedding in\nthe primal-dual view of ELBO to couple the variational distribution with original graphical models.\nWe build up the connections of the proposed method with the existing \ufb02ows formulations in Section 4.\nWe demonstrate the empirical performances of the proposed algorithm in Section 5.\n\n2 Background\n\nVariational inference and learning Consider a probabilistic generative model, p\u03b8(x, z) =\np\u03b8(x|z)p(z), where x \u2208 Rd denotes the observed variables and z \u2208 Rr latent variables 2. Given\nthe dataset D = [xi]N\ni=1, one learns the parameter \u03b8 in the model by maximizing the marginal\n\nlikelihood, i.e., log(cid:82) p\u03b8(x, z)dz. However, the integral is intractable in general cases. Variational in-\n\nference [Jordan et al., 1998] maximizes the evidence lower bound (ELBO) of the marginal likelihood\nby introducing an approximate posterior distribution, i.e.,\n\nlog p\u03b8(x) = log\n\np\u03b8(x, z)dz (cid:62) Ez\u223cq\u03c6(z|x) [log p\u03b8(x, z) \u2212 log q\u03c6(z|x)] ,\n\n(1)\n\n(cid:90)\n\nwhere \u03c6 denotes the parameters of the variational distributions. There are two major issues in solving\nsuch optimization: i), the appropriate parametrization for the introduced variational distributions,\nand ii), the ef\ufb01cient algorithms for updating the parameters {\u03b8, \u03c6}. By adopting different variational\ndistributions and exploiting different optimization algorithms, plenty of variants of variational\ninference and learning algorithms have been proposed. Among the existing algorithms, optimizing\nthe objective with stochastic gradient descent [Hoffman et al., 2013, Titsias and L\u00e1zaro-gredilla, 2014,\nDai et al., 2016a] becomes the dominated algorithm due to its scalability for large-scale datasets.\nHowever, how to select the variational distribution family has not been answered sati\ufb01edly yet.\nReparametrized density Kingma and Welling [2013], Mnih and Gregor [2014] exploit the recog-\nnition model or inference network to parametrize the variational distributions. A typical inference\nnetwork is a stochastic mapping from the observation x to the latent variable z with a set of global\n, where \u00b5\u03c61(x) and \u03c3\u03c62 (x) are ofter\n\nparameters \u03c6, e.g., q\u03c6(z|x) := N(cid:16)\n\nz|\u00b5\u03c61 (x), diag\n\n(cid:17)(cid:17)\n\n(cid:16)\n\n\u03c32\n\u03c62\n\n(x)\n\n2We mainly discuss continuous latent variables in main text. However, the proposed algorithm can be\n\nextended to discrete latent variables easily as we show in Appendix B.\n\n2\n\n\f\u2202zt\n\nt=1\n\n(cid:12)(cid:12)det \u2202Tt\n\n(cid:12)(cid:12)\u22121 by the\n\n(cid:0)z0(cid:1) following the distribution qT (z|x) = q0(z|x)(cid:81)T\n\nparametrized by deep neural networks. Practically, such reparameterizations have the closed-form\nof the entropy in general, and thus, the gradient computation and the optimization is relatively easy.\nHowever, such parameterization cannot perfectly \ufb01t the posterior when it does not fall into the known\ndistirbution family, therefore, resulting extra approximation error to the true posterior.\nTractable \ufb02ows-based model Parameterizing the variational distributions with \ufb02ows is proposed\nto mitigate the limitation of expressive ability of the variational distribution. Speci\ufb01cally, assuming\na series of invertible transformations as {Tt : Rr \u2192 Rr}T\nt=1 and z0 \u223c q0 (z|x), we have zT =\nTT \u25e6 TT\u22121 \u25e6 . . . \u25e6 T1\nchange of variable formula. The \ufb02ow-based parametrization generalizes the reparametrization tricks\nfor the known distributions. However, a general parametrization of the transformation may violate\nthe invertible requirement and result expensive or even infeasible calculation for the Jacobian and its\ndeterminant. Therefore, several carefully designed simple parametric forms of T have been proposed\nto compromise the invertible requirement and tractability of Jacobian [Rezende and Mohamed, 2015,\nKingma et al., 2016, Tomczak and Welling, 2016, Dinh et al., 2016], at the expense of the \ufb02exibility\nof the corresponding variational distribution families.\n3 Coupled Variational Bayes\nIn this section, we \ufb01rst consider the variational inference from a primal-dual view, by which we\ncan avoid the computation of the determinant of the Jacobian. Then, we propose the optimization\nembedding, which generates the variational distribution by the adopt optimization algorithm. It\nautomatically produces a nonparametric distribution class, which is \ufb02exible enough to approximate\nthe posterior. More importantly, the optimization embedding couples the implicit variational distribu-\ntion with the original graphical models, making the training more ef\ufb01cient. We introduce the key\ncomponents below. Due to space limitation, we postpone the proof details of all the theorems in this\nsection to Appendix A.\n3.1 A Primal-Dual View of ELBO in Functional Space\nAs we introduced, the \ufb02ow-based parametrization introduce more \ufb02exibility in representing the distri-\nbutions. However, the calculating of the determinant of the Jacobian introduces extra computational\ncost and invertible requirement of the parametrization. In this section, we start from the primal-dual\nview perspective of ELBO, which will provide us a mechanism to avoid such computation and\nrequirement, therefore, making the arbitrary \ufb02ow parametrization applicable for inference.\nAs Zellner [1988], Dai et al. [2016a] show, when the family of variational distribution includes all\nvalid distributions P, the ELBO matches the marginal likelihood, i.e.,\nL (\u03b8):= Ex\u223cD\n\nEx\u223cDEz\u223cq(z|x) [log p\u03b8(x|z) \u2212 KL (q(z|x)||p (z))]\n,\n\np\u03b8 (x, z) dz\n\n(cid:90)\n\n(cid:20)\n\n(cid:21)\n\nlog\n\n= max\nq(z|x)\u2208P\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:96)\u03b8(q)\n\nwhere p\u03b8 (x, z) = p\u03b8 (x|z) p (z) and Ex\u223cD [\u00b7] denotes the expectation over empirical distribution on\n(2)\nobservations and (cid:96)\u03b8 (q) stands for the objective for the variational distribution in density space P\n(cid:82) p\u03b8(x,z)dz . The\nunder the probabilistic model with \u03b8. Denote q\u2217\nultimate objective L(\u03b8) will solely depend on \u03b8, i.e.,\n\n\u03b8 (z|x) := argmaxq(z|x)\u2208P (cid:96)\u03b8 (q) = p\u03b8(x,z)\n\nL(\u03b8) = Ex\u223cDEz\u223cq\u2217\n\n\u03b8 (z|x) [log p\u03b8(x, z) \u2212 log q\u2217\n\n\u03b8 (z|x)] ,\n\nwhich can be updated by stochastic gradient descent.\nThis would then require routinely solving the subproblem maxq\u2208P (cid:96)\u03b8 (q). Since the objective is\ntaking over the whole distribution space, it is intractable in general. Traditionally, one may introduce\nspecial parametrization forms of distributions or \ufb02ows for the sake of computational tractability, thus\nlimiting the approximation ability. In what follows, we introduce an equivalent primal-dual view of\nthe (cid:96)\u03b8(q) in Theorem 1, which yields a promising opportunity to meet both approximation ability and\ncomputational tractability.\nTheorem 1 (Equivalent reformulation of L (\u03b8)) We can reformulate the L (\u03b8) equivalently as\n\u2212 1,\n\nlog p\u03b8 (x|zx,\u03be) \u2212 log \u03bd (x, zx,\u03be)\n\n(cid:9), p\u03be (\u00b7) denotes some simple distribution and the optimal\n\nwhere H+ = (cid:8)h : Rd \u00d7 Rr \u2192 R+\n\n+ Ez\u223cp(z) [\u03bd(x, z)]\n\nE\u03be\u223cp\u03be(\u00b7)\n\nmax\nzx,\u03be\u2208Rr\n\nEx\u223cD\n\nmin\n\u03bd\u2208H+\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\n(4)\n\n(cid:125)\n\n(3)\n\n\u03b8 (x, z) = q\u2217\n\u03bd\u2217\n\n\u03b8 (z|x)\np(z)\n\n.\n\n3\n\n\fThe primal-dual formulation of L (\u03b8) is derived based on Fenchel-duality and interchangeability\nprinciple [Dai et al., 2016b, Shapiro et al., 2014]. With the primal-dual view of ELBO, we are able\nto represent the distributional operation on q by local variables zx,\u03be, which provides an implicit\nnonparametric transformation from (x, \u03be) \u2208 Rd \u00d7 \u039e to zx,\u03be \u2208 Rp. Meanwhile, with the help of dual\nfunction \u03bd (x, z), we can also avoid the computation of the determinant of Jacobian matrix of the\ntransformation, which is in general infeasible for arbitrary transformation.\n3.2 Optimization Embedding\nIn this section, inspired by the local variable representation of the variational distribution in Theorem 1,\nwe will construct a special variational distribution family, which integrates the variational distribution\nq, i.e., transformation on local variables, and the original parameters of graphical models \u03b8. We\nemphasize that optimization embedding is a general technique for representing the variational\ndistributions and can also be accompanied with the original ELBO, which is provided in Appendix B.\nAs shown in Theorem 1, we switch handling the distribution q(z|x) \u2208 P to each local variables.\nSpeci\ufb01cally, given x \u223c D and \u03be \u223c p(\u03be), with a \ufb01xed \u03bd \u2208 H+,\n\nz\u2217\nx,\u03be;\u03b8 = argmax\nzx,\u03be\u2208Rp\n\nlog p\u03b8 (x|zx,\u03be) \u2212 log \u03bd (x, zx,\u03be) .\n\n(5)\n\n(cid:68)\n\n(cid:16)\n\n(cid:17)\n\n,\n\n(cid:17)(cid:69) \u2212 D\u03c9\n(cid:16)\n\nx, zt\u22121\n\n(cid:16)\n(cid:17)\n\nx,\u03be;\u03b8\n\n(cid:16)\n\n(cid:17)\n\nx,\u03be;\u03b8\n\nx, zt\u22121\n\n(cid:16)\n(cid:17) \u2212 \u2207z log \u03bd\n(cid:17)\n(cid:16)\n\nFor the complex graphical models, it is dif\ufb01cult to obtain the global optimum of (5). We can approach\nthe z\u2217\nx,\u03be by applying mirror descent algorithm (MDA) [Beck and Teboulle, 2003, Nemirovski et al.,\n2009]. Speci\ufb01cally, denote the initialization as z0\nx,\u03be, in t-th iteration, we update the variables until\nconverges via the prox-mapping operator\n\n.\n\nx,\u03be;\u03b8\n\nx,\u03be;\u03b8\n\nzt\n\nx,\u03be;\u03b8\n\n+ f\n\n\u03b7tg\n\n(cid:16)\n\nz, \u03b7tg\n\nzt\u22121\n\nx,\u03be;\u03b8\n\n(cid:17)(cid:17)\n\nx|zt\u22121\n\nx, zt\u22121\n\nx, zt\u22121\n\n2 (cid:107)z(cid:107)2\n\nzt\nx,\u03be;\u03b8 = argmax\nz\u2208Rr\n= \u2207z log p\u03b8\n\nzt\u22121\n(6)\nx,\u03be;\u03b8, z\nand D\u03c9 (z1, z2) = \u03c9 (z2) \u2212\nwhere g\n[\u03c9 (z1) + (cid:104)\u2207\u03c9 (z1) , z2 \u2212 z1(cid:105)] is the Bregman divergence generated by a continuous and strongly\nconvex function \u03c9 (\u00b7). In fact, we have the closed-form solution to the prox-mapping operator (6).\nTheorem 2 (The closed-form of MDA) Recall the \u03c9(\u00b7) is strongly convex, denote f (\u00b7) = \u2207\u03c9 (\u00b7),\nthen, f\u22121 (\u00b7) exists. Therefore, the solution to (6) is\n\nx,\u03be;\u03b8 = f\u22121(cid:16)\nlies in a simplex, one may use \u03c9 (z) =(cid:80)r\n\n(7)\nProper choices of the Bregman divergences could exploit the geometry of the feasible domain and\nyield faster convergence. For example, if z lies in the general continuous space, one may use\n2, the D\u03c9 (\u00b7,\u00b7) will be Euclidean distance on Rr, f (z) = z and f\u22121 (z) = z, and if z\n\u03c9 (z) = 1\ni=1 zi log zi, the D\u03c9 (\u00b7,\u00b7) will be KL-divergence on the\n\np-dim simplex, f (z) = log z and f\u22121 (z) = exp (z).\nAssume we conduct the update (7) T iterations, the mirror descent algorithm outputs zT\nx,\u03be;\u03b8 for\neach pair of (x, \u03be). Therefore, it naturally establishes another nonparametric function that maps\nfrom Rd \u00d7 \u039e to Rr to approximate the sampler of the variational distribution point-wise, i.e.,\n\u03b8 (x, \u03be) \u2248 z\u2217\nx,\u03be;\u03b8, \u2200 (x, \u03be) \u2208 Rd \u00d7 \u039e. Since such an approximation function is generated by the\nzT\nmirror descent algorithm, we name the corresponding function class as optimization embedding.\nMost importantly, the optimization embedded function explicitly depends on \u03b8, which makes the\nend-to-end learning possible by back-propagation through the variational distribution. The detailed\nadvantage of using the optimization embedding for learning will be explained in Section 3.3.\nBefore that, we \ufb01rst justify the approximation ability of the optimization embedding by connecting\nto the gradient \ufb02ow for minimizing the KL-divergence with a special \u03bd (x, z) in the limit case.For\nsimplicity, we mainly focus on the basic case when f (z) = z. For a \ufb01xed x, sample \u03be \u223c p(\u03be), the\nparticle zT\nTheorem 3 (Optimization embedding as gradient \ufb02ow) For a continuous time t = \u03b7T and in-\n\ufb01nitesimal step size \u03b7 \u2192 0, the density of the particles zt \u2208 Rr, denoted as qt (z|x), follows\nnonlinear Fokker-Planck equation\n\n\u03b8 (x, \u03be) is recursively constructed by transform Tx (z) = z + \u03b7g (x, z). We show that\n\nif gt (x, z) := \u2207z log p\u03b8 (x|z) \u2212 \u2207z log \u03bd\u2217\nby (8) is a gradient \ufb02ow of KL-divergence in the space of measures with 2-Wasserstein metric.\n\nt (x, z) = qt(z|x)\n\np(z) . Such process de\ufb01ned\n\n\u2202qt (z|x)\n\n\u2202t\n\n= \u2212\u2207 \u00b7 (qt (z|x) gt (x, z)) ,\nt (x, z) with \u03bd\u2217\n\n(8)\n\n4\n\n\fAlgorithm 1 Coupled Variational Bayes (CVB)\n1: Initialize \u03b8, V and W (the parameters of \u03bd and z0) randomly, set length of steps T and mirror\n\nfunction f.\nSet z0(x, \u03be) = hW (x, \u03be).\n\n2: for iteration k = 1, . . . , K do\nSample mini-batch {xi}m\n3:\nfor iteration t = 1, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\ni=1 from dataset D, {zi}m\n(cid:80)m\n\u03b8 (x, \u03be) for each pair of {xi, \u03bei}m\ni=1.\ni=1 [\u03bdV (xi, zi) \u2212 log \u03bdV (xi, zt\n(cid:2)log p\u03b8\n(cid:80)m\n\n(cid:0)x|zT\n\u03b8 (x, \u03be)(cid:1) \u2212 log \u03bdV\n\nend for\nAscend \u03b8 by stochastic gradient (11).\nAscend W by \u2207W\n\nCompute zt\nDescend V with \u2207V\n\n1\nm\n\n1\nm\n\ni=1\n\n\u03b8 (xi, \u03bei))] .\n\n(cid:0)x, zT\n\u03b8 (x, \u03be)(cid:1)(cid:3) .\n\ni=1 from prior p(z), and {\u03bei}m\n\ni=1 from p(\u03be).\n\nFrom such a gradient \ufb02ow view of optimization embedding, we can see that in limit case, the\noptimization embedding, zT\n\n\u03b8 (x, \u03be), is \ufb02exible enough to approximate the posterior accurately.\n\n3.3 Algorithm\nApplying the optimization embedding into the (cid:96)\u03b8 (q), we arrive the approximate surrogate optimiza-\ntion to L (\u03b8) in (2) as\nmax\n\n(cid:0)x|zT\n\u03b8 (x, \u03be)(cid:1) \u2212 log \u03bd(cid:0)x, zT\n\nEx\u223cD(cid:2)E\u03be\u223cp(\u03be)\n\n(cid:2)log p\u03b8\n\n\u02dcL (\u03b8) := min\n\u03bd\u2208H+\n\n\u03b8\n\n(9)\nWe can apply the stochastic gradient algorithm for (9) with the unbiased gradient estimator as follows.\n\n\u03b8 (x, \u03be)(cid:1)(cid:3) + Ez\u223cp(z) [\u03bd(x, z)](cid:3) .\n(cid:2)log \u03bd(cid:0)x, zT\n\u03b8 (x, \u03be)(cid:1)(cid:3) ,\n(cid:12)(cid:12)(cid:12)z=zT\n\n\u03b8 (x, \u03be)\n\u2202\u03b8\n\n(10)\n\n(cid:21)\n\n\u2202zT\n\n\u03b8 (x,\u03be)\n\n.\n\n(11)\n\nTheorem 4 (Unbiased gradient estimator) Denote\n\n\u03bd\u2217\n\u03b8 (x, z) = argmin\n\u03bd\u2208H+\n\nEx\u223cDEz\u223cp(z) [\u03bd (x, z)] \u2212 Ex\u223cDE\u03be\u223cp(\u03be)\n\nwe have the unbiased gradient estimator w.r.t. \u03b8 as\n\n\u2202 \u02dcL (\u03b8)\n\n\u2202\u03b8\n\n= Ex\u223cDE\u03be\u223cp(\u03be)\n\u2212 Ex\u223cDE\u03be\u223cp(\u03be)\n\n(cid:20) \u2202 log p\u03b8 (x|z)\n(cid:20) \u2202 log \u03bd\u2217\n\n\u2202\u03b8\n\u03b8 (x, z)\n\u2202z\n\n(cid:12)(cid:12)(cid:12)z=zT\n(cid:12)(cid:12)(cid:12)z=zT\n\n\u2202 log p\u03b8 (x|z)\n\n(cid:21)\n\n+\n\n\u03b8 (x,\u03be)\n\n\u03b8 (x,\u03be)\n\n\u2202zT\n\n\u2202z\n\u03b8 (x, \u03be)\n\u2202\u03b8\n\nAs we can see from the gradient estimator (11), besides the effect on \u03b8 from the log-likelihood as in\ntraditional VAE method with separate parameters of the variational distribution, which is the \ufb01rst\nterm in (11), the estimator also considers the effect through the variational distribution explicitly\nin the second term. Such dependences through optimization embedding will potentially accelerate\nthe learning in terms of sample complexity. The computation of the second term resembles to the\nback-propagation through time (BPTT) in learning the recurrent neural network, which can be easily\nimplemented in Tensor\ufb02ow or PyTorch.\nPractical extension With the functional primal-dual view of ELBO and the optimization em-\nbedding, we are ready to derive the practical CVB algorithm. The CVB algorithm can be easily\nincorporated with parametrization into each component to balance among approximation \ufb02exibility,\ncomputational cost, and sample ef\ufb01ciency. The introduced parameters can be also trained by SGD\nwithin the CVB framework. For example, in the optimization embedding, the algorithm requires the\ninitialization z0\nx,\u03be. Besides the random initialization, we can also introduce a parametrized function for\nx,\u03be = hW (x, \u03be), with W denoting the parameters. We can parametrize the \u03bd (x, \u03be) by deep neural\nz0\nnetworks with parameter V . To guarantee positive outputs of \u03bdV (x, \u03be), we can use positive activation\nfunctions, e.g., Gaussian, exponential, multi-quadratics, and so on, in the last layer. However, the\nneural networks parameterization may induce non-convexity, and thus, loss the guarantee of the\nglobal convergence in both (9) and (10), which leads to the bias in the estimator (11) and potential\nunstability in training. Empirically, to reduce the effect from neural network parametrization, we\nupdate the parameters in \u03bd within the optimization embedding simultaneously, implicitly pushing\nzT to follow the gradient \ufb02ow. Taking into account of the introduced parameters, we have the CVB\nalgorithm illustrated in Algorithm 1.\n\n5\n\n\f(cid:0)x, zt\u22121(cid:1) + 2\n\nMoreover, we only discuss the optimization embedding through the basic mirror descent. In fact,\nother optimization algorithm, e.g., the accelerated gradient descent, gradient descent with momentum,\nand other adaptive gradient method (Adagrad, RMSprop), can also be used for constructing the\nvariational distributions. For the variants of CVB to parametrized continuous/discrete latent variables\nmodel and hybrid model with Langevin dynamics, please refer to the Appendix B and Appendix C.\n4 Related Work\nConnections to Langevin dynamics and Stein variational gradient descent As we show in The-\norem 3, the optimization embedding could be viewed as a discretization of a nonlinear Fokker-Plank\nequation, which can be interpreted as a gradient \ufb02ow of KL-divergence on 2-Wasserstein metric with\na special \u03bd (x, z). It resembles the gradient \ufb02ow with Langevin dynamics [Otto, 2001]. However,\n\u221a\nLangevin dynamics is governed by a linear Fokker-Plank equation and results a stochastic update\n\u03b7\u03bet\u22121 with \u03bet\u22121 \u223c N (0, 1), thus different from our\nrule, i.e., zt = zt\u22121 + \u03b7\u2207 log p\u03b8\ndeterministic update given the initialization z0.\nSimilar to the optimization embedding, Stein variational gradient descent (SVGD) also exploits a\nnonlinear Fokker-Plank equation. However, these two gradient \ufb02ows follow from different PDEs\nand correspond to different metric spaces, thus also resulting different deterministic updates. Unlike\noptimization embedding, the SVGD follows interactive updates between samples and requires to\nkeep a \ufb01xed number of samples in the whole process.\nConnection to adversarial variational Bayes (AVB) The AVB [Mescheder et al., 2017] can\nalso exploit arbitrary \ufb02ow and avoid the calculation related to the determinant of Jacobian via\nvariational technique. Comparing to the primal-dual view of ELBO in CVB, AVB is derived based\non classi\ufb01cation density ratio estimation for KL-divergence in ELBO [Goodfellow et al., 2014,\nSugiyama et al., 2012]. The most important difference is that CVB couples the adversarial component\nwith original models through optimization embedding, which is \ufb02exible enough to approximate the\ntrue posterior and promote the learning sample ef\ufb01ciency.\nConnection to deep unfolding The optimization embedding is closely related to deep unfolding\nfor inference and learning on graphical models. Existing schemes either unfold the point estimation\nthrough optimization [Domke, 2012, Hershey et al., 2014, Chen et al., 2015, Belanger et al., 2017,\nChien and Lee, 2018], or expectation-maximization [Greff et al., 2017], or loopy BP [Stoyanov et al.,\n2011]. In contrast, the we exploit optimization embedding through a \ufb02ow pointwisely, so that it\nhandles the distribution in a nonparametric way and ensures enough \ufb02exibility for approximation.\n5 Experiments\nIn this section, we justify the bene\ufb01ts of the proposed coupled variational Bayes in terms of the\n\ufb02exibility and the ef\ufb01ciency in sample complexity empirically. We also illustrate its generative ability.\nThe algorithms are executed on the machine with Intel Core i7-4790K CPU and GTX 1080Ti GPUs.\nAdditional experimental results, including the variants of CVB to discrete latent variable models and\nmore results on real-world datasets, can be found in Appendix D. We The implementation is released\nat https://github.com/Hanjun-Dai/cvb.\n5.1 Flexibility in Posterior Approximation\nWe \ufb01rst justify the \ufb02exibility of the opti-\nmization embedding in CVB on the simple\nsynthetic dataset [Mescheder et al., 2017].\nIt contains 4 data points, each representing\na one-hot 2 \u00d7 2 binary image with non-\nzero entries at different positions. The\ngenerative model is a multivariate inde-\n(cid:81)4\npendent Bernoulli distribution with Gaus-\nsian distribution as prior, i.e., p\u03b8 (x|z) =\ni=1 \u03c0i (z)xi and p(z) = N (0, I) with\nz \u2208 R2, and \u03c0i (z) is parametrized by 4-\nlayer fully-connected neural networks with\n64 hidden units in each latent layer. For\nCVB, we set f (z) = z in optimization embedding. We emphasize the optimization embedding is non-\nparametric and generated automatically via mirror descent. The dual function \u03bd (x, z) is parametrized\n\nFigure 1: Distribution of the latent variables for VAE\nand CVB on synthetic dataset.\n\n(1) vanilla VAE\n\n(2) CVB\n\n6\n\n\fFigure 2: Convergence speed comparison in terms of number epoch on MNIST. We report the\nobjective values of each method on held-out test set. The CVB achieves faster convergence speed\ncomparing the other competitors in both r = 8 and r = 32 cases.\nby a (4 + 2)-64-64-1 neural networks. The number of steps T in optimization embedding is set to be\n5 in this case.\nTo demonstrate the \ufb02exibility of the optimization embedding, we compare the proposed CVB with\nthe vanilla VAE with a diagonal Gaussian posterior. A separate encoder in VAE is parametrized\nby reversing the structure of the decoder. We visualize the obtained posterior by VAE and CVB\nin Figure 1. While VAE generates a mixture of 4 Gaussians that is consistent with the parametrization\nassumption, the proposed CVB divides the latent space with a complex distribution. Clearly, this\nyields that CVB is more \ufb02exible in terms of approximation ability.\n\n5.2 Ef\ufb01ciency in Sample Complexity\nTo verify the sample ef\ufb01ciency of CVB, we compare the performance of CVB on static binarize\nMNIST dataset to the current state-of-the-art algorithms, including VAE with inverse autoregressive\n\ufb02ow (VAE+IAF) [Kingma et al., 2016], adversarial variational Bayes (AVB) [Mescheder et al., 2017],\nand the vanilla VAE with Gaussian assumption for the posterior distribution (VAE) [Kingma and\nWelling, 2013]. In this experiment, we use the Gaussian as the initialization in CVB. We follow\nthe same setting as AVB [Mescheder et al., 2017], where conditional generative model P (x|z) is a\nBernoulli that is parameterized with 3-layer convolutional neural networks (CNN), and the inference\nmodel is also a CNN which is parametrized reversely as the generative model. Experiments for AVB\nand VAE+IAF are conducted based on the codes provided by Mescheder et al. [2017]3, where the\ndefault neural network structure are adopted. For all the methods, in each epoch, the batch size is set\nto be 100 while the initial learning rate is set to 0.0001.\nWe illustrate the convergence speed\nof testing objective values in terms of\nnumber epoch in Figure 2. As we can\nsee, in both cases with the dimension\nof latent variable r = 8 and r = 32,\nthe proposed CVB, represented by the\nred curve, converges to a lower test\nobjective value in a much faster speed.\nWe also compare the \ufb01nal approxi-\nmated log likelihood evaluated by Im-\nportance Sampling, with the best base-\nline results reported in the original pa-\npers in Table 1. In this case, the objec-\ntive function becomes too optimistic\nabout the actual likelihood. It could\nbe caused by the Monte Carlo estimation of the Fenchel-Dual of KL-divergence, which is noisy\ncomparing to the KL-divergence with closed-form in vanilla VAE. We can see that the proposed\nCVB still performs comparable with other alternatives. These results justify the bene\ufb01ts of parameters\ncoupling through optimization embedding, especially in high dimension.\n\nTable 1: The log-likelihood comparison between CVB and\ncompetitors on MNIST dataset. We can see that the proposed\nCVB achieves comparable performance on MNIST dataset.\n\nconvVAE + HVI (T = 16)\nVAE + HVI (T = 16)\n\n\u221289.6 [Mescheder et al., 2017]\n\u221280.2 [Mescheder et al., 2017]\n\u221279.9 [Tran et al., 2015]\n\u221279.1 [Kingma et al., 2016]\n\u221281.9 [Salimans et al., 2015]\n\u221285.5 [Salimans et al., 2015]\n\n\u221285.1 [Rezende and Mohamed, 2015]\n\nAVB + AC (8-dim)\nAVB + AC (32-dim)\n\nDRAW + VGP\n\nMethods\n\nCVB (8-dim)\nCVB (32-dim)\n\nVAE + IAF\n\nVAE + NF (T = 80)\n\nlog p(x) \u2248\n\n-93.5\n-84.0\n\n3The code can be found on https://github.com/LMescheder/AdversarialVariationalBayes.\n\n7\n\n5152535455565758595# training epochs80100120140160180test ObjectiveMNIST r=8CVBVAEIAFAVB5152535455565758595# training epochs80100120140160180200test ObjectiveMNIST r=32CVBVAEIAFAVB\f5.3 Generative Ability\nWe conduct experiments on real-world datasets, MNIST and CelebA, for demonstrating the generative\nability of the model learned by CVB. For additional generated images, please refer to Appendix D.\nMNIST We use the model that is speci\ufb01ed in Section 5.2. The generated images and reconstructed\nimages by the variant of CVB in Appendix B.1 learned model versus the training samples are\nillustrated in the \ufb01rst row of Figure 3.\nCelebA We use the variant of CVB in Appendix B.4 to train a generative model with deep decon-\nvolution network on CelebA-dataset for a 64-dimension latent space with N (0, 1) prior [Mescheder\net al., 2017]. we use convolutional neural network architecture similar to DCGAN. We illustrate the\nresults in the second row of Figure 3.\nWe can see that the learned models can produces realistic images and reconstruct reasonably in both\nMNIST and CelebA datasets.\n\n(a) Training data\n\n(b) Random generated samples\n\n(c) Reconstruction\n\n(a) Training data\n\n(b) Random generated samples\n\n(c) Reconstruction\n\nFigure 3: The training data, random generated images and the reconstructed images by the CVB\nlearned models on MNIST and CelebA dataset. In the reconstruction column, the odd rows correspond\nto the test samples, and even rows correspond the reconstructed images.\n\n6 Conclusion\n\nWe propose the coupled variational Bayes, which is designed based on the primal-dual view of ELBO\nand the optimization embedding technique. The primal-dual view of ELBO allows to bypass the\ndif\ufb01culty with computing the Jacobian for non-invertible transformations and makes it possible to\napply arbitrary transformation for variational inference. The optimization embedding technique,\nautomatically generates a nonparametric variational distribution and couples it with the original\nparameters in generative models, which plays a key role in reducing the sample complexity. Numerical\nexperiments demonstrates the superiority of CVB in approximate ability, computational ef\ufb01ciency,\nand sample complexity.\nWe believe the optimization embedding is an important and general technique, which is the \ufb01rst of the\nkind in literature and could be of independent interest. We provide several variants of the optimization\nembedding in Appendix B. It can also be applied to other models, e.g., generative adversarial model\nand adversarial training, and deserves further investigation.\n\n8\n\n\fAcknowledgments\n\nPart of this work was done when BD was with Georgia Tech. NH is supported in part by NSF CCF-\n1755829 and NSF CMMI-1761699. LS is supported in part by NSF IIS-1218749, NIH BIGDATA\n1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF IIS-1841351 EAGER,\nNSF CCF-1836822, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA, Amazon\nAWS, Google Cloud and Siemens.\n\nReferences\nAmir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\nDavid Belanger, Bishan Yang, and Andrew McCallum. End-to-end learning for structured prediction\n\nenergy networks. arXiv preprint arXiv:1703.05667, 2017.\n\nD. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, second edition, 1999.\n\nJianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xiaodong He, Jianfeng Gao, Xinying Song, and\nLi Deng. End-to-end learning of lda by mirror-descent back propagation over a deep architecture.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 28, pages 1765\u20131773. 2015.\n\nJen-Tzung Chien and Chao-Hsi Lee. Deep unfolding for topic models. IEEE transactions on pattern\n\nanalysis and machine intelligence, 40(2):318\u2013331, 2018.\n\nBo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent.\nIn Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n985\u2013994, 2016a.\n\nBo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions\n\nvia dual embeddings. CoRR, abs/1607.04579, 2016b.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv\n\npreprint arXiv:1605.08803, 2016.\n\nJustin Domke. Generic methods for optimization-based modeling. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 318\u2013326, 2012.\n\nA. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer-\n\nVerlag, 2001.\n\nSamuel Gershman, Matt Hoffman, and David M. Blei. Nonparametric variational inference. In John\nLangford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine\nLearning (ICML-12), pages 663\u2013670, New York, NY, USA, 2012. ACM.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Advances in Neural\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets.\nInformation Processing Systems, pages 2672\u20132680, 2014.\n\nKlaus Greff, Sjoerd van Steenkiste, and J\u00fcrgen Schmidhuber. Neural expectation maximization. In\n\nAdvances in Neural Information Processing Systems, pages 6694\u20136704, 2017.\n\nJohn R Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration\n\nof novel deep architectures. arXiv preprint arXiv:1409.2574, 2014.\n\nMatthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference.\n\nJournal of Machine Learning Research, 14:1303\u20131347, 2013.\n\nTommi S. Jaakkola and Michael I. Jordon. Learning in graphical models. chapter Improving the\nMean Field Approximation via the Use of Mixture Distributions, pages 163\u2013173. MIT Press,\nCambridge, MA, USA, 1999. ISBN 0-262-60032-3.\n\n9\n\n\fEric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\nM. I. Jordan, Z. Gharamani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\ngraphical models. In M. I. Jordan, editor, Learning in Graphical Models, pages 105\u2013162. Kluwer\nAcademic, 1998.\n\nYoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semi-amortized\n\nvariational autoencoders. arXiv preprint arXiv:1802.02550, 2018.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751, 2016.\n\nQiang Liu. Stein variational gradient descent as gradient \ufb02ow. In Advances in neural information\n\nprocessing systems, pages 3118\u20133126, 2017.\n\nChris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous\n\nrelaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.\n\nJoseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International\n\nConference on Machine Learning, pages 3400\u20133409, 2018.\n\nLars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying\nvariational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722,\n2017.\n\nT. Minka. Expectation Propagation for approximative Bayesian inference. PhD thesis, MIT Media\n\nLabs, Cambridge, USA, 2001.\n\nAndriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv\n\npreprint arXiv:1402.0030, 2014.\n\nRadford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical\n\nreport, Dept. of Computer Science, University of Toronto, 1993. CRG-TR-93-1.\n\nRadford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n2(11), 2011.\n\nA. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\nto stochastic programming. SIAM J. on Optimization, 19(4):1574\u20131609, January 2009. ISSN\n1052-6234.\n\nFelix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\nR. T. Rockafellar and R. J-B. Wets. Variational Analysis. Springer Verlag, 1998.\n\nTim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational\ninference: Bridging the gap. In International Conference on Machine Learning, pages 1218\u20131226,\n2015.\n\nAlexander Shapiro, Darinka Dentcheva, et al. Lectures on stochastic programming: modeling and\n\ntheory, volume 16. SIAM, 2014.\n\nVeselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimization of graphical\nmodel parameters given approximate inference, decoding, and model structure. In Proceedings of\nthe Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 725\u2013733,\n2011.\n\n10\n\n\fMasashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine\n\nlearning. Cambridge University Press, 2012.\n\nMichalis Titsias and Miguel L\u00e1zaro-gredilla. Doubly stochastic variational bayes for non-conjugate\ninference. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Con-\nference on Machine Learning (ICML-14), pages 1971\u20131979. JMLR Workshop and Conference\nProceedings, 2014.\n\nJakub M Tomczak and Max Welling. Improving variational auto-encoders using householder \ufb02ow.\n\narXiv preprint arXiv:1611.09630, 2016.\n\nDustin Tran, Rajesh Ranganath, and David M Blei. The variational gaussian process. arXiv preprint\n\narXiv:1511.06499, 2015.\n\nRichard E Turner and Maneesh Sahani. Two problems with variational expectation maximisation for\n\ntime-series models. Bayesian Time series models, 1(3.1):3\u20131, 2011.\n\nM. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nTechnical Report 649, UC Berkeley, Department of Statistics, September 2003.\n\nArnold Zellner. Optimal Information Processing and Bayes\u2019s Theorem. The American Statistician,\n\n42(4), November 1988.\n\n11\n\n\f", "award": [], "sourceid": 6162, "authors": [{"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Niao", "family_name": "He", "institution": "UIUC"}, {"given_name": "Weiyang", "family_name": "Liu", "institution": "Georgia Institute of Technology"}, {"given_name": "Zhen", "family_name": "Liu", "institution": "Georgia Institute of Technology"}, {"given_name": "Jianshu", "family_name": "Chen", "institution": "Tencent AI Lab"}, {"given_name": "Lin", "family_name": "Xiao", "institution": "Microsoft Research"}, {"given_name": "Le", "family_name": "Song", "institution": "Ant Financial & Georgia Institute of Technology"}]}