{"title": "Wasserstein Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2473, "page_last": 2482, "abstract": "This paper introduces Wasserstein variational inference, a new form of approximate Bayesian inference based on optimal transport theory. Wasserstein variational inference uses a new family of divergences that includes both f-divergences and the Wasserstein distance as special cases. The gradients of the Wasserstein variational loss are obtained by backpropagating through the Sinkhorn iterations. This technique results in a very stable likelihood-free training method that can be used with implicit distributions and probabilistic programs. Using the Wasserstein variational inference framework, we introduce several new forms of autoencoders and test their robustness and performance against existing variational autoencoding techniques.", "full_text": "Wasserstein Variational Inference\n\nLuca Ambrogioni*\nRadboud University\n\nl.ambrogioni@donders.ru.nl\n\nUmut G\u00fc\u00e7l\u00fc*\n\nRadboud University\n\nu.guclu@donders.ru.nl\n\nYa\u02d8gmur G\u00fc\u00e7l\u00fct\u00fcrk\nRadboud University\n\ny.gucluturk@donders.ru.nl\n\nMax Hinne\n\nUniversity of Amsterdam\n\nm.hinne@uva.nl\n\nEric Maris\n\nRadboud University\n\ne.maris@donders.ru.nl\n\nMarcel A. J. van Gerven\n\nRadboud University\n\nm.vangerven@donders.ru.nl\n\nAbstract\n\nThis paper introduces Wasserstein variational inference, a new form of approxi-\nmate Bayesian inference based on optimal transport theory. Wasserstein variational\ninference uses a new family of divergences that includes both f-divergences and the\nWasserstein distance as special cases. The gradients of the Wasserstein variational\nloss are obtained by backpropagating through the Sinkhorn iterations. This tech-\nnique results in a very stable likelihood-free training method that can be used with\nimplicit distributions and probabilistic programs. Using the Wasserstein variational\ninference framework, we introduce several new forms of autoencoders and test their\nrobustness and performance against existing variational autoencoding techniques.\n\n1\n\nIntroduction\n\nVariational Bayesian inference is gaining a central role in machine learning. Modern stochastic\nvariational techniques can be easily implemented using differentiable programming frameworks\n[1\u20133]. As a consequence, complex Bayesian inference is becoming almost as user friendly as deep\nlearning [4, 5]. This is in sharp contrast with old-school variational methods that required model-\nspeci\ufb01c mathematical derivations and imposed strong constraints on the possible family of models\nand variational distributions. Given the rapidness of this transition it is not surprising that modern\nvariational inference research is still in\ufb02uenced by some legacy effects from the days when analytical\ntractability was the main concern. One of the most salient examples of this is the central role of\nthe (reverse) KL divergence [6, 7]. While several other divergence measures have been suggested\n[8\u201312], the reverse KL divergence still dominates both research and applications. Recently, optimal\ntransport divergences such as the Wasserstein distance [13, 14] have gained substantial popularity\nin the generative modeling literature as they can be shown to be well-behaved in several situations\nwhere the KL divergence is either in\ufb01nite or unde\ufb01ned [15\u201318]. For example, the distribution of\nnatural images is thought to span a sub-manifold of the original pixel space [15]. In these situations\nWasserstein distances are considered to be particularly appropriate because they can be used for\n\ufb01tting degenerate distributions that cannot be expressed in terms of densities [15].\nIn this paper we introduce the use of optimal transport methods in variational Bayesian inference. To\nthis end, we de\ufb01ne the new c-Wasserstein family of divergences, which includes both Wasserstein\n\n*These authors contributed equally to this paper.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmetrics and all f-divergences (which have both forward and reverse KL) as special cases. Using this\nfamily of divergences we introduce the new framework of Wasserstein variational inference, which\nexploits the celebrated Sinkhorn iterations [19, 20] and automatic differentiation. Wasserstein varia-\ntional inference provides a stable gradient-based black-box method for solving Bayesian inference\nproblems even when the likelihood is intractable and the variational distribution is implicit [21, 22].\nImportantly, as opposed to most other implicit variational inference methods [21\u201324], our approach\ndoes not rely on potentially unstable adversarial training [25].\n\n1.1 Background on joint-contrastive variational inference\n\nWe start by brie\ufb02y reviewing the framework of joint-contrastive variational inference [23, 21, 9].\nFor notational convenience we will express distributions in terms of their densities. Note however\nthat those densities could be degenerate. For example, the density of a discrete distribution can be\nexpressed in terms of delta functions. The posterior distribution of the latent variable z given the\nobserved data x is p(z|x) = p(z, x)/p(x). While the joint probability p(z, x) is usually tractable, the\nevaluation of p(x) often involves an intractable integral or summation. The central idea of variational\nBayesian inference is to minimize a divergence functional between the intractable posterior p(z|x)\nand a tractable parametrized family of variational distributions. This form of variational inference\nis sometimes referred to as posterior-contrastive. Conversely, in joint-contrastive inference the\ndivergence to minimize is de\ufb01ned between two structured joint distributions. For example, using the\nreverse KL we have the following cost functional:\n\nDKL(p(x, z)(cid:107)q(x, z)) = Eq(x,z)\n\n(1)\nwhere q(x, z) = q(z|x)k(x) is the product between the variational posterior and the sampling\ndistribution of the data. Usually k(x) is approximated as the re-sampling distribution of a \ufb01nite\ntraining set, as in the case of variational autoencoders (VAE) [26]. The advantage of this joint-\ncontrastive formulation is that it does not require the evaluation of the intractable distribution p(z|x).\nJoint-contrastive variational inference can be seen as a generalization of amortized inference [21].\n\n,\n\nlog\n\nq(x, z)\np(x, z)\n\n(cid:20)\n\n(cid:21)\n\n1.2 Background on optimal transport\n\nIntuitively speaking, optimal transport divergences quantify the distance between two probability\ndistributions as the cost of transporting probability mass from one to the other. Let \u0393[p, q] be the\nset of all bivariate probability measures on the product space X \u00d7 X whose marginals are p and q\nrespectively. An optimal transport divergence is de\ufb01ned by the following optimization:\n\nWc(p, q) = inf\n\n\u03b3\u2208\u0393[p,q]\n\nc(x1, x2) d\u03b3(x1, x2) ,\n\n(2)\n\nwhere c(x1, x2) is the cost of transporting probability mass from x1 to x2. When the cost is a metric\nfunction the resulting divergence is a proper distance and it is usually referred to as the Wasserstein\ndistance. We will denote the Wasserstein distance as W (p, q).\nThe computation of the optimization problem in Eq. 2 suffers from a super-cubic complexity. Recent\nwork showed that this complexity can be greatly reduced by adopting entropic regularization [20].\nWe begin by de\ufb01ning a new set of joint distributions:\n\nU\u0001[p, q] =(cid:8)\u03b3 \u2208 \u0393[p, q](cid:12)(cid:12) DKL(\u03b3(x, y)(cid:107)p(x)q(y)) \u2264 \u0001\u22121(cid:9) .\n\n(3)\nThese distributions are characterized by having the mutual information between the two variables\nbounded by the regularization parameter \u0001\u22121. Using this family of distributions we can de\ufb01ne the\nentropy regularized optimal transport divergence:\n\n(cid:90)\n\n(cid:90)\n\nWc,\u0001(p, q) = inf\n\nu\u2208U\u0001[p,q]\n\nc(x1, x2) du(x1, x2) .\n\n(4)\n\nThis regularization turns the optimal transport into a strictly convex problem. When p and q are\ndiscrete distributions the regularized optimal transport cost can be ef\ufb01ciently obtained using the\nSinkhorn iterations [19, 20]. The \u0001-regularized optimal transport divergence is then given by:\n\n(5)\nwhere the function S \u0001\nt [p, q, c] gives the output of the t-th Sinkhorn iteration. The pseudocode of\nthe Sinkhorn iterations is given in Algorithm 1. Note that all the operations in this algorithm are\ndifferentiable.\n\nWc,\u0001(p, q) = lim\n\nt [p, q, c] ,\n\nt\u2192\u221eS \u0001\n\n2\n\n\fAlgorithm 1 Sinkhorn Iterations. C: Cost matrix, t: Number of iterations, \u0001: Regularization strength\n1: procedure SINKHORN(C, t, \u0001)\n2:\n3:\n4:\n5:\n6:\n7:\n\nK = exp(\u2212C/\u0001), n, m = shape(C)\nr = ones(n, 1)/n, c = ones(m, 1)/m, u0 = r, \u03c4 = 0\nwhile \u03c4 \u2264 t do\na = K T u\u03c4\nb = c/a\nu\u03c4 +1 = m/(Kb), \u03c4 = \u03c4 + 1\n\n(cid:46) Juxtaposition denotes matrix product\n(cid:46) \"/\" denotes entrywise division\n\n(cid:46) \"*\" denotes entrywise product\n\nv = c/(K T ut), S \u0001\nreturn S \u0001\n\nt\n\n8:\n\nt = sum(ut \u2217 (K \u2217 C)v)\n\n2 Wasserstein variational inference\n\nWe can now introduce the new framework of Wasserstein variational inference for general-purpose\napproximate Bayesian inference. We begin by introducing a new family of divergences that includes\nboth optimal transport divergences and f-divergences as special cases. Subsequently, we develop a\nblack-box and likelihood-free variational algorithm based on automatic differentiation through the\nSinkhorn iterations.\n\n2.1\n\nc-Wasserstein divergences\n\nTraditional divergence measures such as the KL divergence depend explicitly on the distributions\np and q. Conversely, optimal transport divergences depend on p and q only through the constraints\nof an optimization problem. We will now introduce the family of c-Wasserstein divergences that\ngeneralize both forms of dependencies. A c-Wasserstein divergence has the following form:\n\n(cid:90)\n\nWC(p, q) = inf\n\n\u03b3\u2208\u0393[p,q]\n\nC p,q(x1, x2) d\u03b3(x1, x2) ,\n\n(6)\n\nwhere the real-valued functional C p,q(x1, x2) depends both on the two scalars x1 and x2 and on the\ntwo distributions p and q. Note that we are writing this dependency in terms of the densities only for\nnotational convenience and that this dependency should be interpreted in terms of distributions. The\ncost functional C p,q(x1, x2) is assumed to respect the following requirements:\n\n1. C p,p(x1, x2) \u2265 0,\u2200x1, x2 \u2208 supp(p)\n2. C p,p(x, x) = 0,\u2200x \u2208 supp(p)\n3. E\u03b3[C p,q(x1, x2)] \u2265 0,\u2200\u03b3 \u2208 \u0393[p, q] ,\n\nwhere supp(p) denotes the support of the distribution p. From these requirements we can derive the\nfollowing theorem:\nTheorem 1. The functional WC(p, q) is a (pseudo-)divergence, meaning that WC(p, q) \u2265 0 for all\np and q and WC(p, p) = 0 for all p.\n\nProof. From property 1 and property 2 it follows that, when p is equal to q, C p,p(x1, x2) is a non-\nnegative function of x and y that vanishes when x = y. In this case, the optimization in Eq. 6 is\noptimized by the diagonal transport \u03b3(x1, x2) = p(x1)\u03b4(x1 \u2212 x2). In fact:\nC p,p(x1, x2)p(x1)\u03b4(x1 \u2212 x2) dx1 dx2\n\nWC(p, p) =\n\n(cid:90)\n(cid:90)\n\n=\n\nC p,p(x1, x1)p(x1) dx1 = 0 .\n\n(7)\n\nThis is a global minimum since property 3 implies that WC(p, q) is always non-negative.\n\nAll optimal transport divergences are part of the c-Wasserstein family, where C p,q(x, y) reduces to a\nnon-negative valued function c(x1, x2) independent from p and q.\nProving property 3 for an arbitrary cost functional can be a challenging task. The following theorem\nprovides a criterion that is often easier to verify:\n\n3\n\n\fTheorem 2. Let f : R \u2192 R be a convex function such that f (1) = 0. The cost functional\nC p,q(x, y) = f (g(x, y)) respects property 3 when E\u03b3[g(x, y)] = 1 for all \u03b3 \u2208 \u0393[p, q].\n\nProof. The result follows directly from Jensen\u2019s inequality.\n\n2.2 Stochastic Wasserstein variational inference\n\nWe can now introduce the general framework of Wasserstein variational inference. The loss functional\nis a c-Wasserstein divergence between p(x, z) and q(x, z):\n\n(cid:90)\n\nLC[p, q] = WC(p(z, x), q(z, x)) = inf\n\n\u03b3\u2208\u0393[p,q]\n\nC p,q(x1, z1; x2, z2) d\u03b3(x1, z1; x1, z1) .\n\n(8)\n\nFrom Theorem 1 it follows that this variational loss is always minimized when p is equal to q. Note\nthat we are allowing members of the c-Wasserstein divergence family to be pseudo-divergences,\nmeaning that LC[p, q] could be 0 even if p (cid:54)= q. It is sometimes convenient to work with pseudo-\ndivergences when some features of the data are not deemed to be relevant.\nWe can now derive a black-box Monte Carlo estimate of the gradient of Eq. 8 that can be used together\nwith gradient-based stochastic optimization methods [27]. A Monte Carlo estimator of Eq 8 can be\nobtained by computing the discrete c-Wasserstein divergence between two empirical distributions:\n\nC p,q(x(j)\n\n1 , z(j)\n\n1 , x(k)\n\n2 , z(k)\n\n2 )\u03b3(x(j)\n\n1 , z(j)\n\n1 , x(k)\n\n2 , z(k)\n\n2 ) ,\n\n(9)\n\nLC[pn, qn] = inf\n\n\u03b3\n\n(cid:88)\n\nj,k\n\n1 , z(j)\n\n2 , z(k)\n\n1 ) and (x(k)\n\nwhere (x(j)\nthe Wasserstein distance, we can show that this estimator is asymptotically unbiased:\nTheorem 3. Let W (pn, qn) be the Wasserstein distance between two empirical distributions pn and\nqn. For n tending to in\ufb01nity, there is a positive number s such that\n\n2 ) are sampled from p(x, z) and q(x, z) respectively. In the case of\n\nEpq[W (pn, qn)] (cid:46) W (p, q) + n\u22121/s .\n\nProof. Using the triangle inequality and the linearity of the expectation we obtain:\nEpq[W (pn, qn)] \u2264 Ep[W (pn, p)] + W (p, q) + Eq[W (q, qn)] .\n\n(10)\n\n(11)\n\nIn [28] it was proven that for any distribution u:\n\n(12)\nwhen su is larger than the upper Wasserstein dimension (see de\ufb01nition 4 in [28]). The result follows\nwith s = max(sp, sq).\n\nEu[W (un, u)] \u2264 n\u22121/su ,\n\nUnfortunately the Monte Carlo estimator is biased for \ufb01nite values of n. In order to eliminate the bias\nwhen p is equal to q, we use the following modi\ufb01ed loss:\n\n\u02dcLC[pn, qn] = LC[pn, qn] \u2212 (LC[pn, pn] + LC[qn, qn])/2 .\n\nIt is easy to see that the expectation of this new loss is zero when p is equal to q. Furthermore:\n\n\u02dcLC[pn, qn] = LC[p, q] .\n\nlim\nn\u2192\u221e\n\n(13)\n\n(14)\n\nAs we discussed in Section 1.2, the entropy-regularized version of the optimal transport cost in Eq. 9\ncan be approximated by truncating the Sinkhorn iterations. Importantly, the Sinkhorn iterations are\ndifferentiable and consequently we can compute the gradient of the loss using automatic differentiation\n[17]. The approximated gradient of the \u0001-regularized loss can be written as\n\n\u2207LC[pn, qn] = \u2207S \u0001\n\n(15)\nwhere the function S \u0001\nt [pn, qn, Cp,q] is the output of t steps of the Sinkhorn algorithm with regular-\nization \u0001 and cost function Cp,q. Note that the cost is a functional of p and q and consequently the\ngradient contains the term \u2207Cp,q. Also note that this approximation converges to the real gradient of\nEq. 8 for n \u2192 \u221e and \u0001 \u2192 0 (however the Sinkhorn algorithm becomes unstable when \u0001 \u2192 0).\n\nt [pn, qn, Cp,q] ,\n\n4\n\n\f3 Examples of c-Wasserstein divergences\n\nWe will now introduce two classes of c-Wasserstein divergences that are suitable for deep Bayesian\nvariational inference. Moreover, we will show that the KL divergence and all f-divergences are part\nof the c-Wasserstein family.\n\n3.1 A metric divergence for latent spaces\n\nIn order to apply optimal transport divergences to a Bayesian variational problem we need to assign a\nmetric, or more generally a transport cost, to the latent space of the Bayesian model. The geometry of\nthe latent space should depend on the geometry of the observable space since differences in the latent\nspace are only meaningful as far as they correspond to differences in the observables. The simplest\nway to assign a geometric transport cost to the latent space is to pull back a metric function from the\nobservable space:\n\nP B(z1, z2) = dx(gp(z1), gp(z2)) ,\n\n(16)\nwhere dx(x1, x2) is a metric function in the observable space and gp(z) is a deterministic function\nthat maps z to the expected value of p(x|z). In our notation the subscript p in gp denotes the fact\nthat the distribution p(z|x) and the function gp depend on a common set of parameters which are\noptimized during variational inference. The resulting pullback cost function is a proper metric if gp is\na diffeomorphism (i.e. a differentiable map with differentiable inverse) [29].\n\nC p\n\n3.2 Autoencoder divergences\n\nAnother interesting special case of c-Wasserstein divergence can be obtained by considering the\ndistribution of the residuals of an autoencoder. Consider the case where the expected value of q(z|x)\nis given by the deterministic function hq(z). We can de\ufb01ne the latent autoencoder cost functional as\nthe transport cost between the latent residuals of the two models:\n\nLA(x1, z1; x2, z2) = d(z1 \u2212 hq(x1), z2 \u2212 hq(x2)) ,\nC q\n\n(17)\nwhere d is a distance function. It is easy to check that this cost functional de\ufb01nes a proper c-\nWasserstein divergence since it is non-negative valued and it is equal to zero when p is equal to q\nand x1, z1 are equal to x2, z2. Similarly, we can de\ufb01ne the observable autoencoder cost functional as\nfollows:\n\n(18)\nwhere again gp(z) gives the expected value of the generator. In the case of a deterministic generator,\nthis expression reduces to\n\nOA(x1, z1; x2, z2) = d(x1 \u2212 gp(z1), x2 \u2212 gp(z2)) ,\nC p\n\n(19)\nNote that the transport optimization is trivial in this special case since the cost does not depend on x1\nand z1. In this case, the resulting divergence is just the average reconstruction error:\n\nOA(x1, z1; x2, z2) = d(0, x2 \u2212 gp(z2)) .\nC p\n\n(20)\nAs expected, this is a proper (pseudo-)divergence since it is non-negative valued and x \u2212 gp(z) is\nalways equal to zero when x and z are sampled from p(x, z).\n\nd(0, x2 \u2212 gp(z2)) d\u03b3 = Eq(x,z)[d(0, x \u2212 gp(z))] .\n\ninf\n\u03b3\u2208\u0393[p]\n\n(cid:90)\n\n3.3\n\nf-divergences\n\nWe can now show that all f-divergences are part of the c-Wasserstein family. Consider the following\ncost functional:\n\nC p,q\n\nf (x1, x2) = f\n\n,\n\n(21)\n\nwhere f is a convex function such that f (0) = 1. From Theorem 2 it follows that this cost functional\nde\ufb01nes a valid c-Wasserstein divergence. We can now show that the c-Wasserstein divergence de\ufb01ned\nby this functional is the f-divergence de\ufb01ned by f. In fact\n\n(cid:18) p(x2)\n\n(cid:19)\n\nq(x2)\n\n(cid:90)\n\n(cid:18) p(x2)\n\n(cid:19)\n\nq(x2)\n\ninf\n\n\u03b3X\u2208\u0393[p,q]\n\nf\n\n(cid:20)\n\n(cid:18) p(x2)\n\n(cid:19)(cid:21)\n\nq(x2)\n\nd\u03b3X (x1, x2) = Eq(x2)\n\nf\n\n,\n\n(22)\n\nsince q(x2) is the marginal of all \u03b3(x1, x2) in \u0393[p, q].\n\n5\n\n\f4 Wasserstein variational autoencoders\n\nWe will now use the concepts developed in the previous sections in order to de\ufb01ne a new form of\nautoencoder. VAEs are generative deep amortized Bayesian models where the parameters of both the\nprobabilistic model and the variational model are learned by minimizing a joint-contrastive divergence\n[26, 30, 31]. Let Dp and Dq be parametrized probability distributions and gp(z) and hq(x) be the\noutputs of deep networks determining the parameters of these distributions. The probabilistic model\n(decoder) of a VAE has the following form:\n\np(z, x) = Dp(x|gp(z)) p(z) ,\n\nThe variational model (encoder) is given by:\n\nq(z, x) = Dq(z|hq(x)) k(x) .\n\n(23)\n\n(24)\n\nWe can de\ufb01ne a large family of objective functions of VAEs by combining the cost functionals de\ufb01ned\nin the previous section. The general form is given by the following total autoencoder cost functional:\n\nC p,q\nw,f (x1, z1; x2, z2) = w1dx(x1, x2) + w2C p\n\nP B(z1, z2) + w3C p\n\nLA(x1, z1; x2, z2)\n\n+ w4C q\n\nOA(x1, z1; x2, z2) + w5C p,q\n\nf (x1, z1; x2, z2) ,\n\n(25)\n\nwhere w is a vector of non-negative valued weights, dx(x1, x2) is a metric on the observable space\nand f is a convex function.\n\n5 Connections with related methods\n\nIn the previous sections we showed that variational inference based on f-divergences is a special\ncase of Wasserstein variational inference. We will discuss several theoretical links with some recent\nvariational methods.\n\n5.1 Operator variational inference\n\nWasserstein variational inference can be shown to be a special case of a generalized version of\noperator variational inference [10]. The (amortized) operator variational objective is de\ufb01ned as\nfollows:\n\nLOP = sup\nf\u2208F\n\n\u03b6(Eq(x,z)[Op,qf ])\n\n(26)\n\n(27)\n\n(28)\n\n(29)\n\nwhere F is a set of test functions and \u03b6(\u00b7) is a positive valued function. The dual representation of the\noptimization problem in the c-Wasserstein loss (Eq. 6) is given by the following expression:\n\n(cid:2)Ep(x,z)[f (x, z)] \u2212 Eq(x,z)[f (x, z)](cid:3) ,\n\nWc(p, q) = sup\nf\u2208LC\n\nwhere\n\nLC[p, q] = {f : X \u2192 R| f (x1, z1) \u2212 f (x2, z2) \u2264 C p,q(x1, z1; x2, z2)} .\n\nConverting the expectation over p to an expectation over q using importance sampling, we obtain the\nfollowing expression:\n\n(cid:20)\n\n(cid:20)(cid:18) p(x, z)\n\nq(x, z)\n\n(cid:19)\n\n(cid:21)(cid:21)\n\n\u2212 1\n\nf (x, z)\n\n,\n\nWc(p, q) = sup\n\nf\u2208LC [p,q]\n\nEq(x,z)\n\nwhich has the same form as the operator variational loss in Eq. 26 with t(x) = x and Op,q = p/q \u2212 1.\nNote that the fact that \u03b6(\u00b7) is not positive valued is irrelevant since the optimum of Eq. 27 is always\nnon-negative. This is a generalized form of operator variational loss where the functional family can\nnow depend on p and q. In the case of optimal transport divergences, where C p,q(x1, z1; x2, z2) =\nc(x1, z1; x2, z2), the resulting loss is a special case of the regular operator variational loss.\n\n6\n\n\f5.2 Wasserstein autoencoders\n\nLW A = Eq(x,z)[cx(x, gp(z))] + \u03bbD(p(z)(cid:107)q(z)) ,\n\nThe recently introduced Wasserstein autoencoder (WAE) uses a regularized optimal transport diver-\ngence between p(x) and k(x) in order to train a generative model [32]. The regularized loss has the\nfollowing form:\n(30)\nwhere cx does not depend on p and q and D(p(z)(cid:107)q(z)) is an arbitrary divergence. This loss was\nnot derived from a variational Bayesian inference problem. Instead, the WAE loss is derived as a\nrelaxation of an optimal transport loss between p(x) and k(x):\nLW A \u2248 Wcx (p(x), k(x)) .\n\n(31)\nWhen D(p(z)(cid:107)q(z)) is a c-Wasserstein divergence, we can show that the LW A is a Wasserstein\nvariational inference loss and consequently that Wasserstein autoencoders are approximate Bayesian\nmethods. In fact:\nEq(x,z)[cx(x, gp(x))]+\u03bbWCz (p(z), q(z)) = inf\n(z1, z2)] d\u03b3 . (32)\nIn the original paper the regularization term D(p(z)(cid:107)q(z)) is either the Jensen-Shannon divergence\n(optimized using adversarial training) or the maximum mean discrepancy (optimized using a repro-\nducing kernel Hilbert space estimator). Our reformulation suggests another way of training the latent\nspace using a metric optimal transport divergence and the Sinkhorn iterations.\n\n(cid:90)\n\n\u03b3\u2208\u0393[p,q]\n\n[cx(x2, gp(z2)) + \u03bbC p,q\n\nz\n\n6 Experimental evaluation\n\nWe will now demonstrate experimentally the effectiveness and robustness of Wasserstein variational\ninference. We focused our analysis on variational autoecoding problems on the MNIST dataset. We\ndecided to use simple deep architectures and to avoid any form of structural and hyper-parameter\noptimization for three main reasons. First and foremost, our main aim is to show that Wasserstein\nvariational inference works off-the-shelf without user tuning. Second, it allows us to run a large\nnumber of analyses and consequently to systematically investigate the performance of several variants\nof the Wasserstein autoencoder on several datasets. Finally, it minimizes the risk of inducing a bias\nthat disfavors the baselines. In our \ufb01rst experiment, we assessed the performance of our Wasserstein\nvariation autoencoder against VAE, ALI and WAE. We used the same neural architecture for all\nmodels. The generative models were parametrized by three-layered fully connected networks (100-\n300-500-1568) with Relu nonlinearities in the hidden layers. Similarly, the variational models\nwere parametrized by three-layered ReLu networks (784-500-300-100). The cost functional of our\nWasserstein variational autoencoder (see Eq. 25) had the weights w1, w2, w3 and w4 different from\nzero. Conversely, in this experiment w5 was set to zero, meaning that we did not use a f-divergence\ncomponent. We refer to this model as 1111. We trained 1111 using t = 20 Sinkhorn iterations. We\nevaluated three performance metrics: 1) mean squared reconstruction error in the latent space, 2)\npixelwise mean squared reconstruction error in the image space and 3) sample quality estimated as the\nsmallest Euclidean distance with an image in the validation set. Variational autoencoders are known\nto be sensitive to the \ufb01ne tuning of the parameter regulating the relative contribution of the latent and\nthe observable component of the loss. For each method, we optimized this parameter on the validation\nset. VAE, ALI and WAE losses have a single independent parameter \u03b1: the relative contribution\nof the two components of the loss (VAE-loss/WAE-loss = \u03b1*latent-loss + (1 - \u03b1)*observable-loss,\nALI-loss = \u03b1*generator-loss + (1 - \u03b1)*discriminator-loss for ALI). In the case of our 1111 model\nwe reduced the optimization to a single parameter by giving equal weights to the two latent and the\ntwo observable losses (loss = \u03b1*latent-loss + (1 -\u03b1)*observable-loss). We estimated the errors of\nall methods with respect to all metrics with alpha ranging from 0.1 to 0.9 in steps of 0.1. For each\nmodel we selected an optimal value of \u03b1 by minimizing the sum of the three error metrics in the\nvalidation set (individually re-scaled by z-scoring). Fig. 1 shows the test set square errors both in\nthe latent and in the observable space for the optimized models. Our model has better performance\nthan both VAE and ALI with respect to all error metrics. WAE has lower observable error but higher\nsample error and slightly higher latent error. All differences are statistically signi\ufb01cant (p<0.001,\npaired t-test).\nIn our second experiment we tested several other forms of Wasserstein variational autoencoders\non three different datasets. We denote different versions of our autoencoder with a binary string\n\n7\n\n\fFigure 1: Comparison between Wasserstein variational inference, VAE, ALI and WAE.\n\nTable 1: Detailed analysis on MNIST, fashion MNIST and Quick Sketches.\n\nMNIST\nLatent Observable\n1.0604\n1.1807\n0.9256\n1.0052\n1.0030\n1.0145\n0.8991\n0.8865\n0.9007\n\n0.1419\n0.0406\n0.0710\n0.0227\n0.0273\n0.0268\n0.0293\n0.0289\n0.0292\n\nALI\nVAE\n1001\n0110\n0011\n1100\n1111\nh-ALI\nh-VAE\n\nFashion-MNIST\n\nSample Latent Observable\n0.0631\n0.1766\n0.0448\n0.0513\n0.0740\n0.0483\n0.0441\n0.0462\n0.0442\n\n0.1210\n0.0214\n0.0687\n0.0244\n0.0196\n0.0246\n0.0258\n0.0260\n0.0227\n\n1.0179\n1.7671\n0.9453\n1.4886\n1.0033\n1.3748\n0.9053\n0.9026\n0.9072\n\nQuick Sketch\nSample Latent Observable\n0.0564\n0.0567\n0.0277\n0.0385\n0.0447\n0.0291\n0.0297\n0.0300\n0.0306\n\n0.3477\n0.0758\n0.1471\n0.0568\n0.0656\n0.0554\n0.0642\n0.0674\n0.0638\n\n1.0337\n0.9445\n0.9777\n0.8894\n1.0016\n1.0364\n0.8822\n0.8961\n0.8983\n\nSample\n0.1157\n0.0687\n0.0654\n0.0743\n0.1204\n0.0736\n0.0699\n0.0682\n0.0677\n\ndenoting which weight was set to either zero or one. For example, we denote the purely metric\nversion without autoencoder divergences as 1100. We also included two hybrid models obtained by\ncombining our loss (1111) with the VAE and the ALI losses. These methods are special cases of\nWasserstein variational autoencoders with non-zero w5 weight and where the f function is chosen to\ngive either the reverse KL divergence or the Jensen-Shannon divergence respectively. Note that this\n\ufb01fth component of the loss was not obtained from the Sinkhorn iterations. As can be seen in Table 1,\nmost versions of the Wasserstein variational autoencoder perform better than both VAE and ALI on\nall datasets. The 0011 has good reconstruction errors but signi\ufb01cantly lower sample quality as it does\nnot explicitly train the marginal distribution of x. Interestingly, the purely metric 1100 version has a\nsmall reconstruction error even if the cost functional is solely de\ufb01ned in terms of the marginals over x\nand z. Also interestingly, the hybrid methods h-VAE and h-ALI have high performances. This result\nis promising as it suggests that the Sinkhorn loss can be used for stabilizing adversarial methods.\n\nFigure 2: Observable reconstructions (A) and samples (B).\n\n8\n\nALIVAE1111ALIVAE1111DataAB\f7 Conclusions\n\nIn this paper we showed that Wasserstein variational inference offers an effective and robust method\nfor black-box (amortized) variational Bayesian inference. Importantly, Wasserstein variational infer-\nence is a likelihood-free method and can be used together with implicit variational distributions and\ndifferentiable variational programs [22, 21]. These features make Wasserstein variational inference\nparticularly suitable for probabilistic programming, where the aim is to combine declarative general\npurpose programming and automatic probabilistic inference.\n\nReferences\n[1] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[2] R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. International Con-\n\nference on Arti\ufb01cial Intelligence and Statistic, 2014.\n\n[3] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. International Conference on Machine Learning, 2014.\n\n[4] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation\n\nvariational inference. The Journal of Machine Learning Research, 18(1):430\u2013474, 2017.\n\n[5] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M. Blei. Edward: A library\n\nfor probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.\n\n[6] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[7] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt. Advances in variational inference. arXiv\n\npreprint arXiv:1711.05597, 2017.\n\n[8] Y. Li and R. E. Turner. R\u00e9nyi divergence variational inference. Advances in Neural Information\n\nProcessing Systems, 2016.\n\n[9] L. Ambrogioni, U. G\u00fc\u00e7l\u00fc, J. Berezutskaya, E. W.P. van den Borne, Y. G\u00fc\u00e7l\u00fct\u00fcrk, M. Hinne,\nE. Maris, and M. A.J. van Gerven. Forward amortized inference for likelihood-free variational\nmarginalization. arXiv preprint arXiv:1805.11542, 2018.\n\n[10] R. Ranganath, D. Tran, J. Altosaar, and D. Blei. Operator variational inference. Advances in\n\nNeural Information Processing Systems, 2016.\n\n[11] A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei. Variational inference via chi upper\n\nbound minimization. Advances in Neural Information Processing Systems, 2017.\n\n[12] R. Bamler, C. Zhang, M. Opper, and S. Mandt. Perturbative black box variational inference.\n\nAdvances in Neural Information Processing Systems, pages 5086\u20135094, 2017.\n\n[13] C. Villani. Topics in Optimal Transportation. Number 58. American Mathematical Society,\n\n2003.\n\n[14] Cuturi M. Peyr\u00e9 G. Computational Optimal Transport. arXiv preprint arXiv:1803.00567, 2018.\n\n[15] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. Interna-\n\ntional Conference on Machine Learning, 2017.\n\n[16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\n\nWasserstein GANs. Advances in Neural Information Processing Systems, 2017.\n\n[17] A. Genevay, G. Peyr\u00e9, and M. Cuturi. Learning generative models with Sinkhorn divergences.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1608\u20131617, 2018.\n\n[18] G. Montavon, K. M\u00fcller, and M. Cuturi. Wasserstein training of restricted Boltzmann machines.\n\nAdvances in Neural Information Processing Systems, 2016.\n\n9\n\n\f[19] R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices.\n\nPaci\ufb01c Journal of Mathematics, 21(2):343\u2013348, 1967.\n\n[20] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in\n\nNeural Information Processing Systems, 2013.\n\n[21] F. Husz\u00e1r. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235,\n\n2017.\n\n[22] D. Tran, R. Ranganath, and David M. Blei. Hierarchical implicit models and likelihood-free\n\nvariational inference. arXiv preprint arXiv:1702.08896, 2017.\n\n[23] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville.\nAdversarially learned inference. International Conference on Learning Representations, 2017.\n\n[24] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational\n\nautoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.\n\n[25] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial\n\nnetworks. International Conference on Learning Representations, 2017.\n\n[26] D. P. Kingma and M. Welling. Auto\u2013encoding variational Bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\n[27] D. Fouskakis and D. Draper. Stochastic optimization: A review. International Statistical Review,\n\n70(3):315\u2013349, 2002.\n\n[28] J. Weed and F. Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in Wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[29] D. Burago, I. D. Burago, and S. Ivanov. A Course in Metric Geometry, volume 33. American\n\nMathematical Society, 2001.\n\n[30] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder\nfor deep learning of images, labels and captions. Advances in Neural Information Processing\nSystems, 2016.\n\n[31] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv\n\npreprint arXiv:1511.05644, 2015.\n\n[32] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. Interna-\n\ntional Conference on Learning Representations, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1244, "authors": [{"given_name": "Luca", "family_name": "Ambrogioni", "institution": "Donders Institute"}, {"given_name": "Umut", "family_name": "G\u00fc\u00e7l\u00fc", "institution": "Donders Institute for Brain, Cognition and Behaviour, Radboud University"}, {"given_name": "Ya\u011fmur", "family_name": "G\u00fc\u00e7l\u00fct\u00fcrk", "institution": "Donders Institute for Brain, Cognition and Behaviour, Radboud University"}, {"given_name": "Max", "family_name": "Hinne", "institution": "University of Amsterdam"}, {"given_name": "Marcel", "family_name": "van Gerven", "institution": "Radboud Universiteit"}, {"given_name": "Eric", "family_name": "Maris", "institution": "Donders Institute"}]}