{"title": "On the Convergence and Robustness of Training GANs with Regularized Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 7091, "page_last": 7101, "abstract": "Generative Adversarial Networks (GANs) are one of the most practical methods for learning data distributions. A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions. Unfortunately, minimizing the Wasserstein distance between the data distribution and the generative model distribution is a computationally challenging problem as its objective is non-convex, non-smooth, and even hard to compute. In this work, we show that obtaining gradient information of the smoothed Wasserstein GAN formulation, which is based on regularized Optimal Transport (OT), is computationally effortless and hence one can apply first order optimization methods to minimize this objective. Consequently, we establish theoretical convergence guarantee to stationarity for a proposed class of GAN optimization algorithms. Unlike the original non-smooth formulation, our algorithm only requires solving the discriminator to approximate optimality. We apply our method to learning MNIST digits as well as CIFAR-10 images. Our experiments show that our method is computationally efficient and generates images comparable to the state of the art algorithms given the same architecture and computational power.", "full_text": "On the Convergence and Robustness of Training\n\nGANs with Regularized Optimal Transport\n\nMaziar Sanjabi\n\nUniversity of Southern California\n\nsanjabi@usc.edu\n\nMeisam Razaviyayn\n\nUniversity of Southern California\n\nrazaviya@usc.edu\n\nJimmy Ba\n\nUniversity of Toronto\n\njimmy@cs.toronto.edu\n\nJason D. Lee\n\nUniversity of Southern California\njasonlee@marshall.usc.edu\n\nAbstract\n\nGenerative Adversarial Networks (GANs) are one of the most practical methods\nfor learning data distributions. A popular GAN formulation is based on the use of\nWasserstein distance as a metric between probability distributions. Unfortunately,\nminimizing the Wasserstein distance between the data distribution and the genera-\ntive model distribution is a computationally challenging problem as its objective is\nnon-convex, non-smooth, and even hard to compute. In this work, we show that ob-\ntaining gradient information of the smoothed Wasserstein GAN formulation, which\nis based on regularized Optimal Transport (OT), is computationally effortless and\nhence one can apply \ufb01rst order optimization methods to minimize this objective.\nConsequently, we establish theoretical convergence guarantee to stationarity for a\nproposed class of GAN optimization algorithms. Unlike the original non-smooth\nformulation, our algorithm only requires solving the discriminator to approximate\noptimality. We apply our method to learning MNIST digits as well as CIFAR-10\nimages. Our experiments show that our method is computationally ef\ufb01cient and\ngenerates images comparable to the state of the art algorithms given the same\narchitecture and computational power.\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GANs) have gained popularity for unsupervised learning due to\ntheir unique ability to learn the generation of realistic samples. In the absence of labels, GANs aims\nat \ufb01nding the mapping from a known distribution, e.g. Gaussian, to an unknown data distribution,\nwhich is only represented by empirical samples. In order to measure the mapping quality, various\nmetrics between the probability distributions have been proposed. While the original work on\nGANs proposed Jensen-Shannon distance [21], other works have proposed other metrics such as f-\ndivergences [37, 16]. Recently, a seminal work by [4] re-surged Wasserstein distance [44] as a metric\nfor measuring the distance between the distributions. One major advantage of Wasserstein distance,\ncompared to Jensen-Shannon, is its continuity. In addition, Wasserstein distance is differentiable\nwith respect to the generator parameters almost everywhere [4]. As a result, it is more appealing\nfrom optimization perspective. From the generator perspective, this objective function is not smooth\nwith respect to the generator parameters. As we will see in this paper, this non-smoothness results\nin dif\ufb01culties for optimization algorithms. Hence, we propose to use a smooth surrogate for the\nWasserstein distance.\nThe introduction of Wasserstein distance as a metric for GANs re-surged the interest in the \ufb01eld\nof optimal transport [44]. [4] provided a game-representation for their proposed Wasserstein GAN\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this game-\nformulation based on the dual form of the resulting optimal transport problem.\nrepresentation, the discriminator is comprised of a 1-Lipschitz function and aims at differentiating\nbetween real and fake samples. From optimization perspective, enforcing the Lipschitz constraint is\nchallenging. Therefore, many heuristics, such as weight clipping [4] and gradient norm penalty [22],\nhave been developed to impose such constraint on the discriminator. As we will see, our smoothed\nsurrogate objective results in a natural penalizing term to softly impose various constraints such as\nLipschitzness.\nStudying the convergence of algorithms for optimizing GANs is an active area of research. The\nalgorithms and analyses developed for optimizing GANs can be divided into two general categories\nbased on the amount of effort spent on solving the discriminator problem. In the \ufb01rst category,\nwhich puts the same emphasis on the discriminator and the generator problem, a simultaneous or\nsuccessive generator-discriminator (stochastic) gradient descent-ascent update is used for solving\nboth problems. These approaches are inspired by the mirror/proximal gradient descent method\nwhich was developed for solving convex-concave games [34]. Although the GAN problem does not\nconform to convex-concave paradigm in general, researchers have found this procedure successful in\nsome special practical GANs; and unsuccessful in some others [30]. The theoretical convergence\nguarantees for these methods are local and based on limiting assumptions which are typically not\nsatis\ufb01ed/veri\ufb01able in almost all practical GANs. More precisely, they either assume some (local)\nstability of the iterates or local/global convex-concave structure [33, 31, 14]. In all of these works,\nsimilar to our setting, some form of regularization is necessary for obtaining convergence. Compared\nto these methods, we do not limit the number of discriminator steps to one. But our convergence is\nglobal and only depends on the quality of the discriminator.\nAs opposed to the \ufb01rst line of work, some developed algorithm put more emphasis on the discriminator\nproblem [32, 27, 23]. For example, [27] proved global convergence to optimality for the \ufb01rst order\nupdate rule on a speci\ufb01c problem of learning a one dimensional mixture of two Gaussians when the\ndiscriminator problem is solved to global optimality at each step. Another line of analysis, which\nalso prioritize the discriminator problem, is based on the strategy of learning the discriminator much\nfaster than the generator [23]. These analyses, which are inspired by the variants of the two time\nscale dynamic proposed by [8], do not directly require the convex-concavity of the objective function.\nHowever, they require some kind of local (global) stability which is dif\ufb01cult to achieve unless there\nis local (global) convex-concave structure. Compared to these results, our convergence analysis is\nagnostic to the method for solving the discriminator problem provided that the discriminator problem\nis solved to optimality within some accuracy. Based on our analysis, the amount of this accuracy\nthen dictates the closeness to stationarity. Therefore, our result suggests that in order to obtain a\nbetter quality solution, it is not enough to just increase the number of steps for the generator, but one\nalso needs to maintain high enough accuracy in the discriminator. It also suggests that the simple\ndescent-ascent update rule might not converge\u2013 as it has also been observed before in the literature;\nsee, e.g., [29] . Therefore, one should use algorithms similar to the two time scale approaches that\ngive discriminator increasingly more advantage. Note that, unlike [27], we do not assume perfect\ndiscriminator which is not feasible in practice. It is also worth noting that the dual formulation\nof our regularized Wasserstein GAN is an unconstrained smooth convex problem in the functional\ndomain. Therefore, it is theoretically feasible to solve it non-parametrically to any accuracy with\npolynomial number of steps in the functional space. To the best of our knowledge, our convergence\nanalysis is the \ufb01rst result in the literature with mild assumptions that proves the global convergence\nto a stationary solution with polynomial number of generator steps and with approximate solutions\nto the discriminator at each step. Notice that this is possible thanks to the regularizer added to the\ndiscriminator problem in our formulation.\n1.1 Related works and contributions\nWe study the problem of solving Wasserstein GANs from optimization perspective. In short, we\nsolidify the intuitions that the use of regularized Wasserstein distance is bene\ufb01cial in learning GANs\n[13, 7, 42, 18, 41] through a rigorous and novel algorithmic convergence analysis. There are three\nsteps for obtaining such convergence guarantee.\n\n\u2022 We prove that the regularized Wasserstein distance, when used in GAN problems, is smooth\nwith respect to the generator parameters.\n\u2022 We also prove that by approximately solving the regularized Wasserstein distance (discriminator\nsteps), we can control the error in the computation of the (stochastic) gradients for the generator.\n\n2\n\n\fSuch an error control could not be achieved for the original Wasserstein distance or other GAN\nformulations in general; see Proposition 3 in [9].\n\n\u2022 Having approximate \ufb01rst order information and smoothness, we prove the convergence of\nvanilla stochastic gradient descent (SGD) method to a stationary solution. Our results suggests\nthat converging to stationarity of the \ufb01nal solution not only depends on the number of steps in\nthe generator, but also depends on the quality of solving the discriminator problem.\n\nNote that our convergence result relies on the smoothness of regularized Wasserstein distance with\nrespect to the generator parameters. The use of regularization as a means for smoothing has a long\nhistory in optimization literature [35]. In the optimal transport literature, the regularization has been\nused as a means to derive faster methods for computing the optimal transport. The most prominent\nexample is Sinkhorn distance [13], which is based on regularizing the optimal transport problem\nwith a negative entropy term. There are many ef\ufb01cient algorithms proposed for \ufb01nding the Sinkhorn\ndistance [25, 43, 3]. Recently Blondel et al. [7] noted that using strongly convex regularizers on the\noptimal transport problem would result in an unconstrained dual formulation which is computationally\neasier to solve. This unconstrained form is essential in using parametric methods, such as neural\nnetworks, for solving regularized optimal transport problems, as also observed in [42].\nIn [42] a regularized optimal transport with very small regularization weight is considered as an\nobjective for learning GANs. Choosing a small regularization parameter is important as a strong\nregularization introduces bias. Although, our convergence guarantee applies to this objective, our\nsmoothness analysis predicts that small weight regularization leads to an unstable algorithm. This\nfact has also been observed in our experiments.Thus, we use Sinkhorn loss [19, 41, 5] for which\nour convergence guarantee holds. We show that this loss does not introduce bias into \ufb01nding the\ncorrect generative model regardless of the amount of regularization. Finally, using the insights from\nour theoretical analysis, we put together all the pieces of different methods for solving GANs with\nregularized optimal transport [42, 18, 41] to get an algorithm that is competitive both in terms of\ncomputational ef\ufb01ciency and quality with the state of the art methods.\n\n2 Background\nGiven a cost function c : Rd \u21e5 Rd ! R, the optimal transport cost between two distributions p and\nq can be de\ufb01ned as\n(1)\n\n\u21e1(x, y)c(x, y)dxdy,\n\ndc(p, q) = min\n\n\u21e12\u21e7(p,q)ZYZX\n\nwhere \u21e7(p, q) is the set of all joint distributions having marginal distributions p and q, i.e.\n\u21e1(x, y)dx = p(y) and RY\nRX\n\u21e1(x, y)dy = q(x). Note that X and Y de\ufb01ne the spaces of all\npossible x\u2019s and y\u2019s respectively. In practice, the distributions are replaced by their empirical samples,\nthus X and Y have \ufb01nite supports. In such cases we still use integrals to represent \ufb01nite sums.\nThroughout the paper, X and Y are assumed to have \ufb01nite support, unless otherwise is noted.\nThe optimal transport cost (1) could be used as an objective for learning generative models. To be\nmore speci\ufb01c, we assume that we have a base distribution p and among a set of parameterized family\nof functions {G\u2713, \u2713 2 \u21e5}, we aim at learning a mapping G\u2713\u21e4 such that the cost dc(G\u2713\u21e4(q), p) 1 is\nminimized. In other words the problem of generative model learning is\n\nh0(\u2713) = dc(G\u2713(q), p) = min\n\n\u21e12\u21e7(p,q)ZXZY\n\nmin\n\u27132\u21e5\n\n\u21e1(x, y) c(G\u2713(x), y) dy dx.\n\n(2)\n\n2.1 Generative adversarial networks with Wc objective\n\nIn [4] authors propose to use the dual formulation of the generative problem (2) as it is easier to\nparametrize the dual functions instead of the transport plan \u21e1. They call this formulation Wasserstein\n\n1 Throughout the paper we have the hidden technical assumption that G\u2713 is a one-to-one mapping on X . This\nis a reasonable assumption since the space of Y is usually a low dimensional manifold in higher dimensional\nspace which could be approximated by a mapping of low dimensional code words x 2 X . Therefore, the\nmappings are going to be from low dimensions to high dimensions.\n\n3\n\n\fGAN (WGAN). Based on Kantorovich theorem [44] the dual form of (2) could be written as\n\nmin\n\n\u2713\n\nEx\u21e0p \u21b5(G\u2713(x)) Ey\u21e0q (y),\n\nmax\n\u21b5,\ns.t. \u21b5(G\u2713(x)) (y) \uf8ff c(G\u2713(x), y),8(x, y)\n\n(3)\n\ndc,(p, q) = min\n\n\u21e12\u21e7(p,q)\n\nwhere for practical considerations we have assumed that the dual functions/discriminators, and ,\nbelong to the set of parametric functions with parameters \u21b5 and respectively. Note that the inner\nproblem has a constraint over the functions and . In the case where c is a distance, then = \nand the constraint is enforcing the 1-Lipschitz constraint on the functions = with respect to c.\nThis 1-Lipschitzness is not easy to enforce. In practice, it is usually imposed heuristically by adding\nsome regularizer [22].\n2.2 Regularized optimal transport\nFor any strongly convex function I(\u21e1), we can de\ufb01ne the regularized optimal transport as\n\nH(\u21e1, \u2713) =Z Z \u21e1(x, y) c(x, y) dxdy + I(\u21e1).\n\n2Z Z \u21e1(x, y)2\n\nq(x)p(y)\u25c6dxdy & norm-2: I(\u21e1) =\n\n(4)\nWe also de\ufb01ne \u00afdc,(p, q) = dc,(p, q) I(\u21e1\u21e4), where \u21e1\u21e4 is the optimal solution of (4). Note that,\nr\u2713dc,(G\u2713(q), p) = r\u2713 \u00afdc,(G\u2713(q), p). Among all strongly convex regularizers, the following two\nare the most popular ones [7]:\nKL: I(\u21e1) =Z Z \u21e1(x, y) log\u2713 \u21e1(x, y)\nWhen c is a proper distance, one can show desirable properties for dc, and \u00afdc,; see [13] and\nAppendix A. It is also possible to prove the uniform convergence of function dc,(p, q) to dc(p, q)\nas ! 0 when p and q have \ufb01nite support; see [7] and Appendix B. In the case of continuous\ndistributions [42] proves a point-wise convergence between the two distances as ! 0.\n2.3 Dual formulation for regularized optimal transport\nThe dual formulation for regularized optimal transport has also been covered in other recent works\n[42, 7]. Thus, we just present a summary of the results in Lemma 2.1 and highlight the important part\nin the remarks that follow; see Appendix D for a more comprehensive discussion.\nLemma 2.1. Let (x) and (y) be the dual variables for the constraints in the regularized optimal\ntransport problem (4). Let us also de\ufb01ne the violation function V (x, y) = (x) (y) c(x, y).\nThen, the dual of the regularized optimal transport is:\n(5)\n\nEx\u21e0q[(x)] Ey\u21e0p[ (y)] Ex,y\u21e0q\u21e5p [f(V (x, y))],\n\ndc,(p, q) = max\n ,\n\nq(x)p(y)\n\ndx dy.\n\n1\n\ne e v\n\n for KL regularization and f(v) = (v+)2\n\n2 for norm-2 regularization. Further-\nwhere f(v) = \nmore, given the optimal dual variables and , the optimal primal transport plan could be computed\n for\nas \u21e1(x, y) = q(x)p(y)M (V (x, y)), where M (v) = 1\nnorm-2 regularization.\nRemark 2.1.1. The dual of the regularized optimal transport is a large scale unconstrained concave\nmaximization which can be solved as one. But it is also amenable to the use of parametric method,\ni.e., neural networks, for representing the dual functions.\n\n for KL regularization and M (v) = v+\n\ne e v\n\nNote that V (x, y) in Lemma 2.1 represents the amount of violation from the hard constraint in the\noriginal dual formulation (3). Therefore, by adding the regularizer in the primal, we are relaxing the\nhard constraint in the dual representation to a soft one in the objective function. By looking at the\nproblem from this perspective, one can \ufb01nd similarities between our approach and the one in [22]\nwhere the authors drop the 1-Lipschitz constraint on the discriminator and try to softly enforce it by\nregularizing the objective using the Jacobian of the discriminator function.\nRemark 2.1.2. Lemma 2.1 also provides a mapping that translates the dual solutions and to a\ncorresponding pseudo-transport plan\n\n(6)\nNote that although \u21e1 may not be a feasible transport plan, (6) can be used to compute an approximate\ngradient of the generator problem, as discussed in Section 4.\n\n\u21e1(x, y) = q(x)p(y)M (V (x, y)).\n\n4\n\n\fIn what follows, we assume that at each iteration of the procedure for \ufb01nding the optimal generator,\nwe have access to an oracle which solves the resulting dual of regularized optimal transport to some\nprede\ufb01ned accuracy. We say a solution (, ) is an \u270f-accurate solution for (5), if the value that it\nachieves is within \u270f of the optimal value for (5). Such an oracle could be realized through convex\noptimization methods and non-parametric representations; see Appendix D.2. However, due to\npractical computational barriers, we opt for parametric realizations of the oracle, i.e neural networks.\n\n3 Smoothness of the generative objective\n\nGiven two \ufb01xed distributions q and p, let us de\ufb01ne h(\u2713) = dc,(G\u2713(q), p). In this section we prove\nthat h(\u2713) is smooth with respect to \u2713 in contrast to the original metric h0(\u2713). This is particularly\nimportant when we only solve the inner optimal transport problem within some accuracy. Due to\nspace limitations, we only state our result on the smoothness of h(\u2713) when the regularizer is KL\ndivergence; similar result for norm-2 regularizer can be found in Theorem E.1 in the Appendix. The\nonly difference when changing the regularizer comes from the fact that these two regularizers are\nstrongly convex with respect to different norms.\nTheorem 3.1. Assume X and Y are compact, p and q have bounded entropy, c is bounded from\nbelow, and there exist non-negative constants L1 and L0, such that:\n\n\u2022 For any feasible \u27131, \u27132, supx,y kr\u2713c(G\u27131(x), y) r\u2713c(G\u27132(x), y)k \uf8ff L1k\u27131 \u27132k,\n\u2022 For any feasible \u27131, supx,y kr\u2713c(G\u27131(x), y)k \uf8ff L0,\n\n0\n\n . Moreover, for any two parameters\n\n k\u27131 \u27132k, where \u21e1\u21e4(\u2713) = arg min\u21e12\u21e7(p,q) H(\u21e1, \u2713).\n\nthen the function h(\u2713) is L-Lipschitz smooth2 with L = L1 + L2\n\u27131 and \u27132, k\u21e1\u21e4(\u27131) \u21e1\u21e4(\u27132)k1 \uf8ff L0\nRemark 3.1.1. Theorem 3.1 holds for both continuous and discrete distributions. In addition, when\nX and Y have \ufb01nite support, the entropies of p and q are automatically bounded.\nThe proof of this theorem is inspired by [35] and is relegated to Appendix C, where we \ufb01rst prove the\nresult for distributions with \ufb01nite support, and then we extend the proof to continuous distributions.\nNote that unlike the non-regularized original formulation, small changes in \u2713 results in a small change\nin the corresponding optimal transport plan in the regularized formulation. Consequently, after\nupdating \u2713, solving the inner problem would be easier as the optimal discriminator has not moved\nvery far from the last iterate. It is also worth noticing that the assumptions of Theorem 3.1 is satis\ufb01ed\nwhen the functions c and G are smooth and the domain of x, y is compact.\n\n4 Solving the generator problem to stationarity using \ufb01rst order methods\n\nFirst order methods, including SGD and its variants such as Adam [24] or SVRG [2], are the work-\nhorse for large scale optimization. These methods are built on top of an oracle that can generate a\nclose approximation of the (stochastic) gradients.\nUnfortunately, the original non-regularized GAN objective h0(\u2713) is non-smooth. Moreover, it is\nimpossible to obtain guaranteed good quality approximations of the its sub-gradients even if we solve\nthe discriminator problem with high accuracy; see Proposition 3 in [9]. In contrast, we proved that\nthe h(\u2713) is smooth. Next we will prove that one can obtain decent quality estimates of its gradient\nby solving the corresponding regularized dual problem approximately.\nTheorem 4.1. Under the same assumptions as in Theorem 3.1, let (, ) be an \u270f-accurate solution\nto the dual formulation of regularized optimal transport for a given \u2713. Let \u21e1 be the transport\nplan corresponding to (, ), derived using (6). Let us also de\ufb01ne m(x, y) = \u21e1(x,y)\nq(x)p(y) and G =\n\nEx,y\u21e0q\u21e5p\u21e5m(x, y)r\u2713c(G\u2713(x), y)\u21e4. Then,\n\nkG rh(\u2713)k \uf8ff = O\u2713r \u270f\n\u25c6\n\n2A function f (\u2713) : Rd ! R is said to be L-Lipschitz smooth if it is differentiable and its derivative is\n\nL-Lipschitz, i.e. 8 \u27131, \u27132 : krf (\u27131) rf (\u27132)k \uf8ff Lk\u27131 \u27132k.\n\n(7)\n\n5\n\n\fWe only prove this result for the case where X and Y have \ufb01nite support. This is the scenario that\nhappens in practice where the distributions are replaced by their \ufb01nite samples. While we believe a\nmore general version of this result could be proved for continuous distributions by delicately repeating\nthe same steps, such a proof goes beyond the scope of this paper. See Appendix G for the proof.\nAlso note that it is possible to verify the quality of the discriminator/dual solutions. Due to the space\nlimitation, we relegate the discussion on veri\ufb01cation to Appendix D.3.\nThe above theorem guarantees that using the dual solver, we can generate approximate (stochastic)\ngradients for h(\u2713). In other words, the discriminator steps in solving GANs could be viewed\nas a way of obtaining approximate gradient information for h(\u2713). Using the above approximate\n(stochastic) gradients, one can provide algorithms with guaranteed convergence to approximate\nstationary solutions for GANs. We describe one such algorithm based on the vanilla mini-batch SGD\nand state its convergence guarantee.\n\nAlgorithm 1 Oracle based Non-Convex SGD for GANs\n\nINPUT: q, p, , S, \u27130, {\u21b5t > 0}T1\nfor t = 0,\u00b7\u00b7\u00b7 , T 1 do\nCall the oracle to \ufb01nd \u270f-approximate maximizer (t, t) for the dual formulation\nSample I.I.D. points x1\n\nt=0\n\nt \u21e0 q, y1\nt ,\u00b7\u00b7\u00b7 , xS\nS2Xi,j\ngt =\n\n1\n\nt ,\u00b7\u00b7\u00b7 , yS\n\n\u21e1t(G\u2713(xi\n\nt \u21e0 p and compute\nt), yj\nt )\nt ) r\u2713c(G\u2713(xi\nt)p(yj\n\nq(xi\n\nt), yj\nt )\n\nwhere \u21e1t is computed using (t, t) based on (6).\nUpdate \u2713t+1 \u2713t \u21b5tgt\n\nend for\n\nRemark 4.1.1. In Algorithm 1, if we de\ufb01ne Gt = E[gt|\u21e1t, \u2713t], then Theorem 4.1 simply states that\nkGt rh(\u2713t)k \uf8ff = Op \u270f\n.\nThe following theorem establishes the convergence of Algorithm 1 to an approximate stationary\nsolution of h.\nTheorem 4.2. Let L be the Lipschitz constant of the gradient of h. Set = h(\u27130)inf \u2713 h(\u2713) and\nlet Gt = E[gt|\u21e1t, \u2713t]. Furthermore, assume kGtrh(\u2713t)k \uf8ff and E[kgtGtk2|\u21e1t, \u2713t] \uf8ff 2, 8t.\n\n\u2022 If T < 2L\n\u2022 If T 2L\n\n2 , setting \u21b5t = 1\n\n2 , setting \u21b5t =q 2\n\nT PT\n\nL, we have 1\n\nL2T , we have 1\n\nt=1 E[krh(\u2713t)k2] \uf8ff 2L\nt=1 E\u21e5krh(\u2713t)k2\nT PT\nE[kr\u2713h(\u2713t)k2] \uf8ff O\u2713r L\n\nT \u25c6 + O\u2713 \u270f\n\u25c6.\n\nmin\n\nt=1,\u00b7\u00b7\u00b7 ,T\n\nT + 2 + 2.\n\nF\u21e4 \uf8ff q8 L\n\nT + 2.\n\nThe proof of this theorem is inspired by [20] and is presented in Appendix H.\nRemark 4.2.1. The second regime in Theorem 4.2 results in the following asymptotic convergence\nrate of expected norm of the gradient as T ! 1:\n\n(8)\n\nIt is worth noting that our convergence analysis also guarantees the convergence of the algorithm in\n[42] for generative learning which is similar to Algorithm 1.\nRemark 4.2.2. When the error in gradient approximation at each step t is t, Theorem 4.2 is still\nvalid with 2 = 1\nt . Thus, the algorithm only needs to keep the average error in solving the\nt=1 2\ninner problem small enough.\n4.1 Sinkhorn loss: a more robust generative objective\n\nT PT\n\nIn practice, when using regularized Wasserstein distance h(\u2713) as an objective for generative models,\none needs to use very small value of as noted by [42]. This is because a large would introduce bias\ninto measuring the Wasserstein distance. In fact, choosing large may lead to undesired solutions;\nsee Corollary F.0.2 for an example. A na\u00efve approach to deal with this bias is to reduce . However,\na reduced have three dire effects: (i) based on Theorem 3.1, any change in the generator parameters\nwould result in large changes in the optimal discriminator parameters. (ii) According to (8), smaller\n\n6\n\n\f requires smaller error \u270f for solving discriminator to obtain the same convergence guarantee. Thus,\nsolving the discriminator problem requires more effort. (iii) Lipschitz smoothness constant of h(\u2713)\nincreases with the decrease in . Thus, we have to choose smaller step-size, based on Theorem 4.2,\nwhich means slower convergence. In our experiments we observed that this situation worsens with\nthe complexity and scale of the problem. A proposed solution in literature is to use Sinkhorn loss\n[19, 41] to reduce the bias in measuring the distance between the two distributions without reducing\n and get an objective which is meaningful even for large values of . The Sinkhorn loss between\ntwo distributions p and q is de\ufb01ned as\n\nL(p, q) = 2 \u00afdc,(p, q) \u00afdc,(p, p) \u00afdc,(q, q)\n\n(9)\nIn [19] Genevay et al. proved that, when c is a distance, as ! 1, L converges to Maximum Mean\nDiscrepancy Distance (MMD)[15]; and when ! 0, L converges to 2dc(p, q). Next, we present a\nresult that shows the robustness of L with respect to the choice of 2 (0,1) in identifying the true\ngenerator parameters; see the proof in Appendix J.\nLemma 4.3. Assume c is symmetric, i.e., c(x, y) = c(y, x). If there exists \u27130 for which q = G\u27130(p),\nthen \u27130 is a stationary solution of L(G\u2713(p), q). Moreover, L(G\u27130(p), q) = 0 for any > 0.\nNotice that the above does not hold for dc,(G\u2713(p), q) unless ! 0. Based on the above Lemma\nwe opt to use the following Sinkhorn loss as our generative objective\n\nmin\n\n\u2713\n\n\u02c6h(\u2713) = L(G\u2713(q), p) = 2 \u00afdc,(G\u2713(q), p) \u00afdc,(G\u2713(q), G\u2713(q)) \u00afdc,(p, p)\n\n(10)\n\nNote that only the \ufb01rst two terms in \u02c6h(\u2713) depend on \u2713. To compute approximate gradients for these\ntwo terms, we need to call the discriminator oracle twice; and approximately solve dc,(G\u2713(q), p)\nand dc,(G\u2713(q), G\u2713(q)). With the returned discriminator solutions, we have two approximate\ngradients, one for each term. If each one of the gradients has error , obtained by applying (7), the\nerror in approximating the overall gradient is bounded by \u02c6 = 3 . Now if we further assume that\nsub-sampling would generate a stochastic gradient of variance 2 for each term, the overall variance\nof the noise in gradient would be bounded by \u02c62 = 5 2. We summarize the SGD based method\nfor solving (10) in Algorithm 2 in the Appendix. With these assumptions, we can easily extend the\nconvergence result of Theorem 4.2 to Algorithm 2.\nCorollary 4.3.1. By replacing and in the convergence guarantees of Theorem 4.2 with \u02c6 and \u02c6\nrespectively, we obtain a convergence guarantee for the SGD based method, described above, for\nsolving the generative Sinkhorn loss optimization problem (10).\n5 Experiments\nIn this section we test a family of methods which we generally call Smoothed WGAN (SWGAN).\nThey all use the two variants of regularized OT formulation, i.e. h(\u2713) and \u02c6h(\u2713), as their objective.\nWe differentiate between the two objectives by explicitly mentioning Sinkhorn loss when it is used.\nWe also investigate the choice of different cost functions, i.e. L1 and Cosine distances. Unlike\n[41, 18] that solve regularized OT with a parametric approach, we use neural networks to solve the\nOT (discriminator) similar to [42]. In contrast with [41, 18] that use large batch-sizes to get unbiased\ngradient estimates, our gradient estimates are always unbiased due to the use of neural networks as\ndiscriminator. As a benchmark, we compare the SWGAN methods with the gradient penalty WGAN\n(WGAN-GP) [22] and other methods that uses the regularized OT objective [41, 19]. All algorithms\nwere implemented in TensorFlow [1].\n\n5.1 Learning handwritten digits\n\nIn this section we apply SWGAN methods to learn handwritten digits on the MNIST data set. Our\nmain goal is to see the effect of different choices of objective and cost function on the performance of\nSWGAN methods. For details of hyper parameters and networks structures see Appendix L.3.\nThe \ufb01rst and second row of Fig. 1 corresponds to SWGAN methods with L1 and Cosine cost\nrespectively. As the SWGAN formulations allows \ufb02exibility in the choice of the cost function, we\napply these costs on different representation of the images. In Fig. 1 (a), (b), (e) and (f), the cost\nfunction is applied on the pixel domain (no latent representation), while in Fig. 1 (c), (d), (g) and (h)\nthe cost is applied on a lower dimensional latent representation of the image [41, 18], parameterized\n\n7\n\n\fSWGAN\n\nL1\n\nSWGAN\nCosine\n\n(a) Pixel\n\n(b) Pixel Sinkhorn loss\n\n(c) Latent\n\n(d) Latent Sinkhorn loss\n\n(e) Pixel\n\n(f) Pixel Sinkhorn loss\n\n(g) Latent\n\n(h) Latent Sinkhorn loss\n\nFigure 1: Generated MNIST samples using different SWGAN and benchmark methods\n\n(i) Latent\nsolver [19]\n\nSinkhorn\n\n(j) WGAN-GP [22]\n\nby a Convolution Neural Network. As proposed by [41, 18], this latent representation could be\nadversarially trained to improve the quality of the \ufb01nal results; see Appendix L.1 for more details.\nComparing SWGAN methods with and without latent representation, we \ufb01nd that the ones with\nlatent representation perform better. This might be due to the existence of many bad local minima in\nthe high dimensional pixel space cost function which the algorithm cannot avoid. In contrast, the\nones with lower dimensional latent representations seem to have easier time avoiding such bad local\nminima when the latent representation is also being updated. Based on Lemma 4.3, we conjecture that\nin these cases, ground truth is the only solution that is stationary regardless of the latent representation.\nThus, updating the latent representation once in a while prevents the generative parameters from\nconverging to local minima, i.e. over-\ufb01tting to a speci\ufb01c representation. Note that this conjecture is\nnot a direct result of Lemma 4.3.\nIt is also interesting to note the difference between Fig. 1 (a) and (f), where (f) outperforms (a) which\nproduces many faint images. We believe that the change of objective from regularized OT to Sinkhorn\nloss helps the method \ufb01nd a better stationary solution, which is closer to the underlying ground truth\nas predicted by Lemma 4.3. This difference is more pronounced in the experiments on CIFAR-10.\nWe have also included samples generated by other methods [18, 22] in the last row of Fig. 1.\nCompared to these methods, specially [18] which uses Sinkhorn algorithm to solve regularized OT,\nSWGAN methods are capable of generating higher quality images. We also noted that SWGAN\nmethods qualitatively converge faster than other methods. We will formalize this comparison in the\nexperiments on CIFAR-10 using the inception score [40].\n\n5.2 Generating tiny color images\nTo further investigate the performance of SWGAN, we apply it to model 32x32 color images from\nCIFAR-10 [26]. We compare the SWGAN approach with WGAN-GP [22], OT-GAN [41] and\nSinkhorn solver [18]. All the methods are trained using the same architecture and batch-size of 150;\nsee Appendix M for the details and a list of hyper-parameters. We use inception score [40] to compare\nthe quality of generated samples. Learning CIFAR-10 images is a more challenging problem than\nMNIST; and as we predicted in Section 4.1 SWGAN methods with regularized OT objective cannot\ngenerate high quality samples, even with carefully tuned hyper-parameters; see Fig. 4 in Appendix.\nDue to high computational cost, we only evaluate latent Sinkhorn loss SWGAN with L1 and Cosine\ncost on CIFAR-10. As can be seen in Fig. 2 (c), given the same architecture and computational\npower, the SWGAN methods have faster convergence compared to other algorithms. In Fig. 2 all the\nmethods have been running for roughly the same amount of time. Note that OT-GAN [41] is slower\nas it uses two batches for each label, i.e. fake and real, and requires more computations. We also\ndepict samples of the generated images by SWGAN methods in Fig. 2 (a) and (b). We noted that\n\n8\n\n\f(a) SWGAN Cosine\n\n(b) Inception scores\n\n(c) SWGAN L1\n\nFigure 2: Generated CIFAR-10 samples and inception scores.\n\nSWGAN with L1 cost converges faster than the cosine distance in terms of inception scores, but the\nsamples from the cosine distance model are more visually appealing than the ones from L1.\n\nAcknowledgments\nMS and JDL acknowledge support from ARO W911NF-11-1-0303. The authors would like to thank\nthe anonymous reviewers whose comments/suggestions helped improve the quality/clarity of the\npaper.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,\net al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Z. Allen-Zhu and Y. Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In\n\nInternational conference on machine learning, pages 1080\u20131089, 2016.\n\n[3] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via\n\nsinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1961\u20131971, 2017.\n\n[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.\n\n[5] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, and R. Munos.\nThe cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.\n\n[6] P. Bernhard and A. Rapaport. On a theorem of danskin with an application to a theorem of von neumann-\n\nsion. Nonlinear analysis, 24(8):1163\u20131182, 1995.\n\n[7] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. arXiv preprint arXiv:1710.06276,\n\n2017.\n\n[8] V. S. Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291\u2013294,\n\n1997.\n\n[9] O. Bousquet, S. Gelly, I. Tolstikhin, C.-J. Simon-Gabriel, and B. Schoelkopf. From optimal transport to\n\ngenerative modeling: the VEGAN cookbook. arXiv preprint arXiv:1705.07642, 2017.\n\n[10] G. Carlier, V. Duval, G. Peyr\u00e9, and B. Schmitzer. Convergence of entropic schemes for optimal transport\n\nand gradient \ufb02ows. SIAM Journal on Mathematical Analysis, 49(2):1385\u20131418, 2017.\n\n[11] G. Carlier, V. Duval, G. Peyr\u00e9, and B. Schmitzer. Convergence of entropic schemes for optimal transport\n\nand gradient \ufb02ows. SIAM Journal on Mathematical Analysis, 49(2):1385\u20131418, 2017.\n\n9\n\n\f[12] F. Cicalese, L. Gargano, and U. Vaccaro. How to \ufb01nd a joint probability distribution of minimum entropy\n\n(almost), given the marginals. arXiv preprint arXiv:1701.05243, 2017.\n\n[13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural\n\ninformation processing systems, pages 2292\u20132300, 2013.\n\n[14] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. arXiv preprint\n\narXiv:1711.00141, 2017.\n\n[15] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean\n\ndiscrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.\n\n[16] F. Farnia and D. Tse. A convex duality framework for gans. In Advances in Neural Information Processing\n\nSystems, pages 5254\u20135263, 2018.\n\n[17] S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding GANs:\n\narXiv:1710.10793, 2017.\n\nthe LQG setting.\n\narXiv preprint\n\n[18] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. Bach. Stochastic optimization for large-scale optimal transport. In\n\nAdvances in Neural Information Processing Systems, pages 3440\u20133448, 2016.\n\n[19] A. Genevay, G. Peyr\u00e9, and M. Cuturi. Learning generative models with sinkhorn divergences.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1608\u20131617, 2018.\n\nIn\n\n[20] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in neural information processing systems, pages 2672\u20132680,\n2014.\n\n[22] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein\n\nGANs. In Advances in Neural Information Processing Systems, pages 5769\u20135779, 2017.\n\n[23] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale\nupdate rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems,\npages 6629\u20136640, 2017.\n\n[24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n\n[25] P. A. Knight. The sinkhorn\u2013knopp algorithm: convergence and applications. SIAM Journal on Matrix\n\nAnalysis and Applications, 30(1):261\u2013275, 2008.\n\n[26] A. Krizhevsky, V. Nair, and G. Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar.\n\nhtml, 2014.\n\n[27] J. Li, A. Madry, J. Peebles, and L. Schmidt. Towards understanding the dynamics of generative adversarial\n\nnetworks. arXiv preprint arXiv:1706.09884, 2017.\n\n[28] J. H. Lim and J. C. Ye. Geometric GAN. arXiv preprint arXiv:1705.02894, 2017.\n\n[29] L. Mescheder. On the convergence properties of GAN training. arXiv preprint arXiv:1801.04406, 2018.\n\n[30] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? arXiv\n\npreprint arXiv:1801.04406, 2018.\n\n[31] L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In Advances in Neural Information\n\nProcessing Systems, pages 1823\u20131833, 2017.\n\n[32] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint\n\narXiv:1611.02163, 2016.\n\n[33] V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural\n\nInformation Processing Systems, pages 5591\u20135600, 2017.\n\n[34] A. Nemirovski. Prox-method with rate of convergence O (1/t) for variational inequalities with lipschitz\ncontinuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on\nOptimization, 15(1):229\u2013251, 2004.\n\n10\n\n\f[35] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127\u2013152,\n\n2005.\n\n[36] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science &\n\nBusiness Media, 2013.\n\n[37] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational\ndivergence minimization. In Advances in Neural Information Processing Systems, pages 271\u2013279, 2016.\n\n[38] E. Posner. Random coding strategies for minimum entropy. IEEE Transactions on Information Theory,\n\n21(4):388\u2013391, 1975.\n\n[39] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[40] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining GANs. In Advances in Neural Information Processing Systems, pages 2234\u20132242, 2016.\n\n[41] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving GANs using optimal transport. arXiv\n\npreprint arXiv:1803.05573, 2018.\n\n[42] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large-scale optimal\n\ntransport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.\n\n[43] A. Thibault, L. Chizat, C. Dossal, and N. Papadakis. Overrelaxed Sinkhorn-Knopp algorithm for regularized\n\noptimal transport. arXiv preprint arXiv:1711.01851, 2017.\n\n[44] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\n11\n\n\f", "award": [], "sourceid": 3533, "authors": [{"given_name": "Maziar", "family_name": "Sanjabi", "institution": "University of Southern California"}, {"given_name": "Jimmy", "family_name": "Ba", "institution": null}, {"given_name": "Meisam", "family_name": "Razaviyayn", "institution": "University of Southern California"}, {"given_name": "Jason", "family_name": "Lee", "institution": "University of Southern California"}]}