{"title": "Dualing GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 5606, "page_last": 5616, "abstract": "Generative adversarial nets (GANs) are a promising technique for modeling a distribution from samples. It is however well known that GAN training suffers from instability due to the nature of its saddle point formulation. In this paper, we explore ways to tackle the instability problem by dualizing the discriminator. We start from linear discriminators in which case conjugate duality provides a mechanism to reformulate the saddle point objective into a maximization problem, such that both the generator and the discriminator of this \u2018dualing GAN\u2019 act in concert. We then demonstrate how to extend this intuition to non-linear formulations. For GANs with linear discriminators our approach is able to remove the instability in training, while for GANs with nonlinear discriminators our approach provides an alternative to the commonly used GAN training algorithm.", "full_text": "Dualing GANs\n\nAlexander Schwing3\n\nKuan-Chieh Wang1,2\n\nRichard Zemel1,2\n\nYujia Li1\u2217\n\n1Department of Computer Science, University of Toronto\n\n2Vector Institute\n\n3Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign\n{yujiali, wangkua1, zemel}@cs.toronto.edu\naschwing@illinois.edu\n\nAbstract\n\nGenerative adversarial nets (GANs) are a promising technique for modeling a\ndistribution from samples. It is however well known that GAN training suffers from\ninstability due to the nature of its saddle point formulation. In this paper, we explore\nways to tackle the instability problem by dualizing the discriminator. We start from\nlinear discriminators in which case conjugate duality provides a mechanism to\nreformulate the saddle point objective into a maximization problem, such that both\nthe generator and the discriminator of this \u2018dualing GAN\u2019 act in concert. We then\ndemonstrate how to extend this intuition to non-linear formulations. For GANs\nwith linear discriminators our approach is able to remove the instability in training,\nwhile for GANs with nonlinear discriminators our approach provides an alternative\nto the commonly used GAN training algorithm.\n\nIntroduction\n\n1\nGenerative adversarial nets (GANs) [5] are, among others like variational auto-encoders [10] and\nauto-regressive models [19], a promising technique for modeling a distribution from samples. A lot\nof empirical evidence shows that GANs are able to learn to generate images with good visual quality\nat unprecedented resolution [22, 17], and recently there has been a lot of research interest in GANs,\nto better understand their properties and the training process.\nTraining GANs can be viewed as a duel between a discriminator and a generator. Both players\nare instantiated as deep nets. The generator is required to produce realistic-looking samples that\ncannot be differentiated from real data by the discriminator. In turn, the discriminator does as good a\njob as possible to tell the samples apart from real data. Due to the complexity of the optimization\nproblem, training GANs is notoriously hard, and usually suffers from problems such as mode collapse,\nvanishing gradient, and divergence. Moreover, the training procedures are very unstable and sensitive\nto hyper-parameters. Therefore, a number of techniques have been proposed to address these issues,\nsome empirically justi\ufb01ed [17, 18], and others more theoretically motivated [15, 1, 16, 23].\nThis tremendous amount of recent work, together with the wide variety of heuristics applied by\npractitioners, indicates that many questions regarding the properties of GANs are still unanswered. In\nthis work we provide another perspective on the properties of GANs, aiming toward better training\nalgorithms in some cases. Our study in this paper is motivated by the alternating gradient update\nbetween discriminator and generator, employed during training of GANs. This form of update is one\nsource of instability, and it is known to diverge even for some simple problems [18]. Ideally, when\nthe discriminator is optimized to optimality, the GAN objective is a deterministic function of the\ngenerator. In this case, the optimization problem would be much easier to solve. This motivates our\nidea to dualize parts of the GAN objective, offering a mechanism to better optimize the discriminator.\nInterestingly, our dual formulation provides a direct relationship between the GAN objective and the\nmaximum mean-discrepancy framework discussed in [6]. When restricted to linear discriminators,\nwhere we can \ufb01nd the optimal discriminator by solving the dual, this formulation permits the\nderivation of an optimization algorithm that monotonically increases the objective. Moreover, for\n\n\u2217Now at DeepMind.\n\n\fnon-linear discriminators we can apply trust-region type optimization techniques to obtain more\naccurate discriminators. Our work brings to the table some additional optimization techniques beyond\nstochastic gradient descent; we hope this encourages other researchers to pursue this direction.\n2 Background\nIn generative training we are interested in modeling of and sampling from an unknown distribution\nP , given a set D = {x1, . . . , xN} \u223c P of datapoints, for example images. GANs use a generator\nnetwork G\u03b8(z) parameterized by \u03b8, that maps samples z drawn from a simple distribution, e.g.,\nGaussian or uniform, to samples in the data space \u02c6x = G\u03b8(z). A separate discriminator Dw(x)\nparameterized by w maps a point x in the data space to the probability of it being a real sample.\nThe discriminator is trained to minimize a classi\ufb01cation loss, typically the cross-entropy, and the\ngenerator is trained to maximize the same loss. On sets of real data samples {x1, ..., xn} and\nnoise samples {z1, ..., zn}, using the (averaged) cross-entropy loss results in the following joint\noptimization problem:\n\nmax\n\n\u03b8\n\nmin\n\nw\n\nf (\u03b8, w) where\n\nf (\u03b8, w) = \u2212 1\n2n\n\nlog Dw(xi)\u2212 1\n2n\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\nlog(1\u2212Dw(G\u03b8(zi))). (1)\n\nWe adhere to the formulation of a \ufb01xed batch of samples for clarity of the presentation, but also point\nout how this process is adapted to the stochastic optimization setting later in the paper as well as in\nthe supplementary material.\nTo solve this saddle point optimization problem, ideally, we want to solve for the optimal discriminator\nparameters w\u2217(\u03b8) = argminw f (\u03b8, w), in which case the GAN program given in Eq. (1) can be\nreformulated as a maximization for \u03b8 using max\u03b8 f (\u03b8, w\u2217(\u03b8)). However, typical GAN training only\nalternates two gradient updates w \u2190 w \u2212 \u03b7w\u2207wf (\u03b8, w) and \u03b8 \u2190 \u03b8 + \u03b7\u03b8\u2207\u03b8f (\u03b8, w), and usually\njust one step for each of \u03b8 and w in each round. In this case, the objective maximized by the generator\nis f (\u03b8, w) instead. This objective is always an upper bound on the correct objective f (\u03b8, w\u2217(\u03b8)),\nsince w\u2217(\u03b8) is the optimal w for \u03b8. Maximizing an upper bound has no guarantee on maximizing the\ncorrect objective, which leads to instability. Therefore, many practically useful techniques have been\nproposed to circumvent the dif\ufb01culties of the original program de\ufb01nition presented in Eq. (1).\n\nAnother widely employed technique is a separate loss \u2212(cid:80)\n\ni log(Dw(G\u03b8(zi))) to update \u03b8 in order\nto avoid vanishing gradients during early stages of training when the discriminator can get too\nstrong. This technique can be combined with our approach, but in what follows, we keep the elegant\nformulation of the GAN program speci\ufb01ed in Eq. (1).\n3 Dualing GANs\nThe main idea of \u2018Dualing GANs\u2019 is to represent the discriminator program minw f (\u03b8, w) in Eq. (1)\nusing its dual, max\u03bb g(\u03b8, \u03bb). Hereby, g is the dual objective of f w.r.t. w, and \u03bb are the dual variables.\nInstead of gradient descent on f to update w, we solve the dual instead. This results in a maximization\nproblem max\u03b8 max\u03bb g(\u03b8, \u03bb).\nUsing the dual is bene\ufb01cial for two reasons. First, note that for any \u03bb, g(\u03b8, \u03bb) is a lower bound on the\nobjective with optimal discriminator parameters f (\u03b8, w\u2217(\u03b8)). Staying in the dual domain, it is then\nguaranteed that optimization of g w.r.t. \u03b8 makes progress in terms of the original program. Second,\nthe dual problem usually involves a much smaller number of variables, and can therefore be solved\nmuch more easily than the primal formulation. This provides opportunities to obtain more accurate\nestimates for the discriminator parameters w, which is in turn bene\ufb01cial for stabilizing the learning\nof the generator parameters \u03b8. In the following, we start by studying linear discriminators, before\nextending our technique to training with non-linear discriminators. Also, we use cross-entropy as the\nclassi\ufb01cation loss, but emphasize that other convex loss functions, e.g., the hinge-loss, can be applied\nequivalently.\n\n3.1 Linear Discriminator\nWe start from linear discriminators that use a linear scoring function F (w, x) = w(cid:62)x, i.e., the\ndiscriminator Dw(x) = pw(y = 1|x) = \u03c3(F (w, x)) = 1/[1 + exp(\u2212w(cid:62)x)]. Here, y = 1 indicates\nreal data, while y = \u22121 for a generated sample. The distribution pw(y = \u22121|x) = 1 \u2212 pw(y = 1|x)\ncharacterizes the probability of x being a generated sample.\n\n2\n\n\f(cid:88)\n\ni\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\nWe only require the scoring function F to be linear in w and any (nonlinear) differentiable features\n\u03c6(x) can be used in place of x in this formulation. Substituting the linear scoring function into the\nobjective given in Eq. (1), results in the following program for w:\n\nmin\n\nw\n\n(cid:107)w(cid:107)2\n\n2 +\n\nC\n2\n\n1\n2n\n\nlog(1 + exp(\u2212w(cid:62)xi)) +\n\n1\n2n\n\nlog(1 + exp(w(cid:62)G\u03b8(zi))).\n\n(2)\n\n(cid:88)\n\ni\n\nHere we also added an L2-norm regularizer on w. We note that the program presented in Eq. (2) is\nconvex in the discriminator parameters w. Hence, we can equivalently solve it in the dual domain as\ndiscussed in the following claim, with proof provided in the supplementary material.\nClaim 1. The dual program to the task given in Eq. (2) reads as follows:\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88)\n\ni\n\n\u03bbxixi \u2212(cid:88)\n\ni\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nmax\n\n\u03bb\n\ng(\u03b8, \u03bb) = \u2212 1\n2C\n\n\u03bbziG\u03b8(zi)\n\n+\n\n1\n2n\n\nH(2n\u03bbxi) +\n\n1\n2n\n\nH(2n\u03bbzi),\n\n0 \u2264 \u03bbxi \u2264 1\n2n\n\ns.t. \u2200i,\n(3)\nwith binary entropy H(u) = \u2212u log u \u2212 (1 \u2212 u) log(1 \u2212 u). The optimal solution to the original\nproblem w\u2217 can be obtained from the optimal \u03bb\u2217\n\n0 \u2264 \u03bbzi \u2264 1\n2n\n\n,\n\n,\n\n(cid:32)(cid:88)\n\nxi\n\nxi \u2212(cid:88)\n\nand \u03bb\u2217\nvia\n\u03bb\u2217\n\nzi\n\nzi\n\n\u03bb\u2217\n\nxi\n\ni\n\ni\n\nw\u2217 =\n\n1\nC\n\n(cid:33)\n\nG\u03b8(zi)\n\n.\n\nn\n\nn\n\nzi\n\nxi\n\n(cid:80)\n\nxi \u2212 1\n\nG\u03b8(zi)(cid:107)2\n\nreach zero, reveals that we aim to match the empirical data observation(cid:80)\narti\ufb01cial sample observation(cid:80)\n(cid:80)\n\nRemarks: Intuitively, considering the last two terms of the program given in Claim 1 as well as its\nconstraints, we aim at assigning weights \u03bbx, \u03bbz close to half of 1\n2n to as many data points and to as\nmany arti\ufb01cial samples as possible. More carefully investigating the \ufb01rst part, which can at most\ni \u03bbxi xi and the generated\ni \u03bbziG\u03b8(zi). Note that this resembles the moment matching property\nobtained in other maximum likelihood models. Importantly, this objective also resembles the (kernel)\nmaximum mean discrepancy (MMD) framework, where the empirical squared MMD is estimated via\n(cid:107) 1\n2. Generative models that learn to minimize the MMD objective, like\nthe generative moment matching networks [13, 3], can therefore be included in our framework, using\n\ufb01xed \u03bb\u2019s and proper scaling of the \ufb01rst term.\nCombining the result obtained in Claim 1 with the training objective for the generator yields the task\nmax\u03b8,\u03bb g(\u03b8, \u03bb) for training of GANs with linear discriminators. Hence, instead of searching for a\nsaddle point, we strive to \ufb01nd a maximizer, a task which is presumably easier. The price to pay is the\nrestriction to linear discriminators and the fact that every randomly drawn arti\ufb01cial sample zi has its\nown dual variable \u03bbzi.\nIn the non-stochastic optimization setting, where we optimize for \ufb01xed sets of data samples {xi} and\nrandomizations {zi}, it is easy to design a learning algorithm for GANs with linear discriminators\nthat monotonically improves the objective g(\u03b8, \u03bb) based on line search. Although this approach is\nnot practical for very large data sets, such a property is convenient for smaller scale data sets. In\naddition, linear models are favorable in scenarios in which we know informative features that we\nwant the discriminator to pay attention to.\nWhen optimizing with mini-batches we introduce new data samples {xi} and randomizations {zi}\nin every iteration. In the supplementary material we show that this corresponds to maximizing a\nlower bound on the full expectation objective. Since the dual variables vary from one mini-batch\nto the next, we need to solve for the newly introduced dual variables to a reasonable accuracy. For\nsmall minibatch sizes commonly used in deep learning literature, like 100, calling a constrained\noptimization solver to solve the dual problem is quite cheap. We used Ipopt [20], which typically\nsolves this dual problem to a good accuracy in negligible time; other solvers can also be used and\nmay lead to improved performance.\nUtilizing a log-linear discriminator reduces the model\u2019s expressiveness and complexity. We therefore\nnow propose methods to alleviate this restriction.\n3.2 Non-linear Discriminator\nGeneral non-linear discriminators use non-convex scoring functions F (w, x), parameterized by a\ndeep net. The non-convexity of F makes it hard to directly convert the problem into its dual form.\n\n3\n\n\fTherefore, our approach for training GANs with non-convex discriminators is based on repeatedly\nlinearizing and dualizing the discriminator locally. At \ufb01rst sight this seems restrictive, however, we\nwill show that a speci\ufb01c setup of this technique recovers the gradient direction employed in the\nregular GAN training mechanism while providing additional \ufb02exibility.\nWe consider locally approximating the primal objective f around a point wk using a model function\nmk,\u03b8(s) \u2248 f (\u03b8, wk + s). We phrase the update w.r.t. the discriminator parameters w as a search\nfor a step s, i.e., wk+1 = wk + s where k indicates the current iteration. In order to guarantee the\n2 \u2264 \u2206k \u2208 R+ where \u2206k\nquality of the approximation, we introduce a trust-region constraint 1\nspeci\ufb01es the trust-region size. More concretely, we search for a step s by solving\n\n2(cid:107)s(cid:107)2\n\nmin\n\ns\n\nmk,\u03b8(s)\n\ns.t.\n\n1\n2\n\n(cid:107)s(cid:107)2\n\n2 \u2264 \u2206k,\n\n(4)\n\ngiven generator parameters \u03b8. Rather than optimizing the GAN objective f (\u03b8, w) with stochastic\ngradient descent, we can instead employ this model function and use the algorithm outlined in Alg. 1.\nIt proceeds by \ufb01rst performing a gradient ascent w.r.t. the generator parameters \u03b8. Afterwards, we\n\ufb01nd a step s by solving the program given in Eq. (4). We then apply this step, and repeat.\nDifferent model functions mk,\u03b8(s) result in variants of the algorithm. If we choose mk,\u03b8(s) =\nf (\u03b8, wk + s), model m and function f are identical but the program given in Eq. (4) is hard to\nsolve. Therefore, in the following, we propose two model functions that we have found to be\nuseful. The \ufb01rst one is based on linearization of the cost function f (\u03b8, w) and recovers the step s\nemployed by gradient-based discriminator updates in standard GAN training. The second one is\nbased on linearization of the score function F (w, x) while keeping the loss function intact; this\nsecond approximation is hence accurate in a larger region. Many more models mk,\u03b8(s) exist and we\nleave further exploration of this space to future work.\n(A). Cost function linearization: A local approximation to the cost function f (\u03b8, w) can be con-\nstructed by using the \ufb01rst order Taylor approximation\n\nmk,\u03b8(s) = f (wk, \u03b8) + \u2207wf (wk, \u03b8)(cid:62)s.\n\nSuch a model function is appealing because step 2 of Fig. 1, i.e., minimization of the model function\nsubject to trust-region constraints as speci\ufb01ed in Eq. (4), has the analytically computable solution\n\n\u221a\n\ns = \u2212\n\n2\u2206k\n\n(cid:107)\u2207wf (wk, \u03b8)(cid:107)2\n\n\u2207wf (wk, \u03b8).\n\nConsequently step 3 of Fig. 1 is a step of length 2\u2206k into the negative gradient direction of the\ncost function f (\u03b8, w). We can use the trust region parameter \u2206k to tune the step size just like it\nis common to specify the step size for standard GAN training. As mentioned before, using the\n\ufb01rst order Taylor approximation as our model mk,\u03b8(s) recovers the same direction that is employed\nduring standard GAN training. The value of the \u2206k parameters can be \ufb01xed or adapted; see the\nsupplementary material for more details.\nUsing the \ufb01rst order Taylor approximation as a model is not the only choice. While some choices like\nquadratic approximation are fairly obvious, we present another intriguing option in the following.\n(B). Score function linearization: Instead of linearizing the entire cost function as demonstrated in\nthe previous part, we can choose to only linearize the score function F , locally around wk, via\n\nF (wk + s, x) \u2248 \u02c6F (s, x) = F (wk, x) + s(cid:62)\u2207wF (wk, x),\n\n\u2200x.\n\nNote that the overall objective f is itself a nonlinear function of F . Substituting the approximation\nfor F into the overall objective, results in the following model function:\n\n(cid:88)\n\nlog(cid:0)1 + exp(cid:0)\u2212F (wk, xi) \u2212 s(cid:62)\u2207wF (wk, xi)(cid:1)(cid:1)\nlog(cid:0)1 + exp(cid:0)F (wk, G\u03b8(zi)) + s(cid:62)\u2207wF (wk, G\u03b8(zi))(cid:1)(cid:1) .\n\n1\n2n\n\ni\n\nmk,\u03b8(s) =\n\nC\n2\n\n+\n\n(cid:107)wk + s(cid:107)2\n\n2 +\n\n(cid:88)\n\ni\n\n1\n2n\n\n(5)\n\nThis approximation keeps the nonlinearities of the surrogate loss function intact, therefore we expect\nit to be more accurate than linearization of the whole cost function f (\u03b8, w). When F is already linear\nin w, linearization of the score function introduces no approximation error, and the formulation can\nbe naturally reduced to the discussion presented in Sec. 3.1; non-negligible errors are introduced\nwhen linearizing the whole cost function f in this case.\n\n4\n\n\fAlgorithm 1 GAN optimization with model function.\nInitialize \u03b8, w0, k = 0 and iterate\n\n1. One or few gradient ascent steps on f (\u03b8, wk) w.r.t. generator parameters \u03b8\n2. Find step s using mins mk,\u03b8(s) s.t. 1\n3. Update wk+1 \u2190 wk + s\n4. k \u2190 k + 1\n\n2 \u2264 \u2206k\n\n2(cid:107)s(cid:107)2\n\nFor general non-linear discriminators, however, no analytic solution can be computed for the program\ngiven in Eq. (4) when using this model. Nonetheless, the model function ful\ufb01lls mk,\u03b8(0) = f (wk, \u03b8)\nand it is convex in s. Exploiting this convexity, we can derive the dual for this trust-region optimization\nproblem as presented in the following claim. The proof is included in the supplementary material.\n2 \u2264 \u2206k with model function as in Eq. (5) is:\nClaim 2. The dual program to mins mk,\u03b8(s) s.t. 1\n\u03bbzi\u2207wF (wk, G\u03b8(zi))\n\n(cid:107)wk(cid:107)2\n\nmax\n\n1\n\nC\n2\n\n\u03bb\n\n\u03bbxi\u2207wF (wk, xi) \u2212(cid:88)\n(cid:88)\n\ni\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2212Cwk +\n2(cid:107)s(cid:107)2\n(cid:88)\nH(2n\u03bbzi) \u2212(cid:88)\n(cid:88)\n0 \u2264 \u03bbxi \u2264 1\n2n\n\n1\n2n\n\n,\n\ni\n\ni\n\ni\n\n0 \u2264 \u03bbzi \u2264 1\n2n\n\n\u03bbxiFxi +\n\ni\n\n2 \u2212\n(cid:88)\n\n2(C + \u03bbT )\n\nH(2n\u03bbxi) +\n\n+\n\n1\n2n\n\u03bbT \u2265 0\n\ni\n\n\u2200i,\n\n\u03bbziFzi \u2212 \u03bbT \u2206k\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\ns.t.\nT , \u03bb\u2217\nThe optimal s\u2217 to the original problem can be expressed through optimal \u03bb\u2217\n\u2212\n\n\u2207wF (wk, xi) \u2212(cid:88)\n\n\u2207wF (wk, zi)\n\n(cid:32)(cid:88)\n\ns\u2217 =\n\n(cid:33)\n\n\u03bb\u2217\n\n\u03bb\u2217\n\n1\n\n.\n\nxi\n\nzi\n\nC + \u03bb\u2217\n\nT\n\ni\n\ni\n\n, \u03bb\u2217\n\nzi\n\nas\n\nxi\n\nC\n\nC + \u03bb\u2217\n\nT\n\nwk\n\nCombining the dual formulation with the maximization of the generator parameters \u03b8 results in a\nmaximization as opposed to a search for a saddle point. However, unlike the linear case, it is not\npossible to design an algorithm that is guaranteed to monotonically increase the cost function f (\u03b8, w).\nThe culprit is step 3 of Alg. 1, which adapts the model mk,\u03b8(s) in every iteration.\nIntuitively, the program illustrated in Claim 2 aims at choosing dual variables \u03bbxi, \u03bbzi such that the\nweighted means of derivatives as well as scores match. Note that this program searches for a direction\ns as opposed to searching for the weights w, hence the term \u2212Cwk inside the squared norm.\nIn practice, we use Ipopt [20] to solve the dual problem. The form of this dual is more ill-conditioned\nthan the linear case. The solution found by Ipopt sometimes contains errors, however, we found the\nerrors to be generally tolerable and not to affect the performance of our models.\n4 Experiments\nIn this section, we empirically study the proposed dual GAN algorithms. In particular, we show\nthe stable and monotonic training for linear discriminators and study its properties. For nonlinear\nGANs we show good quality samples and compare it with standard GAN training methods. Overall\nthe results show that our proposed approaches work across a range of problems and provide good\nalternatives to the standard GAN training method.\n4.1 Dual GAN with linear discriminator\nWe explore the dual GAN with linear discriminator on a synthetic 2D dataset generated by sampling\npoints from a mixture of 5 2D Gaussians, as well as the MNIST [12] dataset. Through these\nexperiments we show that (1) with the proposed dual GAN algorithm, training is very stable; (2) the\ndual variables \u03bb can be used as an extra informative signal for monitoring the training process; (3)\nfeatures matter, and we can train good generative models even with linear discriminators when we\nhave good features. In all experiments, we compare our proposed dual GAN with the standard GAN\nwhen training the same generator and discriminator models. Additional experimental details and\nresults are included in the supplementary material.\nThe discussion of linear discriminators presented in Sec. 3.1 works with any feature representation\n\u03c6(x) in place of x as long as \u03c6 is differentiable to allow gradients \ufb02ow through it. For the simple\n\n5\n\n\fFigure 1: We show the learning curves and samples from two models of the same architecture, one\noptimized in dual space (left), and one in the primal space (i.e., typical GAN) up to 5000 iterations.\nSamples are shown at different points during training, as well as at the very end (right top - dual, right\nbottom - primal). Despite having similar sample qualities in the end, they demonstrate drastically\ndifferent training behavior. In the typical GAN setup, loss oscillates and has no clear trend, whereas\nin the dual setup, loss monotonically increases and shows much smaller oscillation. Sample quality is\nnicely correlated with the dual objective during training.\n\nFigure 2: Training GANs with linear discriminators on the simple 5-Gaussians dataset. Here we\nare showing typical runs with the compared methods (not cherry-picked). Top: training curves and\nsamples from a single experiment: left - dual with full batch, middle - dual with minibatch, right -\nstandard GAN with minibatch. The real data from this dataset are drawn in blue, generated samples\nin green. Below: distribution of \u03bb\u2019s during training for the two dual GAN experiments, as a histogram\nat each x-value (iteration) where intensity depicts frequency for values ranging from 0 to 1 (red are\ndata, and green are samples).\n\n5-Gaussian dataset, we use RBF features based on 100 sample training points. For the MNIST dataset,\nwe use a convolutional neural net, and concatenate the hidden activations on all layers as the features.\nThe dual GAN formulation has a single hyper-parameter C, but we found the algorithm not to be\nsensitive to it, and set it to 0.0001 in all experiments. We used Adam [9] with \ufb01xed learning rate and\nmomentum to optimize the generator.\nStable Training: The main results illustrating stable training are provided in Fig. 1 and 2, where\nwe show the learning curves as well as model samples at different points during training. Both the\ndual GAN and the standard GAN use minibatches of the same size, and for the synthetic dataset\nwe did an extra experiment doing full-batch training. From these curves we can see the stable\nmonotonic increase of the dual objective, contrasted with standard GAN\u2019s spiky training curves. On\nthe synthetic data, we see that increasing the minibatch size leads to signi\ufb01cantly improved stability.\nIn the supplementary material we include an extra experiment to quantify the stability of the proposed\nmethod on the synthetic dataset.\n\n6\n\n\fDataset mini-batch size generator generator\nlearnrate momentum\n\nlearnrate*\n5-Gaussians randint[20,200] enr([0,10]) rand[.1,.9] enr([0,6]) enr([0,10])\n\ndiscriminator generator\narchitecture\n\nC\n\nMNIST randint[20,200] enr([0,10]) rand[.1,.9] enr([0,6]) enr([0,10])\n\nmax\n\niterations\n\nrandint[400,2000]\n\n20000\n\nfc-small\nfc-large\nfc-small\nfc-large\ndcgan\n\ndcgan-no-bn\n\nTable 1: Ranges of hyperparameters for sensitivity experiment. randint[a,b] means samples were\ndrawn from uniformly distributed integers in the closed interval of [a,b], similarly rand[a,b] for\nreal numbers. enr([a,b]) is shorthand for exp(-randint[a,b]), which was used for hyperparameters\ncommonly explored in log-scale. For generator architectures, for the 5-Gaussians dataset we tried\n2 3-layer fully-connected networks, with 20 and 40 hidden units. For MNIST, we tried 2 3-layer\nfully-connected networks, with 256 and 1024 hidden units, and a DCGAN-like architecture with and\nwithout batch normalization.\n\n5-Gaussians\n\nMNIST\n\nFigure 3: Results for hyperparameter sensitivity experiment. For 5-Gaussians dataset, the x-axis\nrepresents the number of modes covered. For MNIST, the x-axis represents discretized Inception\nscore. Overall, the proposed dual GAN results concentrate signi\ufb01cantly more mass on the right side,\ndemonstrating its better robustness to hyperparameters than standard GANs.\n\nSensitivity to Hyperparameters: Sensitivity to hyperparameters is another important aspect of\ntraining stability. Successful GAN training typically requires carefully tuned hyperparameters,\nmaking it dif\ufb01cult for non-experts to adopt these generative models. In an attempt to quantify this\nsensitivity, we investigated the robustness of the proposed method to the hyperparameter choice.\nFor both the 5-Gaussians and MNIST datasets, we randomly sampled 100 hyperparameter settings\nfrom ranges speci\ufb01ed in Table 1, and compared learning using both the proposed dual GAN and the\nstandard GAN. On the 5-Gaussians dataset, we evaluated the performance of the models by how well\nthe model samples covered the 5 modes. We de\ufb01ned successfully covering a mode as having > 100\nout of 1000 samples falling within a distance of 3 standard deviations to the center of the Gaussian.\nOur dual linear GAN succeeded in 49% of the experiments (note that there are a signi\ufb01cant number\nof bad hyperparameter combinations in the search range), and standard GAN succeeded in only 32%,\ndemonstrating our method was signi\ufb01cantly easier to train and tune. On MNIST, the mean Inception\nscores were 2.83, 1.99 for the proposed method and GAN training respectively. A more detailed\nbreakdown of mode coverage and Inception score can be found in Figure 3.\nDistribution of \u03bb During Training: The dual formulation allows us to monitor the training process\nthrough a unique perspective by monitoring the dual variables \u03bb. Fig. 2 shows the evolution of the\ndistribution of \u03bb during training for the synthetic 2D dataset. At the begining of training the \u03bb\u2019s\nare on the low side as the generator is not good and \u03bb\u2019s are encouraged to be small to minimize the\nmoment matching cost. As the generator improves, more attention is devoted to the entropy term in\nthe dual objective, and the \u03bb\u2019s start to converge to the value of 1/(4n).\nComparison of Different Features: The qualitative differences of the learned models with different\nfeatures can be observed in Fig. 4. In general, the more information the features carry about the\ndata, the better the learned generative models. On MNIST, even with random features and linear\ndiscriminators we can learn reasonably good generative models. On the other hand, these results also\n\n7\n\n012345# modes covered (/5)0.00.20.40.60.81.0normalized countsgandual012345678discretized inception scores0.00.20.40.60.81.0normalized countsgandual\fd\ne\nn\ni\na\nr\nT\n\nm\no\nd\nn\na\nR\n\nLayer: All\n\nConv1\n\nConv2\n\nConv3\n\nFc4\n\nFc5\n\nFigure 4: Samples from dual linear GAN using pretrained and random features on MNIST. Each\ncolumn shows a set of different features, utilizing all layers in a convnet and then successive single\nlayers in the network.\n\nScore Type\nInception (end)\nInternal classi\ufb01er (end)\nInception (avg)\nInternal classi\ufb01er (avg)\n\nGAN\n\n5.61\u00b10.09\n3.85\u00b10.08\n5.59\u00b10.38\n3.64\u00b10.47\n\nScore Lin\n5.40\u00b10.12\n3.52\u00b10.09\n5.44\u00b10.08\n3.70\u00b10.27\n\nCost Lin\n5.43\u00b10.10\n4.42\u00b10.09\n5.16\u00b10.37\n4.04\u00b10.37\n\nReal Data\n10.72 \u00b1 0.38\n8.03 \u00b1 0.07\n\n-\n-\n\nTable 2: Inception Score [18] for different GAN training methods. Since the score depends on the\nclassi\ufb01er, we used code from [18] as well as our own small convnet CIFAR-10 classi\ufb01er for evaluation\n(achieves 83% accuracy). All scores are computed using 10,000 samples. The top pair are scores on\nthe \ufb01nal models. GANs are known to be unstable, and results are sometimes cherry-picked. So, the\nbottom pair are scores averaged across models sampled from different iterations of training after it\nstopped improving.\n\nindicate that if the features are bad then it is hard to learn good models. This leads us to the nonlinear\ndiscriminators presented below, where the discriminator features are learned together with the last\nlayer, which may be necessary for more complicated problem domains where features are potentially\ndif\ufb01cult to engineer.\n\n4.2 Dual GAN with non-linear discriminator\nNext we assess the applicability of our proposed technique for non-linear discriminators, and focus\non training models on MNIST and CIFAR-10 [11].\nAs discussed in Sec. 3.2, when the discriminator is non-linear, we can only approximate the discrimi-\nnator locally. Therefore we do not have monotonic convergence guarantees. However, through better\napproximation and optimization of the discriminator we may expect the proposed dual GAN to work\nbetter than standard gradient based GAN training in some cases. Since GAN training is sensitive to\nhyperparameters, to make the comparison fair, we tuned the parameters for both the standard GANs\nand our approaches extensively and compare the best results for each.\nFig. 5 and 6 show the samples generated by models learned using different approaches. Visually\nsamples of our proposed approaches are on par with the standard GANs. As an extra quantitative\nmetric for performance, we computed the Inception Score [18] for each of them on CIFAR-10 in\nTable 2. The Inception Score is a surrogate metric which highly depends on the network architecture.\nTherefore we computed the score using our own classi\ufb01er and the one proposed in [18]. As can be\nseen in Table 2, both score and cost linearization are competitive with standard GANs. From the\ntraining curves we can also see that score linearization does the best in terms of approximating the\nobjective, and both score linearization and cost linearization oscillate less than standard GANs.\n\n5 Related Work\n\nA thorough review of the research devoted to generative modeling is beyond the scope of this paper.\nIn this section we focus on GANs [5] and review the most related work that has not been discussed\nthroughout the paper.\n\n8\n\n\fScore Linearization\n\nCost Linearization\n\nGAN\n\nFigure 5: Nonlinear discriminator experiments on MNIST, and their training curves, showing the\nprimal objective, the approximation, and the discriminator accuracy. Here we are showing typical\nruns with the compared methods (not cherry-picked).\n\nScore Linearization Cost Linearization\n\nGAN\n\nFigure 6: Nonlinear discriminator experiments on CIFAR-10, learning curves and samples organized\nby class are provided in the supplementary material.\n\nOur dual formulation reveals a close connection to moment-matching objectives widely seen in many\nother models. MMD [6] is one such related objective, and has been used in deep generative models\nin [13, 3]. [18] proposed a range of techniques to improve GAN training, including the usage of\nfeature matching. Similar techniques are also common in style transfer [4]. In addition to these,\nmoment-matching objectives are very common for exponential family models [21]. Common to all\nthese works is the use of \ufb01xed moments. The Wasserstein objective proposed for GAN training in [1]\ncan also be thought of as a form of moment matching, where the features are part of the discriminator\nand they are adaptive. The main difference between our dual GAN with linear discriminators and\nother forms of adaptive moment matching is that we adapt the weighting of features by optimizing\nnon-parametric dual parameters, while other works mostly adopt a parametric model to adapt features.\nDuality has also been studied to understand and improve GAN training. [16] pioneered work that uses\nduality to derive new GAN training objectives from other divergences. [1] also used duality to derive\na practical objective for training GANs from other distance metrics. Compared to previous work,\ninstead of coming up with new objectives, we instead used duality on the original GAN objective and\naim to better optimize the discriminator.\nBeyond what has already been discussed, there has been a range of other techniques developed to\nimprove or extend GAN training, e.g., [8, 7, 22, 2, 23, 14] just to name a few.\n\n6 Conclusion\nTo conclude, we introduced \u2018Dualing GANs,\u2019 a framework which considers duality based formulations\nfor the duel between the discriminator and the generator. Using the dual formulation provides\nopportunities to better train the discriminator. This helps remove the instability in training for linear\ndiscriminators, and we also adapted this framework to non-linear discriminators. The dual formulation\nalso provides connections to other techniques. In particular, we discussed a close link to moment\nmatching techniques, and showed that the cost function linearization for non-linear discriminators\nrecovers the original gradient direction in standard GANs. We hope that our results spur further\nresearch in this direction to obtain a better understanding of the GAN objective and its intricacies.\n\n9\n\n\fAcknowledgments: This material is based upon work supported in part by the National Science\nFoundation under Grant No. 1718221, and grants from NSERC, Samsung and CIFAR.\n\nReferences\n\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In https://arxiv.org/abs/1701.07875,\n\n2017.\n\n[2] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter-\npretable Representation Learning by Information Maximizing Generative Adversarial Nets. In\nhttps://arxiv.org/pdf/1606.03657v1.pdf, 2016.\n\n[3] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural\nnetworks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906,\n2015.\n\n[4] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style.\n\narXiv preprint arXiv:1508.06576, 2015.\n\n[5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Yoshua Bengio. Generative Adversarial Networks. In https://arxiv.org/abs/1406.2661,\n2014.\n\n[6] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A Kernel Two-Sample\n\nTest. JMLR, 2012.\n\n[7] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked Generative Adversarial\n\nNetworks. In https://arxiv.org/abs/1612.04357, 2016.\n\n[8] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent adversarial\n\nnetworks. In https://arxiv.org/abs/1602.05110, 2016.\n\n[9] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In Proc. ICLR, 2015.\nIn\n[10] D. P. Kingma\n\nAuto-Encoding Variational Bayes.\n\nand M. Welling.\nhttps://arxiv.org/abs/1312.6114, 2013.\n\n[11] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images,\n\n2009.\n\n[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. IEEE, 1998.\n\n[13] Y. Li, K. Swersky, and R. Zemel. Generative Moment Matching Networks. In abs/1502.02761,\n\n2015.\n\n[14] B. London and A. G. Schwing. Generative Adversarial Structured Networks. In Proc. NIPS\n\nWorkshop on Adversarial Training, 2016.\n\n[15] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled Generative Adversarial Networks.\n\nIn https://arxiv.org/abs/1611.02163, 2016.\n\n[16] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training Generative Neural Samplers using\n\nVariational Divergence Minimization. In https://arxiv.org/abs/1606.00709, 2016.\n\n[17] A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep\n\nConvolutional Generative Adversarial Networks. In https://arxiv.org/abs/1511.06434, 2015.\n\n[18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\nTechniques for Training GANs. In https://arxiv.org/abs/1606.03498, 2016.\n\n[19] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu.\nConditional Image Generation with PixelCNN Decoders. In https://arxiv.org/abs/1606.05328,\n2016.\n\n[20] A. W\u00e4chter and L. T. Biegler. On the Implementation of a Primal-Dual Interior Point Filter Line\nSearch Algorithm for Large-Scale Nonlinear Programming. Mathematical Programming, 2006.\n[21] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n10\n\n\f[22] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. StackGAN:\nText to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. In\nhttps://arxiv.org/abs/1612.03242, 2016.\n\n[23] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based Generative Adversarial Network. In Proc.\n\nICLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2876, "authors": [{"given_name": "Yujia", "family_name": "Li", "institution": "University of Toronto"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Kuan-Chieh", "family_name": "Wang", "institution": "University of Toronto"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}]}