{"title": "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium", "book": "Advances in Neural Information Processing Systems", "page_first": 6626, "page_last": 6637, "abstract": "Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the `Fr\u00e9chet Inception Distance'' (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark.", "full_text": "GANs Trained by a Two Time-Scale Update Rule\n\nConverge to a Local Nash Equilibrium\n\nMartin Heusel\n\nHubert Ramsauer\n\nThomas Unterthiner\n\nBernhard Nessler\n\nSepp Hochreiter\n\nLIT AI Lab & Institute of Bioinformatics,\n\nJohannes Kepler University Linz\n\n{mhe,ramsauer,unterthiner,nessler,hochreit}@bioinf.jku.at\n\nA-4040 Linz, Austria\n\nAbstract\n\nGenerative Adversarial Networks (GANs) excel at creating realistic images with\ncomplex models for which maximum likelihood is infeasible. However, the con-\nvergence of GAN training has still not been proved. We propose a two time-scale\nupdate rule (TTUR) for training GANs with stochastic gradient descent on ar-\nbitrary GAN loss functions. TTUR has an individual learning rate for both the\ndiscriminator and the generator. Using the theory of stochastic approximation, we\nprove that the TTUR converges under mild assumptions to a stationary local Nash\nequilibrium. The convergence carries over to the popular Adam optimization, for\nwhich we prove that it follows the dynamics of a heavy ball with friction and thus\nprefers \ufb02at minima in the objective landscape. For the evaluation of the perfor-\nmance of GANs at image generation, we introduce the \u2018Fr\u00e9chet Inception Distance\u201d\n(FID) which captures the similarity of generated images to real ones better than\nthe Inception Score. In experiments, TTUR improves learning for DCGANs and\nImproved Wasserstein GANs (WGAN-GP) outperforming conventional GAN train-\ning on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word\nBenchmark.\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GANs) [16] have achieved outstanding results in generating realistic\nimages [42, 31, 25, 1, 4] and producing text [21]. GANs can learn complex generative models for\nwhich maximum likelihood or a variational approximations are infeasible. Instead of the likelihood,\na discriminator network serves as objective for the generative model, that is, the generator. GAN\nlearning is a game between the generator, which constructs synthetic data from random variables,\nand the discriminator, which separates synthetic data from real world data. The generator\u2019s goal is\nto construct data in such a way that the discriminator cannot tell them apart from real world data.\nThus, the discriminator tries to minimize the synthetic-real discrimination error while the generator\ntries to maximize this error. Since training GANs is a game and its solution is a Nash equilibrium,\ngradient descent may fail to converge [44, 16, 18]. Only local Nash equilibria are found, because\ngradient descent is a local optimization method. If there exists a local neighborhood around a point\nin parameter space where neither the generator nor the discriminator can unilaterally decrease their\nrespective losses, then we call this point a local Nash equilibrium.\nTo characterize the convergence properties of training general GANs is still an open challenge [17, 18].\nFor special GAN variants, convergence can be proved under certain assumptions [34, 20, 46]. A\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Left: Original vs. TTUR GAN training on CelebA. Right: Figure from Zhang 2007 [50]\nwhich shows the distance of the parameter from the optimum for a one time-scale update of a 4\nnode network \ufb02ow problem. When the upper bounds on the errors (\u03b1, \u03b2) are small, the iterates\noscillate and repeatedly return to a neighborhood of the optimal solution (cf. Supplement Section 2.3).\nHowever, when the upper bounds on the errors are large, the iterates typically diverge.\n\nprerequisit for many convergence proofs is local stability [30] which was shown for GANs by\nNagarajan and Kolter [39] for a min-max GAN setting. However, Nagarajan and Kolter require for\ntheir proof either rather strong and unrealistic assumptions or a restriction to a linear discriminator.\nRecent convergence proofs for GANs hold for expectations over training samples or for the number\nof examples going to in\ufb01nity [32, 38, 35, 2], thus do not consider mini-batch learning which leads to\na stochastic gradient [47, 23, 36, 33].\nRecently actor-critic learning has been analyzed using stochastic approximation. Prasad et al. [41]\nshowed that a two time-scale update rule ensures that training reaches a stationary local Nash\nequilibrium if the critic learns faster than the actor. Convergence was proved via an ordinary\ndifferential equation (ODE), whose stable limit points coincide with stationary local Nash equilibria.\nWe follow the same approach. We adopt this approach for GANs and prove that also GANs\nconverge to a local Nash equilibrium when trained by a two time-scale update rule (TTUR), i.e.,\nwhen discriminator and generator have separate learning rates. This also leads to better results in\nexperiments. The main premise is that the discriminator converges to a local minimum when the\ngenerator is \ufb01xed. If the generator changes slowly enough, then the discriminator still converges,\nsince the generator perturbations are small. Besides ensuring convergence, the performance may\nalso improve since the discriminator must \ufb01rst learn new patterns before they are transferred to the\ngenerator. In contrast, a generator which is overly fast, drives the discriminator steadily into new\nregions without capturing its gathered information. In recent GAN implementations, the discriminator\noften learned faster than the generator. A new objective slowed down the generator to prevent it from\novertraining on the current discriminator [44]. The Wasserstein GAN algorithm uses more update\nsteps for the discriminator than for the generator [1]. We compare TTUR and standard GAN training.\nFig. 1 shows at the left panel a stochastic gradient example on CelebA for original GAN training\n(orig), which often leads to oscillations, and the TTUR. On the right panel an example of a 4 node\nnetwork \ufb02ow problem of Zhang et al. [50] is shown. The distance between the actual parameter and\nits optimum for an one time-scale update rule is shown across iterates. When the upper bounds on the\nerrors are small, the iterates return to a neighborhood of the optimal solution, while for large errors\nthe iterates may diverge (see also Supplement Section 2.3). Our novel contributions in this paper are:\n(i) the two time-scale update rule for GANs, (ii) the proof that GANs trained with TTUR converge to\na stationary local Nash equilibrium, (iii) the description of Adam as heavy ball with friction and the\nresulting second order differential equation, (iv) the convergence of GANs trained with TTUR and\nAdam to a stationary local Nash equilibrium, (v) the \u201cFr\u00e9chet Inception Distance\u201d (FID) to evaluate\nGANs, which is more consistent than the Inception Score.\n\nTwo Time-Scale Update Rule for GANs\n\nWe consider a discriminator D(.; w) with parameter vector w and a generator G(.; \u03b8) with parameter\nvector \u03b8. Learning is based on a stochastic gradient \u02dcg(\u03b8, w) of the discriminator\u2019s loss function LD\nand a stochastic gradient \u02dch(\u03b8, w) of the generator\u2019s loss function LG. The loss functions LD and\nLG can be the original as introduced in Goodfellow et al. [16], its improved versions [18], or recently\nproposed losses for GANs like the Wasserstein GAN [1]. Our setting is not restricted to min-max\n\n2\n\n050100150200250mini-batch x 1k0100200300400500FIDorig 1e-5orig 1e-4TTUR 1e-5 1e-450001000015000IterationFlow 1 (\u03b5n = 0.01)Flow 2 (\u03b5n = 0.01)Flow 3 (\u03b5n = 0.01)Flow 4 (\u03b5n = 0.01)Flow 1 (\u03b5n = 1/n)Flow 2 (\u03b5n = 1/n)Flow 3 (\u03b5n = 1/n)Flow 4 (\u03b5n = 1/n)Diminising step sizeConstant step sizeFig.2.Convergenceofdeterministicalgorithmunderdifferentstepsizes.50001000015000IterationFlow 1 (\u03b5n = 0.01)Flow 2 (\u03b5n = 0.01)Flow 3 (\u03b5n = 0.01)Flow 4 (\u03b5n = 0.01)Flow 1 (\u03b5n = 1/n)Flow 2 (\u03b5n = 1/n)Flow 3 (\u03b5n = 1/n)Flow 4 (\u03b5n = 1/n)Diminishing step sizeConstant step sizeFig.3.Convergenceundernoisyfeedback(theunbiasedcase).sizes,theconvergencetoaneighborhoodisthebestwecanhope;whereasbyusingdiminishingstepsizes,convergencewithprobabilityonetotheoptimalpointsismadepossible.3)StabilityofTheStochasticAlgorithm:TheBiasedCase:Recallthatwhenthegradientestimationerrorisbiased,wecannothopetoobtainalmostsureconvergencetotheoptimalsolutions.Instead,wehaveshownthatprovidedthatthebiasederrorisasymptoticallyuniformlybounded,theiteratesreturntoa\u201ccontractionregion\u201din\ufb01nitelyoften.Inthisexample,weassumethat\u03b1s(n)=\u03b2(i,j)(n)andareuniformlyboundedbyaspeci\ufb01edpositivevalue.Wealsoassumethat\u03b6s(n)\u223cN(0,1)and\u03be(i,j)(n)\u223cN(0,1),forallsand(i,j).Weplottheiterates(usingtherelativedistancetotheoptimalpoints)inFig.4,whichisfurther\u201czoomedin\u201dinFig.5.ItcanbeobservedfromFig.4thatwhentheupper-boundsonthe{\u03b1s,\u03b2(i,j)}aresmall,theiteratesreturntoaneighborhoodoftheoptimalsolution.However,whentheestimationerrorsarelarge,therecurrentbehavioroftheiteratesmaynotoccur,andtheiteratesmaydiverge.Thiscorroboratesthetheoreticalanalysis.WecanfurtherobservefromFig.5thatthesmallertheupper-boundis,thesmallerthe\u201ccontractionregion\u201dA\u03b7becomes,indicatingthattheiteratesconverge\u201ccloser\u201dtotheoptimalpoints.10010110210310410500.20.40.60.811.21.41.61.8Iteration||x(n)\u2212x*|| \u03b1 = \u03b2 = [0.05, 0.05, 0.05, 0.05]\u03b1 = \u03b2 = [0.5, 0.5, 0.5, 0.5]\u03b1 = \u03b2 = [1, 1, 1, 1]\u03b1 = \u03b2 = [5, 5, 5, 5]Fig.4.Convergenceundernoisyfeedback(thebiasedcase).10110210310410500.10.20.30.40.50.6Iteration||x(n)\u2212x*|| \u03b1 = \u03b2 = [0.05, 0.05, 0.05, 0.05]\u03b1 = \u03b2 = [0.5, 0.5, 0.5, 0.5]\u03b1 = \u03b2 = [1, 1, 1, 1]\u03b1 = \u03b2 = [5, 5, 5, 5]Fig.5.\u201cZoomed-in\u201dconvergencebehavioroftheiteratesinFigure4.V.STOCHASTICSTABILITYOFTWOTIME-SCALEALGORITHMUNDERNOISYFEEDBACKIntheprevioussections,wehaveappliedthedualdecom-positionmethodtoProblem(1)anddevisedtheprimal-dualalgorithm,whichisasingletime-scalealgorithm.AsnotedinSectionI,therearemanyotherdecompositionmethods.Inparticular,theprimaldecompositionmethodisausefulmachineryforproblemwithcoupledvariables[31];andwhensomeofthevariablesare\ufb01xed,therestoftheproblemmaydecoupleintoseveralsubproblems.Thisnaturallyyieldsmultipletime-scalealgorithms.Itisalsoofgreatinteresttoexaminethestabilityofthemultipletime-scalealgorithmsinthepresenceofnoisyfeedback,andcomparewiththesingletime-scalealgorithms,intermsofcomplexityandrobustness.Togetamoreconcretesenseofthetwotime-scaleal-gorithmsbasedonprimaldecomposition,weconsiderthefollowingNUMproblem:\u039e2:maximize{ms\u2264xs\u2264Ms,p}PsUs(xs)subjecttoPs:l\u2208L(s)xs\u2264cl,\u2200lcl=hl(p),\u2200lp\u2208H,(39)wherethelinkcapacities{cl}arefunctionsofspeci\ufb01cMACparametersp(forinstance,pcanbetransmissionprobabilities10\fGANs, but is also valid for all other, more general GANs for which the discriminator\u2019s loss function\nLD is not necessarily related to the generator\u2019s loss function LG. The gradients \u02dcg(cid:0)\u03b8, w(cid:1) and \u02dch(cid:0)\u03b8, w(cid:1)\nare stochastic, since they use mini-batches of m real world samples x(i), 1 (cid:54) i (cid:54) m and m synthetic\nsamples z(i), 1 (cid:54) i (cid:54) m which are randomly chosen. If the true gradients are g(\u03b8, w) = \u2207wLD and\nh(\u03b8, w) = \u2207\u03b8LG, then we can de\ufb01ne \u02dcg(\u03b8, w) = g(\u03b8, w) +M (w) and \u02dch(\u03b8, w) = h(\u03b8, w) +M (\u03b8)\nwith random variables M (w) and M (\u03b8). Thus, the gradients \u02dcg(cid:0)\u03b8, w(cid:1) and \u02dch(cid:0)\u03b8, w(cid:1) are stochastic\napproximations to the true gradients. Consequently, we analyze convergence of GANs by two\ntime-scale stochastic approximations algorithms. For a two time-scale update rule (TTUR), we use\nthe learning rates b(n) and a(n) for the discriminator and the generator update, respectively:\nwn+1 = wn + b(n)(cid:16)g(cid:0)\u03b8n, wn(cid:1) + M (w)\n\nn (cid:17) , \u03b8n+1 = \u03b8n + a(n) (cid:16)h(cid:0)\u03b8n, wn(cid:1) + M (\u03b8)\nn (cid:17) .\n\nFor more details on the following convergence proof and its assumptions see Supplement Section 2.1.\nTo prove convergence of GANs learned by TTUR, we make the following assumptions (The actual\nassumption is ended by (cid:74), the following text are just comments and explanations):\n\n(1)\n\n, M (w)\n\nl\n\nl\n\n(A1) The gradients h and g are Lipschitz. (cid:74) Consequently, networks with Lipschitz smooth\nactivation functions like ELUs (\u03b1 = 1) [11] ful\ufb01ll the assumption but not ReLU networks.\n\nw.r.t.\n\nE(cid:104)(cid:107)M (\u03b8)\n\nn } and {M (w)\nn (cid:107)2 | F (w)\n\nn (cid:105) (cid:54) B1 and E(cid:104)(cid:107)M (w)\n\nthe increasing \u03c3-\ufb01eld Fn = \u03c3(\u03b8l, wl, M (\u03b8)\nn (cid:107)2 | F (\u03b8)\n\n(A2) (cid:80)n a(n) = \u221e,(cid:80)n a2(n) < \u221e,(cid:80)n b(n) = \u221e,(cid:80)n b2(n) < \u221e, a(n) = o(b(n))(cid:74)\n(A3) The stochastic gradient errors {M (\u03b8)\nn } are martingale difference sequences\n, l (cid:54) n), n (cid:62) 0 with\nn (cid:105) (cid:54) B2, where B1 and B2 are positive\n(A4) For each \u03b8, the ODE \u02d9w(t) = g(cid:0)\u03b8, w(t)(cid:1) has a local asymptotically stable attractor\nh(cid:0)\u03b8(t), \u03bb(\u03b8(t))(cid:1) has a local asymptotically stable attractor \u03b8\u2217 within a domain of\n\ndeterministic constants.(cid:74) The original Assumption (A3) from Borkar 1997 follows from\nLemma 2 in [5] (see also [43]). The assumption is ful\ufb01lled in the Robbins-Monro setting,\nwhere mini-batches are randomly sampled and the gradients are bounded.\n\n\u03bb(\u03b8) within a domain of attraction G\u03b8 such that \u03bb is Lipschitz. The ODE \u02d9\u03b8(t) =\nattraction.(cid:74) The discriminator must converge to a minimum for \ufb01xed generator param-\neters and the generator, in turn, must converge to a minimum for this \ufb01xed discriminator\nminimum. Borkar 1997 required unique global asymptotically stable equilibria [7]. The\nassumption of global attractors was relaxed to local attractors via Assumption (A6) and\nTheorem 2.7 in Karmakar & Bhatnagar [26]. See for more details Assumption (A6) in\nSupplement Section 2.1.3. Here, the GAN objectives may serve as Lyapunov functions.\nThese assumptions of locally stable ODEs can be ensured by an additional weight decay term\nin the loss function which increases the eigenvalues of the Hessian. Therefore, problems\nwith a region-wise constant discriminator that has zero second order derivatives are avoided.\nFor further discussion see Supplement Section 2.1.1 (C3).\n\n(A5) supn (cid:107)\u03b8n(cid:107) < \u221e and supn (cid:107)wn(cid:107) < \u221e.(cid:74) Typically ensured by the objective or a weight\n\ndecay term.\n\nThe next theorem has been proved in the seminal paper of Borkar 1997 [7].\nTheorem 1 (Borkar). If the assumptions are satis\ufb01ed, then the updates Eq. (1) converge to\n(\u03b8\u2217, \u03bb(\u03b8\u2217)) a.s.\n\nThe solution (\u03b8\u2217, \u03bb(\u03b8\u2217)) is a stationary local Nash equilibrium [41], since \u03b8\u2217 as well as \u03bb(\u03b8\u2217) are\n\nlocal asymptotically stable attractors with g(cid:0)\u03b8\u2217, \u03bb(\u03b8\u2217)(cid:1) = 0 and h(cid:0)\u03b8\u2217, \u03bb(\u03b8\u2217)(cid:1) = 0. An alternative\n\napproach to the proof of convergence using the Poisson equation for ensuring a solution to the fast\nupdate rule can be found in the Supplement Section 2.1.2. This approach assumes a linear update\nfunction in the fast update rule which, however, can be a linear approximation to a nonlinear gradient\n[28, 29]. For the rate of convergence see Supplement Section 2.2, where Section 2.2.1 focuses on\nlinear and Section 2.2.2 on non-linear updates. For equal time-scales it can only be proven that the\nupdates revisit an environment of the solution in\ufb01nitely often, which, however, can be very large\n[50, 12]. For more details on the analysis of equal time-scales see Supplement Section 2.3. The main\nidea of the proof of Borkar [7] is to use (T, \u03b4) perturbed ODEs according to Hirsch 1989 [22] (see\nalso Appendix Section C of Bhatnagar, Prasad, & Prashanth 2013 [6]). The proof relies on the fact\n\n3\n\n\fthat there eventually is a time point when the perturbation of the slow update rule is small enough\n(given by \u03b4) to allow the fast update rule to converge. For experiments with TTUR, we aim at \ufb01nding\nlearning rates such that the slow update is small enough to allow the fast to converge. Typically,\nthe slow update is the generator and the fast update the discriminator. We have to adjust the two\nlearning rates such that the generator does not affect discriminator learning in a undesired way and\nperturb it too much. However, even a larger learning rate for the generator than for the discriminator\nmay ensure that the discriminator has low perturbations. Learning rates cannot be translated directly\ninto perturbation since the perturbation of the discriminator by the generator is different from the\nperturbation of the generator by the discriminator.\n\n2 Adam Follows an HBF ODE and Ensures TTUR Convergence\n\nIn our experiments, we aim at using Adam stochastic approximation to avoid mode collapsing. GANs\nsuffer from \u201cmode collapsing\u201d where large masses of probability are mapped onto a few modes\nthat cover only small regions. While these regions represent meaningful samples, the variety of the\nreal world data is lost and only few prototype samples are\ngenerated. Different methods have been proposed to avoid\nmode collapsing [9, 37]. We obviate mode collapsing by\nusing Adam stochastic approximation [27]. Adam can be\ndescribed as Heavy Ball with Friction (HBF) (see below),\nsince it averages over past gradients. This averaging cor-\nresponds to a velocity that makes the generator resistant\nto getting pushed into small regions. Adam as an HBF\nmethod typically overshoots small local minima that cor-\nrespond to mode collapse and can \ufb01nd \ufb02at minima which\ngeneralize well [24]. Fig. 2 depicts the dynamics of HBF,\nwhere the ball settles at a \ufb02at minimum. Next, we analyze\nwhether GANs trained with TTUR converge when using\nAdam. For more details see Supplement Section 3.\nWe recapitulate the Adam update rule at step n, with learning rate a, exponential averaging factors \u03b21\nfor the \ufb01rst and \u03b22 for the second moment of the gradient \u2207f (\u03b8n\u22121):\n\nFigure 2: Heavy Ball with Friction, where the\nball with mass overshoots the local minimum\n\u03b8+ and settles at the \ufb02at minimum \u03b8\u2217.\n\ngn \u2190\u2212 \u2207f (\u03b8n\u22121)\nmn \u2190\u2212 (\u03b21/(1 \u2212 \u03b2n\nvn \u2190\u2212 (\u03b22/(1 \u2212 \u03b2n\n\u03b8n \u2190\u2212 \u03b8n\u22121 \u2212 a mn/(\u221avn + \u0001) ,\n\n1 )) mn\u22121 + ((1 \u2212 \u03b21)/(1 \u2212 \u03b2n\n2 )) vn\u22121 + ((1 \u2212 \u03b22)/(1 \u2212 \u03b2n\n\n1 )) gn\n2 )) gn (cid:12) gn\n\nwhere following operations are meant componentwise: the product (cid:12), the square root \u221a., and the\ndivision / in the last line. Instead of learning rate a, we introduce the damping coef\ufb01cient a(n) with\na(n) = an\u2212\u03c4 for \u03c4 \u2208 (0, 1]. Adam has parameters \u03b21 for averaging the gradient and \u03b22 parametrized\nby a positive \u03b1 for averaging the squared gradient. These parameters can be considered as de\ufb01ning a\nmemory for Adam. To characterize \u03b21 and \u03b22 in the following, we de\ufb01ne the exponential memory\nr(n) = r and the polynomial memory r(n) = r/(cid:80)n\nl=1 a(l) for some positive constant r. The next\ntheorem describes Adam by a differential equation, which in turn allows to apply the idea of (T, \u03b4)\nperturbed ODEs to TTUR. Consequently, learning GANs with TTUR and Adam converges.\nTheorem 2. If Adam is used with \u03b21 = 1 \u2212 a(n + 1)r(n), \u03b22 = 1 \u2212 \u03b1a(n + 1)r(n) and with \u2207f\nas the full gradient of the lower bounded, continuously differentiable objective f, then for stationary\nsecond moments of the gradient, Adam follows the differential equation for Heavy Ball with Friction\n(HBF):\n\n(2)\n\n(3)\n\n\u00a8\u03b8t + a(t) \u02d9\u03b8t + \u2207f (\u03b8t) = 0 .\n\nAdam converges for gradients \u2207f that are L-Lipschitz.\nProof. Gadat et al. derived a discrete and stochastic version of Polyak\u2019s Heavy Ball method [40], the\nHeavy Ball with Friction (HBF) [15]:\n\n\u03b8n+1 = \u03b8n \u2212 a(n + 1) mn ,\n\nmn+1 = (cid:0)1 \u2212 a(n + 1) r(n)(cid:1) mn + a(n + 1) r(n)(cid:0)\u2207f (\u03b8n) + Mn+1(cid:1) .\n\n(4)\n\n4\n\n \fThese update rules are the \ufb01rst moment update rules of Adam [27]. The HBF can be formulated as the\ndifferential equation Eq. (3) [15]. Gadat et al. showed that the update rules Eq. (4) converge for loss\nfunctions f with at most quadratic grow and stated that convergence can be proofed for \u2207f that are\nL-Lipschitz [15]. Convergence has been proved for continuously differentiable f that is quasiconvex\n(Theorem 3 in Goudou & Munier [19]). Convergence has been proved for \u2207f that is L-Lipschitz\nand bounded from below (Theorem 3.1 in Attouch et al. [3]). Adam normalizes the average mn by\nthe second moments vn of of the gradient gn: vn = E [gn (cid:12) gn]. mn is componentwise divided by\nthe square root of the components of vn. We assume that the second moments of gn are stationary,\ni.e., v = E [gn (cid:12) gn]. In this case the normalization can be considered as additional noise since the\nnormalization factor randomly deviates from its mean. In the HBF interpretation the normalization\nby \u221av corresponds to introducing gravitation. We obtain\n\nvn =\n\n1 \u2212 \u03b22\n1 \u2212 \u03b2n\n\n2\n\nn(cid:88)l=1\n\n\u03b2n\u2212l\n2\n\ngl (cid:12) gl , \u2206vn = vn \u2212 v =\n\n1 \u2212 \u03b22\n1 \u2212 \u03b2n\n\n2\n\nn(cid:88)l=1\n\n\u03b2n\u2212l\n2\n\n(gl (cid:12) gl \u2212 v) .\n\n(5)\n\nFor a stationary second moment v and \u03b22 = 1 \u2212 \u03b1a(n + 1)r(n), we have \u2206vn \u221d a(n + 1)r(n). We\nuse a componentwise linear approximation to Adam\u2019s second moment normalization 1/\u221av + \u2206vn \u2248\n1/\u221av \u2212 (1/(2v (cid:12) \u221av)) (cid:12) \u2206vn + O(\u22062vn), where all operations are meant componentwise. If\nn+1 = \u2212(mn (cid:12) \u2206vn)/(2v (cid:12) \u221ava(n + 1)r(n)), then mn/\u221avn \u2248 mn/\u221av + a(n +\nwe set M (v)\nn+1(cid:105) = 0, since E [gl (cid:12) gl \u2212 v] = 0. For a stationary second moment v,\nn+1 and E(cid:104)M (v)\n1)r(n)M (v)\nthe random variable {M (v)\nn } is a martingale difference sequence with a bounded second moment.\nn+1} can be subsumed into {Mn+1} in update rules Eq. (4). The factor 1/\u221av can\nTherefore {M (v)\nbe componentwise incorporated into the gradient g which corresponds to rescaling the parameters\nwithout changing the minimum.\n\nAccording to Attouch et al. [3] the energy, that is, a Lyapunov function, is E(t) = 1/2| \u02d9\u03b8(t)|2+f (\u03b8(t))\nand \u02d9E(t) = \u2212a | \u02d9\u03b8(t)|2 < 0. Since Adam can be expressed as differential equation and has a\nLyapunov function, the idea of (T, \u03b4) perturbed ODEs [7, 22, 8] carries over to Adam. Therefore\nthe convergence of Adam with TTUR can be proved via two time-scale stochastic approximation\nanalysis like in Borkar [7] for stationary second moments of the gradient.\nIn the supplement we further discuss the convergence of two time-scale stochastic approximation\nalgorithms with additive noise, linear update functions depending on Markov chains, nonlinear update\nfunctions, and updates depending on controlled Markov processes. Futhermore, the supplement\npresents work on the rate of convergence for both linear and nonlinear update rules using similar\ntechniques as the local stability analysis of Nagarajan and Kolter [39]. Finally, we elaborate more on\nequal time-scale updates, which are investigated for saddle point problems and actor-critic learning.\n\n3 Experiments\n\nPerformance Measure. Before presenting the experiments, we introduce a quality measure for\nmodels learned by GANs. The objective of generative learning is that the model produces data which\nmatches the observed data. Therefore, each distance between the probability of observing real world\ndata pw(.) and the probability of generating model data p(.) can serve as performance measure for\ngenerative models. However, de\ufb01ning appropriate performance measures for generative models\nis dif\ufb01cult [45]. The best known measure is the likelihood, which can be estimated by annealed\nimportance sampling [49]. However, the likelihood heavily depends on the noise assumptions for\nthe real data and can be dominated by single samples [45]. Other approaches like density estimates\nhave drawbacks, too [45]. A well-performing approach to measure the performance of GANs is the\n\u201cInception Score\u201d which correlates with human judgment [44]. Generated samples are fed into an\ninception model that was trained on ImageNet. Images with meaningful objects are supposed to\nhave low label (output) entropy, that is, they belong to few object classes. On the other hand, the\nentropy across images should be high, that is, the variance over the images should be large. Drawback\nof the Inception Score is that the statistics of real world samples are not used and compared to the\nstatistics of synthetic samples. Next, we improve the Inception Score. The equality p(.) = pw(.)\n\nholds except for a non-measurable set if and only if(cid:82) p(.)f (x)dx = (cid:82) pw(.)f (x)dx for a basis\n\nf (.) spanning the function space in which p(.) and pw(.) live. These equalities of expectations\n\n5\n\n\fFigure 3: FID is evaluated for upper left: Gaussian noise, upper middle: Gaussian blur, upper\nright: implanted black rectangles, lower left: swirled images, lower middle: salt and pepper noise,\nand lower right: CelebA dataset contaminated by ImageNet images. The disturbance level rises\nfrom zero and increases to the highest level. The FID captures the disturbance level very well by\nmonotonically increasing.\n\nare used to describe distributions by moments or cumulants, where f (x) are polynomials of the\ndata x. We generalize these polynomials by replacing x by the coding layer of an inception model\nin order to obtain vision-relevant features. For practical reasons we only consider the \ufb01rst two\npolynomials, that is, the \ufb01rst two moments: mean and covariance. The Gaussian is the maximum\nentropy distribution for given mean and covariance, therefore we assume the coding units to follow a\nmultidimensional Gaussian. The difference of two Gaussians (synthetic and real-world images) is\nmeasured by the Fr\u00e9chet distance [14] also known as Wasserstein-2 distance [48]. We call the Fr\u00e9chet\ndistance d(., .) between the Gaussian with mean (m, C) obtained from p(.) and the Gaussian with\nmean (mw, Cw) obtained from pw(.) the \u201cFr\u00e9chet Inception Distance\u201d (FID), which is given by\n[13]: d2((m, C), (mw, Cw)) = (cid:107)m \u2212 mw(cid:107)2\nthe FID is consistent with increasing disturbances and human judgment. Fig. 3 evaluates the FID for\nGaussian noise, Gaussian blur, implanted black rectangles, swirled images, salt and pepper noise, and\nCelebA dataset contaminated by ImageNet images. The FID captures the disturbance level very well.\nIn the experiments we used the FID to evaluate the performance of GANs. For more details and a\ncomparison between FID and Inception Score see Supplement Section 1, where we show that FID is\nmore consistent with the noise level than the Inception Score.\n\n2 + Tr(cid:0)C + Cw \u2212 2(cid:0)CCw(cid:1)1/2(cid:1). Next we show that\n\nModel Selection and Evaluation. We compare the two time-scale update rule (TTUR) for GANs\nwith the original GAN training to see whether TTUR improves the convergence speed and per-\nformance of GANs. We have selected Adam stochastic optimization to reduce the risk of mode\ncollapsing. The advantage of Adam has been con\ufb01rmed by MNIST experiments, where Adam indeed\nconsiderably reduced the cases for which we observed mode collapsing. Although TTUR ensures\nthat the discriminator converges during learning, practicable learning rates must be found for each\nexperiment. We face a trade-off since the learning rates should be small enough (e.g. for the generator)\nto ensure convergence but at the same time should be large enough to allow fast learning. For each of\nthe experiments, the learning rates have been optimized to be large while still ensuring stable training\nwhich is indicated by a decreasing FID or Jensen-Shannon-divergence (JSD). We further \ufb01xed the\ntime point for stopping training to the update step when the FID or Jensen-Shannon-divergence of\nthe best models was no longer decreasing. For some models, we observed that the FID diverges\nor starts to increase at a certain time point. An example of this behaviour is shown in Fig. 5. The\nperformance of generative models is evaluated via the Fr\u00e9chet Inception Distance (FID) introduced\nabove. For the One Billion Word experiment, the normalized JSD served as performance measure.\nFor computing the FID, we propagated all images from the training dataset through the pretrained\nInception-v3 model following the computation of the Inception Score [44], however, we use the last\n\n6\n\n0123disturbance level050100150200250300350400FID0123disturbance level050100150200250300350400FID0123disturbance level050100150200250300350400FID0123disturbance level050100150200250FID0123disturbance level0100200300400500600FID0123disturbance level050100150200250300FID\fpooling layer as coding layer. For this coding layer, we calculated the mean mw and the covariance\nmatrix Cw. Thus, we approximate the \ufb01rst and second central moment of the function given by\nthe Inception coding layer under the real world distribution. To approximate these moments for the\nmodel distribution, we generate 50,000 images, propagate them through the Inception-v3 model, and\nthen compute the mean m and the covariance matrix C. For computational ef\ufb01ciency, we evaluate\nthe FID every 1,000 DCGAN mini-batch updates, every 5,000 WGAN-GP outer iterations for the\nimage experiments, and every 100 outer iterations for the WGAN-GP language model. For the one\ntime-scale updates a WGAN-GP outer iteration for the image model consists of \ufb01ve discriminator\nmini-batches and ten discriminator mini-batches for the language model, where we follow the original\nimplementation. For TTUR however, the discriminator is updated only once per iteration. We repeat\nthe training for each single time-scale (orig) and TTUR learning rate eight times for the image\ndatasets and ten times for the language benchmark. Additionally to the mean FID training progress\nwe show the minimum and maximum FID over all runs at each evaluation time-step. For more details,\nimplementations and further results see Supplement Section 4 and 6.\n\nSimple Toy Data. We \ufb01rst want to demonstrate the difference between a single time-scale update\nrule and TTUR on a simple toy min/max problem where a saddle point should be found. The\nobjective f (x, y) = (1 + x2)(100 \u2212 y2) in Fig. 4 (left) has a saddle point at (x, y) = (0, 0) and\nful\ufb01lls assumption A4. The norm (cid:107)(x, y)(cid:107) measures the distance of the parameter vector (x, y) to\nthe saddle point. We update (x, y) by gradient descent in x and gradient ascent in y using additive\nGaussian noise in order to simulate a stochastic update. The updates should converge to the saddle\npoint (x, y) = (0, 0) with objective value f (0, 0) = 100 and the norm 0. In Fig. 4 (right), the \ufb01rst\ntwo rows show one time-scale update rules. The large learning rate in the \ufb01rst row diverges and has\nlarge \ufb02uctuations. The smaller learning rate in the second row converges but slower than the TTUR in\nthe third row which has slow x-updates. TTUR with slow y-updates in the fourth row also converges\nbut slower.\n\nFigure 4: Left: Plot of the objective with a saddle point at (0, 0). Right: Training progress with\nequal learning rates of 0.01 (\ufb01rst row) and 0.001 (second row)) for x and y, TTUR with a learning\nrate of 0.0001 for x vs. 0.01 for y (third row) and a larger learning rate of 0.01 for x vs. 0.0001 for y\n(fourth row). The columns show the function values (left), norms (middle), and (x, y) (right). TTUR\n(third row) clearly converges faster than with equal time-scale updates and directly moves to the\nsaddle point as shown by the norm and in the (x, y)-plot.\n\nDCGAN on Image Data. We test TTUR for the deep convolutional GAN (DCGAN) [42] at the\nCelebA, CIFAR-10, SVHN and LSUN Bedrooms dataset. Fig. 5 shows the FID during learning\nwith the original learning method (orig) and with TTUR. The original training method is faster at\nthe beginning, but TTUR eventually achieves better performance. DCGAN trained TTUR reaches\nconstantly a lower FID than the original method and for CelebA and LSUN Bedrooms all one\ntime-scale runs diverge. For DCGAN the learning rate of the generator is larger then that of the\ndiscriminator, which, however, does not contradict the TTUR theory (see the Supplement Section 5).\nIn Table 1 we report the best FID with TTUR and one time-scale training for optimized number of\nupdates and learning rates. TTUR constantly outperforms standard training and is more stable.\n\nWGAN-GP on Image Data. We used the WGAN-GP image model [21] to test TTUR with the\nCIFAR-10 and LSUN Bedrooms datasets. In contrast to the original code where the discriminator is\ntrained \ufb01ve times for each generator update, TTUR updates the discriminator only once, therefore\nwe align the training progress with wall-clock time. The learning rate for the original training was\noptimized to be large but leads to stable learning. TTUR can use a higher learning rate for the\ndiscriminator since TTUR stabilizes learning. Fig. 6 shows the FID during learning with the original\n\n7\n\nx3210123y-8-403743.8282.8521.9760.91000.0150200objective0.51.0norm0.250.00x vs y1001100.00.50.250.001001250.00.50.250.000200040001001250200040000.250.500.50.00.50.40.2\fFigure 5: Mean FID (solid line) surrounded by a shaded area bounded by the maximum and the\nminimum over 8 runs for DCGAN on CelebA, CIFAR-10, SVHN, and LSUN Bedrooms. TTUR\nlearning rates are given for the discriminator b and generator a as: \u201cTTUR b a\u201d. Top Left: CelebA.\nTop Right: CIFAR-10, starting at mini-batch update 10k for better visualisation. Bottom Left:\nSVHN. Bottom Right: LSUN Bedrooms. Training with TTUR (red) is more stable, has much lower\nvariance, and leads to a better FID.\n\nFigure 6: Mean FID (solid line) surrounded by a shaded area bounded by the maximum and the\nminimum over 8 runs for WGAN-GP on CelebA, CIFAR-10, SVHN, and LSUN Bedrooms. TTUR\nlearning rates are given for the discriminator b and generator a as: \u201cTTUR b a\u201d. Left: CIFAR-10,\nstarting at minute 20. Right: LSUN Bedrooms. Training with TTUR (red) has much lower variance\nand leads to a better FID.\n\nlearning method and with TTUR. Table 1 shows the best FID with TTUR and one time-scale training\nfor optimized number of iterations and learning rates. Again TTUR reaches lower FIDs than one\ntime-scale training.\n\nWGAN-GP on Language Data. Finally the One Billion Word Benchmark [10] serves to evaluate\nTTUR on WGAN-GP. The character-level generative language model is a 1D convolutional neural\nnetwork (CNN) which maps a latent vector to a sequence of one-hot character vectors of dimension\n32 given by the maximum of a softmax output. The discriminator is also a 1D CNN applied to\nsequences of one-hot vectors of 32 characters. Since the FID criterium only works for images, we\nmeasured the performance by the Jensen-Shannon-divergence (JSD) between the model and the\nreal world distribution as has been done previously [21]. In contrast to the original code where the\ncritic is trained ten times for each generator update, TTUR updates the discriminator only once,\ntherefore we align the training progress with wall-clock time. The learning rate for the original\ntraining was optimized to be large but leads to stable learning. TTUR can use a higher learning rate\nfor the discriminator since TTUR stabilizes learning. We report for the 4 and 6-gram word evaluation\nthe normalized mean JSD for ten runs for original training and TTUR training in Fig. 7. In Table 1\nwe report the best JSD at an optimal time-step where TTUR outperforms the standard training for\nboth measures. The improvement of TTUR on the 6-gram statistics over original training shows that\nTTUR enables to learn to generate more subtle pseudo-words which better resembles real words.\n\n8\n\n050100150200250mini-batch x 1k0200400FIDorig 1e-5orig 1e-4orig 5e-4TTUR 1e-5 5e-420406080100120mini-batch x 1k406080100120FIDorig 1e-4orig 2e-4orig 5e-4TTUR 1e-4 5e-40255075100125150175mini-batch x 1k0200400FIDorig 1e-5orig 5e-5orig 1e-4TTUR 1e-5 1e-4050100150200250300350400mini-batch x 1k200400FIDorig 1e-5orig 5e-5orig 1e-4TTUR 1e-5 1e-402004006008001000minutes50100150FIDorig 1e-4orig 5e-4orig 7e-4TTUR 3e-4 1e-40500100015002000minutes0100200300400FIDorig 1e-4orig 5e-4orig 7e-4TTUR 3e-4 1e-4\fFigure 7: Performance of WGAN-GP models trained with the original (orig) and our TTUR method\non the One Billion Word benchmark. The performance is measured by the normalized Jensen-\nShannon-divergence based on 4-gram (left) and 6-gram (right) statistics averaged (solid line) and\nsurrounded by a shaded area bounded by the maximum and the minimum over 10 runs, aligned to\nwall-clock time and starting at minute 150. TTUR learning (red) clearly outperforms the original one\ntime-scale learning.\n\nTable 1: The performance of DCGAN and WGAN-GP trained with the original one time-scale\nupdate rule and with TTUR on CelebA, CIFAR-10, SVHN, LSUN Bedrooms and the One Billion\nWord Benchmark. During training we compare the performance with respect to the FID and JSD for\noptimized number of updates. TTUR exhibits consistently a better FID and a better JSD.\n\nDCGAN Image\nmethod\ndataset\nCelebA\nTTUR\nCIFAR-10 TTUR\nTTUR\nSVHN\nLSUN\nTTUR\nWGAN-GP Image\ndataset\nmethod\nCIFAR-10 TTUR\nLSUN\nTTUR\nWGAN-GP Language\nn-gram\nmethod\nTTUR\n4-gram\n6-gram\nTTUR\n\nb, a\n1e-5, 5e-4\n1e-4, 5e-4\n1e-5, 1e-4\n1e-5, 1e-4\n\nb, a\n3e-4, 1e-4\n3e-4, 1e-4\n\nb, a\n3e-4, 1e-4\n3e-4, 1e-4\n\nupdates\n225k\n75k\n165k\n340k\n\ntime(m)\n\n700\n1900\n\ntime(m)\n\n1150\n1120\n\nFID method\n12.5\n36.9\n12.5\n57.5\n\norig\norig\norig\norig\n\nFID method\n24.8\n9.5\n\norig\norig\n\nJSD method\n0.35\n0.74\n\norig\norig\n\nb = a\n5e-4\n1e-4\n5e-5\n5e-5\n\nb = a\n1e-4\n1e-4\n\nb = a\n1e-4\n1e-4\n\nupdates\n\n70k\n100k\n185k\n70k\n\ntime(m)\n\n800\n2010\n\ntime(m)\n\n1040\n1070\n\nFID\n21.4\n37.7\n21.4\n70.4\n\nFID\n29.3\n20.5\n\nJSD\n0.38\n0.77\n\n4 Conclusion\n\nFor learning GANs, we have introduced the two time-scale update rule (TTUR), which we have\nproved to converge to a stationary local Nash equilibrium. Then we described Adam stochastic\noptimization as a heavy ball with friction (HBF) dynamics, which shows that Adam converges and\nthat Adam tends to \ufb01nd \ufb02at minima while avoiding small local minima. A second order differential\nequation describes the learning dynamics of Adam as an HBF system. Via this differential equation,\nthe convergence of GANs trained with TTUR to a stationary local Nash equilibrium can be extended\nto Adam. Finally, to evaluate GANs, we introduced the \u2018Fr\u00e9chet Inception Distance\u201d (FID) which\ncaptures the similarity of generated images to real ones better than the Inception Score. In experiments\nwe have compared GANs trained with TTUR to conventional GAN training with a one time-scale\nupdate rule on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark.\nTTUR outperforms conventional GAN training consistently in all experiments.\n\nAcknowledgment\n\nThis work was supported by NVIDIA Corporation, Bayer AG with Research Agreement 09/2017,\nZalando SE with Research Agreement 01/2016, Audi.JKU Deep Learning Center, Audi Electronic\nVenture GmbH, IWT research grant IWT150865 (Exaptation), H2020 project grant 671555 (ExCAPE)\nand FWF grant P 28660-N31.\n\n9\n\n20040060080010001200minutes0.350.400.450.500.55JSDorig 1e-4TTUR 3e-4 1e-420040060080010001200minutes0.750.800.85JSDorig 1e-4TTUR 3e-4 1e-4\fReferences\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv e-prints, arXiv:1701.07875,\n\n2017.\n\n[2] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\nadversarial nets (GANs).\nIn D. Precup and Y. W. Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, Proceedings of Machine Learning Research,\nvol. 70, pages 224\u2013232, 2017.\n\n[3] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, I. the continu-\nous dynamical system: Global exploration of the local minima of a real-valued function by\nasymptotic analysis of a dissipative dynamical system. Communications in Contemporary\nMathematics, 2(1):1\u201334, 2000.\n\n[4] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary equilibrium generative adversarial\n\nnetworks. arXiv e-prints, arXiv:1703.10717, 2017.\n\n[5] D. P. Bertsekas and J. N. Tsitsiklis. Gradient convergence in gradient methods with errors.\n\nSIAM Journal on Optimization, 10(3):627\u2013642, 2000.\n\n[6] S. Bhatnagar, H. L. Prasad, and L. A. Prashanth. Stochastic Recursive Algorithms for Optimiza-\n\ntion. Lecture Notes in Control and Information Sciences. Springer-Verlag London, 2013.\n\n[7] V. S. Borkar. Stochastic approximation with two time scales. Systems & Control Letters,\n\n29(5):291\u2013294, 1997.\n\n[8] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation\nand reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447\u2013469, 2000.\n\n[9] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial\nnetworks. In Proceedings of the International Conference on Learning Representations (ICLR),\n2017. arXiv:1612.02136.\n\n[10] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion\nword benchmark for measuring progress in statistical language modeling. arXiv e-prints,\narXiv:1312.3005, 2013.\n\n[11] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by\nexponential linear units (ELUs). In Proceedings of the International Conference on Learning\nRepresentations (ICLR), 2016. arXiv:1511.07289.\n\n[12] D. DiCastro and R. Meir. A convergent online single time scale actor critic algorithm. J. Mach.\n\nLearn. Res., 11:367\u2013410, 2010.\n\n[13] D. C. Dowson and B. V. Landau. The Fr\u00e9chet distance between multivariate normal distributions.\n\nJournal of Multivariate Analysis, 12:450\u2013455, 1982.\n\n[14] M. Fr\u00e9chet. Sur la distance de deux lois de probabilit\u00e9. C. R. Acad. Sci. Paris, 244:689\u2013692,\n\n1957.\n\n[15] S. Gadat, F. Panloup, and S. Saadane. Stochastic heavy ball. arXiv e-prints, arXiv:1609.04228,\n\n2016.\n\n[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D.\nLawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems\n27, pages 2672\u20132680, 2014.\n\n[17] I. J. Goodfellow. On distinguishability criteria for estimating generative models. In Workshop\nat the International Conference on Learning Representations (ICLR), 2015. arXiv:1412.6515.\n\n[18] I. J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv e-prints,\n\narXiv:1701.00160, 2017.\n\n10\n\n\f[19] X. Goudou and J. Munier. The gradient and heavy ball with friction dynamical systems: the\n\nquasiconvex case. Mathematical Programming, 116(1):173\u2013191, 2009.\n\n[20] P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach\n\nto generative adversarial networks. arXiv e-prints, arXiv:1706.03269, 2017.\n\n[21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of\nWasserstein GANs. arXiv e-prints, arXiv:1704.00028, 2017. Advances in Neural Information\nProcessing Systems 31 (NIPS 2017).\n\n[22] M. W. Hirsch. Convergent activation dynamics in continuous time networks. Neural Networks,\n\n2(5):331\u2013349, 1989.\n\n[23] R. D. Hjelm, A. P. Jacob, T. Che, K. Cho, and Y. Bengio. Boundary-seeking generative\n\nadversarial networks. arXiv e-prints, arXiv:1702.08431, 2017.\n\n[24] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[25] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional\nadversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 2017. arXiv:1611.07004.\n\n[26] P. Karmakar and S. Bhatnagar. Two time-scale stochastic approximation with controlled Markov\nnoise and off-policy temporal-difference learning. Mathematics of Operations Research, 2017.\n\n[27] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proceedings of the\n\nInternational Conference on Learning Representations (ICLR)), 2015. arXiv:1412.6980.\n\n[28] V. R. Konda. Actor-Critic Algorithms. PhD thesis, Department of Electrical Engineering and\n\nComputer Science, Massachusetts Institute of Technology, 2002.\n\n[29] V. R. Konda and J. N. Tsitsiklis. Linear stochastic approximation driven by slowly varying\n\nMarkov chains. Systems & Control Letters, 50(2):95\u2013102, 2003.\n\n[30] H. J. Kushner and G. G. Yin. Stochastic Approximation Algorithms and Recursive Algorithms\n\nand Applications. Springer-Verlag New York, second edition, 2003.\n\n[31] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi.\nPhoto-realistic single image super-resolution using a generative adversarial network. arXiv\ne-prints, arXiv:1609.04802, 2016.\n\n[32] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P\u00f3czos. MMD GAN: Towards deeper\nunderstanding of moment matching network. In Advances in Neural Information Processing\nSystems 31 (NIPS 2017), 2017. arXiv:1705.08584.\n\n[33] J. Li, A. Madry, J. Peebles, and L. Schmidt. Towards understanding the dynamics of generative\n\nadversarial networks. arXiv e-prints, arXiv:1706.09884, 2017.\n\n[34] J. H. Lim and J. C. Ye. Geometric GAN. arXiv e-prints, arXiv:1705.02894, 2017.\n\n[35] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative\nadversarial learning. In Advances in Neural Information Processing Systems 31 (NIPS 2017),\n2017. arXiv:1705.08991.\n\n[36] L. M. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In Advances in Neural\n\nInformation Processing Systems 31 (NIPS 2017), 2017. arXiv:1705.10461.\n\n[37] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks.\nIn Proceedings of the International Conference on Learning Representations (ICLR), 2017.\narXiv:1611.02163.\n\n[38] Y. Mroueh and T. Sercu. Fisher GAN. In Advances in Neural Information Processing Systems\n\n31 (NIPS 2017), 2017. arXiv:1705.09675.\n\n11\n\n\f[39] V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. arXiv\ne-prints, arXiv:1706.04156, 2017. Advances in Neural Information Processing Systems 31\n(NIPS 2017).\n\n[40] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[41] H. L. Prasad, L. A. Prashanth, and S. Bhatnagar. Two-timescale algorithms for learning Nash\nequilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference\non Autonomous Agents and Multiagent Systems (AAMAS \u201915), pages 1371\u20131379, 2015.\n\n[42] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convo-\nlutional generative adversarial networks. In Proceedings of the International Conference on\nLearning Representations (ICLR), 2016. arXiv:1511.06434.\n\n[43] A. Ramaswamy and S. Bhatnagar. Stochastic recursive inclusion in two timescales with an\n\napplication to the lagrangian dual problem. Stochastics, 88(8):1173\u20131187, 2016.\n\n[44] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training GANs. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234\u20132242,\n2016.\n\n[45] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models.\nIn Proceedings of the International Conference on Learning Representations (ICLR), 2016.\narXiv:1511.01844.\n\n[46] I. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and B. Sch\u00f6lkopf. AdaGAN: Boosting\ngenerative models. arXiv e-prints, arXiv:1701.02386, 2017. Advances in Neural Information\nProcessing Systems 31 (NIPS 2017).\n\n[47] R. Wang, A. Cully, H. J. Chang, and Y. Demiris. MAGAN: margin adaptation for generative\n\nadversarial networks. arXiv e-prints, arXiv:1704.03817, 2017.\n\n[48] L. N. Wasserstein. Markov processes over denumerable products of spaces describing large\n\nsystems of automata. Probl. Inform. Transmission, 5:47\u201352, 1969.\n\n[49] Y. Wu, Y. Burda, R. Salakhutdinov, and R. B. Grosse. On the quantitative analysis of decoder-\nbased generative models. In Proceedings of the International Conference on Learning Repre-\nsentations (ICLR), 2017. arXiv:1611.04273.\n\n[50] J. Zhang, D. Zheng, and M. Chiang. The impact of stochastic noisy feedback on distributed\nnetwork utility maximization. In IEEE INFOCOM 2007 - 26th IEEE International Conference\non Computer Communications, pages 222\u2013230, 2007.\n\n12\n\n\f", "award": [], "sourceid": 3319, "authors": [{"given_name": "Martin", "family_name": "Heusel", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Hubert", "family_name": "Ramsauer", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Thomas", "family_name": "Unterthiner", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Bernhard", "family_name": "Nessler", "institution": "Johannes Kepler University Linz"}, {"given_name": "Sepp", "family_name": "Hochreiter", "institution": "LIT AI Lab / University Linz"}]}