{"title": "Self-supervised GAN: Analysis and Improvement with Multi-class Minimax Game", "book": "Advances in Neural Information Processing Systems", "page_first": 13253, "page_last": 13264, "abstract": "Self-supervised (SS) learning is a powerful approach for representation learning using unlabeled data. Recently, it has been applied to Generative Adversarial Networks (GAN) training. Specifically, SS tasks were proposed to address the catastrophic forgetting issue in the GAN discriminator. In this work, we perform an in-depth analysis to understand how SS tasks interact with learning of generator. From the analysis, we identify issues of SS tasks which allow a severely mode-collapsed generator to excel the SS tasks. To address the issues, we propose new SS tasks based on a multi-class minimax game. The competition between our proposed SS tasks in the game encourages the generator to learn the data distribution and generate diverse samples. We provide both theoretical and empirical analysis to support that our proposed SS tasks have better convergence property. We conduct experiments to incorporate our proposed SS tasks into two different GAN baseline models. Our approach establishes state-of-the-art FID scores on CIFAR-10, CIFAR-100, STL-10, CelebA, Imagenet $32\\times32$ and Stacked-MNIST datasets, outperforming existing works by considerable margins in some cases. Our unconditional GAN model approaches performance of conditional GAN without using labeled data. Our code: \\url{https://github.com/tntrung/msgan}", "full_text": "Self-supervised GAN: Analysis and Improvement\n\nwith Multi-class Minimax Game\n\nNgoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Linxiao Yang, Ngai-Man Cheung\n\nSingapore University of Technology and Design (SUTD)\n\nCorresponding author: Ngai-Man Cheung \n\nAbstract\n\nSelf-supervised (SS) learning is a powerful approach for representation learning\nusing unlabeled data. Recently, it has been applied to Generative Adversarial\nNetworks (GAN) training. Speci\ufb01cally, SS tasks were proposed to address the\ncatastrophic forgetting issue in the GAN discriminator. In this work, we perform\nan in-depth analysis to understand how SS tasks interact with learning of gener-\nator. From the analysis, we identify issues of SS tasks which allow a severely\nmode-collapsed generator to excel the SS tasks. To address the issues, we propose\nnew SS tasks based on a multi-class minimax game. The competition between our\nproposed SS tasks in the game encourages the generator to learn the data distri-\nbution and generate diverse samples. We provide both theoretical and empirical\nanalysis to support that our proposed SS tasks have better convergence property.\nWe conduct experiments to incorporate our proposed SS tasks into two different\nGAN baseline models. Our approach establishes state-of-the-art FID scores on\nCIFAR-10, CIFAR-100, STL-10, CelebA, Imagenet 32 \u00d7 32 and Stacked-MNIST\ndatasets, outperforming existing works by considerable margins in some cases. Our\nunconditional GAN model approaches performance of conditional GAN without\nusing labeled data. Our code: https://github.com/tntrung/msgan\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GAN). GAN [12] have become one of the most important\nmethods to learn generative models. GAN has shown remarkable results in various tasks, such as:\nimage generation [17, 2, 18], image transformation [16, 53], super-resolution [23], text to image\n[38, 50], anomaly detection [41, 26]. The idea behind GAN is the mini-max game. It uses a binary\nclassi\ufb01er, so-called the discriminator, to distinguish the data (real) versus generated (fake) samples.\nThe generator of GAN is trained to confuse the discriminator to classify the generated samples as the\nreal ones. By having the generator and discriminator competing with each other in this adversarial\nprocess, they are able to improve themselves. The end goal is to have the generator capturing the\ndata distribution. Although considerable improvement has been made for GAN under the conditional\nsettings [34, 49, 2], i.e., using ground-truth labels to support the learning, it is still very challenging\nwith unconditional setup. Fundamentally, using only a single signal (real/fake) to guide the generator\nto learn the high-dimensional, complex data distribution is very challenging [11, 1, 3, 5, 30, 40].\nSelf-supervised Learning. Self-supervised learning is an active research area [6, 35, 51, 52, 33, 10].\nSelf-supervised learning is a paradigm of unsupervised learning. Self-supervised methods encourage\nthe classi\ufb01er to learn better feature representation with pseudo-labels. In particular, these methods\npropose to learn image feature by training the model to recognize some geometric transformation\nthat is applied to the image which the model receives as the input. A simple-yet-powerful method\nproposed in [10] is to use image rotations by 0, 90, 180, 270 degrees as the geometric transformation.\nThe model is trained with the 4-way classi\ufb01cation task of recognizing one of the four rotations. This\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftask is referred as the self-supervised task. This simple method is able to close the gap between\nsupervised and unsupervised image classi\ufb01cation [10].\nSelf-supervised Learning for GAN. Recently, self-supervised learning has been applied to GAN\ntraining [4, 44]. These works propose auxiliary self-supervised classi\ufb01cation tasks to assist the\nmain GAN task (Figure 1). In particular, their objective functions for learning discriminator D and\ngenerator G are multi-task loss as shown in (1) and (2) respectively:\n\nVD(D, C, G) = V(D, G) + \u03bbd\u03a8(G, C)\nVG(D, C, G) = V(D, G) \u2212 \u03bbg\u03a6(G, C)\n\nmax\nD,C\n\nmin\n\nG\n\n(cid:16)\n\n(cid:17)\n\n(1)\n\n(2)\n\n(cid:16)\n\n(cid:17)\n\n+ Ex\u223cPg log\n\n1 \u2212 D(x)\n\nV(D, G) = Ex\u223cPd log\n\nD(x)\n\n(3)\nHere, V(D, G) in (3) is the GAN task, which is the original value function proposed in Goodfellow\net al. [12]. Pd is true data distribution, Pg is the distribution induced by the generator mapping.\n\u03a8(G, C) and \u03a6(G, C) are the self-supervised (SS) tasks for discriminator and generator learning,\nrespectively (details to be discussed). C is the classi\ufb01er for the self-supervised task, e.g. rotation\nclassi\ufb01er as discussed [10]. Based on this framework, Chen et al.[4] apply self-supervised task to\nhelp discriminator counter catastrophic forgetting. Empirically, they have shown that self-supervised\ntask enables discriminator to learn more stable and improved representation. Tran et al. [44] propose\nto improve self-supervised learning with adversarial training.\nDespite the encouraging empirical results, in-depth analysis of the interaction between SS tasks (\u03a8(.)\nand \u03a6(.)) and GAN task (V(D, G)) has not been done before. On one hand, the application of SS\ntask for discriminator learning is reasonable: the goal of discriminator is to classify real/fake image;\nan additional SS classi\ufb01cation task \u03a8(G, C) could assist feature learning and enhance the GAN task.\nOn the other hand, the motivation and design of SS task for generator learning is rather subtle: the\ngoal of generator learning is to capture the data distribution in G, and it is unclear exactly how an\nadditional SS classi\ufb01cation task \u03a6(G, C) could help.\nIn this work, we conduct in-depth empirical and theoretical analysis to understand the interaction\nbetween self-supervised tasks (\u03a8(.) and \u03a6(.)) and learning of generator G. Interestingly, from\nour analysis, we reveal issues of existing works. Speci\ufb01cally, the SS tasks of existing works have\n\u201cloophole\u201d that, during generator learning, G could exploit to maximize \u03a6(G, C) without truly\nlearning the data distribution. We show that analytically and empirically that a severely mode-\ncollapsed generator can excel \u03a6(G, C). To address this issue, we propose new SS tasks based on a\nmulti-class minimax game. Our proposed new SS tasks of discriminator and generator compete with\neach other to reach the equilibrium point. Through this competition, our proposed SS tasks are able\nto support the GAN task better. Speci\ufb01cally, our analysis shows that our proposed SS tasks enhance\nmatching between Pd and Pg by leveraging the transformed samples used in the SS classi\ufb01cation\n(rotated images when [10] is applied). In addition, our design couples GAN task and SS task. To\nvalidate our design, we provide theoretical analysis on the convergence property of our proposed\nSS tasks. Training a GAN with our proposed self-supervised tasks based on multi-class minimax\ngame signi\ufb01cantly improves baseline models. Overall, our system establishes state-of-the-art Fr\u00e9chet\nInception Distance (FID) scores. In summary, our contributions are:\n\nsupervised tasks in existing works.\n\n\u2022 We conduct in-depth empirical and theoretical analysis to understand the issues of self-\n\u2022 Based on the analysis, we propose new self-supervised tasks based on a multi-class minimax\n\u2022 We conduct extensive experiments to validate our proposed self-supervised tasks.\n\ngame.\n\n2 Related works\n\nWhile training GAN with conditional signals (e.g., ground-truth labels of classes) has made good\nprogress [34, 49, 2], training GAN in the unconditional setting is still very challenging. In the original\nGAN [12], the single signal (real or fake) of samples is provided to train discriminator and the\ngenerator. With these signals, the generator or discriminator may fall into ill-pose settings, and they\n\n2\n\n\fFigure 1: The model of (a) SSGAN [4] and (b) our approach. Here, \u03a8(C) and \u03a6(G, C) are the\nself-supervised value functions in training discriminator and generator, respectively, as proposed in\n[4]. \u03a8+(G, C) and \u03a6+(G, C) are the self-supervised value functions proposed in this work.\n\nmay get stuck at bad local minimums though still satisfying the signal constraints. To overcome\nthe problems, many regularizations have been proposed. One of the most popular approaches is to\nenforce (towards) Lipschitz condition of the discriminator. These methods include weight-clipping\n[1], gradient penalty constraints [13, 39, 21, 36, 27] and spectral norm [31]. Constraining the\ndiscriminator mitigates gradients vanishing and avoids sharp decision boundary between the real and\nfake classes.\nUsing Lipschitz constraints improve the stability of GAN. However, the challenging optimization\nproblem still remains when using a single supervisory signal, similar to the original GAN [12]. In\nparticular, the learning of discriminator is highly dependent on generated samples. If the generator\ncollapses to some particular modes of data distribution, it is only able to create samples around\nthese modes. There is no competition to train the discriminator around other modes. As a result, the\ngradients of these modes may vanish, and it is impossible for the generator to model well the entire\ndata distribution. Using additional supervisory signals helps the optimization process. For example,\nusing self-supervised learning in the form of auto-encoder has been proposed. AAE [29] guides the\ngenerator towards resembling realistic samples. However, an issue with using auto-encoder is that\npixel-wise reconstruction with (cid:96)2-norm causes blurry artifacts. VAE/GAN [22], which combining\nVAE [19] and GAN, is an improved solution: while the discriminator of GAN enables the usage\nof feature-wise reconstruction to overcome the blur, the VAE constrains the generator to mitigate\nmode collapse. In ALI [8] and BiGAN [7], they jointly train the data/latent samples in the GAN\nframework. InfoGAN [5] infers the disentangled latent representation by maximizing the mutual\ninformation. In [42, 43], they combine two different types of supervisory signals: real/fake signals\nand self-supervised signal in the form of auto-encoder. In addition, Auto-encoder based methods,\nincluding [22, 42, 43], can be considered as an approach to mitigate catastrophic forgetting because\nthey regularize the generator to resemble the real ones. It is similar to EWC [20] or IS [48] but\nthe regularization is achieved via the output, not the parameter itself. Although using feature-wise\ndistance in auto-encoder could reconstruct sharper images, it is still challenging to produce very\nrealistic detail of textures or shapes.\nSeveral different types of supervisory signal have been proposed. Instead of using only one discrimi-\nnator or generator, they propose ensemble models, such as multiple discriminators [32], mixture of\ngenerators [15, 9] or applying an attacker as a new player for GAN training [28]. Recently, training\nmodel with auxiliary self-supervised constraints [4, 44] via multi pseudo-classes [10] helps improve\nstability of the optimization process. This approach is appealing: it is simple to implement and\ndoes not require more parameters in the networks (except a small head for the classi\ufb01er). Recent\nwork applies InfoMax principle to improve GAN [24]. Variational Autoencoder is another important\napproach to learn generative models [19, 46].\n\n3\n\nCTk(x)xTkC1C2C3C4SS\ttask\tin\tdiscriminator\tlearningCTk(G(z))G(z)TkC1C2C3C4SS\ttask\tin\tgenerator\tlearningCTk(x)xTkCSS\ttask\tin\tgenerator\tlearning(a)\tOriginal\tSSGAN(b)\tOur\tproposalFakeC1C3C2C4FakeC1C3C2C4Tk(G(z))G(z)TkTk(G(z))G(z)TkSS\ttask\tin\tdiscriminator\tlearning\f3 GAN with Auxiliary Self-Supervised tasks\n\nIn [4], self-supervised (SS) value function (also referred as \u201cself-supervised task\u201d) was proposed for\nGAN [12] via image rotation prediction [10]. In their work, they showed that the SS task was useful\nto mitigate catastrophic forgetting problem of GAN discriminator. The objectives of the discriminator\nand generator in [4] are shown in Eq. 4 and 5. Essentially, the SS task of the discriminator (denoted\nby \u03a8(C)) is to train the classi\ufb01er C that maximizes the performance of predicting the rotation applied\nto the real samples. Given this classi\ufb01er C, the SS task of the generator (denoted by \u03a6(G, C))\nis to train the generator G to produce fake samples for maximizing classi\ufb01cation performance.\nThe discriminator and classi\ufb01er are the same (shared parameters), except the last layer in order\nto implement two different heads: the last fully-connected layer which returns a one-dimensional\noutput (real or fake) for the discriminator, and the other which returns a K-dimensional softmax of\npseudo-classes for the classi\ufb01er. \u03bbd and \u03bbg are constants.\n\nV(D, C, G) = V(D, G) + \u03bbd\n\nmax\nD,C\n\nV(D, C, G) = V(D, G) \u2212 \u03bbg\n\nmin\n\nG\n\n(cid:18)\n(cid:124)\n(cid:18)\n(cid:124)\n\nE\nx\u223cP T\n\nd\n\nETk\u223cT log\n\nCk(x)\n\nEx\u223cP T\n\ng\n\nETk\u223cT log\n\nCk(x)\n\n(cid:16)\n(cid:16)\n\n\u03a8(C)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\n\u03a6(G,C)\n\n(cid:17)(cid:19)\n(cid:125)\n(cid:17)(cid:19)\n(cid:125)\n\n(4)\n\n(5)\n\nHere, the GAN value function V(D, G) (also referred as \u201cGAN task\u201d) can be the original minimax\nGAN objective [12] or other improved versions. T is the set of transformation, Tk \u2208 T is the k-th\ntransformation. The rotation SS task proposed in [10] is applied, and T1, T2, T3, T4 are the 0, 90, 180,\n270 degree image rotation, respectively. Pd, Pg are the distributions of real and fake data samples,\ng are the mixture distribution of rotated real and fake data samples (by Tk \u2208 T ),\nrespectively. P T\nk=1 Ck(x) = 1,\u2200x.\n\nrespectively. Let Ck(x) be the k-th softmax output of classi\ufb01er C, and we have(cid:80)K\n\nThe models are shown in Fig. 1a. In [4], empirical evidence of improvements has been provided.\nNote that, the goal of \u03a6(G, C) is to encourage the generator to produce realistic images. It is because\nclassi\ufb01er C is trained with real images and captures features that allow detection of rotation. However,\nthe interaction of \u03a6(G, C) with the GAN task V(D, G) has not been adequately analyzed.\n\nd , P T\n\n4 Analysis on Auxiliary Self-supervised Tasks\n\nWe analyze the SS tasks in [4] (Figure 1a). We assume that all networks D, G, C have enough capacity\n[12]. Refer to the Appendix A for full derivation. Let D\u2217 and C\u2217 be the optimal discriminator and\noptimal classi\ufb01er respectively at an equilibrium point. We assume that we have an optimal D\u2217\nof the GAN task. We focus on C\u2217 of SS task. Let pTk (x) be the probability of sample x under\ntransformation by Tk (Figure 2). pTk\ng (x) denotes the probability pTk (x) of data sample\n(x \u223c P T\nd ) or generated sample (x \u223c P T\ng ) respectively.\nProposition 1 The optimal classi\ufb01er C\u2217 of Eq. 4 is:\npTk\nd (x)\nk=1 pTk\n\nC\u2217\nk (x) =\n\n(cid:80)K\n\nd (x), pTk\n\nd (x)\n\n(6)\n\nProof. Refer to our proof in Appendix A for optimal C\u2217.\nTheorem 1 Given optimal classi\ufb01er C\u2217 for SS task \u03a8(C), at the equilibrium point, maximizing SS\ntask \u03a6(G, C\u2217) of Eq. 5 is equal to maximizing:\n\n\u03a6(G, C\u2217) =\n\n1\nK\n\nE\nx\u223cP\n\nTk\ng\n\nlog\n\npTk\nd (x)\nk=1 pTk\n\nd (x)\n\n=\n\n1\nK\n\nV Tk\n\u03a6 (x)\n\n(7)\n\n(cid:17)(cid:21)\n\nK(cid:88)\n\nk=1\n\n(cid:20)\n\nK(cid:88)\n\nk=1\n\n(cid:16)\n\n(cid:80)K\n\n4\n\n\fFigure 2: The probability distribution pTk\nd (x). Here, samples from Pd are rotated by Tk. The\ndistribution of rotated sample is pTk (x). Some rotated samples resemble the original samples, e.g.\nthose on the right of x2. On the other hand, for some image, there is no rotated image resembling it,\nd (x1) = 0, j (cid:54)= 1). The generator can learn to generate these images e.g. x1 to achieve\ne.g. x1 (pTj\nmaximum of \u03a6(G, C\u2217), without actually learning the entire Pd.\n\nProof. Refer to our proof in Appendix A.\n\n(cid:80)K\nd (x) = 0, j (cid:54)= 1. For these x, the maximum V Tk\nd (x) (cid:54)= 0 and pTj\n\nTk\nd (x)\nTk\nd (x)\n\np\nk=1 p\n\nTheorem 1 depicts learning of generator G given the optimal C\u2217: selecting G (hence Pg) to maximize\n\u03a6(G, C\u2217). As C\u2217 is trained on real data, \u03a6(G, C\u2217) encourages G to learn to generate realistic samples.\nHowever, we argue that G can maximize \u03a6(G, C\u2217) without actually learning data distribution Pd.\nIn particular, it is suf\ufb01cient for G to maximize \u03a6(G, C\u2217) by simply learning to produce images\nwhich rotated version is rare (near zero probability). Some example images are shown in Figure 3a.\nIntuitively, for these images, rotation can be easily recognized.\nThe argument can be developed from Theorem 1. From (7), it can be shown that V Tk\ng (x) >= 0 and\n(pTk\n\n\u03a6 (x) \u2264 0\n\u2264 1). One way for G to achieve the maximum is to generate x such\nthat pT1\n\u03a6 (x) = 0 is attained. Note that\nT1 corresponds to 0 degree rotation, i.e., no rotation. Recall that pTk\nd (x) is the probability distribution\nd (x) = 0, j (cid:54)= 1 means\nof transformed data by Tk. Therefore the condition pT1\nthat there is no other rotated image resembling x, or equivalently, rotated x does not resemble any\nother images (Figure 2). Therefore, the generator can exploit this \u201cloophole\u201d to maximize \u03a6(G, C\u2217)\nwithout actually learning the data distribution. In particular, even a mode-collapsed generator can\nachieve the maximum of \u03a6(G, C\u2217) by generating such images.\nEmpirical evidence. Empirically, our experiments (in Appendix B.2.1) show that the FID of the\nmodels when using \u03a6(G, C) is poor except for very small \u03bbg. We further illustrate this issue by a toy\nempirical example using CIFAR-10. We augment the training images x with transformation data\nTk(x) to train the classi\ufb01er C to predict the rotation applied to x. This is the SS task of discriminator\nin Figure 1a. Given this classi\ufb01er C, we simulate the SS task of generator learning as follows. To\nsimulate the output of a good generator Ggood which generates diverse realistic samples, we choose\nthe full test set of CIFAR-10 (10 classes) images and compute the cross-entropy loss, i.e. \u2212\u03a6(G, C),\nwhen they are fed into C. To simulate the output of a mode-collapsed generator Gcollapsed, we select\nsamples from one class, e.g. \u201chorse\u201d, and compute the cross-entropy loss when they are fed into\nC. Fig. 3b show that some Gcollapsed can outperform Ggood and achieve a smaller \u2212\u03a6(G, C). E.g.\na Gcollapsed that produces only \u201chorse\u201d samples outperform Ggood under \u03a6(G, C). This example\nillustrates that, while \u03a6(G, C) may help the generator to create more realistic samples, it does not\nhelp the generator to prevent mode collapse. In fact, as part of the multi-task loss (see (5)), \u03a6(G, C)\nwould undermine the learning of synthesizing diverse samples in the GAN task V(D, G).\n\nd (x) (cid:54)= 0 and pTj\n\n5 Proposed method\n\n5.1 Auxiliary Self-Supervised Tasks with Multi-class Minimax Game\n\nIn this section, we propose improved SS tasks to address the issue (Fig. 1b). Based on a multi-class\nminimax game, our classi\ufb01er learns to distinguish the rotated samples from real data versus those\nfrom generated data. Our proposed SS tasks are \u03a8+(G, C) and \u03a6+(G, C) in (8) and (9) respectively.\n\n5\n\nPdPdT1(x)PdT2(x)PdT3(x)PdT4(x)x1x2pdT1(x1)pdT1(x2)Tk\fFigure 3: (a) Left: Example images that achieve minimal loss (or maximal \u03a6(G, C)). For these\nimages, rotation can be easily recognized: an image with a 90 degree rotated horse is likely due to\napplying T2 rather than an original one. (b) Right (Top): the loss of original SS task, i.e. \u2212\u03a6(G, C)\ncomputed over a good generator (red) and collapsed generators (green, yellow). Some collapsed\ngenerators (e.g. one that generates only \u201chorse\u201d) have smaller loss than the good generator under\n\u2212\u03a6(G, C). (c) Right (Bottom): the loss of proposed MS task, \u2212\u03a6+(G, C), of a good generator (red)\nand collapsed generators (green). The good generator has the smallest loss under \u2212\u03a6+(G, C).\n\nOur discriminator objective is:\n\nV(D, C, G) = V(D, G)+\u03bbd\n\nmax\nD,C\n\n(cid:18)\n(cid:124)\n\n(cid:16)\n\nCk(x)\n\n(cid:17)\n\nE\nx\u223cP T\n\nd\n\nETk\u223cT log\n\n(cid:16)\n\nCK+1(x)\n\n(cid:17)(cid:19)\n(cid:125)\n\nETk\u223cT log\n\n+ Ex\u223cP T\n\ng\n\n(cid:123)(cid:122)\n\n\u03a8+(G,C)\n\n(8)\nEq. 8 means that we simultaneously distinguish generated samples, as the (K + 1)-th class, from the\nrotated real sample classes. Here, CK+1(x) is the (K + 1)-th output for the fake class of classi\ufb01er C.\nWhile rotated real samples are \ufb01xed samples that help prevent the classi\ufb01er (discriminator) from\nforgetting, the class K + 1 serves as the connecting point between generator and classi\ufb01er, and\nthe generator can directly challenge the classi\ufb01er. Our technique resembles the original GAN by\nGoodfellow et al. [12], but we generalize it for multi-class minimax game. Our generator objective\nis:\n\nV(D, C, G) = V(D, G)\u2212\u03bbg\n\nmin\n\nG\n\nEx\u223cP T\n\ng\n\nETk\u223cT log\n\nCk(x)\n\nETk\u223cT log\n\nCK+1(x)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17) \u2212 Ex\u223cP T\n(cid:123)(cid:122)\n\ng\n\n\u03a6+(G,C)\n\n(cid:17)(cid:19)\n(cid:125)\n\n(cid:18)\n(cid:124)\n\n(9)\n\u03a8+(G, C) and \u03a6+(G, C) form a multi-class minimax game. Note that, when we mention multi-class\nminimax game (or multi-class adversarial training), we refer to the SS tasks. The game for GAN task\nis the original by Goodfellow et al. [12].\n\n5.1.1 Theoretical Analysis\nProposition 2 For \ufb01xed generator G, the optimal solution C\u2217 under Eq. 8 is:\n\nC\u2217\nk (x) =\n\npT\nd (x)\npT\ng (x)\n\n(cid:80)K\n\npTk\nd (x)\nk=1 pTk\n\nd (x)\n\nC\u2217\nK+1(x)\n\n(10)\n\nd (x) and pT\n\ng (x) are probability of sample x in the mixture distributions P T\n\nwhere pT\ntively.\nProof. Refer to our proof in Appendix A for optimal C\u2217.\nTheorem 2 Given optimal classi\ufb01er C\u2217 obtained from multi-class minimax training \u03a8+(G, C), at\nthe equilibrium point, maximizing \u03a6+(G, C\u2217) is equal to maximizing Eq. 11:\n\nd and P T\n\ng respec-\n\n\u03a6+(G, C\u2217) = \u2212 1\nK\n\nKL(P Tk\n\ng ||P Tk\nd )\n\nE\nx\u223cP\n\nTk\ng\n\nlog\n\n(cid:20) K(cid:88)\n\nk=1\n\n(cid:20)\n\nK(cid:88)\n\nk=1\n\n(cid:21)\n\n+\n\n1\nK\n\n6\n\n(cid:16)\n\n(cid:80)K\n\npTk\nd (x)\nk=1 pTk\n\nd (x)\n\n(cid:17)(cid:21)\n\n(11)\n\nGgoodGairplaneGautomobileGbirdGcatGdogGdeerGfrogGhorseGshipGtruck0.02.55.07.5LossGgoodGairplaneGautomobileGbirdGcatGdogGdeerGfrogGhorseGshipGtruck01020Loss\f(cid:17)\n\n\u03a8+(.), \u03a6+(.)\n\n(cid:16)\ng ||P Tk\n\nProof. Refer to our proof in Appendix A.\nNote that proposed SS task objective (11) is different from the original SS task objective (7) with\nd ) = KL(Pg||Pd), as rotation Tk is an\nthe KL divergence term. Furthermore, note that KL(P Tk\naf\ufb01ne transform and KL divergence is invariant under af\ufb01ne transform (our proof in Appendix A).\nTherefore, the improvement is clear: Proposed SS tasks\nwork together to improve the\nmatching of Pg and Pd by leveraging the rotated samples. For a given Pg, feedbacks are computed\nfrom not only KL(Pg||Pd) but also KL(P Tk\nd ) via the rotated samples. Therefore, G has more\nfeedbacks to improve Pg. We investigate the improvement of our method on toy dataset as in Section\n4. The setup is the same, except that now we replace models/cost functions of \u2212\u03a6(G, C) with our\nproposed ones \u2212\u03a6+(G, C) (the design of Ggood and Gcollapsed are the same). The loss now is shown\nin Fig. 3c. Comparing Fig. 3c and Fig. 3b, the improvement using our proposed model can be\nobserved: Ggood has the lowest loss under our proposed model. Note that, since optimizing KL\ndivergence is not easy because it is asymmetric and could be biased to one direction [32], in our\nimplementation, we use a slightly modi\ufb01ed version as described in the Appendix.\n\ng ||P Tk\n\n6 Experiments\n\nWe measure the diversity and quality of generated samples via the Fr\u00e9chet Inception Distance (FID)\n[14]. FID is computed with 10K real samples and 5K generated samples exactly as in [31] if not\nprecisely mentioned. We report the best FID attained in 300K iterations as in [45, 25, 42, 47]. We\nintegrate our proposed techniques into two baseline models (SSGAN [4] and Dist-GAN [42]). We\nconduct experiments mainly on CIFAR-10 and STL-10 (resized into 48 \u00d7 48 as in [31]). We also\nprovide additional experiments of CIFAR-100, Imagenet 32 \u00d7 32 and Stacked-MNIST.\nFor Dist-GAN [42], we evaluate three versions implemented with different network architectures:\nDCGAN architecture [37], CNN architectures of SN-GAN [31] (referred as SN-GAN architecture)\nand ResNet architecture [13]. We recall these network architectures in Appendix C. We use ResNet\narchitecture [13] for experiments of CIFAR-100, Imagenet 32 \u00d7 32, and tiny K/4, K/2 architectures\n[30] for Stacked MNIST. We keep all parameters suggested in the original work and focus to\nunderstand the contribution of our proposed techniques. For SSGAN [4], we use the ResNet\narchitecture as implemented in the of\ufb01cial code1.\nIn our experiments, we use SS to denote the original self-supervised tasks proposed in [4], and we use\nMS to denote our proposed self-supervised tasks \u201cMulti-class mini-max game based Self-supervised\ntasks\". Details of the experimental setup and network parameters are discussed in Appendix B.\nWe have conducted extensive experiments. Setup and results are discussed in Appendix B. In this\nsection, we highlight the main results:\n\n\u2022 Comparison between SS and our proposed MS using the same baseline.\n\u2022 Comparison between our proposed baseline + MS and other state-of-the-art unconditional\nand conditional GAN. We emphasize that our proposed baseline + MS is unconditional and\ndoes not use any label.\n\n6.1 Comparison between SS and our proposed MS using the same baseline\n\nResults are shown in Fig. 4 using Dist-GAN [42] as the baseline. For each experiment and for each\napproach (SS or MS), we obtain the best \u03bbg and \u03bbd using extensive search (see Appendix B.4 for\ndetails), and we use the best \u03bbg and \u03bbd in the comparison depicted in Fig. 4. In our experiments, we\nobserve that Dist-GAN has stable convergence. Therefore, we use it in these experiments. As shown\nin Fig. 4, our proposed MS outperforms the original SS consistently. More details can be found in\nAppendix B.4.\n\n6.2 Comparison between our proposed method with other state-of-the-art GAN\n\nMain results are shown in Table 1. Details of this comparison can be found in Appendix B.4. The\nbest \u03bbg and \u03bbd as in Figure 4 are used in this comparison. The best FID attained in 300K iterations\n\n1https://github.com/google/compare_gan\n\n7\n\n\fFigure 4: Compare SS (original SS tasks proposed in [4]) and MS (our proposed Multi-class mini-\nmax game based Self-supervised tasks). The baseline is Dist-GAN [42], implemented with SN-GAN\nnetworks (CNN architectures in [31]) and ResNet. Two datasets are used, CIFAR-10 and STL-10.\nFor each experiment, we use the best \u03bbd, \u03bbg for the models, obtained through extensive search\n(Appendix B.4). Note that \u03bbg = 0 is the best for \u201cBaseline + SS\u201d in all experiments. The results\nsuggest consistent improvement using our proposed self-supervised tasks.\n\nTable 1: Comparison with other state-of-the-art GAN on CIFAR-10 and STL-10 datasets. We\nreport the best FID of the methods. Two network architectures are used: SN-GAN networks (CNN\narchitectures in [31]) and ResNet. The FID scores are extracted from the respective papers when\navailable. SS denotes the original SS tasks proposed in [4]. MS denotes our proposed self-supervised\ntasks. \u2018*\u2019: FID is computed with 10K-10K samples as in [4]. All compared GAN are unconditional,\nexcept SAGAN and BigGAN. SSGAN+ is SS-GAN in [4] but using the best parameters we have\nobtained. In SSGAN+ + MS, we replace the original SS in author\u2019s code with our proposed MS.\n\nSN-GAN\n\nResNet\n\nMethods\nGAN-GP [31]\nWGAN-GP [31]\nSN-GAN [31]\nSS-GAN [4]\nDist-GAN [42]\nGN-GAN [43]\nSAGAN [49] (cond.)\nBigGAN [2] (cond.)\nSSGAN+\nOurs(SSGAN+ + MS)\nDist-GAN + SS\nOurs(Dist-GAN + MS)\n\nCIFAR-10\n37.7\n40.2\n25.5\n-\n22.95\n21.70\n-\n-\n-\n-\n21.40\n18.88\n\nSTL-10 CIFAR-10\n-\n55.1\n43.2\n-\n36.19\n30.80\n-\n-\n-\n-\n29.79\n27.95\n\n-\n-\n21.70 \u00b1 .21\n-\n17.61 \u00b1 .30\n16.47 \u00b1 .28\n13.4 (best)\n14.73\n-\n-\n14.97 \u00b1 .29\n13.90 \u00b1 .22\n\nSTL-10\n-\n-\n40.10 \u00b1 .50\n-\n28.50 \u00b1 .49\n-\n-\n-\n-\n-\n27.98 \u00b1 .38\n27.10 \u00b1 .34\n\nCIFAR-10\u2217\n-\n-\n19.73\n15.65\n13.01\n-\n-\n-\n20.47\n19.89\n12.37\n11.40\n\nare reported as in [45, 25, 42, 47]. Note that SN-GAN method [31] attains the best FID at about 100K\niterations with ResNet and it diverges afterward. Similar observation is also discussed in [4].\nAs shown in Table 1, our method (Dist-GAN + MS) consistently outperforms the baseline Dist-GAN\nand other state-of-the-art GAN. These results con\ufb01rm the effectiveness of our proposed self-supervised\ntasks based on multi-class minimax game.\nWe have also extracted the FID reported in [4], i.e. SSGAN with the original SS tasks proposed\nthere. In this case, we follow exactly their settings and compute FID using 10K real samples and\n10K fake samples. Our model achieves better FID score than SSGAN with exactly the same ResNet\narchitecture on CIFAR-10 dataset. See results under the column CIFAR-10\u2217 in Table 1.\nNote that we have tried to reproduce the results of SSGAN using its published code, but we were\nunable to achieve similar results as reported in the original paper [4]. We have performed extensive\nsearch and we use the obtained best parameter to report the results as SSGAN+ in Table 1 (i.e.,\nSSGAN+ uses the published code and the best parameters we obtained). We use this code and\nsetup to compare SS and MS, i.e. we replace the SS code in the system with MS code, and obtain\n\u201cSSGAN+ + MS\u201d. As shown in Table 1, our \u201cSSGAN+ + MS\u201d achieves better FID than SSGAN+.\nThe improvement is consistent with Figure 4 when Dist-GAN is used as the baseline. More detailed\nexperiments can be found in the Appendix. We have also compared SSGAN+ and our system\n(SSGAN+ + MS) on CelebA (64 \u00d7 64). In this experiment, we use a small DCGAN architecture\nprovided in the authors\u2019 code. Our proposed MS outperforms the original SS, with FID improved\nfrom 35.03 to 33.47. This experiment again con\ufb01rms the effectiveness of our proposed MS.\n\n8\n\n0.511.522.53Iterations10520253035FIDSN-GAN for CIFAR-10BaselineBaseline + SS (d = 1.0)Baseline + MS (d = 1.0, g = 0.01)0.511.522.53Iterations10515202530FIDResNet for CIFAR-10BaselineBaseline + SS (d = 0.5)Baseline + MS (d = 0.5, g = 0.10)0.511.522.53Iterations10525303540455055FIDSN-GAN for STL-10BaselineBaseline + SS (d = 1.0)Baseline + MS (d = 0.5, g = 0.05)0.511.522.53Iterations10525303540455055FIDResNet for STL-10BaselineBaseline + SS (d = 1.0)Baseline + MS (d = 1.0, d = 0.01)\fTable 2: Results on CIFAR-100 and ImageNet 32\u00d732. We use baseline model Dist-GAN with\nResNet architecture. We follow the same experiment setup as above. SS: proposed in [4]; MS: this\nwork.\n\nDatasets\n\nCIFAR-100 (10K-5K FID)\n\nImageNet 32\u00d732 (10K-10K FID)\n\nSS\n21.02\n17.1\n\nMS\n19.74\n12.3\n\nTable 3: Comparing to state-of-the-art methods on Stacked MNIST with tiny K/4 and K/2 archi-\ntectures [30]. We also follow the same experiment setup of [30]. Baseline model: Dist-GAN. SS:\nproposed in [4]; MS: this work. Our method MS achieves the best results for this dataset with both\narchitectures, outperforming state-of-the-art [42, 17] by a signi\ufb01cant margin.\n\nArch\nK/4, #\nK/4, KL\nK/2, #\nK/2, KL\n\nUnrolled GAN [30] WGAN-GP [13]\n640.1 \u00b1 136.3\n1.97 \u00b1 0.70\n772.4 \u00b1 146.5\n1.35 \u00b1 0.55\n\n372.2 \u00b1 20.7\n4.66 \u00b1 0.46\n817.4 \u00b1 39.9\n1.43 \u00b1 0.12\n\nDist-GAN [42]\n859.5 \u00b1 68.7\n1.04 \u00b1 0.29\n917.9 \u00b1 69.6\n1.06 \u00b1 0.23\n\nPro-GAN [17]\n859.5 \u00b1 36.2\n1.05 \u00b1 0.09\n919.8 \u00b1 35.1\n0.82 \u00b1 0.13\n\n[42]+SS\n906.75 \u00b1 26.15\n0.90 \u00b1 0.13\n957.50 \u00b1 31.23\n0.61 \u00b1 0.15\n\nOurs([42]+MS)\n926.75 \u00b1 32.65\n0.78 \u00b1 0.13\n976.00 \u00b1 10.04\n0.52 \u00b1 0.07\n\nWe conduct additional experiments on CIFAR-100 and ImageNet 32\u00d732 to compare SS and MS\nwith Dist-GAN baseline. We use the same ResNet architecture as Section B.4 on CIFAR-10 for this\nstudy, and we use the best parameters \u03bbd and \u03bbg selected in Section B.4 for ResNet architecture.\nExperimental results in Table 2 show that our MS consistently outperform SS for all benchmark\ndatasets. For ImageNet 32\u00d732 we report the best FID for SS because the model suffers serious mode\ncollapse at the end of training. Our MS achieves the best performance at the end of training.\nWe also evaluate the diversity of our generator on Stacked MNIST [30]. Each image of this dataset is\nsynthesized by stacking any three random MNIST digits. We follow exactly the same experiment\nsetup with tiny architectures K/4, K/2 and evaluation protocol of [30]. We measure the quality\nof methods by the number of covered modes (higher is better) and KL divergence (lower is better).\nRefer to [30] for more details. Table. 3 shows that our proposed MS outperforms SS for both mode\nnumber and KL divergence. Our approach signi\ufb01cantly outperforms state-of-the-art [42, 17]. The\nmeans and standard deviations of MS and SS are computed from eight runs (we re-train our GAN\nmodel from the scratch for each run). The results are reported with best (\u03bbd, \u03bbg) of MS: (0.5, 0.2)\nfor K/4 architecture and (1.0, 1.0) for K/2 architecture. Similarly, best (\u03bbd, \u03bbg) of SS: (0.5, 0.0)\nfor K/4 architecture and (1.0, 0.0) for K/2 architecture.\nFinally, in Table 1, we compare our FID to SAGAN [49] (a state-of-the-art conditional GAN) and\nBigGAN [2]. We perform the experiments under the same conditions using ResNet architecture on\nthe CIFAR-10 dataset. We report the best FID that SAGAN can achieve. As SAGAN paper does not\nhave CIFAR-10 results [49], we run the published SAGAN code and select the best parameters to\nobtain the results for CIFAR-10. For BigGAN, we extract best FID from original paper. Although our\nmethod is unconditional, our best FID is very close to that of these state-of-the-art conditional GAN.\nThis validates the effectiveness of our design. Generated images using our system can be found in\nFigures 5 and 6 of Appendix B.\n\n7 Conclusion\n\nWe provide theoretical and empirical analysis on auxiliary self-supervised task for GAN. Our analysis\nreveals the limitation of the existing work. To address the limitation, we propose multi-class minimax\ngame based self-supervised tasks. Our proposed self-supervised tasks leverage the rotated samples\nto provide better feedback in matching the data and generator distributions. Our theoretical and\nempirical analysis support improved convergence of our design. Our proposed SS tasks can be\neasily incorporated into existing GAN models. Experiment results suggest that they help boost\nthe performance of baseline implemented with various network architectures on the CIFAR-10,\nCIFAR-100, STL-10, CelebA, Imagenet 32 \u00d7 32, and Stacked-MNIST datasets. The best version of\nour proposed method establishes state-of-the-art FID scores on all these benchmark datasets.\n\n9\n\n\fAcknowledgements\n\nThis work was supported by ST Electronics and the National Research Foundation(NRF), Prime\nMinister\u2019s Of\ufb01ce, Singapore under Corporate Laboratory @ University Scheme (Programme Title:\nSTEE Infosec - SUTD Corporate Laboratory). This research was also supported by the National\nResearch Foundation Singapore under its AI Singapore Programme [Award Number: AISG-100E-\n2018-005]. This research was also supported in part by the Energy Market Authority (EP award no.\nNRF2017EWT-EP003-061). This project was also supported by SUTD project PIE-SGP-AI-2018-01.\n\nReferences\n[1] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adversarial networks.\n\narXiv preprint arXiv:1701.04862, 2017.\n\n[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity natural\n\nimage synthesis. arXiv preprint arXiv:1809.11096, 2018.\n\n[3] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative\n\nadversarial networks. CoRR, 2016.\n\n[4] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised gans via auxiliary\n\nrotation loss. In CVPR, 2019.\n\n[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[6] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context\n\nprediction. In CVPR, 2015.\n\n[7] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv preprint\n\narXiv:1605.09782, 2016.\n\n[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and\n\nAaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.\n\n[9] Arnab Ghosh, Viveka Kulharia, Vinay P Namboodiri, Philip HS Torr, and Puneet K Dokania. Multi-agent\n\ndiverse generative adversarial networks. In CVPR, 2018.\n\n[10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting\n\nimage rotations. ICLR, 2018.\n\n[11] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160,\n\n2016.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\ntraining of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n[14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans\ntrained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural\nInformation Processing Systems, pages 6626\u20136637, 2017.\n\n[15] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. Mgan: Training generative adversarial nets\n\nwith multiple generators. 2018.\n\n[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\n\nadversarial networks. CVPR, 2017.\n\n[17] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved\n\nquality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[18] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial\n\nnetworks. In CVPR, 2019.\n\n10\n\n\f[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[20] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,\nKieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the national academy of sciences, 2017.\n\n[21] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. arXiv\n\npreprint arXiv:1705.07215, 2017.\n\n[22] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther. Autoencoding\n\nbeyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.\n\n[23] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-\nresolution using a generative adversarial network. In CVPR, 2017.\n\n[24] Kwot Sin Lee, Ngoc-Trung Tran, and Ngai-Man Cheung. Infomax-gan: Mutual information maximization\nIn NeurIPS 2019 Workshop on Information Theory and\n\nfor improved adversarial image generation.\nMachine Learning, 2019.\n\n[25] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Mmd gan: Towards\n\ndeeper understanding of moment matching network. In NIPS, 2017.\n\n[26] Swee Kiat Lim, Yi Loo, Ngoc-Trung Tran, Ngai-Man Cheung, Gemma Roig, and Yuval Elovici. Doping:\nGenerative data augmentation for unsupervised anomaly detection. In Proceeding of IEEE International\nConference on Data Mining (ICDM), 2018.\n\n[27] Kanglin Liu. Varying k-lipschitz constraint for generative adversarial networks.\n\narXiv:1803.06107, 2018.\n\narXiv preprint\n\n[28] Xuanqing Liu and Cho-Jui Hsieh. Rob-gan: Generator, discriminator and adversarial attacker. In CVPR,\n\n2019.\n\n[29] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In\n\nInternational Conference on Learning Representations, 2016.\n\n[30] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks.\n\nICLR, 2017.\n\n[31] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\n\ngenerative adversarial networks. ICLR, 2018.\n\n[32] Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. Dual discriminator generative adversarial nets. In NIPS,\n\n2017.\n\n[33] Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In\n\nICCV, 2017.\n\n[34] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary\n\nclassi\ufb01er GANs. In ICML, 2017.\n\n[35] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders:\n\nFeature learning by inpainting. In CVPR, 2016.\n\n[36] Henning Petzka, Asja Fischer, and Denis Lukovnicov. On the regularization of wasserstein gans. arXiv\n\npreprint arXiv:1709.08894, 2017.\n\n[37] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[38] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.\n\nGenerative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.\n\n[39] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative\nadversarial networks through regularization. In Advances in Neural Information Processing Systems, pages\n2018\u20132028, 2017.\n\n[40] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\n\ntechniques for training gans. In NIPS, pages 2234\u20132242, 2016.\n\n11\n\n\f[41] Thomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, and Georg Langs.\nUnsupervised anomaly detection with generative adversarial networks to guide marker discovery. CoRR,\nabs/1703.05921, 2017.\n\n[42] Ngoc-Trung Tran, Tuan-Anh Bui, and Ngai-Man Cheung. Dist-gan: An improved gan using distance\n\nconstraints. In ECCV, 2018.\n\n[43] Ngoc-Trung Tran, Tuan-Anh Bui, and Ngai-Man Cheung. Improving gan with neighbors embedding and\n\ngradient matching. In AAAI, 2019.\n\n[44] Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, and Ngai-Man Cheung. An improved self-\n\nsupervised gan via adversarial training. arXiv preprint arXiv:1905.05469, 2019.\n\n[45] Sitao Xiang and Hao Li. On the effects of batch and weight normalization in generative adversarial\n\nnetworks. arXiv preprint arXiv:1704.03971, 2017.\n\n[46] Linxiao Yang, Ngai-Man Cheung, Jiaying Li, and Jun Fang. Deep clustering by gaussian mixture variational\nautoencoders with graph embedding. In The IEEE International Conference on Computer Vision (ICCV),\nOctober 2019.\n\n[47] Yasin Yaz\u0131c\u0131, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chan-\ndrasekhar. The unusual effectiveness of averaging in gan training. arXiv preprint arXiv:1806.04498,\n2018.\n\n[48] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. arXiv\n\npreprint arXiv:1703.04200, 2017.\n\n[49] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial\n\nnetworks. arXiv preprint arXiv:1805.08318, 2018.\n\n[50] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N\nMetaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.\nIn CVPR, 2017.\n\n[51] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.\n\n[52] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by\n\ncross-channel prediction. In CVPR, 2017.\n\n[53] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networkss. In ICCV, 2017.\n\n12\n\n\f", "award": [], "sourceid": 7259, "authors": [{"given_name": "Ngoc-Trung", "family_name": "Tran", "institution": "Singapore University of Technology and Design"}, {"given_name": "Viet-Hung", "family_name": "Tran", "institution": "Singapore University of Technology and Design"}, {"given_name": "Bao-Ngoc", "family_name": "Nguyen", "institution": "Singapore University of Technology and Design"}, {"given_name": "Linxiao", "family_name": "Yang", "institution": "University of Electronic Science and Technology of China; Singapore University of Technology and Design"}, {"given_name": "Ngai-Man (Man)", "family_name": "Cheung", "institution": "Singapore University of Technology and Design"}]}