{"title": "MarginGAN: Adversarial Training in Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10440, "page_last": 10449, "abstract": "A Margin Generative Adversarial Network (MarginGAN) is proposed for semi-supervised learning problems. Like Triple-GAN, the proposed MarginGAN consists of three components---a generator, a discriminator and a classifier, among which two forms of adversarial training arise. The discriminator is trained as usual to distinguish real examples from fake examples produced by the generator. The new feature is that the classifier attempts to increase the margin of real examples and to decrease the margin of fake examples. On the contrary, the purpose of the generator is yielding realistic and large-margin examples in order to fool the discriminator and the classifier simultaneously. Pseudo labels are used for generated and unlabeled examples in training. Our method is motivated by the success of large-margin classifiers and the recent viewpoint that good semi-supervised learning requires a ``bad'' GAN. Experiments on benchmark datasets testify that MarginGAN is orthogonal to several state-of-the-art methods, offering improved error rates and shorter training time as well.", "full_text": "MarginGAN: Adversarial Training in\n\nSemi-Supervised Learning\n\nJinhao Dong\n\nSchool of Computer Science and Technology,\n\nXidian University\n\nXi\u2019an 710126, China\n\njhdong@stu.xidian.edu.cn\n\nTong Lin\u2217\n\nKey Laboratory of Machine Perception, MOE\nSchool of EECS, Peking University, Beijing,\n\n& Peng Cheng Laboratory, Shenzhen\n\nlintong@pku.edu.cn\n\nAbstract\n\nA Margin Generative Adversarial Network (MarginGAN) is proposed for semi-\nsupervised learning problems. Like Triple-GAN, the proposed MarginGAN con-\nsists of three components\u2014a generator, a discriminator and a classi\ufb01er, among\nwhich two forms of adversarial training arise. The discriminator is trained as usual\nto distinguish real examples from fake examples produced by the generator. The\nnew feature is that the classi\ufb01er attempts to increase the margin of real examples\nand to decrease the margin of fake examples. On the contrary, the purpose of\nthe generator is yielding realistic and large-margin examples in order to fool the\ndiscriminator and the classi\ufb01er simultaneously. Pseudo labels are used for generat-\ned and unlabeled examples in training. Our method is motivated by the success\nof large-margin classi\ufb01ers and the recent viewpoint that good semi-supervised\nlearning requires a \u201cbad\u201d GAN. Experiments on benchmark datasets testify that\nMarginGAN is orthogonal to several state-of-the-art methods, offering improved\nerror rates and shorter training time as well.\n\n1\n\nIntroduction\n\nIn the real world, unlabeled data can usually be obtained relatively easily, while manually labeled\ndata costs a lot. Therefore, semi-supervised learning (SSL), which learns by using large amounts of\nunlabeled data with limited labeled data, meets a variety of practical needs.\nPseudo labels are arti\ufb01cial labels of unlabeled data to play the same role as labels of manually\nannotated data, which is a simple and effective method in semi-supervised learning. Several traditional\nSSL methods, such as self-training [1\u20133] and co-training [4], are based on pseudo labels. In the past\nfew years, deep neural network has made a great advancement in SSL, and hence the idea of pseudo\nlabels is incorporated into deep learning to leverage unlabeled data. In [5] the class with the maximum\npredicted probability is picked as the pseudo label. Temporal Ensembling proposed in [6] uses a\nensemble prediction as the pseudo label, which is an exponential moving average of label predictions\non different epochs, under different regularization and input augmentation conditions. In contrast\nto [6] where label predictions are averaged, in the Mean Teacher approach [7] model weights are\naveraged instead. The role pseudo labels play in [5] and [6, 7] is not exactly the same. Pseudo labels\nin [5] have the identical effect with ground-truth labels to minimize the cross-entropy loss, whereas\npseudo labels in [6, 7] serve as targets for the prediction to achieve consistency regularization, which\ncan make the classi\ufb01er give consistent outputs for similar data points.\nRecently, generative adversarial networks (GANs) has been applied to SSL and obtained amazing\nresults. The method of Feature Matching (FM) GANs proposed in [8] substitutes the original binary\n\n\u2217T. Lin is the corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The architecture of MarginGAN.\n\ndiscriminator with a (K + 1)-class classi\ufb01er. The aim of the classi\ufb01er (i.e. the discriminator) is to\nclassify labeled samples into the correct class, unlabeled samples into any of the \ufb01rst K classes and\ngenerated samples into the (K +1)-th class. As a improvement of feature matching GANs, the method\nproposed in [9] veri\ufb01es that good semi-supervised learning requires a \u201cbad\u201d generator. The proposed\ncomplement generator can yield arti\ufb01cial data points in low-density areas, thus encouraging the\nclassi\ufb01er to place the class boundaries in these areas and to improve the generalization performance.\nAlthough the idea of using pseudo labels is simple and effective in deep learning, sometimes it might\nhappen that the incorrect pseudo labels will impair the generalization performance and slow down\nthe training of deep networks. Prior works such as [6, 7] make efforts on how to improve the quality\nof pseudo labels. Inspired by [9], we propose a method that encourages the generator to yield \u201cbad\u201d\nexamples in SSL, so as to increase the tolerance of incorrect pseudo labels and reduce the error rates\nfurther.\nTo address the issue caused by the incorrect pseudo labels, we present MarginGAN, a GAN model\nin semi-supervised learning based on margin theory of classi\ufb01ers. MarginGAN consists of three\ncomponents\u2014a generator, a discriminator and a classi\ufb01er (See Fig. 1 for the architecture of Margin-\nGAN). The role of the discriminator is same as in standard GAN, distinguishing whether a example\nis from the real distribution or produced by the generator. The multi-class classi\ufb01er is trained to\nincrease the classi\ufb01cation margin of real examples (including labeled samples and unlabeled samples),\nand decreases the margin of generated fake examples meanwhile. The goal of the generator is to\nyield bogus examples that look like realistic and have large margin, aiming at deceiving both the\ndiscriminator and the classi\ufb01er simultaneously.\nThe paper is organized as follows. Section 2 brie\ufb02y reviews related work, and our proposed Margin-\nGAN is described in Section 3. Experimental results are presented in Section 4. Section 5 concludes\nthis paper.\n\n2 Related Work\n\nThe research of semi-supervised learning dates back to 1970s, and there have emerged many classical\nSSL algorithms since then. Self-training [1\u20133] is probably the \ufb01rst SSL algorithm, which selects\nthe unlabeled examples with the surest predictions and puts them into the labeled sample set in\neach iteration. Co-training [4] trains two classi\ufb01er on two different views of the labeled examples\nrespectively, and each classi\ufb01er puts the unlabeled examples with the highest prediction con\ufb01dence\ninto the labeled dataset of the other classi\ufb01er. The above two methods can be regarded as methods\nusing pseudo labels. Graph-based semi-supervised learning [10, 11] constructs a neighborhood\ngraph according to the geometric structure between the samples, and propagates the label from\nlabeled samples to unlabeled samples utilizing the adjacency relation on the graph based on manifold\nhypothesis.\nRecently, deep neural networks have made a great progress in SSL. As already discussed, the idea of\npseudo labels is also used in [5\u20137]. To reduce the instability brought by pseudo labels, a coef\ufb01cient\n\u03b1(t) is used in [5] to balance labeled samples and unlabeled examples. \u03b1(t) is set slowly increased\nso that the low weights can reduce the negative effect of unreliable pseudo labels. In [6, 7] the quality\n\n2\n\n\fof pseudo labels is improved by keeping an exponential moving average of predictions or model\nweights, respectively. Our work focuses on another perspective that the classi\ufb01er learns from auxiliary\ngenerated examples. Besides [9], virtual adversarial training proposed in [12, 13] is similar to our\napproach in spirit.\nRecent methods leveraging GANs achieve amazing results in SSL. It is worth noting that CatGAN\nproposed in [14] shares a similar \ufb02avor to our margin-based method. The classi\ufb01er of CatGAN\nminimizes the conditional entropy of real samples, while maximizing the conditional entropy of\ngenerated fake examples at the same time. Triple-GAN [15] also adopts a three-player architecture,\nwhere the generator and the classi\ufb01er characterize the conditional distributions between examples and\nlabels, and the discriminator solely focuses on identifying fake example-label pairs. Triangle-GAN\n[16] develops a more complex architecture consisting of two generators and two discriminators.\nIn [17] Structured Generative Adversarial Networks (SGANs) are proposed for semi-supervised\nconditional generative modeling, which can better manipulate the semantics of generated examples.\nBesides the methods mentioned above, there has been other efforts in semi-supervised learning\nusing deep generative models. In [18] the Ladder network was extended to the area of SSL. In [19]\nan unsupervised regularization term was proposed to explicitly enforce that the predictions of the\nmulti-class classi\ufb01er should be mutually-exclusive.\n\n3 The Proposed MarginGAN\n\n3.1 Motivation and Intuition\n\nIn a usual GAN model, the goal is to train a generator that can produce realistic fake examples such\nthat a discriminator can not discern real or fake examples. However in SSL problems our purpose\nis to train a high-accuracy classi\ufb01er achieving large margins of training examples. We hope that\nthe generator can yield \u201cinformative\u201d examples near the true decision boundary, just like support\nvectors in the SVM models. Here another kind of adversarial training arises: the generator attempts to\nproduce large-margin fake examples, while the classi\ufb01er aims at achieving small-margin predictions\nover these fake examples.\nWrong pseudo labels of unlabeled examples (and fake examples) greatly deteriorate the accuracy\nof prior methods based on pseudo labels, but our MarginGAN exhibits a better tolerance to wrong\npseudo labels. Since the discriminator plays the same role in a usual GAN, we argue that the improved\naccuracy obtained by MarginGAN comes from adversarial interactions between the generator and the\nclassi\ufb01er.\nFirst, the extreme training case in our ablation study (in Sec. 4.2) show that fake examples generated\nby MarginGAN can aggressively remedy the in\ufb02uence of wrong pseudo labels. Because the classi\ufb01er\nenforces the small margin values of fake examples, the generator must yield fake examples near the\n\u201ccorrect\u201d decision boundaries. This will re\ufb01ne and shrink the decision boundary for surrounding the\nreal examples.\nSecond, we illustrate the large-margin intuition on a four-class problem in Fig. 2. If the classi\ufb01er\nchooses to believe the wrong pseudo labels, the decision boundaries have to stride over the \u201creal\u201d\ngap between the two classes of examples. But wrong pseudo labels lead to reduced values in margin,\nwhich hurts the generalization accuracy. Therefore, large margin classi\ufb01ers should ignore those\nwrong pseudo labels for achieving higher accuracy.\n\n3.2 Margin\n\nDe\ufb01nition of margins\nIn machine learning, the margin of a single data point is de\ufb01ned to be the\ndistance from that data point to a decision boundary, which can be used to bound the generalization\nerror of the classi\ufb01er. Both support vector machines (SVM) and boosting can be explainable with\nmargin-based generalization bounds.\nIn the AdaBoost algorithm, ht(x) \u2208 {1,\u22121} is a base classi\ufb01er acquired in iteration t and \u03b1t \u2265 0 is\nits corresponding weight assigned to ht. The combined classi\ufb01er f is a weighted majority vote of T\n\n3\n\n\fFigure 2: A illustration to show that wrong pseudo labels may cause decision boundaries across the\ntrue gap between two classes. A large-margin classi\ufb01er would disregard those wrong pseudo labels to\nget better generalization.\n\nbase classi\ufb01ers, which is formulated as\n\nf (x) =\n\nIn [20] the margin of an instance-label pair (x, y) is de\ufb01ned as\n\n(cid:80)T\n(cid:80)T\ny(cid:80)T\n(cid:80)T\n\nt=1 \u03b1tht(x)\n\n.\n\ns=1 \u03b1s\n\nt=1 \u03b1tht(x)\n\ns=1 \u03b1s\n\nyf (x) =\n\n.\n\n(1)\n\nThe sign of the margin re\ufb02ects whether or not the prediction of the combined classi\ufb01er is correct,\nwhile the magnitude indicates the prediction con\ufb01dence. It is interesting that Eq. 1 can bring a uni\ufb01ed\nform for both boosting and SVM:\n\ny(cid:104)\u03b1, h(x)(cid:105)\n(cid:107)\u03b1(cid:107) (cid:107)h(x)(cid:107) ,\n\nwhere h(x)\nfor h and (cid:96)1 norm is used for \u03b1; and for SVM, (cid:96)2 norm is used for both h and \u03b1.\n\n.\n= [h1(x), h2(x), ..., hT (x)] and \u03b1\n\n.\n= [\u03b11, \u03b12, ..., \u03b1T ]. For boosting, (cid:96)\u221e norm is used\n\nMargins in semi-supervised learning [21] propose the margin of an unlabeled example denoted\n\nas |f (x)|, that can be also represented as(cid:101)yf (x) with pseudo label(cid:101)y = sign(f (x)). This way just\n\nregards the current prediction is correct and makes the classi\ufb01er more certain of what it predicts\ncurrently. Regardless of labeled and unlabeled examples, larger margins of data points can decrease\nthe upper bound of the generalization error, which brings about better generalization performance.\n\nlayer, so the output is a discrete distribution such that(cid:80)k\n\nMargins of multi-class classi\ufb01cation We set our problem in multi-class problems: for an instance-\nlabel pair (x, y), y \u2208 Rk is the ground-truth label in one-hot encoding and C(x) \u2208 Rk is the\nprediction of the multi-class classi\ufb01er C. The last layer of the classi\ufb01er network is usually a softmax\ni=1 Ci(x) = 1. The margin in multi-\nclass problems is de\ufb01ned as the difference between the probability for the true class and maximal\nprobability for the false classes:\n\nMargin(x, y) = Cy(x) \u2212 max\ni(cid:54)=y\n\nCi(x).\n\n(2)\n\nIt is evident that if the margin is a large positive number, the probability of the correct class is\npeaked in the distribution [C(x)1, C(x)2, . . . , C(x)k], indicating that the classi\ufb01er is con\ufb01dent of its\nprediction. On the contrary, if the margin is a small positive number, the distribution is \ufb02at and the\nclassi\ufb01er is uncertain of its prediction, which has a similar \ufb02avor to CatGAN proposed in [14]. The\nclassi\ufb01er makes a mistake decision when the margin is negative.\n\n4\n\n\f3.3 Architecture Overview\n\nThe original architecture of GAN consists of two components, a generator and a discriminator, that\nplay a zero-sum game. The generator G transforms a latent variable z\u223cp(z) to a fake example\n\u02c6x\u223cpg(\u02c6x) such that the generated distribution pg(\u02c6x) approximates the real data distribution p(x).\nThe discriminator D is to distinguish generated fake examples from real examples. For adapting\nto semi-supervised learning, we add a classi\ufb01er C to the original architecture, which forms a three-\nplayer game. We retain the discriminator for encouraging the generator to produce visually realistic\nexamples. We describe each of the components as follows. The architecture of MarginGAN is\ndepicted in Fig. 1.\n\n3.4 Discriminator\n\nThe discriminator D of MarginGAN receives three kinds of inputs, labeled examples x drawn from\n\np[l](x), unlabeled examples(cid:101)x drawn from p[u]((cid:101)x) and generated examples G(z) = \u02c6x\u223cpg(\u02c6x). In our\nLoss(D) = \u2212(cid:8)Ex\u223cp[l](x)[log(D(x))] + E(cid:101)x\u223cp[u]((cid:101)x)[log(D((cid:101)x))] + Ez\u223cp(z)[log(1 \u2212 D(G(z)))](cid:9).\n\nSSL setting, the discriminator should regard both labeled examples and unlabeled examples as real\ndata points, and discern generated examples as fake data points. We de\ufb01ne the loss function for the\ndiscriminator as\n\n3.5 Classi\ufb01er\n\nWe add a multi-class classi\ufb01er C to the original GAN, as high-accuracy classi\ufb01cation is our purpose\nin SSL. We develop the classi\ufb01er from the perspective of margins. The classi\ufb01er receives the same\ninputs as the discriminator\u2014labeled examples, unlabeled examples and generated fake examples. In\nthe following we will detail the corresponding objective of the classi\ufb01er for each input.\nFor labeled examples, the classi\ufb01er has the same objective as ordinary multi-class classi\ufb01ers. Given\nan instance-label pair (x, y), the classi\ufb01er C attempts to minimize the cross-entropy loss between the\ntrue label y and the predicted label C(x):\n\nLossCE(y, C(x)) = \u2212 T(cid:88)\n\nwhere y\u2208Rk is in one-hot encoding, and C(x)\u2208Rk is the prediction. And the loss function for the\nlabeled examples can be formulated as\n\nLoss(C [l]) = E(x,y)\u223cp[l](x,y) [LossCE(y, C(x))]\n\n= E(x,y)\u223cp[l](x,y)\n\n.\n\n(3)\n\nyi log(C(x)i),\n\ni=1\n\n(cid:34)\n\u2212 k(cid:88)\n\ni=1\n\nyi log(cid:0)C(x)i\n\n(cid:1)(cid:35)\n\nNote that minimizing the cross-entropy encourages the increase of the probability of the true class\nand inhibits the probability of other false classes, leading to a larger margin de\ufb01ned in Eq. 2.\nFor unlabeled examples, the goal of the classi\ufb01er is to increase the margin of these data points.\nHowever, since there is no information about the corresponding true label, we have no idea which\n\nclass probability should be peaked. Like [5\u20137, 21], we leverage the pseudo label(cid:101)y[u]\u2208Rk in one-hot\nencoding for unlabeled examples. That is, the(cid:101)y[u] vector has 1 at the only entry corresponding to\nthe class with the maximum predicted probability of the current predict C((cid:101)x), while other entries\nbetween (cid:101)y[u] and C((cid:101)x).\n\nare exactly zeros. With pseudo labels, we can increase the margin by minimizing the cross-entropy\nIntuitively, this objective will reinforce the con\ufb01dence of the current\n\npredictions. The loss function is given as\n\nLoss(C [u]) = E(cid:101)x\u223cp[u]((cid:101)x)\n= E(cid:101)x\u223cp[u]((cid:101)x)\n\n(cid:104)\n(cid:105)\nLossCE((cid:101)y[u], C((cid:101)x))\n(cid:34)\n(cid:35)\n\u2212 k(cid:88)\nlog(C((cid:101)x)i)\n\n(cid:101)y[u]\n\ni\n\ni=1\n\n,\n\n(4)\n\nwhich has the same form as Eq. 3.\n\n5\n\n\fWhen it comes to generated examples, the classi\ufb01er should decrease the margin of these data points\nand make the prediction distribution \ufb02at. The generated examples are another form of unlabeled data,\nand we take the same way to use pseudo label. In order to decrease the margin of generated examples,\nwe introduce a new loss function, Inverted Cross Entropy (ICE) between two distributions\n\nLossICE(p, q) = \u2212 k(cid:88)\n\npi log(1 \u2212 qi),\n\nwhere p, q\u2208Rk. Minimizing the inverted cross entropy will increase the cross-entropy between the\n\npseudo label(cid:101)y[g] and C(G(z)), so that the prediction distribution will be \ufb02at and the margin will be\n\ni=1\n\ndecreased. The loss function for generated examples is\n\nLoss(C [g]) = Ez\u223cp(z)\n\n= Ez\u223cp(z)\n\n(cid:104)\n(cid:105)\nLossICE((cid:101)y[g], C(G(z)))\n(cid:34)\n\u2212 k(cid:88)\n\nlog(1 \u2212 C(G(z))i)\n\n(cid:101)y[g]\n\ni\n\n(cid:35)\n\n.\n\n(5)\n\nCombining three loss functions de\ufb01ned in Eq. 3, 4 and 5 altogether, we obtain the integrated loss\nfunction of the classi\ufb01er\n\nLoss(C) = Loss(C [l]) + Loss(C [u]) + Loss(C [g]).\n\n(6)\n\ni=1\n\n3.6 Generator\n\nThe purpose of G is to produce bogus examples that looks like realistic to the discriminator D and\nimproves the generalization of classi\ufb01er C meanwhile. From this perspective, C and D form an\nalliance to compete with G, while G attempts to fool both C and D. On the one hand, just like the\nstandard GAN, the objective of G is to generate fake data points that D can not distinct. On the other\nhand, because C increases the margin of real examples and decreases the margin of fake examples, G\nshould compete to yield data points having large margin to fool C. Therefore, in order to fool both D\nand C, G tries to yield realistic and large-margin examples simultaneously such that the generated\nfake data points can not easily separated from real examples. In a nutshell, the loss function of G is\nformulated as\n\n(cid:104)\n\n(cid:105)\nLossCE((cid:101)y[g], C(G(z)))\n\n.\n\nLoss(G) = \u2212Ez\u223cp(z)\n\nlog (D(G(z)))\n\n+ Ez\u223cp(z)\n\n(cid:104)\n\n3.7 Minimax Game\n\nTo adapt the loss function of MarginGAN into a similar form of the original GAN, we combine all\nthe loss functions of each component into a minimax problem:\n\nG\n\nmin\n\nmax\nD,C\n\nJ(G, D, C)\n\n(cid:110)\nEx\u223cp[l](x)[log(D(x))] + E(cid:101)x\u223cp[u]((cid:101)x)[log(D((cid:101)x))] + Ez\u223cp(z)[log(1 \u2212 D(G(z)))]\n\u2212(cid:110)\nE(x,y)\u223cp[l](x,y)[LossCE(y, C(x))] + E(cid:101)x\u223cp[u]((cid:101)x)\n\n(cid:104)\nLossCE((cid:101)y[u], C((cid:101)x))\n\n(cid:105)\n\n=\n\n(cid:111)\n\n(cid:104)\n(cid:105)(cid:111)\nLossICE((cid:101)y[g], C(G(z)))\n\n,\n\n+Ez\u223cp(z)\n\nwhere the \ufb01rst part is a minimax game between G and D, and the second part is between G and C.\nInstead, we can view the minimax game from the perspective of margin:\n\nG\n\nmin\n\nmax\nD,C\n\nJ(G, D, C)\n\n(cid:110)\nEx\u223cp[l](x)[log(D(x))] + E(cid:101)x\u223cp[u]((cid:101)x)[log(D((cid:101)x))] + Ez\u223cp(z)[log(1 \u2212 D(G(z)))]\n(cid:110)\n\n(cid:104)\n\n(cid:111)\n\n(cid:105)\nMargin((cid:101)x,(cid:101)y[u])\n\n(cid:2)Margin(x, y)(cid:3) + E(cid:101)x\u223cp[u]((cid:101)x)\n(cid:104)\n(cid:105)(cid:111)\n1 \u2212 Margin(G(z),(cid:101)y[g])\n\n,\n\nE(x,y)\u223cp[l](x,y)\n\n+Ez\u223cp(z)\n\n=\n\n+\n\n= (cid:104)y, log(1\u2212 C(x))(cid:105). In practice,\n.\nif we rede\ufb01ne Margin(x, y)\neach time any of three networks (D, C and G) is trained with gradient descent over one example or a\nmini-batch with other two networks being \ufb01xed, same as the training procedure of an usual GAN.\n\n= (cid:104)y, log C(x)(cid:105) and 1\u2212 Margin(x, y)\n.\n\n(cid:105)\n\n6\n\n\fTable 1: Error rates (%) on MNIST with 100, 600, 1000 and 3000 labeled examples in semi-supervised\nlearning. The results of competing methods come from [5]. Means and stardard errors are reported\nfor our method on 5 runs.\n\n# of labels\nNN\nSVM\nCNN\nTSVM\nDBN-rNCA\nEmbedNN\nCAE\nMTC\ndropNN\n+PL\n+PL+DAE\nMarginGAN (ours)\n\n100\n25.81\n23.44\n22.98\n16.81\n\u2014\n16.86\n13.47\n12.03\n21.89\n16.15\n10.49\n3.53 \u00b1 0.57\n\n600\n11.44\n8.85\n7.68\n6.16\n8.70\n5.97\n6.30\n5.13\n8.57\n5.03\n4.01\n3.03 \u00b1 0.60\n\n1000\n10.70\n7.77\n6.45\n5.38\n\u2014\n5.73\n4.77\n3.64\n6.59\n4.30\n3.46\n2.87 \u00b1 0.71\n\n3000\n6.04\n4.21\n3.35\n3.45\n3.30\n3.59\n3.22\n2.57\n3.72\n2.80\n2.69\n2.06 \u00b1 0.20\n\n4 Experiments\n\n4.1 Preliminary Experiment on MNIST\n\nSimilar to our work, pseudo labels are employed in prior work [5] and experiments on MNIST are\nreported. To show the improvement brought by MarginGAN clearly, we \ufb01rst conduct a preliminary\nexperiment on MNIST. We use the generator and the discriminator from the infoGAN [22], and use a\nsimple convolutional network with six layers as the classi\ufb01er. Although the classi\ufb01er we use might\nbe powerful than that used in [5], the subsequent ablation study can reveal the contribution brought\nby generated fake examples.\nMNIST consists of a training set of 60,000 images and a test set of 10,000 images, with all images\nof 28 by 28 gray-scale pixels. In settings, we sample 100, 600, 1000 or 3000 labeled examples and\nuse the rest of the training set as unlabeled samples. When training, we \ufb01rst pretrain the classi\ufb01er\nto achieve the error rate lower than 8.0%, 9.3%, 9.5% and 9.7%, only with the labeled examples,\nrespectively corresponding to 100, 600, 1000 and 3000 labeled examples. Then, the unlabeled\nsamples and generated samples engage in the training process. Table 1 compares our results against\nother competing methods from [5]. We can see that the proposed MarginGAN outperforms these\npseudo label based previous methods on each setting, which can be attributed to the participation of\ngenerated fake examples. Although this comparison with dated algorithms is somewhat unfair, our\nmethod does achieve higher accuracies under all settings and the subsequent ablation study further\nveri\ufb01es the improvements of our method.\n\n4.2 Ablation Study on MNIST\n\nTo \ufb01nd out the in\ufb02uence of labeled examples, unlabeled examples and generated fake examples, we\nran ablation experiments with one or several types of examples fed as input at a time. In the ablation\nstudy, because of the instability of pseudo labels and lack of labeled examples in some cases, we\ndecrease the learning rate from 0.1 to 0.01. We measured the lowest error rates and time consumed to\ntraining convergence in different settings, and the results are reported in Table 2.\n\nUnlabeled examples Unlabeled examples plays an important role in semi-supervised learning. We\ncan see that the addition of unlabeled examples can reduce the error rate from 8.21% to 4.54% with\n3.67% improvement. To verify the uncertainty of the correctness of pseudo labels, we conducted an\nextreme attempt: the classi\ufb01er was pretrained to achieve the error rate of 9.78% (\u00b1 0.14%), and then\nwe fed the classi\ufb01er with unlabeled examples alone. In other words, the classi\ufb01er can not access the\nlabeled examples again. To our surprise, the error rate blew up and quickly reached to over 89.53%.\nThe incorrect pseudo labels will mislead the classi\ufb01er and hinder its generalization.\n\n7\n\n\fTable 2: Ablation study of our algorithm on MNIST. The amount of labeled examples in this\nexperiment is 600. The abbreviations of L, U and G correspond to labeled, unlabeled and generated\nexamples, respectively. The last two rows show an extreme training situation.\n\nSettings\nL\n\nNormal\nTraining L + U\n\nL + U + G\n\nExtreme U\nTraining U + G\n\nError Rates (%) Training Time (sec.)\n408.41 \u00b1 26.17\n1305.64 \u00b1 495.18\n367.79 \u00b1 82.82\n\u2014\n886.83 \u00b1 193.98\n\n8.21 \u00b1 0.82\n4.54 \u00b1 0.41\n3.20 \u00b1 0.62\n89.53 \u00b1 0.81\n7.40 \u00b1 5.01\n\nTable 3: Means and standard errors of the error rates (%) on SVHN and CIFAR-10 over 4 runs.\n\nMETHODS\n\nLadder [18]\nCatGAN [14]\nFM GANs [8]\nTriple-GAN [15]\nSGAN [17]\n\n(cid:81) model [7]\n\nMarginGAN (ours)\n\nSVHN\n(500 labels)\n\u2014\n\u2014\n\u2014\n\u2014\n\u2014\n6.83 \u00b1 0.66\n6.07 \u00b1 0.43\n\nCIFAR-10\n(1000 labels)\n\n27.36 \u00b1 1.20\n10.39 \u00b1 0.43\n\nCIFAR-10\n(4000 labels)\n\u2014 20.04 \u00b1 0.47\n\u2014 19.58 \u00b1 0.58\n\u2014 18.63 \u00b1 2.32\n\u2014 18.82 \u00b1 0.32\n\u2014 17.26 \u00b1 0.69\n13.20 \u00b1 0.27\n6.44 \u00b1 0.10\n\nGenerated fake examples We fed generated examples to the classi\ufb01er, making it robust to wrong\npseudo labels and improving the performance. We can see that, compared with training of only\nlabeled samples and unlabeled samples, the generated examples can further improve the error rates\nfrom 4.54% to 3.20%. Moreover, it\u2019s worth noting that the generated examples can remarkably\nreduce the training time consumed by 71.8%. However when we continue to train, the error rate\nstarts to increase and over\ufb01tting arises. When the generated images are more realistic gradually,\nthe classi\ufb01er still reduces their margins, which might harm the performance. Back to the extreme\nsituation mentioned above, when combining unlabeled images and generated images after pretraining,\nthe error rates can be improved indeed (from 9.78% to 7.40%).\n\n4.3 Results on SVHN and CIFAR-10\n\nNext we run our MarginGAN method on two standard datasets in SSL\u2014SVHN and CIFAR-10. We\nemploy a 12-block residual network [23] with Shake-Shake regularization [24] as our classi\ufb01er, which\nis same as the ResNet version used in the mean teacher implementation. Our algorithm integrates the\ngenerator and the discriminator from the infoGAN [22] into this residual network. We also use the\nmean teacher training for averaging model weights over recent training examples.\nThe details of SVHN and CIFAR-10 datasets are as follows. SVHN contains of 73,257 digits for\ntraining and 26,032 digits for testing, with each digit being a 32 \u00d7 32 RGB image. The CIFAR-10\ndataset consists of 50,000 training samples and 10,000 test samples of 32 \u00d7 32 color images of 10\nobject classes. On SVHN we randomly select 500 labeled samples. And we use 1,000 and 4,000\nlabels to train on CIFAR-10, respectively. Table 3 shows the results of experiments on the SVHN and\nCIFAR-10 datasets. We can see the improvements brought by our method.\n\n4.4 Generated Fake Images\n\nWe show the images generated by MarginGAN in Fig. 3 when the accuracy of classi\ufb01er is increasing.\nAs we can see, these fake images really look \u201cbad\u201d: for instance, most generated digits in MNIST\nand SVHN are close to the decision boundaries such that one can not determine their labels with high\ncon\ufb01dence. This situation meets with our motivation of this paper.\n\n8\n\n\f(a) MNIST\n\n(b) SVHN\n\n(c) CIFAR-10\n\nFigure 3: Generated fake images by MarginGAN.\n\n5 Conclusion\n\nIn this work, we presented the Margin Generative Adversarial Network (MarginGAN), which consists\nof three players\u2014a generator, a discriminator and a classi\ufb01er. The key is that the classi\ufb01er can leverage\nfake examples produced by the generator to improve the generalization performance. Speci\ufb01cally,\nthe classi\ufb01er aims at maximizing margin values of true examples and minimizing margin values\nof fake examples. The generator attempts to yield realistic and large-margin examples to fool both\nthe discriminator and the classi\ufb01er. The experimental results on several benchmarks show that\nMarginGAN can provide improved accuracies and shorten training time as well.\n\n9\n\n\fReferences\n[1] H. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE TIT,\n\n1965.\n\n[2] S. Fralick. Learning to recognize patterns without a teacher. IEEE TIT, 1967.\n[3] A. Agrawala. Learning with a probabilistic teacher. IEEE TIT, 1970.\n[4] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. COLT, 1998.\n[5] D.-H. Lee. Pseudo-label: The simple and ef\ufb01cient semi-supervised learning method for deep\n\nneural networks. ICML Workshop, 2013.\n\n[6] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. arXiv:1610.02242,\n\n2016.\n\n[7] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency\n\ntargets improve semisupervised deep learning results. NeurIPS, 2017.\n\n[8] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training GANs. NeurIPS, 2016.\n\n[9] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. Salakhutdinov. Good semi-supervised learning\n\nthat requires a bad GAN. NeurIPS, 2017.\n\n[10] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In\n\nICML, pages 19\u201326, 2001.\n\n[11] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML, pages 912\u2013919, 2003.\n\n[12] T. Miyato, S. i. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: a regularization\n\nmethod for supervised and semi-supervised learning. arXiv: 1704.03976, 2017.\n\n[13] T. Miyato, S. i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with\n\nvirtual adversarial training. ICLR, 2016.\n\n[14] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative\n\nadversarial networks. ICLR, 2016.\n\n[15] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. NeurIPS, 2017.\n[16] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative\n\nadversarial networks. NeurIPS, 2017.\n\n[17] Z. Deng, H. Zhang, X. Liang, L. Yang, S. Xu, J. Zhu, and E. P. Xing. Structured generative\n\nadversarial networks. NeurIPS, 2017.\n\n[18] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with\n\nladder networks. NeurIPS, 2015.\n\n[19] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Mutual exclusivity loss for semi-supervised deep\n\nlearning. ICIP, 2016.\n\n[20] R.E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press.\n[21] K. P. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble methods.\n\nKDD, 2002.\n\n[22] X. Chen, Y. Duan, R. Houthooft, J.Schulman, I. Sutskever, and P. Abbeel.\n\nInfoGAN: In-\nterpretable representation learning by information maximizing generative adversarial nets.\narXiv:1606.03657, 2016.\n\n[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arX-\n\niv:1512.03385, 2015.\n\n[24] X. Gastaldi. Shake-Shake regularization. arXiv:1705.07485, 2017.\n\n10\n\n\f", "award": [], "sourceid": 5512, "authors": [{"given_name": "Jinhao", "family_name": "Dong", "institution": "Xidian University"}, {"given_name": "Tong", "family_name": "Lin", "institution": "Peking University"}]}