{"title": "Large Scale Adversarial Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10542, "page_last": 10552, "abstract": "Adversarially trained generative models (GANs) have recently achieved compelling image synthesis results. But despite early successes in using GANs for unsupervised representation learning, they have since been superseded by approaches based on self-supervision. In this work we show that progress in image generation quality translates to substantially improved representation learning performance. Our approach, BigBiGAN, builds upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator. We extensively evaluate the representation learning and generation capabilities of these BigBiGAN models, demonstrating that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as compelling results in unconditional image generation.", "full_text": "Large Scale Adversarial Representation Learning\n\nJeff Donahue\n\nDeepMind\n\njeffdonahue@google.com\n\nKaren Simonyan\n\nDeepMind\n\nsimonyan@google.com\n\nAbstract\n\nAdversarially trained generative models (GANs) have recently achieved compelling\nimage synthesis results. But despite early successes in using GANs for unsuper-\nvised representation learning, they have since been superseded by approaches based\non self-supervision. In this work we show that progress in image generation quality\ntranslates to substantially improved representation learning performance. Our ap-\nproach, BigBiGAN, builds upon the state-of-the-art BigGAN model, extending it to\nrepresentation learning by adding an encoder and modifying the discriminator. We\nextensively evaluate the representation learning and generation capabilities of these\nBigBiGAN models, demonstrating that these generation-based models achieve the\nstate of the art in unsupervised representation learning on ImageNet, as well as in\nunconditional image generation. Pretrained BigBiGAN models \u2013 including image\ngenerators and encoders \u2013 are available on TensorFlow Hub1.\n\n1\n\nIntroduction\n\nIn recent years we have seen rapid progress in generative models of visual data. While these models\nwere previously con\ufb01ned to domains with single or few modes, simple structure, and low resolution,\nwith advances in both modeling and hardware they have since gained the ability to convincingly\ngenerate complex, multimodal, high resolution image distributions [1, 17, 18].\nIntuitively, the ability to generate data in a particular domain necessitates a high-level understanding\nof the semantics of said domain. This idea has long-standing appeal as raw data is both cheap \u2013\nreadily available in virtually in\ufb01nite supply from sources like the Internet \u2013 and rich, with images\ncomprising far more information than the class labels that typical discriminative machine learning\nmodels are trained to predict from them. Yet, while the progress in generative models has been\nundeniable, nagging questions persist: what semantics have these models learned, and how can they\nbe leveraged for representation learning?\nThe dream of generation as a means of true understanding from raw data alone has hardly been realized.\nInstead, the most successful approaches for unsupervised learning leverage techniques adopted from\nthe \ufb01eld of supervised learning, a class of methods known as self-supervised learning [4, 35, 32, 9].\nThese approaches typically involve changing or holding back certain aspects of the data in some way,\nand training a model to predict or generate aspects of the missing information. For example, [34, 35]\nproposed colorization as a means of unsupervised learning, where a model is given a subset of the\ncolor channels in an input image, and trained to predict the missing channels.\nGenerative models as a means of unsupervised learning offer an appealing alternative to self-\nsupervised tasks in that they are trained to model the full data distribution without requiring any\nmodi\ufb01cation of the original data. One class of generative models that has been applied to rep-\nresentation learning is generative adversarial networks (GANs) [11]. The generator in the GAN\n\n1 Models available at https://tfhub.dev/s?publisher=deepmind&q=bigbigan, with a Colab\nnotebook demo at https://colab.research.google.com/github/tensorflow/hub/blob/master/\nexamples/colab/bigbigan_with_tf_hub.ipynb.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fx \u223c Px\n\ndata\n\n\u02c6x \u223c G(z)\n\ndiscriminator D\n\ne\nn\nc\no\nd\ne\nr\nE\n\nx\n\nE\n\n\u02c6z\n\n\u02c6x\n\nG\n\nz\n\nG\nr\no\nt\na\nr\ne\nn\ne\ng\n\n\u02c6z \u223c E(x)\n\nz \u223c Pz\n\nlatents\n\nx\n\n\u02c6x\n\n\u02c6z\n\nz\n\nF\n\nH\n\nJ\n\nscores\n\nsx\n\nsxz\n\nsz\n\n(cid:80)\n\nloss\n\n(cid:96)\n\nFigure 1: The structure of the BigBiGAN framework. The joint discriminator D is used to compute\nthe loss (cid:96). Its inputs are data-latent pairs, either (x \u223c Px, \u02c6z \u223c E(x)), sampled from the data\ndistribution Px and encoder E outputs, or (\u02c6x \u223c G(z), z \u223c Pz), sampled from the generator G outputs\nand the latent distribution Pz. The loss (cid:96) includes the unary data term sx and the unary latent term sz,\nas well as the joint term sxz which ties the data and latent distributions.\n\nframework is a feed-forward mapping from randomly sampled latent variables (also called \u201cnoise\u201d)\nto generated data, with learning signal provided by a discriminator trained to distinguish between\nreal and generated data samples, guiding the generator\u2019s outputs to follow the data distribution.\nThe adversarially learned inference (ALI) [8] or bidirectional GAN (BiGAN) [5] approaches were\nproposed as extensions to the GAN framework that augment the standard GAN with an encoder\nmodule mapping real data to latents, the inverse of the mapping learned by the generator.\nIn the limit of an optimal discriminator, [5] showed that a deterministic BiGAN behaves like an\nautoencoder minimizing (cid:96)0 reconstruction costs; however, the shape of the reconstruction error\nsurface is dictated by a parametric discriminator, as opposed to simple pixel-level measures like the\n(cid:96)2 error. Since the discriminator is usually a powerful neural network, the hope is that it will induce\nan error surface which emphasizes \u201csemantic\u201d errors in reconstructions, rather than low-level details.\nIn [5] it was demonstrated that the encoder learned via the BiGAN or ALI framework is an effective\nmeans of visual representation learning on ImageNet for downstream tasks. However, it used a\nDCGAN [26] style generator, incapable of producing high-quality images on this dataset, so the\nsemantics the encoder could model were in turn quite limited. In this work we revisit this approach\nusing BigGAN [1] as the generator, a modern model that appears capable of capturing many of the\nmodes and much of the structure present in ImageNet images. Our contributions are as follows:\n\n\u2022 We show that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in\n\nunsupervised representation learning on ImageNet.\n\n\u2022 We propose a more stable version of the joint discriminator for BigBiGAN.\n\u2022 We perform a thorough empirical analysis and ablation study of model design choices.\n\u2022 We show that the representation learning objective also improves unconditional image\ngeneration, and demonstrate state-of-the-art results in unconditional ImageNet generation.\n\n\u2022 We open source pretrained BigBiGAN models on TensorFlow Hub2.\n\n2 BigBiGAN\n\nThe BiGAN [5] or ALI [8] approaches were proposed as extensions of the GAN [11] framework\nwhich enable the learning of an encoder that can be employed as an inference model [8] or feature\nrepresentation [5]. Given a distribution Px of data x (e.g., images), and a distribution Pz of latents z\n\n2See footnote 1.\n\n2\n\n\f(usually a simple continuous distribution like an isotropic Gaussian N (0, I)), the generator G models\na conditional distribution P (x|z) of data x given latent inputs z sampled from the latent prior Pz,\nas in the standard GAN generator [11]. The encoder E models the inverse conditional distribution\nP (z|x), predicting latents z given data x sampled from the data distribution Px.\nBesides the addition of E, the other modi\ufb01cation to the GAN in the BiGAN framework is a joint\ndiscriminator D, which takes as input data-latent pairs (x, z) (rather than just data x as in a standard\nGAN), and learns to discriminate between pairs from the data distribution and encoder, versus\nthe generator and latent distribution. Concretely, its inputs are pairs (x \u223c Px, \u02c6z \u223c E(x)) and\n(\u02c6x \u223c G(z), z \u223c Pz), and the goal of the G and E is to \u201cfool\u201d the discriminator by making the two\njoint distributions PxE and PGz from which these pairs are sampled indistinguishable. The adversarial\nminimax objective in [5, 8], analogous to that of the GAN framework [11], was de\ufb01ned as follows:\n\n(cid:8)Ex\u223cPx,z\u223cE\u03a6(x) [log(\u03c3(D(x, z)))] + Ez\u223cPz,x\u223cG\u03a6(z) [log(1 \u2212 \u03c3(D(x, z)))](cid:9)\n\nminGE maxD\n\nUnder this objective, [5, 8] showed that with an optimal D, G and E minimize the Jensen-Shannon\ndivergence between the joint distributions PxE and PGz, and therefore at the global optimum, the\ntwo joint distributions PxE = PGz match, analogous to the results from standard GANs [11].\nFurthermore, [5] showed that in the case where E and G are deterministic functions (i.e., the learned\nconditional distributions PG(x|z) and PE (z|x) are Dirac \u03b4 functions), these two functions are inverses\nat the global optimum: e.g., \u2200x\u2208supp(Px) x = G(E(x)), with the optimal joint discriminator effectively\nimposing (cid:96)0 reconstruction costs on x and z.\nWhile the crux of our approach, BigBiGAN, remains the same as that of BiGAN [5, 8], we have\nadopted the generator and discriminator architectures from the state-of-the-art BigGAN [1] generative\nimage model. Beyond that, we have found that an improved discriminator structure leads to better\nrepresentation learning results without compromising generation (Figure 1). Namely, in addition to\nthe joint discriminator loss proposed in [5, 8] which ties the data and latent distributions together, we\npropose additional unary terms in the learning objective, which are functions only of either the data x\nor the latents z. Although [5, 8] prove that the original BiGAN objective already enforces that the\nlearnt joint distributions match at the global optimum, implying that the marginal distributions of\nx and z match as well, these unary terms intuitively guide optimization in the \u201cright direction\u201d by\nexplicitly enforcing this property. For example, in the context of image generation, the unary loss\nterm on x matches the original GAN objective and provides a learning signal which steers only the\ngenerator to match the image distribution independently of its latent inputs. (In our evaluation we\nwill demonstrate empirically that the addition of these terms results in both improved generation and\nrepresentation learning.)\nConcretely, the discriminator loss LD and the encoder-generator loss LEG are de\ufb01ned as follows,\nbased on scalar discriminator \u201cscore\u201d functions s\u2217 and the corresponding per-sample losses (cid:96)\u2217:\n\nsx(x) = \u03b8\nsz(z) = \u03b8\nsxz(x, z) = \u03b8\n\n(cid:124)\nxF\u0398(x)\n(cid:124)\nz H\u0398(z)\n(cid:124)\nxzJ\u0398(F\u0398(x), H\u0398(z))\n\ny \u2208 {\u22121, +1}\n\n(cid:96)EG(x, z, y) = y (sx(x) + sz(z) + sxz(x, z))\nLEG(Px, Pz) = Ex\u223cPx,\u02c6z\u223cE\u03a6(x) [(cid:96)EG(x, \u02c6z, +1)] + Ez\u223cPz,\u02c6x\u223cG\u03a6(z) [(cid:96)EG(\u02c6x, z,\u22121)]\n(cid:96)D(x, z, y) = h(ysx(x)) + h(ysz(z)) + h(ysxz(x, z))\nLD(Px, Pz) = Ex\u223cPx,\u02c6z\u223cE\u03a6(x) [(cid:96)D(x, \u02c6z, +1)] + Ez\u223cPz,\u02c6x\u223cG\u03a6(z) [(cid:96)D(\u02c6x, z,\u22121)]\nwhere h(t) = max(0, 1 \u2212 t) is a \u201chinge\u201d used to regularize the discriminator [22, 30] 3, also used in\nBigGAN [1]. The discriminator D includes three submodules: F , H, and J. F takes only x as input\nand H takes only z, and learned projections of their outputs with parameters \u03b8x and \u03b8z respectively\ngive the scalar unary scores sx and sz. In our experiments, the data x are images and latents z are\nunstructured \ufb02at vectors; accordingly, F is a ConvNet and H is an MLP. The joint score sxz tying x\nand z is given by the remaining D submodule, J, a function of the outputs of F and H.\n\ny \u2208 {\u22121, +1}\n\n3We also considered an alternative discriminator loss (cid:96)(cid:48)\n\nsum of the three loss terms \u2013 (cid:96)(cid:48)\nsigni\ufb01cantly worse than (cid:96)D above which clamps each of the three loss terms separately.\n\nD which invokes the \u201chinge\u201d h just once on the\nD(x, z, y) = h(y (sx(x) + sz(z) + sxz(x, z))) \u2013 but found that this performed\n\n3\n\n\fThe E and G parameters \u03a6 are optimized to minimize the loss LEG, and the D parameters \u0398 are\noptimized to minimize loss LD. As usual, the expectations E are estimated by Monte Carlo samples\ntaken over minibatches.\nLike in BiGAN [5] and ALI [8], the discriminator loss LD intuitively trains the discriminator to\ndistinguish between the two joint data-latent distributions from the encoder and the generator, pushing\nit to predict positive values for encoder input pairs (x,E(x)) and negative values for generator\ninput pairs (G(z), z). The generator and encoder loss LEG trains these two modules to fool the\ndiscriminator into incorrectly predicting the opposite, in effect pushing them to create matching joint\ndata-latent distributions. (In the case of deterministic E and G, this requires the two modules to invert\none another [5].)\n\n3 Evaluation\n\nMost of our experiments follow the standard protocol used to evaluate unsupervised learning tech-\nniques, \ufb01rst proposed in [34]. We train a BigBiGAN on unlabeled ImageNet, freeze its learned\nrepresentation, and then train a linear classi\ufb01er on its outputs, fully supervised using all of the training\nset labels. We also measure image generation performance, reporting Inception Score [28] (IS) and\nFr\u00e9chet Inception Distance [15] (FID) as the standard metrics there.\n\n3.1 Ablation\n\nWe begin with an extensive ablation study in which we directly evaluate a number of modeling\nchoices, with results presented in Table 1. Where possible we performed three runs of each variant\nwith different seeds and report the mean and standard deviation for each metric.\nWe start with a relatively fully-\ufb02edged version of the model at 128 \u00d7 128 resolution (row Base), with\nthe G architecture and the F component of D taken from the corresponding 128 \u00d7 128 architectures\nin BigGAN, including the skip connections and shared noise embedding proposed in [1]. z is 120\ndimensions, split into six groups of 20 dimensions fed into each of the six layers of G as in [1]. The\nremaining components of D \u2013 H and J \u2013 are 8-layer MLPs with ResNet-style skip connections\n(four residual blocks with two layers each) and size 2048 hidden layers. The E architecture is the\nResNet-v2-50 ConvNet originally proposed for image classi\ufb01cation in [13], followed by a 4-layer\nMLP (size 4096) with skip connections (two residual blocks) after ResNet\u2019s globally average pooled\noutput. The unconditional BigGAN training setup corresponds to the \u201cSingle Label\u201d setup proposed\nin [23], where a single \u201cdummy\u201d label is used for all images (theoretically equivalent to learning a\nbias in place of the class-conditional batch norm inputs). We then ablate several aspects of the model,\nwith results detailed in the following paragraphs. Additional architectural and optimization details are\nprovided in Appendix A (supplementary material). Full learning curves for many results are included\nin Appendix D (supplementary material).\nLatent distribution Pz and stochastic E. As in ALI [8], the encoder E of our Base model is non-\ndeterministic, parametrizing a distribution N (\u00b5, \u03c3). \u00b5 and \u02c6\u03c3 are given by a linear layer at the output\nof the model, and the \ufb01nal standard deviation \u03c3 is computed from \u02c6\u03c3 using a non-negative \u201csoftplus\u201d\nnon-linearity \u03c3 = log(1 + exp(\u02c6\u03c3)) [7]. The \ufb01nal z uses the reparametrized sampling from [19],\nwith z = \u00b5 + \u0001\u03c3, where \u0001 \u223c N (0, I). Compared to a deterministic encoder (row Deterministic E)\nwhich predicts z directly without sampling (effectively modeling P (z|x) as a Dirac \u03b4 distribution),\nthe non-deterministic Base model achieves signi\ufb01cantly better classi\ufb01cation performance (at no cost\nto generation). We also compared to using a uniform Pz = U(\u22121, 1) (row Uniform Pz) with E\ndeterministically predicting z = tanh(\u02c6z) given a linear output \u02c6z, as done in BiGAN [5]. This also\nachieves worse classi\ufb01cation results than the non-deterministic Base model.\n\nUnary loss terms. We evaluate the effect of removing one or both unary terms of the loss function\nproposed in Section 2, sx and sz. Removing both unary terms (row No Unaries) corresponds to the\noriginal objective proposed in [5, 8]. It is clear that the x unary term has a large positive effect on\ngeneration performance, with the Base and x Unary Only rows having signi\ufb01cantly better IS and\nFID than the z Unary Only and No Unaries rows. This result makes intuitive sense as it matches\nthe standard generator loss. It also marginally improves classi\ufb01cation performance. The z unary\nterm makes a more marginal difference, likely due to the relative ease of modeling relatively simple\n\n4\n\n\fdistributions like isotropic Gaussians, though also does result in slightly improved classi\ufb01cation and\ngeneration in terms of FID \u2013 especially without the x term (z Unary Only vs. No Unaries). On the\nother hand, IS is worse with the z term. This may be due to IS roughly measuring the generator\u2019s\ncoverage of the major modes of the distribution (the classes) rather than the distribution in its entirety,\nthe latter of which may be better captured by FID and more likely to be promoted by a good encoder\nE. The requirement of invertibility in a (Big)BiGAN could be encouraging the generator to produce\ndistinguishable outputs across the entire latent space, rather than \u201ccollapsing\u201d large volumes of latent\nspace to a single mode of the data distribution.\nG capacity. To address the question of the importance of the generator G in representation learning,\nwe vary the capacity of G (with E and D \ufb01xed) in the Small G rows. With a third of the capacity\nof the Base G model (Small G (32)), the overall model is quite unstable and achieves signi\ufb01cantly\nworse classi\ufb01cation results than the higher capacity base model4 With two-thirds capacity (Small G\n(64)), generation performance is substantially worse (matching the results in [1]) and classi\ufb01cation\nperformance is modestly worse. These results con\ufb01rm that a powerful image generator is indeed\nimportant for learning good representations via the encoder. Assuming this relationship holds in\nthe future, we expect that better generative models are likely to lead to further improvements in\nrepresentation learning.\n\nStandard GAN. We also compare BigBiGAN\u2019s image generation performance against a standard\nunconditional BigGAN with no encoder E and only the standard F ConvNet in the discriminator, with\nonly the sx term in the loss (row No E (GAN)). While the standard GAN achieves a marginally better\nIS, the BigBiGAN FID is about the same, indicating that the addition of the BigBiGAN E and joint\nD does not compromise generation with the newly proposed unary loss terms described in Section 2.\n(In comparison, the versions of the model without unary loss term on x \u2013 rows z Unary Only and\nNo Unaries \u2013 have substantially worse generation performance in terms of FID than the standard\nGAN.) We conjecture that the IS is worse for similar reasons that the sz unary loss term leads to\nworse IS. Next we will show that with an enhanced E taking higher input resolutions, generation with\nBigBiGAN in terms of FID is substantially improved over the standard GAN.\nHigh resolution E with varying resolution G. BiGAN [5] proposed an asymmetric setup in which\nE takes higher resolution images than G outputs and D takes as input, showing that an E taking\n128 \u00d7 128 inputs with a 64 \u00d7 64 G outperforms a 64 \u00d7 64 E for downstream tasks. We experiment\nwith this setup in BigBiGAN, raising the E input resolution to 256 \u00d7 256 \u2013 matching the resolution\nused in typical supervised ImageNet classi\ufb01cation setups \u2013 and varying the G output and D input\nresolution in {64, 128, 256}. Our results in Table 1 (rows High Res E (256) and Low/High Res G (*))\nshow that BigBiGAN achieves better representation learning results as the G resolution increases, up\nto the full E resolution of 256 \u00d7 256. However, because the overall model is much slower to train\nwith G at 256 \u00d7 256 resolution, the remainder of our results use the 128 \u00d7 128 resolution for G.\nInterestingly, with the higher resolution E, generation improves signi\ufb01cantly (especially by FID),\ndespite G operating at the same resolution (row High Res E (256) vs. Base). This is an encouraging\nresult for the potential of BigBiGAN as a means of improving adversarial image synthesis itself,\nbesides its use in representation learning and inference.\nE architecture. Keeping the E input resolution \ufb01xed at 256, we experiment with varied and often\nlarger E architectures, including several of the ResNet-50 variants explored in [20]. In particular,\nwe expand the capacity of the hidden layers by a factor of 2 or 4, as well as swap the residual block\nstructure to a reversible variant called RevNet [10] with the same number of layers and capacity as\nthe corresponding ResNets. (We use the version of RevNet described in [20].) We \ufb01nd that the base\nResNet-50 model (row High Res E (256)) outperforms RevNet-50 (row RevNet), but as the network\nwidths are expanded, we begin to see improvements from RevNet-50, with double-width RevNet\noutperforming a ResNet of the same capacity (rows RevNet \u00d72 and ResNet \u00d72). We see further gains\nwith an even larger quadruple-width RevNet model (row RevNet \u00d74), which we use for our \ufb01nal\nresults in Section 3.2.\n\n4Though the generation performance by IS and FID in row Small G (32) is very poor at the point we measured\n\u2013 when its best validation classi\ufb01cation performance (43.59%) is achieved \u2013 this model was performing more\nreasonably for generation earlier in training, reaching IS 14.69 and FID 60.67.\n\n5\n\n\fIS (\u2191)\n\nResults\nFID (\u2193)\n\nCls. (\u2191)\n\n(-)\n(-)\n\n1\n1\n1\n1\n1\n1\n1\n1\n\nEncoder (E)\n\n50\n50\n50\n50\n50\n50\n50\n50\n\nLoss L\u2217\nsxz sx sz Pz\n\nGen. (G)\nC.\nR.\nA. D. C. R. Var. \u03b7\n128 (cid:88) (cid:88) (cid:88) N 22.66 \u00b1 0.18 31.19 \u00b1 0.37 48.10 \u00b1 0.13\n128 (cid:88) 1\n96\nBase\nS\n128 (cid:88) (cid:88) (cid:88) N 22.79 \u00b1 0.27 31.31 \u00b1 0.30 46.97 \u00b1 0.35\nDeterministic E\n96\n128\n1\nS\n128 (cid:88) (cid:88) (cid:88) (U) 22.83 \u00b1 0.24 31.52 \u00b1 0.28 45.11 \u00b1 0.93\n96\nUniform Pz\nS\n128\n1\n128 (cid:88) (cid:88) (-) N 23.19 \u00b1 0.28 31.99 \u00b1 0.30 47.74 \u00b1 0.20\n128 (cid:88) 1\n96\nx Unary Only\nS\n128 (cid:88) (-) (cid:88) N 19.52 \u00b1 0.39 39.48 \u00b1 1.00 47.78 \u00b1 0.28\n128 (cid:88) 1\n96\nz Unary Only\nS\n128 (cid:88) (-) (-) N 19.70 \u00b1 0.30 42.92 \u00b1 0.92 46.71 \u00b1 0.88\n128 (cid:88) 1\n96\nNo Unaries (BiGAN) S\n(32) 128 (cid:88) (cid:88) (cid:88) N 3.28 \u00b1 0.18 247.30 \u00b1 10.31 43.59 \u00b1 0.34\nSmall G (32)\n128 (cid:88) 1\nS\nSmall G (64)\n(64) 128 (cid:88) (cid:88) (cid:88) N 19.96 \u00b1 0.15 38.93 \u00b1 0.39 47.54 \u00b1 0.33\n128 (cid:88) 1\nS\n128 (-) (cid:88) (-) N 23.56 \u00b1 0.37 30.91 \u00b1 0.23\nNo E (GAN) *\n96\n(-)\n128 (cid:88) (cid:88) (cid:88) N 23.45 \u00b1 0.14 27.86 \u00b1 0.13 50.80 \u00b1 0.30\nHigh Res E (256)\n1 (256) (cid:88) 1\n96\nS\n(64) (cid:88) (cid:88) (cid:88) N 19.40 \u00b1 0.19 15.82 \u00b1 0.06 47.51 \u00b1 0.09\nLow Res G (64)\n1 (256) (cid:88) 1\n96\nS\n96 (256) (cid:88) (cid:88) (cid:88) N\nHigh Res G (256)\n1 (256) (cid:88) 1\nS\n128 (cid:88) (cid:88) (cid:88) N\nS (101) 1 (256) (cid:88) 1\n96\nResNet-101\n128 (cid:88) (cid:88) (cid:88) N\nResNet \u00d72\n(2) (256) (cid:88) 1\n96\nS\n128 (cid:88) (cid:88) (cid:88) N 23.33 \u00b1 0.09 27.78 \u00b1 0.06 49.42 \u00b1 0.18\n1 (256) (cid:88) 1\n96\n(V)\nRevNet\n128 (cid:88) (cid:88) (cid:88) N\nRevNet \u00d72\n(2) (256) (cid:88) 1\n96\n(V)\n128 (cid:88) (cid:88) (cid:88) N\nRevNet \u00d74\n(4) (256) (cid:88) 1\n(V)\n96\n128 (cid:88) (cid:88) (cid:88) N 23.27 \u00b1 0.22 28.51 \u00b1 0.44 53.70 \u00b1 0.15\nResNet (\u2191 E LR)\n1 (256) (cid:88) (10) 96\nS\nRevNet \u00d74 (\u2191 E LR) (V)\n128 (cid:88) (cid:88) (cid:88) N\n(4) (256) (cid:88) (10) 96\n\n50\n50\n50\n\n50\n50\n50\n50\n50\n50\n\n38.58\n28.01\n27.81\n\n27.96\n28.15\n\n24.70\n23.29\n23.68\n\n23.21\n23.23\n\n51.49\n51.21\n52.66\n\n54.40\n57.15\n\n60.15\n\n23.08\n\n28.54\n\n-\n\nTable 1: Results for variants of BigBiGAN, given in Inception Score [28] (IS) and Fr\u00e9chet Inception\nDistance [15] (FID) of the generated images, and ImageNet top-1 classi\ufb01cation accuracy percentage\n(Cls.) of a supervised logistic regression classi\ufb01er trained on the encoder features [34], computed\non a split of 10K images randomly sampled from the training set, which we refer to as the \u201ctrainval\u201d\nsplit. The Encoder (E) columns specify the E architecture (A.) as ResNet (S) or RevNet (V), the\ndepth (D., e.g. 50 for ResNet-50), the channel width multiplier (C.), with 1 denoting the original\nwidths from [13], the input image resolution (R.), whether the variance is predicted and a z vector\nis sampled from the resulting distribution (Var.), and the learning rate multiplier \u03b7 relative to the G\nlearning rate. The Generator (G) columns specify the BigGAN G channel multiplier (C.), with 96\ncorresponding to the original width from [1], and output image resolution (R.). The Loss columns\nspecify which terms of the BigBiGAN loss are present in the objective. The Pz column speci\ufb01es the\ninput distribution as a standard normal N (0, 1) or continuous uniform U(\u22121, 1). Changes from the\nBase setup in each row are highlighted in blue. Results with margins of error (written as \u201c\u00b5 \u00b1 \u03c3\u201d)\nare the means and standard deviations over three runs with different random seeds. (Experiments\nrequiring more computation were run only once.) (* Result for vanilla GAN (No E (GAN)) selected\nwith early stopping based on best FID; other results selected with early stopping based on validation\nclassi\ufb01cation accuracy (Cls.).)\n\nDecoupled E/G optimization. As a \ufb01nal improvement, we decoupled the E optimizer from that of\nG, and found that simply using a 10\u00d7 higher learning rate for E dramatically accelerates training\nand improves \ufb01nal representation learning results. For ResNet-50 this improves linear classi\ufb01er\naccuracy by nearly 3% (ResNet (\u2191 E LR) vs. High Res E (256)). We also applied this to our largest E\narchitecture, RevNet-50 \u00d74, and saw similar gains (RevNet \u00d74 (\u2191 E LR) vs. RevNet \u00d74).\n\n3.2 Comparison with prior methods\n\nRepresentation learning. We now take our best model by trainval classi\ufb01cation accuracy from the\nabove ablations and present results on the of\ufb01cial ImageNet validation set, comparing against the state\nof the art in recent unsupervised learning literature. For comparison, we also present classi\ufb01cation\nresults for our best performing variant with the smaller ResNet-50-based E. These models correspond\nto the last two rows of Table 1, ResNet (\u2191 E LR) and RevNet \u00d74 (\u2191 E LR).\nResults are presented in Table 2. (For reference, the fully supervised accuracy of these architectures\nis given in Appendix A, Table 1 (supplementary material).) Compared with a number of modern self-\nsupervised approaches [25, 3, 34, 32, 9, 14] and combinations thereof [4], our BigBiGAN approach\nbased purely on generative models performs well for representation learning, state-of-the-art among\nrecent unsupervised learning results, improving upon a recently published result from [20] of 55.4%\nto 60.8% top-1 accuracy using rotation prediction pre-training with the same representation learning\n\n6\n\n\fArchitecture\nMethod\nFeature\nAlexNet\nBiGAN [5, 35]\nConv3\nResNet-19\nSS-GAN [2]\nBlock6\nResNet-101\nMotion Segmentation (MS) [25, 4]\nAvePool\nResNet-101\nExemplar (Ex) [6, 4]\nAvePool\nResNet-101\nRelative Position (RP) [3, 4]\nAvePool\nColorization (Col) [34, 4]\nResNet-101\nAvePool\nAvePool\nCombination of MS+Ex+RP+Col [4] ResNet-101\nResNet-101\nCPC [32]\nAvePool\nRevNet-50 \u00d74 AvePool\nRotation [9, 20]\nAvePool\nResNet-170\nEf\ufb01cient CPC [14]\nResNet-50\nAvePool\nResNet-50\nBN+CReLU\nRevNet-50 \u00d74 AvePool\nRevNet-50 \u00d74 BN+CReLU\n\nBigBiGAN (ours)\n\n-\n-\n\nTop-1 Top-5\n31.0\n38.3\n27.6\n31.5\n36.2\n39.6\n\n48.3\n53.1\n59.2\n62.5\n69.3\n73.6\n\n-\n\n83.0\n77.4\n78.6\n81.4\n81.9\n\n-\n\n48.7\n55.4\n61.0\n55.4\n56.6\n60.8\n61.3\n\nTable 2: Comparison of BigBiGAN models on the of\ufb01cial ImageNet validation set against recent\ncompeting approaches with a supervised logistic regression classi\ufb01er. BigBiGAN results are selected\nwith early stopping based on highest accuracy on our trainval subset of 10K training set images.\nResNet-50 results correspond to row ResNet (\u2191 E LR) in Table 1, and RevNet-50 \u00d74 corresponds to\nRevNet \u00d74 (\u2191 E LR).\n\nMethod\nBigGAN + SL [23]\nBigGAN + Clustering [23]\nBigBiGAN + SL (ours)\nBigBiGAN High Res E + SL (ours)\nBigBiGAN High Res E + SL (ours)\n\nIS (\u2191)\n\nSteps\n20.4 (15.4 \u00b1 7.57)\n500K\n22.7 (22.8 \u00b1 0.42)\n500K\n500K 25.38 (25.33 \u00b1 0.17)\n500K 25.43 (25.45 \u00b1 0.04)\n1M 27.94 (27.80 \u00b1 0.21)\n\nFID vs. Train (\u2193)\n\n-\n-\n\n22.78 (22.63 \u00b1 0.23)\n22.34 (22.36 \u00b1 0.04)\n20.32 (20.27 \u00b1 0.09)\n\nFID vs. Val. (\u2193)\n25.3 (71.7 \u00b1 66.32)\n23.2 (22.7 \u00b1 0.80)\n23.60 (23.56 \u00b1 0.12)\n22.94 (23.00 \u00b1 0.15)\n21.61 (21.62 \u00b1 0.09)\n\nTable 3: Comparison of our BigBiGAN for unsupervised (unconditional) generation vs. previously\nreported results for unsupervised BigGAN from [23]. We specify the \u201cpseudo-labeling\u201d method as\nSL (Single Label) or Clustering. For comparison we train BigBiGAN for the same number of steps\n(500K) as the BigGAN-based approaches from [23], but also present results from additional training\nto 1M steps in the last row and observe further improvements. All results above include the median\nm as well as the mean \u00b5 and standard deviation \u03c3 across three runs, written as \u201cm (\u00b5 \u00b1 \u03c3)\u201d. The\nBigBiGAN result is selected with early stopping based on best FID vs. Train.\n\narchitecture5 and feature, labeled as AvePool in Table 2, and matches the results of the concurrent\nwork in [14] based on contrastic predictive coding (CPC).\nWe also experiment with learning linear classi\ufb01ers on a different rendering of the AvePool feature,\nlabeled BN+CReLU, which boosts our best results with RevNet \u00d74 to 61.3% top-1 accuracy. Given\nthe global average pooling output a, we \ufb01rst compute h = BatchNorm(a), and the \ufb01nal feature\nis computed by concatenating [ReLU(h), ReLU(\u2212h)], sometimes called a \u201cCReLU\u201d (concatened\nReLU) non-linearity [29]. BatchNorm denotes parameter-free Batch Normalization [16], where the\nscale (\u03b3) and offset (\u03b2) parameters are not learned, so training a linear classi\ufb01er on this feature does\nnot involve any additional learning. The CReLU non-linearity retains all the information in its inputs\nand doubles the feature dimension, each of which likely contributes to the improved results.\nFinally, in Appendix C (supplementary material) we consider evaluating representations by zero-\nshot k nearest neighbors classi\ufb01cation, achieving 43.3% top-1 accuracy in this setting. Qualitative\nexamples of nearest neighbors are presented in Appendix C, Figure 11 (supplementary material).\n\nUnsupervised image generation.\nIn Table 3 we show results for unsupervised generation with\nBigBiGAN, comparing to the BigGAN-based [1] unsupervised generation results from [23]. Note\n\n5Our RevNet \u00d74 architecture matches the widest architectures used in [20], labeled as \u00d716 there.\n\n7\n\n\fFigure 2: Selected reconstructions from an unsupervised BigBiGAN model (Section 3.3). Top row\nimages are real data x \u223c Px; bottom row images are generated reconstructions of the above image x\ncomputed by G(E(x)). Unlike most explicit reconstruction costs (e.g., pixel-wise), the reconstruction\ncost implicitly minimized by a (Big)BiGAN [5, 8] tends to emphasize more semantic, high-level\ndetails. Additional reconstructions are presented in Appendix B (supplementary material).\n\nthat these results differ from those in Table 1 due to the use of the data augmentation method of [23]6\n(rather than ResNet-style preprocessing used for all results in our Table 1 ablation study). The lighter\naugmentation from [23] results in better image generation performance under the IS and FID metrics.\nThe improvements are likely due in part to the fact that this augmentation, on average, crops larger\nportions of the image, thus yielding generators that typically produce images encompassing most\nor all of a given object, which tends to result in more representative samples of any given class\n(giving better IS) and more closely matching the statistics of full center crops (as used in the real data\nstatistics to compute FID). Besides this preprocessing difference, the approaches in Table 3 have the\nsame con\ufb01gurations as used in the Base or High Res E (256) row of Table 1.\nThese results show that BigBiGAN signi\ufb01cantly improves both IS and FID over the baseline uncondi-\ntional BigGAN generation results with the same (unsupervised) \u201clabels\u201d (a single \ufb01xed label in the\nSL (Single Label) approach \u2013 row BigBiGAN + SL vs. BigGAN + SL). We see further improvements\nusing a high resolution E (row BigBiGAN High Res E + SL), surpassing the previous unsupervised\nstate of the art (row BigGAN + Clustering) under both IS and FID. (Note that the image generation\nresults remain comparable: the generated image resolution is still 128 \u00d7 128 here, despite the higher\nresolution E input.) The alternative \u201cpseudo-labeling\u201d approach from [23], Clustering, which uses\nlabels derived from unsupervised clustering, is complementary to BigBiGAN and combining both\ncould yield further improvements. Finally, observing that results continue to improve signi\ufb01cantly\nwith training beyond 500K steps, we also report results at 1M steps in the \ufb01nal row of Table 3.\n\n3.3 Reconstruction\nAs shown in [5, 8], the (Big)BiGAN E and G can reconstruct data instances x by computing the\nencoder\u2019s predicted latent representation E(x) and then passing this predicted latent back through the\ngenerator to obtain the reconstruction G(E(x)). We present BigBiGAN reconstructions in Figure 2.\nThese reconstructions are far from pixel-perfect, likely due in part to the fact that no reconstruction\ncost is explicitly enforced by the objective \u2013 reconstructions are not even computed at training\ntime. However, they may provide some intuition for what features the encoder E learns to model.\nFor example, when the input image contains a dog, person, or a food item, the reconstruction is\noften a different instance of the same \u201ccategory\u201d with similar pose, position, and texture \u2013 for\nexample, a similar species of dog facing the same direction. The extent to which these reconstructions\ntend to retain the high-level semantics of the inputs rather than the low-level details suggests that\nBigBiGAN training encourages the encoder to model the former more so than the latter. Additional\nreconstructions are presented in Appendix B (supplementary material).\n\n6See the \u201cdistorted\u201d preprocessing method from the Compare GAN framework: https://github.com/\n\ngoogle/compare_gan/blob/master/compare_gan/datasets.py.\n\n8\n\n\f4 Related work\n\nA number of approaches to unsupervised representation learning from images based on self-\nsupervision have proven very successful. Self-supervision generally involves learning from tasks\ndesigned to resemble supervised learning in some way, but in which the \u201clabels\u201d can be created\nautomatically from the data itself with no manual effort. An early example is relative location\nprediction [3], where a model is trained on input pairs of image patches and predicts their relative\nlocations. Contrastive predictive coding (CPC) [32, 14] is a recent related approach where, given\nan image patch, a model predicts which patches occur in other image locations. Other approaches\ninclude colorization [34, 35], motion segmentation [25], rotation prediction [9, 2], GAN-based dis-\ncrimination [26, 2], and exemplar matching [6]. Rigorous empirical comparisons of many of these\napproaches have also been conducted [4, 20]. A key advantage offered by BigBiGAN and other\napproaches based on generative models, relative to most self-supervised approaches, is that their\ninput may be the full-resolution image or other signal, with no cropping or modi\ufb01cation of the data\nneeded (though such modi\ufb01cations may be bene\ufb01cial as data augmentation). This means the resulting\nrepresentation can typically be applied directly to full data in the downstream task with no domain\nshift.\nA number of relevant autoencoder and GAN variants have also been proposed. Associative com-\npression networks (ACNs) [12] learn to compress at the dataset level by conditioning data on other\npreviously transmitted data which are similar in code space, resulting in models that can \u201cdaydream\u201d\nsemantically similar samples, similar to BigBiGAN reconstructions. VQ-VAEs [33] pair a discrete\n(vector quantized) encoder with an autoregressive decoder to produce faithful reconstructions with a\nhigh compression factor and demonstrate representation learning results in reinforcement learning\nsettings. In the adversarial space, adversarial autoencoders [24] proposed an autoencoder-style\nencoder-decoder pair trained with pixel-level reconstruction cost, replacing the KL-divergence reg-\nularization of the prior used in VAEs [19] with a discriminator. In another proposed VAE-GAN\nhybrid [21] the pixel-space reconstruction error used in most VAEs is replaced with feature space\ndistance from an intermediate layer of a GAN discriminator. Other hybrid approaches like AGE [31]\nand \u03b1-GAN [27] add an encoder to stabilize GAN training. An interesting difference between many\nof these approaches and the BiGAN [8, 5] framework is that BiGAN does not train the encoder or\ngenerator with an explicit reconstruction cost. Though it can be shown that (Big)BiGAN implicitly\nminimizes a reconstruction cost, qualitative reconstruction results (Section 3.3) suggest that this\nreconstruction cost is of a different \ufb02avor, emphasizing high-level semantics over pixel-level details.\n\n5 Discussion\n\nWe have shown that BigBiGAN, an unsupervised learning approach based purely on generative mod-\nels, achieves state-of-the-art results in image representation learning on ImageNet. Our ablation study\nlends further credence to the hope that powerful generative models can be bene\ufb01cial for representation\nlearning, and in turn that learning an inference model can improve large-scale generative models.\nIn the future we hope that representation learning can continue to bene\ufb01t from further advances in\ngenerative models and inference models alike, as well as scaling to larger image databases.\n\nAcknowledgments\n\nThe authors would like to thank Aidan Clark, Olivier H\u00e9naff, A\u00e4ron van den Oord, Sander Dieleman,\nand many other colleagues at DeepMind for useful discussions and feedback on this work.\n\nReferences\n[1] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high \ufb01delity natural\n\nimage synthesis. In ICLR, 2019.\n\n[2] Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised GANs via\n\nauxiliary rotation loss. In CVPR, 2019.\n\n[3] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context\n\nprediction. In ICCV, 2015.\n\n[4] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In ICCV, 2017.\n\n9\n\n\f[5] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.\n\n[6] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox.\nDiscriminative unsupervised feature learning with exemplar convolutional neural networks. In NeurIPS,\n2014.\n\n[7] Charles Dugas, Yoshua Bengio, Fran\u00e7ois Belisle, Claude Nadeau, and Rene Garcia.\n\nsecond-order functional knowledge for better option pricing. In NeurIPS, 2000.\n\nIncorporating\n\n[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and\n\nAaron Courville. Adversarially learned inference. In ICLR, 2017.\n\n[9] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting\n\nimage rotations. In ICLR, 2018.\n\n[10] Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible residual network:\n\nBackpropagation without storing activations. In NeurIPS, 2017.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\n[12] Alex Graves, Jacob Menick, and A\u00e4ron van den Oord. Associative compression networks.\n\narXiv:1804.02476, 2018.\n\nIn\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn ECCV, 2016.\n\n[14] Olivier J. H\u00e9naff, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and A\u00e4ron van den Oord. Data-ef\ufb01cient\n\nimage recognition with contrastive predictive coding. In arXiv:1905.09272, 2019.\n\n[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs\n\ntrained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.\n\n[16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In arXiv:1502.03167, 2015.\n\n[17] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial\n\nnetworks. In CVPR, 2019.\n\n[18] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\narXiv:1807.03039, 2018.\n\n[19] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In arXiv:1312.6114, 2013.\n\n[20] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation\n\nlearning. In arXiv:1901.09005, 2019.\n\n[21] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther. Autoencoding\n\nbeyond pixels using a learned similarity metric. In ICML, 2016.\n\n[22] Jae Hyun Lim and Jong Chul Ye. Geometric GAN. In arXiv:1705.02894, 2017.\n\n[23] Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly.\n\nHigh-\ufb01delity image generation with fewer labels. In ICML, 2019.\n\n[24] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nautoencoders. In ICLR, 2016.\n\n[25] Deepak Pathak, Ross Girshick, Piotr Doll\u00e1r, Trevor Darrell, and Bharath Hariharan. Learning features by\n\nwatching objects move. In CVPR, 2017.\n\n[26] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\n[27] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational ap-\n\nproaches for auto-encoding generative adversarial networks. In arXiv:1706.04987, 2017.\n\n[28] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\n\ntechniques for training GANs. In arXiv:1606.03498, 2016.\n\n10\n\n\f[29] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolu-\n\ntional neural networks via concatenated recti\ufb01ed linear units. In ICML, 2016.\n\n[30] Dustin Tran, Rajesh Ranganath, and David M. Blei. Hierarchical implicit models and likelihood-free\n\nvariational inference. In NeurIPS, 2017.\n\n[31] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. It takes (only) two: Adversarial generator-encoder\n\nnetworks. In arXiv:1704.02304, 2017.\n\n[32] A\u00e4ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive\n\ncoding. In arXiv:1807.03748, 2018.\n\n[33] A\u00e4ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In\n\narXiv:1711.00937, 2017.\n\n[34] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In ECCV, 2016.\n\n[35] Richard Zhang, Phillip Isola, and Alexei A. Efros. Split-brain autoencoders: Unsupervised learning by\n\ncross-channel prediction. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5567, "authors": [{"given_name": "Jeff", "family_name": "Donahue", "institution": "DeepMind"}, {"given_name": "Karen", "family_name": "Simonyan", "institution": "DeepMind"}]}