{"title": "On Adversarial Mixup Resynthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 4346, "page_last": 4357, "abstract": "In this paper, we explore new approaches to combining information encoded within the learned representations of auto-encoders. We explore models that are capable of combining the attributes of multiple inputs such that a resynthesised output is trained to fool an adversarial discriminator for real versus synthesised data. Furthermore, we explore the use of such an architecture in the context of semi-supervised learning, where we learn a mixing function whose objective is to produce interpolations of hidden states, or masked combinations of latent representations that are consistent with a conditioned class label. We show quantitative and qualitative evidence that such a formulation is an interesting avenue of research.", "full_text": "On Adversarial Mixup Resynthesis\n\nChristopher Beckham1,3, Sina Honari1,3, Vikas Verma1,6,\u2020, Alex Lamb1,2, Farnoosh Ghadiri1,3,\n\nR Devon Hjelm1,2,5, Yoshua Bengio1,2,\u2217 & Christopher Pal1,3,4,\u2021,\u2217\n1Mila - Qu\u00e9bec Arti\ufb01cial Intelligence Institute, Montr\u00e9al, Canada\n\n2Universit\u00e9 de Montr\u00e9al, Canada\n3Polytechnique Montr\u00e9al, Canada\n4Element AI, Montr\u00e9al, Canada\n\n5Microsoft Research, Montr\u00e9al, Canada\n\n6Aalto University, Finland\n\nfirstname.lastname@mila.quebec\n\n\u2020 vikas.verma@aalto.fi, \u2021 christopher.pal@polymtl.ca\n\nAbstract\n\nIn this paper, we explore new approaches to combining information encoded within\nthe learned representations of auto-encoders. We explore models that are capable\nof combining the attributes of multiple inputs such that a resynthesised output\nis trained to fool an adversarial discriminator for real versus synthesised data.\nFurthermore, we explore the use of such an architecture in the context of semi-\nsupervised learning, where we learn a mixing function whose objective is to produce\ninterpolations of hidden states, or masked combinations of latent representations\nthat are consistent with a conditioned class label. We show quantitative and\nqualitative evidence that such a formulation is an interesting avenue of research.1\n\n1\n\nIntroduction\n\nThe auto-encoder is a fundamental building block in unsupervised learning. Auto-encoders are trained\nto reconstruct their inputs after being processed by two neural networks: an encoder which encodes\nthe input to a high-level representation or bottleneck, and a decoder which performs the reconstruction\nusing that representation as input. One primary goal of the auto-encoder is to learn representations\nof the input data which are useful (Bengio, 2012), which may help in downstream tasks such as\nclassi\ufb01cation (Zhang et al., 2017; Hsu et al., 2019) or reinforcement learning (van den Oord et al.,\n2017; Ha & Schmidhuber, 2018). The representations of auto-encoders can be encouraged to contain\nmore \u2018useful\u2019 information by restricting the size of the bottleneck, through the use of input noise (e.g.,\nin denoising auto-encoders, Vincent et al., 2008), through regularisation of the encoder function\n(Rifai et al., 2011), or by introducing a prior (Kingma & Welling, 2013). Other goals include learning\ninterpretable representations (Chen et al., 2016; Jang et al., 2016), disentanglement of latent variables\n(Liu et al., 2017; Thomas et al., 2017) or maximisation of mutual information (Chen et al., 2016;\nBelghazi et al., 2018; Hjelm et al., 2019) between the input and the code.\nWe know that data augmentation greatly helps when it comes to increasing generalisation performance\nof models. A practical intuition for why this is the case is that by generating additional samples,\nwe are training our model on a set of examples that better covers those in the test set. In the case\nof images, we are already afforded a variety of transformation techniques at our disposal, such as\nrandom \ufb02ipping, crops, rotations, and colour jitter. While indispensible, there are other regularisation\ntechniques one can also consider.\n\n1Code provided here: https://github.com/christopher-beckham/amr\n* Author is a Canada CIFAR AI Chair\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Adversarial mixup resynthesis involves mixing the latent codes used by auto-encoders\nthrough an arbitrary mixing mechanism that is able to recombine codes from different inputs to\nproduce novel examples. These novel examples are made to look realistic via the use of adversarial\nlearning. We show the gradual mixing between two real examples of shoes (far left and far right).\n\nMixup (Zhang et al., 2018) is a regularisation technique which encourages deep neural networks to\nbehave linearly between pairs of data points. These methods arti\ufb01cially augment the training set by\nproducing random convex combinations between pairs of examples and their corresponding labels\nand training the network on these combinations. This has the effect of creating smoother decision\nboundaries, which was shown to have a positive effect on generalisation performance. Arguably\nhowever, the downside of mixup is that these random convex combinations between images may not\nlook realistic due to the interpolations being performed on a per-pixel level.\nIn Verma et al. (2018); Yaguchi et al. (2019), these random convex combinations are computed in the\nhidden space of the network. This procedure can be viewed as using the high-level representation\nof the network to produce novel training examples. Though mixing based methods have shown\nto improve strong baselines in supervised learning (Zhang et al., 2018; Verma et al., 2018) and\nsemi-supervised learning (Verma et al., 2019a; Berthelot et al., 2019; Verma et al., 2019b), there has\nbeen relatively less exploration of these methods in the context of unsupervised learning.\nThis kind of mixing (in latent space) may encourage representations which are more amenable to the\nidea of systematic generalisation \u2013 we would like our model to be able to compose new examples from\nunseen combinations of latent factors despite only seeing a very small subset of those combinations\nin training (Bahdanau et al., 2018). Therefore, in this paper we explore the use of such a mechanism\nin the context of auto-encoders through an exploration of various mixing functions. These mixing\nfunctions could consist of continuous interpolations between latent vectors such as in Verma et al.\n(2018), genetically-inspired recombination such as crossover, or even a deep neural network which\nlearns the mixing operation. To ensure that the output of the decoder given the mixed representation\nresembles the data distribution at the pixel level, we leverage adversarial learning (Goodfellow et al.,\n2014), where here we train a discriminator to distinguish between decoded mixed and real data points.\nThis gives us the ability to simulate novel data points (through exponentially many combinations of\nlatent factors not present in the training set), and also improve the learned representation as we will\ndemonstrate on downstream tasks later in this paper. Figure 1 shows one example of such mixing.\n\n2 Formulation\n\nThe auto-encoder serves as the baseline for our work since its encoder allows us to infer latent\nvariables, and therefore also allow us to compute mixing operations between those variables. Subse-\nquently, the decoder allows us to visualise these mixed latent variables and (through an adversarial\nframework) enable us to leverage those mixes to improve representations learned by the auto-encoder.\nLet us consider an auto-encoder model F (\u00b7), with the encoder part denoted as f (\u00b7) and the decoder\ng(\u00b7). In an auto-encoder we wish to minimise the reconstruction, which is simply:\n\nEx\u223cp(x) (cid:107)x \u2212 g(f (x))(cid:107)2\n\nmin\n\nF\n\nBecause auto-encoders that are trained by pixel-space reconstruction produce low quality images\n(characterized by blurriness), we augment this baseline by adding an adversarial game to the recon-\nstruction (as done in Larsen et al. (2016)). In turn, the discriminator D tries to distinguish between\nreal and reconstructed x, and the auto-encoder tries to construct \u2018realistic\u2019 reconstructions so as to\nfool the discriminator. This formulation serves as our baseline (to make this clear throughout this\nwork, we call this \u2018AE + GAN\u2019), which can be written as:\n\nEx\u223cp(x) \u03bb(cid:107)x \u2212 g(f (x))(cid:107)2 + (cid:96)GAN (D(g(f (x))), 1)\nEx\u223cp(x) (cid:96)GAN (D(x), 1) + (cid:96)GAN (D(g(f (x))), 0),\n\nmin\n\nF\nmin\nD\n\n(1)\n\n(2)\n\n2\n\n\fFigure 2: The unsupervised version of adversarial mixup resynthesis (AMR). In addition to the auto-\nencoder loss functions, we have a mixing function Mix (called \u2018mixer\u2019 in the \ufb01gure) which creates\nsome combination between the latent variables h1 and h2, which is subsequently decoded into an\nimage intended to be realistic-looking by fooling the discriminator. Subsequently the discriminator\u2019s\njob is to distinguish real samples from generated ones from mixes.\n\nwhere (cid:96)GAN is a GAN-speci\ufb01c loss function. In our case, (cid:96)GAN is the binary cross-entropy loss,\nwhich corresponds to the Jenson-Shannon GAN (Goodfellow et al., 2014).\nWhat we would like to do is to be able to encode an arbitrary pair of inputs h1 = f (x1) and\nh2 = f (x2) into their latent representation, perform some combination between them through a\nfunction we denote Mix(h1, h2) (more on this soon), run the result through the decoder g(\u00b7), and\nthen minimise some loss function which encourages the resulting decoded mix to look realistic. With\nthis in mind, we propose adversarial mixup resynthesis (AMR), where part of the auto-encoder\u2019s\nobjective is to produce mixes which, when decoded, are indistinguishable from real images. The\ngenerator and the discriminator of AMR are trained by the following mixture of loss components:\n+ (cid:96)GAN (D(g(Mix(f (x), f (x(cid:48))))), 1)\n\n+ (cid:96)GAN (D(g(f (x))), 1)\n\nmin\n\n(cid:124)\n\n(cid:125)\nEx,x(cid:48)\u223cp(x) \u03bb(cid:107)x \u2212 g(f (x))(cid:107)2\n(cid:125)\n\nEx,x(cid:48)\u223cp(x) (cid:96)GAN (D(x), 1)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\nreconstruction\n\n(cid:124)\n\nF\n\nmin\nD\n\n(cid:124)\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\n(cid:125)\n(cid:125)\n\n(cid:124)\n(cid:124)\n\nfool D with reconstruction\n+ (cid:96)GAN (D(g(f (x))), 0)\n\nfool D with mixes\n\n+ (cid:96)GAN (D(g(Mix(f (x), f (x(cid:48))))), 0)\n\nlabel x as real\n\nlabel reconstruction as fake\n\nlabel mixes as fake\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\n(cid:125)\n(cid:125)\n\n.\n\n(3)\n\nThe AMR model is shown in Figure 2. There are many ways one could combine the two latent\nrepresentations, and we denote this function Mix(h1, h2). Manifold mixup (Verma et al., 2018)\nimplements mixing in the hidden space through convex combinations:\nMixmixup(h1, h2) = \u03b1h1 + (1 \u2212 \u03b1)h2,\n\n(4)\nwhere \u03b1 \u2208 [0, 1] is sampled from a Uniform(0, 1) distribution. We can interpret this as interpolating\nalong line segments, as shown in Figure 3 (left).\nWe also explore a strategy in which we randomly retain some components of the hidden representation\nfrom h1 and use the rest from h2. In this case we would randomly sample a binary mask m \u2208 {0, 1}k\n(where k denotes the number of feature maps) and perform the following operation:\n\nMixBern(h1, h2) = mh1 + (1 \u2212 m)h2,\n\n(5)\n\nwhere m is sampled from a Bernoulli(p) distribution (p can simply be sampled uniformly) and\nmultiplication is element-wise. This formulation is interesting in the sense that it is very reminiscent\nof crossover in biological reproduction: the auto-encoder has to organise feature maps in such a way\nthat that any recombination between sets of feature maps must decode into realistic looking images.\n\n2.1 Mixing with k examples\n\nWe can generalise the above mixing functions to operate on more than just two examples. For\ninstance, in the case of mixup (Equation 4), if we were to mix between examples {h1, . . . , hk}, we\n\n3\n\n\fFigure 3: Left: mixup (Equation 4), with interpolated points in blue corresponding to line segments\nbetween the three points shown in red. Middle: triplet mixup (Equation 6). Right: Bernoulli mixup\n(Equation 5).\n\nFigure 4: The supervised version of Bernoulli mixup. In this, we learn an embedding function\nembed(y) (an MLP) which maps y to Bernoulli parameters p \u2208 [0, 1]k, from which a Bernoulli mask\nm \u223c Bernoulli(p) is sampled. The resulting mix is then simply mh1 + (1 \u2212 m)h2. Intuitively, the\nembedding function can be thought of as a function which decides what feature maps need to be\nrecombined from h1 and h2 in order to produce a mix which satis\ufb01es the attribute vector y.\n\ncan simply sample \u03b1 \u223c Dirichlet(1, . . . , 1)2, where \u03b1 \u2208 [0, 1]k and(cid:80)k\n\ndot product between this and the hidden states:\n\ni=1 \u03b1i = 1 and compute the\n\n\u03b11 \u00b7 h1 + \u00b7\u00b7\u00b7 + \u03b1k \u00b7 hk =\n\n\u03b1jhj,\n\n(6)\n\nk(cid:88)\n\nj=1\n\nOne can think of this process as being equivalent to doing multiple iterations (or in biological terms,\ngenerations) of mixing. For example, in the case of a large k, \u03b11 \u00b7 h1 + \u03b12 \u00b7 h2 + \u03b13 \u00b7 h3 + \u00b7\u00b7\u00b7 =\n(. . . (\u03b11 \u00b7 h1 + \u03b12 \u00b7 h2)\n\n) + . . . . We show the k = 3 case in in Figure 3 (middle).\n\n+h3 \u00b7 \u03b13\n\n(cid:124)\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n\ufb01rst iteration\n\nsecond iteration\n\n(cid:125)\n\n(cid:125)\n\n2.2 Using labels\n\nWhile it is interesting to generate new examples via random mixing strategies in the hidden states, we\nalso explore a supervised formulation in which we learn a function that can produce speci\ufb01c kinds of\nmixes between two examples such that they are consistent with a particular class label. We make this\npossible by backpropagating through a classi\ufb01er network p(y|x) which branches off the end of the\ndiscriminator, i.e., an auxiliary classi\ufb01er GAN (Odena et al., 2017).\nLet us assume that for some image x, we have a set of associated binary attributes y, where\ny \u2208 {0, 1}k (and k \u2265 1). We introduce an embedding function embed(y), which is an MLP\n(whose parameters are learned in unison with the auto-encoder) that maps y to Bernoulli parameters\np \u2208 [0, 1]k. These parameters are used to sample a Bernoulli mask m \u223c Bernoulli(p) to produce a\n2Another way to say this is that for mixing k examples, we sample \u03b1 from a k \u2212 1 simplex. This means that\nwhen k = 2 we are sampling from a 1-simplex (a line segment), when k = 3 we are sampling from a 2-simplex\n(triangle), and so forth.\n\n4\n\n\fnew combination trained to have the class label y (for the sake of convenience, we can summarize the\nembedding and sampling steps as simply Mixsup(h1, h2, y)). Note that the conditioning class label\nshould be semantically meaningful with respect to both of the conditioned hidden states. For example,\nif we\u2019re producing mixes based on the gender attribute and both h1 and h2 are male, it would not\nmake sense to condition on the \u2018female\u2019 label since the class mixer only recombines rather than\nadding new information. To enforce this constraint, during training we simply make the conditioning\nlabel a convex combination \u02dcymix = \u03b1y1 + (1 \u2212 \u03b1)y2 as well, using \u03b1 \u223c Uniform(0, 1). This is\nsummarised in Figure 4.\nConcretely, the auto-encoder and discriminator, in addition to their unsupervised losses described in\nEquation 3, try to minimise their respective supervised losses:\n\nEx1,y1\u223cp(x,y),x2,y2\u223cp(x,y),\u03b1\u223cU (0,1) (cid:96)GAN(D(g(\u02dchmix)), 1)\n\n+ (cid:96)cls(p(y|g(\u02dchmix)), \u02dcymix)\n\nmin\n\nF\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nfool D with mix\n\nmake mix\u2019s class consistent\n\nEx1,y2\u223cp(x,y),x2,y2\u223cp(x,y),\u03b1\u223cU (0,1) (cid:96)GAN(D(g(\u02dchmix)), 0)\n\n(7)\n\nmin\nD\n\nwhere \u02dcymix = \u03b1y1 + (1 \u2212 \u03b1)y2 and \u02dchmix = Mixsup(f (x1), f (x2), \u02dcymix)\n\nlabel mixes as fake\n\n(cid:125)\n\n(cid:124)\n\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\n(cid:125)\n(cid:125)\n\n3 Related work\n\nOur method can be thought of as an extension of auto-encoders that allows for sampling through\nmixing operations, such as continuous interpolations and masking operations. Variational auto-\nencoders (VAEs, Kingma & Welling, 2013) can also be thought of as a similar extension of auto-\nencoders, using the outputs of the encoder as parameters for an approximate posterior q(z|x) which\nis matched to a prior distribution p(z) through the evidence lower bound objective (ELBO). At test\ntime, new data points are sampled by passing samples from the prior, z \u223c p(z), through the decoder.\nThe fundamental difference here is that the output of the encoder is constrained to come from a\npre-de\ufb01ned prior distribution, whereas we impose no constraint, at least not in the probabilistic sense.\nThe ACAI algorithm (adversarially constrained auto-encoder interpolation) is another approach which\ninvolves sampling interpolations as part of an unsupervised objective (Berthelot et al., 2019). ACAI\nuses a discriminator network to predict the mixing coef\ufb01cient \u03b1 from the decoded output of the mixed\nrepresentation, and the auto-encoder tries to \u2018fool\u2019 the discriminator by making it predict either \u03b1 = 0\nor \u03b1 = 1, making interpolated points indistinguishable from real ones. One of the main differences is\nthat in our framework the discriminator output is agnostic to the mixing function used, so rather than\ntrying to predict the parameter(s) of the mix (in this case, \u03b1) it is only required to predict whether\nthe mix is real or fake (1/0). On a more technical level, the type of GAN they employ is the least\nsquares GAN (Mao et al., 2017), whereas we use JSGAN (Goodfellow et al., 2014) and spectral\nnormalization (Miyato et al., 2018) to impose a Lipschitz constraint on the discriminator, which is\nknown to be very effective in minimising stability issues in training.\nThe GAIA algorithm (Sainburg et al., 2018) uses a BEGAN framework with an additional\ninterpolation-based adversarial objective. In this work, the mixing function involves interpolat-\ning with an \u03b1 \u223c N (\u00b5, \u03c3), where \u00b5 is de\ufb01ned as the midpoint between the two hidden states h1 and\nh2. For their supervised formulation, the authors use a simple technique in which average latent\nvectors are computed over images with particular attributes. For example, \u00afhfemale and \u00afhglasses could\ndenote the average latent vectors over all images of women and all images of people wearing glasses,\nrespectively. One can then perform arithmetic over these different vectors to produce novel images,\ne.g. \u00afhfemale + \u00afhglasses. However, this approach is crude in the sense that these vectors are confounded\nby and correlated with other irrelevant attributes in the dataset. Conversely, in our technique, we\nlearn a mixing function which tries to produce combinations between latent states consistent with a\nclass label by backpropagating through the classi\ufb01er branch of the discriminator. If the resulting mix\ncontains confounding attributes, then the mixing function would be penalised for doing so.\nWhat primarily differentiates our work from theirs is that we perform an exploration into different\nkinds of mixing functions, including a semi-supervised variant which uses an MLP to produce mixes\nconsistent with a class label. In addition to systematic generalisation, our work is partly motivated\nby processes which occur in sexual reproduction; for example, Bernoulli mixup can be seen as the\nanalogue to crossover in the genetic algorithm setting, similar to how dropout (Srivastava et al., 2014)\ncan be seen as being analogous to random mutations. We \ufb01nd this connection to be appealing, as\n\n5\n\n\fthere has been some interest in leveraging concepts from evolution and biology in deep learning,\nfor instance meta-learning (Bengio et al., 1991), dropout (as previously mentioned), biologically\nplausible deep learning (Bengio et al., 2015) and evolutionary strategies for reinforcement learning\n(Such et al., 2017; Salimans et al., 2017).\n\n4 Results\n\nIn this section we evaluate the classi\ufb01cation accuracy of AMR on various datasetss by training a\nlinear classi\ufb01er on the latent features of the unsupervised variant of the model. We also measure\nevaluate our model on a disentanglement task, which is also unsupervised. Finally, we demonstrate\nsome qualitative results.\n\n4.1 Classi\ufb01cation of learned features\n\nOne way to evaluate the usefulness of the representation learned is to evaluate its performance on\nsome downstream tasks. Similar to what was done in ACAI, we modify our training procedure by\nattaching a linear classi\ufb01cation network to the output of the encoder and train it in unison with the\nother objectives. The classi\ufb01er does not contribute any gradients back into the auto-encoder, so\nit simply acts as a probe (Alain & Bengio, 2016) whose accuracy can be monitored over time to\nquantify the usefulness of the representation learned by the encoder.\nWe employ the following datasets for classi\ufb01cation: MNIST (Deng, 2012), KMNIST (Clanuwat et al.,\n2018), and SVHN (Netzer et al., 2011). We perform three runs for each experiment, and from each\nrun we collect the highest accuracy on the validation set over the entire course of training, from which\nwe compute the mean and standard deviation. Hyperparameter tuning on \u03bb was performed manually\n(this essentially controls the trade-off between the reconstruction and adversarial losses), and we\nexperimented with a reasonable range of values (i.e. {2, 5, 10, 20, 50}. We experiment with three\nmixing functions: mixup (Equation 4), Bernoulli mixup (Equation 5)3, and the various higher-order\nversions with k > 2 (see Section 2.1). The number of epochs we trained for is dependent on the\ndataset (since some datasets converged faster than others) and we indicate this in each table\u2019s caption.\nIn Table 1 we show results on relatively simple datasets \u2013 MNIST, KMNIST, and SVHN \u2013 with\nan encoding dimension of dh = 32 (more concretely, a bottleneck of two feature maps of spatial\ndimension 4 \u00d7 4). In Table 2 we explore the effect of data ablation on SVHN with the same encoding\ndimension but randomly retaining 1k, 5k, 10k, and 20k examples in the training set, to examine the\nef\ufb01cacy of AMR in the low-data setting. Lastly, in Table 3 we evaluate AMR in a higher dimensional\nsetting, trying out SVHN with dh = 256 (i.e., a spatial dimension of 16 \u00d7 4 \u00d7 4) and CIFAR10 with\ndh = 256 and dh = 1024 (a spatial dimension of 64 \u00d7 4 \u00d7 4). These encoding dimensions were\nchosen so as to conform to ACAI\u2019s experimental setup.\nIn terms of training hyperparameters, we used ADAM (Kingma & Ba, 2014) with a learning rate\nof 10\u22124, \u03b21 = 0.5 and \u03b22 = 0.99 and an L2 weight decay of 10\u22125. For architectural details, please\nconsult the README \ufb01le in the code repository.4\n\n4.2 Disentanglement\n\nLastly, we run experiments on the DSprite (Matthey et al., 2017) dataset, a 2D sprite dataset whose\nimages are generated with six known (ground truth) latent factors. Latent encodings produced by\nautoencoders trained on this dataset can be used in conjunction a disentanglement metric (see Higgins\net al. (2017); Kim & Mnih (2018)), which measures the extent to which the learned encodings are\nable to recover the ground truth latent factors. These results are shown in Table 4. We can see that for\nthe AMR methods, Bernoulli mixing performs the best, especially the triplet formulation. \u03b2-VAE\nperforms the best overall, and this may be in part due to the fact that the prior distribution on the\nlatent encoding is an independent Gaussian, which may encourage those variables to behave more\nindependently.\n\n3Due to time / resource constraints, we were unable to explore Bernoulli mixup as exhaustively as mixup,\n\nand therefore we have not shown k > 3 results for this algorithm\n\n4The architectures we used were based off a public PyTorch reimplementation of ACAI, which may not be\nexactly the same as the original implemented in TensorFlow. See the anonymized Github link for more details.\n\n6\n\n\fTable 1: Classi\ufb01cation accuracy results when training a linear classi\ufb01er probe on top of the auto-\nencoder\u2019s encoder output (dh = 32). Each experiment was run thrice. (\u2020 = results taken from\nthe original paper). MNIST, KMNIST, and SVHN were trained for 2k, 5k, and 4.5k epochs,\nrespectively. AE+GAN = adversarial reconstruction auto-encoder (Equation 2); AMR = adversarial\nmixup resynthesis (ours); ACAI = adversarially constrained auto-encoder interpolation (Berthelot\net al., 2019))\n\nMethod\nAE+GAN\n\nAMR\n\nACAI\nACAI\u2020\n\nMix\n-\nmixup\nBern\nmixup\nmixup\nmixup\n\nk\n-\n2\n2\n3\n2\n2\n\nMNIST\n97.52 \u00b1 0.29\n98.01 \u00b1 0.10\n97.76 \u00b1 0.58\n97.61 \u00b1 0.15\n98.66 \u00b1 0.36\n98.25 \u00b1 0.11\n\n(\u03bb)\n(5)\n(10)\n(10)\n(20)\n(2)\n(N/A)\n\nKMNIST\n76.18 \u00b1 1.79\n80.39 \u00b1 3.11\n81.54 \u00b1 3.46\n77.20 \u00b1 0.43\n84.67 \u00b1 1.16\n-\n\n(\u03bb)\n(10)\n(10)\n(10)\n(10)\n(10)\n(N/A)\n\nSVHN (\u03bb)\n(5)\n(10)\n(10)\n(10)\n(2)\n(N/A)\n\n37.01 \u00b1 2.22\n43.98 \u00b1 3.05\n38.31 \u00b1 2.68\n47.34 \u00b1 3.79\n34.74 \u00b1 1.12\n34.47 \u00b1 1.14\n\nTable 2: Classi\ufb01cation accuracy results when training a linear classi\ufb01er probe on top of the auto-\nencoder\u2019s encoder output (dh = 32) for various training set sizes for SVHN (1k, 5k, 10k, and 20k,\nfor 6k, 6k, 6k, and 4k epochs respectively).\n\nMethod\n\nAE+GAN\n\nAMR\n\nACAI\n\nMix\n\n-\n\nmixup\nBern\nmixup\n\nmixup\n\nk\n\n-\n\n2\n2\n3\n\n2\n\nSVHN(1k)\n22.71 \u00b1 0.73\n21.89 \u00b1 0.19\n22.59 \u00b1 1.31\n22.96 \u00b1 0.69\n24.15 \u00b1 1.65\n\n(\u03bb)\n\n(10)\n\n(10)\n(20)\n(10)\n\n(10)\n\nSVHN(5k)\n25.35 \u00b1 0.44\n25.41 \u00b1 1.15\n26.07 \u00b1 1.87\n29.92 \u00b1 3.37\n29.58 \u00b1 1.08\n\n(\u03bb)\n\n(10)\n\n(20)\n(20)\n(10)\n\n(10)\n\nSVHN(10k)\n26.18 \u00b1 0.81\n30.87 \u00b1 0.74\n30.12 \u00b1 2.37\n31.87 \u00b1 0.68\n29.56 \u00b1 0.97\n\n(\u03bb)\n\n(10)\n\n(10)\n(10)\n(10)\n\n(2)\n\nSVHN(20k)\n29.21 \u00b1 1.01\n36.27 \u00b1 3.76\n35.98 \u00b1 0.56\n37.04 \u00b1 2.32\n31.23 \u00b1 0.31\n\n(\u03bb)\n\n(20)\n\n(10)\n(10)\n(10)\n\n(5)\n\nTable 3: Classi\ufb01cation accuracy results on SVHN (dh = 256) and CIFAR10 (dh \u2208 {256, 1024}).\nThese con\ufb01gurations were trained for 4k, 3k, and 8k epochs, respectively. (\u2020 = results from original\npaper.)\n\nMethod\nAE+GAN\n\nAMR\n\nACAI\nACAI\u2020\n\nMix\n-\nmixup\nBern\nmixup\nmixup\nmixup\nmixup\nmixup\nmixup\n\nk\n-\n2\n2\n3\n4\n6\n8\n2\n2\n\nSVHN (256)\n59.00 \u00b1 0.12\n71.51 \u00b1 1.35\n58.64 \u00b1 2.18\n73.33 \u00b1 3.23\n74.69 \u00b1 1.11\n73.85 \u00b1 0.84\n75.71 \u00b1 1.29\n68.64 \u00b1 1.50\n85.14 \u00b1 0.20\n\n(\u03bb)\n(5)\n(5)\n(10)\n(5)\n(5)\n(5)\n(5)\n(2)\n(N/A)\n\nCIFAR10 (256)\n53.08 \u00b1 0.28\n54.24 \u00b1 0.42\n52.40 \u00b1 0.51\n54.94 \u00b1 0.37\n54.68 \u00b1 0.33\n52.95 \u00b1 0.92\n53.07 \u00b1 1.04\n50.06 \u00b1 1.33\n52.77 \u00b1 0.45\n\n(\u03bb)\n(50)\n(50)\n(50)\n(50)\n(50)\n(50)\n(50)\n(20)\n(N/A)\n\nCIFAR10 (1024)\n59.93 \u00b1 0.60\n60.80 \u00b1 0.79\n59.81 \u00b1 0.56\n61.68 \u00b1 0.67\n61.72 \u00b1 0.20\n60.34 \u00b1 0.82\n59.75 \u00b1 1.04\n57.42 \u00b1 1.29\n63.99 \u00b1 0.47\n\n(\u03bb)\n(50)\n(50)\n(50)\n(50)\n(50)\n(50)\n(50)\n(20)\n(N/A)\n\n4.3 Qualitative results (unsupervised)\n\nDue to space constraints, we show qualitative results in the supplementary material. We compare\ninterpolations (between our technique, ACAI, AE+GAN, and pixel-space interpolation) on three\ndatasets: SVHN (Netzer et al., 2011), CelebA (Liu et al., 2015), and Zappos shoes (Yu & Grauman,\n2014, 2017). It can be easily seen that AMR produces realistic-looking mixes with signi\ufb01cantly less\n\u2018ghosting\u2019 or \u2018artifacting\u2019 as exhibited in the baselines. This supplementary also explains an extra\n\u2018consistency loss\u2019 term which was used to improve the quality of the interpolation trajectory between\ntwo images.\n\n7\n\n\fTable 4: Results on DSprite using the disentanglement metric proposed in Kim & Mnih (2018). For\n\u03b2-VAE (Higgins et al., 2017), we show the results corresponding to the best-performing \u03b2 values.\nFor AMR, \u03bb = 1 since this performed the best.\n\nMethod\nVAE(\u03b2 = 100)\nAE+GAN\n\nAMR\n\nMix\n-\n-\nmixup\nBern\nmixup\nBern\n\nk\n-\n-\n2\n2\n3\n3\n\nAccuracy\n68.00 \u00b1 3.89\n45.12 \u00b1 2.68\n49.00 \u00b1 6.72\n53.00 \u00b1 1.59\n51.13 \u00b1 4.95\n56.00 \u00b1 0.91\n\n4.4 Qualitative results (supervised)\n\nWe present some qualitative results with the supervised formulation. We train our supervised AMR\nvariant using a subset of the attributes in CelebA (\u2018is male\u2019, \u2018is wearing heavy makeup\u2019, and \u2018is\nwearing lipstick\u2019). We consider pairs of examples {(x1, y1), (x2, y2)} (where one example is\nmale and the other female) and produce random convex combinations of the attributes \u02dcymix =\n\u03b1y1 + (1 \u2212 \u03b1)y2 and decode their resulting mixes Mixsup(f (x1), f (x2), \u02dcymix). This can be seen in\nFigure 5.\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nmale\nmakeup\n\nlipstick\n\nFigure 5: Interpolations produced by the class mixer function for the set of binary attributes {male,\nheavy makeup, lipstick}. For each image, the left-most face is x1 and the right-most face x2, with\nfaces in between consisting of mixes Mixsup(f (x1), f (x2), \u02dcymix) of a particular attribute mix \u02dcymix,\nshown below each column (where red denotes \u2018off\u2019 and green denotes \u2018on\u2019).\n\n5 Discussion\n\nThe results we present generally show there are bene\ufb01ts to mixing. In Table 1 we obtain the best results\nacross SVHN, with k = 3 mixup performing the best. ACAI also performed quite competitively,\nachieving the best results on MNIST and KMNIST. In Table 2 we \ufb01nd that the triplet formulation\nof mixup (i.e. k = 3) performed the best for 20k, 10k, and 5k examples. In Table 3 we experiment\nwith values of k > 3 and \ufb01nd that higher-order mixing performs the best amongst our experiments,\nfor instance k = 8 mixup for SVHN (256), k = 3 mixup for CIFAR10 (256) and k = 4 mixup for\nCIFAR10 (1024). Bernoulli mixup with k = 2 tends to be inferior to mixup with k = 2, although\none can see from Figure 3 that in that regime it generates nowhere near as many possible mixes as\nmixup, and it would certainly be worth exploring this mixing algorithm for higher values of k. While\nwe were not able to achieve ACAI\u2019s quoted results for those con\ufb01gurations, our own implementation\nof it has the bene\ufb01t of having less confounding factors at play due to it falling under the same\nexperimental setup as our proposed method. Although we have shown that mixing is in general\nbene\ufb01cial for improving unsupervised representations, in some cases performance gains are only on\n\n8\n\n\fthe order of a few percentage points, like in the case of CIFAR10. This may be due to the fact that it\nis relatively more dif\ufb01cult to generate realistic mixes for \u2018natural\u2019 datasets such as CIFAR10. Even if\nwe took a relatively simpler dataset such as CelebA, it would be much easier to generate mixes if\nthe faces are constrained in pose and orientation than if they were allowed to freely vary (this pose\nand orientation \u2018mismatch\u2019 be seen in some of the CelebA interpolations in the appendix). Perhaps\nthis would justify mixing in a vector latent space rather than a spatial one. Lastly, in order to further\nestablish the ef\ufb01cacy of these techniques, these should also be evaluated in the context of supervised\nor semi-supervised learning such as in Verma et al. (2018).\nA potential concern we would like to address are more theoretical aspects of the different mixing\nfunctions and whether there are any interesting mathematical implications which arise from their use,\nsince it is not entirely clear at this point which mixing function should be used beyond employing a\nhyperparameter search. Despite Bernoulli mixup not being explored as thoroughly, the disentangle-\nment results in Table 4 appear to favour it, and we also have shown how it can be leveraged to perform\nclass-conditional mixes by leveraging a mixing function to determine what feature maps should be\ncombined from pairs of examples to produce a mix consistent with a particular set of attributes. This\ncould be leveraged as a data augmentation tool to produce examples for less represented classes.\nWhile our work has dealt with mixing on the feature level, there has been some work using mixup-\nrelated strategies on the spatial level. For example, \u2018cutmix\u2019 (Yun et al., 2019) proposes a mixing\nscheme in input space where contiguous spatial regions of one image are combined with regions from\nanother image. Conversely, \u2018dropblock\u2019 (Ghiasi et al., 2018) proposes to drop contiguous spatial\nregions in feature space. One could however combine these two ideas by proposing a new mixing\nfunction which mixes spatial regions between pairs of examples in feature space. We believe we have\nonly just scratched the surface in terms of the kinds of mixing functions one can utilise.\nOne could expand on these results by experimenting with deeper classi\ufb01ers on top of the bottlenecks,\nor considering the fully-supervised case by back-propagating these gradients back into the auto-\nencoder. Note that while the use of mixup to augment supervised learning was done in Verma et al.\n(2018), in their algorithm arti\ufb01cial examples are created by mixing hidden states and their respective\nlabels for a classi\ufb01er. If our formulation were to be used in the supervised case, no label mixing\nwould be needed since the discriminator is only trying to distinguish between real latent points and\nmixed ones. Furthermore, if it were to be used in the semi-supervised case, any unlabeled examples\ncan simply be used to minimise the unsupervised parts of the network (namely, the reconstruction\nloss and the adversarial component), without the need to backprop through the linear classi\ufb01er using\npseudo-labels (this would at least avoid the need to devise a schedule to determine at what rate /\ncon\ufb01dence pseudo-examples should be mixed in with real training examples).\n\n6 Conclusion\n\nIn conclusion, we present adversarial mixup resynthesis, a study in which we explore different ways\nof combining the representations learned in autoencoders through the use of mixing functions. We\nmotivated this technique as a way to address the issue of systematic generalisation, in which we would\nlike a learner to perform well over new and unseen con\ufb01gurations of latent features learned in the\ntraining distribution. We examined the performance of these new mixing-induced representations on\ndownstream tasks using linear classi\ufb01ers and achieved promising results. Our next step is to further\nquantify performance on downstream tasks on more sophisticated datasets and model architectures.\n\nAcknowledgments\n\nWe thank Compute Canada for GPU access, and nVidia for donating a DGX-1 used for this research.\nWe also thank Huawei for their support. Vikas Verma was supported by Academy of Finland project\n13312683 / Raiko Tapani AT kulut.\n\nReferences\n\nAlain, Guillaume and Bengio, Yoshua. Understanding intermediate layers using linear classi\ufb01er probes. arXiv\n\npreprint arXiv:1610.01644, 2016.\n\n9\n\n\fBahdanau, Dzmitry, Murty, Shikhar, Noukhovitch, Michael, Nguyen, Thien Huu, de Vries, Harm, and Courville,\nAaron C. Systematic generalization: What is required and can it be learned? CoRR, abs/1811.12889, 2018.\nURL http://arxiv.org/abs/1811.12889.\n\nBelghazi, Mohamed Ishmael, Baratin, Aristide, Rajeshwar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville,\nAaron, and Hjelm, Devon. Mutual information neural estimation. In Dy, Jennifer and Krause, Andreas\n(eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of\nMachine Learning Research, pp. 531\u2013540, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\nURL http://proceedings.mlr.press/v80/belghazi18a.html.\n\nBengio, Y, Bengio, S, and Cloutier, J. Learning a synaptic learning rule. In IJCNN-91, International Joint\n\nConference on Neural Networks, volume 2. IEEE, 1991.\n\nBengio, Yoshua. Deep learning of representations for unsupervised and transfer learning. In Guyon, Isabelle,\nDror, Gideon, Lemaire, Vincent, Taylor, Graham, and Silver, Daniel (eds.), Proceedings of ICML Workshop\non Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, pp.\n17\u201336, Bellevue, Washington, USA, 02 Jul 2012. PMLR. URL http://proceedings.mlr.press/v27/\nbengio12a.html.\n\nBengio, Yoshua, Lee, Dong-Hyun, Bornschein, Jorg, Mesnard, Thomas, and Lin, Zhouhan. Towards biologically\n\nplausible deep learning. arXiv preprint arXiv:1502.04156, 2015.\n\nBerthelot, David, Carlini, Nicholas, Goodfellow, Ian, Papernot, Nicolas, Oliver, Avital, and Raffel, Colin.\nMixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv e-prints, art. arXiv:1905.02249, May\n2019.\n\nBerthelot, David, Raffel, Colin, Roy, Aurko, and Goodfellow, Ian. Understanding and improving interpolation in\nautoencoders via an adversarial regularizer. In International Conference on Learning Representations, 2019.\nURL https://openreview.net/forum?id=S1fQSiCcYm.\n\nChen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Infogan: Inter-\npretable representation learning by information maximizing generative adversarial nets. In Advances in neural\ninformation processing systems, pp. 2172\u20132180, 2016.\n\nClanuwat, Tarin, Bober-Irizar, Mikel, Kitamoto, Asanobu, Lamb, Alex, Yamamoto, Kazuaki, and Ha, David.\n\nDeep learning for classical japanese literature, 2018.\n\nDeng, Li. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE\n\nSignal Processing Magazine, 29(6):141\u2013142, 2012.\n\nGhiasi, Golnaz, Lin, Tsung-Yi, and Le, Quoc V. Dropblock: A regularization method for convolutional networks.\n\nCoRR, abs/1810.12890, 2018. URL http://arxiv.org/abs/1810.12890.\n\nGoodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville,\nAaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing\nsystems, pp. 2672\u20132680, 2014.\n\nHa, David and Schmidhuber, J\u00fcrgen. Recurrent world models facilitate policy evolution. In Advances in Neural\nInformation Processing Systems 31, pp. 2451\u20132463. Curran Associates, Inc., 2018. URL https://papers.\nnips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution.\nhttps://\nworldmodels.github.io.\n\nHiggins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed,\nShakir, and Lerchner, Alexander. beta-vae: Learning basic visual concepts with a constrained variational\nframework. In International Conference on Learning Representations, volume 3, 2017.\n\nHjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam,\nand Bengio, Yoshua. Learning deep representations by mutual information estimation and maximization. In\nInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=Bklr3j0cKX.\n\nHsu, Kyle, Levine, Sergey, and Finn, Chelsea. Unsupervised learning via meta-learning. In International Con-\n\nference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1My6sR9tX.\n\nJang, Eric, Gu, Shixiang, and Poole, Ben. Categorical reparameterization with gumbel-softmax. arXiv preprint\n\narXiv:1611.01144, 2016.\n\nKim, Hyunjik and Mnih, Andriy. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.\n\n10\n\n\fKingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\nKingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n\nLarsen, Anders Boesen Lindbo, S\u00f8nderby, S\u00f8ren Kaae, Larochelle, Hugo, and Winther, Ole. Autoencoding\nbeyond pixels using a learned similarity metric. In Balcan, Maria Florina and Weinberger, Kilian Q. (eds.),\nProceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of\nMachine Learning Research, pp. 1558\u20131566, New York, New York, USA, 20\u201322 Jun 2016. PMLR. URL\nhttp://proceedings.mlr.press/v48/larsen16.html.\n\nLiu, Ming-Yu, Breuel, Thomas, and Kautz, Jan. Unsupervised image-to-image translation networks. In Guyon,\nI., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances\nin Neural Information Processing Systems 30, pp. 700\u2013708. Curran Associates, Inc., 2017. URL http:\n//papers.nips.cc/paper/6672-unsupervised-image-to-image-translation-networks.pdf.\n\nLiu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), 2015.\n\nMao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, and Paul Smolley, Stephen. Least\nsquares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer\nVision, pp. 2794\u20132802, 2017.\n\nMatthey, Loic, Higgins, Irina, Hassabis, Demis, and Lerchner, Alexander. dsprites: Disentanglement testing\n\nsprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.\n\nMiyato, Takeru, Kataoka, Toshiki, Koyama, Masanori, and Yoshida, Yuichi. Spectral normalization for\ngenerative adversarial networks. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=B1QRgziT-.\n\nNetzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading digits in\n\nnatural images with unsupervised feature learning. 2011.\n\nOdena, Augustus, Olah, Christopher, and Shlens, Jonathon. Conditional image synthesis with auxiliary classi\ufb01er\ngans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642\u20132651.\nJMLR. org, 2017.\n\nRifai, Salah, Vincent, Pascal, Muller, Xavier, Glorot, Xavier, and Bengio, Yoshua. Contractive auto-encoders:\nIn Proceedings of the 28th International Conference on\nExplicit invariance during feature extraction.\nInternational Conference on Machine Learning, ICML\u201911, pp. 833\u2013840, USA, 2011. Omnipress. ISBN\n978-1-4503-0619-5. URL http://dl.acm.org/citation.cfm?id=3104482.3104587.\n\nSainburg, Tim, Thielk, Marvin, Theilman, Brad, Migliori, Benjamin, and Gentner, Timothy. Generative\nadversarial interpolative autoencoding: adversarial training on latent space interpolations encourage convex\nlatent distributions. CoRR, abs/1807.06650, 2018. URL http://arxiv.org/abs/1807.06650.\n\nSalimans, Tim, Ho, Jonathan, Chen, Xi, Sidor, Szymon, and Sutskever, Ilya. Evolution strategies as a scalable\n\nalternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\nSrivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a\nsimple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):\n1929\u20131958, 2014.\n\nSuch, Felipe Petroski, Madhavan, Vashisht, Conti, Edoardo, Lehman, Joel, Stanley, Kenneth O., and Clune, Jeff.\nDeep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for\nreinforcement learning. CoRR, abs/1712.06567, 2017. URL http://arxiv.org/abs/1712.06567.\n\nThomas, Valentin, Pondard, Jules, Bengio, Emmanuel, Sarfati, Marc, Beaudoin, Philippe, Meurs, Marie-Jean,\nPineau, Joelle, Precup, Doina, and Bengio, Yoshua. Independently controllable factors. CoRR, abs/1708.01289,\n2017. URL http://arxiv.org/abs/1708.01289.\n\nvan den Oord, Aaron, Vinyals, Oriol, and kavukcuoglu, koray. Neural discrete representation learning. In\nGuyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.),\nAdvances in Neural Information Processing Systems 30, pp. 6306\u20136315. Curran Associates, Inc., 2017. URL\nhttp://papers.nips.cc/paper/7210-neural-discrete-representation-learning.pdf.\n\nVerma, Vikas, Lamb, Alex, Beckham, Christopher, Naja\ufb01, Amir, Mitliagkas, Ioannis, Courville, Aaron, Lopez-\nPaz, David, and Bengio, Yoshua. Manifold Mixup: Better Representations by Interpolating Hidden States.\narXiv e-prints, art. arXiv:1806.05236, Jun 2018.\n\n11\n\n\fVerma, Vikas, Lamb, Alex, Kannala, Juho, Bengio, Yoshua, and Lopez-Paz, David. Interpolation Consistency\n\nTraining for Semi-Supervised Learning. arXiv e-prints, art. arXiv:1903.03825, Mar 2019a.\n\nVerma, Vikas, Qu, Meng, Lamb, Alex, Bengio, Yoshua, Kannala, Juho, and Tang, Jian. GraphMix: Regularized\nTraining of Graph Neural Networks for Semi-Supervised Learning. arXiv e-prints, art. arXiv:1909.11715,\nSep 2019b.\n\nVincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-Antoine. Extracting and composing\nrobust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine\nLearning, ICML \u201908, pp. 1096\u20131103, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi:\n10.1145/1390156.1390294. URL http://doi.acm.org/10.1145/1390156.1390294.\n\nYaguchi, Yoichi, Shiratani, Fumiyuki, and Iwaki, Hidekazu. Mixfeat: Mix feature in latent space learns\n\ndiscriminative space, 2019. URL https://openreview.net/forum?id=HygT9oRqFX.\n\nYu, A. and Grauman, K. Fine-grained visual comparisons with local learning. In Computer Vision and Pattern\n\nRecognition (CVPR), Jun 2014.\n\nYu, A. and Grauman, K. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In\n\nInternational Conference on Computer Vision (ICCV), Oct 2017.\n\nYun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, and Yoo, Youngjoon. Cutmix:\nRegularization strategy to train strong classi\ufb01ers with localizable features. arXiv preprint arXiv:1905.04899,\n2019.\n\nZhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N., and Lopez-Paz, David. mixup: Beyond empirical risk\nminimization. International Conference on Learning Representations, 2018. URL https://openreview.\nnet/forum?id=r1Ddp1-Rb.\n\nZhang, Richard, Isola, Phillip, and Efros, Alexei A. Split-brain autoencoders: Unsupervised learning by cross-\nchannel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\n1058\u20131067, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2434, "authors": [{"given_name": "Christopher", "family_name": "Beckham", "institution": "Mila"}, {"given_name": "Sina", "family_name": "Honari", "institution": "Mila, EPFL"}, {"given_name": "Vikas", "family_name": "Verma", "institution": "Aalto University"}, {"given_name": "Alex", "family_name": "Lamb", "institution": "UMontreal (MILA)"}, {"given_name": "Farnoosh", "family_name": "Ghadiri", "institution": "Mila"}, {"given_name": "R Devon", "family_name": "Hjelm", "institution": "Microsoft Research"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "Mila"}, {"given_name": "Chris", "family_name": "Pal", "institution": "MILA, Polytechnique Montr\u00e9al, Element AI"}]}