{"title": "Invariant Representations without Adversarial Training", "book": "Advances in Neural Information Processing Systems", "page_first": 9084, "page_last": 9093, "abstract": "Representations of data that are invariant to changes in specified factors are useful for a wide range of problems: removing potential biases in prediction problems, controlling the effects of covariates, and disentangling meaningful factors of variation. Unfortunately, learning representations that exhibit invariance to arbitrary nuisance factors yet remain useful for other tasks is challenging. Existing approaches cast the trade-off between task performance and invariance in an adversarial way, using an iterative minimax optimization. We show that adversarial training is unnecessary and sometimes counter-productive; we instead cast invariant representation learning as a single information-theoretic objective that can be directly optimized. We demonstrate that this approach matches or exceeds performance of state-of-the-art adversarial approaches for learning fair representations and for generative modeling with controllable transformations.", "full_text": "Invariant Representations without Adversarial\n\nTraining\n\nDaniel Moyer, Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan\n\nInformation Sciences Institute\n\nUniversity of Southern California\n\n{moyerd, gaos, brekelma}@usc.edu\n\n{gregv, galstyan}@isi.edu\n\nAbstract\n\nRepresentations of data that are invariant to changes in speci\ufb01ed factors are useful\nfor a wide range of problems: removing potential biases in prediction problems,\ncontrolling the effects of covariates, and disentangling meaningful factors of varia-\ntion. Unfortunately, learning representations that exhibit invariance to arbitrary nui-\nsance factors yet remain useful for other tasks is challenging. Existing approaches\ncast the trade-off between task performance and invariance in an adversarial way,\nusing an iterative minimax optimization. We show that adversarial training is\nunnecessary and sometimes counter-productive; we instead cast invariant repre-\nsentation learning as a single information-theoretic objective that can be directly\noptimized. We demonstrate that this approach matches or exceeds performance\nof state-of-the-art adversarial approaches for learning fair representations and for\ngenerative modeling with controllable transformations.\n\n1\n\nIntroduction\n\nThe removal of unwanted information is a surprisingly common task. Transform-invariant features in\ncomputer vision, \u201cfair\u201d encodings from the algorithmic fairness community, and two-stage regressions\noften used in scienti\ufb01c studies are all cases of the same general concept: we wish to remove the effect\nof some outside variable c on our data x while still relevant to our original task. In the context of\nrepresentation learning, we wish to map x into an encoding z that is uninformative of c, yet also\noptimal for our task loss L(z, . . . ).\nThese objectives are often operationalized as an independence constraint z \u22a5 c. Encodings satisfying\nthis condition are invariant under changes in c, thus called \u201cinvariant representations\u201d. In practice\nthese constraints are often relaxed to other measures; in recent works an adversary\u2019s ability to predict\nz from c has been used as such a proxy [15, 21], transplanting adversarial losses from generative\nliterature to encoder/decoder settings.\nIn the present work we instead relax z \u22a5 c to a penalty on the mutual information I(z, c). We provide\nan analysis of this loss, showing that:\n\n1. I(z, c) admits a useful variational upper bound. This is in contrast to the usual lower bound\n\ne.g. bounds on I(z, y) for some labels y.\n\n2. When placed alongside the Variational Auto Encoder (VAE) and the Variational Information\nBottleneck (VIB) frameworks, the upper bound on I(z, c) produces a computationally\ntractable form for learning c-agnostic encodings and predictors.\n\n3. The adversarial approach can be also derived as a procedure to minimize I(z, c), but does\n\nnot provide an upper bound.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur proposed methods have the practical advantage of only requiring c at training time, and not at test\ntime. They are thus viable for production settings where accessing c is expensive (requiring human\nlabeling), impossible (requiring underlying transformations), or legally inadvisable (sharing protected\ndata). On the other hand, our method also produces a conditional decoder taking both c and z as\ninputs. While for some purposes this might be discarded at test time, it also can be manipulated to\nfunction similarly to Fader Networks [14], in that we can generate realistic looking transformations of\nan input image where some class label has been altered. We empirically test our proposed c-agnostic\nVAE and VIB methods on two standard \u201cfair prediction\u201d tasks, as well as an unsupervised learning\ntask demonstrating Fader-like capabilities.\n\n1.1 Related Work\n\nThe removal of covariate factors from scienti\ufb01c data has a long history. Observational studies\nin the sciences often cannot control for every factor in\ufb02uencing subjects, thus a large amount of\nliterature has been generated on the topic of removing such factors after data collection. Simple\nstatistical techniques usually involve modifying analyses with corresponding covariate effects [18]\nor sometimes multi-level regressions [7]. Modern methods have included more nuanced feature\ngeneration and more complex models, but follow along the same vein [5, 6] \u201cRegressing out\u201d or\n\u201ccontrolling for\u201d unwanted factors can be effective, but place strong constraints on later analyses\n(namely, the observation of the same unwanted covariates).\nA similar concept also has deep roots in computer vision, where transform-invariant features and/or\nmethods have been sought after for some time. Often these methods were designed for speci\ufb01c cases,\ne.g. scale-invariant or rotation invariant features. Early examples include Steerable Filters [8, 11],\nand later SIFT [16]. For many image transformations, data augmentation has become a standard\npractical tool to encourage invariance.\nRecent work has provided a group-theoretic [4] analysis of the removal of covariate information\n(either by design or by augmentation), in which equivalences are drawn between \ufb01nding invariant\nfeatures and \ufb01nding the quotient space of the domain over a covariate group action. More generally,\nan empirical solution was proposed by Lample et al. [14], who removed speci\ufb01c visual features from\na latent representation through adversarial training.\nMore recently the algorithmic fairness community has investigated fair methods [12] and fair repre-\nsentations [22]. Derived in part from the desire to avoid discriminating against protected classes of\nindividuals (and in part to avoid breaking laws and/or to avoid being the subject of a civil suit), the\nobjective of these methods has been to preserve task accuracy (usually classi\ufb01cation or regression)\nwhile removing bias against the protected class of individuals.\nParticularly relevant to our work are the recent methods proposed by Louizos et al. [15] and Xie et\nal. [21], which have a similar problem setup. Both methods generate representations that make it\ndif\ufb01cult for an adversary to recover the protected class but are still useful for a classi\ufb01cation task.\nLouizos et al. propose the \u201cVariational Fair Auto-Encoder\u201d (VFAE), which, as its name suggests,\nmodi\ufb01es the VAE of Kingma and Welling [13] to produce fair1 encodings, as well as a supervised\ncase providing fair classi\ufb01cations. Xie et al. combine this concept with adversarial training, adding\n(inverted) gradient information to produce fair representations and classi\ufb01cations. This adversarial\nsolution coincides exactly with the conceptual framework used in a computer vision application for\nconstructing Fader Networks [14].\nCompressive encoding in a learning context is also well studied. In particular, the Information\nBottleneck [20] and its modern successor Variational Information Bottleneck (VIB) [3, 2] both\nprovide compressive encodings, aiming for \u201crelevance\u201d with respect to a target variable (usually a\nlabel). An unsupervised method, CorEx, also has a similar extension (Anchored Corex [9]), in which\nlatent factors can be driven toward speci\ufb01c targets. Our work could be thought of as adding \u201cnegative\nanchors\u201d or aiming for \u201cirrelevance\u201d with respect to protected classes.\nModels including \u201cnuisance\u201d factors were also considered by Soatto and Chiuso [19], in which the\nauthors propose de\ufb01nitions of both nuisances and invariant representations. The authors follow the\ngroup theoretic concept of nuisances. Achille and Soatto [1] directly utilize these results, providing\n1The de\ufb01nition of \u201cfair\u201d in an algorithmic setting is of some debate. A fair encoding in this paper is\nuninformative of protected classes. We offer no opinion on whether this is truly \u201cfair\u201d in a general or legal sense,\ntaking the word at face value as used by Louizos et al.\n\n2\n\n\fthe same criterion and relaxation for invariance we will use here, minimal mutual information between\nthe representation z and the covariate c. While nuisances form only a small subsection of the paper,\nthe authors propose and test a sampling based approach to learning invariant representation. Their\nmethod is predicated on the ability to sample from the nuisance distribution (e.g. adding occlusions\nin images). We optimize a similar objective, but avoid such practical constraints.\n\n2 Model\n\nConsider a general task that includes an encoding of observed data x into latent variables z through the\nconditional likelihood q(z|x) (an encoder). Further assume that we observe a variable c which exhibits\nstatistical dependence with x (possibly non-linear dependence). We would like to \ufb01nd a q(z|x) that\nminimizes our loss function L from the original task, but that also produces a z independent of c.\nThis is clearly a dif\ufb01cult optimization; independence is a very strong condition. A natural relaxation\nof this is the minimization of the mutual information I(z, c). We can write our relaxed objective as\n(1)\nwhere \u03bb is a trade-off parameter between the two objectives. L might involve other variables as well,\ne.g. labels y. Without details of L and its associated task, we can still provide insight into I(z, c).\nBefore continuing it is important to note that all entropic quantities related to z are from the encoding\ndistribution q unless explicitly stated otherwise. In some cases entropies depend on prior distributions,\np(z), and this will be explicitly noted.\nFrom properties of mutual information, we have that I(z, c) = I(z, x) \u2212 I(z, x|c) + I(z, c|x). Here,\nwe note that q(z|x) is the function that we are optimizing over, and thus the distribution of z solely\ndepends on x. Thus,\n\nL(q, x) + \u03bbI(z, c)\n\nmin\n\nq\n\nI(z, c|x) = H(z|x) \u2212 H(z|x, c) = H(z|x) \u2212 H(z|x) = 0.\n\nI(z, c) = I(z, x) \u2212 I(z, x|c)\n\n= I(z, x) \u2212 H(x|c) + H(x|z, c)\n\u2264 I(z, x) \u2212 H(x|c) \u2212 Ex,c,z\u223cq[log p(x|z, c)]\n= Ez,x[log q(z|x) \u2212 log q(z)] \u2212 H(x|c) \u2212 Ex,c,z\u223cq[log p(x|z, c)]\n= Ex[ KL[ q(z|x) (cid:107) q(z) ] ] \u2212 H(x|c) \u2212 Ex,c,z\u223cq[log p(x|z, c)].\n\n(2)\nUsing Mutual Information properties and a variational inequality, we can then write the following:\n(3)\n(4)\n(5)\n(6)\n(7)\nH(x|c) is a constant and can be ignored. In Eq. 5 we introduce the variational distribution p(x|z, c)\nwhich will play the traditional role of the decoder. I(z, c) is thus bounded up to a constant by a\ndivergence and a reconstruction error.\nThe result is similar in appearance to the bound from Variational Auto-Encoders [13], wherein\nwe balance the divergence between q(z|x) and a prior p(z) against the reconstruction error. Here\nour penalty on I(z, c) amounts to encouraging q(z|x) to be close to its marginal q(z), i.e. to vary\nless across inputs x, no matter the form of q(z|x) or q(z). From a coding viewpoint our penalty\nencourages the compression of x out of z using the I(z, x) term from Eq 5.\nIn both interpretations, these penalties are tempered by conditional reconstruction error. This provides\nadditional intuition; by adding a copy of c into the reconstruction, we ensure that compressing away\ninformation in z about c is not penalized. In other words, conditional reconstruction combined with\ncompressing regularization leads to invariance w.r.t. to the conditional input.\n\n2.1\n\nInvariant Codes through VAE\n\nWe apply our proposed penalty to the VAE of Kingma and Welling [13], inspired by the similarity of\nthe penalty in Eq. 7 to the VAE loss function. The original VAE stems from the classical unsupervised\ntask of constructing latent factors, z, so that p(z), p(x|z) de\ufb01ne a generative model that maximizes\nthe log likelihood of the data Ex[log p(x)]. This generally intractable expression is lower bounded\nusing Jensen\u2019s inequality and a variational approximation:\n\nlog p(x) \u2265 \u2212KL[ q(z|x) (cid:107) p(z) ] + Ez\u223cq(z|x)[log p(x|z)].\n\n(8)\n\n3\n\n\fKingma and Welling [13] frame q(x|z) and p(z|x) as an encoder/decoder pair. They then provide a\nre-parameterization trick that, when used with standard function approximators (neural networks),\nallows for ef\ufb01cient estimation of latent codes z. In short, the reparametrization is the following:\n\nq(z|x) = g\u03b8(x) + \u03b5,\n\n\u03b5 \u223c N (0, \u03c3(\u03b8))\n\n(9)\n\nwhere g\u03b8 is a deterministic function (neural network) with parameters \u03b8, and \u03b5 is an independent\nrandom variable from a Normal distribution2 also parameterized by \u03b8.\nWe can reformulate Kingma and Welling\u2019s VAE to include our penalty on I(z, c). De\ufb01ne {xi}N\ni=1 data\nand latent factors {zi}, but also de\ufb01ne observed {ci} upon which x may have non-trivial dependence.\nThat is,\n\np(x, z, c) = p(z, c)p(x|z, c).\n\n(10)\nThe invariant coding task is to \ufb01nd q(z|x), p(x|z, c) that maximize E(x,c)[log p(x|c)], subject to\nz \u22a5 c under z \u223c q (i.e. subject to the estimated code z being invariant to c). We make the same\nrelaxation as in Eq. 1 to formulate our objective:\n\nmax E(x,c)[log p(x|c)] \u2212 \u03bbI(z, c).\n\n(11)\n\nStarting with the \ufb01rst term, we can derive a familiar looking encoder/decoder loss function that now\nincludes c.\n\nlog p(x|c) = log\n\n(cid:90)\n(cid:90) p(x, z|c)\n\np(x, z|c)dz\n\n(cid:20) p(x, z|c)\n\n(cid:21)\n\n(12)\n\n(13)\n\nq(z|x)dz = log Ez\u223cq\n\nq(z|x)\n\n= log\n\u2265 Ez\u223cq[log p(x, z|c) \u2212 log q(z|x)]\n= Ez\u223cq[log p(z|c) \u2212 log q(z|x) + log p(x|z, c)].\n\n(14)\n(15)\nBecause p(z|c) is a prior, we can make the assumption that p(z|c) = p(z), the prior marginal\ndistribution over z. This is a willful model misspeci\ufb01cation: for an arbitrary encoder, the latent\nfactors, z, are probably not independent of c. However, practically we wish to \ufb01nd z that are\nindependent of c, thus it is reasonable to include such a prior belief in our generative model. Taking\nthis assumption, we have\n\nq(z|x)\n\nlog p(x|c) \u2265 Ez\u223cq[log p(z) \u2212 log q(z|x)] + Ez\u223cq[log p(x|z, c)]\n\n= \u2212KL[ q(z|x) (cid:107) p(z)] + Ez\u223cq[log p(x|z, c)].\n\n(16)\n(17)\nThis is almost exactly the same as the VAE objective in Eq. 8, except our decoder p(x|z, c) requires\nc as well as z. Putting this together with the penalty term Eq. 7, we have the following variational\nbound on the combined objective (up to a constant):\nE(x,c)[log P (x|c)] \u2212 \u03bbI(z, c) \u2265\n\nE(x,c)[ \u2212KL[ q(z|x) (cid:107) p(z)] \u2212 \u03bbKL[ q(z|x) (cid:107) q(z) ] + (1 + \u03bb)Ez\u223cq[log p(x|z, c)] ]. (18)\n\nWe use this bound to learn c-invariant auto-encoders.\n\n2.1.1 Derivation of an approximation for the Conditional-Marginal divergence\n\nEquation 18 is our desired loss function for learning invariant codes z in an unsupervised context.\nUnfortunately it contains q(z), the empirical marginal distribution of latent code z, which is dif\ufb01cult\nto compute. Using the re-parameterization trick, q(z) becomes a mixture distribution and this allows\n\n2In the original paper, this was de\ufb01ned more generally; here we only consider the Normal distribution case.\n\n4\n\n\fus to approximate its divergence from q(z|x).\nKL[q(z|x)(cid:107)q(z)] = \u2212H(q(z|x)) \u2212\n\nq(z|x) log\n\n(cid:90)\n\n(cid:90)\n(cid:90)\n\nq(z|x(cid:48))p(x(cid:48))dx(cid:48)dz\n\n(cid:88)\n\n(cid:88)\n\nq(z|x(cid:48))dz\n\n1\nB\nx(cid:48)\nq(z|x(cid:48))] \u2212 log B\n\nx(cid:48)\n\nEz\u223cq|x[log q(z|x(cid:48))] \u2212 log B\n\n\u2248 \u2212H(q(z|x)) \u2212\n\nq(z|x) log\n\n= \u2212H(q(z|x)) \u2212 Ez\u223cq|x[log\n\n\u2264 \u2212H(q(z|x)) \u2212(cid:88)\n(cid:88)\n(cid:88)\n\n=\n\nx(cid:48)\n\n=\n\nx(cid:48)\n\nx(cid:48)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n[\u2212H(q(z|x)) \u2212 Ez\u223cq|x[log q(z|x(cid:48))]] + (B \u2212 1)H(q(z|x)) \u2212 log B (23)\nKL[q(z|x)(cid:107)q(z|x(cid:48))]\n\n+(B \u2212 1) H(q(z|x))\n\n\u2212 log B\n\n(24)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nKL between Gaussians\n\nGaussian Ent.\n\nWe can thus approximate KL[q(z|x)(cid:107)q(z)] from its pairwise distances KL[q(z|x)(cid:107)q(z|x(cid:48))], which\nall have a closed form due to the reparameterization trick. While this requires O(b2) operations\nfor batch size b, we can reduce pairwise Gaussian KL divergence to matrix algebra, making this\ncomputation fast in practice.\nThis further provides insight into the previously proposed Variational Fair Auto-Encoder of Louizos\net al [15]. In that paper, the authors add a Maximum Mean Discrepancy penalty as a somewhat ad hoc\nregularizer. This nevertheless works in practice quite well, as it encourages the statistical moments\nof each q(z|c) to be the same over the varying values of c. Our condition of KL[q(z|x)(cid:107)q(z)] has\nequivalent minima, and shares the \u201cq-regularizing\u201d \ufb02avor of the MMD penalty.\n\n2.1.2 Alternate derivation leads to adversarial loss\nIn Equation 3 we used the identity I(z, c) = I(z, x) \u2212 I(z, x|c) + I(z, c|x), with the caveat that the\nthird term I(z, c|x) is zero. We could have instead used another identity, I(z, c) = H(c) \u2212 H(c|z).\nHere, the \ufb01rst term is constant, but expanding the second term provides the following:\n\nH(c|z) = Ec,z\u223cq[\u2212 log p(c|z)]\n\nEc,z\u223cq[\u2212 log r(c|z)]\n\n= inf\nr(c|z)\n\nE(x,c)[log P (x|c)] \u2212 \u03bbI(z, c) \u2265\n\nE(x,c)[ \u2212KL[ q(z|x) (cid:107) p(z)] + Ez\u223cq[log p(x|z, c)] ] + \u03bb inf\nr(c|z)\n\n(25)\n(26)\n\nEc,z\u223cq[\u2212 log r(c|z)]\n\n(27)\n\nThe last inequality is again up to a constant term. Interpreting this in machine learning parlance,\nanother possible approach for minimizing I(z, c) is to optimize the conditional distribution p(c|z)\nso that r, the lowest entropy predictor of c given z, has the highest entropy (i.e. is as inaccurate as\npossible at predicting c). This is often operationalized by adversarial learning, and subsequent error\nmay be due in part to the adversary not achieving the in\ufb01mum. Practically speaking, this may indicate\nthat over-training adversaries would bene\ufb01t performance by bringing the adversarial gradient closer\nto the in\ufb01mum adversary\u2019s gradient.\n\n2.2 Supervised Invariant Codes through the Variational Information Bottleneck\n\nLearned encodings are often used for downstream supervised prediction tasks. Just as in Variational\nFair Auto-encoders [15], we can model both at the same time to offer c-invariant predictions. Our\nformulation of this problem \ufb01ts into the Information Bottleneck framework [20] and mirrors the\nVariational Information Bottleneck (VIB) [3].\nConceptually, VAEs have strong connections to the Information Bottleneck [3]. Stepping out of the\ngenerative context, we can \u201creroute\u201d our decoder to a label variable y. This gives us the following\ncomputational model:\n\nq(z|x)\u2212\u2212\u2212\u2212\u2192 z\n\np(y|z)\u2212\u2212\u2212\u2212\u2192 y\n\nx\n\n5\n\n(28)\n\n\fThe bottleneck paradigm prescribes optimizing over q and p so that I(z, y) is maximal while\nminimizing I(x, z) (\u201cmaintaining information about y with maximal compression of x into z\u201d). As\nillustrated by Alemi et al. [3], this can be approximated using variational inference.\nWe can produce c-invariant codes in the supervised Information Bottleneck context using the re-\nlaxation from Eq. 1. Beginning with the bottleneck objective maxq,p I(z, y) \u2212 \u03b2I(x, z) and then\nincluding the minimization of I(z, c), we have\n\nI(z, y) \u2212 \u03b2I(x, z) \u2212 \u03bbI(z, c)\n\nmax\np,q\n\n(29)\n\nWe can then apply the same bound as in Eq. 5 to obtain, up to constant H(x|c), the following:\n\nmax\np,q\n\nI(z, y) \u2212 (\u03b2 + \u03bb)I(x, z) + \u03bbE[log p(x|z, c)].\n\n(30)\nIn this objective we have a maximization of likelihood p(x|z, c). This is a decoder loss, adding a\nthird branch to our network. Following the derivation in Alemi et al. [3] as well as a similar path as\nin Section 2.1, the variational bound on the objective is\nI(z, y) \u2212 (\u03b2 + \u03bb)I(x, z) + \u03bbE[log p(x|z, c)] \u2265\n\nE(x,c)[ Ez,y[log p(y|z)]\u2212(\u03b2 + \u03bb)KL[ q(z|x) (cid:107) q(z) ] + \u03bbEz[log p(x|z, c)] ].\n\n(31)\nWe use Eq. 31 to learn c-invariant predictors. Optimization is performed over three function\napproximations: one encoder q(z|x), one conditional decoder p(x|z, c), and one predictor p(y|z).\nWe further must compute KL[ q(z|x) (cid:107) q(z) ] from the I(x, z) penalty term. Instead of following\nAlemi et al.[3], we again use the approximation to KL[ q(z|x) (cid:107) q(z) ] from Eq. 24.\n\n3 Computation and Empirical Evaluation\n\nWe have two modi\ufb01ed VAE loss (Eq. 18) and modi\ufb01ed VIB loss (Eq. 31). In both we have to learn\nan encoder and decoder pair q(z|x) and p(x|z, c). We use feed forward networks to approximate\nthese functions. For q(z|x) we use the Gaussian reparameterization trick, and for p(x|z, c) we\nsimply concatenate c onto z as extra input features to be decoded. In the modi\ufb01ed VIB we also\nhave a predictor branch p(y|z), which we also use a feed forward network to parametrize. Speci\ufb01c\narchitectures (e.g. number of layers and nodes per layer for each branch) vary by domain.\nWe evaluate the performance on of our proposed invariance penalty on two datasets with a \u201cfair\nclassi\ufb01cation\u201d task. We also demonstrate \u201cFader Network\u201d-like capabilities for manipulating speci\ufb01ed\nfactors in generative modeling on the MNIST dataset.\n\n3.1 Fair Classi\ufb01cation\n\nFor each fair classi\ufb01cation dataset/task we evaluated both prediction accuracy and adversarial error in\npredicting c from the latent code. We compare against the Variational Fair Autoencoder (VFAE) [15],\nand the adversarial method proposed in Xie et al. [21]. Both datasets are from the UCI repository.\nThe preprocessing for both datasets follow Zemel et al. 2013[22], which is also the source for the\npre-processing in our baselines [15, 21].\nThe \ufb01rst dataset is the German dataset, containing 1000 samples of personal \ufb01nancial data. The\nobjective is to predict whether a person has a good credit score, and the protected class is Age (which,\nas per [22], is binarized). The second dataset is the Adult dataset, containing 45,222 data points of\nUS census data. The objective is to predict whether or not a person has over 50,000 dollars saved in\nthe bank. The protected factor for the Adult dataset is Gender3.\nWherever possible we use architectural constraints from previous papers. All encoders and decoders\nare single layer, as speci\ufb01ed by Louizos et al. [15] (including those in the baselines), and for both\ndatasets we use 64 hidden units in our method as in Xie et al., while for VFAE we use their described\narchitecture. We use a latent space of 30 dimensions for each case. We train using Adam using the\nsame hyperparameter settings as in Xie et al., and a batch size of 128. Optimization and parameter\ntuning is done via a held-out validation set.\n\n3In some papers the protected factor for the Adult dataset is reported as Age, but those papers also reference\n\nZemel et al. [22] as the processing and experimental scheme, which speci\ufb01es Gender.\n\n6\n\n\fGerman Dataset Adv. Loss\nMaj. Class\nVFAE [15]\nXie et al. [21]\nProposed\n\n0.725\n0.717\n0.811\n0.698\n\nPred Acc.\n\n0.695\n0.720\n0.695\n0.710\n\nAdult Dataset\nMaj. Class\nVFAE [15]\nXie et al. [21]\nProposed\n\nAdv. Loss\n\n0.675\n0.882\n0.888\n0.675\n\nPred Acc.\n\n0.752\n0.842\n0.831\n0.844\n\nFigure 1: On the left we display the adversarial loss (the accuracy of the adversary on c) and predictive\naccurracy on y for three methods, plus the majority-class baseline, on both Adult and German datasets.\nFor adv. loss lower is better, while for pred. acc. higher is better. On the right we plot adversarial\nloss by varying adversarial strength (indicated by color), parameterized by the number of layers from\nzero (logistic regression) to three. All evaluations are performed on the hold-out test sets.\n\nFigure 2: t-SNE plots for the latent encodings of (Left to Right) the VFAE, Xie et al., and our\nproposed method on the Adult dataset (\ufb01rst 1000 pts., test split). The value of the c variable is\nprovided as color, where red is the majority class.\n\nFor each tested method we train a discriminator to predict c from generated latent codes z. These\ndiscriminators are trained independently from the encoder/decoder/within-method adversaries. We\nuse the architecture from Xie et al. [21] for these post-hoc adversaries, which describes a three-layer\nfeed-forward network trained using batch normalization and Adam (using \u03b3 = 1 and a learning rate\nof 0.001), with 64 hidden units per layer, using absolute error. We generalize this to four adversaries,\nincreasing in the number of hidden layers. Each discriminator is trained post-hoc for each model,\neven in cases with a discriminator in the model (e.g. the model proposed by Xie et al. [21]).\n\n3.2 Unsupervised Learning\n\nWe demonstrate a form of unsupervised image manipulation inspired by Fader Networks [14] on the\nMNIST dataset. We use the digit label as the covariate class c, which pushes all non-class stylistic\ninformation into the latent space while attempting to remove information about the exact digit being\nwritten. This allows us to manipulate the decoder at test time to produce different arti\ufb01cial digits\nbased on the style of one digit. We use 2 hidden layers with 512 nodes for both the encoder and the\ndecoder.\n\n4 Results\n\nFor the German dataset shown on top table of Figure 1, the methods are roughly equivalent. All\nmethods have comparable predictive accuracy, while the VFAE and the proposed method have\n\n7\n\n\fFigure 3: We demonstrate the ability to generate stylistically similar images of varying classes using\nthe MNIST dataset. The left column is mapped into z that is invariant to its digit label c. We then can\ngenerate an image using z and any other speci\ufb01ed digit, c(cid:48), as show on the right.\n\ncompetitive adversarial loss. In general however, the smaller dataset does not differentiate the\nmethods.\nFor the larger Adult dataset shown on the bottom table of Figure 1, all three methods again have\ncomparable predictive accuracy. However, against stronger adversaries each baseline has very high\nloss. Our proposed method has comparable accuracy with the VFAE, while providing the best\nadversarial error across all four adversarial dif\ufb01culty levels.\nWe further visualized a projection of the latent codes z using t-SNE [17]; invariant representations\nshould produce inseparable embeddings for each class. All methods have large red-only regions; this\nis somewhat expected for the majority class. However, both baseline methods have blue-only regions,\nwhile the proposed method has only a heterogenous region4.\nFigure 3 demonstrates our ability to manipulate the conditional decoder. The left column contain the\nactual images (randomly selected from the test set), while the right columns contain images generated\nusing the decoder. Particularly notable are the transfer of azimuth and thickness, and the failure of\nsome styles to transfer to some digits (usually curved to straight digits or vice versa).\n\n5 Discussion\n\nAs show analytically in Section 2.1.2, in the optimal case adversarial training can perform as well as\nour derived method; it is also intuitively simple and allows for more nuanced tuning. However, it\nintroduces an extra layer of complexity (indeed, a second optimization problem) into our system. In\nthis particular case of invariant representation, our results lead us to believe that adversarial training\nis unnecessary.\nThis does not mean that adversarial training for invariant representations is strictly worse in practice.\nThere are certainly cases where training an adversary may be easier or less restrictive than other\n\n4Previous versions of this paper had severely contorted latent codes for the Xie et al. baseline. Further\n\ninvestigation showed this to be a convergence issue. Mild performance improvements were also observed.\n\n8\n\n\fmethods, and due to its shared literature with Generative Adversarial Networks [10], there may be\ntraining heuristics or other techniques that can improve performance.\nOn the otherhand, we believe that our derivations here shed light on why these methods might fail.\nWe believe speci\ufb01c failure modes of adversarial training can be attributed to Eq. 27, where the\nadversary fails to achieve the in\ufb01mum. Bad approximations (i.e. weak or poorly trained adversaries)\nmay provide bad gradient information to the system, leading to poor performance of the encoder\nagainst a post-hoc adversary.\nOur experimental results do not match those reported in Xie et al. While in general their method has\ncomparable performance for predictive accuracy, we do not \ufb01nd that their adversarial error is low;\ninstead, we \ufb01nd that the encoder/adversary pair becomes stuck in local minima. We also \ufb01nd that\nthe adversary trained alongside the encoder performs badly against the encoder (i.e. the adversary\ncannot predict c well), but a post-hoc trained adversary performs very well, easily predicting c (as\ndemonstrated by our experiments).\nIt may be that we have inadvertently built a stronger adversary. We have attempted to follow\nthe author\u2019s experimental design as closely as possible, using the same architecture and the same\nadversary (using the gradient-\ufb02ip trick and 3-layer feed forward networks). With the details provided\nwe could not replicate their reported adversarial error for their method, nor for the VFAE method.\nHowever, we are able to reproduce the adversarial error reported in Louizos et al., which uses logisic\nregression. In general for stronger adversaries the adversarial loss will increase, but the relative\nrankings should remain roughly the same.\n\n6 Conclusion\n\nWe have derived a variational upper bound for the mutual information between latent representations\nand covariate factors. Provided a dataset with labeled covariates, we can train both supervised and\nunsupervised learning methods that are invariant to these factors without the use of adversarial\ntraining. After training our method can be used in production without requiring covariate labels.\nFinally, our approach also enables manipulation of speci\ufb01ed factors when generating realistic data.\nOur direct, information-theoretic optimization approach avoids the pitfalls inherent in adversarial\nlearning for invariant representation and produces results that match or exceed capabilities of these\nstate-of-the-art methods.\n\nAcknowledgements\n\nThis work was supported by DARPA grants W911NF-16-1-0575 and FA8750-17-C-0106, as well as\nthe NSF Graduate Research Fellowship Program Grant Number DGE-1418060. We would like to\nthank the conference organizers, area chairs, and especially the anonymous reviewers for their work\nand helpful input. We also would like to thank Ayush Jaiswal for several insightful conversations,\nand Ishaan Gulrajani for \ufb01nding and correcting a bug in our evaluation code.\n\nReferences\n\n[1] A. Achille and S. Soatto. On the emergence of invariance and disentangling in deep representa-\n\ntions. arXiv preprint arXiv:1706.01350, 2017.\n\n[2] A. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy\n\ncomputation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck.\n\narXiv preprint arXiv:1612.00410, 2016.\n\n[4] T. Cohen and M. Welling. Group equivariant convolutional networks. In International Confer-\n\nence on Machine Learning, pages 2990\u20132999, 2016.\n\n[5] R. A. Feis et al. Ica-based artifact removal diminishes scan site differences in multi-center\n\nresting-state fMRI. Frontiers in Neuroscience, 9:395, 2015.\n\n[6] J.-P. Fortin et al. Harmonization of multi-site diffusion tensor imaging data. Neuroimage,\n\n161:149\u2013170, 2017.\n\n9\n\n\f[7] R. P. Freckleton. On the misuse of residuals in ecology: regression of residuals vs. multiple\n\nregression. Journal of Animal Ecology, 71(3):542\u2013545, 2002.\n\n[8] W. T. Freeman, E. H. Adelson, et al. The design and use of steerable \ufb01lters. IEEE Transactions\n\non Pattern analysis and machine intelligence, 13(9):891\u2013906, 1991.\n\n[9] R. J. Gallagher, K. Reing, D. Kale, and G. V. Steeg. Anchored correlation explanation: Topic\n\nmodeling with minimal domain knowledge. arXiv preprint arXiv:1611.10277, 2016.\n\n[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[11] H. Greenspan, S. Belongie, R. Goodman, P. Perona, S. Rakshit, and C. H. Anderson. Overcom-\n\nplete steerable pyramid \ufb01lters and rotation invariance. 1994.\n\n[12] F. Kamiran and T. Calders. Classifying without discriminating. In 2nd International Conference\n\non Computer, Control and Communication, pages 1\u20136. IEEE, 2009.\n\n[13] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[14] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al. Fader networks: Manipulat-\ning images by sliding attributes. In Advances in Neural Information Processing Systems, pages\n5969\u20135978, 2017.\n\n[15] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder.\n\narXiv preprint arXiv:1511.00830, 2015.\n\n[16] D. G. Lowe. Object recognition from local scale-invariant features. In IEEE International\n\nConference on Computer Vision, volume 2, pages 1150\u20131157. Ieee, 1999.\n\n[17] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[18] S. W. Raudenbush. Random effects models. The handbook of research synthesis, 421, 1994.\n[19] S. Soatto and A. Chiuso. Visual representations: De\ufb01ning properties and deep approximations.\n\narXiv preprint arXiv:1411.7676, 2014.\n\n[20] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint\n\nphysics/0004057, 2000.\n\n[21] Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig. Controllable invariance through adversarial\nfeature learning. In Advances in Neural Information Processing Systems, pages 585\u2013596, 2017.\n[22] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In\n\nInternational Conference on Machine Learning, pages 325\u2013333, 2013.\n\n10\n\n\f", "award": [], "sourceid": 5443, "authors": [{"given_name": "Daniel", "family_name": "Moyer", "institution": "University of Southern California"}, {"given_name": "Shuyang", "family_name": "Gao", "institution": "ISI USC"}, {"given_name": "Rob", "family_name": "Brekelmans", "institution": "University of Southern California"}, {"given_name": "Aram", "family_name": "Galstyan", "institution": "USC Information Sciences Inst"}, {"given_name": "Greg", "family_name": "Ver Steeg", "institution": "University of Southern California"}]}